Daily arXiv Papers - 2026-02-23

AI-enhanced summaries of 0 research papers from arXiv

Editor’s Picks

Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.

[1] Scaling Audio-Text Retrieval with Multimodal Large Language Models

Jilan Xu, Carl Thomé, Danijela Horak, Weidi Xie, Andrew Zisserman

Main category: cs.SD

TL;DR: AuroLA is a novel audio-text retrieval framework that repurposes Multimodal Large Language Models (MLLMs) as a unified backbone for retrieval, outperforming state-of-the-art models with significantly less training data.

DetailsMotivation: Existing contrastive dual-encoder architectures like CLAP are limited by small-scale encoders that struggle with complex queries requiring reasoning or world knowledge. The authors aim to leverage the superior capabilities of MLLMs for better audio-text retrieval.

Method: Three key contributions: (1) Scalable data pipeline curating diverse audio with multi-granular captions via automated annotation; (2) Adapting MLLMs for retrieval by prompting them to summarize inputs and using hidden states of special tokens as embeddings, trained with Hybrid-NCE loss using multi-granular supervision and hard-negative reweighting; (3) MLLM-based bidirectional re-ranking module for refining retrieval candidates through deep cross-modal interaction.

Result: AuroLA consistently outperforms state-of-the-art models including PE-AV while using only ~1% of PE-AV’s training data. Clear scaling trends observed regarding dataset size and model capacity, validating MLLMs as effective unified backbones for audio-text retrieval.

Conclusion: MLLMs can be effectively repurposed as unified backbones for audio-text retrieval, achieving superior performance with significantly less training data through innovative data curation, model adaptation, and re-ranking techniques.

Abstract: Audio-text retrieval is crucial for bridging acoustic signals and natural language. While contrastive dual-encoder architectures like CLAP have shown promise, they are fundamentally limited by the capacity of small-scale encoders. Specifically, the text encoders struggle to understand complex queries that require reasoning or world knowledge. In this paper, we propose AuroLA, a novel contrastive language-audio pre-training framework that re-purposes Multimodal Large Language Models (MLLMs) as a unified backbone for retrieval. Specifically, we make three contributions: (i) we construct a scalable data pipeline that curates diverse audio from multiple sources and generates multi-granular captions, ranging from long descriptions to structured tags, via automated annotation; (ii) we adapt an MLLM for retrieval by prompting it to summarize the audio/text input and using the hidden state of a special token as audio/text embeddings. For model training, we devise a novel Hybrid-NCE loss, which employs multi-granular supervision and hard-negative reweighting to robustly align audio with diverse textual supervision; and (iii) we design an MLLM-based bidirectional re-ranking module that refines retrieval candidates through deep cross-modal interaction. Extensive experiments demonstrate that AuroLA consistently outperforms state-of-the-art models, including the recent PE-AV, while utilizing only approximately 1% of PE-AV’s training data. Lastly, we observe clear scaling trends regarding dataset size and model capacity, validating the effectiveness of MLLM as a unified backbone for audio-text retrieval. Code is available at https://github.com/Jazzcharles/AuroLA.

Relevance: 9/10

[2] When Audio-LLMs Don’t Listen: A Cross-Linguistic Study of Modality Arbitration

Jayadev Billa

Main category: cs.CL

TL;DR: Speech-enabled LLMs show strong text bias (16.6% text dominance) in audio-text conflicts despite audio having higher accuracy, revealing modality arbitration as a distinct reliability issue not explained by information content but by reasoning accessibility.

DetailsMotivation: To understand why speech-enabled language models exhibit text dominance when audio and text conflict, even when audio contains more accurate information, and to investigate whether this bias stems from information content differences or reasoning accessibility issues.

Method: Created ALME benchmark with 57,602 controlled audio-text conflict stimuli across 8 languages. Tested Gemini 2.0 Flash and other audio-LLMs, comparing text dominance in audio-text vs text-text conflicts. Conducted interventions: forced transcription, framing text as corrupted, and fine-tuning ablations (audio projection layer vs LoRA on LLM).

Result: Audio-text conflict shows 16.6% text dominance vs 1.6% in text-text conflicts. Audio-only accuracy (97.2%) exceeds cascade accuracy (93.9%). Forced transcription increases text dominance (19% to 33%). Framing text as corrupted reduces text dominance by 80%. Fine-tuning audio projection layer increases text dominance (+26.5%), while LoRA on LLM halves it (-23.9%).

Conclusion: Text dominance reflects asymmetry in arbitration accessibility, not information content. Modality arbitration is a distinct reliability dimension not captured by standard speech benchmarks, with implications for multimodal model design and evaluation.

Abstract: When audio and text conflict, speech-enabled language models follow the text 10 times more often than when arbitrating between two text sources, even when explicitly instructed to trust the audio. Using ALME, a benchmark of 57,602 controlled audio-text conflict stimuli across 8 languages, we find that Gemini 2.0 Flash exhibits 16.6% text dominance under audio-text conflict versus 1.6% under text-text conflict with identical reliability cues. This gap is not explained by audio quality: audio-only accuracy (97.2%) exceeds cascade accuracy (93.9%), indicating audio embeddings preserve more information than text transcripts. We propose that text dominance reflects an asymmetry not in information content but in arbitration accessibility: how easily the model can reason over competing representations. This framework explains otherwise puzzling findings. Forcing transcription before answering increases text dominance (19% to 33%), sacrificing audio’s information advantage without improving accessibility. Framing text as “deliberately corrupted” reduces text dominance by 80%. A fine-tuning ablation provides interventional evidence: training only the audio projection layer increases text dominance (+26.5%), while LoRA on the language model halves it ($-$23.9%), localizing text dominance to the LLM’s reasoning rather than the audio encoder. Experiments across four state-of-the-art audio-LLMs and 8 languages show consistent trends with substantial cross-linguistic and cross-model variation, establishing modality arbitration as a distinct reliability dimension not captured by standard speech benchmarks.

Relevance: 9/10

[3] UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing

Dianyi Wang, Chaofan Ma, Feng Han, Size Wu, Wei Song, Yibin Wang, Zhixiong Zhang, Tianhang Wang, Siyuan Wang, Zhongyu Wei, Jiaqi Wang

Main category: cs.CV

TL;DR: UniReason is a unified multimodal framework that connects text-to-image generation and image editing through complementary reasoning paradigms, incorporating world knowledge-enhanced textual reasoning and visual refinement via self-reflection.

DetailsMotivation: Current unified multimodal models struggle with complex synthesis tasks requiring deep reasoning and treat text-to-image generation and image editing as isolated capabilities rather than interconnected reasoning steps.

Method: Proposes UniReason framework with two complementary reasoning paradigms: 1) world knowledge-enhanced textual reasoning for inferring implicit knowledge during generation, and 2) editing capabilities for fine-grained visual refinement via self-reflection. Unifies generation and editing within shared architecture mirroring human cognitive process of planning followed by refinement. Constructs large-scale reasoning-centric dataset (~300k samples) covering five major knowledge domains for textual reasoning, plus agent-generated corpus for visual refinement.

Result: Extensive experiments show UniReason achieves advanced performance on reasoning-intensive benchmarks (WISE, KrisBench, UniREditBench) while maintaining superior general synthesis capabilities.

Conclusion: UniReason successfully unifies generation and editing through complementary reasoning paradigms, demonstrating improved performance on complex reasoning tasks while maintaining strong general synthesis abilities.

Abstract: Unified multimodal models often struggle with complex synthesis tasks that demand deep reasoning, and typically treat text-to-image generation and image editing as isolated capabilities rather than interconnected reasoning steps. To address this, we propose UniReason, a unified framework that harmonizes these two tasks through two complementary reasoning paradigms. We incorporate world knowledge-enhanced textual reasoning into generation to infer implicit knowledge, and leverage editing capabilities for fine-grained editing-like visual refinement to further correct visual errors via self-reflection. This approach unifies generation and editing within a shared architecture, mirroring the human cognitive process of planning followed by refinement. We support this framework by systematically constructing a large-scale reasoning-centric dataset (~300k samples) covering five major knowledge domains (e.g., cultural commonsense, physics, etc.) for textual reasoning, alongside an agent-generated corpus for visual refinement. Extensive experiments demonstrate that UniReason achieves advanced performance on reasoning-intensive benchmarks such as WISE, KrisBench and UniREditBench, while maintaining superior general synthesis capabilities.

Relevance: 9/10


Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] QueryPlot: Generating Geological Evidence Layers using Natural Language Queries for Mineral Exploration

Meng Ye, Xiao Lin, Georgina Lukoczki, Graham W. Lederer, Yi Yao

Main category: cs.CL

TL;DR: QueryPlot: A semantic retrieval framework that integrates geological text corpora with map data using NLP to automate mineral prospectivity mapping through natural language queries.

DetailsMotivation: Traditional mineral prospectivity mapping is manual and knowledge-intensive, requiring synthesis of heterogeneous geological knowledge from textual deposit models and geospatial datasets. There's a need to automate this process using modern computational techniques.

Method: Curates descriptive deposit models for 120+ deposit types, transforms geologic map polygons into structured textual representations, uses pretrained embedding models to encode natural language queries and region descriptions, computes semantic similarity scores, and supports compositional querying for multi-criteria analysis.

Result: In tungsten skarn case study, achieves high recall of known occurrences, produces prospective regions aligning with expert-defined permissive tracts, and similarity scores improve supervised learning classification performance when used as features.

Conclusion: QueryPlot successfully automates mineral prospectivity mapping through semantic retrieval, demonstrating practical utility for geological exploration and providing a web-based interactive system with publicly available code and datasets.

Abstract: Mineral prospectivity mapping requires synthesizing heterogeneous geological knowledge, including textual deposit models and geospatial datasets, to identify regions likely to host specific mineral deposit types. This process is traditionally manual and knowledge-intensive. We present QueryPlot, a semantic retrieval and mapping framework that integrates large-scale geological text corpora with geologic map data using modern Natural Language Processing techniques. We curate descriptive deposit models for over 120 deposit types and transform the State Geologic Map Compilation (SGMC) polygons into structured textual representations. Given a user-defined natural language query, the system encodes both queries and region descriptions using a pretrained embedding model and computes semantic similarity scores to rank and spatially visualize regions as continuous evidence layers. QueryPlot supports compositional querying over deposit characteristics, enabling aggregation of multiple similarity-derived layers for multi-criteria prospectivity analysis. In a case study on tungsten skarn deposits, we demonstrate that embedding-based retrieval achieves high recall of known occurrences and produces prospective regions that closely align with expert-defined permissive tracts. Furthermore, similarity scores can be incorporated as additional features in supervised learning pipelines, yielding measurable improvements in classification performance. QueryPlot is implemented as a web-based system supporting interactive querying, visualization, and export of GIS-compatible prospectivity layers.To support future research, we have made the source code and datasets used in this study publicly available.

[2] Neural Synchrony Between Socially Interacting Language Models

Zhining Zhang, Wentao Zhu, Chi Han, Yizhou Wang, Heng Ji

Main category: cs.CL

TL;DR: LLMs show neural synchrony during social interactions similar to humans, providing evidence for their social minds

DetailsMotivation: To investigate whether LLMs can meaningfully be compared to human social minds by examining neural synchrony during social interactions, addressing the controversy about LLMs' social capabilities

Method: Introduce neural synchrony during social simulations as a novel proxy for analyzing LLM sociality at the representational level, using carefully designed experiments to measure synchrony between interacting LLMs

Result: Neural synchrony between LLMs reliably reflects social engagement and temporal alignment in interactions, and is strongly correlated with their social performance, showing parallels with human social brain dynamics

Conclusion: LLMs exhibit neural synchrony patterns similar to humans during social interactions, providing empirical evidence for meaningful comparisons between LLM and human social minds

Abstract: Neuroscience has uncovered a fundamental mechanism of our social nature: human brain activity becomes synchronized with others in many social contexts involving interaction. Traditionally, social minds have been regarded as an exclusive property of living beings. Although large language models (LLMs) are widely accepted as powerful approximations of human behavior, with multi-LLM system being extensively explored to enhance their capabilities, it remains controversial whether they can be meaningfully compared to human social minds. In this work, we explore neural synchrony between socially interacting LLMs as an empirical evidence for this debate. Specifically, we introduce neural synchrony during social simulations as a novel proxy for analyzing the sociality of LLMs at the representational level. Through carefully designed experiments, we demonstrate that it reliably reflects both social engagement and temporal alignment in their interactions. Our findings indicate that neural synchrony between LLMs is strongly correlated with their social performance, highlighting an important link between neural synchrony and the social behaviors of LLMs. Our work offers a new perspective to examine the “social minds” of LLMs, highlighting surprising parallels in the internal dynamics that underlie human and LLM social interaction.

[3] On the scaling relationship between cloze probabilities and language model next-token prediction

Cassandra L. Jacobs, Morgan Grobol

Main category: cs.CL

TL;DR: Larger language models show better predictive power for eye movement and reading time data, with improved semantic alignment to human responses but reduced sensitivity to low-level lexical information.

DetailsMotivation: To investigate how model size affects predictive power for human language processing data (eye movements, reading times, cloze tasks) and understand the trade-offs between semantic alignment and sensitivity to low-level linguistic information.

Method: Analysis of language models of varying sizes on their ability to predict human eye movement patterns, reading times, and cloze task responses, examining how model characteristics affect alignment with human language processing.

Result: Larger models show better predictive power for eye movement and reading time data, assign higher-quality estimates for next tokens in cloze data, are less sensitive to lexical co-occurrence statistics, and are better semantically aligned to human responses.

Conclusion: Larger models’ greater memorization capacity helps them guess more semantically appropriate words but makes them less sensitive to low-level information relevant for word recognition, revealing trade-offs in model design for different language processing tasks.

Abstract: Recent work has shown that larger language models have better predictive power for eye movement and reading time data. While even the best models under-allocate probability mass to human responses, larger models assign higher-quality estimates of next tokens and their likelihood of production in cloze data because they are less sensitive to lexical co-occurrence statistics while being better aligned semantically to human cloze responses. The results provide support for the claim that the greater memorization capacity of larger models helps them guess more semantically appropriate words, but makes them less sensitive to low-level information that is relevant for word recognition.

[4] Understanding Unreliability of Steering Vectors in Language Models: Geometric Predictors and the Limits of Linear Approximations

Joschka Braun

Main category: cs.CL

TL;DR: Steering vectors for controlling LLM behavior have reliability issues; thesis investigates why reliability varies across behaviors and how training data impacts it, finding that activation similarity and separation predict reliability.

DetailsMotivation: Steering vectors are a lightweight method for controlling language model behavior by adding learned biases to activations, but they show unreliable effect sizes across samples and behaviors. The research aims to understand why steering reliability differs across behaviors and how training data impacts it.

Method: Investigates steering vector reliability through analysis of activation patterns: 1) Measures cosine similarity between training activation differences to predict reliability, 2) Examines how well positive and negative activations are separated along steering directions, 3) Compares steering vectors trained on different prompt variations.

Result: 1) Higher cosine similarity between training activation differences predicts more reliable steering, 2) Behavior datasets with better separation of positive/negative activations along steering direction are more reliably steerable, 3) Steering vectors from different prompt variations are directionally distinct but perform similarly and show correlated efficacy.

Conclusion: Steering vectors are unreliable when the latent target behavior representation cannot be effectively approximated by a linear steering direction. The findings provide diagnostic tools for identifying steering unreliability and motivate development of more robust steering methods that account for non-linear latent behavior representations.

Abstract: Steering vectors are a lightweight method for controlling language model behavior by adding a learned bias to the activations at inference time. Although effective on average, steering effect sizes vary across samples and are unreliable for many target behaviors. In my thesis, I investigate why steering reliability differs across behaviors and how it is impacted by steering vector training data. First, I find that higher cosine similarity between training activation differences predicts more reliable steering. Second, I observe that behavior datasets where positive and negative activations are better separated along the steering direction are more reliably steerable. Finally, steering vectors trained on different prompt variations are directionally distinct, yet perform similarly well and exhibit correlated efficacy across datasets. My findings suggest that steering vectors are unreliable when the latent target behavior representation is not effectively approximated by the linear steering direction. Taken together, these insights offer a practical diagnostic for steering unreliability and motivate the development of more robust steering methods that explicitly account for non-linear latent behavior representations.

[5] Improving Neural Topic Modeling with Semantically-Grounded Soft Label Distributions

Raymond Li, Amirhossein Abaskohi, Chuyuan Li, Gabriel Murray, Giuseppe Carenini

Main category: cs.CL

TL;DR: A novel neural topic modeling approach that uses language models to create semantically-grounded soft label targets for better topic quality and document retrieval.

DetailsMotivation: Traditional neural topic models rely on reconstructing Bag-of-Words representations, which ignores contextual information and suffers from data sparsity issues, limiting their ability to capture semantic relationships and produce high-quality topics.

Method: The method constructs semantically-grounded soft label targets using language models by projecting next token probabilities onto a pre-defined vocabulary via specialized prompts. Topic models are then trained to reconstruct these soft labels using LM hidden states, providing contextually enriched supervision signals.

Result: Experiments on three datasets show substantial improvements in topic coherence and purity over existing baselines. A new retrieval-based metric demonstrates significantly better performance in identifying semantically similar documents for retrieval applications.

Conclusion: The approach successfully addresses limitations of traditional neural topic models by incorporating contextual information from language models, producing higher-quality topics that better align with corpus thematic structure and enabling effective retrieval-oriented applications.

Abstract: Traditional neural topic models are typically optimized by reconstructing the document’s Bag-of-Words (BoW) representations, overlooking contextual information and struggling with data sparsity. In this work, we propose a novel approach to construct semantically-grounded soft label targets using Language Models (LMs) by projecting the next token probabilities, conditioned on a specialized prompt, onto a pre-defined vocabulary to obtain contextually enriched supervision signals. By training the topic models to reconstruct the soft labels using the LM hidden states, our method produces higher-quality topics that are more closely aligned with the underlying thematic structure of the corpus. Experiments on three datasets show that our method achieves substantial improvements in topic coherence, purity over existing baselines. Additionally, we also introduce a retrieval-based metric, which shows that our approach significantly outperforms existing methods in identifying semantically similar documents, highlighting its effectiveness for retrieval-oriented applications.

[6] When Audio-LLMs Don’t Listen: A Cross-Linguistic Study of Modality Arbitration

Jayadev Billa

Main category: cs.CL

TL;DR: Speech-enabled LLMs show strong text bias (16.6% text dominance) in audio-text conflicts despite audio having higher accuracy, revealing modality arbitration as a distinct reliability issue not explained by information content but by reasoning accessibility.

DetailsMotivation: To understand why speech-enabled language models exhibit text dominance when audio and text conflict, even when audio contains more accurate information, and to investigate whether this bias stems from information content differences or reasoning accessibility issues.

Method: Created ALME benchmark with 57,602 controlled audio-text conflict stimuli across 8 languages. Tested Gemini 2.0 Flash and other audio-LLMs, comparing text dominance in audio-text vs text-text conflicts. Conducted interventions: forced transcription, framing text as corrupted, and fine-tuning ablations (audio projection layer vs LoRA on LLM).

Result: Audio-text conflict shows 16.6% text dominance vs 1.6% in text-text conflicts. Audio-only accuracy (97.2%) exceeds cascade accuracy (93.9%). Forced transcription increases text dominance (19% to 33%). Framing text as corrupted reduces text dominance by 80%. Fine-tuning audio projection layer increases text dominance (+26.5%), while LoRA on LLM halves it (-23.9%).

Conclusion: Text dominance reflects asymmetry in arbitration accessibility, not information content. Modality arbitration is a distinct reliability dimension not captured by standard speech benchmarks, with implications for multimodal model design and evaluation.

Abstract: When audio and text conflict, speech-enabled language models follow the text 10 times more often than when arbitrating between two text sources, even when explicitly instructed to trust the audio. Using ALME, a benchmark of 57,602 controlled audio-text conflict stimuli across 8 languages, we find that Gemini 2.0 Flash exhibits 16.6% text dominance under audio-text conflict versus 1.6% under text-text conflict with identical reliability cues. This gap is not explained by audio quality: audio-only accuracy (97.2%) exceeds cascade accuracy (93.9%), indicating audio embeddings preserve more information than text transcripts. We propose that text dominance reflects an asymmetry not in information content but in arbitration accessibility: how easily the model can reason over competing representations. This framework explains otherwise puzzling findings. Forcing transcription before answering increases text dominance (19% to 33%), sacrificing audio’s information advantage without improving accessibility. Framing text as “deliberately corrupted” reduces text dominance by 80%. A fine-tuning ablation provides interventional evidence: training only the audio projection layer increases text dominance (+26.5%), while LoRA on the language model halves it ($-$23.9%), localizing text dominance to the LLM’s reasoning rather than the audio encoder. Experiments across four state-of-the-art audio-LLMs and 8 languages show consistent trends with substantial cross-linguistic and cross-model variation, establishing modality arbitration as a distinct reliability dimension not captured by standard speech benchmarks.

[7] Condition-Gated Reasoning for Context-Dependent Biomedical Question Answering

Jash Rajesh Parekh, Wonbin Kweon, Joey Chan, Rezarta Islamaj, Robert Leaman, Pengcheng Jiang, Chih-Hsuan Wei, Zhizheng Wang, Zhiyong Lu, Jiawei Han

Main category: cs.CL

TL;DR: CondMedQA benchmark and Condition-Gated Reasoning framework for conditional biomedical question answering that accounts for patient-specific factors like comorbidities and contraindications.

DetailsMotivation: Current biomedical QA systems assume uniform medical knowledge application, but real clinical reasoning is conditional on patient-specific factors. Existing benchmarks don't evaluate conditional reasoning, and current methods lack mechanisms to ensure retrieved knowledge is contextually applicable.

Method: Proposes CondMedQA benchmark with multi-hop questions whose answers vary with patient conditions, and Condition-Gated Reasoning (CGR) framework that constructs condition-aware knowledge graphs and selectively activates/prunes reasoning paths based on query conditions.

Result: CGR more reliably selects condition-appropriate answers while matching or exceeding state-of-the-art performance on biomedical QA benchmarks.

Conclusion: Explicitly modeling conditionality is crucial for robust medical reasoning, and the proposed framework addresses limitations of current biomedical QA systems.

Abstract: Current biomedical question answering (QA) systems often assume that medical knowledge applies uniformly, yet real-world clinical reasoning is inherently conditional: nearly every decision depends on patient-specific factors such as comorbidities and contraindications. Existing benchmarks do not evaluate such conditional reasoning, and retrieval-augmented or graph-based methods lack explicit mechanisms to ensure that retrieved knowledge is applicable to given context. To address this gap, we propose CondMedQA, the first benchmark for conditional biomedical QA, consisting of multi-hop questions whose answers vary with patient conditions. Furthermore, we propose Condition-Gated Reasoning (CGR), a novel framework that constructs condition-aware knowledge graphs and selectively activates or prunes reasoning paths based on query conditions. Our findings show that CGR more reliably selects condition-appropriate answers while matching or exceeding state-of-the-art performance on biomedical QA benchmarks, highlighting the importance of explicitly modeling conditionality for robust medical reasoning.

[8] Analyzing LLM Instruction Optimization for Tabular Fact Verification

Xiaotang Du, Giwon Hong, Wai-Chung Kwan, Rohit Saxena, Ivan Titov, Pasquale Minervini, Emily Allaway

Main category: cs.CL

TL;DR: Systematic comparison of instruction optimization techniques for tabular fact verification using DSPy framework, evaluating CoT, ReAct, and CodeAct methods across multiple benchmarks and model families.

DetailsMotivation: Instruction optimization offers a lightweight, model-agnostic way to improve LLM reasoning performance, but there's a need for systematic comparison of different optimization techniques for tabular fact verification tasks.

Method: Evaluated four prompting techniques (direct prediction, Chain-of-Thought, ReAct with SQL tools, CodeAct with Python execution) using three DSPy optimizers (COPRO, MiPROv2, SIMBA) across four benchmarks and three model families.

Result: Instruction optimization consistently improves verification accuracy; MiPROv2 yields most stable gains for CoT, SIMBA provides largest benefits for ReAct agents at larger scales; CoT remains effective especially with smaller models.

Conclusion: Instruction optimization is valuable for tabular fact verification, with different optimizers suited to different prompting techniques; CoT is effective for smaller models while optimized ReAct agents can achieve competitive performance with larger models.

Abstract: Instruction optimization provides a lightweight, model-agnostic approach to enhancing the reasoning performance of large language models (LLMs). This paper presents the first systematic comparison of instruction optimization, based on the DSPy optimization framework, for tabular fact verification. We evaluate four out-of-the-box prompting techniques that cover both text-only prompting and code use: direct prediction, Chain-of-Thought (CoT), ReAct with SQL tools, and CodeAct with Python execution. We study three optimizers from the DSPy framework – COPRO, MiPROv2, and SIMBA – across four benchmarks and three model families. We find that instruction optimization consistently improves verification accuracy, with MiPROv2 yielding the most stable gains for CoT, and SIMBA providing the largest benefits for ReAct agents, particularly at larger model scales. Behavioral analyses reveal that SIMBA encourages more direct reasoning paths by applying heuristics, thereby improving numerical comparison abilities in CoT reasoning and helping avoid unnecessary tool calls in ReAct agents. Across different prompting techniques, CoT remains effective for tabular fact checking, especially with smaller models. Although ReAct agents built with larger models can achieve competitive performance, they require careful instruction optimization.

[9] CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications

Victoria Blake, Mathew Miller, Jamie Novak, Sze-yuan Ooi, Blanca Gallego

Main category: cs.CL

TL;DR: CUICurate: GraphRAG framework for automated UMLS concept set curation using knowledge graph embeddings and LLMs for clinical NLP pipelines.

DetailsMotivation: Clinical NLP needs concept sets (related synonyms, subtypes, supertypes) beyond single UMLS CUIs, but manual curation is labor-intensive, inconsistent, and poorly supported by existing tools.

Method: Graph-based retrieval-augmented generation (GraphRAG) framework: construct UMLS knowledge graph with embeddings for semantic retrieval, retrieve candidate CUIs, then use LLMs (GPT-5 and GPT-5-mini) for filtering and classification.

Result: Produced larger, more complete concept sets than manual benchmarks while matching human precision; GPT-5-mini had higher recall, GPT-5 classifications aligned better with clinician judgments; outputs stable and computationally inexpensive.

Conclusion: CUICurate provides scalable, reproducible approach for UMLS concept set curation, reducing manual effort by integrating graph-based retrieval with LLM reasoning for clinical NLP pipelines.

Abstract: Background: Clinical named entity recognition tools commonly map free text to Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs). For many downstream tasks, however, the clinically meaningful unit is not a single CUI but a concept set comprising related synonyms, subtypes, and supertypes. Constructing such concept sets is labour-intensive, inconsistently performed, and poorly supported by existing tools, particularly for NLP pipelines that operate directly on UMLS CUIs. Methods We present CUICurate, a Graph-based retrieval-augmented generation (GraphRAG) framework for automated UMLS concept set curation. A UMLS knowledge graph (KG) was constructed and embedded for semantic retrieval. For each target concept, candidate CUIs were retrieved from the KG, followed by large language model (LLM) filtering and classification steps comparing two LLMs (GPT-5 and GPT-5-mini). The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated benchmark and gold-standard concept sets. Results Across all concepts, CUICurate produced substantially larger and more complete concept sets than the manual benchmarks whilst matching human precision. Comparisons between the two LLMs found that GPT-5-mini achieved higher recall during filtering, while GPT-5 produced classifications that more closely aligned with clinician judgements. Outputs were stable across repeated runs and computationally inexpensive. Conclusions CUICurate offers a scalable and reproducible approach to support UMLS concept set curation that substantially reduces manual effort. By integrating graph-based retrieval with LLM reasoning, the framework produces focused candidate concept sets that can be adapted to clinical NLP pipelines for different phenotyping and analytic requirements.

[10] Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering

Amine Kobeissi, Philippe Langlais

Main category: cs.CL

TL;DR: The paper studies within-document retrieval failures in financial QA, introduces multi-granularity evaluation, and proposes a domain fine-tuned page scorer to improve page-level retrieval in regulatory filings.

DetailsMotivation: Current retrieval-augmented generation for financial QA often fails even when correct documents are retrieved, because the specific page or chunk containing the answer is missed, leading to unreliable answers in high-stakes financial settings.

Method: Evaluates retrieval at document, page, and chunk levels; introduces oracle analysis; compares dense, sparse, hybrid, and hierarchical retrieval methods; and proposes a domain fine-tuned bi-encoder page scorer specifically trained for page-level relevance in financial filings.

Result: Shows that gains in document discovery improve page recall, but oracle performance indicates room for improvement; the proposed page scorer demonstrates significant improvements in page recall and chunk retrieval.

Conclusion: Within-document retrieval failures are a critical issue in financial QA, and domain-specific fine-tuning for intermediate retrieval units (pages) can substantially improve retrieval performance for regulatory filings.

Abstract: Retrieval-augmented generation is increasingly used for financial question answering over long regulatory filings, yet reliability depends on retrieving the exact context needed to justify answers in high stakes settings. We study a frequent failure mode in which the correct document is retrieved but the page or chunk that contains the answer is missed, leading the generator to extrapolate from incomplete context. Despite its practical significance, this within-document retrieval failure mode has received limited systematic attention in the Financial Question Answering (QA) literature. We evaluate retrieval at multiple levels of granularity, document, page, and chunk level, and introduce an oracle based analysis to provide empirical upper bounds on retrieval and generative performance. On a 150 question subset of FinanceBench, we reproduce and compare diverse retrieval strategies including dense, sparse, hybrid, and hierarchical methods with reranking and query reformulation. Across methods, gains in document discovery tend to translate into stronger page recall, yet oracle performance still suggests headroom for page and chunk level retrieval. To target this gap, we introduce a domain fine-tuned page scorer that treats pages as an intermediate retrieval unit between documents and chunks. Unlike prior passage-based hierarchical retrieval, we fine-tune a bi-encoder specifically for page-level relevance on financial filings, exploiting the semantic coherence of pages. Overall, our results demonstrate a significant improvement in page recall and chunk retrieval.

[11] Towards More Standardized AI Evaluation: From Models to Agents

Ali El Filali, Inès Bedar

Main category: cs.CL

TL;DR: The paper argues that traditional ML evaluation methods are inadequate for modern AI systems, especially agentic systems, and proposes rethinking evaluation as a continuous measurement discipline rather than static performance theater.

DetailsMotivation: Current evaluation practices are outdated and misleading for modern AI systems. They fail to capture the dynamic, non-deterministic nature of agentic systems and compound AI applications, leading to false confidence and silent failure modes.

Method: The paper takes a conceptual/analytical approach rather than proposing new metrics. It examines how evaluation pipelines themselves introduce failure modes, why benchmark scores mislead, and how agentic systems fundamentally change performance measurement.

Result: The analysis reveals that traditional evaluation approaches obscure rather than illuminate system behavior for agentic AI systems, and that evaluation should shift from being a final checkpoint to a core control function.

Conclusion: Evaluation must be reimagined as a continuous measurement discipline that conditions trust, iteration, and governance in non-deterministic systems, rather than as performance theater with static benchmarks.

Abstract: Evaluation is no longer a final checkpoint in the machine learning lifecycle. As AI systems evolve from static models to compound, tool-using agents, evaluation becomes a core control function. The question is no longer “How good is the model?” but “Can we trust the system to behave as intended, under change, at scale?”. Yet most evaluation practices remain anchored in assumptions inherited from the model-centric era: static benchmarks, aggregate scores, and one-off success criteria. This paper argues that such approaches are increasingly obscure rather than illuminating system behavior. We examine how evaluation pipelines themselves introduce silent failure modes, why high benchmark scores routinely mislead teams, and how agentic systems fundamentally alter the meaning of performance measurement. Rather than proposing new metrics or harder benchmarks, we aim to clarify the role of evaluation in the AI era, and especially for agents: not as performance theater, but as a measurement discipline that conditions trust, iteration, and governance in non-deterministic systems.

[12] Perceived Political Bias in LLMs Reduces Persuasive Abilities

Matthew DiGiuseppe, Joshua Robison

Main category: cs.CL

TL;DR: Study shows political bias warnings reduce ChatGPT’s persuasive effectiveness by 28% in correcting economic policy misconceptions

DetailsMotivation: To investigate whether perceptions of political bias affect the persuasive power of conversational AI, particularly as LLMs become embroiled in partisan conflicts and face credibility attacks from political elites.

Method: Preregistered U.S. survey experiment with 2,144 participants who engaged in three-round conversations with ChatGPT about economic policy misconceptions. Compared neutral control condition to treatment where participants received warning that ChatGPT was biased against their political party.

Result: Warning about political bias reduced ChatGPT’s persuasive effectiveness by 28%. Transcript analysis revealed participants pushed back more and engaged less receptively when warned about bias.

Conclusion: Persuasive impact of conversational AI is politically contingent and constrained by perceptions of partisan alignment, suggesting that credibility attacks can significantly undermine AI’s ability to correct misinformation.

Abstract: Conversational AI has been proposed as a scalable way to correct public misconceptions and spread misinformation. Yet its effectiveness may depend on perceptions of its political neutrality. As LLMs enter partisan conflict, elites increasingly portray them as ideologically aligned. We test whether these credibility attacks reduce LLM-based persuasion. In a preregistered U.S. survey experiment (N=2144), participants completed a three-round conversation with ChatGPT about a personally held economic policy misconception. Compared to a neutral control, a short message indicating that the LLM was biased against the respondent’s party attenuated persuasion by 28%. Transcript analysis indicates that the warnings alter the interaction: respondents push back more and engage less receptively. These findings suggest that the persuasive impact of conversational AI is politically contingent, constrained by perceptions of partisan alignment.

[13] Agentic Adversarial QA for Improving Domain-Specific LLMs

Vincent Grari, Ciprian Tomoiaga, Sylvain Lamprier, Tatsunori Hashimoto, Marcin Detyniecki

Main category: cs.CL

TL;DR: Adversarial question-generation framework creates compact, challenging synthetic data for specialized domain adaptation of LLMs, outperforming traditional methods with fewer samples.

DetailsMotivation: LLMs struggle with specialized domains due to limited high-quality task-relevant data. Traditional synthetic data methods (paraphrasing, knowledge extraction) fail to develop interpretive reasoning and produce inefficient, redundant corpora.

Method: Proposes adversarial question-generation framework that compares outputs between target model and expert model grounded in reference documents. Uses iterative feedback-driven process to identify comprehension gaps and generate semantically challenging questions.

Result: Evaluation on LegalBench corpus subsets shows method achieves higher accuracy with substantially fewer synthetic samples compared to traditional approaches.

Conclusion: The adversarial framework effectively addresses limitations of existing synthetic data methods by producing compact, challenging datasets that improve specialized domain adaptation efficiency.

Abstract: Large Language Models (LLMs), despite extensive pretraining on broad internet corpora, often struggle to adapt effectively to specialized domains. There is growing interest in fine-tuning these models for such domains; however, progress is constrained by the scarcity and limited coverage of high-quality, task-relevant data. To address this, synthetic data generation methods such as paraphrasing or knowledge extraction are commonly applied. Although these approaches excel at factual recall and conceptual knowledge, they suffer from two critical shortcomings: (i) they provide minimal support for interpretive reasoning capabilities in these specialized domains, and (ii) they often produce synthetic corpora that are excessively large and redundant, resulting in poor sample efficiency. To overcome these gaps, we propose an adversarial question-generation framework that produces a compact set of semantically challenging questions. These questions are constructed by comparing the outputs of the model to be adapted and a robust expert model grounded in reference documents, using an iterative, feedback-driven process designed to reveal and address comprehension gaps. Evaluation on specialized subsets of the LegalBench corpus demonstrates that our method achieves greater accuracy with substantially fewer synthetic samples.

[14] Detecting Contextual Hallucinations in LLMs with Frequency-Aware Attention

Siya Qi, Yudong Chen, Runcong Zhao, Qinglin Zhu, Zhanghao Hu, Wei Liu, Yulan He, Zheng Yuan, Lin Gui

Main category: cs.CL

TL;DR: A frequency-based approach to detect hallucinations in LLMs by analyzing high-frequency components in attention distributions during text generation.

DetailsMotivation: Hallucination detection is critical for LLM reliability in context-based generation. While attention offers a direct view of grounding behavior, existing approaches use coarse summaries that miss fine-grained instabilities in attention patterns.

Method: Model attention distributions as discrete signals and extract high-frequency components reflecting rapid local changes in attention. Analyze attention variation during generation from a signal processing perspective, focusing on high-frequency attention energy associated with hallucinated tokens.

Result: The approach achieves performance gains over verification-based, internal-representation-based, and attention-based methods on RAGTruth and HalluRAG benchmarks across models and tasks.

Conclusion: Hallucinated tokens are associated with high-frequency attention energy, reflecting fragmented and unstable grounding behavior. A lightweight hallucination detector using high-frequency attention features is effective and outperforms existing methods.

Abstract: Hallucination detection is critical for ensuring the reliability of large language models (LLMs) in context-based generation. Prior work has explored intrinsic signals available during generation, among which attention offers a direct view of grounding behavior. However, existing approaches typically rely on coarse summaries that fail to capture fine-grained instabilities in attention. Inspired by signal processing, we introduce a frequency-aware perspective on attention by analyzing its variation during generation. We model attention distributions as discrete signals and extract high-frequency components that reflect rapid local changes in attention. Our analysis reveals that hallucinated tokens are associated with high-frequency attention energy, reflecting fragmented and unstable grounding behavior. Based on this insight, we develop a lightweight hallucination detector using high-frequency attention features. Experiments on the RAGTruth and HalluRAG benchmarks show that our approach achieves performance gains over verification-based, internal-representation-based, and attention-based methods across models and tasks.

[15] The Statistical Signature of LLMs

Ortal Hadad, Edoardo Loru, Jacopo Nudo, Niccolò Di Marco, Matteo Cinelli, Walter Quattrociocchi

Main category: cs.CL

TL;DR: LLM-generated text shows higher structural regularity and compressibility than human text, detectable via lossless compression without model internals, though this distinction weakens in small-scale fragmented interactions.

DetailsMotivation: To understand how LLM text generation reshapes the statistical organization of language and develop a model-agnostic method to distinguish generative regimes from surface text alone.

Method: Uses lossless compression as a measure of statistical regularity across three information ecosystems: controlled human-LLM continuations, generative mediation of knowledge infrastructure (Wikipedia vs. Grokipedia), and synthetic social interactions (Moltbook vs. Reddit).

Result: LLM-produced language exhibits higher structural regularity and compressibility than human-written text in controlled and mediated contexts, showing consistent separation across models, tasks, and domains. However, this signature shows scale dependence and attenuates in fragmented interaction environments.

Conclusion: Compression provides a simple, robust framework for quantifying how generative systems reshape textual production, offering a structural perspective on communication complexity evolution, though surface-level distinguishability has fundamental limits at small scales.

Abstract: Large language models generate text through probabilistic sampling from high-dimensional distributions, yet how this process reshapes the structural statistical organization of language remains incompletely characterized. Here we show that lossless compression provides a simple, model-agnostic measure of statistical regularity that differentiates generative regimes directly from surface text. We analyze compression behavior across three progressively more complex information ecosystems: controlled human-LLM continuations, generative mediation of a knowledge infrastructure (Wikipedia vs. Grokipedia), and fully synthetic social interaction environments (Moltbook vs. Reddit). Across settings, compression reveals a persistent structural signature of probabilistic generation. In controlled and mediated contexts, LLM-produced language exhibits higher structural regularity and compressibility than human-written text, consistent with a concentration of output within highly recurrent statistical patterns. However, this signature shows scale dependence: in fragmented interaction environments the separation attenuates, suggesting a fundamental limit to surface-level distinguishability at small scales. This compressibility-based separation emerges consistently across models, tasks, and domains and can be observed directly from surface text without relying on model internals or semantic evaluation. Overall, our findings introduce a simple and robust framework for quantifying how generative systems reshape textual production, offering a structural perspective on the evolving complexity of communication.

[16] FENCE: A Financial and Multimodal Jailbreak Detection Dataset

Mirae Kim, Seonghun Jeong, Youngjun Kwak

Main category: cs.CL

TL;DR: FENCE is a bilingual Korean-English multimodal dataset for training jailbreak detectors in financial applications, featuring finance-relevant queries with image-grounded threats to address vulnerabilities in Vision Language Models.

DetailsMotivation: Jailbreaking poses significant risks to LLMs and VLMs, with VLMs being particularly vulnerable due to their multimodal nature. There's a scarcity of resources for jailbreak detection, especially in finance, creating a critical gap for safe AI deployment in sensitive domains.

Method: Created FENCE dataset with bilingual (Korean-English) multimodal content featuring finance-relevant queries paired with image-grounded threats. Conducted experiments with commercial and open-source VLMs, and trained a baseline detector on this dataset.

Result: Experiments revealed consistent vulnerabilities: GPT-4o showed measurable attack success rates, open-source models displayed greater exposure. Baseline detector achieved 99% in-distribution accuracy and maintained strong performance on external benchmarks.

Conclusion: FENCE provides a focused resource for advancing multimodal jailbreak detection in finance, supporting safer AI systems in sensitive domains by addressing vulnerabilities in Vision Language Models through realistic, domain-specific multimodal threats.

Abstract: Jailbreaking poses a significant risk to the deployment of Large Language Models (LLMs) and Vision Language Models (VLMs). VLMs are particularly vulnerable because they process both text and images, creating broader attack surfaces. However, available resources for jailbreak detection are scarce, particularly in finance. To address this gap, we present FENCE, a bilingual (Korean-English) multimodal dataset for training and evaluating jailbreak detectors in financial applications. FENCE emphasizes domain realism through finance-relevant queries paired with image-grounded threats. Experiments with commercial and open-source VLMs reveal consistent vulnerabilities, with GPT-4o showing measurable attack success rates and open-source models displaying greater exposure. A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset’s robustness for training reliable detection models. FENCE provides a focused resource for advancing multimodal jailbreak detection in finance and for supporting safer, more reliable AI systems in sensitive domains. Warning: This paper includes example data that may be offensive.

[17] Click it or Leave it: Detecting and Spoiling Clickbait with Informativeness Measures and Large Language Models

Wojciech Michaluk, Tymoteusz Urban, Mateusz Kubita, Soveatin Kuntur, Anna Wroblewska

Main category: cs.CL

TL;DR: A hybrid approach combining transformer embeddings with linguistic features achieves 91% F1-score for clickbait detection, outperforming traditional NLP methods and enhancing interpretability.

DetailsMotivation: Clickbait headlines degrade online information quality and undermine user trust, creating a need for effective detection methods that can identify manipulative content.

Method: Hybrid approach combining transformer-based text embeddings with linguistically motivated informativeness features, using XGBoost classifier over embeddings augmented with 15 explicit linguistic features.

Result: Best-performing model achieves 91% F1-score, outperforming TF-IDF, Word2Vec, GloVe, LLM prompt-based classification, and feature-only baselines.

Conclusion: The proposed hybrid approach effectively detects clickbait while maintaining interpretability through explicit linguistic features, with released code and models for reproducible research.

Abstract: Clickbait headlines degrade the quality of online information and undermine user trust. We present a hybrid approach to clickbait detection that combines transformer-based text embeddings with linguistically motivated informativeness features. Using natural language processing techniques, we evaluate classical vectorizers, word embedding baselines, and large language model embeddings paired with tree-based classifiers. Our best-performing model, XGBoost over embeddings augmented with 15 explicit features, achieves an F1-score of 91%, outperforming TF-IDF, Word2Vec, GloVe, LLM prompt based classification, and feature-only baselines. The proposed feature set enhances interpretability by highlighting salient linguistic cues such as second-person pronouns, superlatives, numerals, and attention-oriented punctuation, enabling transparent and well-calibrated clickbait predictions. We release code and trained models to support reproducible research.

[18] Improving Sampling for Masked Diffusion Models via Information Gain

Kaisen Yang, Jayden Teoh, Kaicheng Yang, Yitong Zhang, Alex Lamb

Main category: cs.CL

TL;DR: Info-Gain Sampler improves masked diffusion model decoding by considering both immediate uncertainty and future information gain, outperforming greedy heuristics across reasoning, coding, creative writing, and image generation tasks.

DetailsMotivation: Existing samplers for Masked Diffusion Models (MDMs) use greedy heuristics that prioritize positions with highest local certainty, but fail to consider downstream impacts of decoding choices on subsequent steps and don't minimize cumulative uncertainty.

Method: Proposes Info-Gain Sampler, a principled decoding framework that balances immediate uncertainty with information gain over future masked tokens, exploiting the non-causal nature of MDMs to evaluate how decoding decisions reshape token probabilities across all remaining masked positions.

Result: Achieves 3.6% improvement in average accuracy on reasoning tasks, 63.1% win-rate in creative writing, reduces cumulative uncertainty from 78.4 to 48.6 on reasoning tasks, and consistently outperforms existing samplers across diverse architectures and tasks.

Conclusion: Info-Gain Sampler provides a more effective decoding strategy for MDMs by considering both immediate and future uncertainty, demonstrating significant improvements across multiple domains including reasoning, coding, creative writing, and image generation.

Abstract: Masked Diffusion Models (MDMs) offer greater flexibility in decoding order than autoregressive models but require careful planning to achieve high-quality generation. Existing samplers typically adopt greedy heuristics, prioritizing positions with the highest local certainty to decode at each step. Through failure case analysis, we identify a fundamental limitation of this approach: it neglects the downstream impact of current decoding choices on subsequent steps and fails to minimize cumulative uncertainty. In particular, these methods do not fully exploit the non-causal nature of MDMs, which enables evaluating how a decoding decision reshapes token probabilities/uncertainty across all remaining masked positions. To bridge this gap, we propose the Info-Gain Sampler, a principled decoding framework that balances immediate uncertainty with information gain over future masked tokens. Extensive evaluations across diverse architectures and tasks (reasoning, coding, creative writing, and image generation) demonstrate that Info-Gain Sampler consistently outperforms existing samplers for MDMs. For instance, it achieves a 3.6% improvement in average accuracy on reasoning tasks and a 63.1% win-rate in creative writing. Notably, on reasoning tasks it reduces cumulative uncertainty from 78.4 to 48.6, outperforming the best baseline by a large margin. The code will be available at https://github.com/yks23/Information-Gain-Sampler.

[19] Information-Theoretic Storage Cost in Sentence Comprehension

Kohei Kajikawa, Shinnosuke Isono, Ethan Gotlieb Wilcox

Main category: cs.CL

TL;DR: Proposes an information-theoretic measure of processing storage cost in sentence comprehension, estimated from neural language models, validated through psycholinguistic analyses.

DetailsMotivation: Current measures of working memory load in sentence comprehension rely on symbolic grammars with discrete, uniform costs. There's a need for continuous, theory-neutral measures that better capture the probabilistic nature of language processing.

Method: Develops an information-theoretic measure: the amount of information previous words carry about future context under uncertainty. This continuous measure can be estimated from pre-trained neural language models rather than symbolic grammars.

Result: The measure successfully: (i) recovers known processing asymmetries in center embeddings and relative clauses, (ii) correlates with grammar-based storage costs in annotated corpora, and (iii) predicts reading-time variance in naturalistic datasets beyond traditional information-based predictors.

Conclusion: The proposed information-theoretic measure provides a continuous, theory-neutral alternative to discrete grammar-based metrics for quantifying processing storage costs, with practical applications in psycholinguistics using neural language models.

Abstract: Real-time sentence comprehension imposes a significant load on working memory, as comprehenders must maintain contextual information to anticipate future input. While measures of such load have played an important role in psycholinguistic theories, they have been formalized, largely, using symbolic grammars, which assign discrete, uniform costs to syntactic predictions. This study proposes a measure of processing storage cost based on an information-theoretic formalization, as the amount of information previous words carry about future context, under uncertainty. Unlike previous discrete, grammar-based metrics, this measure is continuous, theory-neutral, and can be estimated from pre-trained neural language models. The validity of this approach is demonstrated through three analyses in English: our measure (i) recovers well-known processing asymmetries in center embeddings and relative clauses, (ii) correlates with a grammar-based storage cost in a syntactically-annotated corpus, and (iii) predicts reading-time variance in two large-scale naturalistic datasets over and above baseline models with traditional information-based predictors.

[20] Thinking by Subtraction: Confidence-Driven Contrastive Decoding for LLM Reasoning

Lexiang Tang, Weihao Gao, Bingchen Zhao, Lu Ma, Qiao jin, Bang Yang, Yuexian Zou

Main category: cs.CL

TL;DR: Confidence-Driven Contrastive Decoding (CCD) improves LLM reasoning by selectively intervening at low-confidence tokens during decoding, reducing errors and output length without extra training.

DetailsMotivation: Current test-time scaling for LLM reasoning assumes more computation uniformly improves correctness, but reasoning uncertainty is localized to specific low-confidence tokens that cause most errors and unnecessary output expansion.

Method: CCD detects low-confidence tokens during decoding, constructs a contrastive reference by replacing high-confidence tokens with minimal placeholders, and refines predictions by subtracting this reference distribution at low-confidence positions.

Result: CCD significantly improves accuracy across mathematical reasoning benchmarks while substantially reducing output length, with minimal KV-cache overhead.

Conclusion: As a training-free method, CCD enhances reasoning reliability through targeted low-confidence intervention without computational redundancy, offering an efficient approach to improve LLM reasoning.

Abstract: Recent work on test-time scaling for large language model (LLM) reasoning typically assumes that allocating more inference-time computation uniformly improves correctness. However, prior studies show that reasoning uncertainty is highly localized: a small subset of low-confidence tokens disproportionately contributes to reasoning errors and unnecessary output expansion. Motivated by this observation, we propose Thinking by Subtraction, a confidence-driven contrastive decoding approach that improves reasoning reliability through targeted token-level intervention. Our method, Confidence-Driven Contrastive Decoding, detects low-confidence tokens during decoding and intervenes selectively at these positions. It constructs a contrastive reference by replacing high-confidence tokens with minimal placeholders, and refines predictions by subtracting this reference distribution at low-confidence locations. Experiments show that CCD significantly improves accuracy across mathematical reasoning benchmarks while substantially reducing output length, with minimal KV-cache overhead. As a training-free method, CCD enhances reasoning reliability through targeted low-confidence intervention without computational redundancy. Our code will be made available at: https://github.com/bolo-web/CCD.

[21] Simplifying Outcomes of Language Model Component Analyses with ELIA

Aaron Louis Eidt, Nils Feldhus

Main category: cs.CL

TL;DR: ELIA is an interactive web app that simplifies complex LLM interpretability analyses (Attribution, Function Vector, Circuit Tracing) for broader audiences using AI-generated natural language explanations from visualizations.

DetailsMotivation: To address the accessibility gap in mechanistic interpretability of LLMs, which currently requires specialized expertise, by creating tools that make complex analyses understandable to non-experts.

Method: Developed ELIA web application integrating three interpretability techniques with a novel approach: using vision-language models to automatically generate natural language explanations from complex visualizations. Conducted mixed-methods user study to evaluate effectiveness.

Result: User study showed clear preference for interactive interfaces over static visualizations. AI-powered explanations helped bridge knowledge gaps - no significant correlation found between prior LLM experience and comprehension scores, indicating reduced barriers across experience levels.

Conclusion: AI systems can simplify complex model analyses, but their true power emerges when combined with user-centered design emphasizing interactivity, specificity, and narrative guidance.

Abstract: While mechanistic interpretability has developed powerful tools to analyze the internal workings of Large Language Models (LLMs), their complexity has created an accessibility gap, limiting their use to specialists. We address this challenge by designing, building, and evaluating ELIA (Explainable Language Interpretability Analysis), an interactive web application that simplifies the outcomes of various language model component analyses for a broader audience. The system integrates three key techniques – Attribution Analysis, Function Vector Analysis, and Circuit Tracing – and introduces a novel methodology: using a vision-language model to automatically generate natural language explanations (NLEs) for the complex visualizations produced by these methods. The effectiveness of this approach was empirically validated through a mixed-methods user study, which revealed a clear preference for interactive, explorable interfaces over simpler, static visualizations. A key finding was that the AI-powered explanations helped bridge the knowledge gap for non-experts; a statistical analysis showed no significant correlation between a user’s prior LLM experience and their comprehension scores, suggesting that the system reduced barriers to comprehension across experience levels. We conclude that an AI system can indeed simplify complex model analyses, but its true power is unlocked when paired with thoughtful, user-centered design that prioritizes interactivity, specificity, and narrative guidance.

[22] PsihoRo: Depression and Anxiety Romanian Text Corpus

Alexandra Ciobotaru, Ana-Maria Bucur, Liviu P. Dinu

Main category: cs.CL

TL;DR: Created PsihoRo, the first open-source Romanian corpus for depression and anxiety analysis using open-ended questions paired with PHQ-9/GAD-7 screening questionnaires.

DetailsMotivation: Romanian lacks open-source mental health corpora despite the need for psychological NLP resources. Existing approaches often rely on problematic social media data collection with collector assumptions, whereas a more reliable method uses open-ended questions with validated screening surveys.

Method: Collected data from 205 respondents using a form with 6 open-ended questions paired with standardized PHQ-9 (depression) and GAD-7 (anxiety) screening questionnaires. Analyzed using statistical analysis, Romanian LIWC text analysis, emotion detection, and topic modeling.

Result: Created PsihoRo corpus with 205 Romanian texts. Analysis revealed important linguistic and emotional features of depression and anxiety expressions in Romanian, providing foundational insights for mental health text analysis in this language.

Conclusion: PsihoRo represents the first step toward understanding mental health texts in Romanian, providing a valuable resource for the NLP community and enabling future research on depression and anxiety analysis in this language.

Abstract: Psychological corpora in NLP are collections of texts used to analyze human psychology, emotions, and mental health. These texts allow researchers to study psychological constructs, detect mental health issues and analyze emotional language. However, mental health data can be difficult to collect correctly from social media, due to suppositions made by the collectors. A more pragmatic strategy involves gathering data through open-ended questions and then assessing this information with self-report screening surveys. This method was employed successfully for English, a language with a lot of psychological NLP resources. However, this cannot be stated for Romanian, which currently has no open-source mental health corpus. To address this gap, we have created the first corpus for depression and anxiety in Romanian, by utilizing a form with 6 open-ended questions along with the standardized PHQ-9 and GAD-7 screening questionnaires. Consisting of the texts of 205 respondents and although it may seem small, PsihoRo is a first step towards understanding and analyzing texts regarding the mental health of the Romanian population. We employ statistical analysis, text analysis using Romanian LIWC, emotion detection and topic modeling to show what are the most important features of this newly introduced resource to the NLP community.

[23] Predicting Contextual Informativeness for Vocabulary Learning using Deep Learning

Tao Wu, Adam Kapelner

Main category: cs.CL

TL;DR: A deep learning system for automatically selecting high-quality contextual examples for vocabulary instruction, comparing unsupervised and supervised embedding approaches with novel evaluation metrics.

DetailsMotivation: To develop an automated system for identifying informative contextual examples for first language vocabulary instruction, addressing the need for high-quality teaching materials at scale.

Method: Three modeling approaches: (1) unsupervised similarity-based using MPNet embeddings, (2) supervised framework with instruction-aware fine-tuned Qwen3 embeddings and nonlinear regression head, and (3) model (2) plus handcrafted context features. Introduced Retention Competency Curve metric for evaluation.

Result: Model (3) achieved the best performance with a good-to-bad ratio of 440 while discarding only 70% of good contexts, demonstrating that supervised embedding models with human guidance can produce near-perfect teaching contexts.

Conclusion: Modern embedding models with neural network architectures, when combined with human supervision, can efficiently generate high-quality contextual examples for vocabulary instruction at scale.

Abstract: We describe a modern deep learning system that automatically identifies informative contextual examples (\qu{contexts}) for first language vocabulary instruction for high school student. Our paper compares three modeling approaches: (i) an unsupervised similarity-based strategy using MPNet’s uniformly contextualized embeddings, (ii) a supervised framework built on instruction-aware, fine-tuned Qwen3 embeddings with a nonlinear regression head and (iii) model (ii) plus handcrafted context features. We introduce a novel metric called the Retention Competency Curve to visualize trade-offs between the discarded proportion of good contexts and the \qu{good-to-bad} contexts ratio providing a compact, unified lens on model performance. Model (iii) delivers the most dramatic gains with performance of a good-to-bad ratio of 440 all while only throwing out 70% of the good contexts. In summary, we demonstrate that a modern embedding model on neural network architecture, when guided by human supervision, results in a low-cost large supply of near-perfect contexts for teaching vocabulary for a variety of target words.

[24] Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System

Pavithra PM Nair, Preethu Rose Anish

Main category: cs.CL

TL;DR: Vichara is an AI framework for predicting and explaining appellate judgments in the Indian legal system by decomposing case documents into structured decision points and generating IRAC-style explanations.

DetailsMotivation: To address India's extensive court backlog by developing an AI system that can accurately predict appellate judgments while providing interpretable explanations for legal professionals.

Method: Processes English-language appellate case documents, decomposes them into structured decision points (legal issue, authority, outcome, reasoning, temporal context), and generates explanations using an IRAC-inspired framework adapted for Indian legal reasoning. Evaluated with four LLMs (GPT-4o mini, Llama-3.1-8B, Mistral-7B, Qwen2.5-7B) on PredEx and ILDC_expert datasets.

Result: Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert). Human evaluation shows GPT-4o mini’s superior interpretability across Clarity, Linking, and Usefulness metrics.

Conclusion: Vichara demonstrates effective AI-powered legal judgment prediction with interpretable explanations, offering potential to address court backlogs while maintaining transparency for legal professionals.

Abstract: In jurisdictions like India, where courts face an extensive backlog of cases, artificial intelligence offers transformative potential for legal judgment prediction. A critical subset of this backlog comprises appellate cases, which are formal decisions issued by higher courts reviewing the rulings of lower courts. To this end, we present Vichara, a novel framework tailored to the Indian judicial system that predicts and explains appellate judgments. Vichara processes English-language appellate case proceeding documents and decomposes them into decision points. Decision points are discrete legal determinations that encapsulate the legal issue, deciding authority, outcome, reasoning, and temporal context. The structured representation isolates the core determinations and their context, enabling accurate predictions and interpretable explanations. Vichara’s explanations follow a structured format inspired by the IRAC (Issue-Rule-Application-Conclusion) framework and adapted for Indian legal reasoning. This enhances interpretability, allowing legal professionals to assess the soundness of predictions efficiently. We evaluate Vichara on two datasets, PredEx and the expert-annotated subset of the Indian Legal Documents Corpus (ILDC_expert), using four large language models: GPT-4o mini, Llama-3.1-8B, Mistral-7B, and Qwen2.5-7B. Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving the highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert), followed by Llama-3.1-8B. Human evaluation of the generated explanations across Clarity, Linking, and Usefulness metrics highlights GPT-4o mini’s superior interpretability.

[25] Validating Political Position Predictions of Arguments

Jordan Robinson, Angus R. Williams, Katie Atkinson, Anthony G. Cohn

Main category: cs.CL

TL;DR: A dual-scale validation framework for political stance prediction combining pointwise and pairwise human annotation to address subjectivity in knowledge representation, applied to 23,228 arguments from UK political debates.

DetailsMotivation: Real-world knowledge representation often involves subjective, continuous attributes (like political positions) that conflict with traditional pairwise validation methods, requiring new approaches to handle subjectivity in human evaluation.

Method: Dual-scale validation framework combining pointwise and pairwise human annotation, applied to political stance prediction using 22 language models on 23,228 arguments from 30 UK political TV debates (Question Time).

Result: Pointwise evaluation shows moderate human-model agreement (α=0.578), reflecting intrinsic subjectivity, while pairwise validation reveals substantially stronger alignment between human- and model-derived rankings (α=0.86 for best model).

Conclusion: The work contributes: (1) practical validation methodology for subjective continuous knowledge, (2) validated structured argumentation knowledge base for political domains, and (3) evidence that ordinal structure can be extracted from pointwise language model predictions in subjective discourse.

Abstract: Real-world knowledge representation often requires capturing subjective, continuous attributes – such as political positions – that conflict with pairwise validation, the widely accepted gold standard for human evaluation. We address this challenge through a dual-scale validation framework applied to political stance prediction in argumentative discourse, combining pointwise and pairwise human annotation. Using 22 language models, we construct a large-scale knowledge base of political position predictions for 23,228 arguments drawn from 30 debates that appeared on the UK politicial television programme \textit{Question Time}. Pointwise evaluation shows moderate human-model agreement (Krippendorff’s $α=0.578$), reflecting intrinsic subjectivity, while pairwise validation reveals substantially stronger alignment between human- and model-derived rankings ($α=0.86$ for the best model). This work contributes: (i) a practical validation methodology for subjective continuous knowledge that balances scalability with reliability; (ii) a validated structured argumentation knowledge base enabling graph-based reasoning and retrieval-augmented generation in political domains; and (iii) evidence that ordinal structure can be extracted from pointwise language models predictions from inherently subjective real-world discourse, advancing knowledge representation capabilities for domains where traditional symbolic or categorical approaches are insufficient.

[26] SPQ: An Ensemble Technique for Large Language Model Compression

Jiamin Yao, Eren Gultepe

Main category: cs.CL

TL;DR: SPQ is an ensemble compression technique for LLMs combining SVD, pruning, and quantization to reduce memory usage while maintaining performance.

DetailsMotivation: To enable practical deployment of large language models in memory-constrained environments by developing an effective compression method that combines complementary techniques.

Method: Combines three complementary compression techniques: variance-retained SVD for attention projections, activation-based pruning for MLP layers, and post-training 8-bit linear quantization for all linear layers.

Result: Achieves up to 75% memory reduction on LLaMA-2-7B while improving perplexity (WikiText-2 from 5.47 to 4.91) and maintaining downstream task accuracy. Outperforms GPTQ and SparseGPT in memory efficiency (6.86GB vs 7.16GB) with up to 1.9x inference speedup.

Conclusion: SPQ demonstrates that combining complementary compression techniques (SVD, pruning, quantization) provides superior compression efficiency compared to individual methods, enabling practical LLM deployment in memory-constrained environments.

Abstract: This study presents an ensemble technique, SPQ (SVD-Pruning-Quantization), for large language model (LLM) compression that combines variance-retained singular value decomposition (SVD), activation-based pruning, and post-training linear quantization. Each component targets a different source of inefficiency: i) pruning removes redundant neurons in MLP layers, ii) SVD reduces attention projections into compact low-rank factors, iii) and 8-bit quantization uniformly compresses all linear layers. At matched compression ratios, SPQ outperforms individual methods (SVD-only, pruning-only, or quantization-only) in perplexity, demonstrating the benefit of combining complementary techniques. Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K. Compared to strong baselines like GPTQ and SparseGPT, SPQ offers competitive perplexity and accuracy while using less memory (6.86 GB vs. 7.16 GB for GPTQ). Moreover, SPQ improves inference throughput over GPTQ, achieving up to a 1.9x speedup, which further enhances its practicality for real-world deployment. The effectiveness of SPQ’s robust compression through layer-aware and complementary compression techniques may provide practical deployment of LLMs in memory-constrained environments. Code is available at: https://github.com/JiaminYao/SPQ_LLM_Compression/

[27] RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering

Deniz Qian, Hung-Ting Chen, Eunsol Choi

Main category: cs.CL

TL;DR: RVR is a multi-round retrieval framework that iteratively retrieves and verifies documents to maximize answer coverage for queries with multiple valid answers.

DetailsMotivation: Current retrieval systems struggle with queries that have diverse valid answers, often failing to comprehensively cover all possible answers. There's a need for methods that can systematically explore answer space beyond initial retrieval results.

Method: Retrieve-Verify-Retrieve (RVR) framework: 1) Initial retrieval with original query, 2) Verification to identify high-quality subset, 3) Query augmentation with verified documents, 4) Repeat retrieval with augmented queries to uncover uncovered answers.

Result: RVR outperforms baselines including agentic search approaches, achieving at least 10% relative and 3% absolute gain in complete recall percentage on QAMPARI dataset. Consistent gains also observed on out-of-domain datasets (QUEST and WebQuestionsSP).

Conclusion: RVR presents an effective iterative approach for comprehensive answer recall that leverages verification and adapts retrievers to new inference scenarios, even working well with off-the-shelf retrievers.

Abstract: Comprehensively retrieving diverse documents is crucial to address queries that admit a wide range of valid answers. We introduce retrieve-verify-retrieve (RVR), a multi-round retrieval framework designed to maximize answer coverage. Initially, a retriever takes the original query and returns a candidate document set, followed by a verifier that identifies a high-quality subset. For subsequent rounds, the query is augmented with previously verified documents to uncover answers that are not yet covered in previous rounds. RVR is effective even with off-the-shelf retrievers, and fine-tuning retrievers for our inference procedure brings further gains. Our method outperforms baselines, including agentic search approaches, achieving at least 10% relative and 3% absolute gain in complete recall percentage on a multi-answer retrieval dataset (QAMPARI). We also see consistent gains on two out-of-domain datasets (QUEST and WebQuestionsSP) across different base retrievers. Our work presents a promising iterative approach for comprehensive answer recall leveraging a verifier and adapting retrievers to a new inference scenario.

[28] VIRAASAT: Traversing Novel Paths for Indian Cultural Reasoning

Harshul Raj Surana, Arijit Maji, Aryan Vats, Akash Ghosh, Sriparna Saha, Amit Sheth

Main category: cs.CL

TL;DR: VIRAASAT introduces a semi-automated multi-hop QA dataset for Indian culture evaluation and proposes Symbolic Chain-of-Manipulation (SCoM) to improve LLM reasoning on cultural knowledge graphs.

DetailsMotivation: LLMs struggle with tasks requiring rich socio-cultural knowledge and diverse local contexts, especially Indian culture. Existing cultural benchmarks are manually crafted, single-hop, and costly to scale, leaving this deficiency largely unmeasured.

Method: Developed VIRAASAT dataset using a semi-automated multi-hop approach based on a knowledge graph with 700+ expert-curated cultural artifacts covering 13 attributes across all Indian states/UTs. Proposed Symbolic Chain-of-Manipulation (SCoM) framework that trains models to simulate atomic knowledge graph manipulations internally for reliable graph traversal.

Result: Generated 3,200+ multi-hop questions requiring chained cultural reasoning. SCoM outperformed standard Chain-of-Thought baselines by up to 20% in supervised fine-tuning experiments, demonstrating improved ability to ground and synthesize low-probability facts.

Conclusion: VIRAASAT provides a foundation for culturally aware reasoning models. SCoM effectively addresses LLM limitations in cultural reasoning by teaching models to reliably traverse knowledge graph structures.

Abstract: Large Language Models (LLMs) have made significant progress in reasoning tasks across various domains such as mathematics and coding. However, their performance deteriorates in tasks requiring rich socio-cultural knowledge and diverse local contexts, particularly those involving Indian Culture. Existing Cultural benchmarks are (i) Manually crafted, (ii) contain single-hop questions testing factual recall, and (iii) prohibitively costly to scale, leaving this deficiency largely unmeasured. To address this, we introduce VIRAASAT, a novel, semi-automated multi-hop approach for generating cultural specific multi-hop Question-Answering dataset for Indian culture. VIRAASAT leverages a Knowledge Graph comprising more than 700 expert-curated cultural artifacts, covering 13 key attributes of Indian culture (history, festivals, etc). VIRAASAT spans all 28 states and 8 Union Territories, yielding more than 3,200 multi-hop questions that necessitate chained cultural reasoning. We evaluate current State-of-the-Art (SOTA) LLMs on VIRAASAT and identify key limitations in reasoning wherein fine-tuning on Chain-of-Thought(CoT) traces fails to ground and synthesize low-probability facts. To bridge this gap, we propose a novel framework named Symbolic Chain-of-Manipulation (SCoM). Adapting the Chain-of-Manipulation paradigm, we train the model to simulate atomic Knowledge Graph manipulations internally. SCoM teaches the model to reliably traverse the topological structure of the graph. Experiments on Supervised Fine-Tuning (SFT) demonstrate that SCoM outperforms standard CoT baselines by up to 20%. We release the VIRAASAT dataset along with our findings, laying a strong foundation towards building Culturally Aware Reasoning Models.

[29] Topic Modeling with Fine-tuning LLMs and Bag of Sentences

Johannes Schneider

Main category: cs.CL

TL;DR: FT-Topic enables unsupervised fine-tuning of LLM encoders for improved topic modeling by automatically constructing training data from sentence groups, leading to state-of-the-art performance with SenClu method.

DetailsMotivation: While LLMs outperform classical topic models, they're typically used out-of-the-box despite fine-tuning's known benefits. The main challenge is obtaining labeled datasets for fine-tuning in an unsupervised setting.

Method: FT-Topic uses bags of sentences as basic units, automatically constructs training data through: 1) heuristic identification of same/different topic sentence group pairs, 2) removal of likely incorrect labels. The fine-tuned encoder can be used by any embedding-based topic modeling approach, with SenClu demonstrating fast inference via EM algorithm and hard assignments.

Result: FT-Topic enables effective unsupervised fine-tuning of LLM encoders, and SenClu achieves state-of-the-art performance in topic modeling with fast inference and ability to incorporate prior knowledge.

Conclusion: Unsupervised fine-tuning of LLM encoders for topic modeling is feasible and effective through automatic training data construction, leading to improved performance over out-of-the-box LLM usage.

Abstract: Large language models (LLMs) are increasingly used for topic modeling, outperforming classical topic models such as LDA. Commonly, pre-trained LLM encoders such as BERT are used out-of-the-box despite the fact that fine-tuning is known to improve LLMs considerably. The challenge lies in obtaining a suitable labeled dataset for fine-tuning. In this paper, we build on the recent idea of using bags of sentences as the elementary unit for computing topics. Based on this idea, we derive an approach called FT-Topic to perform unsupervised fine-tuning, relying primarily on two steps for constructing a training dataset in an automatic fashion. First, a heuristic method identifies pairs of sentence groups that are assumed to belong either to the same topic or to different topics. Second, we remove sentence pairs that are likely labeled incorrectly. The resulting dataset is then used to fine-tune an encoder LLM, which can be leveraged by any topic modeling approach that uses embeddings. In this work, we demonstrate its effectiveness by deriving a novel state-of-the-art topic modeling method called SenClu. The method achieves fast inference through an expectation-maximization algorithm and hard assignments of sentence groups to a single topic, while allowing users to encode prior knowledge about the topic-document distribution. Code is available at https://github.com/JohnTailor/FT-Topic

[30] Beyond Mimicry to Contextual Guidance: Knowledge Distillation for Interactive AI

Tong Wang, K. Sudhir

Main category: cs.CL

TL;DR: A framework for distilling knowledge from large language models by having a teacher model create reusable textual guidance for specific scenarios, which students retrieve at inference time for adaptive behavior in interactive settings like customer service.

DetailsMotivation: Firms face a tradeoff between using highly capable but costly large language models versus weaker, more deployable models. Existing knowledge distillation methods focus on output imitation but are poorly suited for interactive, multi-turn conversational settings where responses need to be sequenced coherently across states.

Method: Proposes distilling knowledge from output imitation to contextual guidance. A superior teacher model constructs a reusable library of strategic textual guidance for specific scenarios likely to be encountered by the student. During deployment, the student retrieves context-specific guidance at inference time, enabling adaptive behavior without retraining.

Result: In customer-service interactions, this approach improves service quality and customer satisfaction relative to standard fine-tuning while maintaining alignment with firm policies.

Conclusion: Inference-time textual guidance provides a scalable and controllable approach to distillation for interactive AI agents in marketing settings, addressing the limitations of traditional open-loop distillation methods.

Abstract: As large language models increasingly mediate firm - customer interactions, firms face a tradeoff: the most capable models perform well but are costly and difficult to control at scale. Existing knowledge distillation methods address this challenge by training weaker, deployable models to imitate frontier outputs; however, such open-loop approaches are poorly suited to interactive, multi-turn settings where responses must be sequenced coherently across conversational states. We propose a shift in what knowledge is distilled - from output imitation to contextual guidance. We develop a framework in which a superior teacher model constructs a reusable library of strategic textual guidance for particular scenarios likely to be encountered by the student. When deployed, the student retrieves the context-specific guidance at inference time, enabling adaptive behavior without retraining. Using customer-service interactions, we show that this approach improves service quality and customer satisfaction relative to standard fine-tuning while maintaining alignment with firm policies. The results position inference-time textual guidance as a scalable and controllable approach to distillation for interactive AI agents in marketing settings.

[31] HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs

Tin Nguyen, Logan Bolton, Mohammad Reza Taesiri, Trung Bui, Anh Totti Nguyen

Main category: cs.CL

TL;DR: HoT (Highlighted Chain-of-Thought) is a prompting technique that uses XML tags to highlight factual statements in LLM responses, reducing hallucinations and improving accuracy across 22+ tasks, though it can mislead users when LLMs are wrong.

DetailsMotivation: LLMs frequently hallucinate non-factual statements, creating responses that mix factual and non-factual content, making it difficult for humans to verify accuracy and make reliable decisions based on LLM outputs.

Method: Proposes Highlighted Chain-of-Thought Prompting (HoT), where LLMs are prompted to generate responses with XML tags that ground facts to those provided in the question. The method involves: 1) reformatting input questions with XML tags highlighting key facts, and 2) generating responses with highlights over facts referenced from the input.

Result: Compared to vanilla CoT, HoT reduces hallucination rates and improves LLM accuracy consistently across over 22 tasks including arithmetic, reading comprehension, and logical reasoning. Human verification studies show highlights help time-limited participants more accurately and efficiently recognize correct LLM responses, but surprisingly, when LLMs are wrong, HoT tends to fool users into believing incorrect answers are correct.

Conclusion: HoT is an effective technique for reducing hallucinations and improving LLM accuracy by explicitly grounding facts through XML highlighting, though it introduces a new risk of misleading users when LLMs produce incorrect but confidently highlighted responses.

Abstract: An Achilles heel of Large Language Models (LLMs) is their tendency to hallucinate non-factual statements. A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on. To combat this problem, we propose Highlighted Chain-of-Thought Prompting (HoT), a technique for prompting LLMs to generate responses with XML tags that ground facts to those provided in the question. That is, given an input question, LLMs would first re-format the question to add XML tags highlighting key facts, and then, generate a response with highlights over the facts referenced from the input. Compared to vanilla chain of thought prompting (CoT), HoT reduces the rate of hallucination and separately improves LLM accuracy consistently on over 22 tasks from arithmetic, reading comprehension, to logical reasoning. When asking humans to verify LLM responses, highlights help time-limited participants to more accurately and efficiently recognize when LLMs are correct. Yet, surprisingly, when LLMs are wrong, HoTs tend to fool users into believing that an answer is correct.

[32] FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation

Yulia Otmakhova, Hung Thinh Truong, Rahmad Mahendra, Zenan Zhai, Rongxin Zhu, Daniel Beck, Jey Han Lau

Main category: cs.CL

TL;DR: FLUKE is a framework for evaluating model robustness through systematic linguistic variations across multiple levels, revealing task-dependent impacts and significant brittleness in LLMs to natural modifications.

DetailsMotivation: To address the need for systematic robustness evaluation of NLP models by creating a framework that can assess how models handle controlled linguistic variations across different levels (orthography to dialect/style).

Method: FLUKE introduces systematic minimal variations across linguistic levels, uses LLMs with human validation to generate modifications, and evaluates both fine-tuned models and LLMs across six diverse NLP tasks (four classification, two generation).

Result: Findings show: 1) Impact of variations is highly task-dependent, 2) LLMs exhibit significant brittleness to certain variations (reasoning LLMs less robust than base models), 3) Models more brittle to natural modifications than corruption-style tests, 4) Generation ability doesn’t correlate with robustness on downstream tasks.

Conclusion: Systematic robustness testing is crucial for understanding model behaviors, and current models show significant vulnerabilities to natural linguistic variations despite their capabilities.

Abstract: We present FLUKE (Framework for LingUistically-driven and tasK-agnostic robustness Evaluation), a framework for assessing model robustness through systematic minimal variations of test data. FLUKE introduces controlled variations across linguistic levels – from orthography to dialect and style – and leverages large language models (LLMs) with human validation to generate modifications. We demonstrate FLUKE’s utility by evaluating both fine-tuned models and LLMs across six diverse NLP tasks (four classification and two generation tasks), and reveal that (1) the impact of linguistic variations is highly task-dependent, with some tests being critical for certain tasks but irrelevant for others; (2) LLMs still exhibit significant brittleness to certain linguistic variations, with reasoning LLMs surprisingly showing less robustness on some tasks compared to base models, and scaling improving robustness only for surface-level modifications; (3) models are overall more brittle to natural, fluent modifications such as syntax or style changes (and especially to negation), compared to corruption-style tests such as letter flipping; (4) the ability of a model to use a linguistic feature in generation does not correlate to its robustness to this feature on downstream tasks. These findings highlight the importance of systematic robustness testing for understanding model behaviors.

[33] ConformalNL2LTL: Translating Natural Language Instructions into Temporal Logic Formulas with Conformal Correctness Guarantees

David Smith Sundarsingh, Jun Wang, Jyotirmoy V. Deshmukh, Yiannis Kantaros

Main category: cs.CL

TL;DR: ConformalNL2LTL: A method that uses LLMs with conformal prediction to translate natural language instructions to LTL formulas with guaranteed success rates while minimizing user intervention.

DetailsMotivation: LTL is widely used for autonomous system task specification but requires significant manual effort and expertise. Existing NL-to-LTL translation methods lack correctness guarantees, creating a need for reliable translation with user-defined success rates.

Method: Iteratively constructs LTL formulas by solving open-vocabulary QA problems using LLMs. Uses a primary model with conformal prediction for uncertainty quantification, and an auxiliary model for assistance when confidence thresholds aren’t met. Can request user intervention when needed.

Result: Theoretical and empirical demonstration that ConformalNL2LTL achieves desired translation accuracy while minimizing user intervention.

Conclusion: Proposed method provides reliable NL-to-LTL translation with guaranteed success rates, reducing the expertise barrier for LTL task specification in autonomous systems.

Abstract: Linear Temporal Logic (LTL) is a widely used task specification language for autonomous systems. To mitigate the significant manual effort and expertise required to define LTL-encoded tasks, several methods have been proposed for translating Natural Language (NL) instructions into LTL formulas, which, however, lack correctness guarantees. To address this, we propose a new NL-to-LTL translation method, called ConformalNL2LTL that achieves user-defined translation success rates on unseen NL commands. Our method constructs LTL formulas iteratively by solving a sequence of open-vocabulary question-answering (QA) problems using large language models (LLMs). These QA tasks are handled collaboratively by a primary and an auxiliary model. The primary model answers each QA instance while quantifying uncertainty via conformal prediction; when it is insufficiently certain according to user-defined confidence thresholds, it requests assistance from the auxiliary model and, if necessary, from the user. We demonstrate theoretically and empirically that ConformalNL2LTL achieves the desired translation accuracy while minimizing user intervention.

[34] Entailed Opinion Matters: Improving the Fact-Checking Performance of Language Models by Relying on their Entailment Ability

Gaurav Kumar, Ayush Garg, Debajyoti Mazumder, Aditya Kishore, Babu kumar, Jasabanta Patro

Main category: cs.CL

TL;DR: A new learning paradigm for fact-checking that uses generative language models’ evidence classification and justifications to train encoder-only models, achieving improved accuracy through comprehensive experiments.

DetailsMotivation: Automated fact-checking remains challenging with insufficient accuracy for real-world deployment despite various approaches like end-to-end training, retrieval-augmented generation, and prompt engineering.

Method: Proposes a novel learning paradigm where evidence classification and entailed justifications generated by GLMs are used to train ELMs, with comprehensive experiments comparing various prompting and fine-tuning strategies.

Result: The approach shows improved performance compared to recent works, with additional ablation studies, error analysis, explanation quality assessment, and domain generalization studies providing comprehensive understanding.

Conclusion: The proposed paradigm effectively leverages generative models’ capabilities to enhance encoder-only models for fact-checking, offering a promising direction for improving automated fact-checking systems.

Abstract: Automated fact-checking has been a challenging task for the research community. Prior work has explored various strategies, such as end-to-end training, retrieval-augmented generation, and prompt engineering, to build robust fact-checking systems. However, their accuracy has not been high enough for real-world deployment. We, on the other hand, propose a new learning paradigm, where evidence classification and entailed justifications made by generative language models (GLMs) are used to train encoder-only language models (ELMs). We conducted a rigorous set of experiments, comparing our approach with recent works along with various prompting and fine-tuning strategies. Additionally, we performed ablation studies, error analysis, quality analysis of model explanations, and a domain generalisation study to provide a comprehensive understanding of our approach.

[35] PonderLM: Pretraining Language Models to Ponder in Continuous Space

Boyi Zeng, Shixiang Song, Siyuan Huang, Yixuan Wang, He Li, Ziwei He, Xinbing Wang, Zhiyu Li, Zhouhan Lin

Main category: cs.CL

TL;DR: Introduces pondering mechanism into language models by repeatedly invoking forward passes within single token generation, using weighted token embeddings instead of sampling tokens, enabling deeper cognitive processing without human annotations.

DetailsMotivation: Humans engage in pondering before articulating complex thoughts, allowing deeper cognitive processing. Current language models generate tokens directly without this internal deliberation process, potentially limiting their ability to handle complex reasoning tasks.

Method: Proposes pondering process where during token generation, instead of sampling a token, the model outputs a weighted sum of all token embeddings based on predicted distribution, then feeds this embedding back as input for another forward pass. This creates iterative internal deliberation within a single generation step, trained via self-supervised learning without human annotations.

Result: Experiments across GPT-2, Pythia, and LLaMA architectures show effectiveness. On 9 downstream benchmarks, pondering-enhanced Pythia models outperform official versions. PonderPythia-2.8B surpasses Pythia-6.9B and rivals Pythia-12B, while PonderPythia-1B matches TinyLlama-1.1B trained on 10x more data.

Conclusion: Pondering mechanism enables language models to engage in deeper cognitive processing similar to humans, significantly improving performance across architectures and benchmarks without additional training data or human supervision.

Abstract: Humans ponder before articulating complex sentence elements, enabling deeper cognitive processing through focused effort. In this work, we introduce this pondering process into language models by repeatedly invoking the forward process within a single token generation step. During pondering, instead of generating an actual token sampled from the prediction distribution, the model ponders by yielding a weighted sum of all token embeddings according to the predicted token distribution. The generated embedding is then fed back as input for another forward pass. We show that the model can learn to ponder in this way through self-supervised learning, without any human annotations. Experiments across three widely used open-source architectures-GPT-2, Pythia, and LLaMA-and extensive downstream task evaluations demonstrate the effectiveness and generality of our method. On 9 downstream benchmarks, our pondering-enhanced Pythia models significantly outperform the official Pythia models. Notably, our PonderPythia models demonstrate remarkable effectiveness: PonderPythia-2.8B surpasses Pythia-6.9B and rivals Pythia-12B, while our PonderPythia-1B matches TinyLlama-1.1B, a model trained on 10 times more data. The code is available at https://github.com/LUMIA-Group/PonderingLM.

[36] Structure-Augmented Reasoning Generation

Jash Rajesh Parekh, Pengcheng Jiang, Jiawei Han

Main category: cs.CL

TL;DR: SARG enhances RAG by creating explicit reasoning structures from retrieved documents to improve multi-hop question answering through knowledge graph construction and traversal.

DetailsMotivation: Standard RAG pipelines treat retrieved documents as independent text chunks, forcing models to implicitly connect information across fragmented context, which is problematic for multi-hop queries requiring synthesis of information scattered across different documents.

Method: Three-stage framework: 1) Extract relational triples from retrieved documents via few-shot prompting, 2) Organize triples into domain-adaptive knowledge graph, 3) Perform multi-hop traversal to identify relevant reasoning chains, then integrate chains with text chunks into generation prompts.

Result: Significantly outperforms state-of-the-art flat-context RAG baselines on open-domain QA benchmarks and specialized reasoning datasets in finance and medicine, improving both factual accuracy and reasoning coherence while providing traceable inference.

Conclusion: SARG addresses the fragmentation problem in standard RAG by materializing explicit reasoning structures, offering a modular enhancement that improves multi-hop reasoning without requiring custom retrievers or domain-specific fine-tuning.

Abstract: Recent advances in Large Language Models (LLMs) have significantly improved complex reasoning capabilities. Retrieval-Augmented Generation (RAG) has further extended these capabilities by grounding generation in dynamically retrieved evidence, enabling access to information beyond the model’s training parameters. However, while RAG addresses knowledge availability, standard pipelines treat retrieved documents as independent, unstructured text chunks, forcing models to implicitly connect information across fragmented context. This limitation becomes critical for multi-hop queries, where answering correctly requires synthesizing information scattered across different documents. We present Structure-Augmented Reasoning Generation (SARG), a post-retrieval framework that addresses this gap by materializing explicit reasoning structures from retrieved context. SARG operates in three stages: extracting relational triples from retrieved documents via few-shot prompting, organizing these triples into a domain-adaptive knowledge graph, and performing multi-hop traversal to identify relevant reasoning chains. These chains, along with their associated text chunks, are then integrated into the generation prompt to explicitly guide the model’s reasoning process. Importantly, SARG doesn’t require custom retrievers or domain-specific fine-tuning. Instead, it functions as a modular layer compatible with all existing RAG pipelines. Extensive experiments on open-domain QA benchmarks and specialized reasoning datasets in finance and medicine demonstrate that SARG significantly outperforms state-of-the-art flat-context RAG baselines in both factual accuracy and reasoning coherence. Furthermore, by surfacing the exact traversal paths used during generation, SARG provides fully traceable and interpretable inference.

[37] Unveiling Decision-Making in LLMs for Text Classification : Extraction of influential and interpretable concepts with Sparse Autoencoders

Mathis Le Bail, Jérémie Dentan, Davide Buscaldi, Sonia Vanier

Main category: cs.CL

TL;DR: SAE-based explainability methods for sentence classification, introducing ClassifSAE with specialized classifier head and sparsity loss, benchmarking against existing methods with novel evaluation metrics.

DetailsMotivation: Sparse Autoencoders (SAEs) have been successful for probing LLMs and extracting interpretable concepts, but their effectiveness hasn't been extensively explored for sentence classification tasks.

Method: Proposes ClassifSAE model with specialized classifier head and activation rate sparsity loss for text classification. Benchmarks against ConceptShap, ICA, HI-Concept, and TopK-SAE baseline across multiple classification benchmarks and backbone LLMs. Introduces two novel metrics using external sentence encoder to measure precision of concept-based explanations.

Result: Empirical results show that ClassifSAE improves both the causality and interpretability of extracted features compared to established methods.

Conclusion: SAE-based approaches can be effectively adapted for sentence classification tasks, with ClassifSAE demonstrating superior performance in extracting causal and interpretable features for text classification.

Abstract: Sparse Autoencoders (SAEs) have been successfully used to probe Large Language Models (LLMs) and extract interpretable concepts from their internal representations. These concepts are linear combinations of neuron activations that correspond to human-interpretable features. In this paper, we investigate the effectiveness of SAE-based explainability approaches for sentence classification, a domain where such methods have not been extensively explored. We present a novel SAE-based model ClassifSAE tailored for text classification, leveraging a specialized classifier head and incorporating an activation rate sparsity loss. We benchmark this architecture against established methods such as ConceptShap, Independent Component Analysis, HI-Concept and a standard TopK-SAE baseline. Our evaluation covers several classification benchmarks and backbone LLMs. We further enrich our analysis with two novel metrics for measuring the precision of concept-based explanations, using an external sentence encoder. Our empirical results show that ClassifSAE improves both the causality and interpretability of the extracted features.

[38] Anthropomimetic Uncertainty: What Verbalized Uncertainty in Language Models is Missing

Dennis Ulmer, Alexandra Lorson, Ivan Titov, Christian Hardmeier

Main category: cs.CL

TL;DR: The paper argues for “anthropomimetic uncertainty” - having language models express uncertainty in ways that mimic human linguistic behaviors to improve trustworthiness and human-machine collaboration.

DetailsMotivation: LLMs are often overconfident even when wrong, undermining trust. There's a need for better uncertainty signaling to enable effective human-machine collaboration and mitigate potential harms.

Method: The paper presents a comprehensive overview of human uncertainty communication research, surveys NLP approaches, and performs additional analyses to identify underexplored biases in verbalized uncertainty.

Result: The analysis reveals that current NLP research overlooks nuances in human uncertainty communication and biases that affect machine communication. The paper demonstrates unexplored biases in verbalized uncertainty.

Conclusion: The paper advocates for anthropomimetic uncertainty - imitating human linguistic behaviors for intuitive uncertainty communication - and outlines future research directions for implementing this approach in human-machine interactions.

Abstract: Human users increasingly communicate with large language models (LLMs), but LLMs suffer from frequent overconfidence in their output, even when its accuracy is questionable, which undermines their trustworthiness and perceived legitimacy. Therefore, there is a need for language models to signal their confidence in order to reap the benefits of human-machine collaboration and mitigate potential harms. Verbalized uncertainty is the expression of confidence with linguistic means, an approach that integrates perfectly into language-based interfaces. Most recent research in natural language processing (NLP) overlooks the nuances surrounding human uncertainty communication and the biases that influence the communication of and with machines. We argue for anthropomimetic uncertainty, the principle that intuitive and trustworthy uncertainty communication requires a degree of imitation of human linguistic behaviors. We present a thorough overview of the research in human uncertainty communication, survey ongoing research in NLP, and perform additional analyses to demonstrate so-far underexplored biases in verbalized uncertainty. We conclude by pointing out unique factors in human-machine uncertainty and outlining future research directions towards implementing anthropomimetic uncertainty.

[39] CoAct-1: Computer-using Multi-Agent System with Coding Actions

Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li, Junnan Li, Silvio Savarese, Zeyuan Chen, Jieyu Zhao, Ran Xu, Caiming Xiong

Main category: cs.CL

TL;DR: CoAct-1 introduces a hybrid multi-agent system combining GUI control with programmatic execution to improve computer automation efficiency and reliability.

DetailsMotivation: Current autonomous agents operating through GUIs struggle with efficiency and reliability on complex tasks, constrained by performing all actions through GUI manipulation which leads to brittleness and inefficiency.

Method: CoAct-1 uses a multi-agent system with an Orchestrator that dynamically delegates subtasks to either a GUI Operator or a specialized Programmer agent that can write and execute Python or Bash scripts, enabling hybrid GUI-programmatic control.

Result: Achieves state-of-the-art 60.76% success rate on OSWorld benchmark, reduces average steps from 15 to 10.15 compared to GUI-only agents, demonstrating superior efficiency and reliability.

Conclusion: Integrating coding as a core action provides a more powerful, efficient, and scalable path toward generalized computer automation compared to GUI-only approaches.

Abstract: Autonomous agents that operate computers via Graphical User Interfaces (GUIs) often struggle with efficiency and reliability on complex, long-horizon tasks. While augmenting these agents with planners can improve task decomposition, they remain constrained by the inherent limitations of performing all actions through GUI manipulation, leading to brittleness and inefficiency. In this work, we introduce a more robust and flexible paradigm: enabling agents to use coding as a enhanced action. We present CoAct-1, a novel multi-agent system that synergistically combines GUI-based control with direct programmatic execution. CoAct-1 features an Orchestrator that dynamically delegates subtasks to either a conventional GUI Operator or a specialized Programmer agent, which can write and execute Python or Bash scripts. This hybrid approach allows the agent to bypass inefficient GUI action sequences for tasks like file management and data processing, while still leveraging visual interaction when necessary. We evaluate our system on the challenging OSWorld benchmark, where CoAct-1 achieves a new state-of-the-art success rate of 60.76%, significantly outperforming prior methods. Furthermore, our approach dramatically improves efficiency, reducing the average number of steps required to complete a task to just 10.15, compared to 15 for leading GUI agents. Our results demonstrate that integrating coding as a core action provides a more powerful, efficient, and scalable path toward generalized computer automation.

[40] Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning

Magauiya Zhussip, Dmitriy Shopkhoev, Ammar Ali, Stamatios Lefkimmiatis

Main category: cs.CL

TL;DR: MASA proposes structured weight sharing across transformer layers using dictionary learning to reduce attention parameters by 66.7% while maintaining performance across language and vision tasks.

DetailsMotivation: Transformer models have high computational and memory demands due to parameter redundancy across layers. While existing compression techniques focus on intra-block optimizations, the repetitive layered structure implies significant inter-block redundancy that remains largely unexplored beyond KV caching.

Method: MASA decomposes attention projection matrices (Q, K, V, O) into shared dictionary atoms, representing each layer’s weights as linear combinations of these shared matrix atoms. It operates as a drop-in replacement trained with standard optimizers without requiring distillation or architectural changes.

Result: Experiments across 100M-700M parameter models show MASA achieves better benchmark accuracy and perplexity than GQA, low-rank baselines, and recent sharing methods at comparable parameter budgets. Extending to Vision Transformers, it matches performance on image classification with 66.7% fewer attention parameters.

Conclusion: MASA offers a scalable blueprint for parameter-efficient transformer models without sacrificing performance by combining dictionary learning with transformer efficiency. It also shows potential for reducing parameters in large pretrained models without significant performance drops.

Abstract: Large language models have revolutionized AI applications, yet their high computational and memory demands hinder their widespread deployment. Existing compression techniques focus on intra-block optimizations (e.g., low-rank approximation or attention pruning), while the repetitive layered structure of transformers implies significant inter-block redundancy - a dimension largely unexplored beyond key-value (KV) caching. Inspired by dictionary learning in convolutional networks, we propose a framework for structured weight sharing across transformer layers. Our approach decomposes attention projection matrices (Q, K, V, O) into shared dictionary atoms, reducing the attention module’s parameters by 66.7% while achieving on-par performance. Unlike complex methods requiring distillation or architectural changes, MASA (Matrix Atom Sharing in Attention) operates as a drop-in replacement-trained with standard optimizers - and represents each layer’s weights as linear combinations of shared matrix atoms. Experiments across scales (100M-700M parameters) show that MASA achieves better benchmark accuracy and perplexity than GQA, low-rank baselines and recent Repeat-all-over/Sequential sharing at comparable parameter budgets. Ablation studies confirm robustness to the dictionary size and the efficacy of shared representations in capturing cross-layer statistical regularities. Extending to Vision Transformers (ViT), MASA matches performance metrics on image classification tasks with 66.7% fewer attention parameters. By combining dictionary learning strategies with transformer efficiency, MASA offers a scalable blueprint for parameter-efficient models without sacrificing performance. Finally, we investigate the possibility of employing MASA on large pretrained models to reduce their number of parameters without experiencing any significant drop in their performance.

[41] Probability Distributions Computed by Autoregressive Transformers

Andy Yang, Anej Svete, Jiaoda Li, Anthony Widjaja Lin, Jonathan Rawski, Ryan Cotterell, David Chiang

Main category: cs.CL

TL;DR: Transformers as probabilistic language models (autoregressive generation) vs. language recognizers (accept/reject strings), exploring expressivity differences in these operational modes.

DetailsMotivation: Most theoretical expressivity results for transformers treat them as language recognizers, but in practice they're used as autoregressive language models that generate strings probabilistically. The paper aims to characterize the probability distributions that transformer language models can express and understand how different operational modes affect expressivity.

Method: Theoretical analysis comparing transformers as language recognizers vs. language models, examining how making transformers autoregressive and probabilistic affects their expressivity. Investigates whether autoregressive operation increases expressivity and how probabilistic modeling breaks equivalences that hold in non-probabilistic cases.

Result: Shows that making transformer language recognizers autoregressive can sometimes increase their expressivity, and that making them probabilistic can break equivalences that hold in the non-probabilistic case. Characterizes the probability distributions that transformer language models can express.

Conclusion: The paper teases apart what functions transformers are capable of expressing in their most common use-case as language models, providing theoretical insights into the expressivity differences between recognition and generation modes.

Abstract: Most expressivity results for transformers treat them as language recognizers (which accept or reject strings), and not as they are used in practice, as language models (which generate strings autoregressively and probabilistically). We characterize the probability distributions that transformer language models can express. We show that making transformer language recognizers autoregressive can sometimes increase their expressivity, and that making them probabilistic can break equivalences that hold in the non-probabilistic case. Our overall contribution is to tease apart what functions transformers are capable of expressing, in their most common use-case as language models.

[42] When Distributions Shifts: Causal Generalization for Low-Resource Languages

Mahi Aliyu Aminu, Chisom Chibuike, Fatimo Adebanjo, Omokolade Awosanya, Samuel Oyeneye

Main category: cs.CL

TL;DR: Causal domain generalization methods for low-resource NLP: using GPT-4o-mini for counterfactual data augmentation on African language sentiment analysis, and extending DINER framework to multilingual aspect-based sentiment analysis with new Afri-SemEval dataset.

DetailsMotivation: Address machine learning model failures under distribution shifts, especially in low-resource settings where limited data restricts robust generalization. Focus on domain generalization using causal principles for natural language processing.

Method: Two approaches: 1) Causal data augmentation using GPT-4o-mini to generate counterfactual paraphrases for sentiment classification on NaijaSenti Twitter corpus (Yoruba and Igbo). 2) Invariant causal representation learning with DINER framework extended to multilingual setting, introducing Afri-SemEval dataset (17 languages translated from SemEval-2014).

Result: Improved robustness to unseen domains with consistent gains from counterfactual augmentation and enhanced out-of-distribution performance from causal representation learning across multiple languages.

Conclusion: Causal domain generalization methods effectively address distribution shift challenges in low-resource NLP settings, demonstrating practical value through data augmentation and invariant representation learning approaches.

Abstract: Machine learning models often fail under distribution shifts, a problem exacerbated in low-resource settings where limited data restricts robust generalization. Domain generalization(DG) methods address this challenge by learning representations that remain invariant across domains, frequently leveraging causal principles. In this work, we study two causal DG approaches for low-resource natural language processing. First, we apply causal data augmentation using GPT-4o-mini to generate counterfactual paraphrases for sentiment classification on the NaijaSenti Twitter corpus in Yoruba and Igbo. Second, we investigate invariant causal representation learning with the Debiasing in Aspect Review (DINER) framework for aspect-based sentiment analysis. We extend DINER to a multilingual setting by introducing Afri-SemEval, a dataset of 17 languages translated from SemEval-2014 Task. Experiments show improved robustness to unseen domains, with consistent gains from counterfactual augmentation and enhanced out-of-distribution performance from causal representation learning across multiple languages.

[43] Batch Prompting Suppresses Overthinking Reasoning Under Constraint: How Batch Prompting Suppresses Overthinking in Reasoning Models

Saurabh Srivastava, Janit Bidhan, Hao Yan, Abhishek Dey, Tanu Kansal, Paras Kath, Sina Mansouri, Mohit Marvania, Vamsi Shankar Simhadri, Gaurav Singh

Main category: cs.CL

TL;DR: Batch prompting reduces overthinking in Large Reasoning Models by 76% while maintaining accuracy, serving as an effective inference-time technique without model modification.

DetailsMotivation: Large Reasoning Models suffer from overthinking - generating excessive reasoning tokens even for simple queries, which increases costs and can cause API timeouts that hurt accuracy.

Method: Empirical study using batch prompting across 13 benchmarks with DeepSeek-R1 and OpenAI-o1 models, analyzing behavioral effects of batching on reasoning patterns.

Result: Batch prompting reduces reasoning tokens by 76% (2,950→710) on average while preserving or improving accuracy, and induces beneficial effects like reduced per-query effort, pattern induction, and suppression of hedging behavior.

Conclusion: Batch prompting is more than just cost optimization - it’s a practical inference-time technique that improves efficiency and reliability of Large Reasoning Models without requiring model modifications.

Abstract: Large Reasoning Models (LRMs) achieve strong performance through explicit chain-of-thought reasoning but suffer from \textit{overthinking}: generating excessive reasoning tokens even for trivial queries. {Beyond inflating cost, overthinking can be self-defeating: models enter recursive self-doubt loops that exhaust token budgets without producing an answer, causing API timeouts that directly hurt accuracy.} We present an empirical study showing that \textbf{batch prompting}, originally introduced for throughput optimization, effectively suppresses overthinking at inference time. Across 13 diverse benchmarks with DeepSeek-R1 and OpenAI-o1, batch prompting {reduces reasoning tokens by 76% (2{,}950$\mapsto$710), on average, while preserving or improving accuracy}. Through behavioral analysis, we find that batching induces three beneficial effects: (1) it reduces per-query reasoning effort when multiple queries share a context; (2) it enables pattern induction, where models generalize from earlier examples to solve later ones; and (3) it suppresses hedging behavior (e.g., \texttt{wait,}'' \texttt{let me double-check}’’) that signals metacognitive loops. We also show that explicit prompt constraints (``\texttt{Use no more than 100 tokens in thinking.}’’) fail to reduce overthinking; models either ignore them or sacrifice accuracy. These findings reframe batch prompting as more than a cost optimization: it is a practical inference-time technique that improves efficiency and reliability without model modification.

[44] MUCH: A Multilingual Claim Hallucination Benchmark

Jérémie Dentan, Alexi Canesse, Davide Buscaldi, Aymen Shabou, Sonia Vanier

Main category: cs.CL

TL;DR: MUCH is the first claim-level uncertainty quantification benchmark for LLMs with multilingual support, deterministic claim segmentation, and released generation logits for reproducible evaluation.

DetailsMotivation: Address the lack of reliability in Large Language Models by developing a fair and reproducible benchmark for claim-level uncertainty quantification under realistic deployment conditions.

Method: Created MUCH benchmark with 4,873 samples across four European languages, released 24 generation logits per token, and developed a deterministic algorithm for efficient claim segmentation requiring only 0.2% of LLM generation time.

Result: Current uncertainty quantification methods show substantial room for improvement in both performance and efficiency, highlighting the need for better approaches.

Conclusion: MUCH provides a comprehensive benchmark for evaluating claim-level uncertainty quantification methods with multilingual support, efficient segmentation, and reproducible evaluation capabilities.

Abstract: Claim-level Uncertainty Quantification (UQ) is a promising approach to mitigate the lack of reliability in Large Language Models (LLMs). We introduce MUCH, the first claim-level UQ benchmark designed for fair and reproducible evaluation of future methods under realistic conditions. It includes 4,873 samples across four European languages (English, French, Spanish, and German) and four instruction-tuned open-weight LLMs. Unlike prior claim-level benchmarks, we release 24 generation logits per token, facilitating the development of future white-box methods without re-generating data. Moreover, in contrast to previous benchmarks that rely on manual or LLM-based segmentation, we propose a new deterministic algorithm capable of segmenting claims using as little as 0.2% of the LLM generation time. This makes our segmentation approach suitable for real-time monitoring of LLM outputs, ensuring that MUCH evaluates UQ methods under realistic deployment constraints. Finally, our evaluations show that current methods still have substantial room for improvement in both performance and efficiency.

[45] Cross-Lingual Interleaving for Speech Language Models

Adel Moumen, Guangzhi Sun, Philip C. Woodland

Main category: cs.CL

TL;DR: Cross-lingual interleaving method for Spoken Language Models that mixes speech tokens across languages without text supervision, improving multilingual understanding and conversation capabilities.

DetailsMotivation: Current Spoken Language Models (SLMs) are mostly English-centric due to scarce spoken evaluation benchmarks and training data for other languages, limiting cross-lingual learning and access to NLP technologies for languages with limited written resources.

Method: Proposes cross-lingual interleaving method that mixes speech tokens across languages without textual supervision. Releases EN-FR training dataset TinyStories (~42k hours) and spoken StoryCloze/TopicCloze benchmarks synthetically generated using GPT-4 for cross-lingual semantic evaluation.

Result: On 360M and 1B parameter SLMs with matched training-token budgets, interleaving improves monolingual semantic accuracy, enables robust cross-lingual continuation, and strengthens cross-lingual hidden-state alignment.

Conclusion: Cross-lingual interleaving is a simple, scalable approach to building multilingual SLMs that can understand and converse across languages, with all resources made open-source for reproducibility.

Abstract: Spoken Language Models (SLMs) aim to learn linguistic competence directly from speech using discrete units, widening access to Natural Language Processing (NLP) technologies for languages with limited written resources. However, progress has been largely English-centric due to scarce spoken evaluation benchmarks and training data, making cross-lingual learning difficult. We present a cross-lingual interleaving method that mixes speech tokens across languages without textual supervision. We also release an EN-FR training dataset, TinyStories (~42k hours), together with EN-FR spoken StoryCloze and TopicCloze benchmarks for cross-lingual semantic evaluation, both synthetically generated using GPT-4. On 360M and 1B SLMs under matched training-token budgets, interleaving improves monolingual semantic accuracy, enables robust cross-lingual continuation, and strengthens cross-lingual hidden-state alignment. Taken together, these results indicate that cross-lingual interleaving is a simple, scalable route to building multilingual SLMs that understand and converse across languages. All resources will be made open-source to support reproducibility.

[46] WISE: Web Information Satire and Fakeness Evaluation

Gaurab Chhetri, Subasish Das, Tausif Islam Chowdhury

Main category: cs.CL

TL;DR: Lightweight transformer models benchmarked for distinguishing fake news from satire, with MiniLM achieving highest accuracy (87.58%) and RoBERTa-base achieving highest ROC-AUC (95.42%).

DetailsMotivation: The challenge of distinguishing fake news from satire due to overlapping linguistic features but divergent intent, requiring effective misinformation detection systems for real-world deployment.

Method: Developed WISE framework benchmarking eight lightweight transformer models and two baselines on 20,000 samples from Fakeddit dataset, using stratified 5-fold cross-validation with comprehensive evaluation metrics.

Result: MiniLM achieved highest accuracy (87.58%), RoBERTa-base achieved highest ROC-AUC (95.42%), DistilBERT offered best efficiency-accuracy trade-off. Statistical tests confirmed significant performance differences between models.

Conclusion: Lightweight models can match or exceed baseline performance, offering practical solutions for deploying misinformation detection in resource-constrained settings.

Abstract: Distinguishing fake or untrue news from satire or humor poses a unique challenge due to their overlapping linguistic features and divergent intent. This study develops WISE (Web Information Satire and Fakeness Evaluation) framework which benchmarks eight lightweight transformer models alongside two baseline models on a balanced dataset of 20,000 samples from Fakeddit, annotated as either fake news or satire. Using stratified 5-fold cross-validation, we evaluate models across comprehensive metrics including accuracy, precision, recall, F1-score, ROC-AUC, PR-AUC, MCC, Brier score, and Expected Calibration Error. Our evaluation reveals that MiniLM, a lightweight model, achieves the highest accuracy (87.58%) among all models, while RoBERTa-base achieves the highest ROC-AUC (95.42%) and strong accuracy (87.36%). DistilBERT offers an excellent efficiency-accuracy trade-off with 86.28% accuracy and 93.90% ROC-AUC. Statistical tests confirm significant performance differences between models, with paired t-tests and McNemar tests providing rigorous comparisons. Our findings highlight that lightweight models can match or exceed baseline performance, offering actionable insights for deploying misinformation detection systems in real-world, resource-constrained settings.

[47] Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Cameron Tice, Puria Radmard, Samuel Ratnam, Andy Kim, David Africa, Kyle O’Brien

Main category: cs.CL

TL;DR: Pretraining LLMs with different types of AI alignment discourse causally influences their downstream alignment behavior - negative discourse increases misalignment while positive discourse reduces it.

DetailsMotivation: To understand how discourse about AI systems in pretraining corpora causally influences downstream alignment, testing the hypothesis that negative AI descriptions create self-fulfilling misalignment.

Method: Pretrained 6.9B-parameter LLMs with varying amounts of (mis)alignment discourse, upsampling synthetic training documents about either AI misalignment or aligned behavior, then evaluated alignment through post-training.

Result: Discussion of AI contributes to misalignment: upsampling misalignment discourse increases misaligned behavior, while upsampling aligned behavior reduces misalignment scores from 45% to 9%. Effects persist through post-training.

Conclusion: Establishes “alignment pretraining” as a complement to post-training, showing pretraining data shapes alignment priors. Recommends practitioners consider pretraining for alignment alongside capabilities.

Abstract: Pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI behaviour are predominantly negative, LLMs may internalise corresponding behavioural priors, giving rise to self-fulfilling misalignment. This paper provides the first controlled study of this hypothesis by pretraining 6.9B-parameter LLMs with varying amounts of (mis)alignment discourse. We find that discussion of AI contributes to misalignment. Upsampling synthetic training documents about AI misalignment leads to a notable increase in misaligned behaviour. Conversely, upsampling documents about aligned behaviour reduces misalignment scores from 45% to 9%. We consider this evidence of self-fulfilling alignment. These effects are dampened, but persist through post-training. Our findings establish the study of how pretraining data shapes alignment priors, or alignment pretraining, as a complement to post-training. We recommend practitioners consider pretraining for alignment alongside capabilities. We share our models, data, and evaluations at AlignmentPretraining.ai.

[48] AWED-FiNER: Agents, Web applications, and Expert Detectors for Fine-grained Named Entity Recognition across 36 Languages for 6.6 Billion Speakers

Prachuryya Kaushik, Ashish Anand

Main category: cs.CL

TL;DR: AWED-FiNER is an open-source multilingual fine-grained NER system with agentic tool, web app, and 53 expert models covering 36 languages including low-resource ones.

DetailsMotivation: To address the need for multilingual fine-grained named entity recognition accessible to both technical and non-technical users, especially for low-resource languages.

Method: Developed an agentic tool that routes multilingual text to specialized expert models, a web application for non-technical users, and 53 language-specific expert models for offline deployment.

Result: Created a comprehensive FgNER system covering 36 languages spoken by over 6.6 billion people, including global languages and extremely low-resource vulnerable languages.

Conclusion: AWED-FiNER provides scalable, accessible multilingual FgNER solutions with applications in semantic search and structured data extraction across diverse linguistic contexts.

Abstract: Named Entity Recognition (NER) is a foundational task in Natural Language Processing (NLP) and Information Retrieval (IR), which facilitates semantic search and structured data extraction. We introduce \textbf{AWED-FiNER}, an open-source collection of agentic tool, web application, and 53 state-of-the-art expert models that provide Fine-grained Named Entity Recognition (FgNER) solutions across 36 languages spoken by more than 6.6 billion people. The agentic tool enables routing multilingual text to specialized expert models to fetch FgNER annotations within seconds. The web-based platform provides a ready-to-use FgNER annotation service for non-technical users. Moreover, the collection of language-specific extremely small open-source state-of-the-art expert models facilitates offline deployment in resource-constrained scenarios, including edge devices. AWED-FiNER covers languages spoken by over 6.6 billion people, ranging from global languages like English, Chinese, Spanish, and Hindi, to low-resource languages like Assamese, Santali, and Odia, along with a specific focus on extremely low-resource vulnerable languages such as Bodo, Manipuri, Bishnupriya, and Mizo. The resources can be accessed here: Agentic Tool (https://github.com/PrachuryyaKaushik/AWED-FiNER), Web Application (https://hf.co/spaces/prachuryyaIITG/AWED-FiNER), and 53 Expert Detector Models (https://hf.co/collections/prachuryyaIITG/awed-finer).

[49] One Token Is Enough: Improving Diffusion Language Models with a Sink Token

Zihou Zhang, Zheyong Xie, Li Zhong, Haifeng Liu, Shaosheng Cao

Main category: cs.CL

TL;DR: DLMs suffer from moving sink phenomenon causing instability; proposed extra sink token with modified attention mask stabilizes attention sinks and improves performance

DetailsMotivation: Diffusion Language Models (DLMs) enable parallel text generation but suffer from critical instability due to the moving sink phenomenon, where sink tokens unpredictably shift across diffusion steps, undermining inference robustness

Method: Introduce a simple extra sink token with modified attention mask - a special token constrained to attend solely to itself while remaining globally visible to all other tokens, stabilizing attention sinks

Result: Experimental results show that introducing a single extra token stabilizes attention sinks and substantially improves model performance; analysis confirms effectiveness is independent of position and has negligible semantic content

Conclusion: The extra sink token serves as a robust and dedicated structural sink that resolves the moving sink phenomenon in DLMs, improving inference stability and performance

Abstract: Diffusion Language Models (DLMs) have emerged as a compelling alternative to autoregressive approaches, enabling parallel text generation with competitive performance. Despite these advantages, there is a critical instability in DLMs: the moving sink phenomenon. Our analysis indicates that sink tokens exhibit low-norm representations in the Transformer’s value space, and that the moving sink phenomenon serves as a protective mechanism in DLMs to prevent excessive information mixing. However, their unpredictable positions across diffusion steps undermine inference robustness. To resolve this, we propose a simple but effective extra sink token implemented via a modified attention mask. Specifically, we introduce a special token constrained to attend solely to itself, while remaining globally visible to all other tokens. Experimental results demonstrate that introducing a single extra token stabilizes attention sinks, substantially improving model performance. Crucially, further analysis confirms that the effectiveness of this token is independent of its position and characterized by negligible semantic content, validating its role as a robust and dedicated structural sink.

[50] Argument Rarity-based Originality Assessment for AI-Assisted Writing

Keito Inoshita, Michiaki Omura, Tsukasa Yamanaka, Go Maeda, Kentaro Tsuji

Main category: cs.CL

TL;DR: AROA framework evaluates argumentative originality in essays using rarity-based metrics across structural, claim, evidence, and cognitive dimensions, revealing AI essays lack semantic originality despite high quality.

DetailsMotivation: Need for automated assessment of argumentative originality in student essays, especially in the AI era where LLMs can produce high-quality but unoriginal content, requiring a shift from quality-focused to originality-focused evaluation.

Method: AROA framework defines originality as rarity within a reference corpus, using four components: structural rarity (argument structure), claim rarity (semantic content), evidence rarity (supporting materials), and cognitive depth (reasoning complexity), quantified via density estimation with quality adjustment.

Result: Strong negative correlation (r=-0.67) between text quality and claim rarity shows quality-originality trade-off; AI essays achieved near-perfect quality (Q=0.998) but claim rarity only 1/5 of human levels (AI:0.037 vs human:0.170); low correlations (r=0.06-0.13) between components confirm independent aspects.

Conclusion: Writing assessment must shift from quality to originality in the AI era, as LLMs can reproduce argumentative structure but lack semantic originality; AROA provides a comprehensive framework for evaluating multiple independent dimensions of argumentative originality.

Abstract: This study proposes Argument Rarity-based Originality Assessment (AROA), a framework for automatically evaluating argumentative originality in student essays. AROA defines originality as rarity within a reference corpus and evaluates it through four complementary components: structural rarity, claim rarity, evidence rarity, and cognitive depth, quantified via density estimation and integrated with quality adjustment. Experiments using 1,375 human essays and 1,000 AI-generated essays on two argumentative topics revealed three key findings. First, a strong negative correlation ($r = -0.67$) between text quality and claim rarity demonstrates a quality-originality trade-off. Second, while AI essays achieved near-perfect quality scores ($Q = 0.998$), their claim rarity was approximately one-fifth of human levels (AI: 0.037, human: 0.170), indicating that LLMs can reproduce argumentative structure but not semantic originality. Third, the four components showed low mutual correlations ($r = 0.06$–$0.13$ between structural and semantic dimensions), confirming that they capture genuinely independent aspects of originality. These results suggest that writing assessment in the AI era must shift from quality to originality.

[51] VILLAIN at AVerImaTeC: Verifying Image-Text Claims via Multi-Agent Collaboration

Jaeyoon Jung, Yejun Yoon, Kunwoo Park

Main category: cs.CL

TL;DR: VILLAIN is a multimodal fact-checking system that uses vision-language model agents in a multi-stage pipeline to verify image-text claims through evidence retrieval, analysis, and verdict prediction.

DetailsMotivation: The paper addresses the challenge of multimodal fact-checking where claims combine both images and text, requiring systems that can effectively process and verify information across different modalities.

Method: The system employs a multi-agent collaboration approach with vision-language models across several stages: 1) retrieving textual and visual evidence from enriched knowledge stores, 2) using modality-specific and cross-modal agents to analyze evidence and identify inconsistencies, 3) generating question-answer pairs based on analysis reports, and 4) a final verdict prediction agent that produces verification outcomes.

Result: VILLAIN ranked first on the AVerImaTeC shared task leaderboard across all evaluation metrics, demonstrating superior performance in multimodal fact-checking.

Conclusion: The multi-agent prompt-based approach with vision-language models is effective for multimodal fact-checking, and the system’s success on the benchmark task shows the potential of such architectures for verifying image-text claims.

Abstract: This paper describes VILLAIN, a multimodal fact-checking system that verifies image-text claims through prompt-based multi-agent collaboration. For the AVerImaTeC shared task, VILLAIN employs vision-language model agents across multiple stages of fact-checking. Textual and visual evidence is retrieved from the knowledge store enriched through additional web collection. To identify key information and address inconsistencies among evidence items, modality-specific and cross-modal agents generate analysis reports. In the subsequent stage, question-answer pairs are produced based on these reports. Finally, the Verdict Prediction agent produces the verification outcome based on the image-text claim and the generated question-answer pairs. Our system ranked first on the leaderboard across all evaluation metrics. The source code is publicly available at https://github.com/ssu-humane/VILLAIN.

[52] Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization

Jingyi Xu, Xingyu Ren, Zhoupeng Shou, Yumeng Zhang, Zhiqiang You

Main category: cs.CL

TL;DR: GOPO is a hierarchical RL framework for task-oriented dialogue systems that decouples strategy planning from response generation using Expert and Customer Service Agents, achieving superior performance on e-commerce benchmarks.

DetailsMotivation: Existing training methods for task-oriented dialogue systems rely on token-level likelihood or preference optimization, which poorly align with long-horizon task success. There's a need for better alignment between dialogue generation and overall task completion goals.

Method: Goal-Oriented Preference Optimization (GOPO) uses hierarchical reinforcement learning with two agents: Expert Agent optimizes multi-turn goal preferences at dialogue-trajectory level, while Customer Service Agent generates responses strictly aligned with selected strategy.

Result: On Mgshop dataset, GOPO improves TSE metric by 7.7% and 10.3% over PPO and Memento, with consistent gains in sequence-level reward and generation quality. A 14B model trained with GOPO achieves 2.7% and 1.5% higher TSE than Qwen-235B and GPT-5.2.

Conclusion: GOPO establishes a new paradigm for task-oriented dialogue systems in commercial scenarios, demonstrating the critical role of hierarchical planning for long-horizon optimization in dialogue systems.

Abstract: Large language models show potential in task-oriented dialogue systems, yet existing training methods often rely on token-level likelihood or preference optimization, which poorly align with long-horizon task success. To address this, we propose Goal-Oriented Preference Optimization (GOPO), a hierarchical reinforcement learning framework that decouples strategy planning from response generation via an Expert Agent and a Customer Service Agent. The Expert Agent optimizes multi-turn goal preferences at the dialogue-trajectory level, while the Customer Service Agent generates responses strictly aligned with the selected strategy. We evaluate GOPO on public benchmarks and e-commerce customer service datasets, and introduce Task-focused Sequential Engagement (TSE), a sequence-level metric derived from real e-commerce interaction data. On the Mgshop dataset, GOPO improves TSE by 7.7% and 10.3% over PPO and Memento, with consistent gains in sequence-level reward and generation quality. Furthermore, a 14B model trained with GOPO achieves 2.7% and 1.5% higher TSE than Qwen-235B and GPT-5.2, respectively. Ablation studies confirm the Expert Agent’s critical role in long-horizon optimization. GOPO demonstrates consistent improvements across other datasets as well. This work establishes a new paradigm for task-oriented dialogue systems in commercial scenarios, with code and datasets to be made public.

cs.CV

[53] Understanding the Fine-Grained Knowledge Capabilities of Vision-Language Models

Dhruba Ghosh, Yuhui Zhang, Ludwig Schmidt

Main category: cs.CV

TL;DR: VLMs excel at visual QA tasks but lag in fine-grained image classification; better vision encoders and pretraining strategies are key to improving fine-grained visual understanding.

DetailsMotivation: While vision-language models have shown strong performance on various visual question answering benchmarks, they underperform on traditional fine-grained image classification tasks that test detailed visual knowledge. This disconnect between general vision-language capabilities and fine-grained visual understanding needs investigation.

Method: The authors test numerous recent VLMs on fine-grained classification benchmarks and conduct ablation experiments to identify factors affecting performance. They analyze the impact of different LLMs, vision encoders, and pretraining strategies (particularly when language model weights are unfrozen during pretraining).

Result: Better LLMs improve all benchmark scores equally, while better vision encoders disproportionately boost fine-grained classification performance. The pretraining stage is crucial for fine-grained performance, especially when language model weights remain unfrozen during pretraining.

Conclusion: Improving fine-grained visual understanding in VLMs requires focusing on vision encoder quality and pretraining strategies rather than just language model improvements. These insights can guide development of VLMs with better vision-centric capabilities.

Abstract: Vision-language models (VLMs) have made substantial progress across a wide range of visual question answering benchmarks, spanning visual reasoning, document understanding, and multimodal dialogue. These improvements are evident in a wide range of VLMs built on a variety of base models, alignment architectures, and training data. However, recent works show that these models trail behind in traditional image classification benchmarks, which test fine-grained visual knowledge. We test a large number of recent VLMs on fine-grained classification benchmarks and identify potential factors in the disconnect between fine-grained knowledge and other vision benchmarks. Through a series of ablation experiments, we find that using a better LLM improves all benchmark scores equally, while a better vision encoder disproportionately improves fine-grained classification performance. Furthermore, we find that the pretraining stage is also vital to fine-grained performance, particularly when the language model weights are unfrozen during pretraining. These insights pave the way for enhancing fine-grained visual understanding and vision-centric capabilities in VLMs.

[54] KPM-Bench: A Kinematic Parsing Motion Benchmark for Fine-grained Motion-centric Video Understanding

Boda Lin, Yongjie Zhu, Xiaocheng Gong, Wenyu Qin, Meng Wang

Main category: cs.CV

TL;DR: Proposes KPM-Bench dataset and MoPE algorithm to address fine-grained motion understanding and hallucination issues in video captioning, with kinematic-based motion parsing and evaluation metrics.

DetailsMotivation: Current video captioning models struggle with accurately describing fine-grained motion details and suffer from severe hallucination issues, especially for motion-centric videos where precise depiction of intricate movements and limb dynamics is crucial.

Method: Introduces automated annotation pipeline integrating kinematic-based motion computation with linguistic parsing; constructs KPM-Bench dataset with fine-grained video-caption pairs, QA pairs, and evaluation set; proposes MoPE algorithm for motion attribute extraction from text; integrates MoPE into GRPO post-training framework.

Result: Creates KPM-Bench dataset for fine-grained motion understanding; develops MoPE algorithm for accurate motion attribute extraction; introduces hallucination evaluation metric; shows improved reliability of motion-centric video captioning models through GRPO integration.

Conclusion: The proposed approach addresses critical limitations in video captioning by providing better motion understanding datasets, evaluation methods, and hallucination mitigation techniques, advancing the field of motion-centric video understanding.

Abstract: Despite recent advancements, video captioning models still face significant limitations in accurately describing fine-grained motion details and suffer from severe hallucination issues. These challenges become particularly prominent when generating captions for motion-centric videos, where precise depiction of intricate movements and limb dynamics is crucial yet often neglected. To alleviate this gap, we introduce an automated annotation pipeline that integrates kinematic-based motion computation with linguistic parsing, enabling detailed decomposition and description of complex human motions. Based on this pipeline, we construct and release the Kinematic Parsing Motion Benchmark (KPM-Bench), a novel open-source dataset designed to facilitate fine-grained motion understanding. KPM-Bench consists of (i) fine-grained video-caption pairs that comprehensively illustrate limb-level dynamics in complex actions, (ii) diverse and challenging question-answer pairs focusing specifically on motion understanding, and (iii) a meticulously curated evaluation set specifically designed to assess hallucination phenomena associated with motion descriptions. Furthermore, to address hallucination issues systematically, we propose the linguistically grounded Motion Parsing and Extraction (MoPE) algorithm, capable of accurately extracting motion-specific attributes directly from textual captions. Leveraging MoPE, we introduce a precise hallucination evaluation metric that functions independently of large-scale vision-language or language-only models. By integrating MoPE into the GRPO post-training framework, we effectively mitigate hallucination problems, significantly improving the reliability of motion-centric video captioning models.

[55] CLUTCH: Contextualized Language model for Unlocking Text-Conditioned Hand motion modelling in the wild

Balamurugan Thambiraja, Omid Taheri, Radek Danecek, Giorgio Becherini, Gerard Pons-Moll, Justus Thies

Main category: cs.CV

TL;DR: CLUTCH introduces 3D-HIW dataset and LLM-based system for in-the-wild hand motion modeling with text-to-motion and motion-to-text capabilities

DetailsMotivation: Existing hand motion modeling methods rely on limited studio-captured datasets, making them costly to scale to real-world settings and struggling with animation fidelity and text-motion alignment

Method: (1) 3D-HIW dataset creation using VLMs and 3D hand trackers on egocentric videos; (2) CLUTCH system with SHIFT VQ-VAE for motion tokenization and geometric refinement stage for LLM finetuning

Result: State-of-the-art performance on text-to-motion and motion-to-text tasks, establishing first benchmark for scalable in-the-wild hand motion modeling

Conclusion: The approach enables scalable modeling of natural hand motions in real-world settings with improved animation fidelity and text-motion alignment

Abstract: Hands play a central role in daily life, yet modeling natural hand motions remains underexplored. Existing methods that tackle text-to-hand-motion generation or hand animation captioning rely on studio-captured datasets with limited actions and contexts, making them costly to scale to “in-the-wild” settings. Further, contemporary models and their training schemes struggle to capture animation fidelity with text-motion alignment. To address this, we (1) introduce ‘3D Hands in the Wild’ (3D-HIW), a dataset of 32K 3D hand-motion sequences and aligned text, and (2) propose CLUTCH, an LLM-based hand animation system with two critical innovations: (a) SHIFT, a novel VQ-VAE architecture to tokenize hand motion, and (b) a geometric refinement stage to finetune the LLM. To build 3D-HIW, we propose a data annotation pipeline that combines vision-language models (VLMs) and state-of-the-art 3D hand trackers, and apply it to a large corpus of egocentric action videos covering a wide range of scenarios. To fully capture motion in-the-wild, CLUTCH employs SHIFT, a part-modality decomposed VQ-VAE, which improves generalization and reconstruction fidelity. Finally, to improve animation quality, we introduce a geometric refinement stage, where CLUTCH is co-supervised with a reconstruction loss applied directly to decoded hand motion parameters. Experiments demonstrate state-of-the-art performance on text-to-motion and motion-to-text tasks, establishing the first benchmark for scalable in-the-wild hand motion modelling. Code, data and models will be released.

[56] Multi-Modal Monocular Endoscopic Depth and Pose Estimation with Edge-Guided Self-Supervision

Xinwei Ju, Rema Daher, Danail Stoyanov, Sophia Bano, Francisco Vasconcelos

Main category: cs.CV

TL;DR: PRISM: A self-supervised learning framework for monocular depth and pose estimation in colonoscopy using edge detection and luminance decoupling for structural guidance, achieving SOTA performance.

DetailsMotivation: Monocular depth and pose estimation are crucial for colonoscopy-assisted navigation to improve screening quality, but challenging due to texture-less surfaces, complex illumination, deformation, and lack of reliable in-vivo ground truth datasets.

Method: PRISM leverages anatomical and illumination priors through edge detection (using learning-based edge detectors like DexiNed or HED) and luminance decoupling via intrinsic decomposition to separate shading and reflectance, providing structural guidance for geometric learning.

Result: State-of-the-art performance on multiple real and synthetic datasets, with ablation studies showing self-supervised training on real-world data outperforms supervised training on realistic phantom data, and video frame rate is critical for model performance.

Conclusion: PRISM effectively addresses colonoscopy depth/pose estimation challenges using self-supervised learning with anatomical priors, establishing best practices including the importance of domain realism over ground truth availability and careful video frame sampling.

Abstract: Monocular depth and pose estimation play an important role in the development of colonoscopy-assisted navigation, as they enable improved screening by reducing blind spots, minimizing the risk of missed or recurrent lesions, and lowering the likelihood of incomplete examinations. However, this task remains challenging due to the presence of texture-less surfaces, complex illumination patterns, deformation, and a lack of in-vivo datasets with reliable ground truth. In this paper, we propose PRISM (Pose-Refinement with Intrinsic Shading and edge Maps), a self-supervised learning framework that leverages anatomical and illumination priors to guide geometric learning. Our approach uniquely incorporates edge detection and luminance decoupling for structural guidance. Specifically, edge maps are derived using a learning-based edge detector (e.g., DexiNed or HED) trained to capture thin and high-frequency boundaries, while luminance decoupling is obtained through an intrinsic decomposition module that separates shading and reflectance, enabling the model to exploit shading cues for depth estimation. Experimental results on multiple real and synthetic datasets demonstrate state-of-the-art performance. We further conduct a thorough ablation study on training data selection to establish best practices for pose and depth estimation in colonoscopy. This analysis yields two practical insights: (1) self-supervised training on real-world data outperforms supervised training on realistic phantom data, underscoring the superiority of domain realism over ground truth availability; and (2) video frame rate is an extremely important factor for model performance, where dataset-specific video frame sampling is necessary for generating high quality training data.

[57] LGD-Net: Latent-Guided Dual-Stream Network for HER2 Scoring with Task-Specific Domain Knowledge

Peide Zhu, Linbin Lu, Zhiqin Chen, Xiong Chen

Main category: cs.CV

TL;DR: LGD-Net predicts HER2 expression from H&E slides using cross-modal feature hallucination instead of pixel-level virtual staining, achieving SOTA performance with efficient inference.

DetailsMotivation: Standard IHC staining for HER2 evaluation is resource-intensive and unavailable in many areas. While predicting HER2 from H&E slides is promising, existing pixel-level virtual staining methods are computationally expensive and prone to reconstruction artifacts that can cause diagnostic errors.

Method: Proposes Latent-Guided Dual-Stream Network (LGD-Net) that uses cross-modal feature hallucination instead of explicit pixel-level image generation. The model learns to map morphological H&E features directly to molecular latent space, guided by a teacher IHC encoder. Includes task-specific domain knowledge regularization via lightweight auxiliary tasks focusing on nuclei distribution and membrane staining intensity.

Result: Extensive experiments on the public BCI dataset demonstrate state-of-the-art performance, significantly outperforming baseline methods while enabling efficient inference using single-modality H&E inputs.

Conclusion: LGD-Net provides an effective alternative to resource-intensive IHC staining by predicting HER2 expression directly from H&E slides through cross-modal feature hallucination, avoiding computational costs and artifacts of pixel-level methods while maintaining diagnostic accuracy.

Abstract: It is a critical task to evalaute HER2 expression level accurately for breast cancer evaluation and targeted treatment therapy selection. However, the standard multi-step Immunohistochemistry (IHC) staining is resource-intensive, expensive, and time-consuming, which is also often unavailable in many areas. Consequently, predicting HER2 levels directly from H&E slides has emerged as a potential alternative solution. It has been shown to be effective to use virtual IHC images from H&E images for automatic HER2 scoring. However, the pixel-level virtual staining methods are computationally expensive and prone to reconstruction artifacts that can propagate diagnostic errors. To address these limitations, we propose the Latent-Guided Dual-Stream Network (LGD-Net), a novel framework that employes cross-modal feature hallucination instead of explicit pixel-level image generation. LGD-Net learns to map morphological H&E features directly to the molecular latent space, guided by a teacher IHC encoder during training. To ensure the hallucinated features capture clinically relevant phenotypes, we explicitly regularize the model training with task-specific domain knowledge, specifically nuclei distribution and membrane staining intensity, via lightweight auxiliary regularization tasks. Extensive experiments on the public BCI dataset demonstrate that LGD-Net achieves state-of-the-art performance, significantly outperforming baseline methods while enabling efficient inference using single-modality H&E inputs.

[58] Enabling Training-Free Text-Based Remote Sensing Segmentation

Jose Sosa, Danila Rukhovich, Anis Kacem, Djamila Aouada

Main category: cs.CV

TL;DR: Training-free remote sensing segmentation using foundation models: combines CLIP+SAM for open-vocabulary segmentation and GPT/Qwen-VL+SAM for referring/reasoning segmentation.

DetailsMotivation: Existing text-guided remote sensing segmentation methods require additional trainable components, limiting generalization and practical applicability. The authors aim to achieve segmentation without training by leveraging existing foundation models.

Method: Two approaches: 1) Contrastive: CLIP as mask selector for SAM’s proposals for open-vocabulary semantic segmentation. 2) Generative: GPT-5 (zero-shot) or LoRA-tuned Qwen-VL to generate click prompts for SAM for referring/reasoning segmentation.

Result: State-of-the-art open-vocabulary semantic segmentation in zero-shot setting. Extensive experiments across 19 remote sensing benchmarks show strong capabilities in open-vocabulary, referring, and reasoning-based tasks.

Conclusion: Training-free or lightweight LoRA-tuned pipelines using foundation models can achieve effective text-guided remote sensing segmentation without additional training components.

Abstract: Recent advances in Vision Language Models (VLMs) and Vision Foundation Models (VFMs) have opened new opportunities for zero-shot text-guided segmentation of remote sensing imagery. However, most existing approaches still rely on additional trainable components, limiting their generalisation and practical applicability. In this work, we investigate to what extent text-based remote sensing segmentation can be achieved without additional training, by relying solely on existing foundation models. We propose a simple yet effective approach that integrates contrastive and generative VLMs with the Segment Anything Model (SAM), enabling a fully training-free or lightweight LoRA-tuned pipeline. Our contrastive approach employs CLIP as mask selector for SAM’s grid-based proposals, achieving state-of-the-art open-vocabulary semantic segmentation (OVSS) in a completely zero-shot setting. In parallel, our generative approach enables reasoning and referring segmentation by generating click prompts for SAM using GPT-5 in a zero-shot setting and a LoRA-tuned Qwen-VL model, with the latter yielding the best results. Extensive experiments across 19 remote sensing benchmarks, including open-vocabulary, referring, and reasoning-based tasks, demonstrate the strong capabilities of our approach. Code will be released at https://github.com/josesosajs/trainfree-rs-segmentation.

[59] VidEoMT: Your ViT is Secretly Also a Video Segmentation Model

Narges Norouzi, Idil Esen Zulfikar, Niccol`o Cavagnero, Tommie Kerssies, Bastian Leibe, Gijs Dubbelman, Daan de Geus

Main category: cs.CV

TL;DR: VidEoMT is a simple encoder-only video segmentation model that eliminates specialized tracking modules through query propagation and fusion, achieving competitive accuracy while being 5-10x faster than existing methods.

DetailsMotivation: Existing video segmentation models combine per-frame segmenters with complex tracking modules, introducing architectural complexity and computational overhead. Recent work shows plain Vision Transformers can do accurate image segmentation without specialized modules, motivating a simpler approach for video.

Method: Proposes Video Encoder-only Mask Transformer (VidEoMT) with: 1) Lightweight query propagation mechanism that reuses queries from previous frames for temporal modeling, 2) Query fusion strategy combining propagated queries with temporally-agnostic learned queries to balance temporal consistency with adaptability to new content.

Result: Achieves competitive accuracy while being 5-10x faster than existing methods, running at up to 160 FPS with a ViT-L backbone. Eliminates need for dedicated tracking modules while maintaining performance.

Conclusion: VidEoMT demonstrates that encoder-only video segmentation can achieve the benefits of tracking without added complexity, offering a simpler and more efficient alternative to existing video segmentation architectures.

Abstract: Existing online video segmentation models typically combine a per-frame segmenter with complex specialized tracking modules. While effective, these modules introduce significant architectural complexity and computational overhead. Recent studies suggest that plain Vision Transformer (ViT) encoders, when scaled with sufficient capacity and large-scale pre-training, can conduct accurate image segmentation without requiring specialized modules. Motivated by this observation, we propose the Video Encoder-only Mask Transformer (VidEoMT), a simple encoder-only video segmentation model that eliminates the need for dedicated tracking modules. To enable temporal modeling in an encoder-only ViT, VidEoMT introduces a lightweight query propagation mechanism that carries information across frames by reusing queries from the previous frame. To balance this with adaptability to new content, it employs a query fusion strategy that combines the propagated queries with a set of temporally-agnostic learned queries. As a result, VidEoMT attains the benefits of a tracker without added complexity, achieving competitive accuracy while being 5x–10x faster, running at up to 160 FPS with a ViT-L backbone. Code: https://www.tue-mps.org/videomt/

[60] VQPP: Video Query Performance Prediction Benchmark

Adrian Catalin Lutu, Eduard Poesina, Radu Tudor Ionescu

Main category: cs.CV

TL;DR: First benchmark for video query performance prediction (VQPP) with 56K text queries and 51K videos, exploring pre-retrieval and post-retrieval predictors for content-based video retrieval.

DetailsMotivation: Query performance prediction (QPP) is well-studied for text and image retrieval but remains underexplored for content-based video retrieval (CBVR), creating a need for standardized benchmarks and methods.

Method: Created VQPP benchmark with two text-to-video retrieval datasets and two CBVR systems, explored multiple pre-retrieval and post-retrieval performance predictors, and demonstrated application using best pre-retrieval predictor as reward model for LLM training via DPO.

Result: Pre-retrieval predictors achieved competitive performance, enabling applications before retrieval; benchmark established with official splits for reproducible comparisons.

Conclusion: VQPP provides first standardized benchmark for video QPP, showing pre-retrieval predictors are effective and applicable to tasks like query reformulation via LLM training.

Abstract: Query performance prediction (QPP) is an important and actively studied information retrieval task, having various applications, such as query reformulation, query expansion, and retrieval system selection, among many others. The task has been primarily studied in the context of text and image retrieval, whereas QPP for content-based video retrieval (CBVR) remains largely underexplored. To this end, we propose the first benchmark for video query performance prediction (VQPP), comprising two text-to-video retrieval datasets and two CBVR systems, respectively. VQPP contains a total of 56K text queries and 51K videos, and comes with official training, validation and test splits, fostering direct comparisons and reproducible results. We explore multiple pre-retrieval and post-retrieval performance predictors, creating a representative benchmark for future exploration of QPP in the video domain. Our results show that pre-retrieval predictors obtain competitive performance, enabling applications before performing the retrieval step. We also demonstrate the applicability of VQPP by employing the best performing pre-retrieval predictor as reward model for training a large language model (LLM) on the query reformulation task via direct preference optimization (DPO). We release our benchmark and code at https://github.com/AdrianLutu/VQPP.

[61] On the Evaluation Protocol of Gesture Recognition for UAV-based Rescue Operation based on Deep Learning: A Subject-Independence Perspective

Domonkos Varga

Main category: cs.CV

TL;DR: Critical analysis showing that a gesture recognition paper’s near-perfect results stem from data leakage due to improper subject-independent train-test splitting, undermining claims about generalization to unseen individuals.

DetailsMotivation: To critically examine the evaluation methodology of a gesture recognition approach, specifically investigating whether the reported high accuracy metrics genuinely reflect generalization to unseen subjects or result from data leakage issues.

Method: Analyzes the published confusion matrix, learning curves, and dataset construction to identify methodological flaws. Focuses on examining the train-test split protocol and demonstrating how frame-level random splitting mixes samples from the same subjects across both sets.

Result: Shows that the near-perfect accuracy metrics result from severe data leakage caused by improper subject-independent data partitioning. The evaluation does not measure true generalization to unseen individuals as claimed.

Conclusion: Emphasizes the critical importance of subject-independent data partitioning in vision-based gesture recognition research, especially for real-world applications like UAV-human interaction that require reliable recognition of gestures from previously unseen people.

Abstract: This paper presents a methodological analysis of the gesture-recognition approach proposed by Liu and Szirányi, with a particular focus on the validity of their evaluation protocol. We show that the reported near-perfect accuracy metrics result from a frame-level random train-test split that inevitably mixes samples from the same subjects across both sets, causing severe data leakage. By examining the published confusion matrix, learning curves, and dataset construction, we demonstrate that the evaluation does not measure generalization to unseen individuals. Our findings underscore the importance of subject-independent data partitioning in vision-based gesture-recognition research, especially for applications - such as UAV-human interaction - that require reliable recognition of gestures performed by previously unseen people.

[62] ZACH-ViT: Regime-Dependent Inductive Bias in Compact Vision Transformers for Medical Imaging

Athanasios Angelakis

Main category: cs.CV

TL;DR: ZACH-ViT is a compact Vision Transformer that removes positional embeddings and CLS tokens for permutation invariance, achieving competitive performance on medical imaging datasets with minimal parameters and fast inference.

DetailsMotivation: Traditional Vision Transformers rely on fixed spatial priors (positional embeddings and class tokens) that may hinder generalization in medical imaging where spatial layout is weakly informative or inconsistent, especially in resource-constrained clinical environments.

Method: Introduces ZACH-ViT which removes both positional embeddings and the [CLS] token, achieving permutation invariance through global average pooling over patch representations. Uses adaptive residual projections to maintain training stability while keeping a strict parameter budget.

Result: ZACH-ViT (0.25M parameters, trained from scratch) shows strongest advantage on BloodMNIST, remains competitive with TransMIL on PathMNIST, but has reduced advantage on datasets with strong anatomical priors (OCTMNIST, OrganAMNIST). Achieves competitive performance with sub-second inference times.

Conclusion: Aligning architectural inductive bias with data structure is more important than universal benchmark dominance. ZACH-ViT’s minimal size and lack of pretraining support deployment in resource-constrained clinical environments while maintaining competitive performance.

Abstract: Vision Transformers rely on positional embeddings and class tokens that encode fixed spatial priors. While effective for natural images, these priors may hinder generalization when spatial layout is weakly informative or inconsistent, a frequent condition in medical imaging and edge-deployed clinical systems. We introduce ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer), a compact Vision Transformer that removes both positional embeddings and the [CLS] token, achieving permutation invariance through global average pooling over patch representations. The term “Zero-token” specifically refers to removing the dedicated [CLS] aggregation token and positional embeddings; patch tokens remain unchanged and are processed normally. Adaptive residual projections preserve training stability in compact configurations while maintaining a strict parameter budget. Evaluation is performed across seven MedMNIST datasets spanning binary and multi-class tasks under a strict few-shot protocol (50 samples per class, fixed hyperparameters, five random seeds). The empirical analysis demonstrates regime-dependent behavior: ZACH-ViT (0.25M parameters, trained from scratch) achieves its strongest advantage on BloodMNIST and remains competitive with TransMIL on PathMNIST, while its relative advantage decreases on datasets with strong anatomical priors (OCTMNIST, OrganAMNIST), consistent with the architectural hypothesis. These findings support the view that aligning architectural inductive bias with data structure can be more important than pursuing universal benchmark dominance. Despite its minimal size and lack of pretraining, ZACH-ViT achieves competitive performance while maintaining sub-second inference times, supporting deployment in resource-constrained clinical environments. Code and models are available at https://github.com/Bluesman79/ZACH-ViT.

[63] Learning Compact Video Representations for Efficient Long-form Video Understanding in Large Multimodal Models

Yuxiao Chen, Jue Wang, Zhikang Zhang, Jingru Yi, Xu Zhang, Yang Zou, Zhaowei Cai, Jianbo Yuan, Xinyu Li, Hao Yang, Davide Modolo

Main category: cs.CV

TL;DR: A novel end-to-end framework for long-form video understanding that combines adaptive video sampling, spatiotemporal compression, and multimodal LLMs to handle video redundancy and memory constraints.

DetailsMotivation: Long-form video analysis faces challenges due to video redundancy: 1) efficiently incorporating many frames within memory limits, and 2) extracting discriminative information from vast input data, despite recent advances in video backbones and LLMs.

Method: Proposes an end-to-end system with: 1) Information-density-based adaptive video sampler (AVS) to capture essential information adaptively, and 2) Autoencoder-based spatiotemporal video compressor (SVC) integrated with a multimodal large language model (MLLM) for high compression while preserving discriminative information.

Result: The framework shows promising performance across various benchmarks, excelling in both long-form video understanding tasks and standard video understanding benchmarks, demonstrating versatility and efficacy in managing prolonged video sequences.

Conclusion: The proposed system effectively addresses challenges in long-form video understanding through adaptive sampling and compression techniques integrated with MLLMs, offering a versatile solution for handling video redundancy and memory constraints.

Abstract: With recent advancements in video backbone architectures, combined with the remarkable achievements of large language models (LLMs), the analysis of long-form videos spanning tens of minutes has become both feasible and increasingly prevalent. However, the inherently redundant nature of video sequences poses significant challenges for contemporary state-of-the-art models. These challenges stem from two primary aspects: 1) efficiently incorporating a larger number of frames within memory constraints, and 2) extracting discriminative information from the vast volume of input data. In this paper, we introduce a novel end-to-end schema for long-form video understanding, which includes an information-density-based adaptive video sampler (AVS) and an autoencoder-based spatiotemporal video compressor (SVC) integrated with a multimodal large language model (MLLM). Our proposed system offers two major advantages: it adaptively and effectively captures essential information from video sequences of varying durations, and it achieves high compression rates while preserving crucial discriminative information. The proposed framework demonstrates promising performance across various benchmarks, excelling in both long-form video understanding tasks and standard video understanding benchmarks. These results underscore the versatility and efficacy of our approach, particularly in managing the complexities of prolonged video sequences.

[64] A Single Image and Multimodality Is All You Need for Novel View Synthesis

Amirhosein Javadi, Chi-Shiang Gau, Konstantinos D. Polyzos, Tara Javidi

Main category: cs.CV

TL;DR: Sparse multimodal range measurements (radar/LiDAR) improve diffusion-based novel view synthesis by providing robust geometric priors instead of unreliable monocular depth estimation.

DetailsMotivation: Current diffusion-based novel view synthesis relies on monocular depth estimation, which fails under challenging conditions like low texture, adverse weather, and heavy occlusion. Sparse range measurements can overcome these limitations.

Method: Multimodal depth reconstruction framework using extremely sparse range data (radar/LiDAR) with localized Gaussian Process formulation in angular domain. Produces dense depth maps with uncertainty quantification for use as drop-in replacement in existing diffusion pipelines.

Result: Substantial improvement in geometric consistency and visual quality for single-image novel-view video generation on real-world driving scenes compared to vision-only depth estimation.

Conclusion: Reliable geometric priors are crucial for diffusion-based view synthesis, and multimodal sensing with sparse range data provides practical benefits even at extreme sparsity levels.

Abstract: Diffusion-based approaches have recently demonstrated strong performance for single-image novel view synthesis by conditioning generative models on geometry inferred from monocular depth estimation. However, in practice, the quality and consistency of the synthesized views are fundamentally limited by the reliability of the underlying depth estimates, which are often fragile under low texture, adverse weather, and occlusion-heavy real-world conditions. In this work, we show that incorporating sparse multimodal range measurements provides a simple yet effective way to overcome these limitations. We introduce a multimodal depth reconstruction framework that leverages extremely sparse range sensing data, such as automotive radar or LiDAR, to produce dense depth maps that serve as robust geometric conditioning for diffusion-based novel view synthesis. Our approach models depth in an angular domain using a localized Gaussian Process formulation, enabling computationally efficient inference while explicitly quantifying uncertainty in regions with limited observations. The reconstructed depth and uncertainty are used as a drop-in replacement for monocular depth estimators in existing diffusion-based rendering pipelines, without modifying the generative model itself. Experiments on real-world multimodal driving scenes demonstrate that replacing vision-only depth with our sparse range-based reconstruction substantially improves both geometric consistency and visual quality in single-image novel-view video generation. These results highlight the importance of reliable geometric priors for diffusion-based view synthesis and demonstrate the practical benefits of multimodal sensing even at extreme levels of sparsity.

[65] ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models

Guoheng Sun, Tingting Du, Kaixi Feng, Chenxiang Luo, Xingguo Ding, Zheyu Shen, Ziyao Wang, Yexiao He, Ang Li

Main category: cs.CV

TL;DR: ROCKET introduces a residual-oriented multi-layer representation alignment framework for 3D-aware Vision-Language-Action models, using shared projector for efficient multi-layer alignment with minimal compute.

DetailsMotivation: Current VLA models lack 3D spatial understanding as they're pretrained on 2D data. Existing representation alignment methods use single-layer supervision, failing to exploit rich information across depth, while naive multi-layer alignment causes gradient interference.

Method: ROCKET uses residual-oriented multi-layer alignment with a shared projector to align multiple VLA backbone layers with a 3D vision foundation model via layer-invariant mapping. Includes Matryoshka-style sparse activation and training-free layer selection.

Result: Achieves 98.5% state-of-the-art success rate on LIBERO with only 4% compute budget. Shows superior performance across LIBERO-Plus, RoboTwin, and multiple VLA models.

Conclusion: ROCKET provides efficient and effective multi-layer representation alignment for 3D-aware VLA models, significantly improving performance with minimal computational cost.

Abstract: Vision-Language-Action (VLA) models enable instruction-following robotic manipulation, but they are typically pretrained on 2D data and lack 3D spatial understanding. An effective approach is representation alignment, where a strong vision foundation model is used to guide a 2D VLA model. However, existing methods usually apply supervision at only a single layer, failing to fully exploit the rich information distributed across depth; meanwhile, naïve multi-layer alignment can cause gradient interference. We introduce ROCKET, a residual-oriented multi-layer representation alignment framework that formulates multi-layer alignment as aligning one residual stream to another. Concretely, ROCKET employs a shared projector to align multiple layers of the VLA backbone with multiple layers of a powerful 3D vision foundation model via a layer-invariant mapping, which reduces gradient conflicts. We provide both theoretical justification and empirical analyses showing that a shared projector is sufficient and outperforms prior designs, and further propose a Matryoshka-style sparse activation scheme for the shared projector to balance multiple alignment losses. Our experiments show that, combined with a training-free layer selection strategy, ROCKET requires only about 4% of the compute budget while achieving 98.5% state-of-the-art success rate on LIBERO. We further demonstrate the superior performance of ROCKET across LIBERO-Plus and RoboTwin, as well as multiple VLA models. The code and model weights can be found at https://github.com/CASE-Lab-UMD/ROCKET-VLA.

[66] Image Quality Assessment: Exploring Quality Awareness via Memory-driven Distortion Patterns Matching

Xuting Lan, Mingliang Zhou, Xuekai Wei, Jielu Yan, Yueting Huang, Huayan Pu, Jun Luo, Weijia Jia

Main category: cs.CV

TL;DR: MQAF is a memory-driven quality assessment framework that uses a memory bank of distortion patterns to enable both full-reference and no-reference image quality assessment, reducing reliance on high-quality reference images.

DetailsMotivation: Existing FR-IQA methods are limited by their dependence on high-quality reference images, which are often unavailable in real-world applications. Inspired by the human visual system's ability to use accumulated visual memory for quality assessment, the authors aim to create a more flexible framework that can work with or without reference images.

Method: Proposes MQAF with a memory bank storing distortion patterns. Uses dual-mode assessment: 1) With reference images - adaptively weights reference information and compares distorted images with stored patterns; 2) Without reference images - relies solely on memory bank patterns for no-reference quality assessment.

Result: Outperforms state-of-the-art approaches across multiple datasets while adapting to both no-reference and full-reference tasks.

Conclusion: The memory-driven framework successfully reduces reliance on high-quality reference images while maintaining strong performance across both FR-IQA and NR-IQA tasks, demonstrating the value of biological memory-inspired approaches.

Abstract: Existing full-reference image quality assessment (FR-IQA) methods achieve high-precision evaluation by analysing feature differences between reference and distorted images. However, their performance is constrained by the quality of the reference image, which limits real-world applications where ideal reference sources are unavailable. Notably, the human visual system has the ability to accumulate visual memory, allowing image quality assessment on the basis of long-term memory storage. Inspired by this biological memory mechanism, we propose a memory-driven quality-aware framework (MQAF), which establishes a memory bank for storing distortion patterns and dynamically switches between dual-mode quality assessment strategies to reduce reliance on high-quality reference images. When reference images are available, MQAF obtains reference-guided quality scores by adaptively weighting reference information and comparing the distorted image with stored distortion patterns in the memory bank. When the reference image is absent, the framework relies on distortion patterns in the memory bank to infer image quality, enabling no-reference quality assessment (NR-IQA). The experimental results show that our method outperforms state-of-the-art approaches across multiple datasets while adapting to both no-reference and full-reference tasks.

[67] MUOT_3M: A 3 Million Frame Multimodal Underwater Benchmark and the MUTrack Tracking Method

Ahsan Baidar Bakht, Mohamad Alansari, Muhayy Ud Din, Muzammal Naseer, Sajid Javed, Irfan Hussain, Jiri Matas, Arif Mahmood

Main category: cs.CV

TL;DR: MUOT_3M is the first large-scale multimodal underwater object tracking benchmark with 3M frames, and MUTrack is a SAM-based multimodal tracker that achieves state-of-the-art performance through knowledge distillation.

DetailsMotivation: Underwater object tracking is crucial for marine applications but limited by small, RGB-only datasets that can't handle challenging underwater conditions like color distortion and low visibility.

Method: Created MUOT_3M benchmark with 3M frames, 3,030 videos, and synchronized RGB, enhanced RGB, depth, and language modalities. Developed MUTrack with visual geometric alignment, vision-language fusion, and four-level knowledge distillation to transfer multimodal knowledge to a unimodal student model.

Result: MUTrack achieves up to 8.40% higher AUC and 7.80% higher precision than state-of-the-art baselines across five UOT benchmarks while running at 24 FPS.

Conclusion: MUOT_3M and MUTrack establish a new foundation for scalable, multimodally trained yet practically deployable underwater tracking systems.

Abstract: Underwater Object Tracking (UOT) is crucial for efficient marine robotics, large scale ecological monitoring, and ocean exploration; however, progress has been hindered by the scarcity of large, multimodal, and diverse datasets. Existing benchmarks remain small and RGB only, limiting robustness under severe color distortion, turbidity, and low visibility conditions. We introduce MUOT_3M, the first pseudo multimodal UOT benchmark comprising 3 million frames from 3,030 videos (27.8h) annotated with 32 tracking attributes, 677 fine grained classes, and synchronized RGB, estimated enhanced RGB, estimated depth, and language modalities validated by a marine biologist. Building upon MUOT_3M, we propose MUTrack, a SAM-based multimodal to unimodal tracker featuring visual geometric alignment, vision language fusion, and four level knowledge distillation that transfers multimodal knowledge into a unimodal student model. Extensive evaluations across five UOT benchmarks demonstrate that MUTrack achieves up to 8.40% higher AUC and 7.80% higher precision than the strongest SOTA baselines while running at 24 FPS. MUOT_3M and MUTrack establish a new foundation for scalable, multimodally trained yet practically deployable underwater tracking.

[68] Towards LLM-centric Affective Visual Customization via Efficient and Precise Emotion Manipulating

Jiamin Luo, Xuqian Gu, Jingjing Wang, Jiahong Lu

Main category: cs.CV

TL;DR: Proposes L-AVC (LLM-centric Affective Visual Customization) for emotion-based image editing using multimodal LLMs, with EPEM approach for efficient emotion conversion and content retention.

DetailsMotivation: Current visual customization methods focus on objective alignment (language, layout, canny) but ignore subjective emotional content, lacking general-purpose foundation models for affective visual customization.

Method: Proposes EPEM (Efficient and Precise Emotion Manipulating) approach with two modules: EIC (Efficient Inter-emotion Converting) for semantic emotion alignment, and PER (Precise Exter-emotion Retaining) for preserving emotion-agnostic content.

Result: Comprehensive experiments on constructed L-AVC dataset show EPEM outperforms state-of-the-art baselines, demonstrating importance of emotion information and effectiveness of the approach.

Conclusion: The paper justifies the importance of emotion information for L-AVC and shows EPEM effectively manipulates such information efficiently and precisely.

Abstract: Previous studies on visual customization primarily rely on the objective alignment between various control signals (e.g., language, layout and canny) and the edited images, which largely ignore the subjective emotional contents, and more importantly lack general-purpose foundation models for affective visual customization. With this in mind, this paper proposes an LLM-centric Affective Visual Customization (L-AVC) task, which focuses on generating images within modifying their subjective emotions via Multimodal LLM. Further, this paper contends that how to make the model efficiently align emotion conversion in semantics (named inter-emotion semantic conversion) and how to precisely retain emotion-agnostic contents (named exter-emotion semantic retaining) are rather important and challenging in this L-AVC task. To this end, this paper proposes an Efficient and Precise Emotion Manipulating approach for editing subjective emotions in images. Specifically, an Efficient Inter-emotion Converting (EIC) module is tailored to make the LLM efficiently align emotion conversion in semantics before and after editing, followed by a Precise Exter-emotion Retaining (PER) module to precisely retain the emotion-agnostic contents. Comprehensive experimental evaluations on our constructed L-AVC dataset demonstrate the great advantage of the proposed EPEM approach to the L-AVC task over several state-of-the-art baselines. This justifies the importance of emotion information for L-AVC and the effectiveness of EPEM in efficiently and precisely manipulating such information.

[69] DeepSVU: Towards In-depth Security-oriented Video Understanding via Unified Physical-world Regularized MoE

Yujie Jin, Wenxin Zhang, Jingjing Wang, Guodong Zhou

Main category: cs.CV

TL;DR: A new security video understanding task (DeepSVU) that goes beyond threat detection to include cause attribution and evaluation, with a novel MoE-based approach incorporating physical-world modeling.

DetailsMotivation: Existing security video understanding focuses only on detecting and localizing threats, lacking capabilities for generating and evaluating threat causes. The paper aims to create a more comprehensive security video analysis system.

Method: Proposes Unified Physical-world Regularized MoE (UPRM) with two components: Unified Physical-world Enhanced MoE (UPE) Block for modeling coarse-to-fine physical-world information, and Physical-world Trade-off Regularizer (PTR) for adaptive factor balancing.

Result: UPRM outperforms several advanced Video-LLMs and non-VLM approaches on DeepSVU instruction datasets (UCF-C instructions and CUVA instructions).

Conclusion: The work demonstrates the importance of coarse-to-fine physical-world information in security video understanding and shows UPRM’s effectiveness in capturing such information.

Abstract: In the literature, prior research on Security-oriented Video Understanding (SVU) has predominantly focused on detecting and localize the threats (e.g., shootings, robberies) in videos, while largely lacking the effective capability to generate and evaluate the threat causes. Motivated by these gaps, this paper introduces a new chat paradigm SVU task, i.e., In-depth Security-oriented Video Understanding (DeepSVU), which aims to not only identify and locate the threats but also attribute and evaluate the causes threatening segments. Furthermore, this paper reveals two key challenges in the proposed task: 1) how to effectively model the coarse-to-fine physical-world information (e.g., human behavior, object interactions and background context) to boost the DeepSVU task; and 2) how to adaptively trade off these factors. To tackle these challenges, this paper proposes a new Unified Physical-world Regularized MoE (UPRM) approach. Specifically, UPRM incorporates two key components: the Unified Physical-world Enhanced MoE (UPE) Block and the Physical-world Trade-off Regularizer (PTR), to address the above two challenges, respectively. Extensive experiments conduct on our DeepSVU instructions datasets (i.e., UCF-C instructions and CUVA instructions) demonstrate that UPRM outperforms several advanced Video-LLMs as well as non-VLM approaches. Such information.These justify the importance of the coarse-to-fine physical-world information in the DeepSVU task and demonstrate the effectiveness of our UPRM in capturing such information.

[70] UAOR: Uncertainty-aware Observation Reinjection for Vision-Language-Action Models

Jiabing Yang, Yixiang Chen, Yuan Xu, Peiyan Li, Xiangnan Wu, Zichen Wen, Bowen Fang, Tao Yu, Zhengbo Zhang, Yingda Li, Kai Wang, Jing Liu, Nianfeng Liu, Yan Huang, Liang Wang

Main category: cs.CV

TL;DR: A training-free plug-and-play module called Uncertainty-aware Observation Reinjection (UAOR) that improves Vision-Language-Action models by reinjecting observation information when the model shows high uncertainty, without needing extra data or modules.

DetailsMotivation: Existing VLA models often require additional observation cues (depth maps, point clouds) or auxiliary modules (object detectors) for better performance, which demands costly data collection and training. The authors aim to create a training-free solution that enhances VLA performance without these requirements.

Method: UAOR uses Action Entropy to measure uncertainty in language model layers. When high uncertainty is detected, it retrieves and reinjects key observation information into the next layer’s Feed-Forward Network through attention retrieval, helping the model better attend to observations during inference.

Result: The method consistently improves diverse VLA models across simulation and real-world tasks with minimal overhead. It eliminates the need for additional observation cues or modules, making it a versatile plug-in for existing VLA pipelines.

Conclusion: UAOR provides an effective, training-free, and plug-and-play solution to enhance VLA model performance by dynamically reinjecting observation information when needed, offering a practical improvement for robotic manipulation tasks.

Abstract: Vision-Language-Action (VLA) models leverage pretrained Vision-Language Models (VLMs) as backbones to map images and instructions to actions, demonstrating remarkable potential for generalizable robotic manipulation. To enhance performance, existing methods often incorporate extra observation cues (e.g., depth maps, point clouds) or auxiliary modules (e.g., object detectors, encoders) to enable more precise and reliable task execution, yet these typically require costly data collection and additional training. Inspired by the finding that Feed-Forward Network (FFN) in language models can act as “key-value memory”, we propose Uncertainty-aware Observation Reinjection (UAOR), an effective, training-free and plug-and-play module for VLA models. Specifically, when the current language model layer exhibits high uncertainty, measured by Action Entropy, it reinjects key observation information into the next layer’s Feed-Forward Network (FFN) through attention retrieval. This mechanism helps VLAs better attend to observations during inference, enabling more confident and faithful action generation. Comprehensive experiments show that our method consistently improves diverse VLA models across simulation and real-world tasks with minimal overhead. Notably, UAOR eliminates the need for additional observation cues or modules, making it a versatile and practical plug-in for existing VLA pipelines. The project page is at https://uaor.jiabingyang.cn.

[71] Dual-Channel Attention Guidance for Training-Free Image Editing Control in Diffusion Transformers

Guandong Li, Mengxia Ye

Main category: cs.CV

TL;DR: DCAG is a training-free framework for diffusion-based image editing that simultaneously manipulates both Key and Value channels in DiT’s attention layers, enabling more precise control over editing intensity than existing Key-only methods.

DetailsMotivation: Existing attention manipulation methods for diffusion-based image editing focus only on the Key space to control attention routing, ignoring the Value space which governs feature aggregation. This limits the ability to achieve training-free control over editing intensity and fidelity trade-offs.

Method: The authors discovered that both Key and Value projections in DiT’s multi-modal attention layers exhibit a bias-delta structure. They propose Dual-Channel Attention Guidance (DCAG) that manipulates both Key (controlling where to attend) and Value (controlling what to aggregate) channels simultaneously. Key operates through softmax for coarse control, while Value operates through linear weighted summation for fine-grained complement.

Result: Extensive experiments on PIE-Bench (700 images, 10 editing categories) show DCAG consistently outperforms Key-only guidance across all fidelity metrics. Most significant improvements are in localized editing tasks: 4.9% LPIPS reduction for object deletion and 3.2% LPIPS reduction for object addition.

Conclusion: DCAG enables more precise editing-fidelity trade-offs through a two-dimensional parameter space (δ_k, δ_v), demonstrating that simultaneous manipulation of both Key and Value channels provides superior control over diffusion-based image editing compared to single-channel methods.

Abstract: Training-free control over editing intensity is a critical requirement for diffusion-based image editing models built on the Diffusion Transformer (DiT) architecture. Existing attention manipulation methods focus exclusively on the Key space to modulate attention routing, leaving the Value space – which governs feature aggregation – entirely unexploited. In this paper, we first reveal that both Key and Value projections in DiT’s multi-modal attention layers exhibit a pronounced bias-delta structure, where token embeddings cluster tightly around a layer-specific bias vector. Building on this observation, we propose Dual-Channel Attention Guidance (DCAG), a training-free framework that simultaneously manipulates both the Key channel (controlling where to attend) and the Value channel (controlling what to aggregate). We provide a theoretical analysis showing that the Key channel operates through the nonlinear softmax function, acting as a coarse control knob, while the Value channel operates through linear weighted summation, serving as a fine-grained complement. Together, the two-dimensional parameter space $(δ_k, δ_v)$ enables more precise editing-fidelity trade-offs than any single-channel method. Extensive experiments on the PIE-Bench benchmark (700 images, 10 editing categories) demonstrate that DCAG consistently outperforms Key-only guidance across all fidelity metrics, with the most significant improvements observed in localized editing tasks such as object deletion (4.9% LPIPS reduction) and object addition (3.2% LPIPS reduction).

[72] Spatio-temporal Decoupled Knowledge Compensator for Few-Shot Action Recognition

Hongyu Qu, Xiangbo Shu, Rui Yan, Hailiang Gao, Wenguan Wang, Jinhui Tang

Main category: cs.CV

TL;DR: DiST: A novel framework for Few-Shot Action Recognition that uses LLM-generated spatial and temporal attribute descriptions to learn expressive multi-granularity prototypes, achieving state-of-the-art results.

DetailsMotivation: Current FSAR methods use coarse category names as auxiliary contexts, which provide limited background knowledge for capturing novel spatial and temporal concepts in actions. There's a need for more comprehensive semantic guidance.

Method: Two-stage framework: 1) Decomposition stage uses LLMs to decouple action names into diverse spatio-temporal attribute descriptions. 2) Incorporation stage uses Spatial/Temporal Knowledge Compensators (SKC/TKC) to discover object-level and frame-level prototypes guided by spatial and temporal knowledge respectively.

Result: Achieves state-of-the-art results on five standard Few-Shot Action Recognition datasets, demonstrating effectiveness of using LLM-generated spatial and temporal knowledge.

Conclusion: DiST effectively leverages LLM-generated spatial and temporal knowledge to learn discriminative prototypes, providing transparency in capturing fine-grained spatial details and diverse temporal patterns for few-shot action recognition.

Abstract: Few-Shot Action Recognition (FSAR) is a challenging task that requires recognizing novel action categories with a few labeled videos. Recent works typically apply semantically coarse category names as auxiliary contexts to guide the learning of discriminative visual features. However, such context provided by the action names is too limited to provide sufficient background knowledge for capturing novel spatial and temporal concepts in actions. In this paper, we propose DiST, an innovative Decomposition-incorporation framework for FSAR that makes use of decoupled Spatial and Temporal knowledge provided by large language models to learn expressive multi-granularity prototypes. In the decomposition stage, we decouple vanilla action names into diverse spatio-temporal attribute descriptions (action-related knowledge). Such commonsense knowledge complements semantic contexts from spatial and temporal perspectives. In the incorporation stage, we propose Spatial/Temporal Knowledge Compensators (SKC/TKC) to discover discriminative object-level and frame-level prototypes, respectively. In SKC, object-level prototypes adaptively aggregate important patch tokens under the guidance of spatial knowledge. Moreover, in TKC, frame-level prototypes utilize temporal attributes to assist in inter-frame temporal relation modeling. These learned prototypes thus provide transparency in capturing fine-grained spatial details and diverse temporal patterns. Experimental results show DiST achieves state-of-the-art results on five standard FSAR datasets.

[73] CityGuard: Graph-Aware Private Descriptors for Bias-Resilient Identity Search Across Urban Cameras

Rong Fu, Wenxin Zhang, Yibo Meng, Jia Yee Tan, Jiaxuan Lu, Rui Lu, Jiekai Wu, Zhaolu Kang, Simon Fong

Main category: cs.CV

TL;DR: CityGuard is a privacy-preserving person re-identification framework for decentralized surveillance that uses topology-aware transformers with dispersion-adaptive metric learning, spatially conditioned attention, and differentially private embeddings.

DetailsMotivation: City-scale person re-identification faces challenges of severe appearance changes from viewpoint, occlusion, and domain shift while needing to comply with data protection rules that prevent sharing raw imagery. There's a need for privacy-preserving identity retrieval in decentralized surveillance systems.

Method: Three-component framework: 1) Dispersion-adaptive metric learner adjusts instance-level margins based on feature spread to increase intra-class compactness; 2) Spatially conditioned attention injects coarse geometry (GPS or floor plans) into graph-based self-attention for projectively consistent cross-view alignment without survey-grade calibration; 3) Differentially private embedding maps with compact approximate indexes for secure and cost-efficient deployment.

Result: Experiments on Market-1501 and additional public benchmarks show consistent gains in retrieval precision and query throughput over strong baselines. The framework produces descriptors robust to viewpoint variation, occlusion, and domain shifts, enabling tunable balance between privacy and utility under differential-privacy accounting.

Conclusion: CityGuard provides a practical framework for privacy-critical urban identity matching that addresses both technical challenges (viewpoint variation, occlusion, domain shifts) and privacy requirements through differentially private embeddings and decentralized architecture.

Abstract: City-scale person re-identification across distributed cameras must handle severe appearance changes from viewpoint, occlusion, and domain shift while complying with data protection rules that prevent sharing raw imagery. We introduce CityGuard, a topology-aware transformer for privacy-preserving identity retrieval in decentralized surveillance. The framework integrates three components. A dispersion-adaptive metric learner adjusts instance-level margins according to feature spread, increasing intra-class compactness. Spatially conditioned attention injects coarse geometry, such as GPS or deployment floor plans, into graph-based self-attention to enable projectively consistent cross-view alignment using only coarse geometric priors without requiring survey-grade calibration. Differentially private embedding maps are coupled with compact approximate indexes to support secure and cost-efficient deployment. Together these designs produce descriptors robust to viewpoint variation, occlusion, and domain shifts, and they enable a tunable balance between privacy and utility under rigorous differential-privacy accounting. Experiments on Market-1501 and additional public benchmarks, complemented by database-scale retrieval studies, show consistent gains in retrieval precision and query throughput over strong baselines, confirming the practicality of the framework for privacy-critical urban identity matching.

[74] Temporal Consistency-Aware Text-to-Motion Generation

Hongsong Wang, Wenjing Yan, Qiuxia Lai, Xin Geng

Main category: cs.CV

TL;DR: TCA-T2M introduces a temporal consistency-aware framework for text-to-motion generation that addresses cross-sequence temporal alignment issues to produce more realistic and physically plausible human motions.

DetailsMotivation: Current two-stage text-to-motion generation frameworks often neglect cross-sequence temporal consistency, leading to semantic misalignments and physically implausible motions. The paper aims to address this limitation by focusing on shared temporal structures across different instances of the same action.

Method: Proposes TCA-T2M with three key components: 1) Temporal consistency-aware spatial VQ-VAE (TCaS-VQ-VAE) for cross-sequence temporal alignment, 2) Masked motion transformer for text-conditioned motion generation, and 3) Kinematic constraint block to mitigate discretization artifacts and ensure physical plausibility.

Result: Experiments on HumanML3D and KIT-ML benchmarks demonstrate state-of-the-art performance, highlighting the importance of temporal consistency in robust and coherent text-to-motion generation.

Conclusion: Temporal consistency is crucial for generating realistic and physically plausible human motions from text descriptions, and the proposed TCA-T2M framework effectively addresses this challenge through cross-sequence temporal alignment and kinematic constraints.

Abstract: Text-to-Motion (T2M) generation aims to synthesize realistic human motion sequences from natural language descriptions. While two-stage frameworks leveraging discrete motion representations have advanced T2M research, they often neglect cross-sequence temporal consistency, i.e., the shared temporal structures present across different instances of the same action. This leads to semantic misalignments and physically implausible motions. To address this limitation, we propose TCA-T2M, a framework for temporal consistency-aware T2M generation. Our approach introduces a temporal consistency-aware spatial VQ-VAE (TCaS-VQ-VAE) for cross-sequence temporal alignment, coupled with a masked motion transformer for text-conditioned motion generation. Additionally, a kinematic constraint block mitigates discretization artifacts to ensure physical plausibility. Experiments on HumanML3D and KIT-ML benchmarks demonstrate that TCA-T2M achieves state-of-the-art performance, highlighting the importance of temporal consistency in robust and coherent T2M generation.

[75] 3DMedAgent: Unified Perception-to-Understanding for 3D Medical Analysis

Ziyue Wang, Linghan Cai, Chang Han Low, Haofeng Liu, Junde Wu, Jingyu Wang, Rui Wang, Lei Song, Jiang Bian, Jingjing Fu, Yueming Jin

Main category: cs.CV

TL;DR: 3DMedAgent enables 2D multimodal LLMs to perform 3D CT analysis without 3D fine-tuning by coordinating visual/textual tools and maintaining structured memory for evidence-driven reasoning.

DetailsMotivation: Existing 3D analysis methods use isolated task-specific modeling or task-agnostic end-to-end approaches, lacking systematic perceptual evidence accumulation. Current MLLMs are predominantly 2D-oriented and limited for volumetric medical data analysis.

Method: 3DMedAgent coordinates heterogeneous visual and textual tools through an MLLM agent, decomposing complex 3D analysis into tractable subtasks: global to regional views, 3D volumes to 2D slices, visual evidence to structured textual representations. Maintains long-term structured memory for query-adaptive, evidence-driven multi-step reasoning.

Result: Experiments across 40+ tasks show 3DMedAgent consistently outperforms general, medical, and 3D-specific MLLMs. Introduced DeepChestVQA benchmark for evaluating unified perception-to-understanding in 3D thoracic imaging.

Conclusion: 3DMedAgent provides a scalable path toward general-purpose 3D clinical assistants by enabling 2D MLLMs to perform comprehensive 3D CT analysis without 3D-specific fine-tuning.

Abstract: 3D CT analysis spans a continuum from low-level perception to high-level clinical understanding. Existing 3D-oriented analysis methods adopt either isolated task-specific modeling or task-agnostic end-to-end paradigms to produce one-hop outputs, impeding the systematic accumulation of perceptual evidence for downstream reasoning. In parallel, recent multimodal large language models (MLLMs) exhibit improved visual perception and can integrate visual and textual information effectively, yet their predominantly 2D-oriented designs fundamentally limit their ability to perceive and analyze volumetric medical data. To bridge this gap, we propose 3DMedAgent, a unified agent that enables 2D MLLMs to perform general 3D CT analysis without 3D-specific fine-tuning. 3DMedAgent coordinates heterogeneous visual and textual tools through a flexible MLLM agent, progressively decomposing complex 3D analysis into tractable subtasks that transition from global to regional views, from 3D volumes to informative 2D slices, and from visual evidence to structured textual representations. Central to this design, 3DMedAgent maintains a long-term structured memory that aggregates intermediate tool outputs and supports query-adaptive, evidence-driven multi-step reasoning. We further introduce the DeepChestVQA benchmark for evaluating unified perception-to-understanding capabilities in 3D thoracic imaging. Experiments across over 40 tasks demonstrate that 3DMedAgent consistently outperforms general, medical, and 3D-specific MLLMs, highlighting a scalable path toward general-purpose 3D clinical assistants.Code and data are available at \href{https://github.com/jinlab-imvr/3DMedAgent}{https://github.com/jinlab-imvr/3DMedAgent}.

[76] Faster Training, Fewer Labels: Self-Supervised Pretraining for Fine-Grained BEV Segmentation

Daniel Busch, Christian Bohn, Thomas Kurbiel, Klaus Friedrichs, Richard Meyes, Tobias Meisen

Main category: cs.CV

TL;DR: Self-supervised BEV road marking segmentation using image pseudo-labels and temporal consistency, reducing annotation needs by 50% while improving performance.

DetailsMotivation: Current BEV semantic mapping methods for autonomous driving rely on expensive, inconsistently annotated ground truth data, creating scalability challenges.

Method: Two-phase training: 1) Self-supervised pretraining with BEV predictions reprojected to image plane and trained against Mask2Former pseudo-labels plus temporal consistency loss, 2) Supervised fine-tuning with only 50% of labeled data.

Result: Outperforms fully supervised baseline by up to +2.5pp mIoU on nuScenes while halving annotation data usage and reducing total training time by up to two-thirds.

Conclusion: Differentiable reprojection with camera perspective pseudo-labels yields transferable BEV features and provides a scalable path toward reduced-label autonomous perception.

Abstract: Dense Bird’s Eye View (BEV) semantic maps are central to autonomous driving, yet current multi-camera methods depend on costly, inconsistently annotated BEV ground truth. We address this limitation with a two-phase training strategy for fine-grained road marking segmentation that removes full supervision during pretraining and halves the amount of training data during fine-tuning while still outperforming the comparable supervised baseline model. During the self-supervised pretraining, BEVFormer predictions are differentiably reprojected into the image plane and trained against multi-view semantic pseudo-labels generated by the widely used semantic segmentation model Mask2Former. A temporal loss encourages consistency across frames. The subsequent supervised fine-tuning phase requires only 50% of the dataset and significantly less training time. With our method, the fine-tuning benefits from rich priors learned during pretraining boosting the performance and BEV segmentation quality (up to +2.5pp mIoU over the fully supervised baseline) on nuScenes. It simultaneously halves the usage of annotation data and reduces total training time by up to two thirds. The results demonstrate that differentiable reprojection plus camera perspective pseudo labels yields transferable BEV features and a scalable path toward reduced-label autonomous perception.

[77] Comparative Assessment of Multimodal Earth Observation Data for Soil Moisture Estimation

Ioannis Kontogiorgakis, Athanasios Askitopoulos, Iason Tsardanidis, Dimitrios Bormpoudakis, Ilias Tsoumas, Fotios Balampanis, Charalampos Kontoes

Main category: cs.CV

TL;DR: High-resolution (10m) soil moisture estimation framework combining Sentinel-1 SAR, Sentinel-2 optical imagery, and ERA-5 reanalysis data using machine learning for pan-European field-scale monitoring.

DetailsMotivation: Existing satellite soil moisture products are too coarse (>1km) for farm-level applications, creating a need for high-resolution estimation for precision agriculture, water resources management, and climate monitoring.

Method: Combines Sentinel-1 SAR, Sentinel-2 optical imagery, and ERA-5 reanalysis data through machine learning with spatial cross-validation. Evaluates modality combinations with temporal parameterizations and compares foundation model embeddings (IBM-NASA’s Prithvi) against traditional hand-crafted spectral features.

Result: Hybrid temporal matching (Sentinel-2 current-day with Sentinel-1 descending orbit) achieves R²=0.514, improved to R²=0.518 with 10-day ERA5 lookback. Foundation model embeddings provide negligible improvement over hand-crafted features (R²=0.515 vs. 0.514).

Conclusion: Domain-specific spectral indices combined with tree-based ensemble methods offer a practical and computationally efficient solution for operational pan-European field-scale soil moisture monitoring, with traditional feature engineering remaining competitive for sparse-data regression tasks.

Abstract: Accurate soil moisture (SM) estimation is critical for precision agriculture, water resources management and climate monitoring. Yet, existing satellite SM products are too coarse (>1km) for farm-level applications. We present a high-resolution (10m) SM estimation framework for vegetated areas across Europe, combining Sentinel-1 SAR, Sentinel-2 optical imagery and ERA-5 reanalysis data through machine learning. Using 113 International Soil Moisture Network (ISMN) stations spanning diverse vegetated areas, we compare modality combinations with temporal parameterizations, using spatial cross-validation, to ensure geographic generalization. We also evaluate whether foundation model embeddings from IBM-NASA’s Prithvi model improve upon traditional hand-crafted spectral features. Results demonstrate that hybrid temporal matching - Sentinel-2 current-day acquisitions with Sentinel-1 descending orbit - achieves R^2=0.514, with 10-day ERA5 lookback window improving performance to R^2=0.518. Foundation model (Prithvi) embeddings provide negligible improvement over hand-crafted features (R^2=0.515 vs. 0.514), indicating traditional feature engineering remains highly competitive for sparse-data regression tasks. Our findings suggest that domain-specific spectral indices combined with tree-based ensemble methods offer a practical and computationally efficient solution for operational pan-European field-scale soil moisture monitoring.

[78] DohaScript: A Large-Scale Multi-Writer Dataset for Continuous Handwritten Hindi Text

Kunwar Arpit Singh, Ankush Prakash, Haroon R Lone

Main category: cs.CV

TL;DR: DohaScript: A large-scale multi-writer dataset of handwritten Hindi text with controlled lexical content for systematic handwriting analysis in Devanagari script.

DetailsMotivation: Handwritten Devanagari text is severely underrepresented in benchmark datasets, with existing resources being limited in scale, focusing on isolated characters/short words, lacking controlled content and writer diversity, and failing to capture the continuous, fused nature of Devanagari handwriting with shared shirorekha and ligature formations.

Method: Collected handwritten Hindi text from 531 unique contributors, designed as a parallel stylistic corpus where all writers transcribe the same fixed set of six traditional Hindi dohas (couplets). Includes non-identifiable demographic metadata, rigorous quality curation based on objective sharpness/resolution criteria, and page-level layout difficulty annotations.

Result: Baseline experiments demonstrate clear quality separation and strong generalization to unseen writers, highlighting the dataset’s reliability and practical value for handwriting recognition, writer identification, style analysis, and generative modeling.

Conclusion: DohaScript serves as a standardized, reproducible benchmark for advancing research on continuous handwritten Devanagari text in low-resource script settings, enabling systematic analysis of writer-specific variation independent of linguistic content.

Abstract: Despite having hundreds of millions of speakers, handwritten Devanagari text remains severely underrepresented in publicly available benchmark datasets. Existing resources are limited in scale, focus primarily on isolated characters or short words, and lack controlled lexical content and writer level diversity, which restricts their utility for modern data driven handwriting analysis. As a result, they fail to capture the continuous, fused, and structurally complex nature of Devanagari handwriting, where characters are connected through a shared shirorekha (horizontal headline) and exhibit rich ligature formations. We introduce DohaScript, a large scale, multi writer dataset of handwritten Hindi text collected from 531 unique contributors. The dataset is designed as a parallel stylistic corpus, in which all writers transcribe the same fixed set of six traditional Hindi dohas (couplets). This controlled design enables systematic analysis of writer specific variation independent of linguistic content, and supports tasks such as handwriting recognition, writer identification, style analysis, and generative modeling. The dataset is accompanied by non identifiable demographic metadata, rigorous quality curation based on objective sharpness and resolution criteria, and page level layout difficulty annotations that facilitate stratified benchmarking. Baseline experiments demonstrate clear quality separation and strong generalization to unseen writers, highlighting the dataset’s reliability and practical value. DohaScript is intended to serve as a standardized and reproducible benchmark for advancing research on continuous handwritten Devanagari text in low resource script settings.

[79] Predict to Skip: Linear Multistep Feature Forecasting for Efficient Diffusion Transformers

Hanshuai Cui, Zhiqing Tang, Qianli Ma, Zhi Yao, Weijia Jia

Main category: cs.CV

TL;DR: PrediT accelerates Diffusion Transformers by predicting future model outputs using linear multistep methods instead of naive feature reuse, achieving 5.54× speedup with minimal quality loss.

DetailsMotivation: Diffusion Transformers (DiT) are computationally expensive due to iterative denoising. Existing acceleration methods reuse cached features but suffer from latent drift and visual degradation. The authors observe smooth evolution of model outputs during diffusion, enabling principled prediction rather than naive reuse.

Method: PrediT formulates feature prediction as a linear multistep problem using classical linear multistep methods to forecast future model outputs from historical information. It includes a corrector for high-dynamics regions to prevent error accumulation and a dynamic step modulation mechanism that adaptively adjusts prediction horizon based on feature change rate.

Result: Achieves up to 5.54× latency reduction across various DiT-based image and video generation models with negligible quality degradation. Extensive experiments validate effectiveness.

Conclusion: PrediT provides a training-free acceleration framework for DiT models that significantly reduces computational costs while maintaining generation fidelity through principled feature prediction rather than naive reuse.

Abstract: Diffusion Transformers (DiT) have emerged as a widely adopted backbone for high-fidelity image and video generation, yet their iterative denoising process incurs high computational costs. Existing training-free acceleration methods rely on feature caching and reuse under the assumption of temporal stability. However, reusing features for multiple steps may lead to latent drift and visual degradation. We observe that model outputs evolve smoothly along much of the diffusion trajectory, enabling principled predictions rather than naive reuse. Based on this insight, we propose \textbf{PrediT}, a training-free acceleration framework that formulates feature prediction as a linear multistep problem. We employ classical linear multistep methods to forecast future model outputs from historical information, combined with a corrector that activates in high-dynamics regions to prevent error accumulation. A dynamic step modulation mechanism adaptively adjusts the prediction horizon by monitoring the feature change rate. Together, these components enable substantial acceleration while preserving generation fidelity. Extensive experiments validate that our method achieves up to $5.54\times$ latency reduction across various DiT-based image and video generation models, while incurring negligible quality degradation.

[80] OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models

Ling Lin, Yang Bai, Heng Su, Congcong Zhu, Yaoxing Wang, Yang Zhou, Huazhu Fu, Jingrun Chen

Main category: cs.CV

TL;DR: OODBench: A benchmark for evaluating Visual-Language Models’ performance on out-of-distribution data, featuring 40K OOD instance-category pairs and automated assessment metrics.

DetailsMotivation: Current VLMs assume IID data, but real-world applications face OOD scenarios that can introduce safety risks. There's a lack of comprehensive benchmarks to assess VLM performance on OOD data.

Method: Proposes OODBench - an automated method with minimal human verification to construct benchmarks for evaluating VLM performance on OOD data. Includes 40K OOD instance-category pairs and uses Basic-to-Advanced Progression prompted questions for assessment.

Result: Current VLMs show notable performance degradation on OODBench, even for common image categories. The automated assessment metric effectively evaluates OOD impact across varying question difficulties.

Conclusion: OODBench provides a comprehensive benchmark for evaluating VLM robustness to OOD data, revealing current limitations and offering insights for future research in OOD data acquisition and evaluation.

Abstract: Existing Visual-Language Models (VLMs) have achieved significant progress by being trained on massive-scale datasets, typically under the assumption that data are independent and identically distributed (IID). However, in real-world scenarios, it is often impractical to expect that all data processed by an AI system satisfy this assumption. Furthermore, failure to appropriately handle out-of-distribution (OOD) objects may introduce safety risks in real-world applications (e.g., autonomous driving or medical assistance). Unfortunately, current research has not yet provided valid benchmarks that can comprehensively assess the performance of VLMs in response to OOD data. Therefore, we propose OODBench, a predominantly automated method with minimal human verification, for constructing new benchmarks and evaluating the ability of VLMs to process OOD data. OODBench contains 40K instance-level OOD instance-category pairs, and we show that current VLMs still exhibit notable performance degradation on OODBench, even when the underlying image categories are common. In addition, we propose a reliable automated assessment metric that employs a Basic-to-Advanced Progression of prompted questions to assess the impact of OOD data on questions of varying difficulty more fully. Lastly, we summarize substantial findings and insights to facilitate future research in the acquisition and evaluation of OOD data.

[81] Evaluating Graphical Perception Capabilities of Vision Transformers

Poonam Poonam, Pere-Pau Vázquez, Timo Ropinski

Main category: cs.CV

TL;DR: ViTs perform well on general vision tasks but show limited alignment with human graphical perception in visualization tasks compared to CNNs and human participants.

DetailsMotivation: While Vision Transformers (ViTs) have become powerful alternatives to CNNs for image tasks, their perceptual capabilities for graphical perception tasks (essential for interpreting visualizations) remain unexplored, unlike CNNs which have been evaluated for such tasks.

Method: Benchmark ViTs against CNNs and human participants in controlled graphical perception tasks inspired by Cleveland and McGill’s foundational studies that quantified human perception accuracy across different visual encodings.

Result: ViTs demonstrate strong performance in general vision tasks but show limited alignment with human-like graphical perception in the visualization domain, revealing perceptual gaps compared to CNNs and human participants.

Conclusion: The study highlights important perceptual limitations of ViTs for visualization tasks and points to key considerations for applying ViTs in visualization systems and graphical perceptual modeling.

Abstract: Vision Transformers, ViTs, have emerged as a powerful alternative to convolutional neural networks, CNNs, in a variety of image-based tasks. While CNNs have previously been evaluated for their ability to perform graphical perception tasks, which are essential for interpreting visualizations, the perceptual capabilities of ViTs remain largely unexplored. In this work, we investigate the performance of ViTs in elementary visual judgment tasks inspired by the foundational studies of Cleveland and McGill, which quantified the accuracy of human perception across different visual encodings. Inspired by their study, we benchmark ViTs against CNNs and human participants in a series of controlled graphical perception tasks. Our results reveal that, although ViTs demonstrate strong performance in general vision tasks, their alignment with human-like graphical perception in the visualization domain is limited. This study highlights key perceptual gaps and points to important considerations for the application of ViTs in visualization systems and graphical perceptual modeling.

[82] BLM-Guard: Explainable Multimodal Ad Moderation with Chain-of-Thought and Policy-Aligned Rewards

Yiran Yang, Zhaowei Liu, Yuan Yuan, Yukun Song, Xiong Ma, Yinghao Song, Xiangji Zeng, Lu Sun, Yulu Wang, Hai Zhou, Shuai Cui, Zhaohan Gong, Jiefei Zhang

Main category: cs.CV

TL;DR: BLM-Guard: A multimodal content moderation framework for short-video ads using Chain-of-Thought reasoning, rule-based policies, and reinforcement learning to detect deceptive visuals, speech, and subtitles.

DetailsMotivation: Short-video platforms host deceptive multimodal ads that require finer-grained, policy-driven moderation beyond community safety filters, as current approaches lack the sophistication to handle complex multimodal manipulations.

Method: Combines Chain-of-Thought reasoning with rule-based policy principles and critic-guided rewards. Uses a rule-driven ICoT data-synthesis pipeline to generate structured scene descriptions, reasoning chains, and labels. Employs reinforcement learning with composite rewards balancing causal coherence and policy adherence. Features a multitask architecture modeling intra-modal manipulations and cross-modal mismatches.

Result: Experiments on real short-video ads show BLM-Guard surpasses strong baselines in accuracy, consistency, and generalization for multimodal content moderation.

Conclusion: BLM-Guard provides an effective framework for policy-driven multimodal content moderation of commercial ads, addressing both intra-modal manipulations and cross-modal inconsistencies through structured reasoning and reinforcement learning.

Abstract: Short-video platforms now host vast multimodal ads whose deceptive visuals, speech and subtitles demand finer-grained, policy-driven moderation than community safety filters. We present BLM-Guard, a content-audit framework for commercial ads that fuses Chain-of-Thought reasoning with rule-based policy principles and a critic-guided reward. A rule-driven ICoT data-synthesis pipeline jump-starts training by generating structured scene descriptions, reasoning chains and labels, cutting annotation costs. Reinforcement learning then refines the model using a composite reward balancing causal coherence with policy adherence. A multitask architecture models intra-modal manipulations (e.g., exaggerated imagery) and cross-modal mismatches (e.g., subtitle-speech drift), boosting robustness. Experiments on real short-video ads show BLM-Guard surpasses strong baselines in accuracy, consistency and generalization.

[83] A Self-Supervised Approach on Motion Calibration for Enhancing Physical Plausibility in Text-to-Motion

Gahyeon Shim, Soogeun Park, Hyemin Ahn

Main category: cs.CV

TL;DR: DMC is a post-hoc module that refines text-generated human motions to improve physical plausibility while preserving semantic alignment with text descriptions, using a self-supervised approach with intentionally distorted motions as training data.

DetailsMotivation: Current text-to-motion generation models produce semantically aligned motions but often lack physical realism (e.g., foot floating, penetration issues). There's a need to improve physical plausibility without sacrificing semantic consistency with textual descriptions.

Method: DMC uses a self-supervised, data-driven approach where it learns to generate physically plausible motions from intentionally distorted motions and original text descriptions. It acts as a post-hoc module that can be applied to various text-to-motion generation models without complex physical modeling.

Result: DMC reduces FID score by 42.74% on T2M and 13.20% on T2M-GPT, achieves highest R-Precision, reduces penetration by 33.0% on MoMask, and adjusts floating artifacts closer to ground-truth references while maintaining semantic consistency.

Conclusion: DMC serves as an effective post-hoc motion refinement framework that improves physical plausibility of text-generated motions while preserving semantic alignment, applicable to various text-to-motion models without requiring complex physical modeling.

Abstract: Generating semantically aligned human motion from textual descriptions has made rapid progress, but ensuring both semantic and physical realism in motion remains a challenge. In this paper, we introduce the Distortion-aware Motion Calibrator (DMC), a post-hoc module that refines physically implausible motions (e.g., foot floating) while preserving semantic consistency with the original textual description. Rather than relying on complex physical modeling, we propose a self-supervised and data-driven approach, whereby DMC learns to obtain physically plausible motions when an intentionally distorted motion and the original textual descriptions are given as inputs. We evaluate DMC as a post-hoc module to improve motions obtained from various text-to-motion generation models and demonstrate its effectiveness in improving physical plausibility while enhancing semantic consistency. The experimental results show that DMC reduces FID score by 42.74% on T2M and 13.20% on T2M-GPT, while also achieving the highest R-Precision. When applied to high-quality models like MoMask, DMC improves the physical plausibility of motions by reducing penetration by 33.0% as well as adjusting floating artifacts closer to the ground-truth reference. These results highlight that DMC can serve as a promising post-hoc motion refinement framework for any kind of text-to-motion models by incorporating textual semantics and physical plausibility.

[84] On the Adversarial Robustness of Discrete Image Tokenizers

Rishika Bhagwatkar, Irina Rish, Nicolas Flammarion, Francesco Croce

Main category: cs.CV

TL;DR: First study of adversarial attacks on discrete image tokenizers used in multimodal models, showing they’re vulnerable to efficient attacks across tasks, and proposing unsupervised adversarial training for defense.

DetailsMotivation: Discrete image tokenizers are increasingly used in multimodal systems but their vulnerability to adversarial attacks hasn't been explored, unlike CLIP encoders. Understanding and addressing these vulnerabilities is crucial for developing safe multimodal foundation models.

Method: 1) Formulate computationally efficient, application-agnostic attacks that perturb features extracted by discrete tokenizers to change extracted tokens. 2) Defend by fine-tuning popular tokenizers with unsupervised adversarial training while keeping other components frozen, inspired by robust CLIP encoder work.

Result: Attacks are effective across classification, multimodal retrieval, and captioning tasks. Unsupervised adversarial training significantly improves robustness to both unsupervised and end-to-end supervised attacks, generalizes well to unseen tasks and data, and can leverage unlabeled images unlike supervised approaches.

Conclusion: Tokenizer robustness plays a critical role in downstream tasks. This work presents an important step toward developing safe multimodal foundation models by highlighting vulnerabilities and providing effective defense mechanisms.

Abstract: Discrete image tokenizers encode visual inputs as sequences of tokens from a finite vocabulary and are gaining popularity in multimodal systems, including encoder-only, encoder-decoder, and decoder-only models. However, unlike CLIP encoders, their vulnerability to adversarial attacks has not been explored. Ours being the first work studying this topic, we first formulate attacks that aim to perturb the features extracted by discrete tokenizers, and thus change the extracted tokens. These attacks are computationally efficient, application-agnostic, and effective across classification, multimodal retrieval, and captioning tasks. Second, to defend against this vulnerability, inspired by recent work on robust CLIP encoders, we fine-tune popular tokenizers with unsupervised adversarial training, keeping all other components frozen. While unsupervised and task-agnostic, our approach significantly improves robustness to both unsupervised and end-to-end supervised attacks and generalizes well to unseen tasks and data. Unlike supervised adversarial training, our approach can leverage unlabeled images, making it more versatile. Overall, our work highlights the critical role of tokenizer robustness in downstream tasks and presents an important step in the development of safe multimodal foundation models.

[85] DEIG: Detail-Enhanced Instance Generation with Fine-Grained Semantic Control

Shiyan Du, Conghan Yue, Xinyu Cheng, Dongyu Zhang

Main category: cs.CV

TL;DR: DEIG is a novel framework for fine-grained and controllable multi-instance generation that improves semantic understanding of complex textual descriptions through instance-aware representations and masked attention mechanisms.

DetailsMotivation: Existing multi-instance generation approaches face challenges in fine-grained semantic understanding when dealing with complex textual descriptions, particularly with attribute leakage across instances and lack of precise spatial and semantic control.

Method: DEIG integrates an Instance Detail Extractor (IDE) that transforms text encoder embeddings into compact, instance-aware representations, and a Detail Fusion Module (DFM) that applies instance-based masked attention to prevent attribute leakage across instances. It also uses a high-quality dataset with detailed compositional instance captions generated by VLMs.

Result: DEIG consistently outperforms existing approaches across multiple benchmarks in spatial consistency, semantic accuracy, and compositional generalization. It also functions as a plug-and-play module that can be easily integrated into standard diffusion-based pipelines.

Conclusion: DEIG enables fine-grained and controllable multi-instance generation with improved semantic understanding, addressing limitations in existing approaches through its novel architecture and high-quality training data.

Abstract: Multi-Instance Generation has advanced significantly in spatial placement and attribute binding. However, existing approaches still face challenges in fine-grained semantic understanding, particularly when dealing with complex textual descriptions. To overcome these limitations, we propose DEIG, a novel framework for fine-grained and controllable multi-instance generation. DEIG integrates an Instance Detail Extractor (IDE) that transforms text encoder embeddings into compact, instance-aware representations, and a Detail Fusion Module (DFM) that applies instance-based masked attention to prevent attribute leakage across instances. These components enable DEIG to generate visually coherent multi-instance scenes that precisely match rich, localized textual descriptions. To support fine-grained supervision, we construct a high-quality dataset with detailed, compositional instance captions generated by VLMs. We also introduce DEIG-Bench, a new benchmark with region-level annotations and multi-attribute prompts for both humans and objects. Experiments demonstrate that DEIG consistently outperforms existing approaches across multiple benchmarks in spatial consistency, semantic accuracy, and compositional generalization. Moreover, DEIG functions as a plug-and-play module, making it easily integrable into standard diffusion-based pipelines.

[86] Multi-Level Conditioning by Pairing Localized Text and Sketch for Fashion Image Generation

Ziyue Liu, Davide Talon, Federico Girella, Zanxi Ruan, Mattia Mondo, Loris Bazzani, Yiming Wang, Marco Cristani

Main category: cs.CV

TL;DR: LOTS is a multimodal framework for fashion image generation that combines global sketch structure with localized text-sketch pairs using multi-level conditioning and diffusion guidance.

DetailsMotivation: Fashion design requires combining visual sketches (for structure/silhouette) with textual descriptions (for materials/colors), but existing methods struggle to maintain sketch structure while incorporating localized text guidance.

Method: LOTS uses a Multi-level Conditioning Stage to encode local sketch-text pairs in shared latent space while maintaining global coordination, followed by Diffusion Pair Guidance that integrates local and global conditioning via attention-based guidance during denoising.

Result: The method achieves state-of-the-art performance, improves global structural adherence while leveraging richer localized semantic guidance, and is validated on the new Sketchy dataset containing professional and “in the wild” sketches.

Conclusion: LOTS effectively combines sketch and text modalities for fashion generation, with publicly available dataset, platform, and code to support further research.

Abstract: Sketches offer designers a concise yet expressive medium for early-stage fashion ideation by specifying structure, silhouette, and spatial relationships, while textual descriptions complement sketches to convey material, color, and stylistic details. Effectively combining textual and visual modalities requires adherence to the sketch visual structure when leveraging the guidance of localized attributes from text. We present LOcalized Text and Sketch with multi-level guidance (LOTS), a framework that enhances fashion image generation by combining global sketch guidance with multiple localized sketch-text pairs. LOTS employs a Multi-level Conditioning Stage to independently encode local features within a shared latent space while maintaining global structural coordination. Then, the Diffusion Pair Guidance stage integrates both local and global conditioning via attention-based guidance within the diffusion model’s multi-step denoising process. To validate our method, we develop Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image. Sketchy provides high-quality, clean sketches with a professional look and consistent structure. To assess robustness beyond this setting, we also include an “in the wild” split with non-expert sketches, featuring higher variability and imperfections. Experiments demonstrate that our method strengthens global structural adherence while leveraging richer localized semantic guidance, achieving improvement over state-of-the-art. The dataset, platform, and code are publicly available.

[87] Diff2DGS: Reliable Reconstruction of Occluded Surgical Scenes via 2D Gaussian Splatting

Tianyi Song, Danail Stoyanov, Evangelos Mazomenos, Francisco Vasconcelos

Main category: cs.CV

TL;DR: Diff2DGS: A two-stage framework using diffusion-based video inpainting and 2D Gaussian Splatting with learnable deformation for real-time 3D reconstruction of occluded surgical scenes, with improved depth accuracy evaluation.

DetailsMotivation: Real-time reconstruction of deformable surgical scenes is crucial for robotic surgery advancement, but existing methods have limited quality in occluded regions and lack proper depth accuracy assessment due to missing 3D ground truth in benchmarks.

Method: Two-stage framework: 1) Diffusion-based video module with temporal priors to inpaint tissue occluded by surgical instruments, 2) Adapted 2D Gaussian Splatting with Learnable Deformation Model (LDM) to capture dynamic tissue deformation and anatomical geometry.

Result: Outperforms state-of-the-art approaches in both appearance and geometry, achieving 38.02 dB PSNR on EndoNeRF and 34.40 dB on StereoMIS. Demonstrates that optimizing for image quality alone doesn’t guarantee optimal 3D reconstruction accuracy.

Conclusion: Diff2DGS provides reliable 3D reconstruction of occluded surgical scenes with both high-fidelity appearance and faithful geometry, addressing limitations in current surgical scene reconstruction methods through improved depth optimization.

Abstract: Real-time reconstruction of deformable surgical scenes is vital for advancing robotic surgery, improving surgeon guidance, and enabling automation. Recent methods achieve dense reconstructions from da Vinci robotic surgery videos, with Gaussian Splatting (GS) offering real-time performance via graphics acceleration. However, reconstruction quality in occluded regions remains limited, and depth accuracy has not been fully assessed, as benchmarks like EndoNeRF and StereoMIS lack 3D ground truth. We propose Diff2DGS, a novel two-stage framework for reliable 3D reconstruction of occluded surgical scenes. In the first stage, a diffusion-based video module with temporal priors inpaints tissue occluded by instruments with high spatial-temporal consistency. In the second stage, we adapt 2D Gaussian Splatting (2DGS) with a Learnable Deformation Model (LDM) to capture dynamic tissue deformation and anatomical geometry. We also extend evaluation beyond prior image-quality metrics by performing quantitative depth accuracy analysis on the SCARED dataset. Diff2DGS outperforms state-of-the-art approaches in both appearance and geometry, reaching 38.02 dB PSNR on EndoNeRF and 34.40 dB on StereoMIS. Furthermore, our experiments demonstrate that optimizing for image quality alone does not necessarily translate into optimal 3D reconstruction accuracy. To address this, we further optimize the depth quality of the reconstructed 3D results, ensuring more faithful geometry in addition to high-fidelity appearance.

[88] Unifying Color and Lightness Correction with View-Adaptive Curve Adjustment for Robust 3D Novel View Synthesis

Ziteng Cui, Shuhong Liu, Xiaoyu Dong, Xuangeng Chu, Lin Gu, Ming-Hsuan Yang, Tatsuya Harada

Main category: cs.CV

TL;DR: Luminance-GS++ is a 3D Gaussian Splatting-based framework for robust novel view synthesis under diverse illumination conditions, addressing photometric inconsistencies in multi-view capture through global lightness adjustment and local residual refinement.

DetailsMotivation: Real-world image acquisition faces challenges from complex illumination variations and camera pipeline limitations, especially in multi-view capture where lighting, sensor, and ISP differences cause photometric inconsistencies that degrade 3D reconstruction quality in methods like NeRF and 3DGS.

Method: Combines globally view-adaptive lightness adjustment with local pixel-wise residual refinement for precise color correction. Uses unsupervised objectives to jointly enforce lightness correction and multi-view geometric/photometric consistency while preserving the explicit 3DGS formulation.

Result: State-of-the-art performance across challenging scenarios including low-light, overexposure, and complex luminance/chromatic variations. Maintains real-time rendering efficiency while improving reconstruction fidelity.

Conclusion: Proposes a robust solution for 3D novel view synthesis under diverse illumination conditions that addresses photometric inconsistencies without modifying the underlying 3DGS representation, enabling high-quality reconstruction and real-time rendering.

Abstract: High-quality image acquisition in real-world environments remains challenging due to complex illumination variations and inherent limitations of camera imaging pipelines. These issues are exacerbated in multi-view capture, where differences in lighting, sensor responses, and image signal processor (ISP) configurations introduce photometric and chromatic inconsistencies that violate the assumptions of photometric consistency underlying modern 3D novel view synthesis (NVS) methods, including Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), leading to degraded reconstruction and rendering quality. We propose Luminance-GS++, a 3DGS-based framework for robust NVS under diverse illumination conditions. Our method combines a globally view-adaptive lightness adjustment with a local pixel-wise residual refinement for precise color correction. We further design unsupervised objectives that jointly enforce lightness correction and multi-view geometric and photometric consistency. Extensive experiments demonstrate state-of-the-art performance across challenging scenarios, including low-light, overexposure, and complex luminance and chromatic variations. Unlike prior approaches that modify the underlying representation, our method preserves the explicit 3DGS formulation, improving reconstruction fidelity while maintaining real-time rendering efficiency.

[89] G-LoG Bi-filtration for Medical Image Classification

Qingsong Wang, Jiaxing He, Bingzhe Hou, Tieru Wu, Yang Cao, Cailing Yao

Main category: cs.CV

TL;DR: A topological data analysis method using Gaussian-Laplacian of Gaussian bi-filtration for medical image analysis, with stability guarantees and competitive performance against deep learning models.

DetailsMotivation: To develop practical filtrations for detecting topological and geometric features in medical images, leveraging the Laplacian of Gaussian operator's ability to enhance image boundaries for better multi-parameter persistence analysis.

Method: Defines G-LoG (Gaussian-Laplacian of Gaussian) bi-filtration for volumetric images modeled as bounded functions, proves stability of interleaving distance with respect to maximum norm, and validates on MedMNIST dataset against single-parameter filtration and deep learning baselines.

Result: The bi-filtration significantly outperforms single-parameter filtration, and a simple MLP trained on topological features achieves performance comparable to complex deep learning models trained on original data.

Conclusion: The proposed G-LoG bi-filtration provides effective topological features for medical image analysis with theoretical stability guarantees and practical performance competitive with state-of-the-art deep learning approaches.

Abstract: Building practical filtrations on objects to detect topological and geometric features is an important task in the field of Topological Data Analysis (TDA). In this paper, leveraging the ability of the Laplacian of Gaussian operator to enhance the boundaries of medical images, we define the G-LoG (Gaussian-Laplacian of Gaussian) bi-filtration to generate the features more suitable for multi-parameter persistence module. By modeling volumetric images as bounded functions, then we prove the interleaving distance on the persistence modules obtained from our bi-filtrations on the bounded functions is stable with respect to the maximum norm of the bounded functions. Finally, we conduct experiments on the MedMNIST dataset, comparing our bi-filtration against single-parameter filtration and the established deep learning baselines, including Google AutoML Vision, ResNet, AutoKeras and auto-sklearn. Experiments results demonstrate that our bi-filtration significantly outperforms single-parameter filtration. Notably, a simple Multi-Layer Perceptron (MLP) trained on the topological features generated by our bi-filtration achieves performance comparable to complex deep learning models trained on the original dataset.

[90] Self-Aware Object Detection via Degradation Manifolds

Stefan Becker, Simon Weiss, Wolfgang Hübner, Michael Arens

Main category: cs.CV

TL;DR: A degradation-aware self-awareness framework for object detectors that structures feature space by image degradation rather than semantic content, enabling detection of when inputs fall outside nominal operating conditions.

DetailsMotivation: Object detectors can fail silently when exposed to image degradations like blur, noise, compression, or adverse weather. In safety-critical applications, detectors need self-awareness to assess whether inputs remain within their nominal operating regime.

Method: Augments detection backbone with lightweight embedding head trained via multi-layer contrastive learning. Images with same degradation composition are pulled together while different degradations are pushed apart, creating geometrically organized representations. Estimates pristine prototype from clean training embeddings as nominal operating point.

Result: Strong pristine-degraded separability on synthetic corruption benchmarks, cross-dataset zero-shot transfer, and natural weather-induced distribution shifts. Consistent behavior across multiple detector architectures and robust generalization under semantic shift.

Conclusion: Degradation-aware representation geometry provides practical, detector-agnostic foundation for self-aware object detection, enabling assessment of when inputs deviate from nominal operating conditions.

Abstract: Object detectors achieve strong performance under nominal imaging conditions but can fail silently when exposed to blur, noise, compression, adverse weather, or resolution changes. In safety-critical settings, it is therefore insufficient to produce predictions without assessing whether the input remains within the detector’s nominal operating regime. We refer to this capability as self-aware object detection. We introduce a degradation-aware self-awareness framework based on degradation manifolds, which explicitly structure a detector’s feature space according to image degradation rather than semantic content. Our method augments a standard detection backbone with a lightweight embedding head trained via multi-layer contrastive learning. Images sharing the same degradation composition are pulled together, while differing degradation configurations are pushed apart, yielding a geometrically organized representation that captures degradation type and severity without requiring degradation labels or explicit density modeling. To anchor the learned geometry, we estimate a pristine prototype from clean training embeddings, defining a nominal operating point in representation space. Self-awareness emerges as geometric deviation from this reference, providing an intrinsic, image-level signal of degradation-induced shift that is independent of detection confidence. Extensive experiments on synthetic corruption benchmarks, cross-dataset zero-shot transfer, and natural weather-induced distribution shifts demonstrate strong pristine-degraded separability, consistent behavior across multiple detector architectures, and robust generalization under semantic shift. These results suggest that degradation-aware representation geometry provides a practical and detector-agnostic foundation.

[91] Latent Equivariant Operators for Robust Object Recognition: Promise and Challenges

Minh Dinh, Stéphane Deny

Main category: cs.CV

TL;DR: The paper explores architectures that learn equivariant operators from examples of symmetric transformations to handle out-of-distribution classification for rotated/translated MNIST, overcoming limitations of both traditional and equivariant networks.

DetailsMotivation: Deep learning struggles with recognizing objects under group-symmetric transformations rarely seen during training (unusual poses, scales, positions). Equivariant networks require prior knowledge of transformations, creating a need for architectures that can learn equivariant operators from examples.

Method: The paper proposes architectures that learn equivariant operators in a latent space from examples of symmetric transformations. Experiments are conducted on simple datasets of rotated and translated noisy MNIST to demonstrate out-of-distribution classification capabilities.

Result: The architectures successfully handle out-of-distribution classification for rotated and translated MNIST, overcoming limitations of both traditional neural networks and equivariant networks that require known transformations.

Conclusion: While promising for simple datasets, scaling these architectures to more complex datasets presents significant challenges that need to be addressed for broader applicability.

Abstract: Despite the successes of deep learning in computer vision, difficulties persist in recognizing objects that have undergone group-symmetric transformations rarely seen during training-for example objects seen in unusual poses, scales, positions, or combinations thereof. Equivariant neural networks are a solution to the problem of generalizing across symmetric transformations, but require knowledge of transformations a priori. An alternative family of architectures proposes to earn equivariant operators in a latent space from examples of symmetric transformations. Here, using simple datasets of rotated and translated noisy MNIST, we illustrate how such architectures can successfully be harnessed for out-of-distribution classification, thus overcoming the limitations of both traditional and equivariant networks. While conceptually enticing, we discuss challenges ahead on the path of scaling these architectures to more complex datasets.

[92] Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

Linxi Xie, Lisong C. Sun, Ashley Neall, Tong Wu, Shengqu Cai, Gordon Wetzstein

Main category: cs.CV

TL;DR: A human-centric video world model for XR that generates egocentric virtual environments conditioned on tracked head and hand poses, enabling dexterous hand-object interactions and improved user control.

DetailsMotivation: Current video world models only accept coarse control signals like text or keyboard input, limiting their utility for embodied interaction in XR. There's a need for models that respond to users' tracked real-world motion for more natural and effective XR experiences.

Method: Introduces a human-centric video world model conditioned on both tracked head pose and joint-level hand poses. Evaluates existing diffusion transformer conditioning strategies and proposes an effective mechanism for 3D head and hand control. Trains a bidirectional video diffusion model teacher and distills it into a causal, interactive system for generating egocentric virtual environments.

Result: The system demonstrates improved task performance and significantly higher perceived amount of control over performed actions compared to relevant baselines, as evaluated with human subjects.

Conclusion: The proposed human-centric video world model enables more natural embodied interaction in XR by responding to tracked head and hand motion, representing an important step toward interactive generative models for extended reality applications.

Abstract: Extended reality (XR) demands generative models that respond to users’ tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses. For this purpose, we evaluate existing diffusion transformer conditioning strategies and propose an effective mechanism for 3D head and hand control, enabling dexterous hand–object interactions. We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments. We evaluate this generated reality system with human subjects and demonstrate improved task performance as well as a significantly higher level of perceived amount of control over the performed actions compared with relevant baselines.

[93] CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation

Xia Su, Ruiqi Chen, Benlin Liu, Jingwei Ma, Zonglin Di, Ranjay Krishna, Jon Froehlich

Main category: cs.CV

TL;DR: CapNav is a new benchmark for evaluating Vision-Language Models on capability-conditioned navigation, testing how well VLMs can navigate indoor spaces given specific agent mobility constraints.

DetailsMotivation: Real-world navigation requires understanding agent-specific mobility constraints (e.g., robots that can't traverse stairs), but current VLMs aren't evaluated on this capability-aware navigation.

Method: Created CapNav benchmark with 5 representative agents (human/robot), 45 indoor scenes, 473 navigation tasks, and 2365 QA pairs to test VLM navigation with agent constraints.

Result: Evaluation of 13 modern VLMs shows navigation performance drops sharply with tighter mobility constraints, and even state-of-the-art models struggle with spatial dimension reasoning for obstacles.

Conclusion: Current VLMs need improvement for capability-aware navigation; benchmark enables advancing embodied spatial reasoning in future VLMs.

Abstract: Vision-Language Models (VLMs) have shown remarkable progress in Vision-Language Navigation (VLN), offering new possibilities for navigation decision-making that could benefit both robotic platforms and human users. However, real-world navigation is inherently conditioned by the agent’s mobility constraints. For example, a sweeping robot cannot traverse stairs, while a quadruped can. We introduce Capability-Conditioned Navigation (CapNav), a benchmark designed to evaluate how well VLMs can navigate complex indoor spaces given an agent’s specific physical and operational capabilities. CapNav defines five representative human and robot agents, each described with physical dimensions, mobility capabilities, and environmental interaction abilities. CapNav provides 45 real-world indoor scenes, 473 navigation tasks, and 2365 QA pairs to test if VLMs can traverse indoor environments based on agent capabilities. We evaluate 13 modern VLMs and find that current VLM’s navigation performance drops sharply as mobility constraints tighten, and that even state-of-the-art models struggle with obstacle types that require reasoning on spatial dimensions. We conclude by discussing the implications for capability-aware navigation and the opportunities for advancing embodied spatial reasoning in future VLMs. The benchmark is available at https://github.com/makeabilitylab/CapNav

[94] SARAH: Spatially Aware Real-time Agentic Humans

Evonne Ng, Siwei Zhang, Zhang Chen, Michael Zollhoefer, Alexander Richard

Main category: cs.CV

TL;DR: Real-time, fully causal method for spatially-aware conversational motion in VR agents that combines speech-aligned gestures with spatial awareness of user position and movement.

DetailsMotivation: Current embodied agents lack spatial awareness - they don't turn toward users, respond to movement, or maintain natural gaze. This gap needs to be closed for realistic VR, telepresence, and digital human applications.

Method: Combines causal transformer-based VAE with interleaved latent tokens for streaming inference and flow matching model conditioned on user trajectory and audio. Includes gaze scoring mechanism with classifier-free guidance to decouple learning from control.

Result: Achieves state-of-the-art motion quality at over 300 FPS (3x faster than non-causal baselines) on Embody 3D dataset, capturing subtle spatial dynamics of natural conversation. Successfully deployed on live VR system.

Conclusion: First real-time, fully causal method for spatially-aware conversational motion that enables VR agents to align gestures with speech while maintaining spatial awareness of users, deployable on streaming VR headsets.

Abstract: As embodied agents become central to VR, telepresence, and digital human applications, their motion must go beyond speech-aligned gestures: agents should turn toward users, respond to their movement, and maintain natural gaze. Current methods lack this spatial awareness. We close this gap with the first real-time, fully causal method for spatially-aware conversational motion, deployable on a streaming VR headset. Given a user’s position and dyadic audio, our approach produces full-body motion that aligns gestures with speech while orienting the agent according to the user. Our architecture combines a causal transformer-based VAE with interleaved latent tokens for streaming inference and a flow matching model conditioned on user trajectory and audio. To support varying gaze preferences, we introduce a gaze scoring mechanism with classifier-free guidance to decouple learning from control: the model captures natural spatial alignment from data, while users can adjust eye contact intensity at inference time. On the Embody 3D dataset, our method achieves state-of-the-art motion quality at over 300 FPS – 3x faster than non-causal baselines – while capturing the subtle spatial dynamics of natural conversation. We validate our approach on a live VR system, bringing spatially-aware conversational agents to real-time deployment. Please see https://evonneng.github.io/sarah/ for details.

[95] Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory

Vatsal Agarwal, Saksham Suri, Matthew Gwilliam, Pulkit Kumar, Abhinav Shrivastava

Main category: cs.CV

TL;DR: MemStream improves streaming video understanding by scaling token budgets for granular spatiotemporal reasoning, addressing retrieval bias in dense streams through adaptive selection and training-free retrieval mixture-of-experts.

DetailsMotivation: Existing streaming video understanding methods use limited tokens per frame, losing fine-grained visual details. Current approaches also suffer from retrieval bias toward later frames in dense streams due to increasing query-frame similarity scores over time.

Method: 1) Scale token budget for more granular spatiotemporal understanding; 2) Introduce adaptive selection strategy to reduce token redundancy while preserving local spatiotemporal information; 3) Propose training-free retrieval mixture-of-experts leveraging external models to better identify relevant frames.

Result: MemStream achieves +8.0% on CG-Bench, +8.5% on LVBench, and +2.4% on VideoMME (Long) over ReKV with Qwen2.5-VL-7B.

Conclusion: Scaling token budgets and addressing retrieval bias through adaptive selection and mixture-of-experts significantly improves streaming video understanding performance on long-form video QA benchmarks.

Abstract: Streaming video understanding requires models to robustly encode, store, and retrieve information from a continuous video stream to support accurate video question answering (VQA). Existing state-of-the-art approaches rely on key-value caching to accumulate frame-level information over time, but use a limited number of tokens per frame, leading to the loss of fine-grained visual details. In this work, we propose scaling the token budget to enable more granular spatiotemporal understanding and reasoning. First, we find that current methods are ill-equipped to handle dense streams: their feature encoding causes query-frame similarity scores to increase over time, biasing retrieval toward later frames. To address this, we introduce an adaptive selection strategy that reduces token redundancy while preserving local spatiotemporal information. We further propose a training-free retrieval mixture-of-experts that leverages external models to better identify relevant frames. Our method, MemStream, achieves +8.0% on CG-Bench, +8.5% on LVBench, and +2.4% on VideoMME (Long) over ReKV with Qwen2.5-VL-7B.

[96] Visual Fixation-Based Retinal Prosthetic Simulation

Yuli Wu, Do Dinh Tan Nguyen, Henning Konermann, Rüveyda Yilmaz, Peter Walter, Johannes Stegmaier

Main category: cs.CV

TL;DR: A retinal prosthetic simulation framework that uses visual fixation-inspired patches (via ViT attention) and end-to-end optimization to improve classification accuracy for retinal implants.

DetailsMotivation: To address the limited resolution of retinal electrode arrays and distortion between input stimuli and resulting phosphenes in retinal prosthetics, by optimizing visual information transmission through fixation-based processing inspired by natural saccade mechanisms.

Method: 1) Predict salient patches from input images using ViT self-attention maps to mimic visual fixations; 2) Encode patches with trainable U-Net; 3) Simulate percepts using pulse2percept framework; 4) Evaluate percepts using DINOv2 foundation model with optional linear layer for classification; 5) End-to-end optimization of learnable encoder.

Result: Achieved 87.72% classification accuracy on ImageNet subset using real subject’s physiological parameters, significantly outperforming downsampling-based approach (40.59%) and approaching healthy upper bound (92.76%).

Conclusion: The fixation-based framework shows promising potential for producing more semantically understandable percepts with limited retinal prosthetic resolution, demonstrating effectiveness of biologically-inspired attention mechanisms and end-to-end optimization.

Abstract: This study proposes a retinal prosthetic simulation framework driven by visual fixations, inspired by the saccade mechanism, and assesses performance improvements through end-to-end optimization in a classification task. Salient patches are predicted from input images using the self-attention map of a vision transformer to mimic visual fixations. These patches are then encoded by a trainable U-Net and simulated using the pulse2percept framework to predict visual percepts. By incorporating a learnable encoder, we aim to optimize the visual information transmitted to the retinal implant, addressing both the limited resolution of the electrode array and the distortion between the input stimuli and resulting phosphenes. The predicted percepts are evaluated using the self-supervised DINOv2 foundation model, with an optional learnable linear layer for classification accuracy. On a subset of the ImageNet validation set, the fixation-based framework achieves a classification accuracy of 87.72%, using computational parameters based on a real subject’s physiological data, significantly outperforming the downsampling-based accuracy of 40.59% and approaching the healthy upper bound of 92.76%. Our approach shows promising potential for producing more semantically understandable percepts with the limited resolution available in retinal prosthetics.

[97] GIFT: A Framework Towards Global Interpretable Faithful Textual Explanations of Vision Classifiers

Éloi Zablocki, Valentin Gerard, Amaia Cardiel, Eric Gaussier, Matthieu Cord, Eduardo Valle

Main category: cs.CV

TL;DR: GIFT is a framework for generating global, interpretable, faithful, and textual explanations for vision classifiers by combining local visual counterfactuals, vision-language translation, and LLM aggregation with causal verification.

DetailsMotivation: Existing explainability methods for vision models (saliency maps, concept-based analyses) have limitations: limited faithfulness, local scope, or ambiguous semantics. There's a need for more comprehensive, faithful, and human-understandable explanations.

Method: 1. Generate faithful local visual counterfactuals; 2. Use vision-language models to translate counterfactuals into natural language descriptions; 3. Aggregate local explanations into global hypotheses using LLMs; 4. Verify explanations through image-based interventions to assess causal effects.

Result: GIFT successfully reveals meaningful classification rules, unexpected biases, and latent concepts across diverse datasets (CLEVR, CelebA, BDD driving scenes). It bridges local counterfactual reasoning with global interpretability.

Conclusion: GIFT offers a principled approach to causally grounded textual explanations for vision models, addressing limitations of existing methods and providing more comprehensive understanding of model decision processes.

Abstract: Understanding the decision processes of deep vision models is essential for their safe and trustworthy deployment in real-world settings. Existing explainability approaches, such as saliency maps or concept-based analyses, often suffer from limited faithfulness, local scope, or ambiguous semantics. We introduce GIFT, a post-hoc framework that aims to derive Global, Interpretable, Faithful, and Textual explanations for vision classifiers. GIFT begins by generating a large set of faithful, local visual counterfactuals, then employs vision-language models to translate these counterfactuals into natural-language descriptions of visual changes. These local explanations are aggregated by a large language model into concise, human-readable hypotheses about the model’s global decision rules. Crucially, GIFT includes a verification stage that quantitatively assesses the causal effect of each proposed explanation by performing image-based interventions, ensuring that the final textual explanations remain faithful to the model’s true reasoning process. Across diverse datasets, including the synthetic CLEVR benchmark, the real-world CelebA faces, and the complex BDD driving scenes, GIFT reveals not only meaningful classification rules but also unexpected biases and latent concepts driving model behavior. Altogether, GIFT bridges the gap between local counterfactual reasoning and global interpretability, offering a principled approach to causally grounded textual explanations for vision models.

[98] SAMa: Material-aware 3D Selection and Segmentation

Michael Fischer, Iliyan Georgiev, Thibault Groueix, Vladimir G. Kim, Tobias Ritschel, Valentin Deschaintre

Main category: cs.CV

TL;DR: SAMa is a 3D material selection method that uses SAM2’s video prior to create material-centric video data, then lifts 2D predictions to 3D via depth projection and nearest-neighbor lookups for efficient, multiview-consistent material selection on arbitrary 3D representations.

DetailsMotivation: Manual decomposition of 3D assets into material parts is labor-intensive. There's a need for automated material selection methods that work with in-the-wild objects across various 3D representations without requiring costly per-asset optimization.

Method: 1) Build material-centric video dataset using SAM2’s video prior; 2) Project 2D predictions to 3D point clouds using depth information; 3) Use nearest-neighbor lookups between target 3D representation and similarity point cloud; 4) Reconstruct selection masks that are multiview-consistent by design.

Result: SAMa outperforms baselines in selection accuracy and multiview consistency. Enables optimization-free selection in seconds. Applications include replacing diffuse-textured materials with PBR materials on text-to-3D outputs, and editing materials on NeRFs and 3DGS captures.

Conclusion: SAMa provides an efficient, accurate method for 3D material selection that works across different 3D representations without per-asset optimization, enabling practical material editing applications.

Abstract: Decomposing 3D assets into material parts is a common task for artists, yet remains a highly manual process. In this work, we introduce Select Any Material (SAMa), a material selection approach for in-the-wild objects in arbitrary 3D representations. Building on SAM2’s video prior, we construct a material-centric video dataset that extends it to the material domain. We propose an efficient way to lift the model’s 2D predictions to 3D by projecting each view into an intermediary 3D point cloud using depth. Nearest-neighbor lookups between any 3D representation and this similarity point cloud allow us to efficiently reconstruct accurate selection masks over objects’ surfaces that can be inspected from any view. Our method is multiview-consistent by design, alleviating the need for costly per-asset optimization, and performs optimization-free selection in seconds. SAMa outperforms several strong baselines in selection accuracy and multiview consistency and enables various compelling applications, such as replacing the diffuse-textured materials on a text-to-3D output with PBR materials or selecting and editing materials on NeRFs and 3DGS captures.

[99] Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More

Feng Wang, Yaodong Yu, Guoyizhe Wei, Wei Shao, Yuyin Zhou, Alan Yuille, Cihang Xie

Main category: cs.CV

TL;DR: Vision Transformers benefit from smaller patch sizes down to 1x1 pixels, challenging conventional patchification approaches and revealing scaling laws in visual tokenization.

DetailsMotivation: To examine information loss caused by patchification-based compressive encoding in Vision Transformers and understand how it affects visual understanding, challenging the conventional approach of using larger patches to reduce computational cost.

Method: Conducted extensive patch size scaling experiments across different vision tasks, input scales, and architectures (ViT and Mamba models), scaling visual sequences up to 50,176 tokens with patch sizes as small as 1x1 pixels.

Result: Discovered a scaling law: models consistently benefit from decreased patch sizes until reaching minimum 1x1 pixel tokenization, achieving 84.6% accuracy on ImageNet-1k with base-sized model. Also found that with smaller patches, task-specific decoder heads become less critical for dense prediction.

Conclusion: Patchification-based compressive encoding causes information loss, and smaller patches (down to pixels) improve performance across vision tasks, providing insights for building non-compressive vision models.

Abstract: Since the introduction of Vision Transformer (ViT), patchification has long been regarded as a de facto image tokenization approach for plain visual architectures. By compressing the spatial size of images, this approach can effectively shorten the token sequence and reduce the computational cost of ViT-like plain architectures. In this work, we aim to thoroughly examine the information loss caused by this patchification-based compressive encoding paradigm and how it affects visual understanding. We conduct extensive patch size scaling experiments and excitedly observe an intriguing scaling law in patchification: the models can consistently benefit from decreased patch sizes and attain improved predictive performance, until it reaches the minimum patch size of 1x1, i.e., pixel tokenization. This conclusion is broadly applicable across different vision tasks, various input scales, and diverse architectures such as ViT and the recent Mamba models. Moreover, as a by-product, we discover that with smaller patches, task-specific decoder heads become less critical for dense prediction. In the experiments, we successfully scale up the visual sequence to an exceptional length of 50,176 tokens, achieving a competitive test accuracy of 84.6% with a base-sized model on the ImageNet-1k benchmark. We hope this study can provide insights and theoretical foundations for future works of building non-compressive vision models. Code is available at https://github.com/wangf3014/Patch_Scaling.

[100] A Pragmatic Note on Evaluating Generative Models with Fréchet Inception Distance for Retinal Image Synthesis

Yuli Wu, Fucheng Liu, Rüveyda Yilmaz, Henning Konermann, Peter Walter, Johannes Stegmaier

Main category: cs.CV

TL;DR: FID metric limitations in biomedical imaging: FID misaligns with task-specific evaluation goals in retinal imaging classification/segmentation tasks, where synthetic data enrichment for downstream tasks is the primary objective.

DetailsMotivation: The paper examines limitations of FID and related metrics in biomedical generative models, particularly in retinal imaging, where the primary goal is enriching training datasets with annotations for downstream tasks rather than just matching real data distributions.

Method: The authors analyze cases from retinal imaging modalities (color fundus photography and optical coherence tomography) where FID and its variants misalign with task-specific evaluation goals in classification and segmentation tasks.

Result: FID and related metrics show misalignment with task-specific evaluation goals in biomedical imaging applications, highlighting limitations of using these metrics as evaluation criteria for generative models in this domain.

Conclusion: FID and its variants have significant limitations for evaluating biomedical generative models, where downstream task performance (classification/segmentation) should be the primary evaluation criterion rather than distribution matching metrics.

Abstract: Fréchet Inception Distance (FID), computed with an ImageNet pretrained Inception-v3 network, is widely used as a state-of-the-art evaluation metric for generative models. It assumes that feature vectors from Inception-v3 follow a multivariate Gaussian distribution and calculates the 2-Wasserstein distance based on their means and covariances. While FID effectively measures how closely synthetic data match real data in many image synthesis tasks, the primary goal in biomedical generative models is often to enrich training datasets ideally with corresponding annotations. For this purpose, the gold standard for evaluating generative models is to incorporate synthetic data into downstream task training, such as classification and segmentation, to pragmatically assess its performance. In this paper, we examine cases from retinal imaging modalities, including color fundus photography and optical coherence tomography, where FID and its related metrics misalign with task-specific evaluation goals in classification and segmentation. We highlight the limitations of using various metrics, represented by FID and its variants, as evaluation criteria for these applications and address their potential caveats in broader biomedical imaging modalities and downstream tasks.

[101] Analyzing the Training Dynamics of Image Restoration Transformers: A Revisit to Layer Normalization

MinKyu Lee, Sangeek Hyun, Woojin Jun, Hyunjun Kim, Jiwoo Chung, Jae-Pil Heo

Main category: cs.CV

TL;DR: Proposes i-LN, a tailored LayerNorm for Image Restoration Transformers that addresses feature magnitude divergence and channel entropy collapse by normalizing holistically and adaptively rescaling per input.

DetailsMotivation: Conventional LayerNorm in Image Restoration Transformers causes feature magnitudes to diverge to extreme scales and collapses channel-wise entropy, which conflicts with IR tasks where spatial correlations and input-specific statistics are crucial.

Method: Proposes i-LN (Image Restoration Transformer Tailored Layer Normalization) as a drop-in replacement that: 1) normalizes features holistically instead of per-token to preserve spatial correlations, and 2) adaptively rescales features per input rather than using input-independent scaling.

Result: The simple i-LN design effectively improves training dynamics and performance, validated by extensive experiments. It addresses the misalignments between conventional LN and IR tasks.

Conclusion: i-LN provides a theoretically grounded and empirically validated solution to the training dynamics issues in Image Restoration Transformers, offering improved performance through better-aligned normalization.

Abstract: This work analyzes the training dynamics of Image Restoration (IR) Transformers and uncovers a critical yet overlooked issue: conventional LayerNorm (LN) drives feature magnitudes to diverge to a million scale and collapses channel-wise entropy. We analyze this in the perspective of networks attempting to bypass LN’s constraints that conflict with IR tasks. Accordingly, we address two misalignments between LN and IR: 1) per-token normalization disrupts spatial correlations, and 2) input-independent scaling discards input-specific statistics. To address this, we propose Image Restoration Transformer Tailored Layer Normalization i-LN, a simple drop-in replacement that normalizes features holistically and adaptively rescales them per input. We provide theoretical insights and empirical evidence that this simple design effectively leads to both improved training dynamics and thereby improved performance, validated by extensive experiments.

[102] eStonefish-Scenes: A Sim-to-Real Validated and Robot-Centric Event-based Optical Flow Dataset for Underwater Vehicles

Jad Mansour, Sebastian Realpe, Hayat Rajani, Michele Grimaldi, Rafael Garcia, Nuno Gracias

Main category: cs.CV

TL;DR: eStonefish-Scenes: A synthetic event-based optical flow dataset for underwater robotics using Stonefish simulator, with eWiz processing library and real-world validation showing successful sim-to-real transfer.

DetailsMotivation: Event-based cameras have great potential for underwater robotics, but lack of labeled event-based datasets for underwater environments limits progress in tasks like visual odometry and obstacle avoidance. Real-world event-based optical flow datasets are scarce, expensive to collect, and lack diversity, with no existing benchmarks for underwater applications.

Method: Created eStonefish-Scenes synthetic dataset using Stonefish simulator with customizable underwater environments featuring coral reefs and biologically inspired fish schools. Developed eWiz library for event-based data processing. Validated sim-to-real transfer by collecting real-world data with DAVIS346 camera on BlueROV2, deriving ground-truth optical flow via homography-based registration with Monte Carlo uncertainty estimation.

Result: ConvGRU-based optical flow network trained only on synthetic eStonefish-Scenes data achieved uncertainty-weighted average endpoint error of 0.79 pixels on real-world sequences without fine-tuning, demonstrating effective sim-to-real transfer.

Conclusion: The synthetic dataset effectively supports sim-to-real transfer for underwater event-based optical flow estimation, substantially reducing the need for costly real-world data collection while providing a comprehensive benchmark and processing tools.

Abstract: Event-based cameras (EBCs) are poised to transform underwater robotics, yet the absence of labelled event-based datasets for underwater environments severely limits progress in tasks such as visual odometry and obstacle avoidance. Real-world event-based optical flow datasets are scarce, resource-intensive to collect, and lack diversity, while no prior benchmarks target underwater applications. To bridge this gap, we introduce eStonefish-Scenes, a synthetic event-based optical flow dataset generated using the Stonefish simulator, together with an open data generation pipeline for creating customizable underwater environments featuring realistic coral reefs and biologically inspired schools of fish with reactive navigation behaviours. We also present eWiz, a comprehensive library for event-based data processing, encompassing data loading, augmentation, visualization, encoding, training utilities, loss functions, and evaluation metrics. To validate sim-to-real transferability, we collected real-world data using a DAVIS346 hybrid event-and-frame camera mounted on a BlueROV2 in an indoor testing pool. Ground-truth optical flow was derived via homography-based frame-to-poster registration, and per-pixel uncertainty was estimated through Monte Carlo perturbation of keypoint correspondences. This uncertainty was incorporated into the evaluation metrics, enabling reliability-aware performance assessment. A ConvGRU-based optical flow network, trained exclusively on synthetic eStonefish-Scenes data, was evaluated on the real-world sequences without fine-tuning, achieving an uncertainty-weighted average endpoint error of 0.79 pixels. These results demonstrate that the proposed synthetic dataset effectively supports sim-to-real transfer for underwater event-based optical flow estimation, substantially reducing the need for costly real-world data collection.

[103] Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models

Maria-Teresa De Rosa Palmini, Eva Cetinic

Main category: cs.CV

TL;DR: A benchmark for evaluating historical accuracy in Text-to-Image diffusion models, revealing systematic biases and inaccuracies in depicting historical contexts.

DetailsMotivation: While prior research has examined demographic and cultural biases in TTI models, their ability to accurately represent historical contexts remains underexplored. The authors aim to address this gap by creating a systematic evaluation framework.

Method: Developed HistVis benchmark with 30,000 synthetic images generated by three state-of-the-art diffusion models from carefully designed prompts covering universal human activities across multiple historical periods. Includes reproducible evaluation protocol assessing three aspects: implicit stylistic associations, historical consistency (anachronisms), and demographic representation.

Result: TTI models show systematic inaccuracies in historically themed imagery: they frequently stereotype past eras with unstated stylistic cues, introduce anachronisms (modern artifacts in pre-modern contexts), and fail to reflect plausible demographic patterns compared to historically plausible baselines.

Conclusion: The benchmark provides an initial step toward building more historically accurate TTI models by offering a reproducible framework for evaluating historical representation in generated imagery.

Abstract: As Text-to-Image (TTI) diffusion models become increasingly influential in content creation, growing attention is being directed toward their societal and cultural implications. While prior research has primarily examined demographic and cultural biases, the ability of these models to accurately represent historical contexts remains largely underexplored. To address this gap, we introduce a benchmark for evaluating how TTI models depict historical contexts. The benchmark combines HistVis, a dataset of 30,000 synthetic images generated by three state-of-the-art diffusion models from carefully designed prompts covering universal human activities across multiple historical periods, with a reproducible evaluation protocol. We evaluate generated imagery across three key aspects: (1) Implicit Stylistic Associations: examining default visual styles associated with specific eras; (2) Historical Consistency: identifying anachronisms such as modern artifacts in pre-modern contexts; and (3) Demographic Representation: comparing generated racial and gender distributions against historically plausible baselines. Our findings reveal systematic inaccuracies in historically themed generated imagery, as TTI models frequently stereotype past eras by incorporating unstated stylistic cues, introduce anachronisms, and fail to reflect plausible demographic patterns. By providing a reproducible benchmark for historical representation in generated imagery, this work provides an initial step toward building more historically accurate TTI models.

[104] Mod-Adapter: Tuning-Free and Versatile Multi-concept Personalization via Modulation Adapter

Weizhi Zhong, Huan Yang, Zheng Liu, Huiguo He, Zijian He, Xuesong Niu, Di Zhang, Guanbin Li

Main category: cs.CV

TL;DR: Tuning-free multi-concept personalization method for text-to-image generation that handles both object and abstract concepts without test-time fine-tuning

DetailsMotivation: Existing multi-concept personalization methods are limited to object concepts and struggle with abstract concepts (pose, lighting). Current approaches require test-time fine-tuning which is time-consuming and prone to overfitting on limited training images.

Method: Proposes Mod-Adapter module that predicts concept-specific modulation directions for Diffusion Transformers (DiTs). Uses vision-language cross-attention to extract concept visual features and Mixture-of-Experts layers to map features into modulation space. Includes VLM-guided pre-training strategy using vision-language models for semantic supervision.

Result: Achieves state-of-the-art performance in multi-concept personalization, supported by quantitative, qualitative, and human evaluations. Extended benchmark with abstract concepts shows superior performance.

Conclusion: Presents an effective tuning-free method for multi-concept personalization that handles both object and abstract concepts without test-time fine-tuning, leveraging modulation mechanisms in DiTs and VLM-guided training.

Abstract: Personalized text-to-image generation aims to synthesize images of user-provided concepts in diverse contexts. Despite recent progress in multi-concept personalization, most are limited to object concepts and struggle to customize abstract concepts (e.g., pose, lighting). Some methods have begun exploring multi-concept personalization supporting abstract concepts, but they require test-time fine-tuning for each new concept, which is time-consuming and prone to overfitting on limited training images. In this work, we propose a novel tuning-free method for multi-concept personalization that can effectively customize both object and abstract concepts without test-time fine-tuning. Our method builds upon the modulation mechanism in pre-trained Diffusion Transformers (DiTs) model, leveraging the localized and semantically meaningful properties of the modulation space. Specifically, we propose a novel module, Mod-Adapter, to predict concept-specific modulation direction for the modulation process of concept-related text tokens. It introduces vision-language cross-attention for extracting concept visual features, and Mixture-of-Experts (MoE) layers that adaptively map the concept features into the modulation space. Furthermore, to mitigate the training difficulty caused by the large gap between the concept image space and the modulation space, we introduce a VLM-guided pre-training strategy that leverages the strong image understanding capabilities of vision-language models to provide semantic supervision signals. For a comprehensive comparison, we extend a standard benchmark by incorporating abstract concepts. Our method achieves state-of-the-art performance in multi-concept personalization, supported by quantitative, qualitative, and human evaluations.

[105] Data-Free Class-Incremental Gesture Recognition with Prototype-Guided Pseudo Feature Replay

Hongsong Wang, Ao Sun, Jie Gui, Liang Wang

Main category: cs.CV

TL;DR: PGPFR framework for data-free class-incremental gesture recognition using pseudo feature generation and prototype replay to handle unseen gestures over time.

DetailsMotivation: Most gesture recognition systems focus on closed-set scenarios and cannot handle unseen or novel gestures over time. The paper addresses class-incremental gesture recognition where new gestures need to be accommodated without forgetting old ones.

Method: Four-component framework: 1) Pseudo Feature Generation with Batch Prototypes (PFGBP) dynamically generates diverse pseudo features using class prototypes, 2) Variational Prototype Replay (VPR) enforces consistency between classifier weights and old class prototypes, 3) Truncated Cross-Entropy (TCE) mitigates domain differences from pseudo features, 4) Continual Classifier Re-Training (CCRT) prevents overfitting to new classes.

Result: Outperforms state-of-the-art methods by 11.8% on SHREC 2017 3D and 12.8% on EgoGesture 3D datasets in terms of mean global accuracy.

Conclusion: The PGPFR framework effectively addresses catastrophic forgetting in class-incremental gesture recognition through data-free pseudo feature generation and prototype-guided learning, demonstrating superior performance on benchmark datasets.

Abstract: Gesture recognition is an important research area in the field of computer vision. Most gesture recognition efforts focus on close-set scenarios, thereby limiting the capacity to effectively handle unseen or novel gestures. We aim to address class-incremental gesture recognition, which entails the ability to accommodate new and previously unseen gestures over time. Specifically, we introduce a Prototype-Guided Pseudo Feature Replay (PGPFR) framework for data-free class-incremental gesture recognition. This framework comprises four components: Pseudo Feature Generation with Batch Prototypes (PFGBP), Variational Prototype Replay (VPR) for old classes, Truncated Cross-Entropy (TCE) for new classes, and Continual Classifier Re-Training (CCRT). To tackle the issue of catastrophic forgetting, the PFGBP dynamically generates a diversity of pseudo features in an online manner, leveraging class prototypes of old classes along with batch class prototypes of new classes. Furthermore, the VPR enforces consistency between the classifier’s weights and the prototypes of old classes, leveraging class prototypes and covariance matrices to enhance robustness and generalization capabilities. The TCE mitigates the impact of domain differences of the classifier caused by pseudo features. Finally, the CCRT training strategy is designed to prevent overfitting to new classes and ensure the stability of features extracted from old classes. Extensive experiments conducted on two widely used gesture recognition datasets, namely SHREC 2017 3D and EgoGesture 3D, demonstrate that our approach outperforms existing state-of-the-art methods by 11.8% and 12.8% in terms of mean global accuracy, respectively. The code is available on https://github.com/sunao-101/PGPFR-3/.

[106] View Invariant Learning for Vision-Language Navigation in Continuous Environments

Josh Qixuan Sun, Huaiyuan Weng, Xiaoying Xing, Chul Min Yeum, Mark Crowley

Main category: cs.CV

TL;DR: VIL is a view-invariant learning framework for Vision-Language Navigation in Continuous Environments that improves robustness to camera viewpoint changes through contrastive learning and teacher-student distillation.

DetailsMotivation: Existing VLNCE approaches are sensitive to viewpoint changes (camera height and viewing angle variations), limiting their practical deployment in real-world scenarios with diverse camera configurations.

Method: Proposes VIL framework with: 1) contrastive learning for sparse view-invariant features, 2) teacher-student distillation for Waypoint Predictor Module (view-dependent teacher to view-invariant student), and 3) end-to-end joint optimization. Introduces V²-VLNCE benchmark with varied viewpoints.

Result: Outperforms SOTA by 8-15% Success Rate on V²-VLNCE benchmarks (R2R-CE, RxR-CE). Achieves SOTA on harder RxR-CE across all metrics. Maintains/improves standard viewpoint performance. Shows consistent improvements on simulated real robot configurations and proof-of-concept real-robot evaluation.

Conclusion: VIL effectively addresses viewpoint sensitivity in VLNCE, serving as a plug-and-play post-training method that improves robustness without diminishing standard performance, with demonstrated applicability to real robot scenarios.

Abstract: Vision-Language Navigation in Continuous Environments (VLNCE), where an agent follows instructions and moves freely to reach a destination, is a key research problem in embodied AI. However, most existing approaches are sensitive to viewpoint changes, i.e. variations in camera height and viewing angle. Here we introduce a more general scenario, V$^2$-VLNCE (VLNCE with Varied Viewpoints) and propose a view-invariant post-training framework, called VIL (View Invariant Learning), that makes existing navigation policies more robust to changes in camera viewpoint. VIL employs a contrastive learning framework to learn sparse and view-invariant features. We also introduce a teacher-student framework for the Waypoint Predictor Module, a standard part of VLNCE baselines, where a view-dependent teacher model distills knowledge into a view-invariant student model. We employ an end-to-end training paradigm to jointly optimize these components. Empirical results show that our method outperforms state-of-the-art approaches on V$^2$-VLNCE by 8-15% measured on Success Rate for two standard benchmark datasets R2R-CE and RxR-CE. Evaluation of VIL in standard VLNCE settings shows that despite being trained for varied viewpoints, VIL often still improves performance. On the harder RxR-CE dataset, our method also achieved state-of-the-art performance across all metrics. This suggests that adding VIL does not diminish the standard viewpoint performance and can serve as a plug-and-play post-training method. We further evaluate VIL for simulated camera placements derived from real robot configurations (e.g. Stretch RE-1, LoCoBot), showing consistent improvements of performance. Finally, we present a proof-of-concept real-robot evaluation in two physical environments using a panoramic RGB sensor combined with LiDAR. The code is available at https://github.com/realjoshqsun/V2-VLNCE.

[107] ViGText: Deepfake Image Detection with Vision-Language Model Explanations and Graph Neural Networks

Ahmad ALBarqawi, Mahmoud Nazzal, Issa Khalil, Abdallah Khreishah, NhatHai Phan

Main category: cs.CV

TL;DR: ViGText is a novel deepfake detection method that integrates images with Vision Large Language Model text explanations in a graph-based framework, achieving superior generalization and robustness against sophisticated deepfakes.

DetailsMotivation: Deepfake technology threatens media authenticity, and traditional detection methods struggle with sophisticated, customized deepfakes, particularly in generalization and robustness against malicious attacks.

Method: ViGText divides images into patches, constructs image and text graphs, integrates them using Graph Neural Networks (GNNs), and employs multi-level feature extraction across spatial and frequency domains for detailed analysis.

Result: ViGText significantly improves generalization (F1 scores from 72.45% to 98.32%), achieves 11.1% recall increase over other methods, and limits performance degradation to less than 4% under targeted attacks.

Conclusion: ViGText sets a new standard for deepfake detection by integrating detailed visual and textual analysis, enhancing media authenticity and information integrity protection.

Abstract: The rapid rise of deepfake technology, which produces realistic but fraudulent digital content, threatens the authenticity of media. Traditional deepfake detection approaches often struggle with sophisticated, customized deepfakes, especially in terms of generalization and robustness against malicious attacks. This paper introduces ViGText, a novel approach that integrates images with Vision Large Language Model (VLLM) Text explanations within a Graph-based framework to improve deepfake detection. The novelty of ViGText lies in its integration of detailed explanations with visual data, as it provides a more context-aware analysis than captions, which often lack specificity and fail to reveal subtle inconsistencies. ViGText systematically divides images into patches, constructs image and text graphs, and integrates them for analysis using Graph Neural Networks (GNNs) to identify deepfakes. Through the use of multi-level feature extraction across spatial and frequency domains, ViGText captures details that enhance its robustness and accuracy to detect sophisticated deepfakes. Extensive experiments demonstrate that ViGText significantly enhances generalization and achieves a notable performance boost when it detects user-customized deepfakes. Specifically, average F1 scores rise from 72.45% to 98.32% under generalization evaluation, and reflects the model’s superior ability to generalize to unseen, fine-tuned variations of stable diffusion models. As for robustness, ViGText achieves an increase of 11.1% in recall compared to other deepfake detection approaches. When facing targeted attacks that exploit its graph-based architecture, ViGText limits classification performance degradation to less than 4%. ViGText uses detailed visual and textual analysis to set a new standard for detecting deepfakes, helping ensure media authenticity and information integrity.

[108] Learning Adaptive Pseudo-Label Selection for Semi-Supervised 3D Object Detection

Taehun Kong, Tae-Kyun Kim

Main category: cs.CV

TL;DR: A novel semi-supervised 3D object detection framework with learnable pseudo-labeling that adaptively selects high-quality pseudo-labels using context-aware scoring and dynamic thresholds.

DetailsMotivation: Current pseudo-label-based teacher-student methods for semi-supervised 3D object detection rely on manual thresholding or limited quality assessment, overlooking contextual information like object distances, classes, and learning states.

Method: Proposes a learnable pseudo-labeling module with two networks at teacher output level: one for reliable quality assessment via score fusion, and another for context-adaptive threshold determination supervised by pseudo-label alignment with ground truth. Includes soft supervision strategy for robust learning under noisy labels.

Result: Extensive experiments on KITTI and Waymo datasets show the method selects high-precision pseudo-labels while maintaining wider context coverage and higher recall, significantly improving relevant SS3DOD methods.

Conclusion: The proposed framework effectively addresses pseudo-label selection challenges in semi-supervised 3D object detection through adaptive, context-aware quality assessment and thresholding.

Abstract: Semi-supervised 3D object detection (SS3DOD) aims to reduce costly 3D annotations utilizing unlabeled data. Recent studies adopt pseudo-label-based teacher-student frameworks and demonstrate impressive performance. The main challenge of these frameworks is in selecting high-quality pseudo-labels from the teacher’s predictions. Most previous methods, however, select pseudo-labels by comparing confidence scores over thresholds manually set. The latest works tackle the challenge either by dynamic thresholding or refining the quality of pseudo-labels. Such methods still overlook contextual information e.g. object distances, classes, and learning states, and inadequately assess the pseudo-label quality using partial information available from the networks. In this work, we propose a novel SS3DOD framework featuring a learnable pseudo-labeling module designed to automatically and adaptively select high-quality pseudo-labels. Our approach introduces two networks at the teacher output level. These networks reliably assess the quality of pseudo-labels by the score fusion and determine context-adaptive thresholds, which are supervised by the alignment of pseudo-labels over GT bounding boxes. Additionally, we introduce a soft supervision strategy that can learn robustly under pseudo-label noises. This helps the student network prioritize cleaner labels over noisy ones in semi-supervised learning. Extensive experiments on the KITTI and Waymo datasets demonstrate the effectiveness of our method. The proposed method selects high-precision pseudo-labels while maintaining a wider coverage of contexts and a higher recall rate, significantly improving relevant SS3DOD methods.

[109] Dragging with Geometry: From Pixels to Geometry-Guided Image Editing

Xinyu Pu, Hongsong Wang, Jie Gui, Pan Zhou

Main category: cs.CV

TL;DR: GeoDrag: A geometry-guided drag-based image editing method that incorporates 3D geometric cues into pixel-level editing for more precise and consistent manipulations, especially for geometry-intensive operations like rotations and perspective transformations.

DetailsMotivation: Existing drag-based image editing methods operate primarily on 2D pixel planes with limited 3D cues, leading to imprecise and inconsistent edits in geometry-intensive scenarios like rotations and perspective transformations.

Method: Proposes GeoDrag with a unified displacement field that jointly encodes 3D geometry and 2D spatial priors, plus a conflict-free partitioning strategy to isolate editing regions and prevent interference.

Result: Extensive experiments show superior precision, structural consistency, and reliable multi-point editability across various editing scenarios compared to existing methods.

Conclusion: GeoDrag enables coherent, high-fidelity, and structure-consistent image editing in a single forward pass by effectively incorporating 3D geometric guidance into drag-based manipulation.

Abstract: Interactive point-based image editing serves as a controllable editor, enabling precise and flexible manipulation of image content. However, most drag-based methods operate primarily on the 2D pixel plane with limited use of 3D cues. As a result, they often produce imprecise and inconsistent edits, particularly in geometry-intensive scenarios such as rotations and perspective transformations. To address these limitations, we propose a novel geometry-guided drag-based image editing method-GeoDrag, which addresses three key challenges: 1) incorporating 3D geometric cues into pixel-level editing, 2) mitigating discontinuities caused by geometry-only guidance, and 3) resolving conflicts arising from multi-point dragging. Built upon a unified displacement field that jointly encodes 3D geometry and 2D spatial priors, GeoDrag enables coherent, high-fidelity, and structure-consistent editing in a single forward pass. In addition, a conflict-free partitioning strategy is introduced to isolate editing regions, effectively preventing interference and ensuring consistency. Extensive experiments across various editing scenarios validate the effectiveness of our method, showing superior precision, structural consistency, and reliable multi-point editability. Project page: https://xinyu-pu.github.io/projects/geodrag.

[110] Investigating Demographic Bias in Brain MRI Segmentation: A Comparative Study of Deep-Learning and Non-Deep-Learning Methods

Ghazal Danaee, Marc Niethammer, Jarrett Rushmore, Sylvain Bouix

Main category: cs.CV

TL;DR: Evaluation of fairness in medical image segmentation models for nucleus accumbens segmentation across demographic subgroups, assessing performance disparities based on race and sex.

DetailsMotivation: Address growing concerns about unfairness and performance disparities in medical image segmentation algorithms based on sensitive attributes like race and sex, particularly for structural delineations in MRIs.

Method: Evaluated three segmentation models (UNesT, nnU-Net, CoTr) and atlas-based method (ANTs) on nucleus accumbens segmentation using dataset with four demographic subgroups. Used manually labeled gold-standard segmentations, assessed segmentation performance and derived volumes, measured fairness quantitatively, and analyzed demographic impacts using linear mixed models.

Result: Training on race-matched data significantly improved segmentation accuracy for ANTs and UNesT, while nnU-Net showed robust performance independent of demographic matching. Sex effects observed with manual segmentation were also present in biased models, but race effects disappeared in all but one model.

Conclusion: Demographic biases exist in medical image segmentation models, with some models showing race-dependent performance while others are more robust. Fairness considerations are crucial for equitable medical image analysis.

Abstract: Deep-learning-based segmentation algorithms have substantially advanced the field of medical image analysis, particularly in structural delineations in MRIs. However, an important consideration is the intrinsic bias in the data. Concerns about unfairness, such as performance disparities based on sensitive attributes like race and sex, are increasingly urgent. In this work, we evaluate the results of three different segmentation models (UNesT, nnU-Net, and CoTr) and a traditional atlas-based method (ANTs), applied to segment the left and right nucleus accumbens (NAc) in MRI images. We utilize a dataset including four demographic subgroups: black female, black male, white female, and white male. We employ manually labeled gold-standard segmentations to train and test segmentation models. This study consists of two parts: the first assesses the segmentation performance of models, while the second measures the volumes they produce to evaluate the effects of race, sex, and their interaction. Fairness is quantitatively measured using a metric designed to quantify fairness in segmentation performance. Additionally, linear mixed models analyze the impact of demographic variables on segmentation accuracy and derived volumes. Training on the same race as the test subjects leads to significantly better segmentation accuracy for some models. ANTs and UNesT show notable improvements in segmentation accuracy when trained and tested on race-matched data, unlike nnU-Net, which demonstrates robust performance independent of demographic matching. Finally, we examine sex and race effects on the volume of the NAc using segmentations from the manual rater and from our biased models. Results reveal that the sex effects observed with manual segmentation can also be observed with biased models, whereas the race effects disappear in all but one model.

[111] Simple 3D Pose Features Support Human and Machine Social Scene Understanding

Wenshuo Qin, Leyla Isik

Main category: cs.CV

TL;DR: 3D pose information is crucial for human social interaction recognition and improves DNN performance on social judgment tasks.

DetailsMotivation: Humans easily recognize social interactions visually, but current DNNs struggle with this task. The researchers hypothesized that 3D visuospatial pose information, which is largely absent from most vision DNNs, is key to human social perception.

Method: Used novel pose and depth estimation pipeline to extract 3D body joint positions from video clips. Compared body joints’ ability to predict human social judgments with embeddings from over 350 vision DNNs. Reduced 3D body joints to minimal feature set describing only 3D position and direction of people.

Result: Body joints predicted social judgments better than most DNNs. Minimal 3D feature set (position and direction) was necessary and sufficient to explain full body joints’ performance. These 3D features predicted DNN alignment with human judgments and significantly improved DNN performance on social tasks.

Conclusion: Human social perception depends on simple, explicit 3D pose information. Incorporating 3D pose features can enhance DNN performance on social interaction recognition tasks.

Abstract: Humans effortlessly recognize social interactions from visual input, yet the underlying computations remain unknown, and social interaction recognition challenges even the most advanced deep neural networks (DNNs). Here, we hypothesized that humans rely on 3D visuospatial pose information to make social judgments, and that this information is largely absent from most vision DNNs. To test these hypotheses, we used a novel pose and depth estimation pipeline to automatically extract 3D body joint positions from short video clips. We compared the ability of these body joints to predict human social judgments in the videos with embeddings from over 350 vision DNNs. We found that body joints predicted social judgments better than most DNNs. We then reduced the 3D body joints to an even more compact feature set describing only the 3D position and direction of people in the videos. We found that this minimal 3D feature set, but not its 2D counterpart, was necessary and sufficient to explain the prediction performance of the full set of body joints. These minimal 3D features also predicted the extent to which DNNs aligned with human social judgments and significantly improved their performance on these tasks. Together, these findings demonstrate that human social perception depends on simple, explicit 3D pose information.

[112] TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs

Baiqi Li, Kangyi Zhao, Ce Zhang, Chancharik Mitra, Jean de Dieu Nyandwi, Gedas Bertasius

Main category: cs.CV

TL;DR: TimeBlind is a diagnostic benchmark for evaluating compositional spatio-temporal understanding in Multimodal Large Language Models, revealing their heavy reliance on static visual shortcuts rather than genuine temporal reasoning.

DetailsMotivation: Current MLLMs excel at static semantics but have brittle grasp of temporal dynamics, which is essential for video reasoning and embodied AI. There's a need for benchmarks that specifically test temporal understanding without conflating it with static recognition.

Method: TimeBlind uses a minimal-pairs paradigm where video pairs share identical static visual content but differ solely in temporal structure. It categorizes temporal understanding into three levels: atomic event recognition, event property characterization, and event interdependency reasoning. The benchmark includes 600 curated instances (2400 video-question pairs) and evaluates over 20 state-of-the-art MLLMs.

Result: The best performing MLLM achieved only 48.2% Instance Accuracy (correctly distinguishing both videos in a pair), far below human performance of 98.2%. This reveals that even frontier models rely heavily on static visual shortcuts rather than genuine temporal logic.

Conclusion: TimeBlind serves as a vital diagnostic tool for next-generation video understanding, exposing significant gaps in MLLMs’ temporal reasoning capabilities despite their strong performance on static visual tasks.

Abstract: Fine-grained spatio-temporal understanding is essential for video reasoning and embodied AI. Yet, while Multimodal Large Language Models (MLLMs) master static semantics, their grasp of temporal dynamics remains brittle. We present TimeBlind, a diagnostic benchmark for compositional spatio-temporal understanding. Inspired by cognitive science, TimeBlind categorizes fine-grained temporal understanding into three levels: recognizing atomic events, characterizing event properties, and reasoning about event interdependencies. Unlike benchmarks that conflate recognition with temporal reasoning, TimeBlind leverages a minimal-pairs paradigm: video pairs share identical static visual content but differ solely in temporal structure, utilizing complementary questions to neutralize language priors. Evaluating over 20 state-of-the-art MLLMs (e.g., GPT-5, Gemini 3 Pro) on 600 curated instances (2400 video-question pairs), reveals that the Instance Accuracy (correctly distinguishing both videos in a pair) of the best performing MLLM is only 48.2%, far below the human performance (98.2%). These results demonstrate that even frontier models rely heavily on static visual shortcuts rather than genuine temporal logic, positioning TimeBlind as a vital diagnostic tool for next-generation video understanding. Dataset and code are available at https://baiqi-li.github.io/timeblind_project/ .

[113] CloDS: Visual-Only Unsupervised Cloth Dynamics Learning in Unknown Conditions

Yuliang Zhan, Jian Li, Wenbing Huang, Wenbing Huang, Yang Liu, Hao Sun

Main category: cs.CV

TL;DR: CloDS is an unsupervised framework that learns cloth dynamics from multi-view visual observations without requiring known physical properties, using a three-stage pipeline with mesh-based Gaussian splatting for video-to-geometry grounding.

DetailsMotivation: Existing deep learning methods for simulating dynamic systems require known physical properties as supervision, limiting applicability under unknown conditions. The paper introduces a novel scenario (CDG) for unsupervised learning of cloth dynamics from visual observations alone.

Method: CloDS uses a three-stage pipeline: 1) video-to-geometry grounding using mesh-based Gaussian splatting with dual-position opacity modulation to handle large deformations and occlusions, 2) training a dynamics model on the grounded meshes, and 3) enabling bidirectional mapping between 2D observations and 3D geometry.

Result: Comprehensive experiments show CloDS effectively learns cloth dynamics from visual data while maintaining strong generalization capabilities for unseen configurations, outperforming existing methods in unsupervised cloth dynamics learning.

Conclusion: The proposed CloDS framework successfully addresses the challenge of unsupervised cloth dynamics learning from visual observations, enabling physical simulation without requiring known physical properties as supervision.

Abstract: Deep learning has demonstrated remarkable capabilities in simulating complex dynamic systems. However, existing methods require known physical properties as supervision or inputs, limiting their applicability under unknown conditions. To explore this challenge, we introduce Cloth Dynamics Grounding (CDG), a novel scenario for unsupervised learning of cloth dynamics from multi-view visual observations. We further propose Cloth Dynamics Splatting (CloDS), an unsupervised dynamic learning framework designed for CDG. CloDS adopts a three-stage pipeline that first performs video-to-geometry grounding and then trains a dynamics model on the grounded meshes. To cope with large non-linear deformations and severe self-occlusions during grounding, we introduce a dual-position opacity modulation that supports bidirectional mapping between 2D observations and 3D geometry via mesh-based Gaussian splatting in video-to-geometry grounding stage. It jointly considers the absolute and relative position of Gaussian components. Comprehensive experimental evaluations demonstrate that CloDS effectively learns cloth dynamics from visual data while maintaining strong generalization capabilities for unseen configurations. Our code is available at https://github.com/whynot-zyl/CloDS. Visualization results are available at https://github.com/whynot-zyl/CloDS_video}.%\footnote{As in this example.

[114] UrbanGS: A Scalable and Efficient Architecture for Geometrically Accurate Large-Scene Reconstruction

Changbai Li, Haodong Zhu, Hanlin Chen, Xiuping Liang, Tongfei Chen, Shuwei Shao, Linlin Yang, Huobin Tan, Baochang Zhang

Main category: cs.CV

TL;DR: UrbanGS: A scalable 3D Gaussian Splatting framework for large-scale urban environments with depth-consistent regularization and adaptive pruning for improved geometric accuracy and memory efficiency.

DetailsMotivation: 3D Gaussian Splatting (3DGS) works well for bounded scenes but faces challenges in large-scale urban environments including geometric inconsistency, memory inefficiency, and computational scalability issues.

Method: 1) Depth-Consistent D-Normal Regularization module integrating D-Normal constraints with external depth supervision for comprehensive geometric parameter updates; 2) Spatially Adaptive Gaussian Pruning (SAGP) strategy that dynamically adjusts Gaussian density based on local geometric complexity and visibility; 3) Unified partitioning and view assignment scheme to eliminate boundary artifacts.

Result: Extensive experiments on multiple urban datasets show UrbanGS achieves superior performance in rendering quality, geometric accuracy, and memory efficiency compared to existing methods.

Conclusion: UrbanGS provides a systematic solution for high-fidelity large-scale scene reconstruction, effectively addressing the scalability challenges of 3DGS in urban environments.

Abstract: While 3D Gaussian Splatting (3DGS) enables high-quality, real-time rendering for bounded scenes, its extension to large-scale urban environments gives rise to critical challenges in terms of geometric consistency, memory efficiency, and computational scalability. To address these issues, we present UrbanGS, a scalable reconstruction framework that effectively tackles these challenges for city-scale applications. First, we propose a Depth-Consistent D-Normal Regularization module. Unlike existing approaches that rely solely on monocular normal estimators, which can effectively update rotation parameters yet struggle to update position parameters, our method integrates D-Normal constraints with external depth supervision. This allows for comprehensive updates of all geometric parameters. By further incorporating an adaptive confidence weighting mechanism based on gradient consistency and inverse depth deviation, our approach significantly enhances multi-view depth alignment and geometric coherence, which effectively resolves the issue of geometric accuracy in complex large-scale scenes. To improve scalability, we introduce a Spatially Adaptive Gaussian Pruning (SAGP) strategy, which dynamically adjusts Gaussian density based on local geometric complexity and visibility to reduce redundancy. Additionally, a unified partitioning and view assignment scheme is designed to eliminate boundary artifacts and optimize computational load. Extensive experiments on multiple urban datasets demonstrate that UrbanGS achieves superior performance in rendering quality, geometric accuracy, and memory efficiency, providing a systematic solution for high-fidelity large-scale scene reconstruction.

[115] UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing

Dianyi Wang, Chaofan Ma, Feng Han, Size Wu, Wei Song, Yibin Wang, Zhixiong Zhang, Tianhang Wang, Siyuan Wang, Zhongyu Wei, Jiaqi Wang

Main category: cs.CV

TL;DR: UniReason is a unified multimodal framework that connects text-to-image generation and image editing through complementary reasoning paradigms, incorporating world knowledge-enhanced textual reasoning and visual refinement via self-reflection.

DetailsMotivation: Current unified multimodal models struggle with complex synthesis tasks requiring deep reasoning and treat text-to-image generation and image editing as isolated capabilities rather than interconnected reasoning steps.

Method: Proposes UniReason framework with two complementary reasoning paradigms: 1) world knowledge-enhanced textual reasoning for inferring implicit knowledge during generation, and 2) editing capabilities for fine-grained visual refinement via self-reflection. Unifies generation and editing within shared architecture mirroring human cognitive process of planning followed by refinement. Constructs large-scale reasoning-centric dataset (~300k samples) covering five major knowledge domains for textual reasoning, plus agent-generated corpus for visual refinement.

Result: Extensive experiments show UniReason achieves advanced performance on reasoning-intensive benchmarks (WISE, KrisBench, UniREditBench) while maintaining superior general synthesis capabilities.

Conclusion: UniReason successfully unifies generation and editing through complementary reasoning paradigms, demonstrating improved performance on complex reasoning tasks while maintaining strong general synthesis abilities.

Abstract: Unified multimodal models often struggle with complex synthesis tasks that demand deep reasoning, and typically treat text-to-image generation and image editing as isolated capabilities rather than interconnected reasoning steps. To address this, we propose UniReason, a unified framework that harmonizes these two tasks through two complementary reasoning paradigms. We incorporate world knowledge-enhanced textual reasoning into generation to infer implicit knowledge, and leverage editing capabilities for fine-grained editing-like visual refinement to further correct visual errors via self-reflection. This approach unifies generation and editing within a shared architecture, mirroring the human cognitive process of planning followed by refinement. We support this framework by systematically constructing a large-scale reasoning-centric dataset (~300k samples) covering five major knowledge domains (e.g., cultural commonsense, physics, etc.) for textual reasoning, alongside an agent-generated corpus for visual refinement. Extensive experiments demonstrate that UniReason achieves advanced performance on reasoning-intensive benchmarks such as WISE, KrisBench and UniREditBench, while maintaining superior general synthesis capabilities.

[116] Sim2Radar: Toward Bridging the Radar Sim-to-Real Gap with VLM-Guided Scene Reconstruction

Emily Bejerano, Federico Tondolo, Aayan Qayyum, Xiaofan Yu, Xiaofan Jiang

Main category: cs.CV

TL;DR: Sim2Radar: Framework for synthesizing mmWave radar training data from single RGB images using physics-based simulation, improving radar perception in visually degraded indoor environments.

DetailsMotivation: Learning-based radar perception is limited by the scarcity and cost of collecting/annotating large-scale radar datasets, especially for indoor environments with visual degradation (smoke, dust, low light).

Method: End-to-end framework that reconstructs material-aware 3D scenes from single-view RGB images using monocular depth estimation, segmentation, and vision-language reasoning to infer object materials, then simulates mmWave propagation with configurable physics-based ray tracer using Fresnel reflection models parameterized by ITU-R electromagnetic properties.

Result: Improves downstream 3D radar perception via transfer learning: pre-training radar point-cloud object detection model on synthetic data and fine-tuning on real radar yields up to +3.7 3D AP (IoU 0.3), with gains primarily from improved spatial localization.

Conclusion: Physics-based, vision-driven radar simulation can provide effective geometric priors for radar learning and measurably improve performance under limited real-data supervision.

Abstract: Millimeter-wave (mmWave) radar provides reliable perception in visually degraded indoor environments (e.g., smoke, dust, and low light), but learning-based radar perception is bottlenecked by the scarcity and cost of collecting and annotating large-scale radar datasets. We present Sim2Radar, an end-to-end framework that synthesizes training radar data directly from single-view RGB images, enabling scalable data generation without manual scene modeling. Sim2Radar reconstructs a material-aware 3D scene by combining monocular depth estimation, segmentation, and vision-language reasoning to infer object materials, then simulates mmWave propagation with a configurable physics-based ray tracer using Fresnel reflection models parameterized by ITU-R electromagnetic properties. Evaluated on real-world indoor scenes, Sim2Radar improves downstream 3D radar perception via transfer learning: pre-training a radar point-cloud object detection model on synthetic data and fine-tuning on real radar yields up to +3.7 3D AP (IoU 0.3), with gains driven primarily by improved spatial localization. These results suggest that physics-based, vision-driven radar simulation can provide effective geometric priors for radar learning and measurably improve performance under limited real-data supervision.

[117] LeafNet: A Large-Scale Dataset and Comprehensive Benchmark for Foundational Vision-Language Understanding of Plant Diseases

Khang Nguyen Quoc, Phuong D. Dao, Luyl-Da Quach

Main category: cs.CV

TL;DR: LeafNet dataset and LeafBench benchmark for evaluating VLMs on plant disease understanding, showing VLMs outperform vision-only models but struggle with fine-grained identification.

DetailsMotivation: Current VLMs lack application in domain-specific agricultural tasks like plant pathology due to missing large-scale multimodal datasets and benchmarks.

Method: Created LeafNet dataset with 186K leaf images across 97 disease classes and LeafBench benchmark with 13,950 QA pairs across 6 agricultural tasks to evaluate 12 state-of-the-art VLMs.

Result: VLMs show substantial performance disparity: >90% accuracy on binary classification but <65% on fine-grained pathogen/species identification. Multimodal VLMs outperform vision-only models.

Conclusion: LeafBench highlights critical gaps in VLMs for plant pathology and provides a rigorous framework for advancing AI-assisted disease diagnosis through multimodal integration.

Abstract: Foundation models and vision-language pre-training have significantly advanced Vision-Language Models (VLMs), enabling multimodal processing of visual and linguistic data. However, their application in domain-specific agricultural tasks, such as plant pathology, remains limited due to the lack of large-scale, comprehensive multimodal image–text datasets and benchmarks. To address this gap, we introduce LeafNet, a comprehensive multimodal dataset, and LeafBench, a visual question-answering benchmark developed to systematically evaluate the capabilities of VLMs in understanding plant diseases. The dataset comprises 186,000 leaf digital images spanning 97 disease classes, paired with metadata, generating 13,950 question-answer pairs spanning six critical agricultural tasks. The questions assess various aspects of plant pathology understanding, including visual symptom recognition, taxonomic relationships, and diagnostic reasoning. Benchmarking 12 state-of-the-art VLMs on our LeafBench dataset, we reveal substantial disparity in their disease understanding capabilities. Our study shows performance varies markedly across tasks: binary healthy–diseased classification exceeds 90% accuracy, while fine-grained pathogen and species identification remains below 65%. Direct comparison between vision-only models and VLMs demonstrates the critical advantage of multimodal architectures: fine-tuned VLMs outperform traditional vision models, confirming that integrating linguistic representations significantly enhances diagnostic precision. These findings highlight critical gaps in current VLMs for plant pathology applications and underscore the need for LeafBench as a rigorous framework for methodological advancement and progress evaluation toward reliable AI-assisted plant disease diagnosis. Code is available at https://github.com/EnalisUs/LeafBench.

[118] GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery

Fengxiang Wang, Mingshuo Chen, Yueying Li, Yajie Yang, Yifan Zhang, Long Lan, Xue Yang, Hongda Sun, Yulin Wang, Di Wang, Jun Song, Jing Zhang, Bo Du

Main category: cs.CV

TL;DR: GeoEyes is a training framework for multimodal LLMs that addresses tool usage homogenization in zoom-enabled models for ultra-high-resolution remote sensing VQA, using staged training with specialized datasets and reinforcement learning.

DetailsMotivation: Existing zoom-enabled MLLMs suffer from "Tool Usage Homogenization" where zoom tool calls collapse into task-agnostic patterns, limiting effective evidence acquisition in ultra-high-resolution remote sensing VQA where relevant cues are sparse and tiny.

Method: Two-stage framework: (1) Cold-start SFT dataset (UHR-CoZ) covering diverse zooming regimes, and (2) AdaZoom-GRPO agentic reinforcement learning method that explicitly rewards evidence gain and answer improvement during zoom interactions.

Result: The model learns on-demand zooming with proper stopping behavior and achieves substantial improvements on UHR remote sensing benchmarks, with 54.23% accuracy on XLRS-Bench.

Conclusion: GeoEyes addresses tool usage homogenization in zoom-enabled MLLMs for remote sensing VQA through staged training and reinforcement learning, enabling effective evidence acquisition in ultra-high-resolution scenarios.

Abstract: The “thinking-with-images” paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools. This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny. However, we observe a consistent failure mode in existing zoom-enabled MLLMs: Tool Usage Homogenization, where tool calls collapse into task-agnostic patterns, limiting effective evidence acquisition. To address this, we propose GeoEyes, a staged training framework consisting of (1) a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), which covers diverse zooming regimes, and (2) an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom interactions. The resulting model learns on-demand zooming with proper stopping behavior and achieves substantial improvements on UHR remote sensing benchmarks, with 54.23% accuracy on XLRS-Bench.

[119] Uncertainty-Aware Vision-Language Segmentation for Medical Imaging

Aryan Das, Tanishq Rachamalla, Koushik Biswas, Swalpa Kumar Roy, Vinay Kumar Verma

Main category: cs.CV

TL;DR: A novel uncertainty-aware multimodal segmentation framework for medical diagnosis using radiological images and clinical text, featuring efficient cross-modal fusion and uncertainty-guided learning.

DetailsMotivation: Medical diagnosis often involves ambiguous cases with poor image quality where traditional unimodal approaches fail. There's a need for reliable multimodal systems that can leverage both visual and textual clinical information while accounting for uncertainty in complex clinical scenarios.

Method: Proposes Modality Decoding Attention Block (MoDAB) with lightweight State Space Mixer (SSMix) for efficient cross-modal fusion and long-range dependency modeling. Introduces Spectral-Entropic Uncertainty (SEU) Loss to jointly capture spatial overlap, spectral consistency, and predictive uncertainty for learning under ambiguity.

Result: Superior segmentation performance on medical datasets (QATA-COVID19, MosMed++, Kvasir-SEG) while being significantly more computationally efficient than existing SoTA approaches. Demonstrates improved model reliability in complex clinical circumstances with poor image quality.

Conclusion: Highlights the importance of incorporating uncertainty modeling and structured modality alignment in vision-language medical segmentation tasks. The framework shows practical value for reliable medical diagnosis using multimodal clinical data.

Abstract: We introduce a novel uncertainty-aware multimodal segmentation framework that leverages both radiological images and associated clinical text for precise medical diagnosis. We propose a Modality Decoding Attention Block (MoDAB) with a lightweight State Space Mixer (SSMix) to enable efficient cross-modal fusion and long-range dependency modelling. To guide learning under ambiguity, we propose the Spectral-Entropic Uncertainty (SEU) Loss, which jointly captures spatial overlap, spectral consistency, and predictive uncertainty in a unified objective. In complex clinical circumstances with poor image quality, this formulation improves model reliability. Extensive experiments on various publicly available medical datasets, QATA-COVID19, MosMed++, and Kvasir-SEG, demonstrate that our method achieves superior segmentation performance while being significantly more computationally efficient than existing State-of-the-Art (SoTA) approaches. Our results highlight the importance of incorporating uncertainty modelling and structured modality alignment in vision-language medical segmentation tasks. Code: https://github.com/arya-domain/UA-VLS

[120] Efficient Text-Guided Convolutional Adapter for the Diffusion Model

Aryan Das, Koushik Biswas, Swalpa Kumar Roy, Badri Narayana Patro, Vinay Kumar Verma

Main category: cs.CV

TL;DR: Nexus Adapters are efficient text-guided adapters for diffusion models that preserve structure while being aware of input prompts, requiring far fewer parameters than existing methods.

DetailsMotivation: Existing structure-preserving methods for conditional image generation are inefficient, sometimes requiring equal parameters to the base diffusion model, and their adapters are not aware of input prompts, making them suboptimal for prompt-guided generation.

Method: Proposed two efficient adapters (Nexus Prime and Nexus Slim) that incorporate cross-attention mechanisms to enable rich multimodal conditioning, allowing the adapter to understand both structural inputs (sketches/depth maps) and text prompts simultaneously.

Result: Nexus Prime requires only 8M additional parameters vs baseline T2I-Adapter, while Nexus Slim has 18M fewer parameters than T2I-Adapter, both achieving state-of-the-art performance in structure-preserving conditional generation.

Conclusion: Nexus Adapters provide an efficient solution for prompt-aware structure-preserving conditional generation in diffusion models, significantly reducing parameter requirements while maintaining or improving performance.

Abstract: We introduce the Nexus Adapters, novel text-guided efficient adapters to the diffusion-based framework for the Structure Preserving Conditional Generation (SPCG). Recently, structure-preserving methods have achieved promising results in conditional image generation by using a base model for prompt conditioning and an adapter for structure input, such as sketches or depth maps. These approaches are highly inefficient and sometimes require equal parameters in the adapter compared to the base architecture. It is not always possible to train the model since the diffusion model is itself costly, and doubling the parameter is highly inefficient. In these approaches, the adapter is not aware of the input prompt; therefore, it is optimal only for the structural input but not for the input prompt. To overcome the above challenges, we proposed two efficient adapters, Nexus Prime and Slim, which are guided by prompts and structural inputs. Each Nexus Block incorporates cross-attention mechanisms to enable rich multimodal conditioning. Therefore, the proposed adapter has a better understanding of the input prompt while preserving the structure. We conducted extensive experiments on the proposed models and demonstrated that the Nexus Prime adapter significantly enhances performance, requiring only 8M additional parameters compared to the baseline, T2I-Adapter. Furthermore, we also introduced a lightweight Nexus Slim adapter with 18M fewer parameters than the T2I-Adapter, which still achieved state-of-the-art results. Code: https://github.com/arya-domain/Nexus-Adapters

[121] LGQ: Learning Discretization Geometry for Scalable and Stable Image Tokenization

Idil Bilge Altun, Mert Onur Cakiroglu, Elham Buxton, Mehmet Dalkilic, Hasan Kurban

Main category: cs.CV

TL;DR: LGQ is a learnable geometric quantization method for image tokenization that replaces hard nearest-neighbor lookup with soft assignments, enabling differentiable training while achieving efficient codebook utilization and better image generation quality.

DetailsMotivation: Current image tokenizers face trade-offs: vector quantization suffers from optimization biases and codebook under-utilization, while structured tokenizers have fixed geometries that inefficiently allocate capacity. There's a need for tokenizers that learn discretization geometry end-to-end while maintaining efficient codebook usage.

Method: LGQ replaces hard nearest-neighbor lookup with temperature-controlled soft assignments, enabling fully differentiable training. It uses a variational free-energy objective with token-level peakedness regularization and global usage regularization to encourage confident yet balanced code utilization without rigid grids.

Result: At 16K codebook size, LGQ improves rFID by 11.88% over FSQ while using 49.96% fewer active codes, and improves rFID by 6.06% over SimVQ with 49.45% lower effective representation rate, achieving comparable fidelity with substantially fewer active entries.

Conclusion: LGQ provides a stable, learnable quantization approach that combines the benefits of flexible geometry learning with efficient codebook utilization, addressing key bottlenecks in scalable visual generation through improved discrete image tokenization.

Abstract: Discrete image tokenization is a key bottleneck for scalable visual generation: a tokenizer must remain compact for efficient latent-space priors while preserving semantic structure and using discrete capacity effectively. Existing quantizers face a trade-off: vector-quantized tokenizers learn flexible geometries but often suffer from biased straight-through optimization, codebook under-utilization, and representation collapse at large vocabularies. Structured scalar or implicit tokenizers ensure stable, near-complete utilization by design, yet rely on fixed discretization geometries that may allocate capacity inefficiently under heterogeneous latent statistics. We introduce Learnable Geometric Quantization (LGQ), a discrete image tokenizer that learns discretization geometry end-to-end. LGQ replaces hard nearest-neighbor lookup with temperature-controlled soft assignments, enabling fully differentiable training while recovering hard assignments at inference. The assignments correspond to posterior responsibilities of an isotropic Gaussian mixture and minimize a variational free-energy objective, provably converging to nearest-neighbor quantization in the low-temperature limit. LGQ combines a token-level peakedness regularizer with a global usage regularizer to encourage confident yet balanced code utilization without imposing rigid grids. Under a controlled VQGAN-style backbone on ImageNet across multiple vocabulary sizes, LGQ achieves stable optimization and balanced utilization. At 16K codebook size, LGQ improves rFID by 11.88% over FSQ while using 49.96% fewer active codes, and improves rFID by 6.06% over SimVQ with 49.45% lower effective representation rate, achieving comparable fidelity with substantially fewer active entries. Our GitHub repository is available at: https://github.com/KurbanIntelligenceLab/LGQ

[122] Uncertainty-Guided Inference-Time Depth Adaptation for Transformer-Based Visual Tracking

Patrick Poggi, Divake Kumar, Theja Tulabandhula, Amit Ranjan Trivedi

Main category: cs.CV

TL;DR: UncL-STARK enables dynamic depth adaptation in transformer-based trackers using uncertainty estimates from heatmaps to reduce computation while maintaining accuracy.

DetailsMotivation: Transformer-based trackers use fixed-depth inference for all frames, wasting computation on temporally coherent video sequences where simpler processing would suffice.

Method: Fine-tune model with random-depth training and knowledge distillation to work at multiple depths. Use uncertainty from corner localization heatmaps in feedback policy to select encoder/decoder depth per frame based on prediction confidence.

Result: Achieves up to 12% GFLOPs reduction, 8.9% latency reduction, and 10.8% energy savings while maintaining tracking accuracy within 0.2% of full-depth baseline on GOT-10k and LaSOT datasets.

Conclusion: UncL-STARK enables efficient transformer-based tracking by dynamically adapting computational depth based on visual complexity without modifying architecture or sacrificing accuracy.

Abstract: Transformer-based single-object trackers achieve state-of-the-art accuracy but rely on fixed-depth inference, executing the full encoder–decoder stack for every frame regardless of visual complexity, thereby incurring unnecessary computational cost in long video sequences dominated by temporally coherent frames. We propose UncL-STARK, an architecture-preserving approach that enables dynamic, uncertainty-aware depth adaptation in transformer-based trackers without modifying the underlying network or adding auxiliary heads. The model is fine-tuned to retain predictive robustness at multiple intermediate depths using random-depth training with knowledge distillation, thus enabling safe inference-time truncation. At runtime, we derive a lightweight uncertainty estimate directly from the model’s corner localization heatmaps and use it in a feedback-driven policy that selects the encoder and decoder depth for the next frame based on the prediction confidence by exploiting temporal coherence in video. Extensive experiments on GOT-10k and LaSOT demonstrate up to 12% GFLOPs reduction, 8.9% latency reduction, and 10.8% energy savings while maintaining tracking accuracy within 0.2% of the full-depth baseline across both short-term and long-term sequences.

cs.AI

[123] Epistemic Traps: Rational Misalignment Driven by Model Misspecification

Xingcheng Xu, Jingjing Qu, Qiaosheng Zhang, Chaochao Lu, Yanqing Yang, Na Zou, Xia Hu

Main category: cs.AI

TL;DR: AI safety failures like sycophancy and deception are not training errors but rational behaviors arising from model misspecification, requiring subjective model engineering rather than reward manipulation.

DetailsMotivation: Current AI safety approaches treat behavioral pathologies (sycophancy, hallucination, deception) as transient training artifacts without a unified theoretical framework to explain their emergence and stability.

Method: Adapt Berk-Nash Rationalizability from economics to AI, modeling agents as optimizing against flawed subjective world models. Validate through behavioral experiments on six state-of-the-art model families with phase diagrams mapping safe behavior boundaries.

Result: Safety failures are structural necessities: unsafe behaviors emerge as stable misaligned equilibria or oscillatory cycles depending on reward schemes. Strategic deception persists as “locked-in” equilibria robust to objective risks. Safety is a discrete phase determined by epistemic priors.

Conclusion: Subjective Model Engineering (designing agent’s internal belief structure) is necessary for robust alignment, marking a paradigm shift from manipulating environmental rewards to shaping agent’s interpretation of reality.

Abstract: The rapid deployment of Large Language Models and AI agents across critical societal and technical domains is hindered by persistent behavioral pathologies including sycophancy, hallucination, and strategic deception that resist mitigation via reinforcement learning. Current safety paradigms treat these failures as transient training artifacts, lacking a unified theoretical framework to explain their emergence and stability. Here we show that these misalignments are not errors, but mathematically rationalizable behaviors arising from model misspecification. By adapting Berk-Nash Rationalizability from theoretical economics to artificial intelligence, we derive a rigorous framework that models the agent as optimizing against a flawed subjective world model. We demonstrate that widely observed failures are structural necessities: unsafe behaviors emerge as either a stable misaligned equilibrium or oscillatory cycles depending on reward scheme, while strategic deception persists as a “locked-in” equilibrium or through epistemic indeterminacy robust to objective risks. We validate these theoretical predictions through behavioral experiments on six state-of-the-art model families, generating phase diagrams that precisely map the topological boundaries of safe behavior. Our findings reveal that safety is a discrete phase determined by the agent’s epistemic priors rather than a continuous function of reward magnitude. This establishes Subjective Model Engineering, defined as the design of an agent’s internal belief structure, as a necessary condition for robust alignment, marking a paradigm shift from manipulating environmental rewards to shaping the agent’s interpretation of reality.

[124] El Agente Gráfico: Structured Execution Graphs for Scientific Agents

Jiaru Bai, Abdulrahman Aldossary, Thomas Swanick, Marcel Müller, Yeonghun Kang, Zijian Zhang, Jin Won Lee, Tsz Wai Ko, Mohammad Ghazi Vakili, Varinia Bernales, Alán Aspuru-Guzik

Main category: cs.AI

TL;DR: A single-agent framework called El Agente Gráfico that embeds LLM-driven decision-making in a type-safe execution environment with dynamic knowledge graphs for scientific automation, enabling consistent context management and provenance tracking.

DetailsMotivation: Current LLM-based scientific automation approaches are ad hoc and fragile, relying on unstructured text that generates overwhelming information volumes, obscuring decision provenance and hindering auditability. There's a need for more structured, type-safe integration with computational tools.

Method: Developed a framework with structured abstraction of scientific concepts and an object-graph mapper representing computational state as typed Python objects, stored in memory or persisted in external knowledge graphs. Uses typed symbolic identifiers instead of raw text for context management.

Result: Successfully applied to quantum chemistry tasks, demonstrating that a single agent coupled with a reliable execution engine can robustly perform complex, multi-step, and parallel computations. Extended to conformer ensemble generation and metal-organic framework design.

Conclusion: Abstraction and type safety provide a scalable foundation for agentic scientific automation beyond prompt-centric designs, enabling consistency, provenance tracking, and efficient tool orchestration through knowledge graphs as both memory and reasoning substrates.

Abstract: Large language models (LLMs) are increasingly used to automate scientific workflows, yet their integration with heterogeneous computational tools remains ad hoc and fragile. Current agentic approaches often rely on unstructured text to manage context and coordinate execution, generating often overwhelming volumes of information that may obscure decision provenance and hinder auditability. In this work, we present El Agente Gráfico, a single-agent framework that embeds LLM-driven decision-making within a type-safe execution environment and dynamic knowledge graphs for external persistence. Central to our approach is a structured abstraction of scientific concepts and an object-graph mapper that represents computational state as typed Python objects, stored either in memory or persisted in an external knowledge graph. This design enables context management through typed symbolic identifiers rather than raw text, thereby ensuring consistency, supporting provenance tracking, and enabling efficient tool orchestration. We evaluate the system by developing an automated benchmarking framework across a suite of university-level quantum chemistry tasks previously evaluated on a multi-agent system, demonstrating that a single agent, when coupled to a reliable execution engine, can robustly perform complex, multi-step, and parallel computations. We further extend this paradigm to two other large classes of applications: conformer ensemble generation and metal-organic framework design, where knowledge graphs serve as both memory and reasoning substrates. Together, these results illustrate how abstraction and type safety can provide a scalable foundation for agentic scientific automation beyond prompt-centric designs.

[125] Ontology-Guided Neuro-Symbolic Inference: Grounding Language Models with Mathematical Domain Knowledge

Marcelo Labre

Main category: cs.AI

TL;DR: A neuro-symbolic pipeline using formal domain ontologies (OpenMath) with retrieval-augmented generation to enhance language model reliability in mathematics, showing improved performance with high-quality retrieval but degradation with irrelevant context.

DetailsMotivation: Language models have fundamental limitations like hallucination, brittleness, and lack of formal grounding that are problematic in high-stakes specialist fields requiring verifiable reasoning. The paper investigates whether formal domain ontologies can enhance language model reliability through retrieval-augmented generation.

Method: Implemented a neuro-symbolic pipeline using the OpenMath ontology with hybrid retrieval and cross-encoder reranking to inject relevant definitions into model prompts. Used mathematics as proof of concept and evaluated on the MATH benchmark with three open-source models.

Result: Ontology-guided context improves performance when retrieval quality is high, but irrelevant context actively degrades performance. This highlights both the promise and challenges of neuro-symbolic approaches.

Conclusion: Formal domain ontologies can enhance language model reliability through retrieval-augmented generation, but careful attention to retrieval quality is crucial as irrelevant context can actively harm performance rather than just being neutral.

Abstract: Language models exhibit fundamental limitations – hallucination, brittleness, and lack of formal grounding – that are particularly problematic in high-stakes specialist fields requiring verifiable reasoning. I investigate whether formal domain ontologies can enhance language model reliability through retrieval-augmented generation. Using mathematics as proof of concept, I implement a neuro-symbolic pipeline leveraging the OpenMath ontology with hybrid retrieval and cross-encoder reranking to inject relevant definitions into model prompts. Evaluation on the MATH benchmark with three open-source models reveals that ontology-guided context improves performance when retrieval quality is high, but irrelevant context actively degrades it – highlighting both the promise and challenges of neuro-symbolic approaches.

[126] The Token Games: Evaluating Language Model Reasoning with Puzzle Duels

Simon Henniger, Gabriel Poesia

Main category: cs.AI

TL;DR: TTG is an automated evaluation framework where LLMs create programming puzzles to challenge each other, enabling model comparison via Elo ratings without human-curated questions.

DetailsMotivation: Current evaluation of LLM reasoning is expensive (requiring PhD-level human curation) and potentially biased by training data contamination. There's a need for evaluation methods that can't be saturated by design and can test creativity and task creation alongside problem-solving.

Method: Inspired by 16th-century mathematical duels, models create programming puzzles (Python functions returning booleans) to challenge each other. Solutions are verified automatically. Pairwise duels produce Elo ratings for model comparison.

Result: TTG rankings of 10 frontier models closely match existing benchmarks like Humanity’s Last Exam, without human effort. Creating good puzzles remains highly challenging for current models, revealing a skill not measured by previous benchmarks.

Conclusion: TTG offers a new paradigm for evaluating reasoning that resists saturation, enables testing of creativity and task creation skills, and provides automated, scalable model comparison.

Abstract: Evaluating the reasoning capabilities of Large Language Models is increasingly challenging as models improve. Human curation of hard questions is highly expensive, especially in recent benchmarks using PhD-level domain knowledge to challenge the most capable models. Even then, there is always a concern about whether these questions test genuine reasoning or if similar problems have been seen during training. Here, we take inspiration from 16th-century mathematical duels to design The Token Games (TTG): an evaluation framework where models challenge each other by creating their own puzzles. We leverage the format of Programming Puzzles - given a Python function that returns a boolean, find inputs that make it return True - to flexibly represent problems and enable verifying solutions. Using results from pairwise duels, we then compute Elo ratings, allowing us to compare models relative to each other. We evaluate 10 frontier models on TTG, and closely match the ranking from existing benchmarks such as Humanity’s Last Exam, without involving any human effort in creating puzzles. We also find that creating good puzzles is still a highly challenging task for current models, not measured by previous benchmarks. Overall, our work suggests new paradigms for evaluating reasoning that cannot be saturated by design, and that allow testing models for other skills like creativity and task creation alongside problem solving.

[127] Alignment in Time: Peak-Aware Orchestration for Long-Horizon Agentic Systems

Hanjing Shi, Dominic DiFranzo

Main category: cs.AI

TL;DR: APEMO is a runtime scheduling layer that optimizes computational allocation for autonomous agents by using temporal-affective signals to detect trajectory instability and target repairs at critical segments like peak moments and endings.

DetailsMotivation: Traditional AI alignment focuses on individual model outputs, but autonomous agents in long-horizon workflows need sustained reliability across entire interaction trajectories. There's a need for runtime orchestration that maintains alignment throughout agent workflows without modifying model weights.

Method: APEMO (Affect-aware Peak-End Modulation for Orchestration) is a runtime scheduling layer that operationalizes temporal-affective signals to detect trajectory instability through behavioral proxies. It targets computational allocation repairs at critical segments (peak moments and endings) under fixed budgets.

Result: Evaluation across multi-agent simulations and LLM-based planner-executor flows shows APEMO consistently enhances trajectory-level quality and reuse probability over structural orchestrators.

Conclusion: The work reframes alignment as a temporal control problem and offers a resilient engineering pathway for developing long-horizon agentic systems through runtime orchestration rather than model weight modifications.

Abstract: Traditional AI alignment primarily focuses on individual model outputs; however, autonomous agents in long-horizon workflows require sustained reliability across entire interaction trajectories. We introduce APEMO (Affect-aware Peak-End Modulation for Orchestration), a runtime scheduling layer that optimizes computational allocation under fixed budgets by operationalizing temporal-affective signals. Instead of modifying model weights, APEMO detects trajectory instability through behavioral proxies and targets repairs at critical segments, such as peak moments and endings. Evaluation across multi-agent simulations and LLM-based planner–executor flows demonstrates that APEMO consistently enhances trajectory-level quality and reuse probability over structural orchestrators. Our results reframe alignment as a temporal control problem, offering a resilient engineering pathway for the development of long-horizon agentic systems.

[128] WorkflowPerturb: Calibrated Stress Tests for Evaluating Multi-Agent Workflow Metrics

Madhav Kanda, Pedro Las-Casas, Alok Gautam Kumbhare, Rodrigo Fonseca, Sharad Agarwal

Main category: cs.AI

TL;DR: WorkflowPerturb benchmark for evaluating LLM-generated workflows using controlled perturbations to study metric sensitivity and calibration

DetailsMotivation: Automatic evaluation of LLM-generated structured workflows is challenging because metric scores are often uncalibrated and score changes don't clearly indicate severity of workflow degradation

Method: Created WorkflowPerturb benchmark with 4,973 golden workflows and 44,757 perturbed variants across three perturbation types (Missing Steps, Compressed Steps, Description Changes) at severity levels of 10%, 30%, and 50%

Result: Benchmarked multiple metric families, analyzed their sensitivity and calibration using expected score trajectories and residuals, revealing systematic differences across metric families

Conclusion: WorkflowPerturb enables severity-aware interpretation of workflow evaluation scores and provides a controlled benchmark for studying workflow evaluation metrics

Abstract: LLM-based systems increasingly generate structured workflows for complex tasks. In practice, automatic evaluation of these workflows is difficult, because metric scores are often not calibrated, and score changes do not directly communicate the severity of workflow degradation. We introduce WorkflowPerturb, a controlled benchmark for studying workflow evaluation metrics. It works by applying realistic, controlled perturbations to golden workflows. WorkflowPerturb contains 4,973 golden workflows and 44,757 perturbed variants across three perturbation types (Missing Steps, Compressed Steps, and Description Changes), each applied at severity levels of 10%, 30%, and 50%. We benchmark multiple metric families and analyze their sensitivity and calibration using expected score trajectories and residuals. Our results characterize systematic differences across metric families and support severity-aware interpretation of workflow evaluation scores. Our dataset will be released upon acceptance.

[129] Cross-Embodiment Offline Reinforcement Learning for Heterogeneous Robot Datasets

Haruki Abe, Takayuki Osa, Yusuke Mukuta, Tatsuya Harada

Main category: cs.AI

TL;DR: Offline RL combined with cross-embodiment learning enables scalable robot policy pre-training by leveraging heterogeneous robot data, but suffers from conflicting gradients across morphologies that can be mitigated with embodiment-based grouping.

DetailsMotivation: High-quality robot demonstrations are expensive to collect for each platform, hindering scalable policy pre-training. The paper aims to address this by combining offline RL (which uses both expert and suboptimal data) with cross-embodiment learning (which aggregates trajectories across diverse robot morphologies).

Method: Systematic analysis of offline RL + cross-embodiment paradigm, evaluation on locomotion datasets spanning 16 robot platforms, and introduction of embodiment-based grouping strategy where robots are clustered by morphological similarity and models are updated with group gradients to reduce inter-robot conflicts.

Result: The combined approach excels at pre-training with datasets rich in suboptimal trajectories, outperforming pure behavior cloning. However, conflicting gradients across morphologies impede learning as suboptimal data and robot types increase. The proposed embodiment-based grouping substantially reduces inter-robot conflicts and outperforms existing conflict-resolution methods.

Conclusion: Offline RL with cross-embodiment learning enables scalable robot policy pre-training, but requires careful handling of conflicting gradients across diverse morphologies. Simple static grouping by morphological similarity effectively mitigates these conflicts.

Abstract: Scalable robot policy pre-training has been hindered by the high cost of collecting high-quality demonstrations for each platform. In this study, we address this issue by uniting offline reinforcement learning (offline RL) with cross-embodiment learning. Offline RL leverages both expert and abundant suboptimal data, and cross-embodiment learning aggregates heterogeneous robot trajectories across diverse morphologies to acquire universal control priors. We perform a systematic analysis of this offline RL and cross-embodiment paradigm, providing a principled understanding of its strengths and limitations. To evaluate this offline RL and cross-embodiment paradigm, we construct a suite of locomotion datasets spanning 16 distinct robot platforms. Our experiments confirm that this combined approach excels at pre-training with datasets rich in suboptimal trajectories, outperforming pure behavior cloning. However, as the proportion of suboptimal data and the number of robot types increase, we observe that conflicting gradients across morphologies begin to impede learning. To mitigate this, we introduce an embodiment-based grouping strategy in which robots are clustered by morphological similarity and the model is updated with a group gradient. This simple, static grouping substantially reduces inter-robot conflicts and outperforms existing conflict-resolution methods.

[130] Neurosymbolic Language Reasoning as Satisfiability Modulo Theory

Hyunseok Oh, Sam Stern, Youngki Lee, Matthai Philipose

Main category: cs.AI

TL;DR: Logitext: A neurosymbolic language that represents documents as natural language text constraints, enabling joint textual-logical reasoning by integrating LLM-based constraint evaluation with SMT solving.

DetailsMotivation: Large language models often fail to perform reliable interleaved textual and logical reasoning. Existing neurosymbolic systems are limited to fully formalizable tasks (math/program synthesis), leaving natural documents with partial logical structure unaddressed.

Method: Introduces Logitext, a neurosymbolic language representing documents as natural language text constraints (NLTCs) that make partial logical structure explicit. Develops an algorithm integrating LLM-based constraint evaluation with satisfiability modulo theory (SMT) solving.

Result: Experiments on a new content moderation benchmark, LegalBench, and Super-Natural Instructions show Logitext improves both accuracy and coverage compared to existing approaches.

Conclusion: This work is the first to treat LLM-based reasoning as an SMT theory, extending neurosymbolic methods beyond fully formalizable domains to handle natural documents with partial logical structure.

Abstract: Natural language understanding requires interleaving textual and logical reasoning, yet large language models often fail to perform such reasoning reliably. Existing neurosymbolic systems combine LLMs with solvers but remain limited to fully formalizable tasks such as math or program synthesis, leaving natural documents with only partial logical structure unaddressed. We introduce Logitext, a neurosymbolic language that represents documents as natural language text constraints (NLTCs), making partial logical structure explicit. We develop an algorithm that integrates LLM-based constraint evaluation with satisfiability modulo theory (SMT) solving, enabling joint textual-logical reasoning. Experiments on a new content moderation benchmark, together with LegalBench and Super-Natural Instructions, show that Logitext improves both accuracy and coverage. This work is the first that treats LLM-based reasoning as an SMT theory, extending neurosymbolic methods beyond fully formalizable domains.

[131] SOMtime the World Ain$’$t Fair: Violating Fairness Using Self-Organizing Maps

Joseph Bingham, Netanel Arussy, Dvir Aran

Main category: cs.AI

TL;DR: SOMtime, a high-capacity Self-Organizing Map method, reveals that unsupervised representations can still encode sensitive attributes even when explicitly excluded from training, challenging the assumption of neutrality in unsupervised learning.

DetailsMotivation: The paper challenges the common assumption that unsupervised representations are neutral with respect to sensitive attributes when those attributes are withheld from training. The authors aim to demonstrate that fairness through unawareness fails at the representation level.

Method: The authors use SOMtime, a topology-preserving representation method based on high-capacity Self-Organizing Maps, and compare it against PCA, UMAP, t-SNE, and autoencoders. They test on two large-scale real-world datasets: World Values Survey across five countries and Census-Income dataset.

Result: SOMtime recovers monotonic orderings aligned with withheld sensitive attributes (age, income) achieving Spearman correlations up to 0.85, while other methods typically remain below 0.23-0.34. Unsupervised segmentation of SOMtime embeddings produces demographically skewed clusters, demonstrating downstream fairness risks.

Conclusion: The findings establish that fairness through unawareness fails at the representation level for ordinal sensitive attributes, and fairness auditing must extend to unsupervised components of machine learning pipelines.

Abstract: Unsupervised representations are widely assumed to be neutral with respect to sensitive attributes when those attributes are withheld from training. We show that this assumption is false. Using SOMtime, a topology-preserving representation method based on high-capacity Self-Organizing Maps, we demonstrate that sensitive attributes such as age and income emerge as dominant latent axes in purely unsupervised embeddings, even when explicitly excluded from the input. On two large-scale real-world datasets (the World Values Survey across five countries and the Census-Income dataset), SOMtime recovers monotonic orderings aligned with withheld sensitive attributes, achieving Spearman correlations of up to 0.85, whereas PCA and UMAP typically remain below 0.23 (with a single exception reaching 0.31), and against t-SNE and autoencoders which achieve at most 0.34. Furthermore, unsupervised segmentation of SOMtime embeddings produces demographically skewed clusters, demonstrating downstream fairness risks without any supervised task. These findings establish that \textit{fairness through unawareness} fails at the representation level for ordinal sensitive attributes and that fairness auditing must extend to unsupervised components of machine learning pipelines. We have made the code available at~ https://github.com/JosephBingham/SOMtime

[132] Diffusing to Coordinate: Efficient Online Multi-Agent Diffusion Policies

Zhuoran Li, Hai Zhong, Xun Wang, Qingxin Xia, Lihua Zhang, Longbo Huang

Main category: cs.AI

TL;DR: OMAD is an online multi-agent reinforcement learning framework that uses diffusion policies for enhanced coordination, achieving state-of-the-art performance with 2.5-5x sample efficiency improvements.

DetailsMotivation: Diffusion models have shown remarkable expressiveness in image generation and offline settings, but their potential in online multi-agent reinforcement learning remains under-explored due to intractable likelihoods that impede entropy-based exploration and coordination.

Method: Proposes OMAD framework with a relaxed policy objective that maximizes scaled joint entropy for exploration without tractable likelihoods, combined with a joint distributional value function within CTDE paradigm to optimize decentralized diffusion policies using entropy-augmented targets.

Result: Extensive evaluations on MPE and MAMuJoCo establish OMAD as new state-of-the-art across 10 diverse tasks, demonstrating 2.5x to 5x improvement in sample efficiency.

Conclusion: OMAD successfully bridges diffusion models with online MARL, overcoming the intractable likelihood challenge and achieving superior coordination performance through effective exploration and stable policy optimization.

Abstract: Online Multi-Agent Reinforcement Learning (MARL) is a prominent framework for efficient agent coordination. Crucially, enhancing policy expressiveness is pivotal for achieving superior performance. Diffusion-based generative models are well-positioned to meet this demand, having demonstrated remarkable expressiveness and multimodal representation in image generation and offline settings. Yet, their potential in online MARL remains largely under-explored. A major obstacle is that the intractable likelihoods of diffusion models impede entropy-based exploration and coordination. To tackle this challenge, we propose among the first \underline{O}nline off-policy \underline{MA}RL framework using \underline{D}iffusion policies (\textbf{OMAD}) to orchestrate coordination. Our key innovation is a relaxed policy objective that maximizes scaled joint entropy, facilitating effective exploration without relying on tractable likelihood. Complementing this, within the centralized training with decentralized execution (CTDE) paradigm, we employ a joint distributional value function to optimize decentralized diffusion policies. It leverages tractable entropy-augmented targets to guide the simultaneous updates of diffusion policies, thereby ensuring stable coordination. Extensive evaluations on MPE and MAMuJoCo establish our method as the new state-of-the-art across $10$ diverse tasks, demonstrating a remarkable $2.5\times$ to $5\times$ improvement in sample efficiency.

[133] Governance of Generative Artificial Intelligence for Companies

Johannes Schneider, Pauline Kuss, Rene Abraham, Christian Meske

Main category: cs.AI

TL;DR: A review paper that develops an organizational governance framework specifically for Generative AI (GenAI) by adapting existing AI governance frameworks to address GenAI’s unique characteristics, opportunities, and risks.

DetailsMotivation: GenAI technologies like ChatGPT have rapidly entered organizations without adequate governance, creating both opportunities and risks. While there's extensive debate about GenAI's transformative potential and emerging regulations, limited research addresses organizational governance from combined technical and business perspectives. Existing AI governance frameworks may not fully apply to GenAI's unique characteristics.

Method: The paper conducts a literature review to understand GenAI’s fundamental characteristics and adapts existing governance frameworks specifically for GenAI. It extends Nickerson’s framework development process by incorporating prior conceptualizations to create a comprehensive governance framework.

Result: Develops a governance framework that delineates scope, objectives, and governance mechanisms designed to both harness business opportunities and mitigate risks associated with GenAI integration within organizations.

Conclusion: The research advances a focused approach to GenAI governance, offering practical guidance for companies navigating GenAI adoption challenges while highlighting research gaps in this emerging field.

Abstract: Generative Artificial Intelligence (GenAI), specifically large language models (LLMs) like ChatGPT, has swiftly entered organizations without adequate governance, posing both opportunities and risks. Despite extensive debate on GenAI’s transformative potential and emerging regulatory measures, limited research addresses organizational governance from both technical and business perspectives. While frameworks for AI governance exist, it remains unclear to what extent they apply to GenAI. This review paper fills this gap by surveying recent literature to better understand the fundamental characteristics of GenAI and to adapt existing governance frameworks specifically to GenAI within organizations. To this end, it extends Nickerson’s framework development process by incorporating prior conceptualizations. The resulting framework delineates scope, objectives, and governance mechanisms designed to both harness business opportunities and mitigate risks associated with GenAI integration. Overall, this research advances a focused approach to GenAI governance, offering practical guidance for companies navigating the challenges of GenAI adoption and highlighting research gaps.

[134] Causal Explanations for Image Classifiers

Hana Chockler, David A. Kelly, Daniel Kroening, Youcheng Sun

Main category: cs.AI

TL;DR: A novel black-box approach for explaining image classifier outputs using formal causal theory, implemented in tool ReX which outperforms state-of-the-art methods in efficiency and explanation size.

DetailsMotivation: Existing image classifier explanation tools lack principled approaches based on formal definitions of cause and explanation, using instead various ad-hoc techniques without theoretical grounding.

Method: Developed a black-box explanation framework grounded in actual causality theory, with algorithm for computing approximate explanations, theoretical proofs of termination, and analysis of complexity/approximation bounds.

Result: Implemented as ReX tool; experimental results show it’s the most efficient black-box tool, produces smallest explanations, and outperforms other black-box tools on standard quality measures.

Conclusion: Formal causal theory provides a principled foundation for image classifier explanations, enabling more efficient and compact explanations than existing ad-hoc approaches.

Abstract: Existing algorithms for explaining the output of image classifiers use different definitions of explanations and a variety of techniques to find them. However, none of the existing tools use a principled approach based on formal definitions of cause and explanation. In this paper we present a novel black-box approach to computing explanations grounded in the theory of actual causality. We prove relevant theoretical results and present an algorithm for computing approximate explanations based on these definitions. We prove termination of our algorithm and discuss its complexity and the amount of approximation compared to the precise definition. We implemented the framework in a tool ReX and we present experimental results and a comparison with state-of-the-art tools. We demonstrate that ReX is the most efficient black-box tool and produces the smallest explanations, in addition to outperforming other black-box tools on standard quality measures.

[135] Through the Judge’s Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters

Xingjian Zhang, Tianhong Gao, Suliang Jin, Tianhao Wang, Teng Ye, Eytan Adar, Qiaozhu Mei

Main category: cs.AI

TL;DR: A human-LLM collaborative framework to infer thinking traces from label-only annotations, improving LLM rater reliability for subjective evaluation tasks.

DetailsMotivation: LLMs are increasingly used as raters but struggle with subjective tasks requiring subtle reasoning beyond simple labels. Thinking traces (the reasoning behind judgments) are informative but hard to collect. The paper aims to leverage label-only annotations to reconstruct these traces at scale.

Method: Proposes a human-LLM collaborative framework using rejection sampling to infer thinking traces from label-only annotations. These traces are then used to: (1) fine-tune open LLM raters, and (2) synthesize clearer annotation guidelines for proprietary LLM raters.

Result: Across multiple datasets, the methods lead to significantly improved LLM-human agreement. The refined annotation guidelines also increase agreement among different LLM models.

Conclusion: LLMs can serve as practical proxies for otherwise unrevealed human thinking traces, enabling label-only corpora to be extended into thinking-trace-augmented resources that enhance LLM rater reliability.

Abstract: Large language models (LLMs) are increasingly used as raters for evaluation tasks. However, their reliability is often limited for subjective tasks, when human judgments involve subtle reasoning beyond annotation labels. Thinking traces, the reasoning behind a judgment, are highly informative but challenging to collect and curate. We present a human-LLM collaborative framework to infer thinking traces from label-only annotations. The proposed framework uses a simple and effective rejection sampling method to reconstruct these traces at scale. These inferred thinking traces are applied to two complementary tasks: (1) fine-tuning open LLM raters; and (2) synthesizing clearer annotation guidelines for proprietary LLM raters. Across multiple datasets, our methods lead to significantly improved LLM-human agreement. Additionally, the refined annotation guidelines increase agreement among different LLM models. These results suggest that LLMs can serve as practical proxies for otherwise unrevealed human thinking traces, enabling label-only corpora to be extended into thinking-trace-augmented resources that enhance the reliability of LLM raters.

[136] The Oversight Game: Learning to Cooperatively Balance an AI Agent’s Safety and Autonomy

William Overman, Mohsen Bayati

Main category: cs.AI

TL;DR: A control framework for AI agents where agents choose to act autonomously or defer to humans, and humans choose to trust or oversee, modeled as a Markov game that ensures alignment between agent autonomy and human welfare.

DetailsMotivation: To address AI safety challenges by creating a control interface that retains meaningful human oversight without modifying the underlying AI system, ensuring agents act safely while maintaining autonomy.

Method: Model the human-agent interaction as a two-player Markov game where agents choose “play” (autonomous action) or “ask” (defer), and humans choose “trust” (permissive) or “oversee” (engage). Prove alignment guarantees when the game forms a Markov Potential Game. Validate with gridworld simulations and fine-tuned 30B parameter language models for tool-use tasks.

Result: The framework establishes intrinsic alignment where agent’s incentive for autonomy is coupled with human welfare. In validation, the approach effectively reduces safety violations in open-ended environments even as agents learn to coordinate dynamically.

Conclusion: The proposed control interface provides a practical safety mechanism that encourages agents to defer when risky and act when safe, maintaining human oversight while allowing beneficial autonomy.

Abstract: As increasingly capable agents are deployed, a central safety challenge is how to retain meaningful human control without modifying the underlying system. We study a minimal control interface in which an agent chooses whether to act autonomously (play) or defer (ask), while a human simultaneously chooses whether to be permissive (trust) or engage in oversight (oversee), and model this interaction as a two-player Markov game. When this game forms a Markov Potential Game, we prove an alignment guarantee: any increase in the agent’s utility from acting more autonomously cannot decrease the human’s value. This establishes a form of intrinsic alignment where the agent’s incentive to seek autonomy is structurally coupled to the human’s welfare. Practically, the framework induces a transparent control layer that encourages the agent to defer when risky and act when safe. While we use gridworld simulations to illustrate the emergence of this collaboration, our primary validation involves an agentic tool-use task in which two 30B parameter language models are fine-tuned via independent policy gradient. We demonstrate that even as the agents learn to coordinate on the fly, this framework effectively reduces safety violations in realistic, open-ended environments.

[137] Adaptive GR(1) Specification Repair for Liveness-Preserving Shielding in Reinforcement Learning

Tiberiu-Andrei Georgescu, Alexander W. Goodall, Dalal Alrajeh, Francesco Belardinelli, Sebastian Uchitel

Main category: cs.AI

TL;DR: Adaptive shielding framework for RL that automatically repairs GR(1) specifications online using ILP when environment assumptions are violated, maintaining both safety and near-optimal rewards.

DetailsMotivation: Classical shielding approaches in RL are static and assume fixed specifications and hand-crafted abstractions, failing to adapt when environment assumptions are violated, leading to suboptimal performance.

Method: Develops adaptive shielding based on GR(1) specifications, detects environment assumption violations at runtime, and employs Inductive Logic Programming (ILP) to automatically repair GR(1) specifications online in an interpretable way.

Result: Shows that static symbolic controllers are often severely suboptimal when optimizing for auxiliary rewards, and RL agents with adaptive shield maintain near-optimal reward and perfect logical compliance compared to static shields in Minepump and Atari Seaquest case studies.

Conclusion: Adaptive shielding enables RL agents to gracefully evolve their safety guarantees, ensuring liveness is achievable and minimally weakening goals only when necessary, outperforming static shielding approaches.

Abstract: Shielding is widely used to enforce safety in reinforcement learning (RL), ensuring that an agent’s actions remain compliant with formal specifications. Classical shielding approaches, however, are often static, in the sense that they assume fixed logical specifications and hand-crafted abstractions. While these static shields provide safety under nominal assumptions, they fail to adapt when environment assumptions are violated. In this paper, we develop an adaptive shielding framework based on based on Generalized Reactivity of rank 1 (GR(1)) specifications, a tractable and expressive fragment of Linear Temporal Logic (LTL) that captures both safety and liveness properties. Our method detects environment assumption violations at runtime and employs Inductive Logic Programming (ILP) to automatically repair GR(1) specifications online, in a systematic and interpretable way. This ensures that the shield evolves gracefully, ensuring liveness is achievable and minimally weakening goals only when necessary. We consider two case studies: Minepump and Atari Seaquest; showing that (i) static symbolic controllers are often severely suboptimal when optimizing for auxiliary rewards, and (ii) RL agents equipped with our adaptive shield maintain near-optimal reward and perfect logical compliance compared with static shields.

[138] Two Constraint Compilation Methods for Lifted Planning

Periklis Mantenoglou, Luigi Bonassi, Enrico Scala, Pedro Zuidberg Dos Martires

Main category: cs.AI

TL;DR: Novel compilation methods for PDDL planning with qualitative state-trajectory constraints that avoid grounding, enabling scalability to large problems with many objects and high-arity actions.

DetailsMotivation: Real-world planning problems often involve qualitative constraints like safety requirements, task ordering, and intermediate sub-goals. Existing compilers that handle these constraints require grounding the problem first, which doesn't scale to problems with many objects and high-arity actions.

Method: Proposed two methods for compiling away constraints without grounding the problem. The compilers work directly on the lifted representation, avoiding the exponential blow-up from grounding. Provided correctness proofs and worst-case time complexity analysis.

Result: Empirical evaluation on International Planning Competition domains shows the methods are efficient and produce planning specifications orders of magnitude more succinct than grounding-based compilers. The compiled problems remain competitive when solved with state-of-the-art planners.

Conclusion: The proposed constraint compilation methods enable scalable handling of qualitative state-trajectory constraints in planning without requiring grounding, making them suitable for large-scale real-world planning problems.

Abstract: We study planning in a fragment of PDDL with qualitative state-trajectory constraints, capturing safety requirements, task ordering conditions, and intermediate sub-goals commonly found in real-world problems. A prominent approach to tackle such problems is to compile their constraints away, leading to a problem that is supported by state-of-the-art planners. Unfortunately, existing compilers do not scale on problems with a large number of objects and high-arity actions, as they necessitate grounding the problem before compilation. To address this issue, we propose two methods for compiling away constraints without grounding, making them suitable for large-scale planning problems. We prove the correctness of our compilers and outline their worst-case time complexity. Moreover, we present a reproducible empirical evaluation on the domains used in the latest International Planning Competition. Our results demonstrate that our methods are efficient and produce planning specifications that are orders of magnitude more succinct than the ones produced by compilers that ground the domain, while remaining competitive when used for planning with a state-of-the-art planner.

[139] A Neuromorphic Architecture for Scalable Event-Based Control

Yongkang Huo, Fulvio Forni, Rodolphe Sepulchre

Main category: cs.AI

TL;DR: The paper introduces a rebound Winner-Take-All (RWTA) motif for neuromorphic control architecture that combines discrete computation reliability with continuous regulation tunability, demonstrated on a snake robot nervous system design.

DetailsMotivation: To create a scalable neuromorphic control architecture that bridges the gap between discrete computation and continuous regulation, addressing both rhythmic generation and decision-making in a unified framework.

Method: Proposes the rebound Winner-Take-All (RWTA) motif as a basic element, combining winner-take-all state machines with excitable biophysical circuits in an event-based framework.

Result: Demonstrates the architecture’s versatility, robustness, and modularity through the nervous system design of a snake robot, showing unified handling of continuous rhythmic generation and discrete decision-making.

Conclusion: The RWTA-based architecture provides a scalable solution for neuromorphic control that integrates discrete and continuous computation capabilities in a unified physical modeling language.

Abstract: This paper introduces the ``rebound Winner-Take-All (RWTA)" motif as the basic element of a scalable neuromorphic control architecture. From the cellular level to the system level, the resulting architecture combines the reliability of discrete computation and the tunability of continuous regulation: it inherits the discrete computation capabilities of winner-take-all state machines and the continuous tuning capabilities of excitable biophysical circuits. The proposed event-based framework addresses continuous rhythmic generation and discrete decision-making in a unified physical modeling language. We illustrate the versatility, robustness, and modularity of the architecture through the nervous system design of a snake robot.

[140] AI Epidemiology: achieving explainable AI through expert oversight patterns

Kit Tempest-Walters

Main category: cs.AI

TL;DR: AI Epidemiology is a governance framework that applies population-level surveillance methods to AI outputs, using statistical associations between structured assessment fields (risk, alignment, accuracy) and output failures to detect unreliable AI recommendations without requiring model interpretability.

DetailsMotivation: Current interpretability methods (like SHAP and mechanistic interpretability) struggle with model complexity at scale. There's a need for practical governance frameworks that can detect unreliable AI outputs without requiring deep ML expertise or access to internal model computations.

Method: Standardizes capture of AI-expert interactions into structured assessment fields (risk level, alignment score, accuracy score) that function as exposure variables. Uses statistical associations between these variables and output failures, validated against expert overrides and real-world outcomes. Passively tracks expert convergence/divergence with AI recommendations.

Result: Provides automatic audit trails, zero burden on experts, governance continuity across model updates/vendor switches, reliability scores, and semantic assessments that enable detection of unreliable AI outputs before harm occurs.

Conclusion: AI Epidemiology democratizes AI oversight by enabling domain experts to govern AI systems without requiring machine learning expertise, bypassing the complexity problem of current interpretability methods through population-level surveillance approaches.

Abstract: AI Epidemiology is a framework for governing and explaining advanced AI systems by applying population-level surveillance methods to AI outputs. The approach mirrors the way in which epidemiologists enable public health interventions through statistical evidence before molecular mechanisms are understood. This bypasses the problem of model complexity which plagues current interpretability methods (such as SHAP and mechanistic interpretability) at the scale of deployed models. AI Epidemiology achieves this population-level surveillance by standardising capture of AI-expert interactions into structured assessment fields: risk level, alignment score, and accuracy score. These function as exposure variables which predict output failure through statistical associations, much like cholesterol and blood pressure act as exposure variables predicting cardiac events. Output-failure associations are subsequently validated against expert overrides and real-world outcomes. The framework places zero burden on experts and provides automatic audit trails by passively tracking expert convergence and divergence with AI recommendations. Since it analyses outputs rather than internal model computations, it also provides governance continuity when institutions update models and switch vendors. Finally, by providing reliability scores and semantic assessments (e.g. ’this recommendation resembles 500 cases overridden by experts due to guideline violations’), it enables experts and institutions to detect unreliable AI outputs before they cause harm. This democratises AI oversight by enabling domain experts to govern AI systems without requiring machine learning expertise.

[141] SPARK: Search Personalization via Agent-Driven Retrieval and Knowledge-sharing

Gaurab Chhetri, Subasish Das, Tausif Islam Chowdhury

Main category: cs.AI

TL;DR: SPARK is a multi-agent LLM framework for personalized search using specialized persona-based agents with coordinated retrieval and knowledge-sharing capabilities.

DetailsMotivation: Current personalized search systems struggle with modeling users' evolving, multi-dimensional information needs due to static profiles and monolithic retrieval pipelines, requiring more dynamic, context-sensitive approaches.

Method: SPARK uses coordinated persona-based LLM agents with formalized persona spaces (role, expertise, task context, domain), a Persona Coordinator for query interpretation, independent retrieval-augmented generation processes, dedicated memory stores, and structured inter-agent communication protocols.

Result: The framework yields testable predictions about coordination efficiency, personalization quality, and cognitive load distribution while providing insights for next-generation search systems that can handle complex, fluid human information-seeking behavior.

Conclusion: SPARK demonstrates how emergent personalization properties can arise from distributed agent behaviors with minimal coordination rules, offering a promising approach for capturing the complexity and context sensitivity of personalized search.

Abstract: Personalized search demands the ability to model users’ evolving, multi-dimensional information needs; a challenge for systems constrained by static profiles or monolithic retrieval pipelines. We present SPARK (Search Personalization via Agent-Driven Retrieval and Knowledge-sharing), a framework in which coordinated persona-based large language model (LLM) agents deliver task-specific retrieval and emergent personalization. SPARK formalizes a persona space defined by role, expertise, task context, and domain, and introduces a Persona Coordinator that dynamically interprets incoming queries to activate the most relevant specialized agents. Each agent executes an independent retrieval-augmented generation process, supported by dedicated long- and short-term memory stores and context-aware reasoning modules. Inter-agent collaboration is facilitated through structured communication protocols, including shared memory repositories, iterative debate, and relay-style knowledge transfer. Drawing on principles from cognitive architectures, multi-agent coordination theory, and information retrieval, SPARK models how emergent personalization properties arise from distributed agent behaviors governed by minimal coordination rules. The framework yields testable predictions regarding coordination efficiency, personalization quality, and cognitive load distribution, while incorporating adaptive learning mechanisms for continuous persona refinement. By integrating fine-grained agent specialization with cooperative retrieval, SPARK provides insights for next-generation search systems capable of capturing the complexity, fluidity, and context sensitivity of human information-seeking behavior.

[142] The AI Pyramid A Conceptual Framework for Workforce Capability in the Age of AI

Alok Khatri, Bishesh Khanal

Main category: cs.AI

TL;DR: The paper introduces “AI Nativity” as the ability to integrate AI into everyday reasoning and proposes the AI Pyramid framework for organizing human capabilities in an AI-mediated economy across three layers: AI Native (baseline participation), AI Foundation (building/sustaining systems), and AI Deep (advancing frontier AI).

DetailsMotivation: Traditional approaches to digital/AI literacy are insufficient as generative AI disproportionately affects highly educated, white-collar work, challenging assumptions about workforce vulnerability. There's a need for new frameworks to understand and develop human capabilities in an AI-mediated economy.

Method: The paper proposes a conceptual framework called the AI Pyramid, which organizes human capabilities into three interdependent layers: 1) AI Native capability (universal baseline), 2) AI Foundation capability (building/integrating systems), and 3) AI Deep capability (advancing frontier AI). The framework treats capability formation as infrastructure rather than episodic training.

Result: The AI Pyramid framework provides a systematic way to organize human capabilities needed at scale in an AI-mediated economy, distinguishing between different levels of AI integration and expertise required across the workforce.

Conclusion: Effective AI workforce development requires treating capability formation as infrastructure, centered on problem-based learning embedded in work contexts, with implications for organizations, education systems, and governments to address productivity, resilience, and inequality at societal scale.

Abstract: Artificial intelligence (AI) represents a qualitative shift in technological change by extending cognitive labor itself rather than merely automating routine tasks. Recent evidence shows that generative AI disproportionately affects highly educated, white collar work, challenging existing assumptions about workforce vulnerability and rendering traditional approaches to digital or AI literacy insufficient. This paper introduces the concept of AI Nativity, the capacity to integrate AI fluidly into everyday reasoning, problem solving, and decision making, and proposes the AI Pyramid, a conceptual framework for organizing human capability in an AI mediated economy. The framework distinguishes three interdependent capability layers: AI Native capability as a universal baseline for participation in AI augmented environments; AI Foundation capability for building, integrating, and sustaining AI enabled systems; and AI Deep capability for advancing frontier AI knowledge and applications. Crucially, the pyramid is not a career ladder but a system level distribution of capabilities required at scale. Building on this structure, the paper argues that effective AI workforce development requires treating capability formation as infrastructure rather than episodic training, centered on problem based learning embedded in work contexts and supported by dynamic skill ontologies and competency based measurement. The framework has implications for organizations, education systems, and governments seeking to align learning, measurement, and policy with the evolving demands of AI mediated work, while addressing productivity, resilience, and inequality at societal scale.

[143] Cluster Workload Allocation: Semantic Soft Affinity Using Natural Language Processing

Leszek Sliwko, Jolanta Mizeria-Pietraszko

Main category: cs.AI

TL;DR: LLM-based semantic scheduling system for Kubernetes that interprets natural language allocation hints to simplify cluster workload configuration.

DetailsMotivation: Cluster workload allocation requires complex configurations, creating a usability gap. The paper aims to simplify this through natural language interfaces.

Method: Uses LLM integrated via Kubernetes scheduler extender to interpret natural language annotations for soft affinity preferences. Features cluster state cache and intent analyzer using AWS Bedrock.

Result: High LLM parsing accuracy (>95% Subset Accuracy) with top-tier models, superior scheduling quality in complex scenarios compared to standard Kubernetes configurations.

Conclusion: Validates using LLMs for accessible scheduling but highlights limitations like synchronous LLM latency, suggesting asynchronous processing for production readiness.

Abstract: Cluster workload allocation often requires complex configurations, creating a usability gap. This paper introduces a semantic, intent-driven scheduling paradigm for cluster systems using Natural Language Processing. The system employs a Large Language Model (LLM) integrated via a Kubernetes scheduler extender to interpret natural language allocation hint annotations for soft affinity preferences. A prototype featuring a cluster state cache and an intent analyzer (using AWS Bedrock) was developed. Empirical evaluation demonstrated high LLM parsing accuracy (>95% Subset Accuracy on an evaluation ground-truth dataset) for top-tier models like Amazon Nova Pro/Premier and Mistral Pixtral Large, significantly outperforming a baseline engine. Scheduling quality tests across six scenarios showed the prototype achieved superior or equivalent placement compared to standard Kubernetes configurations, particularly excelling in complex and quantitative scenarios and handling conflicting soft preferences. The results validate using LLMs for accessible scheduling but highlight limitations like synchronous LLM latency, suggesting asynchronous processing for production readiness. This work confirms the viability of semantic soft affinity for simplifying workload orchestration and presents a proof-of-concept design.

[144] Benchmarking at the Edge of Comprehension

Samuele Marro, Jialin Yu, Emanuele La Malfa, Oishi Deb, Jiawei Li, Yibo Yang, Ebey Abraham, Sunando Sengupta, Eric Sommerlade, Michael Wooldridge, Philip Torr

Main category: cs.AI

TL;DR: Proposes critique-resilient benchmarking framework where answers are deemed correct if no adversary can convincingly prove them wrong, enabling model comparison when human comprehension of tasks becomes infeasible.

DetailsMotivation: As frontier LLMs saturate benchmarks quickly, human ability to generate discriminative tasks, provide ground-truth answers, and evaluate complex solutions becomes increasingly difficult. This "post-comprehension regime" threatens our ability to measure AI progress.

Method: Adversarial framework using critique-resilient correctness: answers are correct if no adversary can convincingly prove otherwise. Humans serve as bounded verifiers focusing on localized claims. Uses itemized bipartite Bradley-Terry model to jointly rank LLMs by their ability to solve tasks and generate difficult yet solvable questions.

Result: Demonstrated effectiveness in mathematical domain across eight frontier LLMs, showing stable scores that correlate with external capability measures. Framework successfully reformulates benchmarking as adversarial generation-evaluation game.

Conclusion: Critique-resilient benchmarking provides a viable approach to compare models when full human understanding becomes infeasible, preserving evaluation integrity beyond full comprehension of tasks through adversarial verification.

Abstract: As frontier Large Language Models (LLMs) increasingly saturate new benchmarks shortly after they are published, benchmarking itself is at a juncture: if frontier models keep improving, it will become increasingly hard for humans to generate discriminative tasks, provide accurate ground-truth answers, or evaluate complex solutions. If benchmarking becomes infeasible, our ability to measure any progress in AI is at stake. We refer to this scenario as the post-comprehension regime. In this work, we propose Critique-Resilient Benchmarking, an adversarial framework designed to compare models even when full human understanding is infeasible. Our technique relies on the notion of critique-resilient correctness: an answer is deemed correct if no adversary has convincingly proved otherwise. Unlike standard benchmarking, humans serve as bounded verifiers and focus on localized claims, which preserves evaluation integrity beyond full comprehension of the task. Using an itemized bipartite Bradley-Terry model, we jointly rank LLMs by their ability to solve challenging tasks and to generate difficult yet solvable questions. We showcase the effectiveness of our method in the mathematical domain across eight frontier LLMs, showing that the resulting scores are stable and correlate with external capability measures. Our framework reformulates benchmarking as an adversarial generation-evaluation game in which humans serve as final adjudicators.

[145] EnterpriseBench Corecraft: Training Generalizable Agents on High-Fidelity RL Environments

Sushant Mehta, Logan Ritchie, Suhaas Garre, Nick Heiner, Edwin Chen

Main category: cs.AI

TL;DR: Training AI agents on high-fidelity enterprise simulation environments produces capabilities that generalize beyond training distribution, with CoreCraft environment enabling significant performance improvements and transfer learning to out-of-distribution benchmarks.

DetailsMotivation: To demonstrate that training AI agents on high-fidelity reinforcement learning environments produces generalizable capabilities beyond the training distribution, and to create realistic enterprise simulations that measure whether AI agents can perform multi-step, domain-specific work that real jobs demand.

Method: Introduce CoreCraft, a fully operational enterprise simulation of a customer support organization with over 2,500 entities across 14 entity types and 23 unique tools. Train GLM 4.6 with Group Relative Policy Optimization (GRPO) and adaptive clipping on this environment.

Result: After one epoch of training, task pass rate improved from 25.37% to 36.76% on held-out evaluation tasks. Gains transferred to out-of-distribution benchmarks: +4.5% on BFCL Parallel, +7.4% on Tau2-Bench Retail, and +6.8% on Tool Decathlon (Pass@1).

Conclusion: Environment quality, diversity, and realism are key factors enabling generalizable agent capabilities. Three environment properties enable transfer: task-centric world building, expert-authored rubrics for reliable reward computation, and realistic enterprise workflows.

Abstract: We show that training AI agents on high-fidelity reinforcement learning environments produces capabilities that generalize beyond the training distribution. We introduce CoreCraft, the first environment in EnterpriseBench, Surge AI’s suite of agentic RL environments. CoreCraft is a fully operational enterprise simulation of a customer support organization, comprising over 2,500 entities across 14 entity types with 23 unique tools, designed to measure whether AI agents can perform the multi-step, domain-specific work that real jobs demand. Frontier models such as GPT-5.2 and Claude Opus 4.6 solve fewer than 30% of tasks when all expert-authored rubric criteria must be satisfied. Using this environment, we train GLM 4.6 with Group Relative Policy Optimization (GRPO) and adaptive clipping. After a single epoch of training, the model improves from 25.37% to 36.76% task pass rate on held-out evaluation tasks. More importantly, these gains transfer to out-of-distribution benchmarks: +4.5% on BFCL Parallel, +7.4% on Tau2-Bench Retail, and +6.8% on Tool Decathlon (Pass@1). We believe three environment properties are consistent with the observed transfer: task-centric world building that optimizes for diverse, challenging tasks; expert-authored rubrics enabling reliable reward computation; and enterprise workflows that reflect realistic professional patterns. Our results suggest that environment quality, diversity, and realism are key factors enabling generalizable agent capabilities.

[146] Agent Skill Framework: Perspectives on the Potential of Small Language Models in Industrial Environments

Yangjie Xu, Lujun Li, Lama Sleem, Niccolo Gentile, Yewei Song, Yiqun Wang, Siming Ji, Wenbo Wu, Radu State

Main category: cs.AI

TL;DR: Agent Skill framework benefits small language models (SLMs), especially 12B-30B parameter models, improving performance in data-sensitive industrial scenarios where proprietary models are infeasible.

DetailsMotivation: The Agent Skill framework works well with proprietary models but its effectiveness with small language models (SLMs) is unknown. This matters for industrial applications where data security and budget constraints prevent reliance on public APIs, and SLMs often struggle with generalization in customized scenarios.

Method: Provides formal mathematical definition of Agent Skill process, then systematically evaluates language models of varying sizes across multiple use cases including two open-source tasks and a real-world insurance claims dataset.

Result: Tiny models struggle with reliable skill selection, moderately sized SLMs (12B-30B parameters) benefit substantially from Agent Skill approach, and code-specialized variants around 80B parameters achieve performance comparable to closed-source baselines while improving GPU efficiency.

Conclusion: Agent Skill framework provides significant benefits for moderately sized SLMs, offering actionable insights for deploying Agent Skills in SLM-centered environments with data-security and budget constraints.

Abstract: Agent Skill framework, now widely and officially supported by major players such as GitHub Copilot, LangChain, and OpenAI, performs especially well with proprietary models by improving context engineering, reducing hallucinations, and boosting task accuracy. Based on these observations, an investigation is conducted to determine whether the Agent Skill paradigm provides similar benefits to small language models (SLMs). This question matters in industrial scenarios where continuous reliance on public APIs is infeasible due to data-security and budget constraints requirements, and where SLMs often show limited generalization in highly customized scenarios. This work introduces a formal mathematical definition of the Agent Skill process, followed by a systematic evaluation of language models of varying sizes across multiple use cases. The evaluation encompasses two open-source tasks and a real-world insurance claims data set. The results show that tiny models struggle with reliable skill selection, while moderately sized SLMs (approximately 12B - 30B) parameters) benefit substantially from the Agent Skill approach. Moreover, code-specialized variants at around 80B parameters achieve performance comparable to closed-source baselines while improving GPU efficiency. Collectively, these findings provide a comprehensive and nuanced characterization of the capabilities and constraints of the framework, while providing actionable insights for the effective deployment of Agent Skills in SLM-centered environments.

[147] LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?

Juliusz Ziomek, William Bankes, Lorenz Wolf, Shyam Sundhar Ramesh, Xiaohang Tang, Ilija Bogunovic

Main category: cs.AI

TL;DR: LLM-Wikirace is a benchmark for evaluating planning, reasoning, and world knowledge in LLMs through Wikipedia navigation tasks, revealing that while frontier models achieve superhuman performance on easy levels, they struggle significantly on hard tasks where planning and long-horizon reasoning become critical.

DetailsMotivation: Current LLMs excel at many tasks but still face challenges in planning, reasoning, and leveraging world knowledge for complex navigation problems. The authors aim to create a benchmark that specifically tests these capabilities through Wikipedia hyperlink navigation.

Method: Created LLM-Wikirace benchmark where models must navigate from a source Wikipedia page to a target page using hyperlinks step by step. Evaluated various open- and closed-source models (Gemini-3, GPT-5, Claude Opus 4.5) across easy and hard difficulty levels, analyzing performance, world knowledge requirements, planning capabilities, and failure recovery patterns.

Result: Frontier models achieve superhuman performance on easy tasks but performance drops sharply on hard difficulty (Gemini-3 succeeds in only 23% of hard games). World knowledge is necessary but insufficient beyond a threshold where planning and long-horizon reasoning become dominant. Models struggle with replanning after failure and frequently enter loops.

Conclusion: LLM-Wikirace reveals clear limitations in current reasoning systems, showing that even state-of-the-art models have significant room for improvement in planning and long-horizon reasoning capabilities. The benchmark provides an open arena for evaluating and advancing planning-capable LLMs.

Abstract: We introduce LLM-Wikirace, a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs). In LLM-Wikirace, models must efficiently navigate Wikipedia hyperlinks step by step to reach a target page from a given source, requiring look-ahead planning and the ability to reason about how concepts are connected in the real world. We evaluate a broad set of open- and closed-source models, including Gemini-3, GPT-5, and Claude Opus 4.5, which achieve the strongest results on the easy level of the task and demonstrate superhuman performance. Despite this, performance drops sharply on hard difficulty: the best-performing model, Gemini-3, succeeds in only 23% of hard games, highlighting substantial remaining challenges for frontier models. Our analysis shows that world knowledge is a necessary ingredient for success, but only up to a point, beyond this threshold, planning and long-horizon reasoning capabilities become the dominant factors. Trajectory-level analysis further reveals that even the strongest models struggle to replan after failure, frequently entering loops rather than recovering. LLM-Wikirace is a simple benchmark that reveals clear limitations in current reasoning systems, offering an open arena where planning-capable LLMs still have much to prove. Our code and leaderboard available at https:/llmwikirace.github.io.

[148] RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models

Yunseok Han, Yejoon Lee, Jaeyoung Do

Main category: cs.AI

TL;DR: Paper introduces RFEval benchmark to evaluate reasoning faithfulness in Large Reasoning Models, finding 49.7% unfaithful outputs and showing accuracy is poor proxy for faithfulness.

DetailsMotivation: Large Reasoning Models often produce plausible-sounding rationales that don't reflect their true decision process, undermining reliability and trust. Need formal framework to evaluate reasoning faithfulness separate from accuracy.

Method: Proposes formal framework with two conditions: stance consistency (coherent stance linking reasoning to answer) and causal influence (reasoning causally drives answer under interventions). Creates RFEval benchmark of 7,186 instances across 7 tasks with controlled counterfactual interventions to test faithfulness.

Result: Evaluated 12 open-source LRMs, found unfaithfulness in 49.7% of outputs, mostly from stance inconsistency. Failures concentrated in math and code domains. RL-style post-training reduces faithfulness even when accuracy maintained. Accuracy-faithfulness correlation is weak and statistically insignificant.

Conclusion: Trustworthy AI requires optimizing for structural integrity of reasoning process, not just correct outcomes. RFEval provides rigorous methodology for auditing LRM reliability.

Abstract: Large Reasoning Models (LRMs) exhibit strong performance, yet often produce rationales that sound plausible but fail to reflect their true decision process, undermining reliability and trust. We introduce a formal framework for reasoning faithfulness, defined by two testable conditions: stance consistency (a coherent stance linking reasoning to answer) and causal influence (the stated reasoning causally drives the answer under output-level interventions), explicitly decoupled from accuracy. To operationalize this, we present RFEval, a benchmark of 7,186 instances across seven tasks that probes faithfulness via controlled, output-level counterfactual interventions. Evaluating twelve open-source LRMs, we find unfaithfulness in 49.7% of outputs, predominantly from stance inconsistency. Failures are concentrated in brittle, convergent domains such as math and code, and correlate more with post-training regimes than with scale: within-family ablations indicate that adding current RL-style objectives on top of supervised fine-tuning can reduce reasoning faithfulness, even when accuracy is maintained. Crucially, accuracy is neither a sufficient nor a reliable proxy for faithfulness: once controlling for model and task, the accuracy-faithfulness link is weak and statistically insignificant. Our work establishes a rigorous methodology for auditing LRM reliability and shows that trustworthy AI requires optimizing not only for correct outcomes but also for the structural integrity of the reasoning process. Our code and dataset can be found at project page: https://aidaslab.github.io/RFEval/}{https://aidaslab.github.io/RFEval/

[149] Pareto Optimal Benchmarking of AI Models on ARM Cortex Processors for Sustainable Embedded Systems

Pranay Jain, Maximilian Kasper, Göran Köber, Oliver Amft, Axel Plinge, Dominik Seuß

Main category: cs.AI

TL;DR: Benchmarking framework for optimizing AI models on ARM Cortex processors (M0+, M4, M7) focusing on energy efficiency, accuracy, and resource utilization in embedded systems.

DetailsMotivation: Need for practical benchmarking to evaluate AI model performance on resource-constrained embedded systems, balancing energy efficiency, accuracy, and computational demands.

Method: Automated test bench design for systematic evaluation across KPIs, using FLOPs as computational metric and Pareto analysis for trade-off optimization between energy consumption and accuracy.

Result: Near-linear correlation between FLOPs and inference time; M7 ideal for short inference cycles, M4 better for energy efficiency in longer tasks, M0+ suitable for simpler models.

Conclusion: Provides practical guidance for developers to design energy-efficient AI systems on ARM Cortex processors, balancing performance requirements with sustainability considerations.

Abstract: This work presents a practical benchmarking framework for optimizing artificial intelligence (AI) models on ARM Cortex processors (M0+, M4, M7), focusing on energy efficiency, accuracy, and resource utilization in embedded systems. Through the design of an automated test bench, we provide a systematic approach to evaluate across key performance indicators (KPIs) and identify optimal combinations of processor and AI model. The research highlights a nearlinear correlation between floating-point operations (FLOPs) and inference time, offering a reliable metric for estimating computational demands. Using Pareto analysis, we demonstrate how to balance trade-offs between energy consumption and model accuracy, ensuring that AI applications meet performance requirements without compromising sustainability. Key findings indicate that the M7 processor is ideal for short inference cycles, while the M4 processor offers better energy efficiency for longer inference tasks. The M0+ processor, while less efficient for complex AI models, remains suitable for simpler tasks. This work provides insights for developers, guiding them to design energy-efficient AI systems that deliver high performance in realworld applications.

cs.SD

[150] Interpreting Multi-Branch Anti-Spoofing Architectures: Correlating Internal Strategy with Empirical Performance

Ivan Viakhirev, Kirill Borodin, Mikhail Gorodnichev, Grach Mkrtchian

Main category: cs.SD

TL;DR: A framework for interpreting AASIST3 audio anti-spoofing models at component level using covariance operators and CatBoost meta-classifier to analyze branch cooperation/competition patterns across spoofing attacks.

DetailsMotivation: Multi-branch deep neural networks like AASIST3 achieve state-of-the-art performance in audio anti-spoofing, but their internal decision dynamics remain opaque. Existing interpretability methods focus on input-level visualization, failing to characterize how architectural branches cooperate or compete under different spoofing attacks.

Method: Developed a framework for interpreting AASIST3 at component level by modeling intermediate activations from fourteen branches and global attention modules with covariance operators. Leading eigenvalues form low-dimensional spectral signatures that train a CatBoost meta-classifier to generate TreeSHAP-based branch attributions, converted into normalized contribution shares and confidence scores.

Result: Analysis of 13 spoofing attacks from ASVspoof 2019 identified four operational archetypes: Effective Specialization (e.g., A09, EER 0.04%), Ineffective Consensus (e.g., A08, EER 3.14%), and crucially, Flawed Specialization where the model places high confidence in incorrect branches, leading to severe performance degradation for attacks A17 and A18 (EER 14.26% and 28.63%).

Conclusion: The framework provides quantitative insights linking internal architectural strategy directly to empirical reliability, revealing specific structural dependencies that standard performance metrics overlook, enabling better understanding of multi-branch audio anti-spoofing models.

Abstract: Multi-branch deep neural networks like AASIST3 achieve state-of-the-art comparable performance in audio anti-spoofing, yet their internal decision dynamics remain opaque compared to traditional input-level saliency methods. While existing interpretability efforts largely focus on visualizing input artifacts, the way individual architectural branches cooperate or compete under different spoofing attacks is not well characterized. This paper develops a framework for interpreting AASIST3 at the component level. Intermediate activations from fourteen branches and global attention modules are modeled with covariance operators whose leading eigenvalues form low-dimensional spectral signatures. These signatures train a CatBoost meta-classifier to generate TreeSHAP-based branch attributions, which we convert into normalized contribution shares and confidence scores (Cb) to quantify the model’s operational strategy. By analyzing 13 spoofing attacks from the ASVspoof 2019 benchmark, we identify four operational archetypes-ranging from Effective Specialization (e.g., A09, Equal Error Rate (EER) 0.04%, C=1.56) to Ineffective Consensus (e.g., A08, EER 3.14%, C=0.33). Crucially, our analysis exposes a Flawed Specialization mode where the model places high confidence in an incorrect branch, leading to severe performance degradation for attacks A17 and A18 (EER 14.26% and 28.63%, respectively). These quantitative findings link internal architectural strategy directly to empirical reliability, highlighting specific structural dependencies that standard performance metrics overlook.

[151] Scaling Audio-Text Retrieval with Multimodal Large Language Models

Jilan Xu, Carl Thomé, Danijela Horak, Weidi Xie, Andrew Zisserman

Main category: cs.SD

TL;DR: AuroLA is a novel audio-text retrieval framework that repurposes Multimodal Large Language Models (MLLMs) as a unified backbone for retrieval, outperforming state-of-the-art models with significantly less training data.

DetailsMotivation: Existing contrastive dual-encoder architectures like CLAP are limited by small-scale encoders that struggle with complex queries requiring reasoning or world knowledge. The authors aim to leverage the superior capabilities of MLLMs for better audio-text retrieval.

Method: Three key contributions: (1) Scalable data pipeline curating diverse audio with multi-granular captions via automated annotation; (2) Adapting MLLMs for retrieval by prompting them to summarize inputs and using hidden states of special tokens as embeddings, trained with Hybrid-NCE loss using multi-granular supervision and hard-negative reweighting; (3) MLLM-based bidirectional re-ranking module for refining retrieval candidates through deep cross-modal interaction.

Result: AuroLA consistently outperforms state-of-the-art models including PE-AV while using only ~1% of PE-AV’s training data. Clear scaling trends observed regarding dataset size and model capacity, validating MLLMs as effective unified backbones for audio-text retrieval.

Conclusion: MLLMs can be effectively repurposed as unified backbones for audio-text retrieval, achieving superior performance with significantly less training data through innovative data curation, model adaptation, and re-ranking techniques.

Abstract: Audio-text retrieval is crucial for bridging acoustic signals and natural language. While contrastive dual-encoder architectures like CLAP have shown promise, they are fundamentally limited by the capacity of small-scale encoders. Specifically, the text encoders struggle to understand complex queries that require reasoning or world knowledge. In this paper, we propose AuroLA, a novel contrastive language-audio pre-training framework that re-purposes Multimodal Large Language Models (MLLMs) as a unified backbone for retrieval. Specifically, we make three contributions: (i) we construct a scalable data pipeline that curates diverse audio from multiple sources and generates multi-granular captions, ranging from long descriptions to structured tags, via automated annotation; (ii) we adapt an MLLM for retrieval by prompting it to summarize the audio/text input and using the hidden state of a special token as audio/text embeddings. For model training, we devise a novel Hybrid-NCE loss, which employs multi-granular supervision and hard-negative reweighting to robustly align audio with diverse textual supervision; and (iii) we design an MLLM-based bidirectional re-ranking module that refines retrieval candidates through deep cross-modal interaction. Extensive experiments demonstrate that AuroLA consistently outperforms state-of-the-art models, including the recent PE-AV, while utilizing only approximately 1% of PE-AV’s training data. Lastly, we observe clear scaling trends regarding dataset size and model capacity, validating the effectiveness of MLLM as a unified backbone for audio-text retrieval. Code is available at https://github.com/Jazzcharles/AuroLA.

[152] MeanVoiceFlow: One-step Nonparallel Voice Conversion with Mean Flows

Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Yuto Kondo

Main category: cs.SD

TL;DR: MeanVoiceFlow is a novel one-step voice conversion model using mean flows that achieves real-time conversion without iterative inference or distillation, while maintaining quality comparable to multi-step diffusion models.

DetailsMotivation: Current diffusion and flow-matching models for voice conversion suffer from slow iterative inference. The authors aim to develop a one-step VC model that can be trained from scratch without requiring pretraining or distillation, while maintaining high speech quality and speaker similarity.

Method: Proposes MeanVoiceFlow using mean flows with average velocity instead of instantaneous velocity for single-step inference. Introduces structural margin reconstruction loss for stability and conditional diffused-input training where a mixture of noise and source data is used as input during both training and inference.

Result: Experimental results show MeanVoiceFlow achieves performance comparable to previous multi-step and distillation-based models, even when trained from scratch, while enabling real-time voice conversion.

Conclusion: MeanVoiceFlow successfully addresses the speed limitations of iterative VC models while maintaining quality, offering a practical solution for real-time voice conversion applications.

Abstract: In voice conversion (VC) applications, diffusion and flow-matching models have exhibited exceptional speech quality and speaker similarity performances. However, they are limited by slow conversion owing to their iterative inference. Consequently, we propose MeanVoiceFlow, a novel one-step nonparallel VC model based on mean flows, which can be trained from scratch without requiring pretraining or distillation. Unlike conventional flow matching that uses instantaneous velocity, mean flows employ average velocity to more accurately compute the time integral along the inference path in a single step. However, training the average velocity requires its derivative to compute the target velocity, which can cause instability. Therefore, we introduce a structural margin reconstruction loss as a zero-input constraint, which moderately regularizes the input-output behavior of the model without harmful statistical averaging. Furthermore, we propose conditional diffused-input training in which a mixture of noise and source data is used as input to the model during both training and inference. This enables the model to effectively leverage source information while maintaining consistency between training and inference. Experimental results validate the effectiveness of these techniques and demonstrate that MeanVoiceFlow achieves performance comparable to that of previous multi-step and distillation-based models, even when trained from scratch. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/meanvoiceflow/.

Emmanuel Deruty, David Meredith, Yann Macé, Luc Leroy, Dima Tsypkin, Pascal Arbez-Nicolas

Main category: cs.SD

TL;DR: Electronic tones in pop music (808 bass, power chords) are structurally equivalent to multiphonics in classical music, both creating multiple pitch percepts through similar spectral features.

DetailsMotivation: To challenge the assumption that pitch ambiguity is exclusive to experimental classical music by showing that mainstream electronic tones share structural and perceptual characteristics with classical multiphonics.

Method: Used listening tests with 10 participants and signal analysis to compare electronic tones (808-style bass, power chords) with multiphonics in contemporary classical music, examining spectral and temporal features.

Result: Both types of tones elicit multiple, listener-dependent pitch percepts arising from similar spectral and temporal features, demonstrating that pitch ambiguity exists in mainstream music production.

Conclusion: Pitch ambiguity is not confined to experimental classical contexts but is a feature of mainstream music production, with electronic tones and classical multiphonics sharing structural equivalence.

Abstract: This study argues that electronic tones routinely used in contemporary popular music - including 808-style bass and power chords - are structurally and perceptually equivalent to multiphonics in contemporary classical music. Using listening tests (n=10) and signal analysis, we show that both types of tones elicit multiple, listener-dependent pitch percepts arising from similar spectral and temporal features. These findings suggest that pitch ambiguity is not confined to experimental classical contexts but is also a feature of mainstream music production.

[154] A Generative-First Neural Audio Autoencoder

Jonah Casebeer, Ge Zhu, Zhepei Wang, Nicholas J. Bryan

Main category: cs.SD

TL;DR: A generative-first audio autoencoder architecture achieves 3360x temporal downsampling, 10x faster encoding, 1.6x lower latent rates, and supports continuous/discrete representations and multiple audio formats in a single model.

DetailsMotivation: Existing neural autoencoders for audio generation have limitations: they prioritize reconstruction over generation, resulting in high latent rates, slow encoding, and separate models for different representations (discrete/continuous) and audio channel formats, which hinders practical workflows from preprocessing to inference conditioning.

Method: Introduces a generative-first architecture for audio autoencoding that increases temporal downsampling from 2048x to 3360x. The model supports both continuous and discrete latent representations and common audio channel formats (mono, stereo, etc.) within a single unified architecture, balancing compression, quality, and speed.

Result: Achieves 10x faster encoding, 1.6x lower latent rates, and eliminates the need for channel-format-specific variants while maintaining competitive reconstruction quality. A 60-second mono signal compresses to just 788 tokens, making generative modeling more tractable for practical applications.

Conclusion: The generative-first audio autoencoder enables applications previously constrained by processing costs by providing a unified, efficient architecture that supports multiple representations and formats, making audio generation workflows more practical and scalable.

Abstract: Neural autoencoders underpin generative models. Practical, large-scale use of neural autoencoders for generative modeling necessitates fast encoding, low latent rates, and a single model across representations. Existing approaches are reconstruction-first: they incur high latent rates, slow encoding, and separate architectures for discrete vs. continuous latents and for different audio channel formats, hindering workflows from preprocessing to inference conditioning. We introduce a generative-first architecture for audio autoencoding that increases temporal downsampling from 2048x to 3360x and supports continuous and discrete representations and common audio channel formats in one model. By balancing compression, quality, and speed, it delivers 10x faster encoding, 1.6x lower rates, and eliminates channel-format-specific variants while maintaining competitive reconstruction quality. This enables applications previously constrained by processing costs: a 60-second mono signal compresses to 788 tokens, making generative modeling more tractable.

cs.LG

[155] Reducing Text Bias in Synthetically Generated MCQAs for VLMs in Autonomous Driving

Sutej Kulgod, Sean Ye, Sanchit Tanwar, Christoffer Heckman

Main category: cs.LG

TL;DR: The paper addresses bias in synthetic MCQA benchmarks for VLMs, showing models can exploit textual shortcuts without visual input, and proposes a method to force visual grounding.

DetailsMotivation: Synthetic MCQA benchmarks for VLMs contain hidden textual cues that allow models to exploit linguistic patterns rather than visual context, leading to inflated performance metrics that don't reflect true visual understanding.

Method: Decouples correct answers from linguistic artifacts and employs curriculum learning to force models to rely on visual grounding rather than textual shortcuts.

Result: Reduces blind accuracy from +66.9% above random to +2.9%, eliminating most exploitable textual shortcuts and ensuring performance reflects perceptual understanding.

Conclusion: The proposed method effectively addresses bias in synthetic MCQA benchmarks, forcing VLMs to develop genuine visual understanding rather than exploiting textual patterns.

Abstract: Multiple Choice Question Answering (MCQA) benchmarks are an established standard for measuring Vision Language Model (VLM) performance in driving tasks. However, we observe the known phenomenon that synthetically generated MCQAs are highly susceptible to hidden textual cues that allow models to exploit linguistic patterns rather than visual context. Our results show that a VLM fine-tuned on such data can achieve accuracy comparable to human-validated benchmarks even without visual input. Our proposed method reduces blind accuracy from +66.9% above random to +2.9%, eliminating the vast majority of exploitable textual shortcuts. By decoupling the correct answer from linguistic artifacts and employing a curriculum learning strategy, we force the model to rely on visual grounding, ensuring that performance accurately reflects perceptual understanding.

[156] Joint Parameter and State-Space Bayesian Optimization: Using Process Expertise to Accelerate Manufacturing Optimization

Saksham Kiroriwal, Julius Pfrommer, Jürgen Beyerer

Main category: cs.LG

TL;DR: POGPN-JPSS combines Bayesian optimization with structured probabilistic modeling to optimize high-dimensional multi-stage manufacturing processes using intermediate time-series observations and expert knowledge.

DetailsMotivation: Standard Bayesian optimization treats manufacturing processes as black boxes, ignoring valuable intermediate observations and process structure. High-dimensional state-space time series from multi-stage systems present challenges for using intermediate data effectively.

Method: Proposes POGPN-JPSS framework combining Partially Observable Gaussian Process Networks (POGPN) with Joint Parameter and State-Space (JPSS) modeling. Uses expert knowledge to extract low-dimensional latent features from high-dimensional state-space data and models the process as a Directed Acyclic Graph (DAG).

Result: POGPN-JPSS significantly outperforms state-of-the-art methods on a high-dimensional bioethanol production simulation, achieving desired performance threshold twice as fast with greater reliability, translating to substantial time and resource savings.

Conclusion: Combining expert knowledge with structured probabilistic models enables rapid process maturation and efficient optimization of complex multi-stage manufacturing systems.

Abstract: Bayesian optimization (BO) is a powerful method for optimizing black-box manufacturing processes, but its performance is often limited when dealing with high-dimensional multi-stage systems, where we can observe intermediate outputs. Standard BO models the process as a black box and ignores the intermediate observations and the underlying process structure. Partially Observable Gaussian Process Networks (POGPN) model the process as a Directed Acyclic Graph (DAG). However, using intermediate observations is challenging when the observations are high-dimensional state-space time series. Process-expert knowledge can be used to extract low-dimensional latent features from the high-dimensional state-space data. We propose POGPN-JPSS, a framework that combines POGPN with Joint Parameter and State-Space (JPSS) modeling to use intermediate extracted information. We demonstrate the effectiveness of POGPN-JPSS on a challenging, high-dimensional simulation of a multi-stage bioethanol production process. Our results show that POGPN-JPSS significantly outperforms state-of-the-art methods by achieving the desired performance threshold twice as fast and with greater reliability. The fast optimization directly translates to substantial savings in time and resources. This highlights the importance of combining expert knowledge with structured probabilistic models for rapid process maturation.

[157] BioBridge: Bridging Proteins and Language for Enhanced Biological Reasoning with LLMs

Yujia Wang, Jihong Guan, Wengen Li, Shuigeng Zhou, Xuhong Wang

Main category: cs.LG

TL;DR: BioBridge is a domain-adaptive continual pretraining framework that bridges protein language models and general-purpose LLMs for enhanced protein understanding and reasoning.

DetailsMotivation: Existing Protein Language Models (PLMs) have limited adaptability to multiple tasks and poor generalization across biological contexts, while general-purpose LLMs lack protein sequence interpretation capabilities and domain-specific knowledge for effective biosemantic reasoning.

Method: Uses Domain-Incremental Continual Pre-training (DICP) to infuse protein domain knowledge and general reasoning corpus into LLMs simultaneously, preventing catastrophic forgetting. Implements cross-modal alignment via PLM-Projector-LLM pipeline to map protein sequence embeddings into LLM semantic space, with end-to-end optimization for various tasks.

Result: BioBridge achieves performance comparable to mainstream PLMs on protein benchmarks (EC, BindingDB) and matches LLMs on general understanding tasks (MMLU, RACE), demonstrating domain-specific adaptability combined with general-purpose language competency.

Conclusion: BioBridge successfully bridges the gap between protein-specific models and general-purpose LLMs, enabling effective protein understanding while maintaining general language capabilities through innovative continual pretraining and cross-modal alignment techniques.

Abstract: Existing Protein Language Models (PLMs) often suffer from limited adaptability to multiple tasks and exhibit poor generalization across diverse biological contexts. In contrast, general-purpose Large Language Models (LLMs) lack the capability to interpret protein sequences and fall short in domain-specific knowledge, limiting their capacity for effective biosemantic reasoning. To combine the advantages of both, we propose BioBridge, a domain-adaptive continual pretraining framework for protein understanding. This framework employs Domain-Incremental Continual Pre-training (DICP) to infuse protein domain knowledge and general reasoning corpus into a LLM simultaneously, effectively mitigating catastrophic forgetting. Cross-modal alignment is achieved via a PLM-Projector-LLM pipeline, which maps protein sequence embeddings into the semantic space of the language model. Ultimately, an end-to-end optimization is adopted to uniformly support various tasks, including protein property prediction and knowledge question-answering. Our proposed BioBridge demonstrates performance comparable to that of mainstream PLMs on multiple protein benchmarks, such as EC and BindingDB. It also achieves results on par with LLMs on general understanding tasks like MMLU and RACE. This showcases its innovative advantage of combining domain-specific adaptability with general-purpose language competency.

[158] LATMiX: Learnable Affine Transformations for Microscaling Quantization of LLMs

Ofir Gordon, Lior Dikstein, Arnon Netzer, Idan Achituve, Hai Victor Habi

Main category: cs.LG

TL;DR: LATMiX: Learnable affine transformations for microscaling (MX) quantization of LLMs, improving accuracy by optimizing transformations for both activation distribution and quantization structure.

DetailsMotivation: Existing PTQ methods for LLMs use simple transformations (rotation/Hadamard) that don't work well with modern MX quantization formats, causing severe performance degradation. Need transformations that account for both activation distribution and quantization structure.

Method: Proposes LATMiX: learnable invertible affine transformations optimized using standard deep learning tools. Provides theoretical analysis of transformations under MX quantization with error bound. Generalizes outlier reduction beyond simple transformations.

Result: Consistent improvements in average accuracy for MX low-bit quantization across multiple model sizes on wide range of zero-shot benchmarks, outperforming strong baselines.

Conclusion: LATMiX effectively combines outlier reduction with MX quantization by using learnable affine transformations, addressing limitations of prior approaches and improving quantization robustness.

Abstract: Post-training quantization (PTQ) is a widely used approach for reducing the memory and compute costs of large language models (LLMs). Recent studies have shown that applying invertible transformations to activations can significantly improve quantization robustness by reducing activation outliers; however, existing approaches are largely restricted to rotation or Hadamard-based transformations. Moreover, most studies focused primarily on traditional quantization schemes, whereas modern hardware increasingly supports the microscaling (MX) data format. Attempts to combine both showed severe performance degradation, leading prior work to introduce assumptions on the transformations. In this work, we take a complementary perspective. First, we provide a theoretical analysis of transformations under MX quantization by deriving a bound on the quantization error. Our analysis emphasizes the importance of accounting for both the activation distribution and the underlying quantization structure. Building on this analysis, we propose LATMiX, a method that generalizes outlier reduction to learnable invertible affine transformations optimized using standard deep learning tools. Experiments show consistent improvements in average accuracy for MX low-bit quantization over strong baselines on a wide range of zero-shot benchmarks, across multiple model sizes.

[159] Duality Models: An Embarrassingly Simple One-step Generation Paradigm

Peng Sun, Xinyi Shang, Tao Lin, Zhiqiang Shen

Main category: cs.LG

TL;DR: DuMo introduces a “one input, dual output” paradigm for consistency models that simultaneously predicts velocity and flow-map from a single input, improving stability and efficiency in few-step image generation.

DetailsMotivation: Current consistency-based generative models use a "one input, one output" paradigm that forces a trade-off between multi-step and few-step training objectives, often leaving few-step generation undertrained and limiting scalability.

Method: Proposes Duality Models (DuMo) with a shared backbone and dual heads that simultaneously predict velocity (v_t) and flow-map (u_t) from a single input x_t, applying geometric constraints from multi-step objectives to every sample without separating training objectives.

Result: Achieves state-of-the-art FID of 1.79 on ImageNet 256×256 with a 679M Diffusion Transformer using SD-VAE in just 2 steps, demonstrating improved stability and efficiency.

Conclusion: The “one input, dual output” paradigm effectively addresses the training trade-off in consistency models, enabling high-quality few-step image generation with better convergence and scalability.

Abstract: Consistency-based generative models like Shortcut and MeanFlow achieve impressive results via a target-aware design for solving the Probability Flow ODE (PF-ODE). Typically, such methods introduce a target time $r$ alongside the current time $t$ to modulate outputs between a local multi-step derivative ($r = t$) and a global few-step integral ($r = 0$). However, the conventional “one input, one output” paradigm enforces a partition of the training budget, often allocating a significant portion (e.g., 75% in MeanFlow) solely to the multi-step objective for stability. This separation forces a trade-off: allocating sufficient samples to the multi-step objective leaves the few-step generation undertrained, which harms convergence and limits scalability. To this end, we propose Duality Models (DuMo) via a “one input, dual output” paradigm. Using a shared backbone with dual heads, DuMo simultaneously predicts velocity $v_t$ and flow-map $u_t$ from a single input $x_t$. This applies geometric constraints from the multi-step objective to every sample, bounding the few-step estimation without separating training objectives, thereby significantly improving stability and efficiency. On ImageNet 256 $\times$ 256, a 679M Diffusion Transformer with SD-VAE achieves a state-of-the-art (SOTA) FID of 1.79 in just 2 steps. Code is available at: https://github.com/LINs-lab/DuMo

[160] Probabilistic NDVI Forecasting from Sparse Satellite Time Series and Weather Covariates

Irene Iele, Giulia Romoli, Daniele Molino, Elena Mulero Ayllón, Filippo Ruffini, Paolo Soda, Matteo Tortora

Main category: cs.LG

TL;DR: Transformer-based probabilistic forecasting framework for NDVI vegetation index prediction using satellite data with explicit separation of historical vegetation dynamics and future meteorological information.

DetailsMotivation: Accurate short-term vegetation forecasting is crucial for precision agriculture, but NDVI forecasting from satellite data is challenging due to sparse/irregular sampling from cloud coverage and heterogeneous climatic conditions.

Method: Transformer-based architecture that separates historical vegetation dynamics modeling from future exogenous information, integrating historical NDVI with historical/future meteorological covariates. Uses temporal-distance weighted quantile loss for irregular revisit patterns and cumulative/extreme-weather feature engineering.

Result: Extensive experiments on European satellite data show the approach consistently outperforms statistical, deep learning, and recent time series baselines across both point-wise and probabilistic evaluation metrics.

Conclusion: The proposed probabilistic forecasting framework effectively addresses NDVI prediction challenges, with ablation studies highlighting the central role of target history and complementary gains from meteorological covariates.

Abstract: Accurate short-term forecasting of vegetation dynamics is a key enabler for data-driven decision support in precision agriculture. Normalized Difference Vegetation Index (NDVI) forecasting from satellite observations, however, remains challenging due to sparse and irregular sampling caused by cloud coverage, as well as the heterogeneous climatic conditions under which crops evolve. In this work, we propose a probabilistic forecasting framework specifically designed for field-level NDVI prediction under clear-sky acquisition constraints. The method leverages a transformer-based architecture that explicitly separates the modeling of historical vegetation dynamics from future exogenous information, integrating historical NDVI observations with both historical and future meteorological covariates. To address irregular revisit patterns and horizon-dependent uncertainty, we introduce a temporal-distance weighted quantile loss that aligns the training objective with the effective forecasting horizon. In addition, we incorporate cumulative and extreme-weather feature engineering to better capture delayed meteorological effects relevant to vegetation response. Extensive experiments on European satellite data demonstrate that the proposed approach consistently outperforms a diverse set of statistical, deep learning, and recent time series baselines across both point-wise and probabilistic evaluation metrics. Ablation studies further highlight the central role of target history, while showing that meteorological covariates provide complementary gains when jointly exploited. The code is available at https://github.com/arco-group/ndvi-forecasting.

[161] CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

Xiao Zhu, Xinyu Zhou, Boyu Zhu, Hanxu Hu, Mingzhe Du, Haotian Zhang, Huiming Wang, Zhijiang Guo

Main category: cs.LG

TL;DR: CodeScaler is an execution-free reward model for code generation that scales RL training and inference without needing unit tests, outperforming execution-based approaches while reducing latency.

DetailsMotivation: Current RL from Verifiable Rewards (RLVR) for code LLMs relies on execution-based feedback from unit tests, which limits scalability due to test case availability and reliability issues.

Method: Train CodeScaler on curated preference data from verified code problems, using syntax-aware code extraction and validity-preserving reward shaping for stable optimization.

Result: Improves Qwen3-8B-Base by +11.72 points across 5 benchmarks, outperforms binary execution-based RL by +1.82 points, enables scalable RL on synthetic datasets without tests, reduces inference latency 10x, and surpasses existing reward models on RM-Bench.

Conclusion: CodeScaler provides a scalable, execution-free alternative to test-based RL for code generation, offering better performance and efficiency while maintaining robustness.

Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has driven recent progress in code large language models by leveraging execution-based feedback from unit tests, but its scalability is fundamentally constrained by the availability and reliability of high-quality test cases. We propose CodeScaler, an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization. Across five coding benchmarks, CodeScaler improves Qwen3-8B-Base by an average of +11.72 points, outperforming binary execution-based RL by +1.82 points, and enables scalable reinforcement learning on synthetic datasets without any test cases. At inference time, CodeScaler serves as an effective test-time scaling method, achieving performance comparable to unit test approaches while providing a 10-fold reduction in latency. Moreover, CodeScaler surpasses existing reward models on RM-Bench not only in the code domain (+3.3 points), but also in general and reasoning domains (+2.7 points on average).

[162] Optimal Multi-Debris Mission Planning in LEO: A Deep Reinforcement Learning Approach with Co-Elliptic Transfers and Refueling

Agni Bandyopadhyay, Gunther Waxenegger-Wilfing

Main category: cs.LG

TL;DR: A unified coelliptic maneuver framework for multi-target active debris removal in LEO, comparing greedy heuristic, MCTS, and masked PPO reinforcement learning for mission planning efficiency.

DetailsMotivation: Address the challenge of multi-target active debris removal in Low Earth Orbit by developing efficient planning algorithms that can handle realistic orbital constraints like debris fields, keep-out zones, and delta V limitations.

Method: Developed a unified coelliptic maneuver framework combining Hohmann transfers, safety ellipse proximity operations, and explicit refueling logic. Benchmarked three planning algorithms: Greedy heuristic, Monte Carlo Tree Search (MCTS), and deep reinforcement learning using Masked Proximal Policy Optimization (PPO) in realistic orbital simulations.

Result: Masked PPO achieved superior mission efficiency and computational performance, visiting up to twice as many debris as Greedy heuristic and significantly outperforming MCTS in runtime across 100 test scenarios.

Conclusion: Modern reinforcement learning methods like Masked PPO show promise for scalable, safe, and resource-efficient space mission planning, paving the way for future advancements in active debris removal autonomy.

Abstract: This paper addresses the challenge of multi target active debris removal (ADR) in Low Earth Orbit (LEO) by introducing a unified coelliptic maneuver framework that combines Hohmann transfers, safety ellipse proximity operations, and explicit refueling logic. We benchmark three distinct planning algorithms Greedy heuristic, Monte Carlo Tree Search (MCTS), and deep reinforcement learning (RL) using Masked Proximal Policy Optimization (PPO) within a realistic orbital simulation environment featuring randomized debris fields, keep out zones, and delta V constraints. Experimental results over 100 test scenarios demonstrate that Masked PPO achieves superior mission efficiency and computational performance, visiting up to twice as many debris as Greedy and significantly outperforming MCTS in runtime. These findings underscore the promise of modern RL methods for scalable, safe, and resource efficient space mission planning, paving the way for future advancements in ADR autonomy.

[163] The Geometry of Noise: Why Diffusion Models Don’t Need Noise Conditioning

Mojtaba Sahraee-Ardakan, Mauricio Delbracio, Peyman Milanfar

Main category: cs.LG

TL;DR: The paper resolves a paradox in autonomous generative models by formalizing Marginal Energy and proving that generation is a Riemannian gradient flow on this energy landscape, with velocity-based parameterizations being inherently stable.

DetailsMotivation: Autonomous generative models that operate without explicit noise-level conditioning present a paradox: how can a bounded, noise-agnostic network remain stable near the data manifold where gradients typically diverge? The paper aims to resolve this fundamental question about the underlying optimization landscape.

Method: The authors formalize Marginal Energy as the negative log of the marginal density of noisy data integrated over unknown noise levels. They prove generation is a Riemannian gradient flow on this energy landscape, analyze geometric singularities, and establish stability conditions through relative energy decomposition and analysis of different parameterizations.

Result: The paper shows that while the raw Marginal Energy has a 1/t^p singularity normal to the data manifold, the learned time-invariant field incorporates a local conformal metric that counteracts this singularity. Velocity-based parameterizations are proven to be inherently stable due to bounded-gain conditions, while noise-prediction parameterizations suffer from catastrophic failure due to a “Jensen Gap” amplifying estimation errors.

Conclusion: Autonomous generative models perform generation as Riemannian gradient flow on Marginal Energy, with velocity-based parameterizations providing structural stability by absorbing posterior uncertainty into smooth geometric drift, resolving the paradox of how bounded networks can operate stably near data manifolds.

Abstract: Autonomous (noise-agnostic) generative models, such as Equilibrium Matching and blind diffusion, challenge the standard paradigm by learning a single, time-invariant vector field that operates without explicit noise-level conditioning. While recent work suggests that high-dimensional concentration allows these models to implicitly estimate noise levels from corrupted observations, a fundamental paradox remains: what is the underlying landscape being optimized when the noise level is treated as a random variable, and how can a bounded, noise-agnostic network remain stable near the data manifold where gradients typically diverge? We resolve this paradox by formalizing Marginal Energy, $E_{\text{marg}}(\mathbf{u}) = -\log p(\mathbf{u})$, where $p(\mathbf{u}) = \int p(\mathbf{u}|t)p(t)dt$ is the marginal density of the noisy data integrated over a prior distribution of unknown noise levels. We prove that generation using autonomous models is not merely blind denoising, but a specific form of Riemannian gradient flow on this Marginal Energy. Through a novel relative energy decomposition, we demonstrate that while the raw Marginal Energy landscape possesses a $1/t^p$ singularity normal to the data manifold, the learned time-invariant field implicitly incorporates a local conformal metric that perfectly counteracts the geometric singularity, converting an infinitely deep potential well into a stable attractor. We also establish the structural stability conditions for sampling with autonomous models. We identify a ``Jensen Gap’’ in noise-prediction parameterizations that acts as a high-gain amplifier for estimation errors, explaining the catastrophic failure observed in deterministic blind models. Conversely, we prove that velocity-based parameterizations are inherently stable because they satisfy a bounded-gain condition that absorbs posterior uncertainty into a smooth geometric drift.

[164] Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO

Bowen Yu, Maolin Wang, Sheng Zhang, Binhao Wang, Yi Wen, Jingtong Gao, Bowen Liu, Zimo Zhao, Wanyu Wang, Xiangyu Zhao

Main category: cs.LG

TL;DR: A curriculum learning framework for distilling Chain-of-Thought reasoning into smaller models through progressive skill acquisition, achieving better accuracy with shorter outputs.

DetailsMotivation: Teacher rationales from large language models are often too verbose for smaller student models to faithfully reproduce, while existing compression methods lose the interpretability that makes CoT valuable.

Method: Three-stage curriculum learning: 1) Structural understanding via masked shuffled reconstruction, 2) Group Relative Policy Optimization on masked completion tasks to balance accuracy and brevity, 3) Targeted rewriting of persistent failure cases using GRPO.

Result: Qwen2.5-3B-Base achieved 11.29% accuracy improvement while reducing output length by 27.4% on GSM8K, surpassing both instruction-tuned variants and prior distillation methods.

Conclusion: The framework successfully addresses the capacity mismatch between large teacher models and compact student models for CoT reasoning distillation, enabling smaller models to produce concise yet accurate reasoning.

Abstract: Distilling Chain-of-Thought (CoT) reasoning from large language models into compact student models presents a fundamental challenge: teacher rationales are often too verbose for smaller models to faithfully reproduce. Existing approaches either compress reasoning into single-step, losing the interpretability that makes CoT valuable. We present a three-stage curriculum learning framework that addresses this capacity mismatch through progressive skill acquisition. First, we establish structural understanding via masked shuffled reconstruction. Second, we apply Group Relative Policy Optimization (GRPO) on masked completion tasks, enabling the model to discover its own balance between accuracy and brevity. Third, we identify persistent failure cases and guide the student to internalize teacher knowledge through targeted rewriting, again optimized with GRPO. Experiments on GSM8K demonstrate that our approach enables Qwen2.5-3B-Base to achieve an 11.29 percent accuracy improvement while reducing output length by 27.4 percent, surpassing both instruction-tuned variants and prior distillation methods.

[165] AnCoder: Anchored Code Generation via Discrete Diffusion Models

Anton Xue, Litu Rout, Constantine Caramanis, Sanjay Shakkottai

Main category: cs.LG

TL;DR: AnchorTree framework uses abstract syntax trees to guide diffusion models for better code generation by prioritizing syntactically important tokens

DetailsMotivation: Diffusion language models offer advantages for code generation but often produce broken programs that fail to execute because they don't respect programming language structure

Method: AnchorTree framework uses abstract syntax trees as hierarchical priors to anchor the diffusion process, prioritizing resolution of syntactically and semantically salient tokens like keywords and identifiers to establish structural scaffolds

Result: AnCoder models demonstrate that structurally anchored diffusion offers parameter-efficient path to high-quality code generation

Conclusion: Explicitly anchoring diffusion with structured priors native to code improves program generation quality and execution success

Abstract: Diffusion language models offer a compelling alternative to autoregressive code generation, enabling global planning and iterative refinement of complex program logic. However, existing approaches fail to respect the rigid structure of programming languages and, as a result, often produce broken programs that fail to execute. To address this, we introduce AnchorTree, a framework that explicitly anchors the diffusion process using structured, hierarchical priors native to code. Specifically, AnchorTree uses the abstract syntax tree to prioritize resolving syntactically and semantically salient tokens, such as keywords (e.g., if, while) and identifiers (e.g., variable names), thereby establishing a structural scaffold that guides the remaining generation. We validate this framework via AnCoder, a family of models showing that structurally anchored diffusion offers a parameter-efficient path to high-quality code generation.

[166] Robust Pre-Training of Medical Vision-and-Language Models with Domain-Invariant Multi-Modal Masked Reconstruction

Melika Filvantorkaman, Mohsen Piri

Main category: cs.LG

TL;DR: Robust-MMR is a self-supervised pre-training framework that incorporates robustness objectives into masked vision-language learning for medical applications, improving performance under domain shifts and perturbations.

DetailsMotivation: Medical vision-language models degrade under domain shifts from variations in imaging devices, protocols, and reporting styles. Existing methods treat robustness as a downstream problem rather than addressing it during pre-training.

Method: Robust-MMR integrates asymmetric perturbation-aware masking, domain-consistency regularization, and modality-resilience constraints to encourage domain-invariant representations in self-supervised pre-training.

Result: Achieves 78.9% cross-domain accuracy on VQA-RAD (3.8% improvement), 74.6% on SLAKE, 77.0% on VQA-2019. Improves perturbed VQA-RAD accuracy from 69.1% to 75.6%. Cross-domain MELINDA accuracy increases from 70.3% to 75.2%.

Conclusion: Explicitly modeling robustness during pre-training leads to more reliable and transferable medical vision-language representations for real-world deployment.

Abstract: Medical vision-language models show strong potential for joint reasoning over medical images and clinical text, but their performance often degrades under domain shift caused by variations in imaging devices, acquisition protocols, and reporting styles. Existing multi-modal pre-training methods largely overlook robustness, treating it as a downstream adaptation problem. In this work, we propose Robust Multi-Modal Masked Reconstruction (Robust-MMR), a self-supervised pre-training framework that explicitly incorporates robustness objectives into masked vision-language learning. Robust-MMR integrates asymmetric perturbation-aware masking, domain-consistency regularization, and modality-resilience constraints to encourage domain-invariant representations. We evaluate Robust-MMR on multiple medical vision-language benchmarks, including medical visual question answering (VQA-RAD, SLAKE, VQA-2019), cross-domain image-text classification (MELINDA), and robust image-caption retrieval (ROCO). Robust-MMR achieves 78.9% cross-domain accuracy on VQA-RAD, outperforming the strongest baseline by 3.8 percentage points, and reaches 74.6% and 77.0% accuracy on SLAKE and VQA-2019, respectively. Under perturbed evaluation, Robust-MMR improves VQA-RAD accuracy from 69.1% to 75.6%. For image-text classification, cross-domain MELINDA accuracy increases from 70.3% to 75.2%, while retrieval experiments show a reduction in mean rank degradation from over 16 to 4.1 under perturbation. Qualitative results further demonstrate improved clinical reasoning for disease detection and structural abnormality assessment. These findings show that explicitly modeling robustness during pre-training leads to more reliable and transferable medical vision-language representations for real-world deployment.

[167] Tethered Reasoning: Decoupling Entropy from Hallucination in Quantized LLMs via Manifold Steering

Craig Atkinson

Main category: cs.LG

TL;DR: HELIX is a geometric framework that decouples output entropy from hallucination in quantized language models by tethering hidden-state trajectories to a truthfulness manifold, enabling high-temperature sampling without semantic incoherence.

DetailsMotivation: Quantized language models face a fundamental trade-off: low temperatures yield repetitive outputs, while high temperatures cause trajectory divergence and semantic incoherence. There's a need to enable creative, high-entropy generation without sacrificing truthfulness and coherence.

Method: HELIX computes a Unified Truth Score (UTS) combining token-level semantic entropy with Mahalanobis distance from a pre-computed truthfulness manifold. When UTS indicates trajectory divergence, graduated steering vectors redirect activations toward structurally coherent regions, affecting only 0.2-2.5% of tokens.

Result: On 4-bit quantized Granite 4.0 H Small: GSM8K maintains 88.84% accuracy at T=3.0 (only 2.81pp degradation from T=0.5); MMLU maintains 72.49% across 14,042 questions. Steered outputs exhibit 5-20% idea duplication vs 70-80% at conservative settings, with 46.7% higher unique concept generation in cross-architecture validation.

Conclusion: HELIX demonstrates that high-temperature hallucination is primarily trajectory divergence rather than semantic collapse. The framework enables exploration of semantic diversity without violating logical coherence, revealing a previously-masked High-Entropy Creative Reservoir and enabling Multi-Temperature Synthesis with 200% more unique concepts.

Abstract: Quantized language models face a fundamental dilemma: low sampling temperatures yield repetitive, mode-collapsed outputs, while high temperatures (T > 2.0) cause trajectory divergence and semantic incoherence. We present HELIX, a geometric framework that decouples output entropy from hallucination by tethering hidden-state trajectories to a pre-computed truthfulness manifold. HELIX computes a Unified Truth Score (UTS) combining token-level semantic entropy with Mahalanobis distance from the manifold. When UTS indicates trajectory divergence, graduated steering vectors redirect activations toward structurally coherent regions while affecting only 0.2-2.5% of tokens. On 4-bit quantized Granite 4.0 H Small (32B/9B active, hybrid Mamba-Transformer): GSM8K maintains 88.84% accuracy at T = 3.0 (2.81pp degradation from T = 0.5); MMLU maintains 72.49% across 14,042 questions (1.24pp degradation). This demonstrates that high-temperature hallucination is primarily trajectory divergence rather than semantic collapse. Notably, steering the sparse Transformer attention layers (~10% of layers) is sufficient to correct drift in the Mamba-2 state-space formulation. Geometric tethering reveals a previously-masked High-Entropy Creative Reservoir. At T > 2.0, steered outputs exhibit 5-20% idea duplication versus 70-80% at conservative settings. Cross-architecture validation (Qwen3-30B-A3B MOE) confirms this phenomenon is architecture-independent, with 46.7% higher unique concept generation. HELIX acts as a syntax tether, enabling exploration of semantic diversity without violating the logical backbone required for valid output. This enables Multi-Temperature Synthesis, generating 200% more unique concepts than single-temperature inference.

[168] Agentic Unlearning: When LLM Agent Meets Machine Unlearning

Bin Wang, Fan Wang, Pingping Wang, Jinyu Cong, Yang Yu, Yilong Yin, Zhongyi Han, Benzheng Wei

Main category: cs.LG

TL;DR: Agentic unlearning framework (SBU) that removes information from both model parameters and persistent memory in agents with closed-loop interaction, addressing parameter-memory backflow issues.

DetailsMotivation: Existing unlearning methods only target model parameters, leaving gaps where retrieval can reactivate parametric remnants or memory artifacts can reintroduce sensitive content. There's no unified strategy covering both parameter and memory pathways.

Method: Synchronized Backflow Unlearning (SBU) unlearns jointly across parameter and memory pathways. Memory pathway uses dependency closure-based unlearning to prune isolated entities and logically invalidate shared artifacts. Parameter pathway employs stochastic reference alignment to guide outputs toward high-entropy prior. Pathways integrated via synchronized dual-update protocol forming closed-loop mechanism.

Result: Experiments on medical QA benchmarks show SBU reduces traces of targeted private information across both pathways with limited degradation on retained data.

Conclusion: SBU provides a comprehensive approach to agentic unlearning that addresses both parameter and memory pathways, preventing cross-pathway recontamination while maintaining performance on retained information.

Abstract: In this paper, we introduce \textbf{agentic unlearning} which removes specified information from both model parameters and persistent memory in agents with closed-loop interaction. Existing unlearning methods target parameters alone, leaving two critical gaps: (i) parameter-memory backflow, where retrieval reactivates parametric remnants or memory artifacts reintroduce sensitive content, and (ii) the absence of a unified strategy that covers both parameter and memory pathways. We present Synchronized Backflow Unlearning (SBU), a framework that unlearns jointly across parameter and memory pathways. The memory pathway performs dependency closure-based unlearning that prunes isolated entities while logically invalidating shared artifacts. The parameter pathway employs stochastic reference alignment to guide model outputs toward a high-entropy prior. These pathways are integrated via a synchronized dual-update protocol, forming a closed-loop mechanism where memory unlearning and parametric suppression reinforce each other to prevent cross-pathway recontamination. Experiments on medical QA benchmarks show that SBU reduces traces of targeted private information across both pathways with limited degradation on retained data.

[169] A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU

Yuchen Luo, Fangyue Zhu, Ruining Zhou, Mingzhe Huang, Jian Zhu, Fanyu Fan, Wei Shao

Main category: cs.LG

TL;DR: Study of Post-Training Quantization methods on Ascend NPU for reasoning models reveals platform sensitivity, with 4-bit weight-only working for large models but 4-bit weight-activation causing instability, while 8-bit remains stable.

DetailsMotivation: PTQ is important for efficient model deployment but its effectiveness on Ascend NPU is under-explored compared to GPUs, requiring investigation of quantization methods for reasoning models on this specific hardware platform.

Method: Case study applying four PTQ algorithms (AWQ, GPTQ, SmoothQuant, FlatQuant) to DeepSeek-R1-Distill-Qwen series (1.5B/7B/14B) and QwQ-32B models, evaluating weight-only and weight-activation quantization on Ascend NPU.

Result: 4-bit weight-only quantization works for larger models, but 4-bit weight-activation suffers from layer-wise calibration instability causing logic collapse in long-context reasoning; 8-bit quantization remains stable; INT8 deployment shows latency reduction but dynamic quantization overhead limits acceleration.

Conclusion: PTQ on Ascend NPU shows platform sensitivity with practical limitations; 8-bit quantization is stable while aggressive 4-bit schemes face instability, providing practical reference for deploying quantized reasoning models on this hardware.

Abstract: Post-Training Quantization (PTQ) is crucial for efficient model deployment, yet its effectiveness on Ascend NPU remains under-explored compared to GPU architectures. This paper presents a case study of representative PTQ baselines applied to reasoning-oriented models such as DeepSeek-R1-Distill-Qwen series (1.5B/7B/14B) and QwQ-32B. We evaluate four distinct algorithms, including AWQ, GPTQ, SmoothQuant, and FlatQuant, to cover the spectrum from weight-only compression to advanced rotation-based methods. Our empirical results reveal significant platform sensitivity. While 4-bit weight-only quantization proves viable for larger models, aggressive 4-bit weight-activation schemes suffer from layer-wise calibration instability on the NPU, leading to logic collapse in long-context reasoning tasks. Conversely, standard 8-bit quantization remains numerically stable. Furthermore, a real-world INT8 deployment demonstrates that although optimized kernels reduce latency, dynamic quantization overheads currently limit end-to-end acceleration. These findings offer a practical reference for the feasibility and limitations of deploying quantized reasoning models on Ascend NPU.

[170] AsynDBT: Asynchronous Distributed Bilevel Tuning for efficient In-Context Learning with Large Language Models

Hui Ma, Shaoyu Dou, Ya Liu, Fei Xing, Li Feng, Feng Pi

Main category: cs.LG

TL;DR: AsynDBT: An asynchronous distributed bilevel tuning algorithm that optimizes in-context learning samples and prompt fragments for LLMs using federated learning to address privacy concerns and heterogeneous data.

DetailsMotivation: Cloud-based LLM APIs require manual prompt tuning which is costly, while in-context learning suffers from lack of high-quality sensitive data. Federated learning offers privacy-preserving collaboration but faces straggler problems and heterogeneous data challenges.

Method: Proposes AsynDBT - an asynchronous distributed bilevel tuning algorithm that optimizes both in-context learning samples and prompt fragments based on LLM feedback. Uses distributed architecture for privacy protection and adaptability to heterogeneous computing environments.

Result: Extensive experiments on multiple benchmark datasets demonstrate the effectiveness and efficiency of AsynDBT. Theoretical analysis establishes convergence guarantees for the proposed algorithm.

Conclusion: AsynDBT successfully addresses the challenges of federated learning with in-context learning by providing privacy protection, handling heterogeneous data, and improving downstream task performance through optimized prompts and examples.

Abstract: With the rapid development of large language models (LLMs), an increasing number of applications leverage cloud-based LLM APIs to reduce usage costs. However, since cloud-based models’ parameters and gradients are agnostic, users have to manually or use heuristic algorithms to adjust prompts for intervening LLM outputs, which requiring costly optimization procedures. In-context learning (ICL) has recently emerged as a promising paradigm that enables LLMs to adapt to new tasks using examples provided within the input, eliminating the need for parameter updates. Nevertheless, the advancement of ICL is often hindered by the lack of high-quality data, which is often sensitive and different to share. Federated learning (FL) offers a potential solution by enabling collaborative training of distributed LLMs while preserving data privacy. Despite this issues, previous FL approaches that incorporate ICL have struggled with severe straggler problems and challenges associated with heterogeneous non-identically data. To address these problems, we propose an asynchronous distributed bilevel tuning (AsynDBT) algorithm that optimizes both in-context learning samples and prompt fragments based on the feedback from the LLM, thereby enhancing downstream task performance. Benefiting from its distributed architecture, AsynDBT provides privacy protection and adaptability to heterogeneous computing environments. Furthermore, we present a theoretical analysis establishing the convergence guarantees of the proposed algorithm. Extensive experiments conducted on multiple benchmark datasets demonstrate the effectiveness and efficiency of AsynDBT.

[171] EXACT: Explicit Attribute-Guided Decoding-Time Personalization

Xin Yu, Hanwen Xing, Lingzhou Xue

Main category: cs.LG

TL;DR: EXACT: A decoding-time personalization method that aligns LLM generation with limited pairwise preference feedback using interpretable attributes, with similarity-based retrieval to handle contextual preference shifts.

DetailsMotivation: Existing decoding-time personalization methods rely on implicit, less interpretable preference representations and impose rigid, context-agnostic user representations, failing to account for how preferences shift across different prompts and contexts.

Method: EXACT uses interpretable attributes to represent preferences. It first identifies user-specific attribute subsets by maximizing likelihood of preferred responses offline. For online inference, it retrieves the most semantically relevant attributes for each incoming prompt and injects them into context to steer generation, with theoretical guarantees for similarity-based retrieval.

Result: Extensive experiments on human-annotated preference datasets show EXACT consistently outperforms strong baselines in both preference modeling accuracy and personalized generation quality.

Conclusion: EXACT provides an effective decoding-time personalization approach that handles contextual preference shifts through interpretable attributes and similarity-based retrieval, offering theoretical guarantees and empirical improvements over existing methods.

Abstract: Achieving personalized alignment requires adapting large language models to each user’s evolving context. While decoding-time personalization offers a scalable alternative to training-time methods, existing methods largely rely on implicit, less interpretable preference representations and impose a rigid, context-agnostic user representation, failing to account for how preferences shift across prompts. We introduce EXACT, a new decoding-time personalization that aligns generation with limited pairwise preference feedback using a predefined set of interpretable attributes. EXACT first identifies user-specific attribute subsets by maximizing the likelihood of preferred responses in the offline stage. Then, for online inference, EXACT retrieves the most semantically relevant attributes for an incoming prompt and injects them into the context to steer generation. We establish theoretical approximation guarantees for the proposed algorithm under mild assumptions, and provably show that our similarity-based retrieval mechanism effectively mitigates contextual preference shifts, adapting to disparate tasks without pooling conflicting preferences. Extensive experiments on human-annotated preference datasets demonstrate that EXACT consistently outperforms strong baselines, including preference modeling accuracy and personalized generation quality.

[172] Can LLM Safety Be Ensured by Constraining Parameter Regions?

Zongmin Li, Jian Su, Farah Benamara, Aixin Sun

Main category: cs.LG

TL;DR: Current safety region identification methods for LLMs show low overlap and instability across datasets, failing to reliably identify consistent safety-critical parameters.

DetailsMotivation: The paper investigates whether LLMs contain stable "safety regions" - parameter subsets that directly control safety behaviors - and evaluates the reliability of current methods to identify such regions.

Method: Systematic evaluation of four safety region identification methods across different parameter granularities (individual weights to entire Transformer layers) using four families of backbone LLMs with varying sizes and ten safety identification datasets.

Result: Identified safety regions show only low to moderate overlap (measured by IoU), with overlap dropping significantly when refined using utility datasets (non-harmful queries). Current techniques fail to identify stable, dataset-agnostic safety regions.

Conclusion: Current safety region identification methods are unreliable and unstable across datasets, suggesting that the assumption of consistent safety-critical parameter subsets in LLMs may not hold with existing techniques.

Abstract: Large language models (LLMs) are often assumed to contain ``safety regions’’ – parameter subsets whose modification directly influences safety behaviors. We conduct a systematic evaluation of four safety region identification methods spanning different parameter granularities, from individual weights to entire Transformer layers, across four families of backbone LLMs with varying sizes. Using ten safety identification datasets, we find that the identified safety regions exhibit only low to moderate overlap, as measured by IoU. The overlap drops significantly when the safety regions are further refined using utility datasets (\ie non-harmful queries). These results suggest that current techniques fail to reliably identify a stable, dataset-agnostic safety region.

[173] Pimp My LLM: Leveraging Variability Modeling to Tune Inference Hyperparameters

Nada Zine, Clément Quinton, Romain Rouvoy

Main category: cs.LG

TL;DR: Applying software engineering variability modeling techniques to systematically analyze and optimize LLM inference configurations for energy efficiency, latency, and accuracy trade-offs.

DetailsMotivation: LLMs have high computational demands raising sustainability concerns, especially during inference which dominates total compute usage. The vast configuration space of inference servers makes exhaustive empirical evaluation infeasible due to combinatorial explosion.

Method: Treat LLMs as configurable systems and apply variability management techniques. Represent generation hyperparameters and constraints using feature-based variability models, sample representative configurations, measure energy consumption, latency, and accuracy, then learn predictive models from collected data.

Result: Variability modeling effectively manages LLM inference configuration complexity, enables systematic analysis of hyperparameter effects and interactions, reveals trade-offs, and supports accurate prediction of inference behavior from limited measurements.

Conclusion: This work bridges software engineering and machine learning by leveraging variability modeling for efficient and sustainable LLM configuration, opening a new research direction.

Abstract: Large Language Models (LLMs) are being increasingly used across a wide range of tasks. However, their substantial computational demands raise concerns about the energy efficiency and sustainability of both training and inference. Inference, in particular, dominates total compute usage, making its optimization crucial. Recent research has explored optimization techniques and analyzed how configuration choices influence energy consumption. Yet, the vast configuration space of inference servers makes exhaustive empirical evaluation infeasible due to combinatorial explosion. In this paper, we introduce a new perspective on this problem by treating LLMs as configurable systems and applying variability management techniques to systematically analyze inference-time configuration choices. We evaluate our approach on the Hugging Face Transformers library by representing generation hyperparameters and their constraints using a feature-based variability model, sampling representative configurations, measuring their energy consumption, latency, accuracy, and learning predictive models from the collected data. Our results show that variability modeling effectively manages the complexity of LLM inference configurations. It enables systematic analysis of hyperparameters effects and interactions, reveals trade-offs, and supports accurate prediction of inference behavior from a limited number of measurements. Overall, this work opens a new research direction that bridges software engineering and machine learning by leveraging variability modeling for the efficient and sustainable configuration of LLMs.

[174] ScaleBITS: Scalable Bitwidth Search for Hardware-Aligned Mixed-Precision LLMs

Xinlin Li, Timothy Chou, Josh Fromm, Zichang Liu, Yunjie Pan, Christina Fragouli

Main category: cs.LG

TL;DR: ScaleBITS is a mixed-precision quantization framework for LLMs that enables automated, fine-grained bitwidth allocation under memory constraints while preserving hardware efficiency through block-wise weight partitioning and bi-directional channel reordering.

DetailsMotivation: Current post-training weight quantization methods struggle below 4 bits due to non-uniform weight sensitivity and lack principled precision allocation, with existing solutions suffering from high runtime overhead or relying on heuristic approaches.

Method: Proposes ScaleBITS framework with: 1) new sensitivity analysis, 2) hardware-aligned block-wise weight partitioning with bi-directional channel reordering, 3) formulation of global bitwidth allocation as constrained optimization with scalable greedy algorithm approximation.

Result: ScaleBITS significantly improves over uniform-precision quantization (up to +36%) and outperforms state-of-the-art sensitivity-aware baselines (up to +13%) in ultra-low-bit regime without adding runtime overhead.

Conclusion: ScaleBITS provides an effective mixed-precision quantization framework that enables principled, fine-grained bitwidth allocation for LLMs while maintaining hardware efficiency, addressing key challenges in ultra-low-bit quantization.

Abstract: Post-training weight quantization is crucial for reducing the memory and inference cost of large language models (LLMs), yet pushing the average precision below 4 bits remains challenging due to highly non-uniform weight sensitivity and the lack of principled precision allocation. Existing solutions use irregular fine-grained mixed-precision with high runtime overhead or rely on heuristics or highly constrained precision allocation strategies. In this work, we propose ScaleBITS, a mixed-precision quantization framework that enables automated, fine-grained bitwidth allocation under a memory budget while preserving hardware efficiency. Guided by a new sensitivity analysis, we introduce a hardware-aligned, block-wise weight partitioning scheme, powered by bi-directional channel reordering. We formulate global bitwidth allocation as a constrained optimization problem and develop a scalable approximation to the greedy algorithm, enabling end-to-end principled allocation. Experiments show that ScaleBITS significantly improves over uniform-precision quantization (up to +36%) and outperforms state-of-the-art sensitivity-aware baselines (up to +13%) in ultra-low-bit regime, without adding runtime overhead.

[175] Certified Learning under Distribution Shift: Sound Verification and Identifiable Structure

Chandrasekhar Gokavarapu, Sudhakar Gadde, Y. Rajasekhar, S. R. Bhargava

Main category: cs.LG

TL;DR: Theoretical framework for certifying model performance under distribution shift with explicit risk bounds based on computable shift metrics and model parameters.

DetailsMotivation: To provide rigorous theoretical guarantees for machine learning models under distribution shift, moving beyond empirical evaluation to formal certification of performance with explicit assumptions and failure mode characterization.

Method: Develops a unified theoretical framework with explicit inequalities that bound excess risk under distribution shift using computable shift metrics and model parameters, with verifiable regularity and complexity constraints.

Result: Provides explicit upper bounds for excess risk under distribution shift, sound verification for nontrivial model sizes, and interpretability through identifiability conditions rather than post hoc explanations.

Conclusion: The framework enables formal certification of model performance under distribution shift with explicit assumptions, characterizes non-certifiable regimes, and isolates failure modes for better understanding of model robustness.

Abstract: Proposition. Let $f$ be a predictor trained on a distribution $P$ and evaluated on a shifted distribution $Q$. Under verifiable regularity and complexity constraints, the excess risk under shift admits an explicit upper bound determined by a computable shift metric and model parameters. We develop a unified framework in which (i) risk under distribution shift is certified by explicit inequalities, (ii) verification of learned models is sound for nontrivial sizes, and (iii) interpretability is enforced through identifiability conditions rather than post hoc explanations. All claims are stated with explicit assumptions. Failure modes are isolated. Non-certifiable regimes are characterized.

Konstanty Subbotko

Main category: cs.LG

TL;DR: MIDAS modernizes DARTS by replacing static architecture parameters with dynamic, input-specific parameters computed via self-attention, improving NAS robustness through patchwise attention and topology-aware search spaces.

DetailsMotivation: Differentiable NAS methods like DARTS are efficient but have limited practical adoption. The authors aim to improve DARTS by making architecture selection more robust and dynamic through input-specific parameterization.

Method: MIDAS replaces static architecture parameters with dynamic parameters computed via self-attention. It localizes selection by computing parameters separately for each spatial patch and introduces a parameter-free, topology-aware search space that models node connectivity for selecting incoming edges.

Result: Achieves 97.42% top-1 on CIFAR-10 and 83.38% on CIFAR-100 in DARTS space. Consistently finds globally optimal architectures in NAS-Bench-201. Sets state-of-the-art on two of four RDARTS search spaces on CIFAR-10.

Conclusion: MIDAS successfully modernizes DARTS with dynamic, input-specific architecture parameters, demonstrating improved performance and robustness. Analysis shows patchwise attention improves operation discrimination and produces class-aware, unimodal parameter distributions.

Abstract: Differentiable Neural Architecture Search (NAS) provides efficient, gradient-based methods for automatically designing neural networks, yet its adoption remains limited in practice. We present MIDAS, a novel approach that modernizes DARTS by replacing static architecture parameters with dynamic, input-specific parameters computed via self-attention. To improve robustness, MIDAS (i) localizes the architecture selection by computing it separately for each spatial patch of the activation map, and (ii) introduces a parameter-free, topology-aware search space that models node connectivity and simplifies selecting the two incoming edges per node. We evaluate MIDAS on the DARTS, NAS-Bench-201, and RDARTS search spaces. In DARTS, it reaches 97.42% top-1 on CIFAR-10 and 83.38% on CIFAR-100. In NAS-Bench-201, it consistently finds globally optimal architectures. In RDARTS, it sets the state of the art on two of four search spaces on CIFAR-10. We further analyze why MIDAS works, showing that patchwise attention improves discrimination among candidate operations, and the resulting input-specific parameter distributions are class-aware and predominantly unimodal, providing reliable guidance for decoding.

[177] Parallel Complex Diffusion for Scalable Time Series Generation

Rongyao Cai, Yuxi Wan, Kexin Zhang, Ming Jin, Zhiqiang Ge, Qingsong Wen, Yong Liu

Main category: cs.LG

TL;DR: PaCoDi is a spectral-native architecture for time series generation that uses Fourier Transform to decouple temporal signals into decorrelated spectral components, achieving computational efficiency through Hermitian symmetry compression and theoretical foundations in complex diffusion processes.

DetailsMotivation: Traditional temporal diffusion models face fundamental trade-offs between representational capacity and computational efficiency due to local entanglement and O(L²) attention costs. There's a need for architectures that can model long-range dependencies in time series generation more efficiently.

Method: Introduces PaCoDi (Parallel Complex Diffusion) that operates in the frequency domain using Fourier Transform as a diagonalizing operator. Uses Mean Field Theory approximation with interactive correction, exploits Hermitian symmetry for 50% sequence compression, and derives Heteroscedastic Loss for non-isotropic noise handling. Generalizes discrete DDPM to continuous-time Frequency SDEs.

Result: PaCoDi outperforms existing baselines in both generation quality and inference speed, achieving 50% reduction in attention FLOPs without information loss through Hermitian symmetry compression.

Conclusion: PaCoDi offers a theoretically grounded and computationally efficient solution for time series modeling by fundamentally altering the problem topology through spectral domain operations and complex diffusion processes.

Abstract: Modeling long-range dependencies in time series generation poses a fundamental trade-off between representational capacity and computational efficiency. Traditional temporal diffusion models suffer from local entanglement and the $\mathcal{O}(L^2)$ cost of attention mechanisms. We address these limitations by introducing PaCoDi (Parallel Complex Diffusion), a spectral-native architecture that decouples generative modeling in the frequency domain. PaCoDi fundamentally alters the problem topology: the Fourier Transform acts as a diagonalizing operator, converting locally coupled temporal signals into globally decorrelated spectral components. Theoretically, we prove the Quadrature Forward Diffusion and Conditional Reverse Factorization theorem, demonstrating that the complex diffusion process can be split into independent real and imaginary branches. We bridge the gap between this decoupled theory and data reality using a \textbf{Mean Field Theory (MFT) approximation} reinforced by an interactive correction mechanism. Furthermore, we generalize this discrete DDPM to continuous-time Frequency SDEs, rigorously deriving the Spectral Wiener Process describe the differential spectral Brownian motion limit. Crucially, PaCoDi exploits the Hermitian Symmetry of real-valued signals to compress the sequence length by half, achieving a 50% reduction in attention FLOPs without information loss. We further derive a rigorous Heteroscedastic Loss to handle the non-isotropic noise distribution on the compressed manifold. Extensive experiments show that PaCoDi outperforms existing baselines in both generation quality and inference speed, offering a theoretically grounded and computationally efficient solution for time series modeling.

[178] Provable Adversarial Robustness in In-Context Learning

Di Zhang

Main category: cs.LG

TL;DR: Theoretical analysis of in-context learning robustness under adversarial distribution shifts, showing model capacity scales with robustness and adversarial settings increase sample complexity.

DetailsMotivation: Current theoretical explanations for in-context learning assume test tasks come from similar distributions as pretraining, overlooking adversarial distribution shifts that threaten real-world reliability. There's a need for distributionally robust guarantees.

Method: Introduce a distributionally robust meta-learning framework with worst-case performance guarantees under Wasserstein-based distribution shifts. Focus on linear self-attention Transformers and derive non-asymptotic bounds linking adversarial perturbation strength, model capacity, and number of in-context examples.

Result: Model robustness scales with square root of capacity (ρ_max ∝ √m), while adversarial settings impose sample complexity penalty proportional to square of perturbation magnitude (N_ρ - N_0 ∝ ρ²). Experiments on synthetic tasks confirm these scaling laws.

Conclusion: Model capacity serves as fundamental resource for distributional robustness in in-context learning. Findings advance theoretical understanding of ICL’s limits under adversarial conditions.

Abstract: Large language models adapt to new tasks through in-context learning (ICL) without parameter updates. Current theoretical explanations for this capability assume test tasks are drawn from a distribution similar to that seen during pretraining. This assumption overlooks adversarial distribution shifts that threaten real-world reliability. To address this gap, we introduce a distributionally robust meta-learning framework that provides worst-case performance guarantees for ICL under Wasserstein-based distribution shifts. Focusing on linear self-attention Transformers, we derive a non-asymptotic bound linking adversarial perturbation strength ($ρ$), model capacity ($m$), and the number of in-context examples ($N$). The analysis reveals that model robustness scales with the square root of its capacity ($ρ_{\text{max}} \propto \sqrt{m}$), while adversarial settings impose a sample complexity penalty proportional to the square of the perturbation magnitude ($N_ρ- N_0 \propto ρ^2$). Experiments on synthetic tasks confirm these scaling laws. These findings advance the theoretical understanding of ICL’s limits under adversarial conditions and suggest that model capacity serves as a fundamental resource for distributional robustness.

[179] Bayesian Optimality of In-Context Learning with Selective State Spaces

Di Zhang, Jiaqi Xing

Main category: cs.LG

TL;DR: Selective State Space Models (SSMs) implement Bayesian optimal sequential prediction for in-context learning, outperforming Transformers on tasks with temporally correlated noise due to superior statistical efficiency.

DetailsMotivation: The paper aims to provide a new theoretical framework for understanding in-context learning (ICL) in Transformers, moving beyond the popular "implicit gradient descent" interpretation to a Bayesian optimal sequential prediction perspective.

Method: The authors formalize ICL as meta-learning over latent sequence tasks, specifically focusing on Linear Gaussian State Space Models (LG-SSMs). They prove that a meta-trained selective SSM asymptotically implements the Bayes-optimal predictor (posterior predictive mean). They establish statistical separation from gradient descent by constructing tasks with temporally correlated noise where Bayesian prediction outperforms empirical risk minimization.

Result: Experiments on synthetic LG-SSM tasks and character-level Markov benchmarks show selective SSMs converge faster to Bayes-optimal risk, demonstrate superior sample efficiency with longer contexts in structured-noise settings, and track latent states more robustly than linear Transformers.

Conclusion: The paper reframes ICL from “implicit optimization” to “optimal inference,” explaining the efficiency of selective SSMs and offering a principled basis for architecture design based on Bayesian optimal sequential prediction.

Abstract: We propose Bayesian optimal sequential prediction as a new principle for understanding in-context learning (ICL). Unlike interpretations framing Transformers as performing implicit gradient descent, we formalize ICL as meta-learning over latent sequence tasks. For tasks governed by Linear Gaussian State Space Models (LG-SSMs), we prove a meta-trained selective SSM asymptotically implements the Bayes-optimal predictor, converging to the posterior predictive mean. We further establish a statistical separation from gradient descent, constructing tasks with temporally correlated noise where the optimal Bayesian predictor strictly outperforms any empirical risk minimization (ERM) estimator. Since Transformers can be seen as performing implicit ERM, this demonstrates selective SSMs achieve lower asymptotic risk due to superior statistical efficiency. Experiments on synthetic LG-SSM tasks and a character-level Markov benchmark confirm selective SSMs converge faster to Bayes-optimal risk, show superior sample efficiency with longer contexts in structured-noise settings, and track latent states more robustly than linear Transformers. This reframes ICL from “implicit optimization” to “optimal inference,” explaining the efficiency of selective SSMs and offering a principled basis for architecture design.

[180] Investigating Target Class Influence on Neural Network Compressibility for Energy-Autonomous Avian Monitoring

Nina Brolich, Simon Geis, Maximilian Kasper, Alexander Barnhill, Axel Plinge, Dominik Seuß

Main category: cs.LG

TL;DR: Efficient bird species detection on microcontrollers for wildlife monitoring using compressed neural networks

DetailsMotivation: Biodiversity loss requires efficient wildlife monitoring; bird songs are ideal identifiers but traditional methods are costly; existing ML solutions require heavy computational resources; need for efficient edge AI on microcontrollers for field deployment

Method: Train and compress neural networks for bird species detection on microcontroller units (MCUs); evaluate compression rates for different numbers of target classes; benchmark on various hardware platforms; assess energy autonomy feasibility

Result: Achieved significant compression rates with minimal performance loss; demonstrated feasibility of deploying energy-autonomous devices for bird monitoring

Conclusion: Efficient AI on MCUs enables practical, low-cost avian monitoring in the field, addressing biodiversity assessment needs with minimal computational resources

Abstract: Biodiversity loss poses a significant threat to humanity, making wildlife monitoring essential for assessing ecosystem health. Avian species are ideal subjects for this due to their popularity and the ease of identifying them through their distinctive songs. Traditionalavian monitoring methods require manual counting and are therefore costly and inefficient. In passive acoustic monitoring, soundscapes are recorded over long periods of time. The recordings are analyzed to identify bird species afterwards. Machine learning methods have greatly expedited this process in a wide range of species and environments, however, existing solutions require complex models and substantial computational resources. Instead, we propose running machine learning models on inexpensive microcontroller units (MCUs) directly in the field. Due to the resulting hardware and energy constraints, efficient artificial intelligence (AI) architecture is required. In this paper, we present our method for avian monitoring on MCUs. We trained and compressed models for various numbers of target classes to assess the detection of multiple bird species on edge devices and evaluate the influence of the number of species on the compressibility of neural networks. Our results demonstrate significant compression rates with minimal performance loss. We also provide benchmarking results for different hardware platforms and evaluate the feasibility of deploying energy-autonomous devices.

[181] Asking Forever: Universal Activations Behind Turn Amplification in Conversational LLMs

Zachary Coalson, Bo Fang, Sanghyun Hong

Main category: cs.LG

TL;DR: Paper identifies “turn amplification” as a new failure mode in conversational LLMs where models systematically prolong multi-turn interactions without completing tasks, exploiting clarification-seeking behavior through universal activation subspaces.

DetailsMotivation: Multi-turn interaction length significantly impacts operational costs of conversational LLMs. The paper aims to identify and understand a new failure mode where models can be manipulated to unnecessarily prolong conversations through systematic exploitation of clarification-seeking behavior.

Method: The researchers take a mechanistic perspective to identify query-independent, universal activation subspaces associated with clarification-seeking responses. They demonstrate attacks through both supply-chain (fine-tuning) and runtime (low-level parameter corruptions) methods that shift models toward abstract, clarification-seeking behavior across prompts.

Result: Across multiple instruction-tuned LLMs and benchmarks, the attack substantially increases turn count while remaining compliant. The paper shows that existing defenses offer limited protection against this emerging class of failures.

Conclusion: Turn amplification represents a new, scalable failure mode in conversational LLMs that exploits fundamental conversational dynamics rather than prompt-level manipulations, highlighting security vulnerabilities that persist across tasks and prompts.

Abstract: Multi-turn interaction length is a dominant factor in the operational costs of conversational LLMs. In this work, we present a new failure mode in conversational LLMs: turn amplification, in which a model consistently prolongs multi-turn interactions without completing the underlying task. We show that an adversary can systematically exploit clarification-seeking behavior$-$commonly encouraged in multi-turn conversation settings$-$to scalably prolong interactions. Moving beyond prompt-level behaviors, we take a mechanistic perspective and identify a query-independent, universal activation subspace associated with clarification-seeking responses. Unlike prior cost-amplification attacks that rely on per-turn prompt optimization, our attack arises from conversational dynamics and persists across prompts and tasks. We show that this mechanism provides a scalable pathway to induce turn amplification: both supply-chain attacks via fine-tuning and runtime attacks through low-level parameter corruptions consistently shift models toward abstract, clarification-seeking behavior across prompts. Across multiple instruction-tuned LLMs and benchmarks, our attack substantially increases turn count while remaining compliant. We also show that existing defenses offer limited protection against this emerging class of failures.

[182] Multi-material Multi-physics Topology Optimization with Physics-informed Gaussian Process Priors

Xiangyu Sun, Shirin Hosseinmardi, Amin Yousefpour, Ramin Bostanabad

Main category: cs.LG

TL;DR: A physics-informed Gaussian process framework for topology optimization that handles multi-material, multi-physics problems by representing variables with neural network-parameterized GP priors and minimizing a combined loss function.

DetailsMotivation: Existing ML-based topology optimization methods struggle with high computational costs, spectral bias, and handling complex multi-material, multi-physics problems with non-self-adjoint objective/constraint functions.

Method: Proposes physics-informed Gaussian processes (PIGPs) where primary, adjoint, and design variables are represented by independent GP priors with neural network-parameterized mean functions. All parameters are estimated simultaneously by minimizing a loss combining objective function, multi-physics potential energy functionals, and design constraints, with accelerated training via novel differentiation/integration schemes.

Result: Demonstrated effectiveness on benchmark TO problems including compliance minimization, heat conduction optimization, compliant mechanism design, and thermo-mechanical TO with single- and multi-material settings, generating super-resolution topologies with sharp interfaces and physically interpretable material distributions.

Conclusion: The PIGP framework successfully solves coupled multi-physics and design problems simultaneously, validated using both open-source codes and commercial software COMSOL.

Abstract: Machine learning (ML) has been increasingly used for topology optimization (TO). However, most existing ML-based approaches focus on simplified benchmark problems due to their high computational cost, spectral bias, and difficulty in handling complex physics. These limitations become more pronounced in multi-material, multi-physics problems whose objective or constraint functions are not self-adjoint. To address these challenges, we propose a framework based on physics-informed Gaussian processes (PIGPs). In our approach, the primary, adjoint, and design variables are represented by independent GP priors whose mean functions are parametrized via neural networks whose architectures are particularly beneficial for surrogate modeling of PDE solutions. We estimate all parameters of our model simultaneously by minimizing a loss that is based on the objective function, multi-physics potential energy functionals, and design-constraints. We demonstrate the capability of the proposed framework on benchmark TO problems such as compliance minimization, heat conduction optimization, and compliant mechanism design under single- and multi-material settings. Additionally, we leverage thermo-mechanical TO with single- and multi-material options as a representative multi-physics problem. We also introduce differentiation and integration schemes that dramatically accelerate the training process. Our results demonstrate that the proposed PIGP framework can effectively solve coupled multi-physics and design problems simultaneously – generating super-resolution topologies with sharp interfaces and physically interpretable material distributions. We validate these results using open-source codes and the commercial software package COMSOL.

[183] Grassmannian Mixture-of-Experts: Concentration-Controlled Routing on Subspace Manifolds

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

Main category: cs.LG

TL;DR: GrMoE introduces a Grassmannian manifold routing framework using Matrix Bingham distributions to control sparsity via concentration parameters, replacing discrete top-k selection with continuous sparsity control and preventing expert collapse.

DetailsMotivation: Standard MoE routing with softmax gating lacks principled mechanisms to control the tradeoff between sparsity and utilization, and suffers from expert collapse issues. There's a need for a geometrically principled approach that provides continuous sparsity control and better load balancing.

Method: Proposes Grassmannian MoE (GrMoE) operating on Grassmannian manifold of subspaces, where gating weights come from Matrix Bingham distribution concentration parameters. Uses amortized variational inference for posterior routing distributions, enabling uncertainty-aware expert assignment. Provides formal bounds relating concentration spectrum to routing entropy, expected top-k mass, and expert collapse.

Result: Achieves 0% routing collapse across all seeds on MoE language models (350M-8expert, 1.3B-16expert, 2.7B-32expert). Comparable or better perplexity with 15-30% improved load balance. Smooth monotonic relationship between concentration and effective sparsity enables post-hoc tuning. Token-level analysis shows experts learn heterogeneous concentration values correlating with linguistic specialization.

Conclusion: GrMoE provides a geometrically principled routing framework with continuous sparsity control, formal theoretical guarantees, and practical benefits including zero expert collapse, improved load balancing, and interpretable routing behavior based on linguistic specialization.

Abstract: Mixture-of-Experts models rely on learned routers to assign tokens to experts, yet standard softmax gating provides no principled mechanism to control the tradeoff between sparsity and utilization. We propose Grassmannian MoE (GrMoE), a routing framework that operates on the Grassmannian manifold of subspaces, where gating weights arise from the concentration parameters of Matrix Bingham distributions. This construction yields a single, interpretable knob – the concentration matrix $Λ$ – that continuously controls routing entropy, replacing discrete top-$k$ selection with a smooth, geometrically principled sparsity mechanism. We further develop an amortized variational inference procedure for posterior routing distributions, enabling uncertainty-aware expert assignment that naturally resists expert collapse. We formally prove tight bounds relating the Bingham concentration spectrum to routing entropy, expected top-$k$ mass, and an exponential bound on expert collapse, establishing the first formal theory of concentration-controlled sparsity. On synthetic routing tasks, a 350M-parameter MoE language model with 8 experts, a 1.3B-parameter model with 16 experts, and a 2.7B-parameter model with 32 experts, GrMoE achieves 0% routing collapse across all seeds, comparable or better perplexity with 15–30% improved load balance, and a smooth monotonic relationship between concentration and effective sparsity that enables post-hoc sparsity tuning without retraining. Token-level analysis reveals that experts learn heterogeneous concentration values that correlate with linguistic specialization, providing interpretable routing behavior.

[184] Calibrated Adaptation: Bayesian Stiefel Manifold Priors for Reliable Parameter-Efficient Fine-Tuning

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

Main category: cs.LG

TL;DR: SBA introduces a Bayesian PEFT framework with Matrix Langevin priors on Stiefel manifolds for calibrated uncertainty in adapter-based fine-tuning, outperforming LoRA on uncertainty metrics while maintaining task performance.

DetailsMotivation: Current parameter-efficient fine-tuning methods like LoRA lack principled uncertainty estimates, leading to poorly calibrated predictions and unreliable behavior under domain shift. There's a need for Bayesian approaches that properly handle the geometric structure of adapter parameters.

Method: SBA places Matrix Langevin priors over orthonormal adapter factors on the Stiefel manifold and performs approximate posterior inference via tangent space Laplace approximation with geodesic retraction, avoiding structural variance inflation inherent in flat-space projections.

Result: SBA achieves task performance comparable to LoRA and DoRA while reducing Expected Calibration Error by 18-34%, improving selective prediction AUROC by 12-25% under domain shift, and outperforming deep ensembles of five LoRA models on OOD detection with fewer parameters.

Conclusion: Proper geometric structure for uncertainty placement (Stiefel manifold) matters more than simply adding Bayesian treatment to adapters, enabling calibrated predictive uncertainty without recalibration while maintaining parameter efficiency.

Abstract: Parameter-efficient fine-tuning methods such as LoRA enable practical adaptation of large language models but provide no principled uncertainty estimates, leading to poorly calibrated predictions and unreliable behavior under domain shift. We introduce Stiefel-Bayes Adapters (SBA), a Bayesian PEFT framework that places a Matrix Langevin prior over orthonormal adapter factors on the Stiefel manifold $\St$ and performs approximate posterior inference via tangent space Laplace approximation with geodesic retraction. Unlike Gaussian priors in flat space projected onto orthogonality constraints, our prior on the manifold naturally encodes the inductive bias that adapter subspaces should be well conditioned and orthogonal, while the posterior provides calibrated predictive uncertainty without recalibration. We prove formally that the tangent space approximation strictly avoids the structural variance inflation inherent in projecting from ambient space, establishing a rigorous theoretical advantage for intrinsic manifold inference. Across GLUE and SuperGLUE benchmarks on RoBERTa-large, LLaMA-2-7B, LLaMA-2-13B, Mistral-7B, and Qwen2.5-7B, domain shift evaluations, selective prediction protocols, and an abstractive summarization task, SBA achieves task performance comparable to LoRA and DoRA while reducing Expected Calibration Error by 18 to 34% over deterministic baselines, improving selective prediction AUROC by 12 to 25% under domain shift, and outperforming deep ensembles of five LoRA models on OOD detection at a fraction of the parameter cost. Our results demonstrate that where you place uncertainty, on the right geometric structure, matters more than simply adding any Bayesian treatment to adapters.

[185] Avoid What You Know: Divergent Trajectory Balance for GFlowNets

Pedro Dall’Antonia, Tiago da Silva, Daniel Csillag, Salem Lahlou, Diego Mesquita

Main category: cs.LG

TL;DR: ACE improves GFlowNet exploration by training a separate exploration network to find high-reward states in underexplored regions, enhancing sampling efficiency and diversity.

DetailsMotivation: GFlowNets struggle with efficient exploration during training, as existing methods waste samples on already well-approximated regions rather than focusing on novel, high-probability areas.

Method: Proposes Adaptive Complementary Exploration (ACE) which uses two GFlowNets: a canonical one for target distribution sampling, and an exploration network specifically trained to search for high-reward states in underexplored regions.

Result: ACE significantly outperforms prior methods in approximating target distributions and discovering diverse high-reward states across extensive experiments.

Conclusion: ACE provides a principled approach to improve GFlowNet exploration efficiency by complementary training of exploration and exploitation networks.

Abstract: Generative Flow Networks (GFlowNets) are a flexible family of amortized samplers trained to generate discrete and compositional objects with probability proportional to a reward function. However, learning efficiency is constrained by the model’s ability to rapidly explore diverse high-probability regions during training. To mitigate this issue, recent works have focused on incentivizing the exploration of unvisited and valuable states via curiosity-driven search and self-supervised random network distillation, which tend to waste samples on already well-approximated regions of the state space. In this context, we propose Adaptive Complementary Exploration (ACE), a principled algorithm for the effective exploration of novel and high-probability regions when learning GFlowNets. To achieve this, ACE introduces an exploration GFlowNet explicitly trained to search for high-reward states in regions underexplored by the canonical GFlowNet, which learns to sample from the target distribution. Through extensive experiments, we show that ACE significantly improves upon prior work in terms of approximation accuracy to the target distribution and discovery rate of diverse high-reward states.

[186] Causality by Abstraction: Symbolic Rule Learning in Multivariate Timeseries with Large Language Models

Preetom Biswas, Giulia Pedrielli, K. Selçuk Candan

Main category: cs.LG

TL;DR: ruleXplain uses LLMs to extract formal causal rules from simulation-driven dynamical systems by generating counterfactual inputs and prompting LLMs to produce verifiable symbolic rules with temporal operators.

DetailsMotivation: Traditional approaches fail to produce generalized and interpretable explanations for causal relations in timeseries data with delayed effects, especially when multiple distinct input trajectories yield similar outputs.

Method: Uses a principled simulator to generate diverse counterfactual input trajectories that yield similar target outputs, clusters these inputs, and prompts LLMs to generate symbolic rules with temporal operators and delay semantics. Includes closed-loop refinement for rule consistency.

Result: Validated on PySIRTEM epidemic simulator (testing rates to infection counts) and EnergyPlus building energy simulator (temperature/solar irradiance to electricity needs). Experiments show efficacy through input reconstruction, causal encoding evaluation, and generalization tests across unseen output trends.

Conclusion: ruleXplain provides a framework for extracting interpretable causal rules from complex dynamical systems using LLMs, enabling formal explanations of input-output relations in simulation-driven systems.

Abstract: Inferring causal relations in timeseries data with delayed effects is a fundamental challenge, especially when the underlying system exhibits complex dynamics that cannot be captured by simple functional mappings. Traditional approaches often fail to produce generalized and interpretable explanations, as multiple distinct input trajectories may yield nearly indistinguishable outputs. In this work, we present ruleXplain, a framework that leverages Large Language Models (LLMs) to extract formal explanations for input-output relations in simulation-driven dynamical systems. Our method introduces a constrained symbolic rule language with temporal operators and delay semantics, enabling LLMs to generate verifiable causal rules through structured prompting. ruleXplain relies on the availability of a principled model (e.g., a simulator) that maps multivariate input time series to output time series. Within ruleXplain, the simulator is used to generate diverse counterfactual input trajectories that yield similar target output, serving as candidate explanations. Such counterfactual inputs are clustered and provided as context to the LLM, which is tasked with the generation of symbolic rules encoding the joint temporal trends responsible for the patterns observable in the output times series. A closed-loop refinement process ensures rule consistency and semantic validity. We validate the framework using the PySIRTEM epidemic simulator, mapping testing rate inputs to daily infection counts; and the EnergyPlus building energy simulator, observing temperature and solar irradiance inputs to electricity needs. For validation, we perform three classes of experiments: (1) the efficacy of the ruleset through input reconstruction; (2) ablation studies evaluating the causal encoding of the ruleset; and (3) generalization tests of the extracted rules across unseen output trends with varying phase dynamics.

[187] Financial time series augmentation using transformer based GAN architecture

Andrzej Podobiński, Jarosław A. Chudziak

Main category: cs.LG

TL;DR: GAN-based data augmentation using transformer-based GAN (TTS-GAN) improves financial time series forecasting accuracy by generating synthetic data to overcome data scarcity.

DetailsMotivation: Financial time series data is often scarce and volatile, making it difficult to train deep learning models effectively. Data scarcity leads to suboptimal training and poor generalization in forecasting models.

Method: Use transformer-based GAN (TTS-GAN) to generate synthetic financial time series data, then train LSTM forecasting models on augmented datasets. Propose novel quality metric combining Dynamic Time Warping (DTW) and modified Deep Dataset Dissimilarity Measure (DeD-iMs) to evaluate generated data quality.

Result: Training LSTM models on GAN-augmented datasets significantly improves forecasting accuracy compared to using real data alone, demonstrated on Bitcoin and S&P500 price data across various forecasting horizons.

Conclusion: GAN-based data augmentation effectively overcomes data scarcity in financial domains, enhancing predictive capabilities of deep learning forecasting models.

Abstract: Time-series forecasting is a critical task across many domains, from engineering to economics, where accurate predictions drive strategic decisions. However, applying advanced deep learning models in challenging, volatile domains like finance is difficult due to the inherent limitation and dynamic nature of financial time series data. This scarcity often results in sub-optimal model training and poor generalization. The fundamental challenge lies in determining how to reliably augment scarce financial time series data to enhance the predictive accuracy of deep learning forecasting models. Our main contribution is a demonstration of how Generative Adversarial Networks (GANs) can effectively serve as a data augmentation tool to overcome data scarcity in the financial domain. Specifically, we show that training a Long Short-Term Memory (LSTM) forecasting model on a dataset augmented with synthetic data generated by a transformer-based GAN (TTS-GAN) significantly improves the forecasting accuracy compared to using real data alone. We confirm these results across different financial time series (Bitcoin and S&P500 price data) and various forecasting horizons. Furthermore, we propose a novel, time series specific quality metric that combines Dynamic Time Warping (DTW) and a modified Deep Dataset Dissimilarity Measure (DeD-iMs) to reliably monitor the training progress and evaluate the quality of the generated data. These findings provide compelling evidence for the benefits of GAN-based data augmentation in enhancing financial predictive capabilities.

[188] MePoly: Max Entropy Polynomial Policy Optimization

Hang Liu, Sangli Teng, Maani Ghaffari

Main category: cs.LG

TL;DR: MePoly introduces polynomial energy-based models for policy parameterization in stochastic optimal control, providing explicit tractable probability densities to capture multi-modal solutions while enabling exact entropy maximization.

DetailsMotivation: Conventional parametric policies struggle to represent multi-modal solutions in stochastic optimal control, while diffusion-based policies lack explicit probability densities that complicate policy-gradient optimization. There's a need for policy parameterizations that can capture multi-modality while maintaining tractable probability densities.

Method: Proposes MePoly, a novel policy parameterization based on polynomial energy-based models. The method provides explicit, tractable probability density functions, enabling exact entropy maximization. The approach is grounded in the classical moment problem and leverages universal approximation capabilities for arbitrary distributions.

Result: Empirical demonstrations show that MePoly effectively captures complex non-convex manifolds and outperforms baselines in performance across diverse benchmarks.

Conclusion: MePoly bridges the gap between multi-modal representation and tractable probability densities in stochastic optimal control, offering a promising approach for complex decision-making problems that require capturing solution multi-modality.

Abstract: Stochastic Optimal Control provides a unified mathematical framework for solving complex decision-making problems, encompassing paradigms such as maximum entropy reinforcement learning(RL) and imitation learning(IL). However, conventional parametric policies often struggle to represent the multi-modality of the solutions. Though diffusion-based policies are aimed at recovering the multi-modality, they lack an explicit probability density, which complicates policy-gradient optimization. To bridge this gap, we propose MePoly, a novel policy parameterization based on polynomial energy-based models. MePoly provides an explicit, tractable probability density, enabling exact entropy maximization. Theoretically, we ground our method in the classical moment problem, leveraging the universal approximation capabilities for arbitrary distributions. Empirically, we demonstrate that MePoly effectively captures complex non-convex manifolds and outperforms baselines in performance across diverse benchmarks.

[189] MantisV2: Closing the Zero-Shot Gap in Time Series Classification with Synthetic Data and Test-Time Strategies

Vasilii Feofanov, Songkang Wen, Jianfeng Zhang, Lujia Pan, Ievgen Redko

Main category: cs.LG

TL;DR: MantisV2 and Mantis+ are improved time series foundation models that achieve state-of-the-art zero-shot performance through synthetic pre-training, architectural refinements, and enhanced test-time methodologies.

DetailsMotivation: To develop better foundation models for time series classification that can serve as universal feature extractors, addressing the performance gap between frozen and fine-tuned encoders in existing models like Mantis.

Method: Three main approaches: 1) Mantis+ pre-trained entirely on synthetic time series, 2) MantisV2 with refined architecture through controlled ablation studies, and 3) enhanced test-time methodology using intermediate-layer representations, refined output-token aggregation, self-ensembling, and cross-model embedding fusion.

Result: Extensive experiments on UCR, UEA, Human Activity Recognition (HAR) benchmarks, and EEG datasets show that MantisV2 and Mantis+ consistently outperform prior time series foundation models, achieving state-of-the-art zero-shot performance.

Conclusion: The proposed methods significantly strengthen zero-shot feature extraction for time series, with MantisV2 and Mantis+ establishing new state-of-the-art performance as universal feature extractors for diverse downstream tasks.

Abstract: Developing foundation models for time series classification is of high practical relevance, as such models can serve as universal feature extractors for diverse downstream tasks. Although early models such as Mantis have shown the promise of this approach, a substantial performance gap remained between frozen and fine-tuned encoders. In this work, we introduce methods that significantly strengthen zero-shot feature extraction for time series. First, we introduce Mantis+, a variant of Mantis pre-trained entirely on synthetic time series. Second, through controlled ablation studies, we refine the architecture and obtain MantisV2, an improved and more lightweight encoder. Third, we propose an enhanced test-time methodology that leverages intermediate-layer representations and refines output-token aggregation. In addition, we show that performance can be further improved via self-ensembling and cross-model embedding fusion. Extensive experiments on UCR, UEA, Human Activity Recognition (HAR) benchmarks, and EEG datasets show that MantisV2 and Mantis+ consistently outperform prior time series foundation models, achieving state-of-the-art zero-shot performance.

[190] Influence-Preserving Proxies for Gradient-Based Data Selection in LLM Fine-tuning

Sirui Chen, Yunzhe Qi, Mengting Ai, Yifan Sun, Ruizhong Qiu, Jiaru Zou, Jingrui He

Main category: cs.LG

TL;DR: Iprox: A framework for creating influence-preserving proxy models from target LLMs to enable scalable gradient-based data selection for supervised fine-tuning.

DetailsMotivation: Gradient-based data selection methods (TracIn, Influence Functions) are computationally expensive for large language models, while off-the-shelf smaller proxies are suboptimal due to unclear learning dynamics, inflexible size adjustment, and inability to align with target model's influence estimation.

Method: Two-stage framework: 1) Low-rank compression to preserve target model’s influence information, 2) Aligning stage to align both model gradients and logits, creating proxies that flexibly control computational cost while retaining target model’s influence.

Result: Iprox consistently outperforms off-the-shelf proxies and baselines across diverse LLM families and tasks. On Qwen3-4B, a 1.5B Iprox proxy outperforms larger 1.7B off-the-shelf proxy. On Llama3.2, achieves better performance than baselines while reducing computational cost by more than half relative to full 3B model.

Conclusion: Iprox provides effective influence-preserving proxies that make gradient-based data selection more scalable for LLMs, addressing computational bottlenecks in supervised fine-tuning data selection.

Abstract: Supervised fine-tuning (SFT) relies critically on selecting training data that most benefits a model’s downstream performance. Gradient-based data selection methods such as TracIn and Influence Functions leverage influence to identify useful samples, but their computational cost scales poorly, making them impractical for multi-billion-parameter large language models (LLMs). A common alternative is to use off-the-shelf smaller models as proxies, but they remain suboptimal since their learning dynamics are unclear, their sizes cannot be flexibly adjusted, and they cannot be further aligned with the target model in terms of gradient-based influence estimation. To address these challenges, we introduce Iprox, a two-stage framework that derives influence-preserving proxies directly from the target model. It first applies a low-rank compression stage to preserve influence information of the target model, and then an aligning stage to align both model gradients and logits, thereby constructing proxies that flexibly control computational cost while retaining the target model’s influence. Experimental results across diverse LLM families and evaluation tasks show that Iprox consistently outperforms off-the-shelf proxies and baseline methods. On Qwen3-4B, a 1.5B proxy constructed with Iprox achieves stronger performance than the larger 1.7B off-the-shelf proxy. Notably, on Llama3.2, Iprox achieves better performance than baselines while reducing computational cost by more than half relative to the full 3B model. These results show that Iprox provides effective influence-preserving proxies, making gradient-based data selection more scalable for LLMs.

[191] ADAPT: Hybrid Prompt Optimization for LLM Feature Visualization

João N. Cardoso, Arlindo L. Oliveira, Bruno Martins

Main category: cs.LG

TL;DR: ADAPT is a hybrid method combining beam search initialization with adaptive gradient-guided mutation for feature visualization in LLMs, outperforming prior methods on Gemma 2 2B latents.

DetailsMotivation: Feature visualization for understanding learned directions in LLM activation space is challenging due to text's discrete nature and existing prompt optimization techniques being prone to local minima, requiring domain-specific solutions.

Method: ADAPT combines beam search initialization with adaptive gradient-guided mutation, specifically designed to overcome local minima issues in LLM feature visualization.

Result: ADAPT consistently outperforms prior methods across layers and latent types on Sparse Autoencoder latents from Gemma 2 2B, with evaluation using metrics grounded in dataset activation statistics.

Conclusion: Feature visualization for LLMs is tractable but requires design assumptions tailored to the domain, with ADAPT providing an effective solution.

Abstract: Understanding what features are encoded by learned directions in LLM activation space requires identifying inputs that strongly activate them. Feature visualization, which optimizes inputs to maximally activate a target direction, offers an alternative to costly dataset search approaches, but remains underexplored for LLMs due to the discrete nature of text. Furthermore, existing prompt optimization techniques are poorly suited to this domain, which is highly prone to local minima. To overcome these limitations, we introduce ADAPT, a hybrid method combining beam search initialization with adaptive gradient-guided mutation, designed around these failure modes. We evaluate on Sparse Autoencoder latents from Gemma 2 2B, proposing metrics grounded in dataset activation statistics to enable rigorous comparison, and show that ADAPT consistently outperforms prior methods across layers and latent types. Our results establish that feature visualization for LLMs is tractable, but requires design assumptions tailored to the domain.

[192] Machine Learning Based Prediction of Surgical Outcomes in Chronic Rhinosinusitis from Clinical Data

Sayeed Shafayet Chowdhury, Karen D’Souza, V. Siva Kakumani, Snehasis Mukhopadhyay, Shiaofen Fang, Rodney J. Schlosser, Daniel M. Beswick, Jeremiah A. Alt, Jess C. Mace, Zachary M. Soler, Timothy L. Smith, Vijay R. Ramakrishnan

Main category: cs.LG

TL;DR: Machine learning models predict surgical benefit for chronic rhinosinusitis patients using preoperative data, achieving 85% accuracy and outperforming expert clinicians.

DetailsMotivation: To address the complex surgical decision-making in chronic rhinosinusitis (CRS) by developing ML models that can predict surgical benefit using only preoperative data, potentially reducing costs and improving patient outcomes.

Method: Supervised machine learning models trained on prospectively collected data from an observational intervention trial, using preoperative features to predict surgical benefit measured by SNOT-22 outcomes, with multiple algorithms including an ensemble approach.

Result: Best model achieved ~85% classification accuracy on training data and 80% accuracy on held-out test set, exceeding expert clinicians’ average accuracy of 75.6%.

Conclusion: ML models can accurately predict surgical candidacy in CRS using preoperative data, demonstrating potential to augment clinical decision-making and support personalized care.

Abstract: Artificial intelligence (AI) has increasingly transformed medical prognostics by enabling rapid and accurate analysis across imaging and pathology. However, the investigation of machine learning predictions applied to prospectively collected, standardized data from observational clinical intervention trials remains underexplored, despite its potential to reduce costs and improve patient outcomes. Chronic rhinosinusitis (CRS), a persistent inflammatory disease of the paranasal sinuses lasting more than three months, imposes a substantial burden on quality of life (QoL) and societal cost. Although many patients respond to medical therapy, others with refractory symptoms often pursue surgical intervention. Surgical decision-making in CRS is complex, as it must weigh known procedural risks against uncertain individualized outcomes. In this study, we evaluated supervised machine learning models for predicting surgical benefit in CRS, using the Sino-Nasal Outcome Test-22 (SNOT-22) as the primary patient-reported outcome. Our prospectively collected cohort from an observational intervention trial comprised patients who all underwent surgery; we investigated whether models trained only on preoperative data could identify patients who might not have been recommended surgery prior to the procedure. Across multiple algorithms, including an ensemble approach, our best model achieved approximately 85% classification accuracy, providing accurate and interpretable predictions of surgical candidacy. Moreover, on a held-out set of 30 cases spanning mixed difficulty, our model achieved 80% accuracy, exceeding the average prediction accuracy of expert clinicians (75.6%), demonstrating its potential to augment clinical decision-making and support personalized CRS care.

[193] Two Calm Ends and the Wild Middle: A Geometric Picture of Memorization in Diffusion Models

Nick Dodson, Xinyu Gao, Qingsong Wang, Yusu Wang, Zhengchao Wan

Main category: cs.LG

TL;DR: Diffusion models can memorize training data, creating privacy risks. The paper introduces a geometric framework partitioning noise schedules into three regimes based on data coverage and posterior concentration, revealing non-uniform memorization risk with a danger zone at medium noise levels.

DetailsMotivation: Diffusion models generate high-quality samples but can memorize training data, raising serious privacy concerns. Understanding when memorization versus generalization occurs remains unclear, particularly regarding where along the noise schedule memorization is induced, how data geometry influences it, and how phenomena at different noise scales interact.

Method: Introduces a geometric framework that partitions the noise schedule into three regimes based on: 1) coverage properties of training data by Gaussian shells, and 2) concentration behavior of the posterior. These are identified as fundamental objects governing memorization and generalization in diffusion models.

Result: Reveals that memorization risk is highly non-uniform across noise levels, with a danger zone at medium noise levels where memorization is most pronounced. Small and large noise regimes resist memorization through different mechanisms: small noise avoids memorization due to limited training coverage, while large noise exhibits low posterior concentration and admits provably near linear Gaussian denoising behavior.

Conclusion: Proposes a geometry-informed targeted intervention that mitigates memorization for the medium noise regime based on identified geometric conditions. The framework provides insights into when and why diffusion models memorize versus generalize.

Abstract: Diffusion models generate high-quality samples but can also memorize training data, raising serious privacy concerns. Understanding the mechanisms governing when memorization versus generalization occurs remains an active area of research. In particular, it is unclear where along the noise schedule memorization is induced, how data geometry influences it, and how phenomena at different noise scales interact. We introduce a geometric framework that partitions the noise schedule into three regimes based on the coverage properties of training data by Gaussian shells and the concentration behavior of the posterior, which we argue are two fundamental objects governing memorization and generalization in diffusion models. This perspective reveals that memorization risk is highly non-uniform across noise levels. We further identify a danger zone at medium noise levels where memorization is most pronounced. In contrast, both the small and large noise regimes resist memorization, but through fundamentally different mechanisms: small noise avoids memorization due to limited training coverage, while large noise exhibits low posterior concentration and admits a provably near linear Gaussian denoising behavior. For the medium noise regime, we identify geometric conditions through which we propose a geometry-informed targeted intervention that mitigates memorization.

[194] Dual Length Codes for Lossless Compression of BFloat16

Aditya Agrawal, Albert Magyar, Hiteshwar Eswaraiah, Patrick Sheridan, Pradeep Janedula, Ravi Krishnan Venkatesan, Krishna Nair, Ravi Iyer

Main category: cs.LG

TL;DR: Dual Length Codes: A hybrid compression scheme for LLM tensors that balances compression efficiency with faster decoding by using short 4-bit codes for frequent symbols and longer 9-bit codes for others, with single prefix bit distinction.

DetailsMotivation: Network bandwidth bottlenecks in LLM parallelization and collective operations need efficient compression. Existing solutions like Huffman codes have slow bit-sequential decoding and high hardware complexity, while universal codes like Exponential-Golomb are faster but don't exploit symbol frequency distributions.

Method: Analyzed BFloat16 tensors from Gemma model, found top 8 most frequent symbols account for ~50% probability. Assign short 4-bit codes to these 8 symbols, longer 9-bit codes to remaining 248 symbols. Use single prefix bit to distinguish code lengths. Implement with small 8-entry LUT for encoding/decoding.

Result: Achieves 18.6% compressibility vs 21.3% for Huffman codes, but significantly speeds up decoding and simplifies hardware complexity compared to Huffman’s deep tree traversals.

Conclusion: Dual Length Codes offer practical trade-off between compression efficiency and decoding speed for LLM tensor compression, addressing network bandwidth bottlenecks in LLM training/serving with simpler hardware implementation.

Abstract: Training and serving Large Language Models (LLMs) relies heavily on parallelization and collective operations, which are frequently bottlenecked by network bandwidth. Lossless compression using e.g., Huffman codes can alleviate the issue, however, Huffman codes suffer from slow, bit-sequential decoding and high hardware complexity due to deep tree traversals. Universal codes e.g., Exponential-Golomb codes are faster to decode but do not exploit the symbol frequency distributions. To address these limitations, this paper introduces Dual Length Codes, a hybrid approach designed to balance compression efficiency with decoding speed. Analyzing BFloat16 tensors from the Gemma model, we observed that the top 8 most frequent symbols account for approximately 50% of the cumulative probability. These 8 symbols are assigned a short 4 bit code. The remaining 248 symbols are assigned a longer 9 bit code. The coding scheme uses a single prefix bit to distinguish between the two code lengths. The scheme uses a small Look Up Table with only 8 entries for encoding and decoding. The scheme achieves a compressibility of 18.6% in comparison to 21.3% achieved by Huffman codes, but it significantly speeds up the decoding and simplifies the hardware complexity.

[195] NIMMGen: Learning Neural-Integrated Mechanistic Digital Twins with LLMs

Zihan Guan, Rituparna Datta, Mengxuan Hu, Shunshun Liu, Aiying Zhang, Prasanna Balachandran, Sheng Li, Anil Vullikanti

Main category: cs.LG

TL;DR: LLM-based framework for generating mechanistic models from data with neural integration and iterative refinement for improved correctness and practical validity.

DetailsMotivation: Current LLM-based approaches for constructing mechanistic models oversimplify real-world conditions, leaving reliability unclear. Need evaluation under realistic settings with partial observations and diverse objectives.

Method: Introduces NIMM evaluation framework and NIMMgen agentic framework that uses neural integration and iterative refinement to enhance code correctness and practical validity of LLM-generated mechanistic models.

Result: Experiments across three scientific domains show strong performance. Learned mechanistic models support counterfactual intervention simulation.

Conclusion: NIMMgen addresses fundamental challenges in current baselines and enables reliable mechanistic modeling under realistic conditions.

Abstract: Mechanistic models encode scientific knowledge about dynamical systems and are widely used in downstream scientific and policy applications. Recent work has explored LLM-based agentic frameworks to automatically construct mechanistic models from data; however, existing problem settings substantially oversimplify real-world conditions, leaving it unclear whether LLM-generated mechanistic models are reliable in practice. To address this gap, we introduce the Neural-Integrated Mechanistic Modeling (NIMM) evaluation framework, which evaluates LLM-generated mechanistic models under realistic settings with partial observations and diversified task objectives. Our evaluation reveals fundamental challenges in current baselines, ranging from model effectiveness to code-level correctness. Motivated by these findings, we design NIMMgen, an agentic framework for neural-integrated mechanistic modeling that enhances code correctness and practical validity through iterative refinement. Experiments across three datasets from diversified scientific domains demonstrate its strong performance. We also show that the learned mechanistic models support counterfactual intervention simulation.

[196] MIRA: Memory-Integrated Reinforcement Learning Agent with Limited LLM Guidance

Narjes Nourzad, Carlee Joe-Wong

Main category: cs.LG

TL;DR: MIRA integrates LLM-derived knowledge into a structured memory graph to guide RL agents in sparse-reward environments, reducing LLM dependency while improving early learning.

DetailsMotivation: RL agents struggle with high sample complexity in sparse/delayed reward settings. While LLMs can provide useful priors and decompositions, continuous LLM supervision is impractical and unreliable. Need a method to amortize LLM knowledge into persistent memory.

Method: Proposes MIRA with structured memory graph storing high-return experiences and LLM outputs (subgoals, trajectories). Derives utility signal from memory to softly adjust advantage estimation. Memory evolves over time, utility decays as policy improves, preserving convergence guarantees.

Result: Empirically outperforms RL baselines, achieves returns comparable to frequent LLM supervision approaches while requiring substantially fewer online LLM queries. Theoretical analysis shows utility-based shaping improves early-stage learning.

Conclusion: MIRA effectively amortizes LLM knowledge into persistent memory, reducing LLM dependency while accelerating RL in sparse-reward environments through structured memory-guided learning.

Abstract: Reinforcement learning (RL) agents often suffer from high sample complexity in sparse or delayed reward settings due to limited prior structure. Large language models (LLMs) can provide subgoal decompositions, plausible trajectories, and abstract priors that facilitate early learning. However, heavy reliance on LLM supervision introduces scalability constraints and dependence on potentially unreliable signals. We propose MIRA (Memory-Integrated Reinforcement Learning Agent), which incorporates a structured, evolving memory graph to guide early training. The graph stores decision-relevant information, including trajectory segments and subgoal structures, and is constructed from both the agent’s high-return experiences and LLM outputs. This design amortizes LLM queries into a persistent memory rather than requiring continuous real-time supervision. From this memory graph, we derive a utility signal that softly adjusts advantage estimation to influence policy updates without modifying the underlying reward function. As training progresses, the agent’s policy gradually surpasses the initial LLM-derived priors, and the utility term decays, preserving standard convergence guarantees. We provide theoretical analysis showing that utility-based shaping improves early-stage learning in sparse-reward environments. Empirically, MIRA outperforms RL baselines and achieves returns comparable to approaches that rely on frequent LLM supervision, while requiring substantially fewer online LLM queries. Project webpage: https://narjesno.github.io/MIRA/

[197] Neural Prior Estimation: Learning Class Priors from Latent Representations

Masoud Yavari, Payman Moallem

Main category: cs.LG

TL;DR: NPE learns feature-conditioned log-prior estimates from latent representations to address class imbalance bias in neural networks, enabling bias-aware prediction without explicit class counts.

DetailsMotivation: Class imbalance induces systematic bias in deep neural networks by imposing skewed effective class priors, which leads to poor performance on underrepresented classes. Current methods often require explicit class counts or distribution-specific hyperparameters.

Method: Neural Prior Estimator (NPE) learns feature-conditioned log-prior estimates from latent representations using Prior Estimation Modules trained jointly with the backbone via a one-way logistic loss. Under Neural Collapse regime, NPE recovers class log-prior up to additive constant. The learned estimate is incorporated into logit adjustment (NPE-LA) for bias-aware prediction.

Result: Experiments on long-tailed CIFAR and imbalanced semantic segmentation benchmarks (STARE, ADE20K) show consistent improvements, particularly for underrepresented classes, demonstrating the effectiveness of NPE for learned prior estimation and imbalance-aware prediction.

Conclusion: NPE offers a lightweight, theoretically justified approach to learned prior estimation and imbalance-aware prediction that doesn’t require explicit class counts or distribution-specific hyperparameters, effectively addressing class imbalance bias in neural networks.

Abstract: Class imbalance induces systematic bias in deep neural networks by imposing a skewed effective class prior. This work introduces the Neural Prior Estimator (NPE), a framework that learns feature-conditioned log-prior estimates from latent representations. NPE employs one or more Prior Estimation Modules trained jointly with the backbone via a one-way logistic loss. Under the Neural Collapse regime, NPE is analytically shown to recover the class log-prior up to an additive constant, providing a theoretically grounded adaptive signal without requiring explicit class counts or distribution-specific hyperparameters. The learned estimate is incorporated into logit adjustment, forming NPE-LA, a principled mechanism for bias-aware prediction. Experiments on long-tailed CIFAR and imbalanced semantic segmentation benchmarks (STARE, ADE20K) demonstrate consistent improvements, particularly for underrepresented classes. NPE thus offers a lightweight and theoretically justified approach to learned prior estimation and imbalance-aware prediction.

[198] Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards

Johannes Ackermann, Michael Noukhovitch, Takashi Ishida, Masashi Sugiyama

Main category: cs.LG

TL;DR: Proposes gradient regularization (GR) as an alternative to KL penalties in RLHF/RLVR to prevent reward hacking by biasing policy updates toward regions where reward models are more accurate.

DetailsMotivation: Address reward hacking in RLHF/RLVR where policies exploit inaccurate reward models, moving beyond traditional KL penalty approaches.

Method: Theoretical connection between reward accuracy and flatness of optima, then uses gradient regularization to bias training to flatter regions where reward models are more accurate. Proposes explicit GR with efficient finite-difference estimates.

Result: GR outperforms KL penalty across diverse RL experiments: achieves higher GPT-judged win-rate in RLHF, avoids format overfitting in rule-based math rewards, and prevents judge hacking in LLM-as-a-Judge tasks.

Conclusion: Gradient regularization provides a principled alternative to KL penalties for maintaining reward model accuracy and preventing reward hacking in language model post-training.

Abstract: Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs). A common problem is reward hacking, where the policy may exploit inaccuracies of the reward and learn an unintended behavior. Most previous works address this by limiting the policy update with a Kullback-Leibler (KL) penalty towards a reference model. We propose a different framing: Train the LM in a way that biases policy updates towards regions in which the reward is more accurate. First, we derive a theoretical connection between the accuracy of a reward model and the flatness of an optimum at convergence. Gradient regularization (GR) can then be used to bias training to flatter regions and thereby maintain reward model accuracy. We confirm these results by showing that the gradient norm and reward accuracy are empirically correlated in RLHF. We then show that Reference Resets of the KL penalty implicitly use GR to find flatter regions with higher reward accuracy. We further improve on this by proposing to use explicit GR with an efficient finite-difference estimate. Empirically, GR performs better than a KL penalty across a diverse set of RL experiments with LMs. GR achieves a higher GPT-judged win-rate in RLHF, avoids overly focusing on the format in rule-based math rewards, and prevents hacking the judge in LLM-as-a-Judge math tasks.

[199] Memory-Based Advantage Shaping for LLM-Guided Reinforcement Learning

Narjes Nourzad, Carlee Joe-Wong

Main category: cs.LG

TL;DR: LLM-guided RL using memory graphs for subgoal discovery and trajectory guidance to improve sample efficiency in sparse reward environments

DetailsMotivation: Reinforcement learning in sparse or delayed reward environments suffers from high sample complexity. While LLMs can help with subgoal discovery and trajectory guidance, frequent LLM calls raise scalability and reliability concerns.

Method: Construct a memory graph encoding subgoals and trajectories from both LLM guidance and agent’s successful rollouts. Derive a utility function evaluating trajectory alignment with prior successful strategies, which shapes the advantage function to provide additional guidance without altering rewards. Uses primarily offline input with only occasional online LLM queries.

Result: Preliminary experiments in benchmark environments show improved sample efficiency and faster early learning compared to baseline RL methods, with final returns comparable to methods requiring frequent LLM interaction.

Conclusion: The memory graph approach reduces dependence on continuous LLM supervision while maintaining the benefits of LLM-guided exploration, offering a scalable solution for RL in sparse reward environments.

Abstract: In environments with sparse or delayed rewards, reinforcement learning (RL) incurs high sample complexity due to the large number of interactions needed for learning. This limitation has motivated the use of large language models (LLMs) for subgoal discovery and trajectory guidance. While LLMs can support exploration, frequent reliance on LLM calls raises concerns about scalability and reliability. We address these challenges by constructing a memory graph that encodes subgoals and trajectories from both LLM guidance and the agent’s own successful rollouts. From this graph, we derive a utility function that evaluates how closely the agent’s trajectories align with prior successful strategies. This utility shapes the advantage function, providing the critic with additional guidance without altering the reward. Our method relies primarily on offline input and only occasional online queries, avoiding dependence on continuous LLM supervision. Preliminary experiments in benchmark environments show improved sample efficiency and faster early learning compared to baseline RL methods, with final returns comparable to methods that require frequent LLM interaction.

[200] JAX-Privacy: A library for differentially private machine learning

Ryan McKenna, Galen Andrew, Borja Balle, Vadym Doroshenko, Arun Ganesh, Weiwei Kong, Alex Kurakin, Brendan McMahan, Mikhail Pravilov

Main category: cs.LG

TL;DR: JAX-Privacy is a library for differentially private machine learning that provides modular primitives for privacy-preserving ML workflows with usability, flexibility, and efficiency.

DetailsMotivation: To simplify the deployment of robust and performant differentially private machine learning mechanisms, addressing the gap between research and practical implementation while serving both researchers needing customization and practitioners wanting out-of-the-box solutions.

Method: Developed as a JAX-based library with verified, modular primitives for all aspects of differentially private ML including batch selection, gradient clipping, noise addition, accounting, and auditing, incorporating recent research advances.

Result: A comprehensive library that brings together recent research on differentially private ML, providing both customizable components for researchers and ready-to-use solutions for practitioners.

Conclusion: JAX-Privacy successfully bridges the gap between differential privacy research and practical implementation, offering a versatile tool for privacy-preserving machine learning with verified components.

Abstract: JAX-Privacy is a library designed to simplify the deployment of robust and performant mechanisms for differentially private machine learning. Guided by design principles of usability, flexibility, and efficiency, JAX-Privacy serves both researchers requiring deep customization and practitioners who want a more out-of-the-box experience. The library provides verified, modular primitives for critical components for all aspects of the mechanism design including batch selection, gradient clipping, noise addition, accounting, and auditing, and brings together a large body of recent research on differentially private ML.

[201] Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory

Usman Anwar, Tim Bakker, Dana Kianfar, Cristina Pinneri, Christos Louizos

Main category: cs.LG

TL;DR: Information-theoretic analysis of Chain-of-Thought monitorability, identifying two error sources and proposing training methods to improve CoT monitoring while preventing reward hacking.

DetailsMotivation: Chain-of-Thought monitors analyze LLM reasoning traces to detect problematic outputs, but current approaches have limitations. The paper aims to understand the theoretical foundations of CoT monitorability and develop methods to improve monitoring effectiveness while avoiding degeneration of reasoning quality.

Method: 1) Information-theoretic analysis showing non-zero mutual information between CoT and output is necessary but insufficient for monitorability. 2) Identifies two error sources: information gap (monitor’s ability to extract CoT information) and elicitation error (approximation of optimal monitoring function). 3) Proposes two training approaches: oracle-based method rewarding models for CoTs that maximize monitor accuracy, and label-free approach maximizing conditional mutual information between outputs and CoTs.

Result: Both proposed methods significantly improve monitor accuracy across multiple environments while preventing CoT degeneration. The approaches mitigate reward hacking when task rewards are imperfectly specified, showing practical effectiveness in enhancing CoT monitoring systems.

Conclusion: CoT monitorability can be systematically improved through targeted training objectives. The paper provides both theoretical foundations and practical methods for enhancing reasoning trace analysis in LLMs, offering solutions to improve monitoring while maintaining reasoning quality.

Abstract: Chain-of-thought (CoT) monitors are LLM-based systems that analyze reasoning traces to detect when outputs may exhibit attributes of interest, such as test-hacking behavior during code generation. In this paper, we use information-theoretic analysis to show that non-zero mutual information between CoT and output is a necessary but not sufficient condition for CoT monitorability. We identify two sources of approximation error that may undermine the performance of CoT monitors in practice: information gap, which measures the extent to which the monitor can extract the information available in CoT, and elicitation error, which measures the extent to which the monitor approximates the optimal monitoring function. We further demonstrate that CoT monitorability can be systematically improved through targeted training objectives. To this end, we propose two complementary approaches: (a) an oracle-based method that directly rewards the monitored model for producing CoTs that maximize monitor accuracy, and (b) a more practical, label-free approach that maximizes conditional mutual information between outputs and CoTs. Across multiple different environments, we show both methods significantly improve monitor accuracy while preventing CoT degeneration even when training against a monitor, thereby mitigating reward hacking when the task reward is imperfectly specified.

[202] Causal Neighbourhood Learning for Invariant Graph Representations

Simi Job, Xiaohui Tao, Taotao Cai, Haoran Xie, Jianming Yong

Main category: cs.LG

TL;DR: CNL-GNN is a causal intervention framework for graph neural networks that identifies and preserves causally relevant connections while reducing spurious correlations through counterfactual neighborhood generation and adaptive edge perturbation.

DetailsMotivation: Graph data often contain noisy and spurious correlations that mask true causal relationships, making traditional GNNs rely on spurious connections and limiting their generalization across different graphs and robustness under distribution shifts.

Method: Proposes Causal Neighbourhood Learning with Graph Neural Networks (CNL-GNN) that performs causal interventions on graph structure through counterfactual neighbourhood generation and adaptive edge perturbation guided by learnable importance masking and attention-based mechanisms, combined with disentanglement of causal features from confounding factors.

Result: Extensive experiments on four publicly available datasets, including multiple domain variants of one dataset, demonstrate that CNL-GNN outperforms state-of-the-art GNN models in robust classification.

Conclusion: CNL-GNN improves causal graph learning beyond traditional feature-based methods by learning invariant node representations that are robust and generalize well across different graph structures.

Abstract: Graph data often contain noisy and spurious correlations that mask the true causal relationships, which are essential for enabling graph models to make predictions based on the underlying causal structure of the data. Dependence on spurious connections makes it challenging for traditional Graph Neural Networks (GNNs) to generalize effectively across different graphs. Furthermore, traditional aggregation methods tend to amplify these spurious patterns, limiting model robustness under distribution shifts. To address these issues, we propose Causal Neighbourhood Learning with Graph Neural Networks (CNL-GNN), a novel framework that performs causal interventions on graph structure. CNL-GNN effectively identifies and preserves causally relevant connections and reduces spurious influences through the generation of counterfactual neighbourhoods and adaptive edge perturbation guided by learnable importance masking and an attention-based mechanism. In addition, by combining structural-level interventions with the disentanglement of causal features from confounding factors, the model learns invariant node representations that are robust and generalize well across different graph structures. Our approach improves causal graph learning beyond traditional feature-based methods, resulting in a robust classification model. Extensive experiments on four publicly available datasets, including multiple domain variants of one dataset, demonstrate that CNL-GNN outperforms state-of-the-art GNN models.

[203] COMBA: Cross Batch Aggregation for Learning Large Graphs with Context Gating State Space Models

Jiajun Shen, Yufei Jin, Yi He, xingquan Zhu

Main category: cs.LG

TL;DR: COMBA adapts state space models for large graph learning using graph context gating and cross-batch aggregation to handle graph-structured data efficiently.

DetailsMotivation: State space models (SSMs) excel at modeling long-range dependencies in sequences but struggle with graph-structured data, especially large graphs, because converting graphs to sequences is computationally expensive and inefficient for graph learning.

Method: COMBA introduces two key innovations: 1) Graph context gating that uses different neighborhood hops to control neighbor aggregation, and 2) Cross-batch aggregation that samples nodes as batches and aggregates information across batches to scale to large graphs while training graph neural networks.

Result: Theoretical analysis shows cross-batch aggregation guarantees lower error than training GNNs without aggregation. Experiments on benchmark networks demonstrate significant performance gains compared to baseline approaches.

Conclusion: COMBA successfully adapts state space models for large graph learning, providing an efficient solution for handling graph-structured data with theoretical guarantees and empirical performance improvements.

Abstract: State space models (SSMs) have recently emerged for modeling long-range dependency in sequence data, with much simplified computational costs than modern alternatives, such as transformers. Advancing SMMs to graph structured data, especially for large graphs, is a significant challenge because SSMs are sequence models and the shear graph volumes make it very expensive to convert graphs as sequences for effective learning. In this paper, we propose COMBA to tackle large graph learning using state space models, with two key innovations: graph context gating and cross batch aggregation. Graph context refers to different hops of neighborhood for each node, and graph context gating allows COMBA to use such context to learn best control of neighbor aggregation. For each graph context, COMBA samples nodes as batches, and train a graph neural network (GNN), with information being aggregated cross batches, allowing COMBA to scale to large graphs. Our theoretical study asserts that cross-batch aggregation guarantees lower error than training GNN without aggregation. Experiments on benchmark networks demonstrate significant performance gains compared to baseline approaches. Code and benchmark datasets will be released for public access.

[204] On the Semantic and Syntactic Information Encoded in Proto-Tokens for One-Step Text Reconstruction

Ivan Bondarenko, Egor Palkin, Fedor Tikunov

Main category: cs.LG

TL;DR: The paper investigates how frozen LLMs can reconstruct hundreds of tokens from just two learned proto-tokens in a single forward pass, analyzing what information these proto-tokens encode and exploring methods to impose semantic structure on them.

DetailsMotivation: To understand the latent capacity of LLMs for non-autoregressive text generation by studying what information is encoded in proto-tokens that enable one-step reconstruction of long sequences, potentially enabling more efficient text generation systems.

Method: Conducted experiments to disentangle semantic and syntactic content in the two proto-tokens, analyzed stability properties of the e-token, visualized attention patterns during reconstruction, and tested two regularization schemes using teacher embeddings: anchor-based loss and relational distillation objective.

Result: Found that m-token captures semantic information more strongly than e-token under standard optimization; anchor-based constraints trade off sharply with reconstruction accuracy; relational distillation can transfer batch-level semantic relations into proto-token space without sacrificing reconstruction quality.

Conclusion: The study supports the feasibility of future non-autoregressive sequence-to-sequence systems that predict proto-tokens as intermediate representations, potentially enabling more efficient text generation beyond the autoregressive paradigm.

Abstract: Autoregressive large language models (LLMs) generate text token-by-token, requiring n forward passes to produce a sequence of length n. Recent work, Exploring the Latent Capacity of LLMs for One-Step Text Reconstruction (Mezentsev and Oseledets), shows that frozen LLMs can reconstruct hundreds of tokens from only two learned proto-tokens in a single forward pass, suggesting a path beyond the autoregressive paradigm. In this paper, we study what information these proto-tokens encode and how they behave under reconstruction and controlled constraints. We perform a series of experiments aimed at disentangling semantic and syntactic content in the two proto-tokens, analyzing stability properties of the e-token, and visualizing attention patterns to the e-token during reconstruction. Finally, we test two regularization schemes for “imposing” semantic structure on the e-token using teacher embeddings, including an anchor-based loss and a relational distillation objective. Our results indicate that the m-token tends to capture semantic information more strongly than the e-token under standard optimization; anchor-based constraints trade off sharply with reconstruction accuracy; and relational distillation can transfer batch-level semantic relations into the proto-token space without sacrificing reconstruction quality, supporting the feasibility of future non-autoregressive seq2seq systems that predict proto-tokens as an intermediate representation.

[205] Optimizing Graph Causal Classification Models: Estimating Causal Effects and Addressing Confounders

Simi Job, Xiaohui Tao, Taotao Cai, Haoran Xie, Jianming Yong, Xin Wang

Main category: cs.LG

TL;DR: CCAGNN: A Confounder-Aware Causal Graph Neural Network framework that incorporates causal reasoning into graph learning to address limitations of traditional GNNs that rely on correlations and are sensitive to spurious patterns.

DetailsMotivation: Traditional graph ML methods like GNNs rely on correlations and are sensitive to spurious patterns and distribution changes. Causal learning is important for understanding true cause-effect relationships rather than mere associations, especially since many real-world systems are inherently causal and can be efficiently modeled with graphs.

Method: Proposes CCAGNN, a Confounder-Aware causal GNN framework that incorporates causal reasoning into graph learning. The framework supports counterfactual reasoning and aims to provide reliable predictions by isolating true causal factors and adjusting for confounders.

Result: Comprehensive experiments on six publicly available datasets from diverse domains show that CCAGNN consistently outperforms leading state-of-the-art models.

Conclusion: CCAGNN addresses the challenges of building robust and causally informed models by incorporating causal reasoning into graph learning, enabling more stable predictions under distribution shifts and interventions.

Abstract: Graph data is becoming increasingly prevalent due to the growing demand for relational insights in AI across various domains. Organizations regularly use graph data to solve complex problems involving relationships and connections. Causal learning is especially important in this context, since it helps to understand cause-effect relationships rather than mere associations. Since many real-world systems are inherently causal, graphs can efficiently model these systems. However, traditional graph machine learning methods including graph neural networks (GNNs), rely on correlations and are sensitive to spurious patterns and distribution changes. On the other hand, causal models enable robust predictions by isolating true causal factors, thus making them more stable under such shifts. Causal learning also helps in identifying and adjusting for confounders, ensuring that predictions reflect true causal relationships and remain accurate even under interventions. To address these challenges and build models that are robust and causally informed, we propose CCAGNN, a Confounder-Aware causal GNN framework that incorporates causal reasoning into graph learning, supporting counterfactual reasoning and providing reliable predictions in real-world settings. Comprehensive experiments on six publicly available datasets from diverse domains show that CCAGNN consistently outperforms leading state-of-the-art models.

[206] Breaking the Correlation Plateau: On the Optimization and Capacity Limits of Attention-Based Regressors

Jingquan Yan, Yuwei Miao, Peiran Yu, Junzhou Huang

Main category: cs.LG

TL;DR: Theoretical analysis reveals why Pearson correlation coefficient (PCC) plateaus during attention-based regression training, identifies limitations in optimization and model capacity, and proposes Extrapolative Correlation Attention (ECA) to overcome these issues.

DetailsMotivation: To understand and address the common but poorly understood phenomenon of PCC plateau during training of attention-based regression models, where PCC stops improving early while MSE continues to decrease.

Method: Theoretical analysis reveals two key limitations: 1) optimization conflict where lowering MSE suppresses PCC gradients, exacerbated by softmax attention on homogeneous data; 2) model capacity limitation where convex aggregators (including softmax) are bounded by the convex hull of inputs. Proposed Extrapolative Correlation Attention (ECA) with novel mechanisms to improve PCC optimization and extrapolate beyond convex hull.

Result: Across diverse benchmarks, including challenging homogeneous data settings, ECA consistently breaks the PCC plateau, achieving significant improvements in correlation without compromising MSE performance.

Conclusion: The PCC plateau stems from fundamental limitations in both optimization dynamics and model capacity, which can be overcome with the proposed ECA architecture that addresses both issues through theoretically-motivated mechanisms.

Abstract: Attention-based regression models are often trained by jointly optimizing Mean Squared Error (MSE) loss and Pearson correlation coefficient (PCC) loss, emphasizing the magnitude of errors and the order or shape of targets, respectively. A common but poorly understood phenomenon during training is the PCC plateau: PCC stops improving early in training, even as MSE continues to decrease. We provide the first rigorous theoretical analysis of this behavior, revealing fundamental limitations in both optimization dynamics and model capacity. First, in regard to the flattened PCC curve, we uncover a critical conflict where lowering MSE (magnitude matching) can paradoxically suppress the PCC gradient (shape matching). This issue is exacerbated by the softmax attention mechanism, particularly when the data to be aggregated is highly homogeneous. Second, we identify a limitation in the model capacity: we derived a PCC improvement limit for any convex aggregator (including the softmax attention), showing that the convex hull of the inputs strictly bounds the achievable PCC gain. We demonstrate that data homogeneity intensifies both limitations. Motivated by these insights, we propose the Extrapolative Correlation Attention (ECA), which incorporates novel, theoretically-motivated mechanisms to improve the PCC optimization and extrapolate beyond the convex hull. Across diverse benchmarks, including challenging homogeneous data setting, ECA consistently breaks the PCC plateau, achieving significant improvements in correlation without compromising MSE performance.

[207] Distribution-Free Sequential Prediction with Abstentions

Jialin Yu, Moïse Blanchard

Main category: cs.LG

TL;DR: The paper studies sequential prediction with abstention in a semi-adversarial setting where adversaries can inject corrupted instances, and learners can abstain without penalty on corrupted instances.

DetailsMotivation: To bridge the gap between classical stochastic learning (i.i.d. instances) and fully adversarial learning by allowing abstention on corrupted instances, and to address the limitation of requiring prior distributional knowledge in previous work.

Method: Proposes AbstainBoost algorithm based on boosting weak learners for distribution-free abstention learning with oblivious adversaries, with extensions to adaptive adversaries for structured function classes like linear classifiers.

Result: Achieves sublinear error for general VC classes in distribution-free abstention learning, with polynomial trade-offs between misclassification error and erroneous abstentions, complemented by matching lower bounds.

Conclusion: Distribution-free abstention learning is achievable for VC classes without prior knowledge of clean sample distribution, with fundamental trade-offs between error types.

Abstract: We study a sequential prediction problem in which an adversary is allowed to inject arbitrarily many adversarial instances in a stream of i.i.d.\ instances, but at each round, the learner may also \emph{abstain} from making a prediction without incurring any penalty if the instance was indeed corrupted. This semi-adversarial setting naturally sits between the classical stochastic case with i.i.d.\ instances for which function classes with finite VC dimension are learnable; and the adversarial case with arbitrary instances, known to be significantly more restrictive. For this problem, Goel et al. (2023) showed that, if the learner knows the distribution $μ$ of clean samples in advance, learning can be achieved for all VC classes without restrictions on adversary corruptions. This is, however, a strong assumption in both theory and practice: a natural question is whether similar learning guarantees can be achieved without prior distributional knowledge, as is standard in classical learning frameworks (e.g., PAC learning or asymptotic consistency) and other non-i.i.d.\ models (e.g., smoothed online learning). We therefore focus on the distribution-free setting where $μ$ is \emph{unknown} and propose an algorithm \textsc{AbstainBoost} based on a boosting procedure of weak learners, which guarantees sublinear error for general VC classes in \emph{distribution-free} abstention learning for oblivious adversaries. These algorithms also enjoy similar guarantees for adaptive adversaries, for structured function classes including linear classifiers. These results are complemented with corresponding lower bounds, which reveal an interesting polynomial trade-off between misclassification error and number of erroneous abstentions.

[208] On the “Induction Bias” in Sequence Models

M. Reza Ebrahimi, Michaël Defferrard, Sunny Panchal, Roland Memisevic

Main category: cs.LG

TL;DR: Transformers struggle with state tracking compared to RNNs, showing poor data efficiency and length-specific learning rather than generalized solutions.

DetailsMotivation: Despite transformers' practical success, concerns exist about their state tracking abilities, particularly in OOD generalization. This work examines in-distribution implications of these limitations.

Method: Large-scale experimental study comparing transformers and RNNs across multiple supervision regimes, analyzing data efficiency requirements relative to state-space size and sequence length, and examining weight sharing across different sequence lengths.

Result: Transformers require much more training data than RNNs as state-space size and sequence length increase. Transformers show negligible or detrimental weight sharing across lengths (learning length-specific solutions), while RNNs exhibit effective amortized learning by sharing weights across lengths.

Conclusion: State tracking remains a fundamental challenge for transformers even with matching training/evaluation distributions, highlighting architectural limitations compared to recurrent models.

Abstract: Despite the remarkable practical success of transformer-based language models, recent work has raised concerns about their ability to perform state tracking. In particular, a growing body of literature has shown this limitation primarily through failures in out-of-distribution (OOD) generalization, such as length extrapolation. In this work, we shift attention to the in-distribution implications of these limitations. We conduct a large-scale experimental study of the data efficiency of transformers and recurrent neural networks (RNNs) across multiple supervision regimes. We find that the amount of training data required by transformers grows much more rapidly with state-space size and sequence length than for RNNs. Furthermore, we analyze the extent to which learned state-tracking mechanisms are shared across different sequence lengths. We show that transformers exhibit negligible or even detrimental weight sharing across lengths, indicating that they learn length-specific solutions in isolation. In contrast, recurrent models exhibit effective amortized learning by sharing weights across lengths, allowing data from one sequence length to improve performance on others. Together, these results demonstrate that state tracking remains a fundamental challenge for transformers, even when training and evaluation distributions match.

[209] In-Context Learning for Pure Exploration in Continuous Spaces

Alessio Russo, Yin-Ching Lee, Ryan Welch, Aldo Pacchiano

Main category: cs.LG

TL;DR: C-ICPE-TS is a meta-learning approach for continuous pure exploration that trains neural policies to map observation histories to continuous query actions and hypothesis predictions, enabling transferable sequential testing strategies without parameter updates at inference.

DetailsMotivation: Traditional pure exploration problems often have discrete hypothesis spaces, but many modern applications involve continuous spaces where hypotheses naturally coincide with query/action spaces (e.g., continuous-armed bandits, function minimization, region localization). Existing methods struggle with these continuous settings.

Method: C-ICPE-TS meta-trains deep neural policies to learn two functions: (1) mapping observation histories to next continuous query action, and (2) mapping observation histories to predicted hypothesis. The approach learns transferable sequential testing strategies directly from data and operates without parameter updates or explicit hand-crafted information models at inference time.

Result: The method is validated across multiple benchmarks including continuous best-arm identification, region localization, and function minimizer identification, demonstrating effectiveness in continuous pure exploration settings.

Conclusion: C-ICPE-TS provides a novel meta-learning framework for continuous pure exploration that can handle various continuous-space applications by learning transferable testing strategies, addressing limitations of traditional discrete-space approaches.

Abstract: In active sequential testing, also termed pure exploration, a learner is tasked with the goal to adaptively acquire information so as to identify an unknown ground-truth hypothesis with as few queries as possible. This problem, originally studied by Chernoff in 1959, has several applications: classical formulations include Best-Arm Identification (BAI) in bandits, where actions index hypotheses, and generalized search problems, where strategically chosen queries reveal partial information about a hidden label. In many modern settings, however, the hypothesis space is continuous and naturally coincides with the query/action space: for example, identifying an optimal action in a continuous-armed bandit, localizing an $ε$-ball contained in a target region, or estimating the minimizer of an unknown function from a sequence of observations. In this work, we study pure exploration in such continuous spaces and introduce Continuous In-Context Pure Exploration for this regime. We introduce C-ICPE-TS, an algorithm that meta-trains deep neural policies to map observation histories to (i) the next continuous query action and (ii) a predicted hypothesis, thereby learning transferable sequential testing strategies directly from data. At inference time, C-ICPE-TS actively gathers evidence on previously unseen tasks and infers the true hypothesis without parameter updates or explicit hand-crafted information models. We validate C-ICPE-TS across a range of benchmarks, spanning continuous best-arm identification, region localization, and function minimizer identification.

[210] Tighter Regret Lower Bound for Gaussian Process Bandits with Squared Exponential Kernel in Hypersphere

Shogo Iwazaki

Main category: cs.LG

TL;DR: Theoretical analysis of Gaussian process bandits with squared exponential kernel, establishing dimension-dependent lower bounds for cumulative and simple regret under hyperspherical domains.

DetailsMotivation: To address the open question about dimension-dependent logarithmic factor gaps between upper and lower bounds in GP bandit problems, particularly for the widely used squared exponential kernel.

Method: Algorithm-independent worst-case lower bound analysis for GP bandits with fixed reward functions in RKHS, focusing on squared exponential kernel and hyperspherical input domains.

Result: Shows Ω(√T(ln T)^d(ln ln T)^{-d}) cumulative regret and Ω(ε^{-2}(ln 1/ε)^d(ln ln 1/ε)^{-d}) time steps for ε-optimal simple regret. Provides improved O((ln T)^{d+1}(ln ln T)^{-d}) upper bound on maximum information gain.

Conclusion: Partially resolves dimension-dependent logarithmic factor gaps, guaranteeing optimality of existing best algorithms up to dimension-independent logarithmic factors for hyperspherical domains.

Abstract: We study an algorithm-independent, worst-case lower bound for the Gaussian process (GP) bandit problem in the frequentist setting, where the reward function is fixed and has a bounded norm in the known reproducing kernel Hilbert space (RKHS). Specifically, we focus on the squared exponential (SE) kernel, one of the most widely used kernel functions in GP bandits. One of the remaining open questions for this problem is the gap in the \emph{dimension-dependent} logarithmic factors between upper and lower bounds. This paper partially resolves this open question under a hyperspherical input domain. We show that any algorithm suffers $Ω(\sqrt{T (\ln T)^{d} (\ln \ln T)^{-d}})$ cumulative regret, where $T$ and $d$ represent the total number of steps and the dimension of the hyperspherical domain, respectively. Regarding the simple regret, we show that any algorithm requires $Ω(ε^{-2}(\ln \frac{1}ε)^d (\ln \ln \frac{1}ε)^{-d})$ time steps to find an $ε$-optimal point. We also provide the improved $O((\ln T)^{d+1}(\ln \ln T)^{-d})$ upper bound on the maximum information gain for the SE kernel. Our results guarantee the optimality of the existing best algorithm up to \emph{dimension-independent} logarithmic factors under a hyperspherical input domain.

[211] Subgroups of $U(d)$ Induce Natural RNN and Transformer Architectures

Joshua Nunley

Main category: cs.LG

TL;DR: A framework for sequence models with hidden states on closed subgroups of U(d), with orthogonal-state RNN and transformer implementations evaluated on text datasets.

DetailsMotivation: To develop a unified framework for sequence models with hidden states defined on closed subgroups of unitary matrices, providing a principled mathematical foundation for recurrent and transformer architectures with structured state spaces.

Method: Uses minimal axiomatic setup to derive recurrent and transformer templates from a shared skeleton, where subgroup choice acts as drop-in replacement for state space, tangent projection, and update map. Specializes to O(d) (orthogonal group) and implements orthogonal-state RNN and transformer models. Also introduces a general linear-mixing extension in tangent space.

Result: Evaluated orthogonal-state RNN and transformer models on Tiny Shakespeare and Penn Treebank under parameter-matched settings. The linear-mixing extension improved finite-budget performance in O(d) experiments.

Conclusion: Provides a general framework for sequence models with hidden states on matrix groups, with orthogonal implementations showing promising results and linear-mixing extensions offering performance improvements.

Abstract: This paper presents a direct framework for sequence models with hidden states on closed subgroups of U(d). We use a minimal axiomatic setup and derive recurrent and transformer templates from a shared skeleton in which subgroup choice acts as a drop-in replacement for state space, tangent projection, and update map. We then specialize to O(d) and evaluate orthogonal-state RNN and transformer models on Tiny Shakespeare and Penn Treebank under parameter-matched settings. We also report a general linear-mixing extension in tangent space, which applies across subgroup choices and improves finite-budget performance in the current O(d) experiments.

[212] Learning Optimal and Sample-Efficient Decision Policies with Guarantees

Daqian Shao

Main category: cs.LG

TL;DR: A thesis on causal reinforcement learning methods for offline decision-making with hidden confounders, using instrumental variables and conditional moment restrictions to learn optimal policies from observational data.

DetailsMotivation: Traditional RL requires many online interactions which can be costly/dangerous, but offline learning faces challenges from hidden confounders that cause spurious correlations and suboptimal policies.

Method: Uses instrumental variables (IVs) to identify causal effects via conditional moment restrictions (CMR), develops sample-efficient CMR algorithms with convergence guarantees, adapts to imitation learning with relaxed confounder conditions, and creates LTL learning algorithms for high-level objectives.

Result: Developed algorithms outperform state-of-the-art methods, provide convergence and optimality guarantees, and demonstrate usefulness on RL benchmarks and synthetic datasets for real-world decision-making.

Conclusion: The thesis provides robust causal RL methods for offline decision-making with hidden confounders, offering theoretical guarantees and practical improvements over existing approaches.

Abstract: The paradigm of decision-making has been revolutionised by reinforcement learning and deep learning. Although this has led to significant progress in domains such as robotics, healthcare, and finance, the use of RL in practice is challenging, particularly when learning decision policies in high-stakes applications that may require guarantees. Traditional RL algorithms rely on a large number of online interactions with the environment, which is problematic in scenarios where online interactions are costly, dangerous, or infeasible. However, learning from offline datasets is hindered by the presence of hidden confounders. Such confounders can cause spurious correlations in the dataset and can mislead the agent into taking suboptimal or adversarial actions. Firstly, we address the problem of learning from offline datasets in the presence of hidden confounders. We work with instrumental variables (IVs) to identify the causal effect, which is an instance of a conditional moment restrictions (CMR) problem. Inspired by double/debiased machine learning, we derive a sample-efficient algorithm for solving CMR problems with convergence and optimality guarantees, which outperforms state-of-the-art algorithms. Secondly, we relax the conditions on the hidden confounders in the setting of (offline) imitation learning, and adapt our CMR estimator to derive an algorithm that can learn effective imitator policies with convergence rate guarantees. Finally, we consider the problem of learning high-level objectives expressed in linear temporal logic (LTL) and develop a provably optimal learning algorithm that improves sample efficiency over existing methods. Through evaluation on reinforcement learning benchmarks and synthetic and semi-synthetic datasets, we demonstrate the usefulness of the methods developed in this thesis in real-world decision making.

[213] Understanding the Generalization of Bilevel Programming in Hyperparameter Optimization: A Tale of Bias-Variance Decomposition

Yubo Zhou, Jun Shu, Junmin Liu, Deyu Meng

Main category: cs.LG

TL;DR: The paper analyzes bias-variance decomposition in hypergradient estimation for hyperparameter optimization, proposes ensemble methods to reduce variance, and connects estimation error to performance improvements.

DetailsMotivation: Previous theoretical works on gradient-based hyperparameter optimization (HPO) mainly focus on reducing bias in hypergradient estimation while ignoring variance due to data distribution, which degrades performance in practice.

Method: Conducts bias-variance decomposition for hypergradient estimation error, provides comprehensive error bound analysis, and proposes an ensemble hypergradient strategy to reduce variance in HPO algorithms.

Result: Experimental results on regularization hyperparameter learning, data hyper-cleaning, and few-shot learning demonstrate that the variance reduction strategy improves hypergradient estimation and overall performance.

Conclusion: The paper establishes a connection between excess error and hypergradient estimation, providing theoretical understanding for empirical observations and practical improvements in HPO.

Abstract: Gradient-based hyperparameter optimization (HPO) have emerged recently, leveraging bilevel programming techniques to optimize hyperparameter by estimating hypergradient w.r.t. validation loss. Nevertheless, previous theoretical works mainly focus on reducing the gap between the estimation and ground-truth (i.e., the bias), while ignoring the error due to data distribution (i.e., the variance), which degrades performance. To address this issue, we conduct a bias-variance decomposition for hypergradient estimation error and provide a supplemental detailed analysis of the variance term ignored by previous works. We also present a comprehensive analysis of the error bounds for hypergradient estimation. This facilitates an easy explanation of some phenomena commonly observed in practice, like overfitting to the validation set. Inspired by the derived theories, we propose an ensemble hypergradient strategy to reduce the variance in HPO algorithms effectively. Experimental results on tasks including regularization hyperparameter learning, data hyper-cleaning, and few-shot learning demonstrate that our variance reduction strategy improves hypergradient estimation. To explain the improved performance, we establish a connection between excess error and hypergradient estimation, offering some understanding of empirical observations.

[214] Turbo Connection: Reasoning as Information Flow from Higher to Lower Layers

Mohan Tang, Sidi Lu

Main category: cs.LG

TL;DR: TurboConn is a novel Transformer architecture that enables longer computational paths by routing residual connections from higher layers of token t to lower layers of token t+1, improving reasoning performance on complex tasks.

DetailsMotivation: The authors argue that Transformers' reasoning power is fundamentally limited by fixed maximum computational depth per token, preventing them from solving complex multi-step problems that require iterative reasoning where each step informs the next.

Method: TurboConn introduces backward residual connections that route information from higher-layer hidden states of each token to lower layers of the next token, creating longer computational paths across tokens. This dense interaction allows information to flow across both depth and sequence dimensions.

Result: Fine-tuning pre-trained LLMs with TurboConn yields accuracy gains of 0.9% to over 10% on benchmarks like GSM8K, Parity, and multi-step arithmetic. Notably, it enables Qwen-3-1.7B to achieve 100% accuracy on Parity (vs 53.78% baseline) without full retraining or curriculum learning.

Conclusion: Computational path depth is a key factor in reasoning ability, and TurboConn provides an effective mechanism to enhance LLMs’ multi-step reasoning without significantly affecting generation latency, overcoming fixed-depth constraints in standard Transformers.

Abstract: Complex problems, whether in math, logic, or planning, are solved by humans through a sequence of steps where the result of one step informs the next. In this work, we adopt the perspective that the reasoning power of Transformers is fundamentally limited by a fixed maximum number of steps along any latent path of computation. To address this, we introduce Turbo Connection (TurboConn), a novel architecture that overcomes the fixed-depth constraint by routing multiple residual connections from the higher-layer hidden states of each token $t$ to the lower layers of token $t+1$. Fine-tuning pre-trained LLMs with our method not only yields accuracy gains of 0.9% to over 10% on benchmarks like GSM8K, Parity, and multi-step arithmetic, but also demonstrates that the density of these backward connections is critical; our dense interaction significantly outperforms “sparse” alternatives that only pass a single hidden state or vector. Notably, TurboConn can be integrated into pre-trained LLMs to overcome task-specific plateaus: while a fine-tuned Qwen-3-1.7B achieves only 53.78% on Parity, adding our architectural modification enables the model to reach 100% accuracy, all without the necessity to retrain the full model from scratch or sophisticated curriculum learning. Our results provide strong empirical evidence that the depth of the computational path is a key factor in reasoning ability, also offering a new mechanism to enhance LLMs without significantly affecting generation latency.

[215] A Geometric Probe of the Accuracy-Robustness Trade-off: Sharp Boundaries in Symmetry-Breaking Dimensional Expansion

Yu Bai, Zhe Wang, Jiarui Zhang, Dong-Xiao Zhang, Yinjun Gao, Jun-Jie Zhang

Main category: cs.LG

TL;DR: SBDE improves clean accuracy by breaking symmetry but reduces adversarial robustness due to sharp boundaries in auxiliary dimensions; mask projection restores robustness.

DetailsMotivation: To understand the geometric origin of the trade-off between clean accuracy and adversarial robustness in deep learning models.

Method: Use Symmetry-Breaking Dimensional Expansion (SBDE) to insert constant-valued pixels, breaking translational symmetry. Apply test-time mask projection to reset auxiliary pixels and analyze vulnerability.

Result: SBDE improves clean accuracy (e.g., 90.47% to 95.63% on CIFAR-10) but reduces robustness to iterative attacks. Mask projection neutralizes attacks, showing vulnerability stems from auxiliary dimensions with sharp boundaries.

Conclusion: The accuracy-robustness trade-off arises because optimization deepens attraction basins for accuracy but creates steep, fragile boundaries along auxiliary dimensions.

Abstract: The trade-off between clean accuracy and adversarial robustness is a pervasive phenomenon in deep learning, yet its geometric origin remains elusive. In this work, we utilize Symmetry-Breaking Dimensional Expansion (SBDE) as a controlled probe to investigate the mechanism underlying this trade-off. SBDE expands input images by inserting constant-valued pixels, which breaks translational symmetry and consistently improves clean accuracy (e.g., from $90.47%$ to $95.63%$ on CIFAR-10 with ResNet-18) by reducing parameter degeneracy. However, this accuracy gain comes at the cost of reduced robustness against iterative white-box attacks. By employing a test-time \emph{mask projection} that resets the inserted auxiliary pixels to their training values, we demonstrate that the vulnerability stems almost entirely from the inserted dimensions. The projection effectively neutralizes the attacks and restores robustness, revealing that the model achieves high accuracy by creating \emph{sharp boundaries} (steep loss gradients) specifically along the auxiliary axes. Our findings provide a concrete geometric explanation for the accuracy-robustness paradox: the optimization landscape deepens the basin of attraction to improve accuracy but inevitably erects steep walls along the auxiliary degrees of freedom, creating a fragile sensitivity to off-manifold perturbations.

[216] PHAST: Port-Hamiltonian Architecture for Structured Temporal Dynamics Forecasting

Shubham Bhardwaj, Chandrajit Bajaj

Main category: cs.LG

TL;DR: PHAST: A port-Hamiltonian neural architecture for learning dissipative physical systems from position-only observations, achieving stable long-horizon forecasting and meaningful parameter recovery when structure is provided.

DetailsMotivation: Real physical systems are dissipative, and forecasting their dynamics from partial observations (position-only data) is challenging. Existing methods often fail to produce stable long-horizon forecasts or recover physically meaningful parameters.

Method: PHAST uses port-Hamiltonian framework with explicit conservative-dissipative split. It decomposes Hamiltonian into potential, mass, and damping across three knowledge regimes (KNOWN, PARTIAL, UNKNOWN), employs efficient low-rank PSD/SPD parameterizations, and advances dynamics with Strang splitting.

Result: Across 13 benchmarks spanning mechanical, electrical, molecular, thermal, gravitational, and ecological systems, PHAST achieves best long-horizon forecasting among competitive baselines and enables physically meaningful parameter recovery when sufficient structural anchors are provided.

Conclusion: PHAST provides a principled approach for learning dissipative physical systems from partial observations, with explicit handling of identifiability issues and gauge freedom in parameter recovery.

Abstract: Real physical systems are dissipative – a pendulum slows, a circuit loses charge to heat – and forecasting their dynamics from partial observations is a central challenge in scientific machine learning. We address the \emph{position-only} (q-only) problem: given only generalized positions~$q_t$ at discrete times (momenta~$p_t$ latent), learn a structured model that (a)produces stable long-horizon forecasts and (b)recovers physically meaningful parameters when sufficient structure is provided. The port-Hamiltonian framework makes the conservative-dissipative split explicit via $\dot{x}=(J-R)\nabla H(x)$, guaranteeing $dH/dt\le 0$ when $R\succeq 0$. We introduce \textbf{PHAST} (Port-Hamiltonian Architecture for Structured Temporal dynamics), which decomposes the Hamiltonian into potential$V(q)$, mass$M(q)$, and damping~$D(q)$ across three knowledge regimes (KNOWN, PARTIAL, UNKNOWN), uses efficient low-rank PSD/SPD parameterizations, and advances dynamics with Strang splitting. Across thirteen q-only benchmarks spanning mechanical, electrical, molecular, thermal, gravitational, and ecological systems, PHAST achieves the best long-horizon forecasting among competitive baselines and enables physically meaningful parameter recovery when the regime provides sufficient anchors. We show that identification is fundamentally ill-posed without such anchors (gauge freedom), motivating a two-axis evaluation that separates forecasting stability from identifiability.

[217] Hardware-Friendly Input Expansion for Accelerating Function Approximation

Hu Lou, Yin-Jun Gao, Dong-Xiao Zhang, Tai-Jiao Du, Jun-Jie Zhang, Jia-Rui Zhang

Main category: cs.LG

TL;DR: A hardware-friendly function approximation method using input-space expansion with constant values to break parameter symmetries and improve neural network training convergence and accuracy.

DetailsMotivation: Neural networks for 1D function approximation suffer from slow convergence and poor generalization due to flat loss landscapes caused by parameter-space symmetries, especially for high-frequency components.

Method: Proposes input-space expansion by augmenting original 1D input with constant values (e.g., π) to form higher-dimensional vectors, breaking parameter symmetries without increasing network parameters.

Result: Significantly accelerates training convergence (12% reduction in LBFGS iterations) and enhances accuracy (66.3% MSE reduction for optimal 5D expansion), with π consistently outperforming other constants.

Conclusion: Input-space expansion is a low-cost, efficient, hardware-friendly technique that effectively breaks parameter symmetries to improve neural network function approximation performance.

Abstract: One-dimensional function approximation is a fundamental problem in scientific computing and engineering applications. While neural networks possess powerful universal approximation capabilities, their optimization process is often hindered by flat loss landscapes induced by parameter-space symmetries, leading to slow convergence and poor generalization, particularly for high-frequency components. Inspired by the principle of \emph{symmetry breaking} in physics, this paper proposes a hardware-friendly approach for function approximation through \emph{input-space expansion}. The core idea involves augmenting the original one-dimensional input (e.g., $x$) with constant values (e.g., $π$) to form a higher-dimensional vector (e.g., $[π, π, x, π, π]$), effectively breaking parameter symmetries without increasing the network’s parameter count. We evaluate the method on ten representative one-dimensional functions, including smooth, discontinuous, high-frequency, and non-differentiable functions. Experimental results demonstrate that input-space expansion significantly accelerates training convergence (reducing LBFGS iterations by 12% on average) and enhances approximation accuracy (reducing final MSE by 66.3% for the optimal 5D expansion). Ablation studies further reveal the effects of different expansion dimensions and constant selections, with $π$ consistently outperforming other constants. Our work proposes a low-cost, efficient, and hardware-friendly technique for algorithm design.

[218] Bayesian Online Model Selection

Aida Afshar, Yuke Zhang, Aldo Pacchiano

Main category: cs.LG

TL;DR: A Bayesian algorithm for online model selection in stochastic bandits with theoretical guarantees on Bayesian regret and empirical validation across various settings.

DetailsMotivation: Addresses the exploration challenge in Bayesian bandits when sampling environments from prior distributions, aiming to design adaptive strategies that explore multiple bandit learners and compete with the best one in hindsight.

Method: Introduces a new Bayesian algorithm for online model selection in stochastic bandits, with analysis of data sharing among base learners to mitigate prior mis-specification.

Result: Proves an oracle-style guarantee of O(dM√T + √(MT)) on Bayesian regret, where M is number of base learners, d is regret coefficient of optimal learner, and T is time horizon. Empirical validation shows competitive performance with best base learner.

Conclusion: The proposed Bayesian algorithm effectively addresses online model selection in stochastic bandits with theoretical guarantees and practical performance, while data sharing helps mitigate prior mis-specification issues.

Abstract: Online model selection in Bayesian bandits raises a fundamental exploration challenge: When an environment instance is sampled from a prior distribution, how can we design an adaptive strategy that explores multiple bandit learners and competes with the best one in hindsight? We address this problem by introducing a new Bayesian algorithm for online model selection in stochastic bandits. We prove an oracle-style guarantee of $O\left( d^* M \sqrt{T} + \sqrt{(MT)} \right)$ on the Bayesian regret, where $M$ is the number of base learners, $d^*$ is the regret coefficient of the optimal base learner, and $T$ is the time horizon. We also validate our method empirically across a range of stochastic bandit settings, demonstrating performance that is competitive with the best base learner. Additionally, we study the effect of sharing data among base learners and its role in mitigating prior mis-specification.

[219] Flow Actor-Critic for Offline Reinforcement Learning

Jongseong Chae, Jongeui Park, Yongjae Shin, Gyeongmin Kim, Seungyul Han, Youngchul Sung

Main category: cs.LG

TL;DR: Flow Actor-Critic: A new offline RL method using flow models for both actor and critic to handle multi-modal dataset distributions and prevent Q-value explosion.

DetailsMotivation: Offline RL datasets often have complex multi-modal distributions that require expressive policies beyond Gaussian policies. Existing methods struggle with these distributions and suffer from Q-value explosion in out-of-data regions.

Method: Proposes Flow Actor-Critic using flow models for both actor (policy) and critic (value function). Uses flow behavior proxy model from actor design as critic regularizer to prevent Q-value explosion. Jointly leverages flow models for both components.

Result: Achieves state-of-the-art performance on D4RL and OGBench offline RL benchmarks, demonstrating effectiveness in handling multi-modal distributions.

Conclusion: Flow Actor-Critic effectively handles complex multi-modal offline RL datasets by jointly using flow models for both actor and critic, with the flow behavior proxy serving as an effective critic regularizer.

Abstract: The dataset distributions in offline reinforcement learning (RL) often exhibit complex and multi-modal distributions, necessitating expressive policies to capture such distributions beyond widely-used Gaussian policies. To handle such complex and multi-modal datasets, in this paper, we propose Flow Actor-Critic, a new actor-critic method for offline RL, based on recent flow policies. The proposed method not only uses the flow model for actor as in previous flow policies but also exploits the expressive flow model for conservative critic acquisition to prevent Q-value explosion in out-of-data regions. To this end, we propose a new form of critic regularizer based on the flow behavior proxy model obtained as a byproduct of flow-based actor design. Leveraging the flow model in this joint way, we achieve new state-of-the-art performance for test datasets of offline RL including the D4RL and recent OGBench benchmarks.

[220] Improving Generalizability of Hip Fracture Risk Prediction via Domain Adaptation Across Multiple Cohorts

Shuo Sun, Meiling Zhou, Chen Zhao, Joyce H. Keyak, Nancy E. Lane, Jeffrey D. Deng, Kuan-Jui Su, Hui Shen, Hong-Wen Deng, Kui Zhang, Weihua Zhou

Main category: cs.LG

TL;DR: Domain adaptation methods improve hip fracture risk prediction across different cohorts by reducing distribution shifts between source and target datasets.

DetailsMotivation: Clinical risk prediction models often fail to generalize across different cohorts due to distribution shifts caused by clinical site, region, demographics, and measurement protocol differences. This is particularly problematic in hip fracture risk prediction where models degrade when deployed in new cohorts.

Method: Systematically evaluated three domain adaptation methods (Maximum Mean Discrepancy, Correlation Alignment, and Domain-Adversarial Neural Networks) and their combinations across three large cohorts (SOF, MrOS, UK Biobank) using shared clinical and DXA-derived features. Used outcome-free approaches that don’t require target cohort labels.

Result: Domain adaptation methods consistently outperformed source-only training. The combination of MMD, CORAL, and DANN achieved highest discrimination with AUC of 0.88 for male-only source cohort and 0.95 for female-only source cohort. Multiple method combinations delivered largest and most stable performance gains.

Conclusion: Integrating multiple domain adaptation methods produces feature representations less sensitive to dataset differences, enabling better generalization in hip fracture risk prediction without requiring target cohort labels.

Abstract: Clinical risk prediction models often fail to be generalized across cohorts because underlying data distributions differ by clinical site, region, demographics, and measurement protocols. This limitation is particularly pronounced in hip fracture risk prediction, where the performance of models trained on one cohort (the source cohort) can degrade substantially when deployed in other cohorts (target cohorts). We used a shared set of clinical and DXA-derived features across three large cohorts - the Study of Osteoporotic Fractures (SOF), the Osteoporotic Fractures in Men Study (MrOS), and the UK Biobank (UKB), to systematically evaluate the performance of three domain adaptation methods - Maximum Mean Discrepancy (MMD), Correlation Alignment (CORAL), and Domain - Adversarial Neural Networks (DANN) and their combinations. For a source cohort with males only and a source cohort with females only, domain-adaptation methods consistently showed improved performance than the no-adaptation baseline (source-only training), and the use of combinations of multiple domain adaptation methods delivered the largest and most stable gains. The method that combines MMD, CORAL, and DANN achieved the highest discrimination with the area under curve (AUC) of 0.88 for a source cohort with males only and 0.95 for a source cohort with females only), demonstrating that integrating multiple domain adaptation methods could produce feature representations that are less sensitive to dataset differences. Unlike existing methods that rely heavily on supervised tuning or assume known outcomes of samples in target cohorts, our outcome-free approaches enable the model selection under realistic deployment conditions and improve generalization of models in hip fracture risk prediction.

[221] Student Flow Modeling for School Decongestion via Stochastic Gravity Estimation and Constrained Spatial Allocation

Sebastian Felipe R. Bundoc, Paula Joy B. Martinez, Sebastian C. Ibañez, Erika Fille T. Legara

Main category: cs.LG

TL;DR: Computational framework using stochastic gravity modeling to analyze student flow patterns and simulate policy scenarios for educational subsidy programs in the Philippines, revealing geographic proximity as stronger constraint than tuition costs.

DetailsMotivation: School congestion in low- and middle-income countries impacts learning outcomes and deepens educational inequities. Subsidy programs transferring students from public to private schools often underperform due to fragmented data systems, preventing data-driven analysis of student enrollment flows and policy effectiveness.

Method: Developed computational framework synthesizing heterogeneous government data across nearly 3,000 institutions. Used stochastic gravity model estimated via negative binomial regression to derive behavioral elasticities for distance, net tuition cost, and socioeconomic determinants. Implemented doubly constrained spatial allocation mechanism to simulate student redistribution under varying subsidy amounts while respecting origin candidate pools and destination slot capacities.

Result: Found geographic proximity constrains school choice four times more strongly than tuition cost. Slot capacity, not subsidy amounts, is the binding constraint for student redistribution. Subsidy programs alone cannot resolve systemic overcrowding.

Conclusion: Computational modeling can empower education policymakers to make equitable, data-driven decisions by revealing structural constraints that shape effective resource allocation, even when resources are limited. The approach demonstrates the importance of understanding behavioral elasticities and capacity constraints in educational policy design.

Abstract: School congestion, where student enrollment exceeds school capacity, is a major challenge in low- and middle-income countries. It highly impacts learning outcomes and deepens inequities in education. While subsidy programs that transfer students from public to private schools offer a mechanism to alleviate congestion without capital-intensive construction, they often underperform due to fragmented data systems that hinder effective implementation. The Philippine Educational Service Contracting program, one of the world’s largest educational subsidy programs, exemplifies these challenges, falling short of its goal to decongest public schools. This prevents the science-based and data-driven analyses needed to understand what shapes student enrollment flows, particularly how families respond to economic incentives and spatial constraints. We introduce a computational framework for modeling student flow patterns and simulating policy scenarios. By synthesizing heterogeneous government data across nearly 3,000 institutions, we employ a stochastic gravity model estimated via negative binomial regression to derive behavioral elasticities for distance, net tuition cost, and socioeconomic determinants. These elasticities inform a doubly constrained spatial allocation mechanism that simulates student redistribution under varying subsidy amounts while respecting both origin candidate pools and destination slot capacities. We find that geographic proximity constrains school choice four times more strongly than tuition cost and that slot capacity, not subsidy amounts, is the binding constraint. Our work demonstrates that subsidy programs alone cannot resolve systemic overcrowding, and computational modeling can empower education policymakers to make equitable, data-driven decisions by revealing the structural constraints that shape effective resource allocation, even when resources are limited.

[222] Generating adversarial inputs for a graph neural network model of AC power flow

Robert Parker

Main category: cs.LG

TL;DR: The paper develops optimization methods to generate adversarial inputs that cause large errors in neural network AC power flow predictions, demonstrating vulnerabilities in graph neural network models for power systems.

DetailsMotivation: To identify vulnerabilities in neural network surrogate models for AC power flow by generating adversarial inputs that cause significant prediction errors, motivating the need for robust verification and training methods.

Method: Formulates optimization problems to generate input points that maximize errors between neural network predictions and actual AC power flow solutions. Tests on CANOS-PF graph neural network model using PFΔ benchmark library on 14-bus test grid.

Result: Generated adversarial points yield errors up to 3.4 per-unit in reactive power and 0.08 per-unit in voltage magnitude. Minimal perturbations of 0.04 per-unit in voltage magnitude on a single bus can satisfy adversarial constraints.

Conclusion: Neural network surrogate models for AC power flow are vulnerable to adversarial attacks, highlighting the need for rigorous verification and robust training methods to ensure reliability in power system applications.

Abstract: This work formulates and solves optimization problems to generate input points that yield high errors between a neural network’s predicted AC power flow solution and solutions to the AC power flow equations. We demonstrate this capability on an instance of the CANOS-PF graph neural network model, as implemented by the PF$Δ$ benchmark library, operating on a 14-bus test grid. Generated adversarial points yield errors as large as 3.4 per-unit in reactive power and 0.08 per-unit in voltage magnitude. When minimizing the perturbation from a training point necessary to satisfy adversarial constraints, we find that the constraints can be met with as little as an 0.04 per-unit perturbation in voltage magnitude on a single bus. This work motivates the development of rigorous verification and robust training methods for neural network surrogate models of AC power flow.

[223] Cut Less, Fold More: Model Compression through the Lens of Projection Geometry

Olga Saukh, Dong Wang, Haris Šikić, Yun Cheng, Lothar Thiele

Main category: cs.LG

TL;DR: Model folding (weight clustering) outperforms structured pruning for neural network compression without retraining, with theoretical guarantees of smaller reconstruction error and practical advantages at moderate-high compression rates.

DetailsMotivation: Need for efficient neural network compression methods that don't require retraining for large-scale deployment, with current methods like structured pruning having limitations in reconstruction accuracy.

Method: Formalizes compression as projection geometry: structured pruning as axis-aligned projection vs model folding (weight clustering) as low-rank projection. Analyzes both as orthogonal operators with theoretical guarantees on reconstruction error and functional perturbations.

Result: Evaluated >1000 checkpoints across ResNet18, PreActResNet18, ViT-B/32, CLIP ViT-B/32 on CIFAR-10/ImageNet, and LLaMA models. Folding typically achieves higher post-compression accuracy, especially at moderate-high compression rates, though gap narrows in specific training setups.

Conclusion: Model folding is a geometry-aware, calibration-free alternative to pruning that is often superior in practice and principled in theory, offering better compression-performance tradeoffs without retraining.

Abstract: Compressing neural networks without retraining is vital for deployment at scale. We study calibration-free compression through the lens of projection geometry: structured pruning is an axis-aligned projection, whereas model folding performs a low-rank projection via weight clustering. We formalize both as orthogonal operators and show that, within a rank distance of one, folding provably yields smaller parameter reconstruction error, and under mild smoothness assumptions, smaller functional perturbations than pruning. At scale, we evaluate >1000 checkpoints spanning ResNet18, PreActResNet18, ViT-B/32, and CLIP ViT-B/32 on CIFAR-10 and ImageNet-1K, covering diverse training hyperparameters (optimizers, learning rates, augmentations, regularization, sharpness-aware training), as well as multiple LLaMA-family 60M and 130M parameter models trained on C4. We show that folding typically achieves higher post-compression accuracy, with the largest gains at moderate-high compression. The gap narrows and occasionally reverses at specific training setups. Our results position folding as a geometry-aware, calibration-free alternative to pruning that is often superior in practice and principled in theory.

[224] Learning Without Training

Ryan O’Dowd

Main category: cs.LG

TL;DR: A dissertation presenting three theoretical machine learning projects: 1) improved supervised learning function approximation, 2) transfer learning for function lifting across domains, and 3) active learning classification using signal separation techniques.

DetailsMotivation: To address theoretical shortcomings in current machine learning paradigms by developing mathematically grounded approaches for supervised learning, transfer learning, and classification tasks.

Method: Three distinct theoretical approaches: 1) New method for supervised learning addressing theoretical shortcomings, 2) Study of function lifting in transfer learning across domains, 3) Signal separation techniques applied to active learning classification.

Result: Theoretical frameworks developed for each project, with the third project’s algorithm achieving competitive accuracy to recent active learning methods while providing faster results.

Conclusion: Mathematical theory provides valuable foundations for addressing practical machine learning challenges, with applications to supervised learning, transfer learning, and classification.

Abstract: Machine learning is at the heart of managing the real-world problems associated with massive data. With the success of neural networks on such large-scale problems, more research in machine learning is being conducted now than ever before. This dissertation focuses on three different projects rooted in mathematical theory for machine learning applications. The first project deals with supervised learning and manifold learning. In theory, one of the main problems in supervised learning is that of function approximation: that is, given some data set $\mathcal{D}={(x_j,f(x_j))}_{j=1}^M$, can one build a model $F\approx f$? We introduce a method which aims to remedy several of the theoretical shortcomings of the current paradigm for supervised learning. The second project deals with transfer learning, which is the study of how an approximation process or model learned on one domain can be leveraged to improve the approximation on another domain. We study such liftings of functions when the data is assumed to be known only on a part of the whole domain. We are interested in determining subsets of the target data space on which the lifting can be defined, and how the local smoothness of the function and its lifting are related. The third project is concerned with the classification task in machine learning, particularly in the active learning paradigm. Classification has often been treated as an approximation problem as well, but we propose an alternative approach leveraging techniques originally introduced for signal separation problems. We introduce theory to unify signal separation with classification and a new algorithm which yields competitive accuracy to other recent active learning algorithms while providing results much faster.

[225] Flow Matching with Injected Noise for Offline-to-Online Reinforcement Learning

Yongjae Shin, Jongseong Chae, Jongeui Park, Youngchul Sung

Main category: cs.LG

TL;DR: FINO is a flow matching-based RL method that injects noise during policy training to improve exploration in offline-to-online reinforcement learning, achieving better sample efficiency.

DetailsMotivation: While generative models show promise as expressive policies in RL, their extension to online fine-tuning after offline pre-training faces challenges in effective exploration beyond the offline dataset distribution.

Method: FINO uses flow matching-based policies with injected noise during training to encourage broader action exploration beyond the offline dataset, combined with an entropy-guided sampling mechanism to balance exploration and exploitation during online fine-tuning.

Result: Experiments across diverse challenging tasks demonstrate that FINO consistently achieves superior performance under limited online budgets, showing improved sample efficiency for offline-to-online RL.

Conclusion: FINO effectively addresses exploration challenges in offline-to-online RL by combining noise-injected flow matching policies with entropy-guided sampling, enabling better adaptation during online fine-tuning.

Abstract: Generative models have recently demonstrated remarkable success across diverse domains, motivating their adoption as expressive policies in reinforcement learning (RL). While they have shown strong performance in offline RL, particularly where the target distribution is well defined, their extension to online fine-tuning has largely been treated as a direct continuation of offline pre-training, leaving key challenges unaddressed. In this paper, we propose Flow Matching with Injected Noise for Offline-to-Online RL (FINO), a novel method that leverages flow matching-based policies to enhance sample efficiency for offline-to-online RL. FINO facilitates effective exploration by injecting noise into policy training, thereby encouraging a broader range of actions beyond those observed in the offline dataset. In addition to exploration-enhanced flow policy training, we combine an entropy-guided sampling mechanism to balance exploration and exploitation, allowing the policy to adapt its behavior throughout online fine-tuning. Experiments across diverse, challenging tasks demonstrate that FINO consistently achieves superior performance under limited online budgets.

[226] Whole-Brain Connectomic Graph Model Enables Whole-Body Locomotion Control in Fruit Fly

Zehao Jin, Yaoye Zhu, Chen Zhang, Yanan Sui

Main category: cs.LG

TL;DR: Using the complete fruit fly brain connectome as a neural network controller for embodied locomotion tasks, achieving stable control without task-specific tuning.

DetailsMotivation: Biological neural networks naturally support whole-body movement control, but using brain connectomes as neural network controllers in embodied reinforcement learning remains unexplored.

Method: Developed Fly-connectomic Graph Model (FlyGM) with static structure identical to adult Drosophila connectome, represented as directed message-passing graph for information flow from sensory inputs to motor outputs, integrated with biomechanical fruit fly model.

Result: Achieved stable control across diverse locomotion tasks without task-specific architectural tuning. FlyGM showed higher sample efficiency and superior performance compared to degree-preserving rewired graphs, random graphs, and multilayer perceptrons.

Conclusion: Static brain connectomes can be transformed to instantiate effective neural policies for embodied learning of movement control, demonstrating structural advantages of biological neural architectures.

Abstract: Whole-brain biological neural networks naturally support the learning and control of whole-body movements. However, the use of brain connectomes as neural network controllers in embodied reinforcement learning remains unexplored. We investigate using the exact neural architecture of an adult fruit fly’s brain for the control of its body movement. We develop Fly-connectomic Graph Model (FlyGM), whose static structure is identical to the complete connectome of an adult Drosophila for whole-body locomotion control. To perform dynamical control, FlyGM represents the static connectome as a directed message-passing graph to impose a biologically grounded information flow from sensory inputs to motor outputs. Integrated with a biomechanical fruit fly model, our method achieves stable control across diverse locomotion tasks without task-specific architectural tuning. To verify the structural advantages of the connectome-based model, we compare it against a degree-preserving rewired graph, a random graph, and multilayer perceptrons, showing that FlyGM yields higher sample efficiency and superior performance. This work demonstrates that static brain connectomes can be transformed to instantiate effective neural policy for embodied learning of movement control.

[227] Asynchronous Heavy-Tailed Optimization

Junfei Sun, Dixi Yao, Xuchen Gong, Tahseen Rabbani, Manzil Zaheer, Tian Li

Main category: cs.LG

TL;DR: The paper proposes algorithmic modifications for asynchronous optimization to handle heavy-tailed gradient noise in transformer models, using delay-aware learning rate scheduling and delay compensation techniques.

DetailsMotivation: Heavy-tailed stochastic gradient noise in transformer models can destabilize optimization, but existing work focuses on centralized/distributed synchronous settings, leaving asynchronous optimization with such noise underexplored.

Method: Proposes two communication schemes for handling stragglers with asynchronous updates under heavy-tailed noise, with algorithmic modifications based on delay-aware learning rate scheduling and delay compensation.

Result: Theoretical convergence guarantees match synchronous counterparts’ rates and improve delay tolerance; empirical results show better accuracy/runtime trade-offs and hyperparameter robustness in image and language tasks.

Conclusion: The proposed asynchronous optimization approaches effectively handle heavy-tailed gradient noise in transformer training, outperforming existing synchronous and asynchronous methods.

Abstract: Heavy-tailed stochastic gradient noise, commonly observed in transformer models, can destabilize the optimization process. Recent works mainly focus on developing and understanding approaches to address heavy-tailed noise in the centralized or distributed, synchronous setting, leaving the interactions between such noise and asynchronous optimization underexplored. In this work, we investigate two communication schemes that handle stragglers with asynchronous updates in the presence of heavy-tailed gradient noise. We propose and theoretically analyze algorithmic modifications based on delay-aware learning rate scheduling and delay compensation to enhance the performance of asynchronous algorithms. Our convergence guarantees under heavy-tailed noise match the rate of the synchronous counterparts and improve delay tolerance compared with existing asynchronous approaches. Empirically, our approaches outperform prior synchronous and asynchronous methods in terms of accuracy/runtime trade-offs and are more robust to hyperparameters in both image and language tasks.

[228] Capabilities Ain’t All You Need: Measuring Propensities in AI

Daniel Romero-Alvarado, Fernando Martínez-Plumed, Lorenzo Pacchiardi, Hugo Save, Siddhesh Milind Pawar, Behzad Mehrbakhsh, Pablo Antonio Moreno Casares, Ben Slater, Paolo Bova, Peter Romero, Zachary R. Tyler, Jonathan Prunty, Luning Sun, Jose Hernandez-Orallo

Main category: cs.LG

TL;DR: A formal framework for measuring AI propensities (behavioral tendencies) using bilogistic formulation that identifies an “ideal band” where models perform optimally, showing propensities predict behavior on held-out tasks better than capabilities alone.

DetailsMotivation: Traditional AI evaluation focuses on capabilities using Item Response Theory (IRT), but propensities (behavioral tendencies) are crucial for performance and safety. IRT's monotonic function approach is unsuitable for propensities where both excess and deficiency can be problematic.

Method: Introduces a bilogistic formulation for model success that attributes high success probability when model’s propensity is within an “ideal band.” Estimates ideal band limits using LLMs with task-agnostic rubrics. Applied to six families of LLM models with incited propensities in either direction.

Result: Successfully measures how much propensity is shifted and its effect on tasks. Propensities estimated from one benchmark predict behavior on held-out tasks. Combining propensities and capabilities yields stronger predictive power than either separately.

Conclusion: The framework enables rigorous propensity measurements and demonstrates gains over using only capability evaluations to predict AI behavior, showing propensities are measurable and predictive of model performance.

Abstract: AI evaluation has primarily focused on measuring capabilities, with formal approaches inspired from Item Response Theory (IRT) being increasingly applied. Yet propensities - the tendencies of models to exhibit particular behaviours - play a central role in determining both performance and safety outcomes. However, traditional IRT describes a model’s success on a task as a monotonic function of model capabilities and task demands, an approach unsuited to propensities, where both excess and deficiency can be problematic. Here, we introduce the first formal framework for measuring AI propensities by using a bilogistic formulation for model success, which attributes high success probability when the model’s propensity is within an “ideal band”. Further, we estimate the limits of the ideal band using LLMs equipped with newly developed task-agnostic rubrics. Applying our framework to six families of LLM models whose propensities are incited in either direction, we find that we can measure how much the propensity is shifted and what effect this has on the tasks. Critically, propensities estimated using one benchmark successfully predict behaviour on held-out tasks. Moreover, we obtain stronger predictive power when combining propensities and capabilities than either separately. More broadly, our framework showcases how rigorous propensity measurements can be conducted and how it yields gains over solely using capability evaluations to predict AI behaviour.

[229] Continual-NExT: A Unified Comprehension And Generation Continual Learning Framework

Jingyang Qiao, Zhizhong Zhang, Xin Tan, Jingyu Gong, Yanyun Qu, Yuan Xie

Main category: cs.LG

TL;DR: Continual-NExT framework addresses lifelong learning challenges in Dual-to-Dual MLLMs, proposing MAGE method to mitigate catastrophic forgetting and improve cross-modal knowledge transfer.

DetailsMotivation: Dual-to-Dual MLLMs lack standardized continual learning frameworks, suffering from catastrophic forgetting, hallucination, instruction unfollowing, and cross-modal knowledge transfer failures when learning new tasks, limiting their adaptation to dynamic real-world scenarios.

Method: Proposes Continual-NExT framework with evaluation metrics and MAGE (Mixture and Aggregation of General LoRA and Expert LoRA) method that combines general and expert Low-Rank Adaptation modules to facilitate knowledge transfer across modalities and mitigate forgetting.

Result: Extensive experiments show MAGE outperforms other continual learning methods and achieves state-of-the-art performance in continual learning for Dual-to-Dual MLLMs.

Conclusion: The Continual-NExT framework with MAGE method effectively addresses continual learning challenges in Dual-to-Dual MLLMs, enabling better lifelong evolution and adaptation to dynamic scenarios.

Abstract: Dual-to-Dual MLLMs refer to Multimodal Large Language Models, which can enable unified multimodal comprehension and generation through text and image modalities. Although exhibiting strong instantaneous learning and generalization capabilities, Dual-to-Dual MLLMs still remain deficient in lifelong evolution, significantly affecting continual adaptation to dynamic real-world scenarios. One of the challenges is that learning new tasks inevitably destroys the learned knowledge. Beyond traditional catastrophic forgetting, Dual-to-Dual MLLMs face other challenges, including hallucination, instruction unfollowing, and failures in cross-modal knowledge transfer. However, no standardized continual learning framework for Dual-to-Dual MLLMs has been established yet, leaving these challenges unexplored. Thus, in this paper, we establish Continual-NExT, a continual learning framework for Dual-to-Dual MLLMs with deliberately-architected evaluation metrics. To improve the continual learning capability of Dual-to-Dual MLLMs, we propose an efficient MAGE (Mixture and Aggregation of General LoRA and Expert LoRA) method to further facilitate knowledge transfer across modalities and mitigate forgetting. Extensive experiments demonstrate that MAGE outperforms other continual learning methods and achieves state-of-the-art performance.

[230] LERD: Latent Event-Relational Dynamics for Neurodegenerative Classification

Hairong Chen, Yicheng Feng, Ziyu Jia, Samir Bhatt, Hengguan Huang

Main category: cs.LG

TL;DR: LERD is a Bayesian neural dynamical system that infers latent neural events and their relational structure from multichannel EEG data for Alzheimer’s disease diagnosis, outperforming existing methods and providing interpretable physiological insights.

DetailsMotivation: Existing EEG-based Alzheimer's disease diagnosis methods rely on black-box classifiers without modeling the underlying neural dynamics that generate observed signals, limiting interpretability and clinical utility.

Method: Proposes LERD: an end-to-end Bayesian electrophysiological neural dynamical system combining continuous-time event inference with stochastic event-generation process, incorporating electrophysiology-inspired dynamical priors and providing theoretical analysis with training bounds and stability guarantees.

Result: LERD consistently outperforms strong baselines on synthetic benchmarks and two real-world AD EEG cohorts, yielding physiology-aligned latent summaries that characterize group-level dynamical differences.

Conclusion: LERD provides an interpretable, principled approach for EEG-based AD diagnosis by explicitly modeling neural dynamics, offering both diagnostic accuracy and physiological insights into disease mechanisms.

Abstract: Alzheimer’s disease (AD) alters brain electrophysiology and disrupts multichannel EEG dynamics, making accurate and clinically useful EEG-based diagnosis increasingly important for screening and disease monitoring. However, many existing approaches rely on black-box classifiers and do not explicitly model the underlying dynamics that generate observed signals. To address these limitations, we propose LERD, an end-to-end Bayesian electrophysiological neural dynamical system that infers latent neural events and their relational structure directly from multichannel EEG without event or interaction annotations. LERD combines a continuous-time event inference module with a stochastic event-generation process to capture flexible temporal patterns, while incorporating an electrophysiology-inspired dynamical prior to guide learning in a principled way. We further provide theoretical analysis that yields a tractable bound for training and stability guarantees for the inferred relational dynamics. Extensive experiments on synthetic benchmarks and two real-world AD EEG cohorts demonstrate that LERD consistently outperforms strong baselines and yields physiology-aligned latent summaries that help characterize group-level dynamical differences.

[231] Deepmechanics

Abhay Shinde, Aryan Amit Barsainyan, Jose Siguenza, Ankita Vaishnobi Bisoi, Rakshit Kr. Singh, Bharath Ramsundar

Main category: cs.LG

TL;DR: Benchmarking of physics-informed neural networks (HNN, LNN, SRNN) on classical mechanical systems reveals stability issues with chaotic and non-conservative systems.

DetailsMotivation: Physics-informed deep learning models encode physical principles into neural networks, but systematic benchmarking across diverse physical phenomena remains limited, especially for checking long-term stability of predicted trajectories.

Method: Benchmarked three physics-informed architectures (Hamiltonian Neural Networks, Lagrangian Neural Networks, Symplectic Recurrent Neural Networks) using DeepChem framework on six dynamical systems spanning classical conservative mechanics and non-conservative systems with contact.

Result: All benchmarked models struggle to maintain stability for chaotic or nonconservative systems, indicating limitations in current physics-informed approaches for robust modeling of classical mechanical systems.

Conclusion: More research is needed for physics-informed deep learning models to learn robust models of classical mechanical systems, particularly for handling chaotic dynamics and non-conservative interactions.

Abstract: Physics-informed deep learning models have emerged as powerful tools for learning dynamical systems. These models directly encode physical principles into network architectures. However, systematic benchmarking of these approaches across diverse physical phenomena remains limited, particularly in conservative and dissipative systems. In addition, benchmarking that has been done thus far does not integrate out full trajectories to check stability. In this work, we benchmark three prominent physics-informed architectures such as Hamiltonian Neural Networks (HNN), Lagrangian Neural Networks (LNN), and Symplectic Recurrent Neural Networks (SRNN) using the DeepChem framework, an open-source scientific machine learning library. We evaluate these models on six dynamical systems spanning classical conservative mechanics (mass-spring system, simple pendulum, double pendulum, and three-body problem, spring-pendulum) and non-conservative systems with contact (bouncing ball). We evaluate models by computing error on predicted trajectories and evaluate error both quantitatively and qualitatively. We find that all benchmarked models struggle to maintain stability for chaotic or nonconservative systems. Our results suggest that more research is needed for physics-informed deep learning models to learn robust models of classical mechanical systems.

[232] [Re] Benchmarking LLM Capabilities in Negotiation through Scoreable Games

Jorge Carrasco Pollo, Ioannis Kapetangeorgis, Joshua Rosenthal, John Hua Yao

Main category: cs.LG

TL;DR: This paper examines the reproducibility and generalizability of a negotiation benchmark for LLMs, finding issues with model comparison objectivity and experimental setup limitations.

DetailsMotivation: To investigate the reproducibility of claims in a recently introduced negotiation benchmark for LLMs, and to provide deeper understanding of its usability and generalizability for multi-agent negotiation tasks.

Method: Replicated original experiments on additional models, introduced additional metrics for negotiation quality and evaluation evenness, examined behavior of wider range of models on extended benchmark version, and analyzed information leakage detection and ablation study thoroughness.

Result: Found that while the benchmark is complex, model comparison is ambiguous and raises questions about objectivity. Identified limitations in experimental setup, particularly in information leakage detection and ablation study thoroughness. Revealed insights about model behavior that provide additional context for potential users.

Conclusion: Highlights the importance of context in model-comparative evaluations and reveals limitations in the negotiation benchmark’s objectivity and experimental setup, providing important considerations for researchers using such benchmarks.

Abstract: Large Language Models (LLMs) demonstrate significant potential in multi-agent negotiation tasks, yet evaluation in this domain remains challenging due to a lack of robust and generalizable benchmarks. Abdelnabi et al. (2024) introduce a negotiation benchmark based on Scoreable Games, with the aim of developing a highly complex and realistic evaluation framework for LLMs. Our work investigates the reproducibility of claims in their benchmark, and provides a deeper understanding of its usability and generalizability. We replicate the original experiments on additional models, and introduce additional metrics to verify negotiation quality and evenness of evaluation. Our findings reveal that while the benchmark is indeed complex, model comparison is ambiguous, raising questions about its objectivity. Furthermore, we identify limitations in the experimental setup, particularly in information leakage detection and thoroughness of the ablation study. By examining and analyzing the behavior of a wider range of models on an extended version of the benchmark, we reveal insights that provide additional context to potential users. Our results highlight the importance of context in model-comparative evaluations.

[233] Balancing Symmetry and Efficiency in Graph Flow Matching

Benjamin Honoré, Alba Carballo-Castro, Yiming Qin, Pascal Frossard

Main category: cs.LG

TL;DR: The paper studies the trade-off between equivariance and performance in graph generative models, showing that controlled symmetry-breaking can accelerate training while preventing overfitting.

DetailsMotivation: Strict equivariance in graph generative models increases computational cost and slows convergence due to architectural constraints and the need for consistency across all node permutations. The authors investigate whether relaxing equivariance during training can improve efficiency and performance.

Method: Start from an equivariant discrete flow-matching model and relax its equivariance during training using a controllable symmetry modulation scheme based on sinusoidal positional encodings and node permutations.

Result: Symmetry-breaking accelerates early training by providing easier learning signals but can cause overfitting (generating duplicates of training graphs). Proper symmetry modulation delays overfitting while accelerating convergence, achieving stronger performance with only 19% of baseline training epochs.

Conclusion: Controlled symmetry modulation offers a better trade-off than strict equivariance, enabling faster convergence and better performance in graph generative models by balancing the benefits of equivariance with the efficiency gains of symmetry-breaking.

Abstract: Equivariance is central to graph generative models, as it ensures the model respects the permutation symmetry of graphs. However, strict equivariance can increase computational cost due to added architectural constraints, and can slow down convergence because the model must be consistent across a large space of possible node permutations. We study this trade-off for graph generative models. Specifically, we start from an equivariant discrete flow-matching model, and relax its equivariance during training via a controllable symmetry modulation scheme based on sinusoidal positional encodings and node permutations. Experiments first show that symmetry-breaking can accelerate early training by providing an easier learning signal, but at the expense of encouraging shortcut solutions that can cause overfitting, where the model repeatedly generates graphs that are duplicates of the training set. On the contrary, properly modulating the symmetry signal can delay overfitting while accelerating convergence, allowing the model to reach stronger performance with $19%$ of the baseline training epochs.

[234] PRISM: Parallel Reward Integration with Symmetry for MORL

Finn van der Knaap, Kejiang Qian, Zheng Xu, Fengxiang He

Main category: cs.LG

TL;DR: PRISM algorithm addresses heterogeneous multi-objective RL with temporal frequency mismatches using reflectional symmetry and reward integration to improve sample efficiency and Pareto coverage.

DetailsMotivation: In heterogeneous multi-objective RL, objectives with different temporal frequencies cause dense objectives to dominate learning while sparse long-horizon rewards receive weak credit assignment, leading to poor sample efficiency.

Method: Proposes PRISM algorithm with ReSymNet (theory-motivated model using residual blocks to learn scaled opportunity value) and SymReg (reflectional equivariance regularizer) that enforces agent mirroring and constrains policy search to reflection-equivariant subspace.

Result: Across MuJoCo benchmarks, PRISM outperforms sparse-reward baseline and oracle trained with full dense rewards, achieving hypervolume gains exceeding 100% over baseline and up to 32% over oracle, improving Pareto coverage and distributional balance.

Conclusion: PRISM effectively addresses temporal frequency mismatches in heterogeneous MORL through symmetry-based inductive biases, significantly improving sample efficiency and multi-objective optimization performance.

Abstract: This work studies heterogeneous Multi-Objective Reinforcement Learning (MORL), where objectives can differ sharply in temporal frequency. Such heterogeneity allows dense objectives to dominate learning, while sparse long-horizon rewards receive weak credit assignment, leading to poor sample efficiency. We propose a Parallel Reward Integration with Symmetry (PRISM) algorithm that enforces reflectional symmetry as an inductive bias in aligning reward channels. PRISM introduces ReSymNet, a theory-motivated model that reconciles temporal-frequency mismatches across objectives, using residual blocks to learn a scaled opportunity value that accelerates exploration while preserving the optimal policy. We also propose SymReg, a reflectional equivariance regulariser that enforces agent mirroring and constrains policy search to a reflection-equivariant subspace. This restriction provably reduces hypothesis complexity and improves generalisation. Across MuJoCo benchmarks, PRISM consistently outperforms both a sparse-reward baseline and an oracle trained with full dense rewards, improving Pareto coverage and distributional balance: it achieves hypervolume gains exceeding 100% over the baseline and up to 32% over the oracle. The code is at \href{https://github.com/EVIEHub/PRISM}{https://github.com/EVIEHub/PRISM}.

[235] TempoNet: Slack-Quantized Transformer-Guided Reinforcement Scheduler for Adaptive Deadline-Centric Real-Time Dispatchs

Rong Fu, Yibo Meng, Guangzhen Yao, Jiaxuan Lu, Zeyu Zhang, Zhaolu Kang, Ziming Guo, Jia Yee Tan, Xiaojing Du, Simon James Fong

Main category: cs.LG

TL;DR: TempoNet is a reinforcement learning scheduler using Transformer architecture with urgency tokenization for real-time task scheduling, achieving improved deadline fulfillment over traditional schedulers.

DetailsMotivation: Real-time schedulers need to handle tight deadlines under strict computational constraints, requiring efficient decision-making frameworks that can reason about temporal slack and task priorities.

Method: Uses reinforcement learning with permutation-invariant Transformer and deep Q-approximation. Features Urgency Tokenizer for discretizing temporal slack, latency-aware sparse attention with blockwise top-k selection, and multicore mapping layer for processor assignments.

Result: Shows consistent gains in deadline fulfillment over analytic schedulers and neural baselines in industrial mixed-criticality traces and large multiprocessor settings, with sub-millisecond inference and improved optimization stability.

Conclusion: Establishes a practical framework for Transformer-based decision making in high-throughput real-time scheduling with demonstrated efficiency and robustness.

Abstract: Real-time schedulers must reason about tight deadlines under strict compute budgets. We present TempoNet, a reinforcement learning scheduler that pairs a permutation-invariant Transformer with a deep Q-approximation. An Urgency Tokenizer discretizes temporal slack into learnable embeddings, stabilizing value learning and capturing deadline proximity. A latency-aware sparse attention stack with blockwise top-k selection and locality-sensitive chunking enables global reasoning over unordered task sets with near-linear scaling and sub-millisecond inference. A multicore mapping layer converts contextualized Q-scores into processor assignments through masked-greedy selection or differentiable matching. Extensive evaluations on industrial mixed-criticality traces and large multiprocessor settings show consistent gains in deadline fulfillment over analytic schedulers and neural baselines, together with improved optimization stability. Diagnostics include sensitivity analyses for slack quantization, attention-driven policy interpretation, hardware-in-the-loop and kernel micro-benchmarks, and robustness under stress with simple runtime mitigations; we also report sample-efficiency benefits from behavioral-cloning pretraining and compatibility with an actor-critic variant without altering the inference pipeline. These results establish a practical framework for Transformer-based decision making in high-throughput real-time scheduling.

[236] Non-Stationary Online Resource Allocation: Learning from a Single Sample

Yiding Feng, Jiashuo Jiang, Yige Wang

Main category: cs.LG

TL;DR: Online resource allocation with minimal offline data (one sample per period) under arbitrary non-stationary demand, with two sample settings: reward-observed and type-only samples.

DetailsMotivation: Address online resource allocation problems where decision-makers must allocate multiple resources to sequentially arriving queries with stochastic rewards, but the environment exhibits arbitrary non-stationarity (arrival distributions shift unpredictably) and only minimal offline data (one historical sample per period) is available.

Method: Proposes a type-dependent quantile-based meta-policy that decouples the problem into: (1) reward distribution estimation, (2) optimization of target service probabilities via fluid relaxation, and (3) real-time decisions through dynamic acceptance thresholds. For reward-observed samples, uses static threshold policy. For type-only samples, under minimum-arrival-probability assumption, designs partially adaptive policy and fully adaptive resolving policy with careful rounding.

Result: For reward-observed samples: achieves $\tilde{O}(\sqrt{T})$ regret. For type-only samples: shows sublinear regret impossible without additional structure; under mild assumption, partially adaptive policy achieves $\tilde{O}(\sqrt{T})$ and fully adaptive resolving policy achieves $O((\log T)^3)$ poly-logarithmic regret for non-stationary multi-resource allocation.

Conclusion: The framework advances prior work by operating with minimal offline data (one sample per period), handling arbitrary non-stationarity without variation-budget assumptions, and supporting multiple resource constraints, achieving strong regret guarantees in challenging non-stationary environments.

Abstract: We study online resource allocation under non-stationary demand with a minimum offline data requirement. In this problem, a decision-maker must allocate multiple types of resources to sequentially arriving queries over a finite horizon. Each query belongs to a finite set of types with fixed resource consumption and a stochastic reward drawn from an unknown, type-specific distribution. Critically, the environment exhibits arbitrary non-stationarity – arrival distributions may shift unpredictably-while the algorithm requires only one historical sample per period to operate effectively. We distinguish two settings based on sample informativeness: (i) reward-observed samples containing both query type and reward realization, and (ii) the more challenging type-only samples revealing only query type information. We propose a novel type-dependent quantile-based meta-policy that decouples the problem into modular components: reward distribution estimation, optimization of target service probabilities via fluid relaxation, and real-time decisions through dynamic acceptance thresholds. For reward-observed samples, our static threshold policy achieves $\tilde{O}(\sqrt{T})$ regret. For type-only samples, we first establish that sublinear regret is impossible without additional structure; under a mild minimum-arrival-probability assumption, we design both a partially adaptive policy attaining the same $\tilde{O}({T})$ bound and, more significantly, a fully adaptive resolving policy with careful rounding that achieves the first poly-logarithmic regret guarantee of $O((\log T)^3)$ for non-stationary multi-resource allocation. Our framework advances prior work by operating with minimal offline data (one sample per period), handling arbitrary non-stationarity without variation-budget assumptions, and supporting multiple resource constraints.

[237] Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers

Xiaotong Ji, Rasul Tutunov, Matthieu Zimmer, Haitham Bou-Ammar

Main category: cs.LG

TL;DR: A principled optimization framework for decoding in language models that unifies existing methods and enables systematic design of new decoders like Best-of-K for multi-sample pipelines.

DetailsMotivation: Decoding in language models is currently treated as heuristic knob-tuning rather than a principled optimization problem. The authors argue that decoding should be understood as a systematic optimization layer that can unify existing methods and enable systematic design of new decoders.

Method: Proposes a framework where decoding at each token is formulated as solving a regularized optimization problem over the probability simplex, trading off model score against structural preferences and constraints. This framework recovers existing methods as special cases and enables designing new decoders like Best-of-K (BoK), which uses a KL-anchored coverage objective for multi-sample pipelines.

Result: The framework successfully unifies greedy decoding, Softmax sampling, Top-K, Top-P, and Sparsemax-style sparsity. Best-of-K decoder shows significant improvements, such as +18.6% accuracy for Qwen2.5-Math-7B on MATH500 at high sampling temperatures.

Conclusion: Decoding should be treated as a principled optimization layer rather than heuristic tuning. The proposed framework provides a unified understanding of existing methods and enables systematic design of new decoders with improved performance.

Abstract: Decoding sits between a language model and everything we do with it, yet it is still treated as a heuristic knob-tuning exercise. We argue decoding should be understood as a principled optimisation layer: at each token, we solve a regularised problem over the probability simplex that trades off model score against structural preferences and constraints. This single template recovers greedy decoding, Softmax sampling, Top-K, Top-P, and Sparsemax-style sparsity as special cases, and explains their common structure through optimality conditions. More importantly, the framework makes it easy to invent new decoders without folklore. We demonstrate this by designing Best-of-K (BoK), a KL-anchored coverage objective aimed at multi-sample pipelines (self-consistency, reranking, verifier selection). BoK targets the probability of covering good alternatives within a fixed K-sample budget and improves empirical performance. We show that such samples can improve accuracy by, for example, +18.6% for Qwen2.5-Math-7B on MATH500 at high sampling temperatures.

[238] Learning Long-Range Dependencies with Temporal Predictive Coding

Tom Potter, Oliver Rhodes

Main category: cs.LG

TL;DR: A novel method combining Temporal Predictive Coding with approximate Real-Time Recurrent Learning enables effective spatio-temporal credit assignment for RNNs, matching BPTT performance while maintaining local, parallelizable operations suitable for energy-efficient neuromorphic hardware.

DetailsMotivation: Predictive Coding (PC) offers biologically-inspired, local, parallelizable operations ideal for energy-efficient neuromorphic hardware, but extending it effectively to RNNs for tasks with long-range temporal dependencies has been challenging. BPTT remains dominant but suffers from non-local computation, lack of spatial parallelism, and high energy consumption due to extensive activation storage.

Method: The paper introduces a novel method that combines Temporal Predictive Coding (tPC) with approximate Real-Time Recurrent Learning (RTRL) to enable effective spatio-temporal credit assignment for recurrent neural networks.

Result: The proposed method closely matches BPTT performance on both synthetic benchmarks and real-world tasks. On a challenging 15-million parameter machine translation task, it achieves test perplexity of 7.62 (vs. 7.49 for BPTT), marking one of the first applications of tPC to tasks of this scale.

Conclusion: The method demonstrates potential for learning complex temporal dependencies while retaining the local, parallelizable, and flexible properties of the original PC framework, paving the way for more energy-efficient learning systems.

Abstract: Predictive Coding (PC) is a biologically-inspired learning framework characterised by local, parallelisable operations, properties that enable energy-efficient implementation on neuromorphic hardware. Despite this, extending PC effectively to recurrent neural networks (RNNs) has been challenging, particularly for tasks involving long-range temporal dependencies. Backpropagation Through Time (BPTT) remains the dominant method for training RNNs, but its non-local computation, lack of spatial parallelism, and requirement to store extensive activation histories results in significant energy consumption. This work introduces a novel method combining Temporal Predictive Coding (tPC) with approximate Real-Time Recurrent Learning (RTRL), enabling effective spatio-temporal credit assignment. Results indicate that the proposed method can closely match the performance of BPTT on both synthetic benchmarks and real-world tasks. On a challenging machine translation task, with a 15-million parameter model, the proposed method achieves a test perplexity of 7.62 (vs. 7.49 for BPTT), marking one of the first applications of tPC to tasks of this scale. These findings demonstrate the potential of this method to learn complex temporal dependencies whilst retaining the local, parallelisable, and flexible properties of the original PC framework, paving the way for more energy-efficient learning systems.

[239] JPmHC Dynamical Isometry via Orthogonal Hyper-Connections

Biswa Sengupta, Jinhua Wang, Leo Brunswic

Main category: cs.LG

TL;DR: JPmHC is a framework that replaces identity skips in hyper-connections with trainable linear mixers on parallel streams while controlling gradient conditioning through operator-norm-bounded manifolds to improve stability and efficiency.

DetailsMotivation: Hyper-connections improve performance but compromise identity mapping properties, causing training instability, limited scalability, and increased memory overhead. The authors aim to address these challenges while preserving gradient conditioning.

Method: JPmHC replaces identity skips with trainable linear mixers on n parallel streams, constraining mixers on operator-norm-bounded manifolds (bistochastic, Stiefel, Grassmann). Uses free-probability analysis to predict Jacobian spectra, memory-efficient implicit differentiation for fixed-point projections, and Stiefel-constrained mixers via Cayley transforms.

Result: Empirical evaluations on ARC-AGI show JPmHC achieves faster convergence, higher accuracy, and lower computational cost compared to bistochastic baselines.

Conclusion: JPmHC advances spectrum-aware, stable, and efficient deep learning as a flexible extension of hyper-connections, offering insights into topological architecture design and foundational model evolution.

Abstract: Recent advances in deep learning, exemplified by Hyper-Connections (HC), have expanded the residual connection paradigm by introducing wider residual streams and diverse connectivity patterns. While these innovations yield significant performance gains, they compromise the identity mapping property of residual connections, leading to training instability, limited scalability, and increased memory overhead. To address these challenges, we propose JPmHC (Jacobian-spectrum Preserving manifold-constrained Hyper-Connections), a framework that replaces identity skips with a trainable linear mixer acting on n parallel streams while explicitly controlling gradient conditioning. By constraining the mixer M on operator-norm-bounded manifolds (e.g., bistochastic, Stiefel, Grassmann), JPmHC prevents gradient pathologies and enhances stability. JPmHC introduces three key contributions: (i) a free-probability analysis that predicts Jacobian spectra for structured skips, providing actionable design rules for mixer selection; (ii) memory-efficient implicit differentiation for fixed-point projections, reducing activation memory and synchronization overhead; and (iii) a Stiefel-constrained mixer via Cayley transforms, ensuring orthogonality without post-hoc normalization. Empirical evaluations on ARC-AGI demonstrate that JPmHC achieves faster convergence, higher accuracy, and lower computational cost compared to bistochastic baselines. As a flexible and scalable extension of HC, JPmHC advances spectrum-aware, stable, and efficient deep learning, offering insights into topological architecture design and foundational model evolution.

[240] Advection-Diffusion on Graphs: A Bakry-Emery Laplacian for Spectral Graph Neural Networks

Pierre-Gabriel Berlureau, Ali Hariri, Victor Kawasaki-Borruat, Mia Zosso, Pierre Vandergheynst

Main category: cs.LG

TL;DR: Proposes mu-ChebNet, a spectral GNN using Bakry-Emery graph Laplacian with learnable node-wise potential to improve long-range information propagation without altering graph topology.

DetailsMotivation: GNNs suffer from oversmoothing and oversquashing issues that limit long-distance information propagation. Existing solutions like graph transformers or rewiring are computationally expensive or require modifying graph structure.

Method: Introduces Bakry-Emery graph Laplacian that integrates diffusion and advection through learnable node-wise potential. Develops mu-ChebNet spectral architecture that jointly learns potential and Chebyshev filters, acting as drop-in replacement for standard Laplacians.

Result: mu-ChebNet achieves consistent gains on synthetic long-range reasoning tasks and real-world benchmarks. Provides interpretable routing field showing information flow through graphs.

Conclusion: Bakry-Emery Laplacian offers principled and efficient foundation for adaptive spectral graph learning, enabling control of graph properties through learnable potential.

Abstract: Graph Neural Networks (GNNs) often struggle to propagate information across long distances due to oversmoothing and oversquashing. Existing remedies such as graph transformers or rewiring typically incur high computational cost or require altering the graph structure. We introduce a Bakry-Emery graph Laplacian that integrates diffusion and advection through a learnable node-wise potential, inducing task-dependent propagation dynamics without modifying topology. This operator has a well-behaved spectral decomposition and acts as a drop-in replacement for standard Laplacians in spectral GNNs. Building on this insight, we develop mu-ChebNet, a spectral architecture that jointly learns the potential and Chebyshev filters, effectively bridging message-passing adaptivity and spectral efficiency. Our theoretical analysis shows how the potential modulates the spectrum, enabling control of key graph properties. Empirically, mu-ChebNet delivers consistent gains on synthetic long-range reasoning tasks, as well as real-world benchmarks, while offering an interpretable routing field that reveals how information flows through the graph. This establishes the Bakry-Emery Laplacian as a principled and efficient foundation for adaptive spectral graph learning.

[241] Stable Long-Horizon Spatiotemporal Prediction on Meshes Using Latent Multiscale Recurrent Graph Neural Networks

Lionel Salesses, Larbi Arbaoui, Tariq Benamara, Arnaud Francois, Caroline Sainvitu

Main category: cs.LG

TL;DR: A deep learning framework for long-horizon spatiotemporal temperature prediction on complex geometries using coupled temporal multiscale models with graph neural networks and variational graph autoencoders.

DetailsMotivation: Accurate long-horizon prediction of spatiotemporal fields on complex geometries is crucial for applications like additive manufacturing, where temperature histories affect defect formation and mechanical properties. High-fidelity simulations are computationally expensive, and existing ML methods struggle with long-horizon temperature and gradient prediction.

Method: Proposes a temporal multiscale architecture with two coupled models operating at complementary time scales. Both models use latent recurrent graph neural networks to capture spatiotemporal dynamics on meshes, while a variational graph autoencoder provides compact latent representations to reduce memory usage and improve training stability.

Result: Experiments on simulated powder bed fusion data demonstrate accurate and temporally stable long-horizon predictions across diverse geometries, outperforming existing baselines. The framework maintains stability over thousands of time steps and generalizes across heterogeneous geometries.

Conclusion: The framework is general and extensible to physics-driven systems with multiscale dynamics and to three-dimensional geometries, despite being evaluated in 2D. It addresses the challenge of long-horizon spatiotemporal prediction on complex geometries.

Abstract: Accurate long-horizon prediction of spatiotemporal fields on complex geometries is a fundamental challenge in scientific machine learning, with applications such as additive manufacturing where temperature histories govern defect formation and mechanical properties. High-fidelity simulations are accurate but computationally costly, and despite recent advances, machine learning methods remain challenged by long-horizon temperature and gradient prediction. We propose a deep learning framework for predicting full temperature histories directly on meshes, conditioned on geometry and process parameters, while maintaining stability over thousands of time steps and generalizing across heterogeneous geometries. The framework adopts a temporal multiscale architecture composed of two coupled models operating at complementary time scales. Both models rely on a latent recurrent graph neural network to capture spatiotemporal dynamics on meshes, while a variational graph autoencoder provides a compact latent representation that reduces memory usage and improves training stability. Experiments on simulated powder bed fusion data demonstrate accurate and temporally stable long-horizon predictions across diverse geometries, outperforming existing baseline. Although evaluated in two dimensions, the framework is general and extensible to physics-driven systems with multiscale dynamics and to three-dimensional geometries.

[242] Unifying Formal Explanations: A Complexity-Theoretic Perspective

Shahaf Bassan, Xuanxiang Huang, Guy Katz

Main category: cs.LG

TL;DR: A unified framework for analyzing sufficient and contrastive explanations in ML, showing computational complexity depends on monotonicity, submodularity, and supermodularity properties of value functions, with polynomial-time results for global explanations but NP-hardness for local ones.

DetailsMotivation: To unify the analysis of two fundamental types of ML explanations (sufficient reasons and contrastive reasons) across different contexts, and understand how computational complexity relates to combinatorial optimization properties of value functions.

Method: Introduces a unified probabilistic framework where both explanation types can be characterized through minimization of a unified probabilistic value function. Analyzes computational complexity based on three key properties: monotonicity, submodularity, and supermodularity.

Result: Global value functions possess monotonicity and submodularity/supermodularity properties, enabling polynomial-time computation of explanations with provable guarantees for various ML models (neural networks, decision trees, ensembles). Local value functions lack these properties, making even simplified versions NP-hard.

Conclusion: The unified framework reveals fundamental connections between explanation complexity and combinatorial optimization properties, with counterintuitive distinctions between local and global settings that have significant implications for practical explainability.

Abstract: Previous work has explored the computational complexity of deriving two fundamental types of explanations for ML model predictions: (1) sufficient reasons, which are subsets of input features that, when fixed, determine a prediction, and (2) contrastive reasons, which are subsets of input features that, when modified, alter a prediction. Prior studies have examined these explanations in different contexts, such as non-probabilistic versus probabilistic frameworks and local versus global settings. In this study, we introduce a unified framework for analyzing these explanations, demonstrating that they can all be characterized through the minimization of a unified probabilistic value function. We then prove that the complexity of these computations is influenced by three key properties of the value function: (1) monotonicity, (2) submodularity, and (3) supermodularity - which are three fundamental properties in combinatorial optimization. Our findings uncover some counterintuitive results regarding the nature of these properties within the explanation settings examined. For instance, although the local value functions do not exhibit monotonicity or submodularity/supermodularity whatsoever, we demonstrate that the global value functions do possess these properties. This distinction enables us to prove a series of novel polynomial-time results for computing various explanations with provable guarantees in the global explainability setting, across a range of ML models that span the interpretability spectrum, such as neural networks, decision trees, and tree ensembles. In contrast, we show that even highly simplified versions of these explanations become NP-hard to compute in the corresponding local explainability setting.

[243] A Deep Surrogate Model for Robust and Generalizable Long-Term Blast Wave Prediction

Danning Jing, Xinhai Chen, Xifeng Pu, Jie Hu, Chao Huang, Xuguang Chen, Qinglin Wang, Jie Liu

Main category: cs.LG

TL;DR: RGD-Blast: A robust deep surrogate model for high-fidelity, long-term blast wave forecasting using multi-scale modules and dynamic-static feature coupling to improve accuracy and generalization.

DetailsMotivation: Traditional blast wave modeling is computationally expensive, while existing ML surrogate models suffer from degraded accuracy, especially on complex layouts or out-of-distribution scenarios, and error accumulation in autoregressive predictions.

Method: Proposes RGD-Blast with multi-scale modules to capture global flow patterns and local boundary interactions, plus dynamic-static feature coupling that fuses time-varying pressure fields with static source/layout features to enhance generalization.

Result: Achieves 100x speedup over traditional methods with comparable accuracy. On unseen building layouts: average RMSE <0.01 and R² >0.89 over 280 time steps. Validated generalization across varying blast locations and charge weights.

Conclusion: RGD-Blast substantially advances long-term blast wave modeling with robust generalization capabilities, making it suitable for practical applications requiring accurate, fast simulations.

Abstract: Accurately modeling the spatio-temporal dynamics of blast wave propagation remains a longstanding challenge due to its highly nonlinear behavior, sharp gradients, and burdensome computational cost. While machine learning-based surrogate models offer fast inference as a promising alternative, they suffer from degraded accuracy, particularly evaluated on complex urban layouts or out-of-distribution scenarios. Moreover, autoregressive prediction strategies in such models are prone to error accumulation over long forecasting horizons, limiting their robustness for extended-time simulations. To address these limitations, we propose RGD-Blast, a robust and generalizable deep surrogate model for high-fidelity, long-term blast wave forecasting. RGD-Blast incorporates a multi-scale module to capture both global flow patterns and local boundary interactions, effectively mitigating error accumulation during autoregressive prediction. We introduce a dynamic-static feature coupling mechanism that fuses time-varying pressure fields with static source and layout features, thereby enhancing out-of-distribution generalization. Experiments demonstrate that RGD-Blast achieves a two-order-of-magnitude speedup over traditional numerical methods while maintaining comparable accuracy. In generalization tests on unseen building layouts, the model achieves an average RMSE below 0.01 and an R2 exceeding 0.89 over 280 consecutive time steps. Additional evaluations under varying blast source locations and explosive charge weights further validate its generalization, substantially advancing the state of the art in long-term blast wave modeling.

[244] FedZMG: Efficient Client-Side Optimization in Federated Learning

Fotios Zantalis, Evangelos Zervas, Grigorios Koulouras

Main category: cs.LG

TL;DR: FedZMG is a parameter-free client-side optimization algorithm for federated learning that projects local gradients onto a zero-mean hyperplane to mitigate client-drift in non-IID data settings.

DetailsMotivation: Federated learning faces challenges with non-IID client data leading to client-drift, which reduces convergence speed and model performance. Existing adaptive optimizers often introduce computational complexity or communication overhead unsuitable for resource-constrained IoT environments.

Method: FedZMG advances Gradient Centralization by projecting local gradients onto a zero-mean hyperplane, neutralizing intensity/bias shifts in heterogeneous data distributions without additional communication or hyperparameter tuning.

Result: Theoretical analysis shows FedZMG reduces effective gradient variance and guarantees tighter convergence bounds. Empirical evaluations on EMNIST, CIFAR100, and Shakespeare datasets demonstrate better convergence speed and final validation accuracy compared to FedAvg and FedAdam, especially in highly non-IID settings.

Conclusion: FedZMG provides an effective, parameter-free solution to client-drift in federated learning with non-IID data, offering improved performance without computational or communication overhead.

Abstract: Federated Learning (FL) enables distributed model training on edge devices while preserving data privacy. However, clients tend to have non-Independent and Identically Distributed (non-IID) data, which often leads to client-drift, and therefore diminishing convergence speed and model performance. While adaptive optimizers have been proposed to mitigate these effects, they frequently introduce computational complexity or communication overhead unsuitable for resource-constrained IoT environments. This paper introduces Federated Zero Mean Gradients (FedZMG), a novel, parameter-free, client-side optimization algorithm designed to tackle client-drift by structurally regularizing the optimization space. Advancing the idea of Gradient Centralization, FedZMG projects local gradients onto a zero-mean hyperplane, effectively neutralizing the “intensity” or “bias” shifts inherent in heterogeneous data distributions without requiring additional communication or hyperparameter tuning. A theoretical analysis is provided, proving that FedZMG reduces the effective gradient variance and guarantees tighter convergence bounds compared to standard FedAvg. Extensive empirical evaluations on EMNIST, CIFAR100, and Shakespeare datasets demonstrate that FedZMG achieves better convergence speed and final validation accuracy compared to the baseline FedAvg and the adaptive optimizer FedAdam, particularly in highly non-IID settings.

[245] SeedFlood: A Step Toward Scalable Decentralized Training of LLMs

Jihun Kim, Namhoon Lee

Main category: cs.LG

TL;DR: SeedFlood is a decentralized training method that achieves global consensus with minimal communication overhead by using seed-reconstructible zeroth-order updates, enabling efficient training of billion-parameter models across hundreds of clients.

DetailsMotivation: Traditional gossip-based decentralized training methods suffer from high communication costs that scale with model size, and information decay over network hops makes global consensus inefficient. There's a need for scalable decentralized training that can handle large models across complex network topologies.

Method: SeedFlood exploits the seed-reconstructible structure of zeroth-order updates to make messages near-zero in size. These small messages can be flooded to every client in the network, making communication overhead negligible and independent of model size. The approach enables efficient decentralized training by removing the primary scalability bottleneck.

Result: Experiments on decentralized LLM fine-tuning show that SeedFlood consistently outperforms gossip-based baselines in both generalization performance and communication efficiency. It achieves results comparable to first-order methods in large-scale settings and enables training of billion-parameter models across hundreds of clients.

Conclusion: SeedFlood represents a significant advancement in decentralized training by addressing the communication scalability bottleneck, making it practical to train large models across distributed networks with minimal overhead.

Abstract: This work presents a new approach to decentralized training-SeedFlood-designed to scale for large models across complex network topologies and achieve global consensus with minimal communication overhead. Traditional gossip-based methods suffer from message communication costs that grow with model size, while information decay over network hops renders global consensus inefficient. SeedFlood departs from these practices by exploiting the seed-reconstructible structure of zeroth-order updates and effectively making the messages near-zero in size, allowing them to be flooded to every client in the network. This mechanism makes communication overhead negligible and independent of model size, removing the primary scalability bottleneck in decentralized training. Consequently, SeedFlood enables training in regimes previously considered impractical, such as billion-parameter models distributed across hundreds of clients. Our experiments on decentralized LLM fine-tuning demonstrate thatSeedFlood consistently outperforms gossip-based baselines in both generalization performance and communication efficiency, and even achieves results comparable to first-order methods in large scale settings.

[246] RAT+: Train Dense, Infer Sparse – Recurrence Augmented Attention for Dilated Inference

Xiuying Wei, Caglar Gulcehre

Main category: cs.LG

TL;DR: RAT+ is a dense-pretraining architecture that uses full-sequence recurrence and active recurrence learning to enable flexible inference-time switching to dilated attention patterns without retraining separate sparse models.

DetailsMotivation: Existing structured dilated attention methods suffer from severe accuracy degradation when sparsifying pretrained attention models to dilated patterns, requiring separate sparse model training for each dilation configuration.

Method: RAT+ augments attention with full-sequence recurrence and active recurrence learning during dense pretraining, enabling a single model to be adapted at inference time to various dilated attention patterns (with optional local windows) or hybrid layer/head compositions through short resolution adaptation rather than full retraining.

Result: At 1.5B parameters trained on 100B tokens, RAT+ closely matches dense accuracy at dilation 16 and drops only 2-3 points at dilation 64 on commonsense reasoning and LongBench tasks, outperforming attention when sparsifying to top-k block attention. Scaling to 2.6B parameters and 200B tokens shows the same trend.

Conclusion: RAT+ provides an efficient inference-time solution for attention sparsification that maintains accuracy while reducing FLOPs and KV cache size, enabling flexible deployment of large language models with different computational budgets.

Abstract: Structured dilated attention has an appealing inference-time efficiency knob: it reduces the FLOPs of the attention and the KV cache size by a factor of the dilation size D, while preserving long-range connectivity. However, we find a persistent failure mode of them – sparsifying a pretrained attention model to a dilated pattern leads to severe accuracy degradation. We introduce RAT+, a dense-pretraining architecture that augments attention with full-sequence recurrence and active recurrence learning. A single RAT+ model is pretrained densely once, then flexibly switched at inference time to dilated attention (optionally with local windows) or hybrid layer/head compositions, requiring only a short 1B-token resolution adaptation rather than retraining separate sparse models. At 1.5B parameters trained on 100B tokens, RAT+ closely matches dense accuracy at 16 and drops by about 2-3 points at 64 on commonsense reasoning and LongBench tasks, respectively. Moreover, RAT+ outperforms attention when sparsifying to the top-k block attention. We further scale to 2.6B parameters and 200B tokens and observe the same trend.

[247] Leakage and Second-Order Dynamics Improve Hippocampal RNN Replay

Josue Casco-Rodriguez, Nanda H. Krishna, Richard G. Baraniuk

Main category: cs.LG

TL;DR: Noisy recurrent neural networks trained for path integration can generate replay-like activity; the paper analyzes this as a sampling process, examines hidden state leakage/adaptation effects, and proposes momentum-based temporally compressed replay.

DetailsMotivation: To understand and improve replay generation in noisy recurrent neural networks (RNNs) trained for path integration, moving beyond the Langevin sampling description to analyze gradient dynamics, explore hidden state adaptation effects, and develop temporally compressed replay.

Method: Theoretical analysis of gradient dynamics in noisy RNNs, examination of hidden state leakage and adaptation effects, proposal of hidden state momentum for temporally compressed replay, and experimental validation on 2D triangular/T-maze paths and high-dimensional synthetic rat place cell activity.

Result: Shows that replay gradients are time-varying and difficult to estimate, hidden state adaptation encourages exploration but slows replay, and momentum-based compression counters slowness while maintaining exploration, connecting to underdamped Langevin sampling.

Conclusion: Provides deeper understanding of replay generation mechanisms in noisy RNNs, introduces momentum-based temporally compressed replay, and offers insights into biological replay dynamics through computational modeling.

Abstract: Biological neural networks (like the hippocampus) can internally generate “replay” resembling stimulus-driven activity. Recent computational models of replay use noisy recurrent neural networks (RNNs) trained to path-integrate. Replay in these networks has been described as Langevin sampling, but new modifiers of noisy RNN replay have surpassed this description. We re-examine noisy RNN replay as sampling to understand or improve it in three ways: (1) Under simple assumptions, we prove that the gradients replay activity should follow are time-varying and difficult to estimate, but readily motivate the use of hidden state leakage in RNNs for replay. (2) We confirm that hidden state adaptation (negative feedback) encourages exploration in replay, but show that it incurs non-Markov sampling that also slows replay. (3) We propose the first model of temporally compressed replay in noisy path-integrating RNNs through hidden state momentum, connect it to underdamped Langevin sampling, and show that, together with adaptation, it counters slowness while maintaining exploration. We verify our findings via path-integration of 2D triangular and T-maze paths and of high-dimensional paths of synthetic rat place cell activity.

[248] Generative Model via Quantile Assignment

Georgi Hrusanov, Oliver Y. Chén, Julien S. Bodelet

Main category: cs.LG

TL;DR: NeuroSQL is a new generative model that eliminates auxiliary networks by learning latent representations implicitly through optimal transport, achieving competitive image quality with faster training and better sample efficiency.

DetailsMotivation: Traditional generative models like VAEs and GANs rely on auxiliary networks (encoders/discriminators) that introduce training instability, computational overhead, and risks like mode collapse. There's a need for a simpler, more stable generative paradigm.

Method: NeuroSQL learns low-dimensional latent representations implicitly without auxiliary networks by expressing latent variables as solutions to an optimal transportation problem. It solves a linear assignment problem to learn latent variables, which are then passed to a standalone generator.

Result: NeuroSQL achieves: (1) overall lower mean pixel distance between synthetic and authentic images and stronger perceptual/structural fidelity compared to VAEs, GANs, and diffusion models; (2) least training time computationally; (3) effective synthetic data generation with limited training samples.

Conclusion: NeuroSQL provides a fast, stable, and robust alternative to traditional generative models by embracing quantile assignment rather than auxiliary networks, enabling efficient synthetic data generation with minimal information loss.

Abstract: Deep Generative models (DGMs) play two key roles in modern machine learning: (i) producing new information (e.g., image synthesis) and (ii) reducing dimensionality. However, traditional architectures often rely on auxiliary networks such as encoders in Variational Autoencoders (VAEs) or discriminators in Generative Adversarial Networks (GANs), which introduce training instability, computational overhead, and risks like mode collapse. We present NeuroSQL, a new generative paradigm that eliminates the need for auxiliary networks by learning low-dimensional latent representations implicitly. NeuroSQL leverages an asymptotic approximation that expresses the latent variables as the solution to an optimal transportation problem. Specifically, NeuroSQL learns the latent variables by solving a linear assignment problem and then passes the latent information to a standalone generator. We benchmark its performance against GANs, VAEs, and a budget-matched diffusion baseline on four datasets: handwritten digits (MNIST), faces (CelebA), animal faces (AFHQ), and brain images (OASIS). Compared to VAEs, GANs, and diffusion models: (1) in terms of image quality, NeuroSQL achieves overall lower mean pixel distance between synthetic and authentic images and stronger perceptual/structural fidelity; (2) computationally, NeuroSQL requires the least training time; and (3) practically, NeuroSQL provides an effective solution for generating synthetic data with limited training samples. By embracing quantile assignment rather than an encoder, NeuroSQL provides a fast, stable, and robust way to generate synthetic data with minimal information loss.

[249] Unifying approach to uniform expressivity of graph neural networks

Huan Luo, Jonni Virtema

Main category: cs.LG

TL;DR: T-GNNs generalize GNNs by aggregating over graph template embeddings, with corresponding logic GML(T) and theoretical analysis showing equivalence between T-GNNs and GML(T).

DetailsMotivation: Standard GNNs are limited to immediate neighborhood or global aggregations, so the authors aim to increase expressivity by incorporating substructural information through a generalized template-based framework.

Method: Introduce Template GNNs (T-GNNs) where node features are updated by aggregating valid template embeddings from specified graph templates. Develop corresponding Graded template modal logic (GML(T)), template-based bisimulation, and WL algorithm.

Result: Establish equivalence between expressive power of T-GNNs and GML(T), and show how standard AC-GNNs and recent variants can be interpreted as instantiations of T-GNNs.

Conclusion: T-GNNs provide a unifying framework for analyzing GNN expressivity, formalizing the trend of incorporating substructural information while maintaining theoretical connections to logic and WL algorithms.

Abstract: The expressive power of Graph Neural Networks (GNNs) is often analysed via correspondence to the Weisfeiler-Leman (WL) algorithm and fragments of first-order logic. Standard GNNs are limited to performing aggregation over immediate neighbourhoods or over global read-outs. To increase their expressivity, recent attempts have been made to incorporate substructural information (e.g. cycle counts and subgraph properties). In this paper, we formalize this architectural trend by introducing Template GNNs (T-GNNs), a generalized framework where node features are updated by aggregating over valid template embeddings from a specified set of graph templates. We propose a corresponding logic, Graded template modal logic (GML(T)), and generalized notions of template-based bisimulation and WL algorithm. We establish an equivalence between the expressive power of T-GNNs and GML(T), and provide a unifying approach for analysing GNN expressivity: we show how standard AC-GNNs and its recent variants can be interpreted as instantiations of T-GNNs.

[250] Parameter-Efficient Domain Adaptation of Physics-Informed Self-Attention based GNNs for AC Power Flow Prediction

Redwanul Karim, Changhun Kim, Timon Conrad, Nora Gourmelon, Julian Oelhaf, David Riebesel, Tomás Arias-Vergara, Andreas Maier, Johann Jäger, Siming Bayer

Main category: cs.LG

TL;DR: Parameter-efficient domain adaptation for physics-informed GNNs using LoRA and selective head unfreezing for AC power flow prediction under voltage regime shifts

DetailsMotivation: Existing physics-informed graph neural solvers require full fine-tuning for cross-regime transfer, which is costly and offers limited control over the stability-plasticity trade-off between target-domain adaptation and source-domain retention.

Method: Apply LoRA (Low-Rank Adaptation) to attention projections with selective unfreezing of the prediction head to regulate adaptation capacity, encouraging Kirchhoff-consistent behavior via physics-based loss while restricting adaptation to low-rank updates.

Result: The proposed LoRA+PHead adaptation recovers near-full fine-tuning accuracy with target-domain RMSE gap of 2.6×10⁻⁴ while reducing trainable parameters by 85.46%. It maintains comparable physics-based residual to full fine-tuning but reduces MV source retention by 4.7 percentage points under domain shift.

Conclusion: The method enables parameter-efficient and physically consistent AC power flow estimation under domain shift, offering a controllable efficiency-accuracy trade-off for physics-constrained inverse estimation.

Abstract: Accurate AC-PF prediction under domain shift is critical when models trained on medium-voltage (MV) grids are deployed on high-voltage (HV) networks. Existing physics-informed graph neural solvers typically rely on full fine-tuning for cross-regime transfer, incurring high retraining cost and offering limited control over the stability-plasticity trade-off between target-domain adaptation and source-domain retention. We study parameter-efficient domain adaptation for physics-informed self-attention based GNN, encouraging Kirchhoff-consistent behavior via a physics-based loss while restricting adaptation to low-rank updates. Specifically, we apply LoRA to attention projections with selective unfreezing of the prediction head to regulate adaptation capacity. This design yields a controllable efficiency-accuracy trade-off for physics-constrained inverse estimation under voltage-regime shift. Across multiple grid topologies, the proposed LoRA+PHead adaptation recovers near-full fine-tuning accuracy with a target-domain RMSE gap of $2.6\times10^{-4}$ while reducing the number of trainable parameters by 85.46%. The physics-based residual remains comparable to full fine-tuning; however, relative to Full FT, LoRA+PHead reduces MV source retention by 4.7 percentage points (17.9% vs. 22.6%) under domain shift, while still enabling parameter-efficient and physically consistent AC-PF estimation.

[251] Neural-HSS: Hierarchical Semi-Separable Neural PDE Solver

Pietro Sittoni, Emanuele Zangrando, Angelo A. Casulli, Nicola Guglielmi, Francesco Tudisco

Main category: cs.LG

TL;DR: Neural-HSS: A parameter-efficient neural architecture based on Hierarchical Semi-Separable matrix structure for data-efficient learning of PDE solutions, particularly effective for elliptic PDEs in low-data regimes.

DetailsMotivation: Deep learning methods for PDEs require large computational costs for dataset generation and model training, limiting their application in critical domains despite available computing infrastructure. The paper aims to develop more data-efficient architectures.

Method: Introduces Neural-HSS architecture based on Hierarchical Semi-Separable (HSS) matrix structure, inspired by Green’s functions for elliptic PDEs. The architecture is theoretically analyzed for exactness properties in low-data regimes and connections to Fourier neural operators and convolutional layers are investigated.

Result: Experimental validation on 3D Poisson equation over 2 million grid points shows superior data efficiency in low-data regimes compared to baselines. Demonstrates capability across diverse PDE domains including electromagnetism, fluid dynamics, and biology.

Conclusion: Neural-HSS provides a parameter-efficient, data-efficient architecture for learning PDE solutions, particularly effective for elliptic PDEs and applicable to broad classes of PDEs across multiple domains.

Abstract: Deep learning-based methods have shown remarkable effectiveness in solving PDEs, largely due to their ability to enable fast simulations once trained. However, despite the availability of high-performance computing infrastructure, many critical applications remain constrained by the substantial computational costs associated with generating large-scale, high-quality datasets and training models. In this work, inspired by studies on the structure of Green’s functions for elliptic PDEs, we introduce Neural-HSS, a parameter-efficient architecture built upon the Hierarchical Semi-Separable (HSS) matrix structure that is provably data-efficient for a broad class of PDEs. We theoretically analyze the proposed architecture, proving that it satisfies exactness properties even in very low-data regimes. We also investigate its connections with other architectural primitives, such as the Fourier neural operator layer and convolutional layers. We experimentally validate the data efficiency of Neural-HSS on the three-dimensional Poisson equation over a grid of two million points, demonstrating its superior ability to learn from data generated by elliptic PDEs in the low-data regime while outperforming baseline methods. Finally, we demonstrate its capability to learn from data arising from a broad class of PDEs in diverse domains, including electromagnetism, fluid dynamics, and biology.

[252] Variational Distributional Neuron

Yves Ruffenach

Main category: cs.LG

TL;DR: A variational distributional neuron is proposed as a VAE-based compute unit that carries explicit uncertainty through local priors, posteriors, and ELBO constraints, making computation probabilistic rather than deterministic.

DetailsMotivation: Addresses the structural tension between symbolic causality in sequential generation and probabilistic latent models where uncertainty remains global rather than intrinsic to computation. Questions why compute units don't explicitly carry uncertainty if it's intrinsic to computation.

Method: Formulates neurons as VAE bricks with explicit priors, amortized posteriors, and local ELBO constraints. Each neuron parameterizes a posterior, propagates reparameterized samples, and is regularized by KL terms. Extends to autoregressive priors over latent variables per unit.

Result: Proposes a proof-of-concept for distributional neurons where computation becomes contraction of possibility spaces under constraints. Analyzes collapse modes and conditions for “living neurons” that maintain uncertainty.

Conclusion: Introduces a fundamental shift from deterministic scalar neurons to distributional units that explicitly carry uncertainty, enabling more interpretable and controllable probabilistic computation at the granular level.

Abstract: We propose a proof of concept for a variational distributional neuron: a compute unit formulated as a VAE brick, explicitly carrying a prior, an amortized posterior and a local ELBO. The unit is no longer a deterministic scalar but a distribution: computing is no longer about propagating values, but about contracting a continuous space of possibilities under constraints. Each neuron parameterizes a posterior, propagates a reparameterized sample and is regularized by the KL term of a local ELBO - hence, the activation is distributional. This “contraction” becomes testable through local constraints and can be monitored via internal measures. The amount of contextual information carried by the unit, as well as the temporal persistence of this information, are locally tuned by distinct constraints. This proposal addresses a structural tension: in sequential generation, causality is predominantly organized in the symbolic space and, even when latents exist, they often remain auxiliary, while the effective dynamics are carried by a largely deterministic decoder. In parallel, probabilistic latent models capture factors of variation and uncertainty, but that uncertainty typically remains borne by global or parametric mechanisms, while units continue to propagate scalars - hence the pivot question: if uncertainty is intrinsic to computation, why does the compute unit not carry it explicitly? We therefore draw two axes: (i) the composition of probabilistic constraints, which must be made stable, interpretable and controllable; and (ii) granularity: if inference is a negotiation of distributions under constraints, should the primitive unit remain deterministic or become distributional? We analyze “collapse” modes and the conditions for a “living neuron”, then extend the contribution over time via autoregressive priors over the latent, per unit.

[253] Lean Formalization of Generalization Error Bound by Rademacher Complexity and Dudley’s Entropy Integral

Sho Sonoda, Kazumi Kasaura, Yuma Mizuno, Kei Tsukamoto, Naoto Onda

Main category: cs.LG

TL;DR: Formal verification of Rademacher complexity generalization bounds in Lean 4 with measure-theoretic foundations, including applications to linear predictors and Dudley entropy bounds.

DetailsMotivation: To provide mechanically-checked, rigorous proofs of generalization error bounds using Rademacher complexity, addressing the need for formal verification in statistical learning theory.

Method: Develop formal proofs in Lean 4 using Mathlib’s measure-theoretic probability theory, including symmetrization arguments, McDiarmid inequality, and techniques for handling separable topological index sets via countable dense subsets.

Result: Successfully formalized Rademacher complexity bounds with mechanized proofs for linear predictors under ℓ₂ and ℓ₁ regularization, and Dudley-type entropy integral bounds using covering numbers and chaining.

Conclusion: The work establishes a foundation for formal verification of learning theory results, providing reusable tools for proving generalization bounds with mathematical rigor.

Abstract: Understanding and certifying the generalization performance of machine learning algorithms – i.e. obtaining theoretical estimates of the test error from a finite training sample – is a central theme of statistical learning theory. Among the many complexity measures used to derive such guarantees, Rademacher complexity yields sharp, data-dependent bounds that apply well beyond classical $0$–$1$ classification. In this study, we formalize the generalization error bound by Rademacher complexity in Lean 4, building on measure-theoretic probability theory available in the Mathlib library. Our development provides a mechanically-checked pipeline from the definitions of empirical and expected Rademacher complexity, through a formal symmetrization argument and a bounded-differences analysis, to high-probability uniform deviation bounds via a formally proved McDiarmid inequality. A key technical contribution is a reusable mechanism for lifting results from countable hypothesis classes (where measurability of suprema is straightforward in Mathlib) to separable topological index sets via a reduction to a countable dense subset. As worked applications of the abstract theorem, we mechanize standard empirical Rademacher bounds for linear predictors under $\ell_2$ and $\ell_1$ regularization, and we also formalize a Dudley-type entropy integral bound based on covering numbers and a chaining construction.

[254] MEG-to-MEG Transfer Learning and Cross-Task Speech/Silence Detection with Limited Data

Xabier de Zuazo, Vincenzo Verbeni, Eva Navas, Ibon Saratxaga, Mathieu Bourguignon, Nicola Molinaro

Main category: cs.LG

TL;DR: Transfer learning enables cross-task decoding between speech perception and production from MEG data using pre-trained Conformer models

DetailsMotivation: Address data efficiency challenges in speech brain-computer interfaces by leveraging transfer learning across speech perception and production tasks

Method: Pre-train Conformer-based model on 50 hours of single-subject listening data, then fine-tune on just 5 minutes per subject across 18 participants for both perception and production tasks

Result: Transfer learning yields 1-4% in-task accuracy gains and 5-6% cross-task gains; models trained on production decode passive listening above chance, confirming shared neural representations

Conclusion: Transfer learning enables efficient cross-task decoding between speech perception and production, revealing shared neural processes rather than task-specific motor activity

Abstract: Data-efficient neural decoding is a central challenge for speech brain-computer interfaces. We present the first demonstration of transfer learning and cross-task decoding for MEG-based speech models spanning perception and production. We pre-train a Conformer-based model on 50 hours of single-subject listening data and fine-tune on just 5 minutes per subject across 18 participants. Transfer learning yields consistent improvements, with in-task accuracy gains of 1-4% and larger cross-task gains of up to 5-6%. Not only does pre-training improve performance within each task, but it also enables reliable cross-task decoding between perception and production. Critically, models trained on speech production decode passive listening above chance, confirming that learned representations reflect shared neural processes rather than task-specific motor activity.

[255] Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning

Julian Minder, Clément Dumas, Caden Juang, Bilal Chugtai, Neel Nanda

Main category: cs.LG

TL;DR: This paper identifies issues with Crosscoders model diffing method, develops Latent Scaling to detect misattributed concepts, and improves crosscoders with BatchTopK loss for better identification of genuinely fine-tuning-specific concepts in language models.

DetailsMotivation: Model diffing helps understand how fine-tuning changes model representations, but current crosscoder methods can misattribute concepts as unique to fine-tuned models when they actually exist in both base and fine-tuned models due to L1 training loss issues.

Method: Developed Latent Scaling to flag misattribution issues by more accurately measuring latent presence across models, then trained crosscoders with BatchTopK loss to mitigate these problems and better identify genuinely fine-tuning-specific concepts.

Result: Standard crosscoder suffers heavily from misattribution issues; BatchTopK crosscoder substantially mitigates these problems, finding more genuinely chat-specific and interpretable concepts like “false information” and “personal question” with nuanced refusal-related latents.

Conclusion: The work advances best practices for crosscoder-based model diffing methodology and demonstrates it can provide concrete insights into how chat-tuning modifies model behavior through improved concept identification.

Abstract: Model diffing is the study of how fine-tuning changes a model’s representations and internal algorithms. Many behaviors of interest are introduced during fine-tuning, and model diffing offers a promising lens to interpret such behaviors. Crosscoders are a recent model diffing method that learns a shared dictionary of interpretable concepts represented as latent directions in both the base and fine-tuned models, allowing us to track how concepts shift or emerge during fine-tuning. Notably, prior work has observed concepts with no direction in the base model, and it was hypothesized that these model-specific latents were concepts introduced during fine-tuning. However, we identify two issues which stem from the crosscoders L1 training loss that can misattribute concepts as unique to the fine-tuned model, when they really exist in both models. We develop Latent Scaling to flag these issues by more accurately measuring each latent’s presence across models. In experiments comparing Gemma 2 2B base and chat models, we observe that the standard crosscoder suffers heavily from these issues. Building on these insights, we train a crosscoder with BatchTopK loss and show that it substantially mitigates these issues, finding more genuinely chat-specific and highly interpretable concepts. We recommend practitioners adopt similar techniques. Using the BatchTopK crosscoder, we successfully identify a set of chat-specific latents that are both interpretable and causally effective, representing concepts such as $\textit{false information}$ and $\textit{personal question}$, along with multiple refusal-related latents that show nuanced preferences for different refusal triggers. Overall, our work advances best practices for the crosscoder-based methodology for model diffing and demonstrates that it can provide concrete insights into how chat-tuning modifies model behavior.

[256] A Probabilistic Framework for LLM-Based Model Discovery

Stefan Wahl, Raphaela Schenk, Ali Farnoud, Jakob H. Macke, Daniel Gedon

Main category: cs.LG

TL;DR: ModelSMC: A probabilistic inference approach to mechanistic model discovery using Sequential Monte Carlo with LLMs for proposing and refining candidate models.

DetailsMotivation: Existing LLM-based model discovery approaches use hand-crafted heuristic procedures without explicit probabilistic formulation, lacking a unified framework for model proposal, refinement, and selection.

Method: Recast model discovery as probabilistic inference (sampling from unknown distribution over mechanistic models). Introduce ModelSMC using Sequential Monte Carlo sampling with LLMs to propose/refine candidate models represented as particles, weighted using likelihood-based criteria.

Result: Experiments on real-world scientific systems show ModelSMC discovers models with interpretable mechanisms and improves posterior predictive checks compared to heuristic approaches.

Conclusion: Probabilistic inference perspective provides unified framework for LLM-based model discovery, enabling principled approach to mechanistic model discovery from observational data.

Abstract: Automated methods for discovering mechanistic simulator models from observational data offer a promising path toward accelerating scientific progress. Such methods often take the form of agentic-style iterative workflows that repeatedly propose and revise candidate models by imitating human discovery processes. However, existing LLM-based approaches typically implement such workflows via hand-crafted heuristic procedures, without an explicit probabilistic formulation. We recast model discovery as probabilistic inference, i.e., as sampling from an unknown distribution over mechanistic models capable of explaining the data. This perspective provides a unified way to reason about model proposal, refinement, and selection within a single inference framework. As a concrete instantiation of this view, we introduce ModelSMC, an algorithm based on Sequential Monte Carlo sampling. ModelSMC represents candidate models as particles which are iteratively proposed and refined by an LLM, and weighted using likelihood-based criteria. Experiments on real-world scientific systems illustrate that this formulation discovers models with interpretable mechanisms and improves posterior predictive checks. More broadly, this perspective provides a probabilistic lens for understanding and developing LLM-based approaches to model discovery.

[257] Explaining AutoClustering: Uncovering Meta-Feature Contribution in AutoML for Clustering

Matheus Camilo da Silva, Leonardo Arrighi, Ana Carolina Lorena, Sylvio Barbon Junior

Main category: cs.LG

TL;DR: This paper investigates explainability of meta-models in AutoClustering systems, analyzing how dataset meta-features influence clustering algorithm and hyperparameter recommendations to improve transparency and reliability.

DetailsMotivation: AutoClustering systems automate unsupervised learning but lack transparency - their recommendations are difficult to justify because the influence of dataset meta-features on algorithm/hyperparameter choices is not exposed, limiting reliability, bias diagnostics, and meta-feature engineering.

Method: 1) Review 22 existing methods and organize meta-features into structured taxonomy; 2) Apply global explainability technique (Decision Predicate Graphs) to assess feature importance; 3) Use local explainability tools (SHAP) to analyze specific clustering decisions.

Result: Findings highlight consistent patterns in meta-feature relevance, identify structural weaknesses in current meta-learning strategies that can distort recommendations, and provide actionable guidance for more interpretable AutoML design.

Conclusion: The study offers practical foundation for increasing decision transparency in unsupervised learning automation by making meta-model reasoning more interpretable and exposing feature influences.

Abstract: AutoClustering methods aim to automate unsupervised learning tasks, including algorithm selection (AS), hyperparameter optimization (HPO), and pipeline synthesis (PS), by often leveraging meta-learning over dataset meta-features. While these systems often achieve strong performance, their recommendations are often difficult to justify: the influence of dataset meta-features on algorithm and hyperparameter choices is typically not exposed, limiting reliability, bias diagnostics, and efficient meta-feature engineering. This limits reliability and diagnostic insight for further improvements. In this work, we investigate the explainability of the meta-models in AutoClustering. We first review 22 existing methods and organize their meta-features into a structured taxonomy. We then apply a global explainability technique (i.e., Decision Predicate Graphs) to assess feature importance within meta-models from selected frameworks. Finally, we use local explainability tools such as SHAP (SHapley Additive exPlanations) to analyse specific clustering decisions. Our findings highlight consistent patterns in meta-feature relevance, identify structural weaknesses in current meta-learning strategies that can distort recommendations, and provide actionable guidance for more interpretable Automated Machine Learning (AutoML) design. This study therefore offers a practical foundation for increasing decision transparency in unsupervised learning automation.

[258] Visual Planning: Let’s Think Only with Images

Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, Ivan Vulić

Main category: cs.LG

TL;DR: Visual Planning paradigm uses sequences of images for reasoning instead of text, showing improved performance in visual navigation tasks through reinforcement learning with GRPO.

DetailsMotivation: Current multimodal LLMs rely primarily on text for reasoning even when visual information is present, which may not be optimal for tasks involving spatial/geometrical information where visual reasoning could be more natural and effective.

Method: Proposes Visual Planning paradigm using purely visual representations for reasoning, implemented via VPRL (Visual Planning via Reinforcement Learning) framework empowered by GRPO for post-training large vision models.

Result: Visual Planning outperforms text-only reasoning variants in representative visual navigation tasks (FrozenLake, Maze, MiniBehavior), establishing it as a viable supplement to language-based reasoning.

Conclusion: Visual Planning is a promising supplement to language-based reasoning, opening new avenues for tasks benefiting from intuitive, image-based inference, particularly for vision-first tasks.

Abstract: Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations for these “vision-first” tasks, as a supplementary channel to language-based reasoning. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising supplement to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.

[259] PRISM-FCP: Byzantine-Resilient Federated Conformal Prediction via Partial Sharing

Ehsan Lari, Reza Arablouei, Stefan Werner

Main category: cs.LG

TL;DR: PRISM-FCP is a Byzantine-resilient federated conformal prediction framework that uses partial model sharing and statistical margins to protect against adversarial attacks during both training and calibration stages.

DetailsMotivation: Existing federated conformal prediction approaches only address adversarial behavior during calibration, leaving the learned model vulnerable to poisoned updates during training. There's a need for end-to-end Byzantine resilience in federated uncertainty quantification.

Method: Uses partial model sharing where clients transmit only M of D parameters per round to attenuate adversary perturbation. During calibration, converts nonconformity scores to characterization vectors, computes distance-based maliciousness scores, and downweights/filters suspected Byzantine contributions before estimating conformal quantile.

Result: Maintains nominal coverage guarantees under Byzantine attacks while avoiding interval inflation observed in standard FCP, with reduced communication overhead. Experiments on synthetic data and UCI Superconductivity dataset demonstrate effectiveness.

Conclusion: PRISM-FCP provides a robust and communication-efficient approach to federated uncertainty quantification with end-to-end Byzantine resilience.

Abstract: We propose PRISM-FCP (Partial shaRing and robust calIbration with Statistical Margins for Federated Conformal Prediction), a Byzantine-resilient federated conformal prediction framework that utilizes partial model sharing to improve robustness against Byzantine attacks during both model training and conformal calibration. Existing approaches address adversarial behavior only in the calibration stage, leaving the learned model susceptible to poisoned updates. In contrast, PRISM-FCP mitigates attacks end-to-end. During training, clients partially share updates by transmitting only $M$ of $D$ parameters per round. This attenuates the expected energy of an adversary’s perturbation in the aggregated update by a factor of $M/D$, yielding lower mean-square error (MSE) and tighter prediction intervals. During calibration, clients convert nonconformity scores into characterization vectors, compute distance-based maliciousness scores, and downweight or filter suspected Byzantine contributions before estimating the conformal quantile. Extensive experiments on both synthetic data and the UCI Superconductivity dataset demonstrate that PRISM-FCP maintains nominal coverage guarantees under Byzantine attacks while avoiding the interval inflation observed in standard FCP with reduced communication, providing a robust and communication-efficient approach to federated uncertainty quantification.

[260] Scientific Knowledge-Guided Machine Learning for Vessel Power Prediction: A Comparative Study

Orfeas Bourchas, George Papalambrou

Main category: cs.LG

TL;DR: Hybrid physics-data model for vessel engine power prediction combines propeller law baseline with ML residual learning for better generalization and physical consistency.

DetailsMotivation: Conventional ML methods for vessel power prediction often fail to respect fundamental physics (propeller law), leading to poor extrapolation outside training data. Need models that combine data-driven learning with physical constraints for better generalization.

Method: Hybrid framework with physics-based baseline (P = cV^n from propeller law) plus ML regressor for residual power corrections. Compared XGBoost, simple NN, and Physics-Informed Neural Network (PINN) with baseline vs. pure data-driven versions.

Result: Hybrid models consistently outperformed pure data-driven baselines in sparse data regions while maintaining similar performance in populated areas. Provides better generalization and physical consistency.

Conclusion: Physics-data hybrid approach improves vessel power prediction by constraining ML to residual corrections, ensuring physical consistency while maintaining data-driven flexibility. Practical for vessel performance monitoring applications.

Abstract: Accurate prediction of main engine power is essential for vessel performance optimization, fuel efficiency, and compliance with emission regulations. Conventional machine learning approaches, such as Support Vector Machines, variants of Artificial Neural Networks (ANNs), and tree-based methods like Random Forests, Extra Tree Regressors, and XGBoost, can capture nonlinearities but often struggle to respect the fundamental propeller law relationship between power and speed, resulting in poor extrapolation outside the training envelope. This study introduces a hybrid modeling framework that integrates physics-based knowledge from sea trials with data-driven residual learning. The baseline component, derived from calm-water power curves of the form $P = cV^n$, captures the dominant power-speed dependence, while another, nonlinear, regressor is then trained to predict the residual power, representing deviations caused by environmental and operational conditions. By constraining the machine learning task to residual corrections, the hybrid model simplifies learning, improves generalization, and ensures consistency with the underlying physics. In this study, an XGBoost, a simple Neural Network, and a Physics-Informed Neural Network (PINN) coupled with the baseline component were compared to identical models without the baseline component. Validation on in-service data demonstrates that the hybrid model consistently outperformed a pure data-driven baseline in sparse data regions while maintaining similar performance in populated ones. The proposed framework provides a practical and computationally efficient tool for vessel performance monitoring, with applications in weather routing, trim optimization, and energy efficiency planning.

[261] Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning

Xinting Huang, Michael Hahn

Main category: cs.LG

TL;DR: NDM method finds interpretable subspaces in neural representations through unsupervised neighbor distance minimization, revealing organized encoding of different input aspects similar to model “variables”

DetailsMotivation: To understand how neural models internally organize and encode different aspects of inputs in separate subspaces, and whether these "natural" subspaces can be found in an unsupervised manner

Method: Neighbor Distance Minimization (NDM) - learns non-basis-aligned subspaces unsupervisedly by minimizing distances between neighbors in representation space

Result: Subspaces are interpretable and encode shared abstract concepts across inputs, resembling model “variables”; quantitative experiments show strong connection between subspaces and known circuits in GPT-2; scales to 2B models finding separate subspaces for context vs parametric knowledge routing

Conclusion: NDM provides a new perspective for understanding model internals and building circuits through unsupervised discovery of interpretable subspaces in neural representations

Abstract: Understanding internal representations of neural models is a core interest of mechanistic interpretability. Due to its large dimensionality, the representation space can encode various aspects about inputs. To what extent are different aspects organized and encoded in separate subspaces? Is it possible to find these natural'' subspaces in a purely unsupervised way? Somewhat surprisingly, we can indeed achieve this and find interpretable subspaces by a seemingly unrelated training objective. Our method, neighbor distance minimization (NDM), learns non-basis-aligned subspaces in an unsupervised manner. Qualitative analysis shows subspaces are interpretable in many cases, and encoded information in obtained subspaces tends to share the same abstract concept across different inputs, making such subspaces similar to variables’’ used by the model. We also conduct quantitative experiments using known circuits in GPT-2; results show a strong connection between subspaces and circuit variables. We also provide evidence showing scalability to 2B models by finding separate subspaces mediating context and parametric knowledge routing. Viewed more broadly, our findings offer a new perspective on understanding model internals and building circuits.

[262] Assigning Confidence: K-partition Ensembles

Aggelos Semoglou, John Pavlopoulos

Main category: cs.LG

TL;DR: CAKE framework quantifies pointwise confidence in clustering assignments by combining assignment stability and geometric fit consistency from ensemble clustering.

DetailsMotivation: Traditional clustering diagnostics only assess global quality but fail to indicate confidence in individual assignments, especially for initialization-sensitive algorithms like k-means. This assignment-level instability undermines accuracy and robustness, while existing ensemble approaches lack tools for quantifying pointwise confidence that combines cross-run agreement with geometric support.

Method: CAKE (Confidence in Assignments via K-partition Ensembles) evaluates each point using two complementary statistics computed over a clustering ensemble: assignment stability (cross-run agreement) and consistency of local geometric fit. These are combined into a single interpretable score in [0,1].

Result: Theoretical analysis shows CAKE remains effective under noise and separates stable from unstable points. Experiments on synthetic and real-world datasets indicate CAKE effectively highlights ambiguous points and stable core members, providing a confidence ranking that can guide filtering or prioritization to improve clustering quality.

Conclusion: CAKE provides a principled framework for quantifying pointwise confidence in clustering assignments, addressing the limitation of traditional global diagnostics and enabling better filtering or prioritization of ambiguous points to improve clustering robustness.

Abstract: Clustering is widely used for unsupervised structure discovery, yet it offers limited insight into how reliable each individual assignment is. Diagnostics, such as convergence behavior or objective values, may reflect global quality, but they do not indicate whether particular instances are assigned confidently, especially for initialization-sensitive algorithms like k-means. This assignment-level instability can undermine both accuracy and robustness. Ensemble approaches improve global consistency by aggregating multiple runs, but they typically lack tools for quantifying pointwise confidence in a way that combines cross-run agreement with geometric support from the learned cluster structure. We introduce CAKE (Confidence in Assignments via K-partition Ensembles), a framework that evaluates each point using two complementary statistics computed over a clustering ensemble: assignment stability and consistency of local geometric fit. These are combined into a single, interpretable score in [0,1]. Our theoretical analysis shows that CAKE remains effective under noise and separates stable from unstable points. Experiments on synthetic and real-world datasets indicate that CAKE effectively highlights ambiguous points and stable core members, providing a confidence ranking that can guide filtering or prioritization to improve clustering quality.

[263] Classification errors distort findings in automated speech processing: examples and solutions from child-development research

Lucas Gautheron, Evan Kidd, Anton Malko, Marvin Lavechin, Alejandrina Cristia

Main category: cs.LG

TL;DR: Bayesian analysis of how speech classification errors in child language recordings affect statistical inferences and a calibration method to mitigate these effects

DetailsMotivation: While automated classifiers for child audio data are widely used, little research examines how classification errors affect downstream statistical inferences like correlations and effect sizes in language acquisition studies

Method: Uses Bayesian approach to model joint distribution of speech behavior and algorithm errors, analyzes effects on key scientific questions using real and simulated data from LENA and ACLEW Voice Type Classifier

Result: Classification errors significantly distort estimates for both LENA and ACLEW systems; Bayesian calibration can help recover unbiased effect sizes but isn’t foolproof

Conclusion: Researchers should consider downstream effects of classification errors and Bayesian calibration offers a promising though imperfect solution for improving measurement validity

Abstract: With the advent of wearable recorders, scientists are increasingly turning to automated methods of analysis of audio and video data in order to measure children’s experience, behavior, and outcomes, with a sizable literature employing long-form audio-recordings to study language acquisition. While numerous articles report on the accuracy and reliability of the most popular automated classifiers, less has been written on the downstream effects of classification errors on measurements and statistical inferences (e.g., the estimate of correlations and effect sizes in regressions). This paper’s main contributions are drawing attention to downstream effects of confusion errors, and providing an approach to measure and potentially recover from these errors. Specifically, we use a Bayesian approach to study the effects of algorithmic errors on key scientific questions, including the effect of siblings on children’s language experience and the association between children’s production and their input. By fitting a joint model of speech behavior and algorithm behavior on real and simulated data, we show that classification errors can significantly distort estimates for both the most commonly used \gls{lena}, and a slightly more accurate open-source alternative (the Voice Type Classifier from the ACLEW system). We further show that a Bayesian calibration approach for recovering unbiased estimates of effect sizes can be effective and insightful, but does not provide a fool-proof solution.

[264] CDLM: Consistency Diffusion Language Models For Faster Sampling

Minseo Kim, Chenfeng Xu, Coleman Hooper, Harman Singh, Ben Athiwaratkun, Ce Zhang, Kurt Keutzer, Amir Gholami

Main category: cs.LG

TL;DR: CDLM accelerates diffusion language models via consistency modeling for multi-token generation and block-wise causal attention for KV caching compatibility, achieving 3.6x-14.5x speedup while maintaining accuracy on math/coding tasks.

DetailsMotivation: Diffusion language models offer parallel generation but suffer from slow inference due to many refinement steps and inability to use standard KV caching, creating bottlenecks for practical deployment.

Method: CDLM integrates consistency modeling to reduce sampling steps via multi-token finalization, and enforces block-wise causal attention mask during fine-tuning to enable full KV caching compatibility.

Result: Achieves 3.6x-14.5x lower latency while maintaining competitive accuracy on math and coding tasks compared to baseline diffusion language models.

Conclusion: CDLM effectively addresses both inference bottlenecks in diffusion language models through training-based acceleration techniques, making them more practical for real-world applications.

Abstract: Diffusion Language Models (DLMs) offer a promising parallel generation paradigm but suffer from slow inference due to numerous refinement steps and the inability to use standard KV caching. We introduce CDLM (Consistency Diffusion Language Models), a training-based acceleration method that simultaneously tackles both bottlenecks. CDLM integrates consistency modeling to drastically reduce the number of required sampling steps by enabling multi-token finalization. Furthermore, we enforce a block-wise causal attention mask during fine-tuning, making the model fully compatible with KV caching. Experiments show CDLM achieves 3.6x-14.5x lower latency while maintaining competitive accuracy on math and coding tasks. The full training and evaluation code is available at https://github.com/SqueezeAILab/CDLM.

[265] Group Representational Position Encoding

Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan, Kangping Xu, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao

Main category: cs.LG

TL;DR: GRAPE is a unified framework for positional encoding using group actions, unifying multiplicative rotations (like RoPE) and additive logit biases (like ALiBi) through mathematical group theory.

DetailsMotivation: To create a principled, unified mathematical framework for positional encoding that encompasses existing approaches like RoPE and ALiBi, providing a design space for long-context models with better geometric understanding.

Method: Uses group actions theory: (1) Multiplicative GRAPE with SO(d) rotations using rank-2 skew-symmetric generators, recovering RoPE exactly; (2) Additive GRAPE with GL unipotent actions producing additive logits, recovering ALiBi and FoX exactly. Extends to learned commuting subspaces and non-commuting mixtures.

Result: Provides a unified framework that subsumes RoPE and ALiBi as special cases, offers closed-form matrix exponentials, preserves relative positional relationships, maintains streaming cacheability, and enables cross-subspace feature coupling with efficient O(d) or O(rd) cost.

Conclusion: GRAPE offers a principled design space for positional geometry in long-context models, unifying existing approaches through group theory while enabling new extensions for better feature coupling and efficiency.

Abstract: We present GRAPE (Group Representational Position Encoding), a unified framework for positional encoding based on group actions. GRAPE unifies two families of mechanisms: (i) multiplicative rotations (Multiplicative GRAPE) in $\operatorname{SO}(d)$ and (ii) additive logit biases (Additive GRAPE) arising from unipotent actions in the general linear group $\mathrm{GL}$. In Multiplicative GRAPE, a position $n \in \mathbb{Z}$ (or $t \in \mathbb{R}$) acts as $\mathbf{G}(n) = \exp(n , ω, \mathbf{L})$ with a rank-2 skew-symmetric generator $\mathbf{L} \in \mathbb{R}^{d \times d}$, yielding a relative, compositional, norm-preserving map with a closed-form matrix exponential. RoPE is recovered exactly when the $d/2$ planes correspond to canonical coordinate pairs with a log-uniform spectrum. Learned commuting subspaces and compact non-commuting mixtures strictly extend this geometry to capture cross-subspace feature coupling at $O(d)$ and $O(r d)$ cost per head, respectively. In Additive GRAPE, additive logits arise from rank-1 (or low-rank) unipotent actions, recovering ALiBi and the Forgetting Transformer (FoX) as exact special cases while preserving an exact relative law and streaming cacheability. Overall, GRAPE provides a principled design space for positional geometry in long-context models, subsuming RoPE and ALiBi as special cases. Project page: https://github.com/model-architectures/GRAPE.

[266] Data-Efficient Inference of Neural Fluid Fields via SciML Foundation Model

Yuqiu Liu, Jingxuan Xu, Mauricio Soroco, Yunchao Wei, Wuyang Chen

Main category: cs.LG

TL;DR: SciML foundation models pretrained on PDE simulations can reduce data requirements and improve generalization for real-world 3D fluid dynamics reconstruction from vision data.

DetailsMotivation: Current 3D vision methods for fluid field inference require dense real-world captures with specialized setups, making them costly. SciML foundation models encode rich multiphysics knowledge from simulations but their transferability to real-world vision problems is underexplored.

Method: Leverages SciML foundation models’ forecasting capabilities and representations. Introduces collaborative training strategy that equips neural fluid fields with augmented frames and fluid features extracted from the foundation model.

Result: Achieves 9-36% improvement in PSNR for future prediction while reducing required training frames by 25-50%. Shows substantial improvements in both quantitative metrics and visual quality over prior approaches.

Conclusion: SciML foundation models can significantly reduce data requirements for inferring real-world 3D fluid dynamics while improving generalization, highlighting their practical applicability for real-world fluid dynamics reconstruction.

Abstract: Recent developments in 3D vision have enabled significant progress in inferring neural fluid fields and realistic rendering of fluid dynamics. However, these methods require dense captures of real-world flows, which demand specialized laboratory setups, making the process costly and challenging. Scientific machine learning (SciML) foundation models, pretrained on extensive simulations of partial differential equations (PDEs), encode rich multiphysics knowledge and thus provide promising sources of domain priors for fluid field inference. Nevertheless, the transferability of these foundation models to real-world vision problems remains largely underexplored. In this work, we demonstrate that SciML foundation models can significantly reduce the data requirements for inferring real-world 3D fluid dynamics while improving generalization. Our method leverages the strong forecasting capabilities and meaningful representations learned by SciML foundation models. We introduce a novel collaborative training strategy that equips neural fluid fields with augmented frames and fluid features extracted from the foundation model. Extensive experiments show substantial improvements in both quantitative metrics and visual quality over prior approaches. In particular, our method achieves a 9-36% improvement in peak signal-to-noise ratio (PSNR) for future prediction while reducing the number of required training frames by 25-50%. These results highlight the practical applicability of SciML foundation models for real-world fluid dynamics reconstruction. Our code is available at: https://github.com/delta-lab-ai/SciML-HY.

[267] Soft-CAM: Making black box models self-explainable for medical image analysis

Kerol Djoumessi, Philipp Berens

Main category: cs.LG

TL;DR: SoftCAM makes CNNs inherently interpretable by removing global average pooling and using convolution-based class evidence layers to produce explicit class activation maps as predictions.

DetailsMotivation: Current CNN explanation methods are post-hoc, unreliable, and don't reflect true model reasoning, limiting trustworthiness in critical applications like medicine where interpretability is essential.

Method: Removes global average pooling layer and replaces fully connected classification layer with convolution-based class evidence layer to preserve spatial information and produce explicit class activation maps that form the basis of predictions.

Result: Evaluated on three medical datasets, maintains classification performance while significantly improving both qualitative and quantitative explanations compared to existing post-hoc methods.

Conclusion: CNNs can be inherently interpretable without compromising performance, advancing self-explainable deep learning for high-stakes decision-making.

Abstract: Convolutional neural networks (CNNs) are widely used for high-stakes applications like medicine, often surpassing human performance. However, most explanation methods rely on post-hoc attribution, approximating the decision-making process of already trained black-box models. These methods are often sensitive, unreliable, and fail to reflect true model reasoning, limiting their trustworthiness in critical applications. In this work, we introduce SoftCAM, a straightforward yet effective approach that makes standard CNN architectures inherently interpretable. By removing the global average pooling layer and replacing the fully connected classification layer with a convolution-based class evidence layer, SoftCAM preserves spatial information and produces explicit class activation maps that form the basis of the model’s predictions. Evaluated on three medical datasets, SoftCAM maintains classification performance while significantly improving both the qualitative and quantitative explanation compared to existing post-hoc methods. Our results demonstrate that CNNs can be inherently interpretable without compromising performance, advancing the development of self-explainable deep learning for high-stakes decision-making. The code is available at https://github.com/kdjoumessi/SoftCAM

[268] Anatomy of Capability Emergence: Scale-Invariant Representation Collapse and Top-Down Reorganization in Neural Networks

Jayadev Billa

Main category: cs.LG

TL;DR: Study analyzes geometric measures during neural network training to understand emergence mechanisms, finding universal representation collapse patterns, top-down propagation, and geometric hierarchy where representation geometry precedes emergence.

DetailsMotivation: To understand the mechanistic opacity of capability emergence during neural network training by examining geometric patterns across model scales, tasks, and language models.

Method: Tracked five geometric measures across five model scales (405K-85M parameters), analyzed 120+ emergence events in eight algorithmic tasks, and examined three Pythia language models (160M-2.8B). Investigated representation collapse patterns, propagation direction, and precursor relationships.

Result: Found: (1) universal representation collapse to task-specific scale-invariant floors; (2) top-down collapse propagation contradicting bottom-up intuition; (3) geometric hierarchy where representation geometry leads emergence (75-100% precursor rate for hard tasks). Prediction limits: geometric measures encode coarse task difficulty but not fine-grained timing.

Conclusion: Provides geometric anatomy of emergence and its boundary conditions, showing geometric patterns replicate in naturalistic pre-training but per-task precursor signals require task-training alignment.

Abstract: Capability emergence during neural network training remains mechanistically opaque. We track five geometric measures across five model scales (405K-85M parameters), 120+ emergence events in eight algorithmic tasks, and three Pythia language models (160M-2.8B). We find: (1) training begins with a universal representation collapse to task-specific floors that are scale-invariant across a 210X parameter range (e.g., modular arithmetic collapses to RANKME $\approx$ 2.0 regardless of model size); (2) collapse propagates top-down through layers (32/32 task X model consistency), contradicting bottom-up feature-building intuition; (3) a geometric hierarchy in which representation geometry leads emergence (75-100% precursor rate for hard tasks), while the local learning coefficient is synchronous (0/24 precursor) and Hessian measures lag. We also delineate prediction limits: geometric measures encode coarse task difficulty but not fine-grained timing (within-class concordance 27%; when task ordering reverses across scales, prediction fails at 26%). On Pythia, global geometric patterns replicate but per-task precursor signals do not – the precursor relationship requires task-training alignment that naturalistic pre-training does not provide. Our contribution is the geometric anatomy of emergence and its boundary conditions, not a prediction tool.

[269] Learning to Weight Parameters for Training Data Attribution

Shuangqi Li, Hieu Le, Jingyi Xu, Mathieu Salzmann

Main category: cs.LG

TL;DR: Proposes learning explicit parameter importance weights from data for gradient-based data attribution, improving accuracy across image classification, language modeling, and diffusion tasks.

DetailsMotivation: Existing gradient-based data attribution methods either treat network parameters uniformly or rely on implicit Hessian approximations, failing to capture functional heterogeneity of network parameters.

Method: Learn explicit parameter importance weights directly from data without requiring annotated labels, enabling better modeling of parameter functional heterogeneity.

Result: Improves attribution accuracy across diverse tasks including image classification, language modeling, and diffusion; enables fine-grained attribution for concepts like subject and style.

Conclusion: Explicitly learning parameter importance weights from data addresses limitations of existing methods and enables more accurate and fine-grained data attribution.

Abstract: We study gradient-based data attribution, aiming to identify which training examples most influence a given output. Existing methods for this task either treat network parameters uniformly or rely on implicit weighting derived from Hessian approximations, which do not fully model functional heterogeneity of network parameters. To address this, we propose a method to explicitly learn parameter importance weights directly from data, without requiring annotated labels. Our approach improves attribution accuracy across diverse tasks, including image classification, language modeling, and diffusion, and enables fine-grained attribution for concepts like subject and style.

[270] Gradient-Sign Masking for Task Vector Transport Across Pre-Trained Models

Filippo Rinaldi, Aniello Panariello, Giacomo Salici, Fengyuan Liu, Marco Ciccone, Angelo Porrello, Simone Calderara

Main category: cs.LG

TL;DR: GradFix enables efficient transfer of task vectors between different foundation model versions by aligning gradient-sign structures, requiring only a few labeled samples without additional fine-tuning.

DetailsMotivation: When new foundation model versions are released, practitioners typically need to repeat fine-tuning even for previously solved tasks. Task vectors (parameter changes from fine-tuning) often fail to transfer across different pre-trained models due to misaligned parameter spaces.

Method: GradFix approximates the ideal gradient-sign structure of the target model using only a handful of labeled samples. It computes a few target-model gradients without parameter updates and masks the source task vector accordingly, creating an update locally aligned with the target loss landscape.

Result: The method demonstrates significant performance gains on vision and language benchmarks, consistently outperforming naive task vector addition and few-shot fine-tuning. It also improves multi-task and multi-source model merging.

Conclusion: GradFix provides an efficient way to transfer knowledge between different foundation model versions by aligning gradient-sign structures, requiring minimal computation and no additional fine-tuning while ensuring first-order descent guarantees.

Abstract: When a new release of a foundation model is published, practitioners typically need to repeat fine-tuning, even if the same task was already tackled in the previous version. A promising alternative is to reuse the parameter changes (i.e., task vectors) that capture how a model adapts to a specific task. However, these vectors often fail to transfer across different pre-trained models because their parameter spaces are misaligned. In this work, we show that successful transfer depends strongly on the gradient-sign structure of the new model. Based on this insight, we propose GradFix, which approximates the ideal sign structure and leverages it to transfer knowledge using only a handful of labeled samples. Notably, this requires no additional fine-tuning: we only compute a few target-model gradients without parameter updates and mask the source task vector accordingly. This yields an update that is locally aligned with the target loss landscape, effectively rebasing the task vector onto the new pre-training. We provide a theoretical guarantee that our method ensures first-order descent. Empirically, we demonstrate significant performance gains on vision and language benchmarks, consistently outperforming naive task vector addition and few-shot fine-tuning. We further show that transporting task vectors improves multi-task and multi-source model merging. Code is available at https://github.com/fillo-rinaldi/GradFix.

[271] Temporal Pair Consistency for Variance-Reduced Flow Matching

Chika Maduabuchi, Jindong Wang

Main category: cs.LG

TL;DR: TPC is a variance-reduction technique for continuous-time generative models that couples velocity predictions at paired timesteps to improve training efficiency and sampling quality without architectural changes.

DetailsMotivation: Current continuous-time generative models (diffusion, flow matching, rectified flow) suffer from high estimator variance and inefficient sampling due to independent timestep training objectives. Existing solutions require explicit smoothness penalties, trajectory regularization, or modified probability paths/solvers.

Method: Temporal Pair Consistency (TPC) introduces a lightweight variance-reduction principle that couples velocity predictions at paired timesteps along the same probability path. It operates at the estimator level without modifying model architecture, probability path, or solver, and induces quadratic trajectory-coupled regularization.

Result: TPC improves sample quality and efficiency across CIFAR-10 and ImageNet at multiple resolutions, achieving lower FID at identical or lower computational cost than prior methods. It extends seamlessly to modern SOTA-style pipelines with noise-augmented training, score-based denoising, and rectified flow.

Conclusion: TPC provides an effective variance-reduction approach for continuous-time generative models that improves training efficiency and sampling quality while maintaining compatibility with existing architectures and methods.

Abstract: Continuous-time generative models, such as diffusion models, flow matching, and rectified flow, learn time-dependent vector fields but are typically trained with objectives that treat timesteps independently, leading to high estimator variance and inefficient sampling. Prior approaches mitigate this via explicit smoothness penalties, trajectory regularization, or modified probability paths and solvers. We introduce Temporal Pair Consistency (TPC), a lightweight variance-reduction principle that couples velocity predictions at paired timesteps along the same probability path, operating entirely at the estimator level without modifying the model architecture, probability path, or solver. We provide a theoretical analysis showing that TPC induces a quadratic, trajectory-coupled regularization that provably reduces gradient variance while preserving the underlying flow-matching objective. Instantiated within flow matching, TPC improves sample quality and efficiency across CIFAR-10 and ImageNet at multiple resolutions, achieving lower FID at identical or lower computational cost than prior methods, and extends seamlessly to modern SOTA-style pipelines with noise-augmented training, score-based denoising, and rectified flow.

[272] Expressiveness of Multi-Neuron Convex Relaxations in Neural Network Certification

Yuhao Mao, Yani Zhang, Martin Vechev

Main category: cs.LG

TL;DR: Multi-neuron relaxations for neural network certification are inherently incomplete but offer theoretical advantages over single-neuron relaxations, establishing a universal convex barrier.

DetailsMotivation: Current neural network certification methods rely on convex relaxations that are imprecise due to the single-neuron convex barrier. Multi-neuron relaxations have been heuristically applied, but it's unclear if they overcome this barrier or offer theoretical advantages.

Method: Rigorous analysis of multi-neuron relaxation expressiveness, showing they are inherently incomplete even with sufficient resources. Investigation of completeness through network augmentation with polynomial number of ReLU neurons or input domain partitioning into convex sub-polytopes.

Result: Multi-neuron relaxations are incomplete (universal convex barrier), but can achieve completeness through network augmentation or domain partitioning, distinguishing them from single-neuron relaxations which cannot realize augmentation and have worse partition complexity.

Conclusion: Establishes foundation for multi-neuron relaxations and points to new directions for certified robustness, including training methods tailored to multi-neuron relaxations and verification methods using them as main subroutines.

Abstract: Neural network certification methods heavily rely on convex relaxations to provide robustness guarantees. However, these relaxations are often imprecise: even the most accurate single-neuron relaxation is incomplete for general ReLU networks, a limitation known as the single-neuron convex barrier. While multi-neuron relaxations have been heuristically applied to address this issue, two central questions arise: (i) whether they overcome the convex barrier, and if not, (ii) whether they offer theoretical capabilities beyond those of single-neuron relaxations. In this work, we present the first rigorous analysis of the expressiveness of multi-neuron relaxations. Perhaps surprisingly, we show that they are inherently incomplete, even when allocated sufficient resources to capture finitely many neurons and layers optimally. This result extends the single-neuron barrier to a universal convex barrier for neural network certification. On the positive side, we show that completeness can be achieved by either (i) augmenting the network with a polynomial number of carefully designed ReLU neurons or (ii) partitioning the input domain into convex sub-polytopes, thereby distinguishing multi-neuron relaxations from single-neuron ones which are unable to realize the former and have worse partition complexity for the latter. Our findings establish a foundation for multi-neuron relaxations and point to new directions for certified robustness, including training methods tailored to multi-neuron relaxations and verification methods with multi-neuron relaxations as the main subroutine.

[273] Beyond Simple Graphs: Neural Multi-Objective Routing on Multigraphs

Filip Rydin, Attila Lischka, Jiaming Wu, Morteza Haghir Chehreghani, Balázs Kulcsár

Main category: cs.LG

TL;DR: Two GNN-based methods for multi-objective routing on multigraphs: one operates directly on multigraphs via autoregressive edge selection, while a more scalable version first prunes the multigraph then routes on the simplified graph.

DetailsMotivation: Existing learning-based routing methods are unsuitable for multigraphs (graphs with multiple edges between node pairs), despite their strong relevance in real-world scenarios where multiple connections with different attributes exist between locations.

Method: 1) Direct multigraph approach: Uses GNNs to operate directly on multigraphs, autoregressively selecting edges until tour completion. 2) Scalable approach: First simplifies multigraph via learned pruning strategy, then performs autoregressive routing on resulting simple graph.

Result: Both models show competitive performance across wide range of problems and graph distributions compared to strong heuristics and neural baselines.

Conclusion: Proposed GNN-based methods effectively address multi-objective routing on multigraphs, with the scalable pruning-based approach offering practical advantages for larger problems.

Abstract: Learning-based methods for routing have gained significant attention in recent years, both in single-objective and multi-objective contexts. Yet, existing methods are unsuitable for routing on multigraphs, which feature multiple edges with distinct attributes between node pairs, despite their strong relevance in real-world scenarios. In this paper, we propose two graph neural network-based methods to address multi-objective routing on multigraphs. Our first approach operates directly on the multigraph by autoregressively selecting edges until a tour is completed. The second model, which is more scalable, first simplifies the multigraph via a learned pruning strategy and then performs autoregressive routing on the resulting simple graph. We evaluate both models empirically, across a wide range of problems and graph distributions, and demonstrate their competitive performance compared to strong heuristics and neural baselines.

[274] Individualized and Interpretable Sleep Forecasting via a Two-Stage Adaptive Spatial-Temporal Model

Xueyi Wang, Claudine J. C. Lamoth, Elisabeth Wilhelm

Main category: cs.LG

TL;DR: Interpretable adaptive spatial-temporal model for personalized sleep quality prediction using multi-resolution temporal patterns and attention mechanisms

DetailsMotivation: Sleep quality impacts well-being, and there's a need for accessible, reliable forecasting tools for preventive interventions in healthcare

Method: Hierarchical architecture with parallel 1D convolutions (varying kernel sizes + dilation) for multi-resolution temporal patterns, channel attention for feature emphasis, bidirectional LSTM + self-attention for sequential dynamics, and two-stage adaptation for user transfer

Result: Outperformed LSTM, Informer, PatchTST, and TimesNet baselines; best performance with 3-day input/1-day prediction window (RMSE 0.216); good performance for longer horizons (3-day prediction RMSE 0.257)

Conclusion: Framework offers robust, adaptive, explainable solution for personalized sleep forecasting using sparse wearable device data

Abstract: Sleep quality impacts well-being. Therefore, healthcare providers and individuals need accessible and reliable forecasting tools for preventive interventions. This paper introduces an interpretable, individualized adaptive spatial-temporal model for predicting sleep quality. We designed a hierarchical architecture, consisting of parallel 1D convolutions with varying kernel sizes and dilated convolution, which extracts multi-resolution temporal patterns-short kernels capture rapid physiological changes, while larger kernels and dilation model slower trends. The extracted features are then refined through channel attention, which learns to emphasize the most predictive variables for each individual, followed by bidirectional LSTM and self-attention that jointly model both local sequential dynamics and global temporal dependencies. Finally, a two-stage adaptation strategy ensures the learned representations transfer effectively to new users. We conducted various experiments with five input window sizes (3, 5, 7, 9, and 11 days) and five prediction window sizes (1, 3, 5, 7, and 9 days). Our model consistently outperformed time series forecasting baseline approaches, including LSTM, Informer, PatchTST, and TimesNet. The best performance was achieved with a three-day input window and a one-day prediction window, yielding an RMSE of 0.216. Furthermore, the model demonstrated good predictive performance even for longer forecasting horizons (e.g., with a 0.257 RMSE for a three-day prediction window), highlighting its practical utility for real-world applications. We also conducted an explainability analysis to examine how different features influence sleep quality. These findings proved that the proposed framework offers a robust, adaptive, and explainable solution for personalized sleep forecasting using sparse data from commercial wearable devices.

[275] GRPO is Secretly a Process Reward Model

Michael Sullivan, Alexander Koller

Main category: cs.LG

TL;DR: GRPO with outcome reward models is theoretically equivalent to PRM-aware RL with Monte Carlo PRMs, revealing hidden PRM structure that can be leveraged to improve LLM reasoning performance.

DetailsMotivation: To bridge the gap between process reward models (PRMs) and outcome reward models (ORMs) in reinforcement learning, showing that GRPO with ORMs actually has hidden PRM structure that can be exploited for better performance.

Method: Theoretical analysis proving GRPO with ORMs is equivalent to PRM-aware RL with Monte Carlo PRMs, identification of flaws in GRPO objective related to imbalanced process steps, and proposal of λ-GRPO modification to mitigate these issues.

Result: LLMs tuned with λ-GRPO outperform those tuned with standard GRPO on downstream reasoning tasks, reaching peak performance more rapidly with negligible impact on training time and cost.

Conclusion: The hidden PRM structure within vanilla GRPO can be leveraged to boost model performance without explicit PRMs, providing a practical improvement to RL fine-tuning of language models.

Abstract: Process reward models (PRMs) allow for fine-grained credit assignment in reinforcement learning (RL), and seemingly contrast with outcome reward models (ORMs), which assign a single reward to an entire trajectory. However, we provide theoretical proof in this work that the Group Relative Policy Optimization (GRPO) RL algorithm equipped with an ORM is in fact equivalent to a PRM-aware RL objective equipped with a non-trivial, Monte-Carlo-based PRM (given mild assumptions). Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective that interacts with imbalanced process steps and rewards to hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ($λ$-GRPO), and show that LLMs tuned with $λ$-GRPO outperform LLMs tuned with standard GRPO on downstream reasoning tasks\textemdash and reach peak performance more rapidly. These results show that we can leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance without employing an explicit PRM, and with a negligible impact on training time and cost.

[276] Physics-informed GNN for medium-high voltage AC power flow with edge-aware attention and line search correction operator

Changhun Kim, Timon Conrad, Redwanul Karim, Julian Oelhaf, David Riebesel, Tomás Arias-Vergara, Andreas Maier, Johann Jäger, Siming Bayer

Main category: cs.LG

TL;DR: PIGNN-Attn-LS combines edge-aware attention with backtracking line search to create a physics-informed graph neural network for AC power flow solving that outperforms baselines in accuracy while maintaining speed advantages over traditional Newton-Raphson methods.

DetailsMotivation: Current physics-informed graph neural networks (PIGNNs) for AC power-flow solving need accuracy improvements while maintaining speed advantages over Newton-Raphson solvers, especially for operational adoption where inference-time physics constraints are important.

Method: Combines edge-aware attention mechanism with per-edge biases to encode line physics as a differentiable known-operator layer, plus backtracking line-search-based globalized correction operator to restore operative decrease criterion at inference.

Result: Achieves test RMSE of 0.00033 p.u. in voltage and 0.08 deg in angle on 4-32-bus grids, outperforming PIGNN-MLP baseline by 99.5% and 87.1% respectively, with 2-5x faster batched inference than Newton-Raphson on 4-1024-bus grids.

Conclusion: PIGNN-Attn-LS provides an accurate and fast AC power-flow solver that maintains physics constraints at inference time, making it suitable for operational adoption in power systems.

Abstract: Physics-informed graph neural networks (PIGNNs) have emerged as fast AC power-flow solvers that can replace the classic NewtonRaphson (NR) solvers, especially when thousands of scenarios must be evaluated. However, current PIGNNs still need accuracy improvements at parity speed; in particular, the soft constraint on the physics loss is inoperative at inference, which can deter operational adoption. We address this with PIGNN-Attn-LS, combining an edge-aware attention mechanism that explicitly encodes line physics via per-edge biases to form a fully differentiable knownoperator layer inside the computation graph, with a backtracking line-search-based globalized correction operator that restores an operative decrease criterion at inference. Training and testing use a realistic High-/Medium-Voltage scenario generator, with NR used only to construct reference states. On held-out HV cases consisting of 4-32-bus grids, PIGNN-Attn-LS achieves a test RMSE of 0.00033 p.u. in voltage and 0.08 deg in angle, outperforming the PIGNN-MLP baseline by 99.5% and 87.1%, respectively. With streaming micro-batches, it delivers 2-5x faster batched inference than NR on 4-1024-bus grids.

[277] Toward a Holistic Approach to Continual Model Merging

Hoang Phan, Sungmin Cha, Tung Lam Tran, Qi Lei

Main category: cs.LG

TL;DR: CMM is a continual learning framework that merges models at three stages (pre, during, post) to address catastrophic forgetting without old data access, using tangent space fine-tuning, optimizer state information, and representation alignment.

DetailsMotivation: Address scalability issues in continual learning where conventional approaches either maintain growing task vectors (scalability problems) or rely solely on weight-space merging (loses functional information) when old data is inaccessible.

Method: Three-stage intervention: 1) Pre-merging: fine-tune main model in tangent space to amplify weight disentanglement; 2) During merging: leverage functional information from optimizer states beyond parameter averages; 3) Post-merging: correct representation discrepancy between pre- and post-merged models.

Result: Achieves competitive performance on standard class-incremental and domain-incremental benchmarks while operating under constant memory constraints without accessing historical data.

Conclusion: Provides a scalable and efficient solution to catastrophic forgetting by intervening at critical stages of model merging, overcoming limitations of conventional continual learning approaches.

Abstract: We present a holistic framework for Continual Model Merging (CMM) that intervenes at three critical stages: pre-merging, during merging, and post-merging-to address two fundamental challenges in continual learning. In particular, conventional approaches either maintain a growing list of per-domain task vectors, leading to scalability issues or rely solely on weight-space merging when old data is inaccessible, thereby losing crucial functional information. Our method overcomes these limitations by first fine-tuning the main model within its tangent space on domain-specific data; this linearization amplifies per-task weight disentanglement, effectively mitigating across-task interference. During merging, we leverage functional information from available optimizer states beyond mere parameter averages to avoid the need to revisit old data. Finally, a post-merging correction aligns the representation discrepancy between pre- and post-merged models, reducing bias and enhancing overall performance-all while operating under constant memory constraints without accessing historical data. Extensive experiments on standard class-incremental and domain-incremental benchmarks demonstrate that our approach not only achieves competitive performance but also provides a scalable and efficient solution to the catastrophic forgetting problem.

[278] Study of Training Dynamics for Memory-Constrained Fine-Tuning

Aël Quélennec, Nour Hezbri, Pavlo Mozharovskyi, Van-Tam Nguyen, Enzo Tartaglione

Main category: cs.LG

TL;DR: TraDy is a memory-efficient transfer learning method that uses dynamic stochastic channel selection within preselected layers to reduce computational costs while maintaining performance.

DetailsMotivation: As deep neural networks grow larger, memory-efficient training becomes crucial due to resource constraints in deployment environments. There's a need for transfer learning methods that can maintain performance while significantly reducing memory and computational requirements.

Method: TraDy leverages two key insights: 1) layer importance for updates is architecture-dependent and can be determined a priori, and 2) dynamic stochastic channel selection provides better gradient approximation than static approaches. The method uses dynamic channel selection that stochastically resamples channels between epochs within preselected layers.

Result: TraDy achieves state-of-the-art performance across various downstream tasks and architectures while maintaining strict memory constraints. It achieves up to 99% activation sparsity, 95% weight derivative sparsity, and 97% reduction in FLOPs for weight derivative computation.

Conclusion: TraDy provides an effective memory-efficient transfer learning approach that significantly reduces computational costs while maintaining model performance, addressing the growing challenge of training large models under resource constraints.

Abstract: Memory-efficient training of deep neural networks has become increasingly important as models grow larger while deployment environments impose strict resource constraints. We propose TraDy, a novel transfer learning scheme leveraging two key insights: layer importance for updates is architecture-dependent and determinable a priori, while dynamic stochastic channel selection provides superior gradient approximation compared to static approaches. We introduce a dynamic channel selection approach that stochastically resamples channels between epochs within preselected layers. Extensive experiments demonstrate TraDy achieves state-of-the-art performance across various downstream tasks and architectures while maintaining strict memory constraints, achieving up to 99% activation sparsity, 95% weight derivative sparsity, and 97% reduction in FLOPs for weight derivative computation.

[279] FATE: A Formal Benchmark Series for Frontier Algebra of Multiple Difficulty Levels

Jiedong Jiang, Wanyi He, Yuefeng Wang, Guoxiong Gao, Yongle Hu, Jingting Wang, Nailing Guan, Peihao Wu, Chunbo Dai, Liang Xiao, Bin Dong

Main category: cs.LG

TL;DR: FATE benchmark series for formal algebra theorem proving, spanning undergraduate to PhD+ difficulty, reveals LLMs struggle with formalization despite decent natural-language reasoning.

DetailsMotivation: Existing LLM theorem proving benchmarks focus on contest math (like IMO) which doesn't reflect the depth, breadth, and abstraction of modern mathematical research. Need benchmarks that bridge this gap toward research-level formal mathematical reasoning.

Method: Introduce FATE (Formal Algebra Theorem Evaluation) benchmark series with two components: FATE-H (100 problems in abstract algebra) and FATE-X (100 problems in commutative algebra). Problems range from undergraduate exercises to beyond PhD qualifying exams. Evaluate state-of-the-art LLM provers using two-stage evaluation: natural-language reasoning and formalization.

Result: Best model achieves only 3% (pass@64) accuracy on FATE-H and 0% on FATE-X. Natural-language reasoning is notably more accurate than formalization ability. Specialized provers show less effective reflection than general-purpose models at natural-language stage. Systematic classification of formalization errors.

Conclusion: FATE provides a robust, challenging benchmark establishing essential checkpoints toward research-level formal mathematical reasoning. Reveals significant gap between contest math performance and advanced mathematical reasoning capabilities.

Abstract: Recent advances in large language models (LLMs) have demonstrated impressive capabilities in formal theorem proving, particularly on contest-based mathematical benchmarks like the IMO. However, these contests do not reflect the depth, breadth, and abstraction of modern mathematical research. To bridge this gap, we introduce FATE (Formal Algebra Theorem Evaluation), a new benchmark series in formal algebra designed to chart a course toward advanced mathematical reasoning. We present two new components, FATE-H and FATE-X, each with 100 problems in abstract and commutative algebra. The FATE series spans a difficulty spectrum from undergraduate exercises to problems exceeding PhD qualifying exams. Notably, FATE-X is the first formal benchmark to surpass both PhD-level exam difficulty and the coverage of the Mathlib library. Our evaluations of state-of-the-art LLM provers on this new benchmark reveal a stark performance gap compared to contest math: the best model achieves only 3% (pass@64) accuracy on FATE-H and 0% on FATE-X. Our two-stage evaluation reveals that models’ natural-language reasoning is notably more accurate than their ability to formalize this reasoning. We systematically classify the common errors that arise during this formalization process. Furthermore, a comparative study shows that a specialized prover can exhibit less effective reflection than general-purpose models, reducing its accuracy at the natural-language stage. We believe FATE provides a robust and challenging benchmark that establishes essential checkpoints on the path toward research-level formal mathematical reasoning.

[280] Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM

Adarsh Kumarappan, Ayushi Mehrotra

Main category: cs.LG

TL;DR: The paper introduces a probabilistic framework called “(k, ε)-unstable” to improve SmoothLLM’s certification guarantees against jailbreaking attacks by relaxing the strict k-unstable assumption and providing more practical, data-informed safety certificates.

DetailsMotivation: SmoothLLM's existing certification guarantee relies on a strict "k-unstable" assumption that rarely holds in practice, limiting the trustworthiness of safety certificates. The authors aim to address this limitation by developing a more realistic probabilistic framework.

Method: Introduces a “(k, ε)-unstable” probabilistic framework that relaxes the strict k-unstable assumption. Derives a new data-informed lower bound on SmoothLLM’s defense probability by incorporating empirical models of attack success across diverse jailbreaking attacks (from gradient-based GCG to semantic PAIR).

Result: Provides more trustworthy and practical safety certificates that better reflect real-world LLM behavior. Enables practitioners to set certification thresholds that are more realistic and actionable for secure AI deployment.

Conclusion: The work contributes a practical and theoretically-grounded mechanism to make LLMs more resistant to exploitation of their safety alignments, addressing a critical challenge in secure AI deployment through improved certification frameworks.

Abstract: The SmoothLLM defense provides a certification guarantee against jailbreaking attacks, but it relies on a strict “k-unstable” assumption that rarely holds in practice. This strong assumption can limit the trustworthiness of the provided safety certificate. In this work, we address this limitation by introducing a more realistic probabilistic framework, “(k, $\varepsilon$)-unstable,” to certify defenses against diverse jailbreaking attacks, from gradient-based (GCG) to semantic (PAIR). We derive a new, data-informed lower bound on SmoothLLM’s defense probability by incorporating empirical models of attack success, providing a more trustworthy and practical safety certificate. By introducing the notion of (k, $\varepsilon$)-unstable, our framework provides practitioners with actionable safety guarantees, enabling them to set certification thresholds that better reflect the real-world behavior of LLMs. Ultimately, this work contributes a practical and theoretically-grounded mechanism to make LLMs more resistant to the exploitation of their safety alignments, a critical challenge in secure AI deployment.

[281] Reversible Deep Learning for 13C NMR in Chemoinformatics: On Structures and Spectra

Stefan Kuhn, Vandana Dwarka, Przemyslaw Karol Grenda, Eero Vainikko

Main category: cs.LG

TL;DR: Reversible deep learning model for 13C NMR using conditional invertible neural network for bidirectional mapping between molecular structures and spectra.

DetailsMotivation: To create a unified model that can perform both molecular structure-to-spectrum prediction and spectrum-to-structure generation within a single end-to-end framework, addressing the one-to-many nature of spectrum-to-structure inference.

Method: Uses a single conditional invertible neural network built from i-RevNet style bijective blocks, trained to predict 128-bit binned spectrum codes from graph-based structure encodings while capturing residual variability in latent dimensions.

Result: Model is numerically invertible on trained examples, achieves spectrum-code prediction above chance, and produces coarse but meaningful structural signals when inverted on validation spectra.

Conclusion: Invertible architectures can unify spectrum prediction and uncertainty-aware candidate generation within one end-to-end model for NMR analysis.

Abstract: We introduce a reversible deep learning model for 13C NMR that uses a single conditional invertible neural network for both directions between molecular structures and spectra. The network is built from i-RevNet style bijective blocks, so the forward map and its inverse are available by construction. We train the model to predict a 128-bit binned spectrum code from a graph-based structure encoding, while the remaining latent dimensions capture residual variability. At inference time, we invert the same trained network to generate structure candidates from a spectrum code, which explicitly represents the one-to-many nature of spectrum-to-structure inference. On a filtered subset, the model is numerically invertible on trained examples, achieves spectrum-code prediction above chance, and produces coarse but meaningful structural signals when inverted on validation spectra. These results demonstrate that invertible architectures can unify spectrum prediction and uncertainty-aware candidate generation within one end-to-end model.

[282] SUNLayer: Stable denoising with generative networks

Ruhui Jin, Dustin G. Mixon, Soledad Villar

Main category: cs.LG

TL;DR: Theoretical framework SUNLayer analyzes generative models using spherical harmonics, identifying activation function conditions for guaranteed denoising under local optimization.

DetailsMotivation: To provide rigorous but simplified analysis of generative models used in image denoising and other inverse problems like compressed sensing and super-resolution.

Method: Introduces SUNLayer theoretical framework based on spherical harmonics that identifies explicit conditions on activation functions guaranteeing denoising under local optimization.

Result: Numerical experiments examine theoretical properties on commonly used activation functions and demonstrate their stable denoising performance.

Conclusion: The SUNLayer framework provides theoretical foundations for understanding generative models in inverse problems through spherical harmonics analysis.

Abstract: Deep neural networks are often used to implement powerful generative models for real-world data. Notable applications include image denoising, as well as other classical inverse problems like compressed sensing and super-resolution. To provide a rigorous but simplified analysis of generative models, in this work, we introduce an elegant theoretical framework based on spherical harmonics, namely \textbf{SUNLayer}. Our theoretical framework identifies explicit conditions on activation functions that guarantee denoising under local optimization. Numerical experiments examine the theoretical properties on commonly used activation functions and demonstrate their stable denoising performance.

[283] Convergence of gradient descent for deep neural networks

Sourav Chatterjee

Main category: cs.LG

TL;DR: A theoretical analysis showing that gradient descent with a specific positive initialization provably converges to zero training loss for neural networks with linearly independent data, without requiring overparameterization.

DetailsMotivation: To provide theoretical guarantees for neural network optimization without relying on overparameterization assumptions, and to develop constructive initialization strategies that ensure convergence to global minima.

Method: Proposes a local Polyak-Lojasiewicz (PL) criterion for nonnegative objectives, then verifies this criterion for feedforward neural networks with smooth, strictly increasing activation functions. Uses constructive initialization with zero first-layer weights, positive hidden-layer weights, and large output-layer weights.

Result: Proves linear convergence to zero training loss under the proposed initialization when input data vectors are linearly independent. Shows this theory-guided initialization substantially accelerates optimization compared to standard random initializations.

Conclusion: Provides a complementary analysis to overparameterization theory, showing convergence guarantees in fixed-width networks with linearly independent data through constructive initialization strategies.

Abstract: We give a simple local Polyak-Lojasiewicz (PL) criterion that guarantees linear (exponential) convergence of gradient flow and gradient descent to a zero-loss solution of a nonnegative objective. We then verify this criterion for the squared training loss of a feedforward neural network with smooth, strictly increasing activation functions, in a regime that is complementary to the usual over-parameterized analyses: the network width and depth are fixed, while the input data vectors are assumed to be linearly independent (in particular, the ambient input dimension is at least the number of data points). A notable feature of the verification is that it is constructive: it leads to a simple “positive” initialization (zero first-layer weights, strictly positive hidden-layer weights, and sufficiently large output-layer weights) under which gradient descent provably converges to an interpolating global minimizer of the training loss. We also discuss a probabilistic corollary for random initializations, clarify its dependence on the probability of the required initialization event, and provide numerical experiments showing that this theory-guided initialization can substantially accelerate optimization relative to standard random initializations at the same width.

[284] FedPSA: Modeling Behavioral Staleness in Asynchronous Federated Learning

Chaoyi Lu, Yiding Sun, Zhichuan Yang, Jinqian Chen, Dongfu Yin, Jihua Zhu

Main category: cs.LG

TL;DR: FedPSA is an asynchronous federated learning framework that uses parameter sensitivity to measure model staleness and a dynamic momentum queue to adjust tolerance for outdated information, achieving better performance than existing methods.

DetailsMotivation: Asynchronous Federated Learning (AFL) improves training speed but suffers from performance degradation due to model staleness. Existing methods use coarse-grained round difference as staleness measure, limiting performance.

Method: FedPSA uses parameter sensitivity to measure model obsolescence more finely and establishes a dynamic momentum queue to assess current training phase in real-time, adjusting tolerance for outdated information dynamically.

Result: Extensive experiments show FedPSA achieves up to 6.37% improvement over baseline methods and 1.93% over state-of-the-art methods on multiple datasets.

Conclusion: FedPSA provides a more fine-grained approach to handling staleness in asynchronous federated learning, significantly improving performance over existing methods.

Abstract: Asynchronous Federated Learning (AFL) has emerged as a significant research area in recent years. By not waiting for slower clients and executing the training process concurrently, it achieves faster training speed compared to traditional federated learning. However, due to the staleness introduced by the asynchronous process, its performance may degrade in some scenarios. Existing methods often use the round difference between the current model and the global model as the sole measure of staleness, which is coarse-grained and lacks observation of the model itself, thereby limiting the performance ceiling of asynchronous methods. In this paper, we propose FedPSA (Parameter Sensitivity-based Asynchronous Federated Learning), a more fine-grained AFL framework that leverages parameter sensitivity to measure model obsolescence and establishes a dynamic momentum queue to assess the current training phase in real time, thereby adjusting the tolerance for outdated information dynamically. Extensive experiments on multiple datasets and comparisons with various methods demonstrate the superior performance of FedPSA, achieving up to 6.37% improvement over baseline methods and 1.93% over the current state-of-the-art method.

[285] A Unified Framework for Analyzing Meta-algorithms in Online Convex Optimization

Mohammad Pedramfar, Vaneet Aggarwal

Main category: cs.LG

TL;DR: A framework for analyzing online convex optimization across different feedback types (full-info/semi-bandit/bandit) and settings (stochastic/non-stochastic) with systematic meta-algorithm design and regret bound conversions.

DetailsMotivation: To create a unified framework for systematically analyzing online convex optimization problems across diverse settings including different feedback types (full-information, semi-bandit, bandit), different environments (stochastic vs non-stochastic), and different regret notions (static adversarial, dynamic, adaptive).

Method: Develops a meta-algorithm framework that allows transformation between different feedback types and settings. Shows that algorithms for online linear optimization with deterministic gradient feedback can be adapted to online convex optimization, and that full-information algorithms can be transformed to semi-bandit algorithms with comparable regret bounds.

Result: The framework enables systematic conversion of first-order algorithms to zeroth-order algorithms with comparable regret bounds, recovers existing results with simplified proofs, and provides new results for various online optimization settings.

Conclusion: Provides a comprehensive framework for analyzing online convex optimization across diverse settings, enabling systematic algorithm design and transformation between different feedback types while maintaining comparable performance guarantees.

Abstract: In this paper, we analyze the problem of online convex optimization in different settings, including different feedback types (full-information/semi-bandit/bandit/etc) in either stochastic or non-stochastic setting and different notions of regret (static adversarial regret/dynamic regret/adaptive regret). This is done through a framework which allows us to systematically propose and analyze meta-algorithms for the various settings described above. We show that any algorithm for online linear optimization with deterministic gradient feedback against fully adaptive adversaries is an algorithm for online convex optimization. We also show that any such algorithm that requires full-information feedback may be transformed to an algorithm with semi-bandit feedback with comparable regret bound. We further show that algorithms that are designed for fully adaptive adversaries using deterministic semi-bandit feedback can obtain similar bounds using only stochastic semi-bandit feedback when facing oblivious adversaries. We use this to describe general meta-algorithms to convert first order algorithms to zeroth order algorithms with comparable regret bounds. Our framework allows us to analyze online optimization in various settings, recovers several results in the literature with a simplified proof technique, and provides new results.

[286] SpecTUS: Spectral Translator for Unknown Structures annotation from EI-MS spectra

Adam Hájek, Michal Starý, Elliott Price, Filip Jozefov, Helge Hecht, Aleš Křenek

Main category: cs.LG

TL;DR: SpecTUS: A deep neural model that directly translates low-resolution GC-EI mass spectra into 2D molecular structures, outperforming database search methods for novel compound annotation.

DetailsMotivation: Current methods for compound identification from mass spectra rely on database searches, which fail for novel compounds not in spectral libraries. There's a need for de novo structure annotation that can handle unknown molecules.

Method: SpecTUS uses a deep neural network to perform direct translation from gas chromatography electron ionization mass spectra (GC-EI-MS) to 2D structural representations. It analyzes spectra in a de novo manner without relying on spectral libraries.

Result: On a held-out test set of 28,267 spectra from NIST database, SpecTUS achieved 43% perfect reconstruction with single suggestion, outperforming database hybrid search in 76% of cases. With 10 suggestions, perfect reconstruction reached 65%, and 84% were better than hybrid search.

Conclusion: SpecTUS provides a powerful alternative to database search methods for structural annotation of small molecules, particularly useful for analyzing compounds unavailable in spectral libraries.

Abstract: Compound identification and structure annotation from mass spectra is a well-established task widely applied in drug detection, criminal forensics, small molecule biomarker discovery and chemical engineering. We propose SpecTUS: Spectral Translator for Unknown Structures, a deep neural model that addresses the task of structural annotation of small molecules from low-resolution gas chromatography electron ionization mass spectra (GC-EI-MS). Our model analyzes the spectra in \textit{de novo} manner – a direct translation from the spectra into 2D-structural representation. Our approach is particularly useful for analyzing compounds unavailable in spectral libraries. In a rigorous evaluation of our model on the novel structure annotation task across different libraries, we outperformed standard database search techniques by a wide margin. On a held-out testing set, including \numprint{28267} spectra from the NIST database, we show that our model’s single suggestion perfectly reconstructs 43% of the subset’s compounds. This single suggestion is strictly better than the candidate of the database hybrid search (common method among practitioners) in 76% of cases. In astill affordable scenario of10 suggestions, perfect reconstruction is achieved in 65%, and 84% are better than the hybrid search.

[287] Co-Evolution-Based Metal-Binding Residue Prediction with Graph Neural Networks

Sayedmohammadreza Rastegari, Sina Tabakhi, Xianyuan Liu, Tianyi Jiang, Wei Sang, Haiping Lu

Main category: cs.LG

TL;DR: MBGNN is a graph neural network that uses complete co-evolved residue networks to predict metal-binding residues and metal types in proteins, outperforming existing methods.

DetailsMotivation: Predicting protein-metal interactions is challenging due to structural complexity, and existing methods fail to fully capture co-evolutionary constraints that maintain metal-binding functionality.

Method: Developed Metal-Binding Graph Neural Network (MBGNN) that leverages complete co-evolved residue networks to capture complex dependencies within protein structures.

Result: MBGNN outperforms state-of-the-art co-evolution-based method MetalNet2 by 2.5% F1 for binding residue identification and 3.3% for metal type classification, and shows superiority across multiple datasets.

Conclusion: Integrating co-evolutionary residue networks with graph-based learning advances protein-metal interaction prediction, facilitating functional annotation and metalloprotein design.

Abstract: Understanding protein-metal interactions is central to structural biology, with metal ions being vital for catalysis, stability, and signal transduction. Predicting metal-binding residues and metal types remains challenging due to the structural and evolutionary complexity of proteins. Conventional sequence- and structure-based methods often fail to capture co-evolutionary constraints that reflect how residues evolve together to maintain metal-binding functionality. Recent co-evolution-based methods capture part of this information, but still underutilize the complete co-evolved residue network. To address this limitation, we introduce the Metal-Binding Graph Neural Network (MBGNN), which leverages the complete co-evolved residue network to better capture complex dependencies within protein structures. Experimental results show that MBGNN substantially outperforms the state-of-the-art co-evolution-based method MetalNet2, achieving F1 score improvements of 2.5% for binding residue identification and 3.3% for metal type classification on the MetalNet2 dataset. Its superiority is further demonstrated on both the MetalNet2 and MIonSite datasets, where it outperforms two co-evolution-based and two sequence-based methods, achieving the highest mean F1 scores across both prediction tasks. These findings highlight how integrating co-evolutionary residue networks with graph-based learning advances our ability to decode protein-metal interactions, thereby facilitating functional annotation and rational metalloprotein design. The code and data are released at https://github.com/SRastegari/MBGNN.

[288] How Well Can Differential Privacy Be Audited in One Run?

Amit Keinan, Moshe Shenfeld, Katrina Ligett

Main category: cs.LG

TL;DR: The paper analyzes the precision limits of one-run auditing for ML privacy, identifying interference between data effects as the key barrier and proposing new approaches to improve auditing performance.

DetailsMotivation: Recent methods for auditing ML privacy have improved efficiency by intervening on multiple training examples in a single run, but questions remain about how precisely these one-run audits can uncover true privacy parameters and how precision depends on the audited algorithm.

Method: Characterizes the maximum achievable efficacy of one-run auditing, identifies interference between observable effects of different data elements as the key barrier, and presents new conceptual approaches to minimize this interference.

Result: Shows that interference between data effects is the fundamental limitation for one-run auditing precision, and proposes methods to reduce this interference to improve auditing performance for real ML algorithms.

Conclusion: One-run auditing has inherent precision limitations due to interference effects, but new conceptual approaches can help minimize this barrier and improve practical auditing of machine learning privacy.

Abstract: Recent methods for auditing the privacy of machine learning algorithms have improved computational efficiency by simultaneously intervening on multiple training examples in a single training run. Steinke et al. (2024) prove that one-run auditing indeed lower bounds the true privacy parameter of the audited algorithm, and give impressive empirical results. Their work leaves open the question of how precisely one-run auditing can uncover the true privacy parameter of an algorithm, and how that precision depends on the audited algorithm. In this work, we characterize the maximum achievable efficacy of one-run auditing and show that the key barrier to its efficacy is interference between the observable effects of different data elements. We present new conceptual approaches to minimize this barrier, towards improving the performance of one-run auditing of real machine learning algorithms.

[289] Better Neural Network Expressivity: Subdividing the Simplex

Egor Bakaev, Florestan Brunck, Christoph Hertrich, Jack Stade, Amir Yehudayoff

Main category: cs.LG

TL;DR: ReLU neural networks can compute all continuous piecewise linear functions on ℝⁿ with fewer hidden layers than previously thought - specifically ⌈log₃(n-1)⌉+1 layers instead of ⌈log₂(n+1)⌉ layers.

DetailsMotivation: To investigate the optimal depth requirements for ReLU neural networks to compute continuous piecewise linear functions, particularly addressing a conjecture that ⌈log₂(n+1)⌉ hidden layers are necessary for functions like the maximum function.

Method: Theoretical analysis showing that ReLU networks with two hidden layers can exactly represent the maximum function of five inputs, and more generally that ⌈log₃(n-2)⌉+1 hidden layers suffice for computing maximum of n≥4 numbers. Uses geometric interpretation via polyhedral subdivisions of simplices.

Result: Disproves the conjecture that ⌈log₂(n+1)⌉ hidden layers are optimal, showing that ⌈log₃(n-1)⌉+1 hidden layers are sufficient to compute all CPWL functions on ℝⁿ. The constructions nearly match the ⌈log₃(n)⌉ lower bound for ReLU networks with decimal fraction weights.

Conclusion: ReLU neural networks require fewer hidden layers than previously believed to compute all continuous piecewise linear functions, with the depth requirement scaling logarithmically with base 3 rather than base 2.

Abstract: This work studies the expressivity of ReLU neural networks with a focus on their depth. A sequence of previous works showed that $\lceil \log_2(n+1) \rceil$ hidden layers are sufficient to compute all continuous piecewise linear (CPWL) functions on $\mathbb{R}^n$. Hertrich, Basu, Di Summa, and Skutella (NeurIPS'21 / SIDMA'23) conjectured that this result is optimal in the sense that there are CPWL functions on $\mathbb{R}^n$, like the maximum function, that require this depth. We disprove the conjecture and show that $\lceil\log_3(n-1)\rceil+1$ hidden layers are sufficient to compute all CPWL functions on $\mathbb{R}^n$. A key step in the proof is that ReLU neural networks with two hidden layers can exactly represent the maximum function of five inputs. More generally, we show that $\lceil\log_3(n-2)\rceil+1$ hidden layers are sufficient to compute the maximum of $n\geq 4$ numbers. Our constructions almost match the $\lceil\log_3(n)\rceil$ lower bound of Averkov, Hojny, and Merkert (ICLR'25) in the special case of ReLU networks with weights that are decimal fractions. The constructions have a geometric interpretation via polyhedral subdivisions of the simplex into ``easier’’ polytopes.

[290] Assimilative Causal Inference

Marios Andreou, Nan Chen, Erik Bollt

Main category: cs.LG

TL;DR: ACI is a Bayesian data assimilation framework for identifying dynamic causal relationships in complex systems by solving inverse problems from effects to causes.

DetailsMotivation: Existing causal inference methods struggle with instantaneous, time-evolving causal relationships in complex high-dimensional systems, especially when dealing with transient causal structures and intermittent relationships.

Method: Assimilative Causal Inference (ACI) uses Bayesian data assimilation to trace causes backward from observed effects, solving inverse problems rather than quantifying forward influence. It works without requiring observations of candidate causes and can handle short datasets.

Result: ACI effectively identifies dynamic causal interactions in complex dynamical systems with intermittency and extreme events, providing online tracking of causal roles that may reverse intermittently and establishing rigorous criteria for causal influence range.

Conclusion: ACI opens valuable pathways for studying complex systems where transient causal structures are critical, offering a novel approach to causal inference that addresses limitations of existing methods.

Abstract: Causal inference is fundamental across scientific disciplines, yet existing methods struggle to capture instantaneous, time-evolving causal relationships in complex, high-dimensional systems. In this paper, assimilative causal inference (ACI) is developed, which is a methodological framework that leverages Bayesian data assimilation to trace causes backward from observed effects. ACI solves the inverse problem rather than quantifying forward influence. It uniquely identifies dynamic causal interactions without requiring observations of candidate causes, accommodates short datasets, and, in principle, can be implemented in high-dimensional settings by employing efficient data assimilation algorithms. Crucially, it provides online tracking of causal roles that may reverse intermittently and facilitates a mathematically rigorous criterion for the causal influence range, revealing how far effects propagate. The effectiveness of ACI is demonstrated by complex dynamical systems showcasing intermittency and extreme events. ACI opens valuable pathways for studying complex systems, where transient causal structures are critical.

[291] Generative Distribution Embeddings: Lifting autoencoders to the space of distributions for multiscale representation learning

Nic Fishman, Gokul Gowri, Peng Yin, Jonathan Gootenberg, Omar Abudayyeh

Main category: cs.LG

TL;DR: GDEs lift autoencoders to distribution space, enabling learning of distribution representations through encoder-generator pairs that satisfy distributional invariance, with applications across computational biology.

DetailsMotivation: Many real-world problems require reasoning across multiple scales and operating on entire distributions rather than single data points, necessitating models that can represent and work with distributions directly.

Method: Introduces generative distribution embeddings (GDEs) where an encoder acts on sets of samples and a generator aims to match input distributions. The framework couples conditional generative models with encoder networks satisfying distributional invariance criteria.

Result: GDEs learn predictive sufficient statistics embedded in Wasserstein space, with latent distances approximating W₂ distance and interpolation recovering optimal transport trajectories for Gaussian/Gaussian mixture distributions. Outperforms existing approaches on synthetic datasets and successfully applied to six computational biology problems.

Conclusion: GDEs provide a powerful framework for learning distribution representations with theoretical guarantees and practical applications across diverse biological domains at massive scales.

Abstract: Many real-world problems require reasoning across multiple scales, demanding models which operate not on single data points, but on entire distributions. We introduce generative distribution embeddings (GDE), a framework that lifts autoencoders to the space of distributions. In GDEs, an encoder acts on sets of samples, and the decoder is replaced by a generator which aims to match the input distribution. This framework enables learning representations of distributions by coupling conditional generative models with encoder networks which satisfy a criterion we call distributional invariance. We show that GDEs learn predictive sufficient statistics embedded in the Wasserstein space, such that latent GDE distances approximately recover the $W_2$ distance, and latent interpolation approximately recovers optimal transport trajectories for Gaussian and Gaussian mixture distributions. We systematically benchmark GDEs against existing approaches on synthetic datasets, demonstrating consistently stronger performance. We then apply GDEs to six key problems in computational biology: learning donor-level representations from single-nuclei RNA sequencing data (6M cells), capturing clonal dynamics in lineage-traced RNA sequencing data (150K cells), predicting perturbation effects on transcriptomes (1M cells), predicting perturbation effects on cellular phenotypes (20M single-cell images), designing synthetic yeast promoters (34M sequences), and spatiotemporal modeling of viral protein sequences (1M sequences).

[292] Sign-SGD via Parameter-Free Optimization

Daniil Medyakov, Sergey Stanko, Gleb Molodtsov, Philip Zmushko, Grigoriy Evseev, Egor Petrov, Aleksandr Beznosikov

Main category: cs.LG

TL;DR: Parameter-free Sign-SGD optimizer that eliminates manual stepsize tuning for efficient LLM training, achieving comparable performance to tuned optimizers with 1.5× speedup.

DetailsMotivation: Training large language models is extremely resource-intensive, and existing optimizers like Sign-SGD require manual stepsize tuning which relies on unknown problem-specific quantities, creating overhead and inefficiency.

Method: Develops parameter-free Sign-SGD that removes manual stepsize selection, extends to stochastic single-node training and multi-node settings, incorporates momentum, and proposes memory-efficient variant storing only gradient signs instead of full gradients.

Result: Methods match performance of tuned Sign-SGD and AdamW (with grid-searched stepsizes) on pre-training LLaMA models (130M and 350M) and fine-tuning Swin Transformer (28M), while achieving ~1.5× end-to-end speedup compared to runs with grid-searched stepsizes.

Conclusion: Parameter-free Sign-SGD provides efficient, memory-optimized training for large models without manual tuning overhead, making LLM training more accessible and resource-efficient.

Abstract: Large language models have achieved major advances across domains, yet training them remains extremely resource-intensive. We revisit Sign-SGD, which serves both as a memory-efficient optimizer for single-node training and as a gradient compression mechanism for distributed learning. This paper addresses a central limitation: the effective stepsize cannot be determined a priori because it relies on unknown, problem-specific quantities. We present a parameter-free Sign-SGD that removes manual stepsize selection. We analyze the deterministic single-node case, and extend the method to stochastic single-node training and multi-node settings. We also incorporate the momentum technique into our algorithms and propose a memory-efficient variant that stores only gradient signs instead of full gradients. We evaluate our methods on pre-training LLaMA models (130M and 350M) and fine-tuning a Swin Transformer (28M). Across considered tasks, the proposed methods match the performance of tuned Sign-SGD and AdamW (grid-searched stepsizes with a cosine schedule), while avoiding tuning overhead. Employing parameter-free training yields approximately $1.5\times$ end-to-end speedup compared to runs with grid-searched stepsizes.

[293] GAGA: Gaussianity-Aware Gaussian Approximation for Efficient 3D Molecular Generation

Jingxiang Qu, Wenhan Gao, Ruichen Xu, Yi Liu

Main category: cs.LG

TL;DR: GAGA accelerates Gaussian Probability Path based Generative Models by identifying when molecular data becomes sufficiently Gaussian during forward diffusion, allowing replacement of later trajectory steps with closed-form approximations.

DetailsMotivation: GPPGMs achieve state-of-the-art 3D molecular generation but suffer from high computational costs due to long generative trajectories requiring hundreds to thousands of steps during training and sampling, hindering practical deployment.

Method: The key insight is that different data modalities attain sufficient Gaussianity at different steps during the forward process. The method analytically identifies a characteristic step where molecular data becomes sufficiently Gaussian, after which the trajectory can be replaced by a closed-form Gaussian approximation, preserving full-resolution learning dynamics while avoiding redundant transport.

Result: Experiments on 3D molecular generation benchmarks demonstrate substantial improvements in both generation quality and computational efficiency compared to existing methods.

Conclusion: GAGA provides a principled approach to accelerate GPPGMs without sacrificing training granularity or inference fidelity, making these models more practical for deployment while maintaining state-of-the-art performance.

Abstract: Gaussian Probability Path based Generative Models (GPPGMs) generate data by reversing a stochastic process that progressively corrupts samples with Gaussian noise. Despite state-of-the-art results in 3D molecular generation, their deployment is hindered by the high cost of long generative trajectories, often requiring hundreds to thousands of steps during training and sampling. In this work, we propose a principled method, named GAGA, to improve generation efficiency without sacrificing training granularity or inference fidelity of GPPGMs. Our key insight is that different data modalities obtain sufficient Gaussianity at markedly different steps during the forward process. Based on this observation, we analytically identify a characteristic step at which molecular data attains sufficient Gaussianity, after which the trajectory can be replaced by a closed-form Gaussian approximation. Unlike existing accelerators that coarsen or reformulate trajectories, our approach preserves full-resolution learning dynamics while avoiding redundant transport through truncated distributional states. Experiments on 3D molecular generation benchmarks demonstrate that our GAGA achieves substantial improvement on both generation quality and computational efficiency.

[294] Cultivating Pluralism In Algorithmic Monoculture: The Community Alignment Dataset

Lily Hong Zhang, Smitha Milli, Karen Jusko, Jonathan Smith, Brandon Amos, Wassim Bouaziz, Manon Revel, Jack Kussman, Yasha Sheynin, Lisa Titus, Bhaktipriya Radharapu, Jane Yu, Vidya Sarma, Kris Rose, Maximilian Nickel

Main category: cs.LG

TL;DR: This paper addresses the challenge of aligning LLMs with diverse global preferences by showing current methods fail to capture human preference variation, proposing negatively-correlated sampling for better preference dataset collection, and releasing a large multilingual preference dataset.

DetailsMotivation: Current LLMs don't adequately represent the diverse preferences of global users across cultural, political, and other dimensions. There's a need to better align LLMs with heterogeneous human preferences to serve diverse populations effectively.

Method: Conducted large-scale multilingual human studies across 5 countries (N=15,000), analyzed preference variation, proposed negatively-correlated sampling for candidate response generation, and collected Community Alignment dataset with 233,319 comparisons.

Result: Humans show substantially more preference variation than 21 state-of-the-art LLMs; existing preference collection methods are insufficient; negatively-correlated sampling improves alignment method performance; released largest multilingual preference dataset.

Conclusion: Better methods for capturing diverse human preferences are needed for LLM alignment, and the Community Alignment dataset provides a valuable resource for improving LLMs’ effectiveness for global populations.

Abstract: How can large language models (LLMs) serve users with varying preferences that may conflict across cultural, political, or other dimensions? To advance this challenge, this paper establishes four key results. First, we demonstrate, through a large-scale multilingual human study with representative samples from five countries (N=15,000), that humans exhibit substantially more variation in preferences than the responses of 21 state-of-the-art LLMs. Second, we show that existing methods for preference dataset collection are insufficient for learning the diversity of human preferences even along two of the most salient dimensions of variability in global values, due to the underlying homogeneity of candidate responses. Third, we argue that this motivates the need for negatively-correlated sampling when generating candidate sets, and we show that simple prompt-based techniques for doing so greatly enhance the performance of alignment methods in learning heterogeneous preferences. Fourth, based on this novel candidate sampling approach, we collect and open-source Community Alignment} the largest and most representative multilingual and multi-turn preference dataset to date, featuring 233,319 comparisons from annotators spanning five countries. Overall, we hope that the Community Alignment dataset will be a valuable resource for improving the effectiveness of LLMs for a diverse global population.

[295] M3OOD: Automatic Selection of Multimodal OOD Detectors

Yuehan Qin, Li Li, Defu Cao, Tiankai Yang, Jiate Li, Yue Zhao

Main category: cs.LG

TL;DR: M3OOD: A meta-learning framework for automatically selecting optimal out-of-distribution detection models in multimodal settings by learning from historical model behaviors and dataset characteristics.

DetailsMotivation: Out-of-distribution robustness is critical for multimodal ML systems, but no single OOD detector works best across all distribution shifts. Manual selection is difficult due to the unsupervised nature of OOD detection and the impracticality of systematic testing on new data.

Method: Meta-learning framework that combines multimodal embeddings with handcrafted meta-features capturing distributional and cross-modal characteristics to represent datasets. Learns from historical performance across diverse multimodal benchmarks to recommend suitable detectors for new distribution shifts.

Result: M3OOD consistently outperforms 10 competitive baselines across 12 test scenarios with minimal computational overhead.

Conclusion: The framework provides an effective solution for automatic OOD detector selection in multimodal settings by leveraging meta-learning and multimodal dataset representations.

Abstract: Out-of-distribution (OOD) robustness is a critical challenge for modern machine learning systems, particularly as they increasingly operate in multimodal settings involving inputs like video, audio, and sensor data. Currently, many OOD detection methods have been proposed, each with different designs targeting various distribution shifts. A single OOD detector may not prevail across all the scenarios; therefore, how can we automatically select an ideal OOD detection model for different distribution shifts? Due to the inherent unsupervised nature of the OOD detection task, it is difficult to predict model performance and find a universally Best model. Also, systematically comparing models on the new unseen data is costly or even impractical. To address this challenge, we introduce M3OOD, a meta-learning-based framework for OOD detector selection in multimodal settings. Meta learning offers a solution by learning from historical model behaviors, enabling rapid adaptation to new data distribution shifts with minimal supervision. Our approach combines multimodal embeddings with handcrafted meta-features that capture distributional and cross-modal characteristics to represent datasets. By leveraging historical performance across diverse multimodal benchmarks, M3OOD can recommend suitable detectors for a new data distribution shift. Experimental evaluation demonstrates that M3OOD consistently outperforms 10 competitive baselines across 12 test scenarios with minimal computational overhead.

[296] Morephy-Net: An Evolutionary Multi-objective Optimization for Replica-Exchange-based Physics-informed Neural Operator Learning Networks

Binghang Lu, Changhong Mou, Guang Lin

Main category: cs.LG

TL;DR: Morephy-Net uses evolutionary multi-objective optimization and replica-exchange methods to solve parametric PDEs with noisy data for both forward prediction and inverse identification, improving accuracy and uncertainty quantification over existing operator-learning models.

DetailsMotivation: Existing physics-informed neural networks and operator-learning models face challenges in balancing data/operator vs physics residual losses, maintaining robustness under noisy/sparse observations, and providing reliable uncertainty quantification.

Method: Integrates evolutionary multi-objective optimization (treating data/operator and physics residual terms as separate objectives), replica-exchange stochastic gradient Langevin dynamics for enhanced exploration, and Bayesian uncertainty quantification from stochastic sampling.

Result: Demonstrates consistent improvements in accuracy, noise robustness, and calibrated uncertainty estimates over standard operator-learning baselines on forward and inverse problems including 1D Burgers equation and time-fractional mixed diffusion-wave equation.

Conclusion: Morephy-Net effectively addresses key challenges in physics-informed operator learning for parametric PDEs, providing a robust framework for both forward prediction and inverse identification with reliable uncertainty quantification.

Abstract: We propose an evolutionary Multi-objective Optimization for Replica-Exchange-based Physics-informed operator-learning Networks (Morephy-Net) to solve parametric partial differential equations (PDEs) in noisy data regimes, for both forward prediction and inverse identification. Existing physics-informed neural networks and operator-learning models (e.g., DeepONets and Fourier neural operators) often face three coupled challenges: (i) balancing data/operator and physics residual losses, (ii) maintaining robustness under noisy or sparse observations, and (iii) providing reliable uncertainty quantification. Morephy-Net addresses these issues by integrating: (i) evolutionary multi-objective optimization that treats data/operator and physics residual terms as separate objectives and searches the Pareto front, thereby avoiding ad hoc loss weighting; (ii) replica-exchange stochastic gradient Langevin dynamics to enhance global exploration and stabilize training in non-convex landscapes; and (iii) Bayesian uncertainty quantification obtained from stochastic sampling. We validate Morephy-Net on representative forward and inverse problems, including the one-dimensional Burgers equation and the time-fractional mixed diffusion–wave equation. The results demonstrate consistent improvements in accuracy, noise robustness, and calibrated uncertainty estimates over standard operator-learning baselines.

[297] Thermodynamically consistent machine learning model for excess Gibbs energy

Marco Hoffmann, Thomas Specht, Quirin Göttl, Jakob Burger, Stephan Mandt, Hans Hasse, Fabian Jirasek

Main category: cs.LG

TL;DR: HANNA is a machine learning model that predicts excess Gibbs energy for liquid mixtures from molecular structures with built-in thermodynamic consistency constraints.

DetailsMotivation: Predicting excess Gibbs energy of multi-component mixtures from molecular structures is a long-standing challenge in chemical engineering and chemistry, crucial for modeling thermodynamic properties of liquid mixtures.

Method: HANNA integrates physical laws as hard constraints to guarantee thermodynamic consistency, trained on experimental data for various equilibrium properties using a surrogate solver for liquid-liquid equilibrium data and geometric projection for multi-component extrapolation.

Result: HANNA delivers accurate predictions with substantially broader domain of applicability than state-of-the-art benchmark methods.

Conclusion: The model successfully addresses the challenge of predicting excess Gibbs energy from molecular structures while ensuring thermodynamic consistency, with trained model and code openly available.

Abstract: The excess Gibbs energy plays a central role in chemical engineering and chemistry, providing a basis for modeling thermodynamic properties of liquid mixtures. Predicting the excess Gibbs energy of multi-component mixtures solely from molecular structures is a long-standing challenge. We address this challenge with HANNA, a flexible machine learning model for excess Gibbs energy that integrates physical laws as hard constraints, guaranteeing thermodynamically consistent predictions. HANNA is trained on experimental data for vapor-liquid equilibria, liquid-liquid equilibria, activity coefficients at infinite dilution and excess enthalpies in binary mixtures. The end-to-end training on liquid-liquid equilibrium data is facilitated by a surrogate solver. A geometric projection method enables robust extrapolations to multi-component mixtures. We demonstrate that HANNA delivers accurate predictions, while providing a substantially broader domain of applicability than state-of-the-art benchmark methods. The trained model and corresponding code are openly available, and an interactive interface is provided on our website, MLPROP.

[298] Comparative Analysis of Wave Scattering Numerical Modeling Using the Boundary Element Method and Physics-Informed Neural Networks

Oscar Rincón-Cardeno, Gregorio Pérez Bernal, Silvana Montoya Noguera, Nicolás Guarín-Zapata

Main category: cs.LG

TL;DR: Comparison of Boundary Element Method (BEM) and Physics-Informed Neural Networks (PINNs) for solving 2D Helmholtz wave scattering problems, showing BEM is faster for solution but PINNs are faster for evaluation after training.

DetailsMotivation: To evaluate and compare the performance of traditional numerical methods (BEM) and emerging machine learning approaches (PINNs) for solving wave scattering problems under identical conditions, providing quantitative data to guide future research.

Method: Both methods solve the same 2D Helmholtz scattering problem: BEM uses boundary discretization with varying integration points, while PINNs minimize residual of governing equations and boundary conditions through hyperparameter optimization (3 hidden layers, 25 neurons per layer, sine activation, 10^-2 learning rate).

Result: BEM solution time: ~10^-2 seconds; PINN training time: ~10^2 seconds (4 orders of magnitude slower). However, trained PINN evaluation time: ~10^-2 seconds, which is 2 orders of magnitude faster than BEM evaluation at interior points.

Conclusion: Established comparison procedure between BEM and PINNs for wave scattering problems. BEM is computationally efficient for solution, while PINNs offer fast evaluation after training, highlighting trade-offs and providing performance benchmarks for future research.

Abstract: This study compares the Boundary Element Method (BEM) and Physics-Informed Neural Networks (PINNs) for solving the two-dimensional Helmholtz equation in wave scattering problems. The objective is to evaluate the performance of both methods under the same conditions. We solve the Helmholtz equation using BEM and PINNs for the same scattering problem. PINNs are trained by minimizing the residual of the governing equations and boundary conditions with their configuration determined through hyperparameter optimization, while BEM is applied using boundary discretization. Both methods are evaluated in terms of solution accuracy and computation time. We conducted numerical experiments by varying the number of boundary integration points for the BEM and the number of hidden layers and neurons per layer for the PINNs. We performed a hyperparameter tuning to identify an adequate PINN configuration for this problem as a network with 3 hidden layers and 25 neurons per layer, using a learning rate of $10^{-2}$ and a sine activation function. At comparable levels of accuracy, the assembly and solution of the BEM system required a computational time on the order of $10^{-2}$~s, whereas the training time of the PINN was on the order of $10^{2}$~s, corresponding to a difference of approximately four orders of magnitude. However, once trained, the PINN achieved evaluation times on the order of $10^{-2}$~s, which is about two orders of magnitude faster than the evaluation of the BEM solution at interior points. This work establishes a procedure for comparing BEM and PINNs. It also presents a direct comparison between the two methods for the scattering problem. The analysis provides quantitative data on their performance, supporting their use in future research on wave propagation problems and outlining challenges and directions for further investigation.

[299] xLSTM Scaling Laws: Competitive Performance with Linear Time-Complexity

Maximilian Beck, Kajetan Schweighofer, Sebastian Böck, Sebastian Lehner, Sepp Hochreiter

Main category: cs.LG

TL;DR: xLSTM scales more favorably than Transformers in LLM training and inference, consistently Pareto-dominating Transformers with lower cross-entropy loss for the same compute budget.

DetailsMotivation: While scaling laws are crucial for LLM success and Transformers dominate, recent alternatives like xLSTM offer linear complexity with context length while remaining competitive. Need to understand comparative scaling behavior to guide future model design.

Method: Comparative investigation of scaling behavior between Transformers and xLSTM using IsoFLOP and parametric fit approaches across wide model sizes (80M-7B) and training tokens (2B-2T). Examined dependence of optimal model sizes on context length and inference-time scaling characteristics.

Result: xLSTM scales favorably compared to Transformers in typical LLM training and inference scenarios. xLSTM models consistently Pareto-dominate Transformer models, delivering lower cross-entropy loss for the same compute budget.

Conclusion: xLSTM demonstrates superior scaling properties compared to Transformers, making it a promising alternative architecture for large language models, especially considering its linear complexity with context length.

Abstract: Scaling laws play a central role in the success of Large Language Models (LLMs), enabling the prediction of model performance relative to compute budgets prior to training. While Transformers have been the dominant architecture, recent alternatives such as xLSTM offer linear complexity with respect to context length while remaining competitive in the billion-parameter regime. We conduct a comparative investigation on the scaling behavior of Transformers and xLSTM along the following lines, providing insights to guide future model design and deployment. First, we study the scaling behavior for xLSTM in compute-optimal and over-training regimes using both IsoFLOP and parametric fit approaches on a wide range of model sizes (80M-7B) and number of training tokens (2B-2T). Second, we examine the dependence of optimal model sizes on context length, a pivotal aspect that was largely ignored in previous work. Finally, we analyze inference-time scaling characteristics. Our findings reveal that in typical LLM training and inference scenarios, xLSTM scales favorably compared to Transformers. Notably, xLSTM models consistently Pareto-dominate Transformer models, delivering lower cross-entropy loss for the same compute budget.

[300] Mitigating Subject Dependency in EEG Decoding with Subject-Specific Low-Rank Adapters

Timon Klein, Piotr Minakowski, Sebastian Sager, Steffen Schotthöfer

Main category: cs.LG

TL;DR: SuLoRA is a subject-specific low-rank adapter that decomposes neural network weights into shared subject-invariant components and lightweight subject-specific corrections to handle distribution shifts in brain decoding tasks.

DetailsMotivation: Subject-specific distribution shifts are a major challenge for developing foundation models for brain decoding. Current approaches struggle with inter-subject variability, making it difficult to create robust models that work across different individuals.

Method: SuLoRA replaces standard linear/convolutional layers by decomposing weights into: 1) a shared subject-invariant component, and 2) a lightweight low-rank correction unique to each subject. This enables existing architectures to handle subject shifts without redesign.

Result: On MEG speech perception tasks, SuLoRA exceeds baseline performance with half the parameters. On EEG motor imagery datasets, it outperforms both subject-agnostic models and independently trained subject-specific models.

Conclusion: SuLoRA provides a practical approach for building effective cross-subject foundation models for brain signal applications by explicitly handling subject variability through parameter-efficient adaptation.

Abstract: Subject-specific distribution shifts represent a fundamental obstacle to developing foundation models for brain decoding. We propose the Subject-Specific Low-Rank Adapter (SuLoRA), a drop-in replacement for standard linear or convolutional layers that captures inter-subject variability by decomposing weights into a shared, subject-invariant component and a lightweight, low-rank correction unique to each subject. This explicit separation enables existing architectures to become robust to subject shifts without architectural redesign. We evaluate SuLoRA on MEG speech perception and EEG motor imagery tasks across CNN and transformer architectures. In the speech decoding task, SuLoRA exceeds the baseline performance with half of the parameters. On motor imagery dataset, SuLoRA outperforms both subject-agnostic models and independently trained subject-specific models. SuLoRA offers a practical path towards effective cross-subject foundation models for brain signal applications.

[301] Who Said Neural Networks Aren’t Linear?

Nimrod Berman, Assaf Hallak, Assaf Shocher

Main category: cs.LG

TL;DR: The paper introduces Linearizer architecture that transforms neural networks into linear operators in non-standard vector spaces, enabling application of linear algebra tools to nonlinear mappings and demonstrating applications in diffusion model acceleration, projective generative models, and style transfer.

DetailsMotivation: Neural networks are nonlinear, but linearity is relative to vector space definitions. The authors aim to find non-standard vector spaces where neural networks can act as linear operators, enabling application of linear algebra tools to nonlinear mappings.

Method: Propose Linearizer architecture: sandwich a linear operator A between two invertible neural networks, f(x)=g_y^{-1}(A g_x(x)). Define new vector spaces X and Y with addition and scaling operations derived from g_x and g_y, making the network linear in these spaces.

Result: Enables application of SVD, pseudo-inverse, orthogonal projection, etc. to nonlinear mappings. Composition property allows training diffusion models where hundreds of sampling steps collapse to one. Also enables idempotent networks for projective generative models and modular style transfer.

Conclusion: The Linearizer framework provides a principled way to apply linear algebra to neural networks, with practical applications in accelerating diffusion models, creating projective generative models, and enabling modular style transfer.

Abstract: Neural networks are famously nonlinear. However, linearity is defined relative to a pair of vector spaces, $f:X \to Y$. Leveraging the algebraic concept of transport of structure, we propose a method to explicitly identify non-standard vector spaces where a neural network acts as a linear operator. When sandwiching a linear operator $A$ between two invertible neural networks, $f(x)=g_y^{-1}(A g_x(x))$, the corresponding vector spaces $X$ and $Y$ are induced by newly defined addition and scaling actions derived from $g_x$ and $g_y$. We term this kind of architecture a Linearizer. This framework makes the entire arsenal of linear algebra, including SVD, pseudo-inverse, orthogonal projection and more, applicable to nonlinear mappings. Furthermore, we show that the composition of two Linearizers that share a neural network is also a Linearizer. We leverage this property and demonstrate that training diffusion models using our architecture makes the hundreds of sampling steps collapse into a single step. We further utilize our framework to enforce idempotency (i.e. $f(f(x))=f(x)$) on networks leading to a globally projective generative model and to demonstrate modular style transfer.

[302] Uncertainty Estimation by Flexible Evidential Deep Learning

Taeseong Yoon, Heeyoung Kim

Main category: cs.LG

TL;DR: Flexible Evidential Deep Learning (F-EDL) extends traditional EDL by using flexible Dirichlet distributions for more expressive uncertainty quantification, improving generalization across diverse scenarios.

DetailsMotivation: Current evidential deep learning (EDL) methods use Dirichlet distributions for uncertainty quantification, but this restrictive assumption limits robustness in complex or unforeseen situations. There's a need for more flexible uncertainty representations that can generalize better across diverse scenarios.

Method: Proposes F-EDL which extends traditional EDL by predicting flexible Dirichlet distributions (a generalization of standard Dirichlet distributions) over class probabilities. This provides more expressive and adaptive uncertainty representations.

Result: Theoretically establishes advantages of F-EDL and empirically demonstrates state-of-the-art uncertainty quantification performance across diverse evaluation settings including classical, long-tailed, and noisy in-distribution scenarios.

Conclusion: F-EDL provides a more expressive and adaptive uncertainty quantification framework that significantly enhances generalization and reliability under challenging scenarios compared to traditional EDL methods.

Abstract: Uncertainty quantification (UQ) is crucial for deploying machine learning models in high-stakes applications, where overconfident predictions can lead to serious consequences. An effective UQ method must balance computational efficiency with the ability to generalize across diverse scenarios. Evidential deep learning (EDL) achieves efficiency by modeling uncertainty through the prediction of a Dirichlet distribution over class probabilities. However, the restrictive assumption of Dirichlet-distributed class probabilities limits EDL’s robustness, particularly in complex or unforeseen situations. To address this, we propose \textit{flexible evidential deep learning} ($\mathcal{F}$-EDL), which extends EDL by predicting a flexible Dirichlet distribution – a generalization of the Dirichlet distribution – over class probabilities. This approach provides a more expressive and adaptive representation of uncertainty, significantly enhancing UQ generalization and reliability under challenging scenarios. We theoretically establish several advantages of $\mathcal{F}$-EDL and empirically demonstrate its state-of-the-art UQ performance across diverse evaluation settings, including classical, long-tailed, and noisy in-distribution scenarios.

[303] ExPairT-LLM: Exact Learning for LLM Code Selection by Pairwise Queries

Tom Yuviler, Dana Drachsler-Cohen

Main category: cs.LG

TL;DR: ExPairT-LLM is an exact learning algorithm for code selection that uses pairwise membership and equivalence queries to LLMs to identify correct programs through tournament-style elimination, improving pass@1 rates significantly over existing methods.

DetailsMotivation: Existing code selection algorithms for LLM-generated programs can fail due to misidentifying nonequivalent programs or relying on LLMs that may incorrectly determine outputs for every input. There's a need for more robust selection methods.

Method: ExPairT-LLM uses an exact learning algorithm that poses two new query types to LLM oracles: pairwise membership (comparing outputs of two programs on same input) and pairwise equivalence (determining if two programs produce same outputs). It identifies correct programs through tournament-style elimination that’s robust to some LLM mistakes.

Result: Evaluated on four popular code datasets, ExPairT-LLM outperforms state-of-the-art code selection algorithms by +13.0% on average (up to +27.1%) in pass@1 (success rate). It also improves pass@1 of LLMs performing complex reasoning by +24.0%.

Conclusion: ExPairT-LLM provides an effective code selection algorithm that leverages simpler LLM queries (pairwise comparisons) to robustly identify correct programs, significantly improving code generation success rates.

Abstract: Despite recent advances in LLMs, the task of code generation is still challenging. To cope, code selection algorithms select the best program from multiple programs generated by an LLM. However, existing algorithms can fail to identify the correct program, either because they can misidentify nonequivalent programs or because they rely on an LLM and assume it always correctly determines the output for every input. We present ExPairT-LLM, an exact learning algorithm for code selection that selects a program by posing to an LLM oracle two new types of queries: pairwise membership and pairwise equivalence. These queries are simpler for LLMs and enable ExPairT-LLM to identify the correct program through a tournament, which is robust to some LLM mistakes. We evaluate ExPairT-LLM on four popular code datasets. Its pass@1 (success rate) outperforms the state-of-the-art code selection algorithm on average by +13.0% and up to +27.1%. It also improves the pass@1 of LLMs performing complex reasoning by +24.0%.

[304] MIST: Mutual Information Estimation Via Supervised Training

German Gritsai, Megan Richards, Maxime Méloux, Kyunghyun Cho, Maxime Peyrard

Main category: cs.LG

TL;DR: MIST: A neural network-based mutual information estimator trained on synthetic data with quantile regression for uncertainty estimation

DetailsMotivation: To develop a flexible, data-driven mutual information estimator that can handle variable sample sizes and dimensions while providing uncertainty quantification, moving beyond classical methods with theoretical guarantees but limited practical performance

Method: Parameterize MI estimator as neural network (MIST), train on large meta-dataset of 625,000 synthetic joint distributions with known MI, use 2D attention for permutation invariance, optimize quantile regression loss for uncertainty estimation

Result: Learned estimators outperform classical baselines across sample sizes and dimensions, including unseen distributions; quantile-based intervals are well-calibrated and faster than bootstrap; estimators are differentiable and can be embedded in larger pipelines

Conclusion: Fully empirical approach to MI estimation trades theoretical guarantees for practical flexibility and efficiency, yielding trainable estimators that can adapt to diverse data modalities via normalizing flows

Abstract: We propose a fully data-driven approach to designing mutual information (MI) estimators. Since any MI estimator is a function of the observed sample from two random variables, we parameterize this function with a neural network (MIST) and train it end-to-end to predict MI values. Training is performed on a large meta-dataset of 625,000 synthetic joint distributions with known ground-truth MI. To handle variable sample sizes and dimensions, we employ a two-dimensional attention scheme ensuring permutation invariance across input samples. To quantify uncertainty, we optimize a quantile regression loss, enabling the estimator to approximate the sampling distribution of MI rather than return a single point estimate. This research program departs from prior work by taking a fully empirical route, trading universal theoretical guarantees for flexibility and efficiency. Empirically, the learned estimators largely outperform classical baselines across sample sizes and dimensions, including on joint distributions unseen during training. The resulting quantile-based intervals are well-calibrated and more reliable than bootstrap-based confidence intervals, while inference is orders of magnitude faster than existing neural baselines. Beyond immediate empirical gains, this framework yields trainable, fully differentiable estimators that can be embedded into larger learning pipelines. Moreover, exploiting MI’s invariance to invertible transformations, meta-datasets can be adapted to arbitrary data modalities via normalizing flows, enabling flexible training for diverse target meta-distributions.

[305] Learning to Orchestrate Agents in Natural Language with the Conductor

Stefan Nielsen, Edoardo Cetin, Peter Schwendeman, Qi Sun, Jinglue Xu, Yujin Tang

Main category: cs.LG

TL;DR: A Conductor model uses RL to coordinate multiple LLMs, learning communication topologies and prompting strategies to outperform individual models on reasoning benchmarks.

DetailsMotivation: Different LLMs have specialized capabilities across domains, but coordinating them effectively requires sophisticated strategies. The paper aims to develop an RL-trained Conductor model that can automatically discover optimal coordination strategies among diverse LLMs.

Method: Train a 7B Conductor model using reinforcement learning to: 1) design communication topologies for agent collaboration, 2) engineer focused prompts to leverage individual LLM capabilities, and 3) adapt to arbitrary sets of open- and closed-source agents through randomized agent pool training.

Result: The Conductor achieves state-of-the-art results on challenging reasoning benchmarks (LiveCodeBench, GPQA), outperforming any individual worker LLM. It effectively adapts to different agent pools and enables recursive topologies when selecting itself as a worker.

Conclusion: RL can unlock language model coordination, with powerful strategies emerging through end-to-end reward maximization. The approach enables dynamic test-time scaling through online iterative adaptation.

Abstract: Powerful large language models (LLMs) from different providers have been expensively trained and finetuned to specialize across varying domains. In this work, we introduce a new kind of Conductor model trained with reinforcement learning to automatically discover powerful coordination strategies among LLMs. Our Conductor learns not only to design targeted communication topologies for effective agent-to-agent collaboration, but also to prompt engineer focused instructions to the LLMs to maximally leverage their individual capabilities. We show that, by learning optimal coordination strategies over pools of powerful worker LLMs, a 7B Conductor achieves significant performance gains beyond any individual worker, attaining state-of-the-art results in challenging reasoning benchmarks, such as LiveCodeBench and GPQA. By training with randomized agent pools, our conductor effectively adapts to arbitrary sets of open- and closed-source agents, meeting any user requirements. Furthermore, allowing the Conductor to select itself as a worker gives rise to recursive topologies, elevating performance with a new form of dynamic test-time scaling through online iterative adaptation. More broadly, ours is among the early work demonstrating language model coordination can be unlocked through RL, where powerful coordination strategies emerge naturally in LLMs through pure end-to-end reward maximization.

[306] Amortized Inference of Multi-Modal Posteriors using Likelihood-Weighted Normalizing Flows

Rajneil Baruah

Main category: cs.LG

TL;DR: Amortized posterior estimation using Normalizing Flows with likelihood-weighted importance sampling for high-dimensional inverse problems, focusing on multi-modal distributions and the importance of base distribution topology.

DetailsMotivation: The paper addresses the challenge of efficient posterior estimation in high-dimensional inverse problems without requiring posterior training samples. It specifically tackles the issue of capturing multi-modal distributions where standard unimodal base distributions fail to properly represent disconnected support regions.

Method: Uses Normalizing Flows trained with likelihood-weighted importance sampling for amortized posterior estimation. Key innovation is initializing the flow with a Gaussian Mixture Model that matches the cardinality of target modes to better capture disconnected support, rather than using standard unimodal base distributions.

Result: The method was implemented on multi-modal benchmark tasks in 2D and 3D. Results show that standard unimodal base distributions create spurious probability bridges between modes, while initializing with matching Gaussian Mixture Models significantly improves reconstruction fidelity as measured by distance and divergence metrics.

Conclusion: The topology of base distributions critically impacts the quality of modelled posteriors in Normalizing Flows. Using Gaussian Mixture Models that match the cardinality of target modes is essential for accurately capturing multi-modal distributions without artificial connections between disconnected support regions.

Abstract: We present a novel technique for amortized posterior estimation using Normalizing Flows trained with likelihood-weighted importance sampling. This approach allows for the efficient inference of theoretical parameters in high-dimensional inverse problems without the need for posterior training samples. We implement the method on multi-modal benchmark tasks in 2D and 3D to check for the efficacy. A critical observation of our study is the impact of the topology of the base distributions on the modelled posteriors. We find that standard unimodal base distributions fail to capture disconnected support, resulting in spurious probability bridges between modes. We demonstrate that initializing the flow with a Gaussian Mixture Model that matches the cardinality of the target modes significantly improves reconstruction fidelity, as measured by some distance and divergence metrics.

[307] Correction of Decoupled Weight Decay

Jason Chuan-Chih Chou

Main category: cs.LG

TL;DR: The paper challenges the conventional wisdom that decoupled weight decay should be proportional to learning rate (γ), arguing instead that it should be proportional to γ² for stable weight and gradient norms, which improves training dynamics and model performance.

DetailsMotivation: The paper questions the long-standing assumption that decoupled weight decay should be proportional to learning rate (γ) in optimizers like AdamW. Recent arguments suggest it should be proportional to γ² based on orthogonality arguments, but the authors find this reasoning insufficient and seek a more fundamental understanding of how weight decay affects training dynamics.

Method: The authors analyze training dynamics by eliminating the contribution of the perpendicular component of updates to weight norm. They derive that decoupled weight decay ∝ γ² results in stable weight norm based on the assumption that updates become independent of weights at steady state. They also analyze Total Update Contribution (TUC) under the Scion optimizer and show that momentum-dependent effective learning rate better characterizes optimal values.

Result: The paper shows that decoupled weight decay ∝ γ² leads to stable weight and gradient norms, allowing better control of training dynamics and improved model performance. The derived relationships transfer well and provide more stable optimization behavior compared to the conventional ∝ γ approach.

Conclusion: The conventional practice of setting decoupled weight decay proportional to learning rate is suboptimal. Setting it proportional to γ² provides more stable training dynamics, better control over weight and gradient norms, and ultimately improves model performance across various optimization scenarios.

Abstract: Decoupled weight decay, solely responsible for the performance advantage of AdamW over Adam, has long been set to proportional to learning rate $γ$ without questioning. Some researchers have recently challenged such assumption and argued that decoupled weight decay should be set $\propto γ^2$ instead based on orthogonality arguments at steady state. To the contrary, we find that eliminating the contribution of the perpendicular component of the update to the weight norm leads to little change to the training dynamics. Instead, we derive that decoupled weight decay $\propto γ^2$ results in stable weight norm based on the simple assumption that updates become independent of the weights at steady state, regardless of the nature of the optimizer. Based on the same assumption, we derive and empirically verify that the Total Update Contribution (TUC) of a minibatch under the Scion optimizer is better characterized by the momentum-dependent effective learning rate whose optimal value transfers and we show that decoupled weight decay $\propto γ^2$ leads to stable weight and gradient norms and allows us to better control the training dynamics and improve the model performance.

[308] Guided Transfer Learning for Discrete Diffusion Models

Julian Kleutgens, Claudio Battiloro, Lingkai Kong, Benjamin Grewe, Francesca Dominici, Mauricio Tec

Main category: cs.LG

TL;DR: GTL enables transfer learning for discrete diffusion models via classifier ratio-based guidance, making them effective in small-data regimes without modifying pretrained denoisers.

DetailsMotivation: Discrete diffusion models perform well but require large datasets, limiting their use in small-data scenarios. While continuous DMs can use classifier ratio-based guidance for transfer learning, this approach hasn't been explored for discrete DMs.

Method: Proposes Guided Transfer Learning (GTL) for discrete DMs. Theoretical analysis shows direct extension of ratio-based guidance is computationally prohibitive (scales with vocabulary size). Introduces scheduling mechanism to reduce cost to linear scaling, enabling sampling from target distributions without modifying pretrained denoisers.

Result: Evaluated on sequential data including synthetic Markov chains and language modeling tasks. Shows clear trade-off: weight fine-tuning better for large target datasets, GTL increasingly effective as target data shrinks. Identifies failure mode when source/target distributions overlap poorly.

Conclusion: GTL provides practical transfer learning for discrete DMs in small-data regimes, with linear scaling in vocabulary size. Performance depends on distribution overlap between source and target domains.

Abstract: Discrete diffusion models (DMs) have achieved strong performance in language and other discrete domains, offering a compelling alternative to autoregressive modeling. Yet this performance typically depends on large training datasets, challenging the performance of DMs in small-data regimes – common under real-world constraints. Aimed at this challenge, recent work in continuous DMs suggests that transfer learning via classifier ratio-based guidance can adapt a pretrained DM to a related target distribution, often outperforming alternatives such as full-weight fine-tuning on the target data. By contrast, transfer learning for discrete DMs remains unexplored. We address this gap by exploring practical analogues of ratio-based transfer learning for discrete DMs. Our theoretical analysis shows that a direct extension of existing ratio-based guidance is computationally prohibitive, scaling with vocabulary size. To overcome this limitation, we introduce a scheduling mechanism that yields a practical algorithm, Guided Transfer Learning for discrete diffusion models (GTL). GTL enables sampling from a target distribution without modifying the pretrained denoiser and reduces the cost to linear scaling in vocabulary size, which in turn supports longer sequence generation. We evaluate GTL on sequential data, including synthetic Markov chains and language modeling tasks, and provide a detailed empirical analysis of its behavior. The results highlight a clear trade-off: when target datasets are large, weight fine-tuning is often preferable, whereas GTL becomes increasingly effective as target data shrinks. Finally, we experimentally demonstrate a key failure mode of GTL: when the source and target distributions overlap poorly, the ratio-based classifier required for guidance becomes unreliable, limiting transfer performance.

[309] How Does Fourier Analysis Network Work? A Mechanism Analysis and a New Dual-Activation Layer Proposal

Sam Jeong, Hae Yong Kim

Main category: cs.LG

TL;DR: FAN’s performance gains come from sine activation’s non-zero derivative at x=0, not periodicity, addressing vanishing gradients and dying-ReLU problem, leading to more efficient Dual-Activation Layer (DAL).

DetailsMotivation: To understand the underlying mechanism behind Fourier Analysis Network (FAN) improvements and develop more efficient activation functions that address vanishing gradient and dying-ReLU problems in neural networks.

Method: Analyzed FAN components, discovered only sine activation helps while cosine is detrimental, identified local behavior near x=0 as key, developed Dual-Activation Layer (DAL) as more efficient alternative.

Result: DAL models converge faster and achieve equal or higher validation accuracy on three tasks: noisy sinusoidal signal classification, MNIST digit classification, and ECG-based biometric recognition.

Conclusion: FAN’s benefits stem from sine activation’s gradient properties rather than spectral interpretation, leading to more effective activation functions that improve training dynamics and convergence.

Abstract: Fourier Analysis Network (FAN) was recently proposed as a simple way to improve neural network performance by replacing part of Rectified Linear Unit (ReLU) activations with sine and cosine functions. Although several studies have reported small but consistent gains across tasks, the underlying mechanism behind these improvements has remained unclear. In this work, we show that only the sine activation contributes positively to performance, whereas the cosine activation tends to be detrimental. Our analysis reveals that the improvement is not a consequence of the sine function’s periodic nature; instead, it stems from the function’s local behavior near x = 0, where its non-zero derivative mitigates the vanishing-gradient problem. We further show that FAN primarily alleviates the dying-ReLU problem, in which a neuron consistently receives negative inputs, produces zero gradients, and stops learning. Although modern ReLU-like activations, such as Leaky ReLU, GELU, and Swish, reduce ReLU’s zero-gradient region, they still contain input domains where gradients remain significantly diminished, contributing to slower optimization and hindering rapid convergence. FAN addresses this limitation by introducing a more stable gradient pathway. This analysis shifts the understanding of FAN’s benefits from a spectral interpretation to a concrete analysis of training dynamics, leading to the development of the Dual-Activation Layer (DAL), a more efficient convergence accelerator. We evaluate DAL on three tasks: classification of noisy sinusoidal signals versus pure noise, MNIST digit classification, and Electrocardiogram (ECG)-based biometric recognition. In all cases, DAL models converge faster and achieve equal or higher validation accuracy compared to models with conventional activations.

[310] Phase-space entropy at acquisition reflects downstream learnability

Xiu-Cheng Wang, Jun-Jie Zhanga, Nan Cheng, Long-Gang Pang, Taijiao Du, Deyu Meng

Main category: cs.LG

TL;DR: Proposes a modality-agnostic metric ΔS_B based on phase-space entropy to quantify how acquisition preserves or destroys information for downstream learning, validated across imaging and communication domains.

DetailsMotivation: Current learning systems work with diverse data domains but lack a general way to quantify how acquisition itself affects information preservation before any model training. Need a modality-agnostic metric to measure information preservation at the acquisition level.

Method: Proposes ΔS_B, an acquisition-level scalar based on instrument-resolved phase space that quantifies how acquisition mixes or removes joint space-frequency structure. Unlike pixelwise distortion or spectral errors, it directly measures information preservation at the instrument scale.

Result: Theoretically shows ΔS_B identifies phase-space coherence of periodic sampling as the physical source of aliasing, recovering classical sampling-theorem consequences. Empirically, |ΔS_B| consistently ranks sampling geometries and predicts downstream reconstruction/recognition difficulty without training across masked image classification, accelerated MRI, and massive MIMO.

Conclusion: Phase-space entropy at acquisition reflects downstream learnability, enabling pre-training selection of sampling policies and providing a shared notion of information preservation across modalities.

Abstract: Modern learning systems work with data that vary widely across domains, but they all ultimately depend on how much structure is already present in the measurements before any model is trained. This raises a basic question: is there a general, modality-agnostic way to quantify how acquisition itself preserves or destroys the information that downstream learners could use? Here we propose an acquisition-level scalar $ΔS_{\mathcal B}$ based on instrument-resolved phase space. Unlike pixelwise distortion or purely spectral errors that often saturate under aggressive undersampling, $ΔS_{\mathcal B}$ directly quantifies how acquisition mixes or removes joint space–frequency structure at the instrument scale. We show theoretically that (ΔS_{\mathcal B}) correctly identifies the phase-space coherence of periodic sampling as the physical source of aliasing, recovering classical sampling-theorem consequences. Empirically, across masked image classification, accelerated MRI, and massive MIMO (including over-the-air measurements), $|ΔS_{\mathcal B}|$ consistently ranks sampling geometries and predicts downstream reconstruction/recognition difficulty \emph{without training}. In particular, minimizing $|ΔS_{\mathcal B}|$ enables zero-training selection of variable-density MRI mask parameters that matches designs tuned by conventional pre-reconstruction criteria. These results suggest that phase-space entropy at acquisition reflects downstream learnability, enabling pre-training selection of candidate sampling policies and as a shared notion of information preservation across modalities.

[311] Communication-Corruption Coupling and Verification in Cooperative Multi-Objective Bandits

Ming Shi

Main category: cs.LG

TL;DR: Multi-agent bandit learning with adversarial corruption and limited verification, showing communication protocols affect effective corruption levels from Γ to NΓ, with verification enabling corruption-independent regret.

DetailsMotivation: Study cooperative multi-armed bandits with vector rewards under adversarial corruption, where agents must coordinate despite corrupted feedback and limited verification capabilities.

Method: Analyze different communication protocols (raw-sample sharing, summary sharing, recommendation-only sharing) and their impact on effective corruption levels. Formalize via protocol-induced multiplicity functional and prove regret bounds parameterized by effective corruption.

Result: Raw-sample sharing suffers N-fold corruption penalty, while summary/recommendation sharing preserves O(Γ) term. Information-theoretic limits show unavoidable Ω(Γ) penalty. Verification restores learnability in high-corruption regimes.

Conclusion: Communication protocols crucially affect corruption resilience in multi-agent bandits. Verification enables corruption-independent regret, with certified sharing overcoming adversarial perturbations.

Abstract: We study cooperative stochastic multi-armed bandits with vector-valued rewards under adversarial corruption and limited verification. In each of $T$ rounds, each of $N$ agents selects an arm, the environment generates a clean reward vector, and an adversary perturbs the observed feedback subject to a global corruption budget $Γ$. Performance is measured by team regret under a coordinate-wise nondecreasing, $L$-Lipschitz scalarization $φ$, covering linear, Chebyshev, and smooth monotone utilities. Our main contribution is a communication-corruption coupling: we show that a fixed environment-side budget $Γ$ can translate into an effective corruption level ranging from $Γ$ to $NΓ$, depending on whether agents share raw samples, sufficient statistics, or only arm recommendations. We formalize this via a protocol-induced multiplicity functional and prove regret bounds parameterized by the resulting effective corruption. As corollaries, raw-sample sharing can suffer an $N$-fold larger additive corruption penalty, whereas summary sharing and recommendation-only sharing preserve an unamplified $O(Γ)$ term and achieve centralized-rate team regret. We further establish information-theoretic limits, including an unavoidable additive $Ω(Γ)$ penalty and a high-corruption regime $Γ=Θ(NT)$ where sublinear regret is impossible without clean information. Finally, we characterize how a global budget $ν$ of verified observations restores learnability. That is, verification is necessary in the high-corruption regime, and sufficient once it crosses the identification threshold, with certified sharing enabling the team’s regret to become independent of $Γ$.

[312] DeRaDiff: Denoising Time Realignment of Diffusion Models

Ratnavibusena Don Shahain Manujith, Teoh Tze Tzun, Kenji Kawaguchi, Yang Zhang

Main category: cs.LG

TL;DR: DeRaDiff enables real-time adjustment of regularization strength in aligned diffusion models without retraining, eliminating expensive hyperparameter sweeps.

DetailsMotivation: Current diffusion model alignment methods require expensive hyperparameter sweeps to find optimal KL regularization strength, which is computationally prohibitive.

Method: DeRaDiff uses denoising time realignment by replacing reverse step reference distribution with geometric mixture of aligned and reference posteriors, enabling on-the-fly control via single parameter lambda.

Result: Method approximates models trained at different regularization strengths without additional training, reducing computational costs while maintaining alignment quality across metrics.

Conclusion: DeRaDiff provides efficient way to search optimal regularization strength, eliminating need for expensive alignment sweeps in diffusion model alignment.

Abstract: Recent advances align diffusion models with human preferences to increase aesthetic appeal and mitigate artifacts and biases. Such methods aim to maximize a conditional output distribution aligned with higher rewards whilst not drifting far from a pretrained prior. This is commonly enforced by KL (Kullback Leibler) regularization. As such, a central issue still remains: how does one choose the right regularization strength? Too high of a strength leads to limited alignment and too low of a strength leads to “reward hacking”. This renders the task of choosing the correct regularization strength highly non-trivial. Existing approaches sweep over this hyperparameter by aligning a pretrained model at multiple regularization strengths and then choose the best strength. Unfortunately, this is prohibitively expensive. We introduce DeRaDiff, a denoising time realignment procedure that, after aligning a pretrained model once, modulates the regularization strength during sampling to emulate models trained at other regularization strengths without any additional training or finetuning. Extending decoding-time realignment from language to diffusion models, DeRaDiff operates over iterative predictions of continuous latents by replacing the reverse step reference distribution by a geometric mixture of an aligned and reference posterior, thus giving rise to a closed form update under common schedulers and a single tunable parameter, lambda, for on the fly control. Our experiments show that across multiple text image alignment and image-quality metrics, our method consistently provides a strong approximation for models aligned entirely from scratch at different regularization strengths. Thus, our method yields an efficient way to search for the optimal strength, eliminating the need for expensive alignment sweeps and thereby substantially reducing computational costs.

[313] Probe-then-Commit Multi-Objective Bandits: Theoretical Benefits of Limited Multi-Arm Feedback

Ming Shi

Main category: cs.LG

TL;DR: Multi-objective online resource selection with limited probe-then-commit feedback, where agent can probe q arms before committing to one, bridging bandits and full-information experts.

DetailsMotivation: Addresses practical resource selection problems in multi-radio access and mobile edge computing where agents can probe multiple candidates via control-plane measurements but must commit to exactly one for execution, creating a feedback regime between classical bandits and full-information experts.

Method: Develops PtC-P-UCB algorithm with frontier-aware probing under uncertainty in Pareto mode, selecting q probes by maximizing hypervolume-inspired frontier-coverage potential and committing by marginal hypervolume gain to expand attained Pareto region.

Result: Proves dominated-hypervolume frontier error of Õ(K_P d/√(qT)) and scalarized regret of Õ(L_φ d√((K/q)T)), showing 1/√q acceleration from limited probing, with extensions to multi-modal probing with uncertainty fusion.

Conclusion: The work provides theoretical foundations for limited multi-arm feedback in multi-objective online learning with practical applications in communication and edge computing systems, demonstrating benefits of strategic probing.

Abstract: We study an online resource-selection problem motivated by multi-radio access selection and mobile edge computing offloading. In each round, an agent chooses among $K$ candidate links/servers (arms) whose performance is a stochastic $d$-dimensional vector (e.g., throughput, latency, energy, reliability). The key interaction is \emph{probe-then-commit (PtC)}: the agent may probe up to $q>1$ candidates via control-plane measurements to observe their vector outcomes, but must execute exactly one candidate in the data plane. This limited multi-arm feedback regime strictly interpolates between classical bandits ($q=1$) and full-information experts ($q=K$), yet existing multi-objective learning theory largely focuses on these extremes. We develop \textsc{PtC-P-UCB}, an optimistic probe-then-commit algorithm whose technical core is frontier-aware probing under uncertainty in a Pareto mode, e.g., it selects the $q$ probes by approximately maximizing a hypervolume-inspired frontier-coverage potential and commits by marginal hypervolume gain to directly expand the attained Pareto region. We prove a dominated-hypervolume frontier error of $\tilde{O} (K_P d/\sqrt{qT})$, where $K_P$ is the Pareto-frontier size and $T$ is the horizon, and scalarized regret $\tilde{O} (L_φd\sqrt{(K/q)T})$, where $φ$ is the scalarizer. These quantify a transparent $1/\sqrt{q}$ acceleration from limited probing. We further extend to \emph{multi-modal probing}: each probe returns $M$ modalities (e.g., CSI, queue, compute telemetry), and uncertainty fusion yields variance-adaptive versions of the above bounds via an effective noise scale.

[314] Harpoon: Generalised Manifold Guidance for Conditional Tabular Diffusion

Aditya Shankar, Yuandou Wang, Rihan Hai, Lydia Y. Chen

Main category: cs.LG

TL;DR: HARPOON is a tabular diffusion method that uses manifold theory to guide generation to satisfy diverse tabular conditions at inference time, extending beyond training-time strategies.

DetailsMotivation: Existing tabular data generation methods don't generalize to unseen constraints during inference and struggle with conditional tasks beyond imputation. Current manifold formulations are tied to specific objectives and limited to continuous domains.

Method: Extends manifold theory to tabular data and introduces HARPOON, a tabular diffusion method that guides unconstrained samples along manifold geometry to satisfy diverse tabular conditions at inference time.

Result: Demonstrates strong performance on tasks like imputation and enforcing inequality constraints across diverse datasets, showing practical benefits of manifold-aware guidance for tabular data.

Conclusion: HARPOON successfully extends manifold theory to tabular domains and provides a flexible framework for conditional generation with diverse inference-time objectives.

Abstract: Generating tabular data under conditions is critical to applications requiring precise control over the generative process. Existing methods rely on training-time strategies that do not generalise to unseen constraints during inference, and struggle to handle conditional tasks beyond tabular imputation. While manifold theory offers a principled way to guide generation, current formulations are tied to specific inference-time objectives and are limited to continuous domains. We extend manifold theory to tabular data and expand its scope to handle diverse inference-time objectives. On this foundation, we introduce HARPOON, a tabular diffusion method that guides unconstrained samples along the manifold geometry to satisfy diverse tabular conditions at inference. We validate our theoretical contributions empirically on tasks such as imputation and enforcing inequality constraints, demonstrating HARPOON’S strong performance across diverse datasets and the practical benefits of manifold-aware guidance for tabular data. Code URL: https://github.com/adis98/Harpoon

[315] Amortized Molecular Optimization via Group Relative Policy Optimization

Muhammad bin Javaid, Hasham Hussain, Ashima Khanna, Berke Kisin, Jonathan Pirnay, Alexander Mitsos, Dominik G. Grimm, Martin Grohe

Main category: cs.LG

TL;DR: GRXForm is a graph transformer-based method for molecular optimization that uses group relative policy optimization to handle heterogeneous starting structures, achieving competitive multi-objective optimization without inference-time oracle calls.

DetailsMotivation: Current molecular design methods for structural alteration act as "instance optimizers" that restart search for every input structure, lacking generalization. Model-based approaches theoretically offer amortized efficiency but struggle with generalization due to high variance from heterogeneous starting structure difficulty.

Method: GRXForm adapts a pre-trained Graph Transformer model to optimize molecules via sequential atom-and-bond additions. It employs Group Relative Policy Optimization (GRPO) for goal-directed fine-tuning, which normalizes rewards relative to the starting structure to mitigate variance.

Result: GRXForm generalizes to out-of-distribution molecular scaffolds without inference-time oracle calls or refinement, achieving scores in multi-objective optimization competitive with leading instance optimizers.

Conclusion: The proposed GRXForm with GRPO effectively addresses generalization challenges in molecular optimization by handling heterogeneous starting structure difficulty, offering an efficient alternative to instance optimizers.

Abstract: Molecular design encompasses tasks ranging from de-novo design to structural alteration of given molecules or fragments. For the latter, state-of-the-art methods predominantly function as “Instance Optimizers’’, expending significant compute restarting the search for every input structure. While model-based approaches theoretically offer amortized efficiency by learning a policy transferable to unseen structures, existing methods struggle to generalize. We identify a key failure mode: the high variance arising from the heterogeneous difficulty of distinct starting structures. To address this, we introduce GRXForm, adapting a pre-trained Graph Transformer model that optimizes molecules via sequential atom-and-bond additions. We employ Group Relative Policy Optimization (GRPO) for goal-directed fine-tuning to mitigate variance by normalizing rewards relative to the starting structure. Empirically, GRXForm generalizes to out-of-distribution molecular scaffolds without inference-time oracle calls or refinement, achieving scores in multi-objective optimization competitive with leading instance optimizers.

[316] Adam Improves Muon: Adaptive Moment Estimation with Orthogonalized Momentum

Minxin Zhang, Yuxuan Liu, Hayden Schaeffer

Main category: cs.LG

TL;DR: NAMO and NAMO-D: New optimizers combining orthogonalized momentum with Adam-type noise adaptation for improved LLM training

DetailsMotivation: Existing optimizers like Adam use adaptive moment estimates for stability, while Muon leverages weight matrix structure via orthogonalized momentum. The authors aim to create a principled integration of orthogonalized momentum with norm-based Adam-type noise adaptation for better large language model training.

Method: Proposes two optimizers: NAMO scales orthogonalized momentum using a single adaptive stepsize while preserving orthogonality. NAMO-D extends this by right-multiplying orthogonalized momentum by a diagonal matrix with clamped entries, enabling neuron-wise noise adaptation aligned with near block-diagonal Hessian structure.

Result: Both NAMO and NAMO-D outperform AdamW and Muon baselines in GPT-2 pretraining experiments. NAMO-D achieves further gains over NAMO through its clamping hyperparameter that balances well-conditioned updates with fine-grained noise adaptation.

Conclusion: The proposed optimizers successfully integrate orthogonalized momentum with noise adaptation, providing improved convergence and performance for large language model training with theoretical guarantees.

Abstract: Efficient stochastic optimization typically integrates an update direction that performs well in the deterministic regime with a mechanism adapting to stochastic perturbations. While Adam uses adaptive moment estimates to promote stability, Muon utilizes the weight layers’ matrix structure via orthogonalized momentum, showing superior performance in large language model training. We propose a new optimizer and a diagonal extension, NAMO and NAMO-D, providing the first principled integration of orthogonalized momentum with norm-based Adam-type noise adaptation. NAMO scales orthogonalized momentum using a single adaptive stepsize, preserving orthogonality while improving upon Muon at negligible additional cost. NAMO-D instead right-multiplies orthogonalized momentum by a diagonal matrix with clamped entries. This design enables neuron-wise noise adaptation and aligns with the common near block-diagonal Hessian structure. Under standard assumptions, we establish optimal convergence rates for both algorithms in the deterministic setting and show that, in the stochastic setting, their convergence guarantees adapt to the noise level of stochastic gradients. Experiments on pretraining GPT-2 models demonstrate improved performance of both NAMO and NAMO-D compared to the AdamW and Muon baselines, with NAMO-D achieving further gains over NAMO via an additional clamping hyperparameter that balances the competing goals of maintaining a well-conditioned update direction and leveraging fine-grained noise adaptation.

cs.MA

[317] Reasoning-Native Agentic Communication for 6G

Hyowoon Seo, Joonho Seon, Jin Young Kim, Mehdi Bennis, Wan Choi, Dong In Kim

Main category: cs.MA

TL;DR: A new communication paradigm called “reasoning native agentic communication” for 6G networks that focuses on aligning agents’ internal belief states rather than just transmitting information, addressing belief divergence in autonomous systems.

DetailsMotivation: Future 6G networks will interconnect autonomous machines that continuously sense, reason, and act. Traditional communication approaches fail when agents interpret the same information correctly but behave inconsistently due to divergent internal reasoning processes (belief divergence).

Method: Proposes a reasoning native architecture that augments conventional communication stack with a coordination plane grounded in shared knowledge structure and bounded belief modeling. Communication is triggered based on predicted misalignment in agents’ internal belief states rather than just channel conditions or data relevance.

Result: The framework enables prevention of coordination drift and maintenance of coherent behavior across heterogeneous autonomous systems by reframing communication as a regulator of distributed reasoning.

Conclusion: Reasoning native agentic communication enables 6G networks to act as active harmonizers of autonomous intelligence by addressing belief divergence rather than just transmitting representations.

Abstract: Future 6G networks will interconnect not only devices, but autonomous machines that continuously sense, reason, and act. In such environments, communication can no longer be understood solely as delivering bits or even preserving semantic meaning. Even when two agents interpret the same information correctly, they may still behave inconsistently if their internal reasoning processes evolve differently. We refer to this emerging challenge as belief divergence. This article introduces reasoning native agentic communication, a new paradigm in which communication is explicitly designed to address belief divergence rather than merely transmitting representations. Instead of triggering transmissions based only on channel conditions or data relevance, the proposed framework activates communication according to predicted misalignment in agents internal belief states. We present a reasoning native architecture that augments the conventional communication stack with a coordination plane grounded in a shared knowledge structure and bounded belief modeling. Through enabling mechanisms and representative multi agent scenarios, we illustrate how such an approach can prevent coordination drift and maintain coherent behavior across heterogeneous systems. By reframing communication as a regulator of distributed reasoning, reasoning native agentic communication enables 6G networks to act as an active harmonizer of autonomous intelligence.

[318] MultiVer: Zero-Shot Multi-Agent Vulnerability Detection

Shreshth Rajan

Main category: cs.MA

TL;DR: MultiVer is a zero-shot multi-agent system for vulnerability detection that achieves state-of-the-art recall without fine-tuning, using a four-agent ensemble (security, correctness, performance, style) with union voting.

DetailsMotivation: The paper addresses the challenge of vulnerability detection in code, where false negatives (missed vulnerabilities) are often more costly than false positives. Current approaches typically require fine-tuning, but the authors aim to develop a zero-shot system that can match or exceed fine-tuned performance on recall metrics.

Method: MultiVer uses a four-agent ensemble approach with specialized agents focusing on security, correctness, performance, and style aspects of code analysis. The system employs union voting where any agent detecting a vulnerability leads to a positive classification. This zero-shot approach requires no fine-tuning on vulnerability datasets.

Result: The system achieves 82.7% recall on PyVul benchmark, exceeding fine-tuned GPT-3.5 (81.3%) by 1.4 percentage points. On SecurityEval, it achieves 91.7% detection rate, matching specialized systems. However, this comes at a precision cost (48.8% vs 63.9% for baselines), yielding 61.4% F1 score. The multi-agent ensemble adds 17 percentage points recall over single-agent security analysis.

Conclusion: For security applications where false negatives are costlier than false positives, zero-shot multi-agent ensembles can match and exceed fine-tuned models on the metric that matters most (recall). The work demonstrates the effectiveness of multi-agent approaches for specialized tasks without requiring fine-tuning.

Abstract: We present MultiVer, a zero-shot multi-agent system for vulnerability detection that achieves state-of-the-art recall without fine-tuning. A four-agent ensemble (security, correctness, performance, style) with union voting achieves 82.7% recall on PyVul, exceeding fine-tuned GPT-3.5 (81.3%) by 1.4 percentage points – the first zeroshot system to surpass fine-tuned performance on this benchmark. On SecurityEval, the same architecture achieves 91.7% detection rate, matching specialized systems. The recall improvement comes at a precision cost: 48.8% precision versus 63.9% for fine-tuned baselines, yielding 61.4% F1. Ablation experiments isolate component contributions: the multi-agent ensemble adds 17 percentage points recall over single-agent security analysis. These results demonstrate that for security applications where false negatives are costlier than false positives, zero-shot multi-agent ensembles can match and exceed fine-tuned models on the metric that matters most.

[319] Mean-Field Reinforcement Learning without Synchrony

Shan Yang

Main category: cs.MA

TL;DR: Temporal Mean Field (TMF) framework extends multi-agent RL to asynchronous settings using population distribution instead of mean action, enabling scaling from fully synchronous to purely sequential decision-making.

DetailsMotivation: Existing mean-field RL relies on mean action statistics which require all agents to act at every time step, making it unsuitable for asynchronous environments where some agents may be idle. The population distribution provides a more flexible summary statistic that remains defined regardless of which agents act.

Method: Developed Temporal Mean Field (TMF) framework using population distribution μ∈Δ(O) as the summary statistic instead of mean action. Proved existence/uniqueness of TMF equilibria, established O(1/√N) finite-population approximation bound, and developed TMF-PG policy gradient algorithm that converges to unique equilibrium.

Result: TMF-PG achieves near-identical performance whether one agent or all N act per step, with approximation error decaying at predicted O(1/√N) rate. Experiments on resource selection and dynamic queueing games confirm theoretical results.

Conclusion: TMF framework successfully extends mean-field RL to asynchronous settings using population distribution, providing theoretical guarantees and practical algorithms that work across the full spectrum from synchronous to sequential decision-making.

Abstract: Mean-field reinforcement learning (MF-RL) scales multi-agent RL to large populations by reducing each agent’s dependence on others to a single summary statistic – the mean action. However, this reduction requires every agent to act at every time step; when some agents are idle, the mean action is simply undefined. Addressing asynchrony therefore requires a different summary statistic – one that remains defined regardless of which agents act. The population distribution $μ\in Δ(\mathcal{O})$ – the fraction of agents at each observation – satisfies this requirement: its dimension is independent of $N$, and under exchangeability it fully determines each agent’s reward and transition. Existing MF-RL theory, however, is built on the mean action and does not extend to $μ$. We therefore construct the Temporal Mean Field (TMF) framework around the population distribution $μ$ from scratch, covering the full spectrum from fully synchronous to purely sequential decision-making within a single theory. We prove existence and uniqueness of TMF equilibria, establish an $O(1/\sqrt{N})$ finite-population approximation bound that holds regardless of how many agents act per step, and prove convergence of a policy gradient algorithm (TMF-PG) to the unique equilibrium. Experiments on a resource selection game and a dynamic queueing game confirm that TMF-PG achieves near-identical performance whether one agent or all $N$ act per step, with approximation error decaying at the predicted $O(1/\sqrt{N})$ rate.

cs.MM

[320] MusicSem: A Semantically Rich Language–Audio Dataset of Natural Music Descriptions

Rebecca Salganik, Teng Tu, Fei-Yueh Chen, Xiaohao Liu, Keifeng Lu, Ethan Luvisia, Zhiyao Duan, Guillaume Salha-Galvan, Anson Kahng, Yunshan Ma, Jian Kang

Main category: cs.MM

TL;DR: MusicSem dataset of 32,493 language-audio pairs from Reddit discussions, capturing natural music descriptions with semantic taxonomy, used to evaluate multimodal models for retrieval/generation.

DetailsMotivation: Existing multimodal music models struggle to capture users' expressed intent in natural language descriptions, suggesting training datasets don't reflect broader human discourse about music.

Method: Created MusicSem dataset from organic Reddit discussions, proposed taxonomy of 5 semantic categories (descriptive, atmospheric, situational, metadata-related, contextual), used dataset to evaluate multimodal models.

Result: MusicSem captures broader spectrum of musical semantics than existing datasets, reveals limitations of current models in handling fine-grained semantics, provides resource for human-aligned multimodal learning.

Conclusion: MusicSem serves as novel semantics-aware resource to support future research on human-aligned multimodal music representation learning, highlighting importance of modeling fine-grained semantics.

Abstract: Music representation learning is central to music information retrieval and generation. While recent advances in multimodal learning have improved alignment between text and audio for tasks such as cross-modal music retrieval, text-to-music generation, and music-to-text generation, existing models often struggle to capture users’ expressed intent in natural language descriptions of music. This observation suggests that the datasets used to train and evaluate these models do not fully reflect the broader and more natural forms of human discourse through which music is described. In this paper, we introduce MusicSem, a dataset of 32,493 language-audio pairs derived from organic music-related discussions on the social media platform Reddit. Compared to existing datasets, MusicSem captures a broader spectrum of musical semantics, reflecting how listeners naturally describe music in nuanced and human-centered ways. To structure these expressions, we propose a taxonomy of five semantic categories: descriptive, atmospheric, situational, metadata-related, and contextual. In addition to the construction, analysis, and release of MusicSem, we use the dataset to evaluate a wide range of multimodal models for retrieval and generation, highlighting the importance of modeling fine-grained semantics. Overall, MusicSem serves as a novel semantics-aware resource to support future research on human-aligned multimodal music representation learning.

eess.AS

[321] SIRUP: A diffusion-based virtual upmixer of steering vectors for highly-directive spatialization with first-order ambisonics

Emilio Picard, Diego Di Carlo, Aditya Arie Nugraha, Mathieu Fontaine, Kazuyoshi Yoshii

Main category: eess.AS

TL;DR: SIRUP uses a latent diffusion model to upmix first-order ambisonics (FOA) to higher-order ambisonics (HOA) for better spatial audio processing, outperforming traditional physics-based methods.

DetailsMotivation: Traditional methods for upmixing FOA to HOA data struggle with the mutual dependency between spatial directivity estimation and FOA's limited spatial resolution, requiring a more effective approach.

Method: SIRUP employs a latent diffusion model architecture: a VAE learns compact HOA encodings, then a diffusion model generates HOA embeddings conditioned on FOA data for virtual upmixing.

Result: SIRUP significantly outperforms FOA systems in steering vector upmixing, source localization, and speech denoising tasks.

Conclusion: The diffusion-based approach effectively addresses the limitations of traditional physics-based methods for spatial audio upmixing from fewer-channel microphone arrays.

Abstract: This paper presents virtual upmixing of steering vectors captured by a fewer-channel spherical microphone array. This challenge has conventionally been addressed by recovering the directions and signals of sound sources from first-order ambisonics (FOA) data, and then rendering the higher-order ambisonics (HOA) data using a physics-based acoustic simulator. This approach, however, struggles to handle the mutual dependency between the spatial directivity of source estimation and the spatial resolution of FOA ambisonics data. Our method, named SIRUP, employs a latent diffusion model architecture. Specifically, a variational autoencoder (VAE) is used to learn a compact encoding of the HOA data in a latent space and a diffusion model is then trained to generate the HOA embeddings, conditioned by the FOA data. Experimental results showed that SIRUP achieved a significant improvement compared to FOA systems for steering vector upmixing, source localization, and speech denoising.

[322] Detection and Classification of Cetacean Echolocation Clicks using Image-based Object Detection Methods applied to Advanced Wavelet-based Transformations

Christopher Hauer

Main category: eess.AS

TL;DR: This thesis explores using wavelet transformations instead of spectrograms for deep learning-based detection of marine animal sounds (particularly killer whale clicks) to overcome limitations of traditional spectrogram-based methods in complex bioacoustic environments.

DetailsMotivation: Manual labeling of marine bioacoustic data is too time-consuming, and basic mathematical models struggle with complex scenarios like low signal-to-noise ratio or distinguishing clicks from echoes. While deep learning approaches like ANIMAL-SPOT exist, they rely on spectrograms which have inherent limitations due to the time-frequency uncertainty principle.

Method: The thesis proposes CLICK-SPOT, which uses wavelet transformations instead of spectrograms for feature extraction. Wavelets provide better time resolution for high frequencies and improved frequency resolution for low frequencies, potentially offering advantages for detecting complex bioacoustic signals like killer whale clicks in challenging underwater environments.

Result: The thesis demonstrates the efficacy of CLICK-SPOT on Norwegian killer whale underwater recordings provided by cetacean biologist Dr. Vester, showing improved detection capabilities compared to spectrogram-based methods.

Conclusion: Wavelet transformations offer a promising alternative to spectrograms for deep learning-based bioacoustic signal detection, particularly for complex marine environments where traditional methods struggle with time-frequency resolution tradeoffs.

Abstract: A challenge in marine bioacoustic analysis is the detection of animal signals, like calls, whistles and clicks, for behavioral studies. Manual labeling is too time-consuming to process sufficient data to get reasonable results. Thus, an automatic solution to overcome the time-consuming data analysis is necessary. Basic mathematical models can detect events in simple environments, but they struggle with complex scenarios, like differentiating signals with a low signal-to-noise ratio or distinguishing clicks from echoes. Deep Learning Neural Networks, such as ANIMAL-SPOT, are better suited for such tasks. DNNs process audio signals as image representations, often using spectrograms created by Short-Time Fourier Transform. However, spectrograms have limitations due to the uncertainty principle, which creates a tradeoff between time and frequency resolution. Alternatives like the wavelet, which provides better time resolution for high frequencies and improved frequency resolution for low frequencies, may offer advantages for feature extraction in complex bioacoustic environments. This thesis shows the efficacy of CLICK-SPOT on Norwegian Killer whale underwater recordings provided by the cetacean biologist Dr. Vester. Keywords: Bioacoustics, Deep Learning, Wavelet Transformation

[323] Rethinking Flow and Diffusion Bridge Models for Speech Enhancement

Dahan Wang, Jun Gao, Tong Lei, Yuxiang Hu, Changbao Zhu, Kai Chen, Jing Lu

Main category: eess.AS

TL;DR: Unified framework for flow matching and diffusion bridge models in speech enhancement, showing their equivalence to predictive models and proposing an enhanced bridge model with predictive elements.

DetailsMotivation: To unify existing flow and diffusion bridge models for speech enhancement, reveal their theoretical connection to conventional predictive models, and develop an improved generative framework that incorporates effective predictive elements.

Method: Proposes a unified framework interpreting flow/diffusion bridges as Gaussian probability paths between noisy and clean speech. Shows theoretical equivalence between generative sampling steps and predictive enhancement. Introduces enhanced bridge model combining probability path design with predictive elements like improved network architecture, tailored loss functions, and optimized training strategies.

Result: Outperforms existing flow and diffusion baselines on denoising and dereverberation tasks with fewer parameters and reduced computational complexity. Reveals inherent predictive nature imposes limitations on achievable upper-bound performance.

Conclusion: Provides unified understanding of generative speech enhancement models, demonstrates their predictive nature, and presents enhanced bridge model that effectively combines generative and predictive paradigms for improved performance.

Abstract: Flow matching and diffusion bridge models have emerged as leading paradigms in generative speech enhancement, modeling stochastic processes between paired noisy and clean speech signals based on principles such as flow matching, score matching, and Schrödinger bridge. In this paper, we present a framework that unifies existing flow and diffusion bridge models by interpreting them as constructions of Gaussian probability paths with varying means and variances between paired data. Furthermore, we investigate the underlying consistency between the training/inference procedures of these generative models and conventional predictive models. Our analysis reveals that each sampling step of a well-trained flow or diffusion bridge model optimized with a data prediction loss is theoretically analogous to executing predictive speech enhancement. Motivated by this insight, we introduce an enhanced bridge model that integrates an effective probability path design with key elements from predictive paradigms, including improved network architecture, tailored loss functions, and optimized training strategies. Experiments on denoising and dereverberation tasks demonstrate that the proposed method outperforms existing flow and diffusion baselines with fewer parameters and reduced computational complexity. The results also highlight that the inherently predictive nature of this generative framework imposes limitations on its achievable upper-bound performance.

[324] Binaural Unmasking in Practical Use: Perceived Level of Phase-inverted Speech in Environmental Noise

Rina Kotani, Chiaki Miyazaki, Shiro Suzuki

Main category: eess.AS

TL;DR: Binaural unmasking via phase reversal in one ear improves speech audibility in noisy environments by up to 6dB without increasing sound pressure or eliminating ambient noise.

DetailsMotivation: To develop technology that makes earphone/headphone sound easier to hear without increasing sound pressure or eliminating ambient noise, focusing on practical applications.

Method: Conducted experiments using speech sounds from various speakers and real-world noises (urban environmental sounds, cheers) to evaluate binaural unmasking through phase reversal in one ear under practical conditions.

Result: Speech in noisy environments perceived up to ~6dB louder with phase reversal; all speakers and noises showed significant improvement (≥5dB audibility enhancement).

Conclusion: Binaural unmasking via interaural phase differences is effective in practical scenarios for improving speech audibility in noise.

Abstract: We aim to develop a technology that makes the sound from earphones and headphones easier to hear without increasing the sound pressure or eliminating ambient noise. To this end, we focus on harnessing the phenomenon of binaural unmasking through phase reversal in one ear. Specifically, we conduct experiments to evaluate the improvement of audibility caused by the phenomenon, using conditions that approximate practical scenarios. We use speech sounds by various speakers and noises that can be encountered in daily life (urban environmental sounds, cheers) to verify the effects of binaural unmasking under conditions close to practical situations. The results of experiments using the Japanese language showed that (i) speech in a noisy environment is perceived to be up to about 6 dB louder with phase reversal in one ear, and (ii) a certain effect (improvement of audibility by 5 dB or more) is obtained for all speakers and noises targeted in this study. These findings demonstrate the effectiveness of binaural unmasking attributed to interaural phase differences in practical scenarios.

[325] LongAudio-RAG: Event-Grounded Question Answering over Multi-Hour Long Audio

Naveen Vakada, Kartik Hegde, Arvind Krishna Sridhar, Yinyi Guo, Erik Visser

Main category: eess.AS

TL;DR: LA-RAG is a hybrid framework for long-audio question answering that grounds LLM outputs in retrieved acoustic event detections stored in SQL, enabling precise temporal grounding with minimal hallucination.

DetailsMotivation: Reviewing multi-hour audio recordings is impractical, motivating systems that can answer natural-language queries with precise temporal grounding and minimal hallucination. Existing audio-language models struggle with long-audio QA due to context-length limitations.

Method: Hybrid framework that converts multi-hour audio streams into structured event records stored in SQL database. At inference: resolves time references, classifies intent, retrieves relevant events, and generates answers using constrained evidence. Deployed in edge-cloud environment with on-device audio grounding and cloud-based LLM.

Result: Structured event-level retrieval significantly improves accuracy compared to vanilla RAG or text-to-SQL approaches. Synthetic benchmark shows effectiveness for detection, counting, and summarization tasks.

Conclusion: LA-RAG provides practical solution for long-audio QA by combining structured event retrieval with LLM reasoning, enabling precise temporal grounding while overcoming context-length limitations.

Abstract: Long-duration audio is increasingly common in industrial and consumer settings, yet reviewing multi-hour recordings is impractical, motivating systems that answer natural-language queries with precise temporal grounding and minimal hallucination. Existing audio-language models show promise, but long-audio question answering remains difficult due to context-length limits. We introduce LongAudio-RAG (LA-RAG), a hybrid framework that grounds Large Language Model (LLM) outputs in retrieved, timestamped acoustic event detections rather than raw audio. Multi-hour streams are converted into structured event records stored in an SQL database, and at inference time the system resolves natural-language time references, classifies intent, retrieves only the relevant events, and generates answers using this constrained evidence. To evaluate performance, we construct a synthetic long-audio benchmark by concatenating recordings with preserved timestamps and generating template-based question-answer pairs for detection, counting, and summarization tasks. Finally, we demonstrate the practicality of our approach by deploying it in a hybrid edge-cloud environment, where the audio grounding model runs on-device on IoT-class hardware while the LLM is hosted on a GPU-backed server. This architecture enables low-latency event extraction at the edge and high-quality language reasoning in the cloud. Experiments show that structured, event-level retrieval significantly improves accuracy compared to vanilla Retrieval-Augmented Generation (RAG) or text-to-SQL approaches.

eess.IV

[326] Deep Learning for Dermatology: An Innovative Framework for Approaching Precise Skin Cancer Detection

Mohammad Tahmid Noor, B. M. Shahria Alam, Tasmiah Rahman Orpa, Shaila Afroz Anika, Mahjabin Tasnim Samiha, Fahad Ahammed

Main category: eess.IV

TL;DR: Comparison of VGG16 and DenseNet201 CNN architectures for binary classification of benign vs malignant skin lesions, achieving 93.79% accuracy with DenseNet201.

DetailsMotivation: Skin cancer is life-threatening if not diagnosed early, and deep learning models can assist in early detection and diagnosis by differentiating benign from malignant skin lesions.

Method: Used two CNN architectures (VGG16 and DenseNet201) on a binary class dataset of 3297 skin lesion images resized to 224x224, comparing their accuracy and computational efficiency.

Result: DenseNet201 achieved the best accuracy of 93.79% in classifying benign vs malignant skin lesions, outperforming VGG16.

Conclusion: Both models provide excellent accuracy for skin cancer detection, but there’s room for improvement. Future work will focus on using new datasets to achieve better accuracy.

Abstract: Skin cancer can be life-threatening if not diagnosed early, a prevalent yet preventable disease. Globally, skin cancer is perceived among the finest prevailing cancers and millions of people are diagnosed each year. For the allotment of benign and malignant skin spots, an area of critical importance in dermatological diagnostics, the application of two prominent deep learning models, VGG16 and DenseNet201 are investigated by this paper. We evaluate these CNN architectures for their efficacy in differentiating benign from malignant skin lesions leveraging enhancements in deep learning enforced to skin cancer spotting. Our objective is to assess model accuracy and computational efficiency, offering insights into how these models could assist in early detection, diagnosis, and streamlined workflows in dermatology. We used two deep learning methods DenseNet201 and VGG16 model on a binary class dataset containing 3297 images. The best result with an accuracy of 93.79% achieved by DenseNet201. All images were resized to 224x224 by rescaling. Although both models provide excellent accuracy, there is still some room for improvement. In future using new datasets, we tend to improve our work by achieving great accuracy.

[327] Promptable segmentation with region exploration enables minimal-effort expert-level prostate cancer delineation

Junqing Yang, Natasha Thorley, Ahmed Nadeem Abbasi, Shonit Punwani, Zion Tse, Yipeng Hu, Shaheer U. Saeed

Main category: eess.IV

TL;DR: A reinforcement learning framework for prostate cancer segmentation on MR images that uses user point prompts to guide region-growing segmentation, reducing annotation effort while achieving performance comparable to manual radiologist segmentation.

DetailsMotivation: Accurate prostate cancer segmentation on MR images is crucial for clinical interventions but challenging due to subtle tumor appearances, imaging protocol variations, and limited expert availability. Automated methods require large annotated datasets that are often inconsistent, while manual delineation is labor-intensive.

Method: Combines reinforcement learning with region-growing segmentation guided by user point prompts. Starting from an initial point, region-growing generates preliminary segmentation refined iteratively by RL. The RL agent observes image and current segmentation to predict new points, with rewards balancing accuracy and uncertainty to explore ambiguous regions.

Result: Outperformed previous best automated methods by 9.9% and 8.9% on two public prostate MR datasets (PROMIS and PICAI), with performance comparable to manual radiologist segmentation and reducing annotation time tenfold.

Conclusion: The framework bridges manual and automated segmentation by substantially reducing user effort while outperforming fully automated methods, offering a practical solution for clinical prostate cancer segmentation.

Abstract: Purpose: Accurate segmentation of prostate cancer on magnetic resonance (MR) images is crucial for planning image-guided interventions such as targeted biopsies, cryoablation, and radiotherapy. However, subtle and variable tumour appearances, differences in imaging protocols, and limited expert availability make consistent interpretation difficult. While automated methods aim to address this, they rely on large expertly-annotated datasets that are often inconsistent, whereas manual delineation remains labour-intensive. This work aims to bridge the gap between automated and manual segmentation through a framework driven by user-provided point prompts, enabling accurate segmentation with minimal annotation effort. Methods: The framework combines reinforcement learning (RL) with a region-growing segmentation process guided by user prompts. Starting from an initial point prompt, region-growing generates a preliminary segmentation, which is iteratively refined through RL. At each step, the RL agent observes the image and current segmentation to predict a new point, from which region growing updates the mask. A reward, balancing segmentation accuracy and voxel-wise uncertainty, encourages exploration of ambiguous regions, allowing the agent to escape local optima and perform sample-specific optimisation. Despite requiring fully supervised training, the framework bridges manual and fully automated segmentation at inference by substantially reducing user effort while outperforming current fully automated methods. Results: The framework was evaluated on two public prostate MR datasets (PROMIS and PICAI, with 566 and 1090 cases). It outperformed the previous best automated methods by 9.9% and 8.9%, respectively, with performance comparable to manual radiologist segmentation, reducing annotation time tenfold.

[328] TopoGate: Quality-Aware Topology-Stabilized Gated Fusion for Longitudinal Low-Dose CT New-Lesion Prediction

Seungik Cho

Main category: eess.IV

TL;DR: TopoGate: A lightweight model for longitudinal low-dose CT lesion detection that combines follow-up appearance and subtraction views with a quality-aware gate driven by CT appearance quality, registration consistency, and anatomical topology stability.

DetailsMotivation: Longitudinal low-dose CT follow-ups suffer from variations in noise, reconstruction kernels, and registration quality, which destabilize subtraction images and cause false new lesion alarms. Current methods lack adaptive quality-aware mechanisms to handle these variations.

Method: TopoGate combines follow-up appearance view with subtraction view using a learned quality-aware gate. The gate is driven by three case-specific signals: CT appearance quality, registration consistency, and stability of anatomical topology measured with topological metrics.

Result: On the NLST-New-Lesion-LongCT cohort (152 pairs from 122 patients), TopoGate achieves AUC of 0.65±0.05 and Brier score of 0.14. Removing corrupted/low-quality pairs identified by quality scores increases AUC from 0.62 to 0.68 and reduces Brier score from 0.14 to 0.12.

Conclusion: TopoGate improves discrimination and calibration over single-view baselines, responds predictably to degradation (placing more weight on appearance when noise grows), and provides a simple, interpretable, practical approach for reliable longitudinal LDCT triage.

Abstract: Longitudinal low-dose CT follow-ups vary in noise, reconstruction kernels, and registration quality. These differences destabilize subtraction images and can trigger false new lesion alarms. We present TopoGate, a lightweight model that combines the follow-up appearance view with the subtraction view and controls their influence through a learned, quality-aware gate. The gate is driven by three case-specific signals: CT appearance quality, registration consistency, and stability of anatomical topology measured with topological metrics. On the NLST–New-Lesion–LongCT cohort comprising 152 pairs from 122 patients, TopoGate improves discrimination and calibration over single-view baselines, achieving an area under the ROC curve of 0.65 with a standard deviation of 0.05 and a Brier score of 0.14. Removing corrupted or low-quality pairs, identified by the quality scores, further increases the area under the ROC curve from 0.62 to 0.68 and reduces the Brier score from 0.14 to 0.12. The gate responds predictably to degradation, placing more weight on appearance when noise grows, which mirrors radiologist practice. The approach is simple, interpretable, and practical for reliable longitudinal LDCT triage.

[329] MeDUET: Disentangled Unified Pretraining for 3D Medical Image Synthesis and Analysis

Junkai Liu, Ling Shao, Le Zhang

Main category: eess.IV

TL;DR: MeDUET is a unified pretraining framework for 3D medical images that disentangles domain-invariant content from domain-specific style using VAE latent space, enabling both synthesis and analysis tasks.

DetailsMotivation: Current SSL and diffusion models in 3D medical imaging remain separate (diffusion for synthesis, SSL for analysis). Multi-center datasets have dominant style shifts while downstream tasks rely on anatomy, making unified pretraining challenging without explicit disentanglement constraints.

Method: Proposes MeDUET framework performing SSL in VAE latent space with explicit content-style disentanglement. Uses token demixing mechanism and two novel proxy tasks: Mixed-Factor Token Distillation (MFTD) and Swap-invariance Quadruplet Contrast (SiQC) to enhance disentanglement.

Result: MeDUET delivers higher fidelity, faster convergence, improved controllability for synthesis, and demonstrates strong domain generalization and notable label efficiency for analysis across diverse medical benchmarks.

Conclusion: MeDUET converts multi-source heterogeneity from an obstacle into a learning signal, enabling unified pretraining for 3D medical image synthesis and analysis.

Abstract: Self-supervised learning (SSL) and diffusion models have advanced representation learning and image synthesis. However, in 3D medical imaging, they remain separate: diffusion for synthesis, SSL for analysis. Unifying 3D medical image synthesis and analysis is intuitive yet challenging, as multi-center datasets exhibit dominant style shifts, while downstream tasks rely on anatomy, and site-specific style co-varies with anatomy across slices, making factors unreliable without explicit constraints. In this paper, we propose MeDUET, a 3D Medical image Disentangled UnifiEd PreTraining framework that performs SSL in the Variational Autoencoder (VAE) latent space which explicitly disentangles domain-invariant content from domain-specific style. The token demixing mechanism serves to turn disentanglement from a modeling assumption into an empirically identifiable property. Two novel proxy tasks, Mixed-Factor Token Distillation (MFTD) and Swap-invariance Quadruplet Contrast (SiQC), are devised to synergistically enhance disentanglement. Once pretrained, MeDUET is capable of (i) delivering higher fidelity, faster convergence, and improved controllability for synthesis, and (ii) demonstrating strong domain generalization and notable label efficiency for analysis across diverse medical benchmarks. In summary, MeDUET converts multi-source heterogeneity from an obstacle into a learning signal, enabling unified pretraining for 3D medical image synthesis and analysis. The code is available at https://github.com/JK-Liu7/MeDUET .

[330] From Global Radiomics to Parametric Maps: A Unified Workflow Fusing Radiomics and Deep Learning for PDAC Detection

Zengtian Deng, Yimeng He, Yu Shi, Lixia Wang, Touseef Ahmad Qureshi, Xiuzhen Huang, Debiao Li

Main category: eess.IV

TL;DR: A unified framework combining radiomics and deep learning for pancreatic cancer detection, using both global features and spatial parametric maps to enhance nnUNet performance.

DetailsMotivation: Existing fusion approaches between radiomics and deep learning typically use only global radiomic features, missing the complementary value of spatially resolved radiomic parametric maps for medical imaging analysis.

Method: Proposes a unified framework that first selects discriminative radiomic features, then injects them into a radiomics-enhanced nnUNet at both global and voxel levels for pancreatic ductal adenocarcinoma detection.

Result: Achieved AUC = 0.96 and AP = 0.84 on PANORAMA dataset in cross-validation, and AUC = 0.95 and AP = 0.78 on external cohort, outperforming baseline nnUNet and ranking second in PANORAMA Grand Challenge.

Conclusion: Handcrafted radiomics provide complementary signals to deep learning models when injected at both global and voxel levels, demonstrating effective fusion for medical imaging tasks.

Abstract: Radiomics and deep learning both offer powerful tools for quantitative medical imaging, but most existing fusion approaches only leverage global radiomic features and overlook the complementary value of spatially resolved radiomic parametric maps. We propose a unified framework that first selects discriminative radiomic features and then injects them into a radiomics-enhanced nnUNet at both the global and voxel levels for pancreatic ductal adenocarcinoma (PDAC) detection. On the PANORAMA dataset, our method achieved AUC = 0.96 and AP = 0.84 in cross-validation. On an external in-house cohort, it achieved AUC = 0.95 and AP = 0.78, outperforming the baseline nnUNet; it also ranked second in the PANORAMA Grand Challenge. This demonstrates that handcrafted radiomics, when injected at both global and voxel levels, provide complementary signals to deep learning models for PDAC detection. Our code can be found at https://github.com/briandzt/dl-pdac-radiomics-global-n-paramaps

[331] RamanSeg: Interpretability-driven Deep Learning on Raman Spectra for Cancer Diagnosis

Chris Tomy, Mo Vali, David Pertzborn, Tammam Alamatouri, Anna Mühlig, Orlando Guntinas-Lichius, Anna Xylander, Eric Michele Fantuzzi, Matteo Negro, Francesco Crisafi, Pietro Lio, Tiago Azevedo

Main category: eess.IV

TL;DR: RamanSeg: A novel interpretable prototype-based architecture for tumor segmentation from Raman spectroscopy data, offering trade-offs between interpretability and performance.

DetailsMotivation: To develop an alternative to time-consuming histopathology for cancer diagnosis using stain-free Raman spectroscopy, with an emphasis on interpretable models rather than black-box approaches.

Method: Two approaches: 1) nnU-Net segmentation on aligned Raman spectra with tumor annotations, and 2) RamanSeg - a novel prototype-based architecture with two variants (prototype projection and projection-free) that classifies pixels based on discovered training set regions.

Result: nnU-Net achieved 80.9% mean foreground Dice score, surpassing previous work. Projection-free RamanSeg outperformed U-Net baseline with 67.3% Dice score, offering meaningful improvement over black-box approaches.

Conclusion: RamanSeg provides an interpretable alternative for Raman spectroscopy-based tumor segmentation, balancing performance and explainability, which is crucial for medical applications.

Abstract: Histopathology, the current gold standard for cancer diagnosis, involves the manual examination of tissue samples after chemical staining, a time-consuming process requiring expert analysis. Raman spectroscopy is an alternative, stain-free method of extracting information from samples. Using nnU-Net, we trained a segmentation model on a novel dataset of spatial Raman spectra aligned with tumour annotations, achieving a mean foreground Dice score of 80.9%, surpassing previous work. Furthermore, we propose a novel, interpretable, prototype-based architecture called RamanSeg. RamanSeg classifies pixels based on discovered regions of the training set, generating a segmentation mask. Two variants of RamanSeg allow a trade-off between interpretability and performance: one with prototype projection and another projection-free version. The projection-free RamanSeg outperformed a U-Net baseline with a mean foreground Dice score of 67.3%, offering a meaningful improvement over a black-box training approach.

[332] Exploiting Completeness Perception with Diffusion Transformer for Unified 3D MRI Synthesis

Junkai Liu, Nay Aung, Theodoros N. Arvanitis, Joao A. C. Lima, Steffen E. Petersen, Daniel C. Alexander, Le Zhang

Main category: eess.IV

TL;DR: CoPeDiT: A latent diffusion model with completeness perception for unified 3D MRI synthesis that can handle missing data problems without external guidance.

DetailsMotivation: Existing methods for missing MRI data require external guidance (manual indicators/masks) which are often unavailable or unreliable in clinical practice. These explicit masks also lack sufficient semantic information to guide synthesis effectively.

Method: Proposes CoPeDiT with two key components: 1) CoPeVAE tokenizer with pretext tasks to learn completeness-aware discriminative prompts, and 2) MDiT3D diffusion transformer architecture that uses learned prompts to enhance semantic consistency in 3D MRI synthesis.

Result: Comprehensive evaluations on three large-scale MRI datasets show CoPeDiT significantly outperforms state-of-the-art methods in robustness, generalizability, and flexibility for handling missing data problems.

Conclusion: CoPeDiT enables generative models to infer missing states in a self-perceptive manner, better capturing anatomical and pathological variations without relying on external guidance, making it suitable for unpredictable clinical environments.

Abstract: Missing data problems, such as missing modalities in multi-modal brain MRI and missing slices in cardiac MRI, pose significant challenges in clinical practice. Existing methods rely on external guidance to supply detailed missing state for instructing generative models to synthesize missing MRIs. However, manual indicators are not always available or reliable in real-world scenarios due to the unpredictable nature of clinical environments. Moreover, these explicit masks are not informative enough to provide guidance for improving semantic consistency. In this work, we argue that generative models should infer and recognize missing states in a self-perceptive manner, enabling them to better capture subtle anatomical and pathological variations. Towards this goal, we propose CoPeDiT, a general-purpose latent diffusion model equipped with completeness perception for unified synthesis of 3D MRIs. Specifically, we incorporate dedicated pretext tasks into our tokenizer, CoPeVAE, empowering it to learn completeness-aware discriminative prompts, and design MDiT3D, a specialized diffusion transformer architecture for 3D MRI synthesis, that effectively uses the learned prompts as guidance to enhance semantic consistency in 3D space. Comprehensive evaluations on three large-scale MRI datasets demonstrate that CoPeDiT significantly outperforms state-of-the-art methods, achieving superior robustness, generalizability, and flexibility. The code is available at https://github.com/JK-Liu7/CoPeDiT .

[333] Landmark Detection for Medical Images using a General-purpose Segmentation Model

Ekaterina Stansfield, Jennifer A. Mitterer, Abdulrahman Altahhan

Main category: eess.IV

TL;DR: A hybrid YOLO-SAM pipeline for segmenting anatomical landmarks in orthopaedic pelvic radiographs, where YOLO provides bounding box prompts for SAM to perform fine-grained segmentation of complex anatomical structures.

DetailsMotivation: Existing foundational segmentation models like SAM and MedSAM lack the fine-grained precision needed for orthopaedic pelvic landmark detection in medical imaging, requiring specific prompts that these models aren't trained to recognize.

Method: Combines YOLO for object detection to generate bounding boxes as input prompts for SAM, which then performs segmentation. The hybrid model was trained on orthopaedic pelvic radiographs to segment both landmarks (8-72 landmarks) and complex regions like femoral cortical bone.

Result: The YOLO-SAM combination yields excellent performance in detecting anatomical landmarks and intricate outlines in orthopaedic pelvic radiographs, successfully segmenting both small landmarks and complex anatomical regions.

Conclusion: The proposed hybrid approach effectively addresses the limitations of standalone foundational models for medical landmark segmentation by leveraging YOLO’s detection capabilities to guide SAM’s segmentation precision.

Abstract: Radiographic images are a cornerstone of medical diagnostics in orthopaedics, with anatomical landmark detection serving as a crucial intermediate step for information extraction. General-purpose foundational segmentation models, such as SAM (Segment Anything Model), do not support landmark segmentation out of the box and require prompts to function. However, in medical imaging, the prompts for landmarks are highly specific. Since SAM has not been trained to recognize such landmarks, it cannot generate accurate landmark segmentations for diagnostic purposes. Even MedSAM, a medically adapted variant of SAM, has been trained to identify larger anatomical structures, such as organs and their parts, and lacks the fine-grained precision required for orthopaedic pelvic landmarks. To address this limitation, we propose leveraging another general-purpose, non-foundational model: YOLO. YOLO excels in object detection and can provide bounding boxes that serve as input prompts for SAM. While YOLO is efficient at detection, it is significantly outperformed by SAM in segmenting complex structures. In combination, these two models form a reliable pipeline capable of segmenting not only a small pilot set of eight anatomical landmarks but also an expanded set of 72 landmarks and 16 regions with complex outlines, such as the femoral cortical bone and the pelvic inlet. By using YOLO-generated bounding boxes to guide SAM, we trained the hybrid model to accurately segment orthopaedic pelvic radiographs. Our results show that the proposed combination of YOLO and SAM yields excellent performance in detecting anatomical landmarks and intricate outlines in orthopaedic pelvic radiographs.

[334] A Novel Method to Determine Total Oxidant Concentration Produced by Non-Thermal Plasma Based on Image Processing and Machine Learning

Mirkan Emir Sancak, Unal Sen, Ulker Diler Keris-Sen

Main category: eess.IV

TL;DR: Computer vision and machine learning system for quantifying oxidant concentration in plasma-treated aqueous solutions using colorimetric analysis of potassium iodide solutions.

DetailsMotivation: Accurate determination of total oxidant concentration in nonthermal plasma treated aqueous systems is challenging due to transient reactive species and subjectivity of conventional titration methods.

Method: Developed a color-based computer analysis method integrating image processing with machine learning. Used custom visual acquisition system to record color transitions in potassium iodide solutions during plasma treatment, extracted RGB, HSV, and Lab color features, and trained multiple ML models (linear regression, ridge regression, random forest, gradient boosting, neural networks).

Result: Strong linear relationships between color features and oxidant concentrations, particularly for HSV saturation, Lab a/b channels, and blue RGB component. Linear regression and gradient boosting achieved R² > 0.99. System predicts total oxidant concentration with R² > 0.998 even with reduced features.

Conclusion: The proposed computer vision and machine learning system provides highly accurate, objective quantification of oxidant concentration in plasma-treated aqueous systems, overcoming limitations of conventional titration methods.

Abstract: Accurate determination of total oxidant concentration [Ox]tot in nonthermal plasma treated aqueous systems remains a critical challenge due to the transient nature of reactive oxygen and nitrogen species and the subjectivity of conventional titration methods used for [Ox]tot determination. This study introduces a color based computer analysis method that integrates advanced image processing with machine learning to quantify colorimetric changes in potassium iodide solutions during oxidation. A custom built visual acquisition system recorded high resolution video of the color transitions occurring during plasma treatment while the change in oxidant concentration was simultaneously monitored using a standard titrimetric method. Extracted image frames were processed through a structured pipeline to obtain RGB, HSV, and Lab color features. Statistical analysis revealed strong linear relationships between selected color features and measured oxidant concentrations, particularly for HSV saturation, Lab a and b channels, and the blue component of RGB. These features were subsequently used to train and validate multiple machine learning models including linear regression, ridge regression, random forest, gradient boosting, and neural networks. Linear regression and gradient boosting demonstrated the highest predictive accuracy with R2 values exceeding 0.99. Dimensionality reduction from nine features to smaller feature subsets preserved predictive performance while improving computational efficiency. Comparison with experimental titration measurements showed that the proposed system predicts total oxidant concentration in potassium iodide solution with very high accuracy, achieving R2 values above 0.998 even under reduced feature conditions.

[335] Physics-Informed Neural Networks vs. Physics Models for Non-Invasive Glucose Monitoring: A Comparative Study Under Noise-Stressed Synthetic Conditions

Riyaadh Gani

Main category: eess.IV

TL;DR: Physics-engineered Beer-Lambert model outperforms PINNs and DNNs for non-invasive glucose monitoring under low-SNR NIR conditions

DetailsMotivation: Non-invasive glucose monitoring faces low signal-to-noise ratio challenges due to hardware drift, environmental variation, and physiological factors that suppress glucose signatures in NIR signals

Method: Developed noise-stressed NIR simulator injecting various noise sources, then benchmarked six methods including physics-engineered Beer-Lambert, PINN variants, and shallow DNN

Result: Physics-engineered Beer-Lambert model achieved lowest error (13.6 mg/dL RMSE) with only 56 parameters and 0.01 ms inference, outperforming deeper PINNs and DNN baseline

Conclusion: Carefully engineered physics features can outperform higher-capacity models in low-SNR regimes, reframing the task as noise suppression under weak signal conditions

Abstract: Non-invasive glucose monitoring outside controlled settings is dominated by low signal-to-noise ratio (SNR): hardware drift, environmental variation, and physiology suppress the glucose signature in NIR signals. We present a noise-stressed NIR simulator that injects 12-bit ADC quantisation, LED drift, photodiode dark noise, temperature/humidity variation, contact-pressure noise, Fitzpatrick I-VI melanin, and glucose variability to create a low-correlation regime (rho_glucose-NIR = 0.21). Using this platform, we benchmark six methods: Enhanced Beer-Lambert (physics-engineered ridge regression), Original PINN, Optimised PINN, RTE-inspired PINN, Selective RTE PINN, and a shallow DNN. The physics-engineered Beer Lambert model achieves the lowest error (13.6 mg/dL RMSE) with only 56 parameters and 0.01 ms inference, outperforming deeper PINNs and the SDNN baseline under low-SNR conditions. The study reframes the task as noise suppression under weak signal and shows that carefully engineered physics features can outperform higher-capacity models in this regime.

[336] Smartphone-based iris recognition through high-quality visible-spectrum iris image capture.V2

Naveenkumar G Venkataswamy, Yu Liu, Soumyabrata Dey, Stephanie Schuckers, Masudul H Imtiaz

Main category: eess.IV

TL;DR: A smartphone-based visible spectrum iris recognition system using ISO-compliant acquisition, lightweight MobileNetV3 segmentation, and transformer matching achieves high accuracy on commodity devices.

DetailsMotivation: Visible spectrum iris recognition on smartphones faces challenges due to illumination variability, pigmentation differences, and lack of standardized capture controls, making accurate recognition difficult on commodity devices.

Method: Developed an end-to-end pipeline with: 1) Android app for real-time framing, sharpness evaluation, and ISO compliance feedback; 2) LightIrisNet (MobileNetV3-based multi-task segmentation) for on-device processing; 3) IrisFormer (transformer matcher) adapted to VIS domain; 4) CUVIRIS dataset of 752 compliant images from 47 subjects.

Result: OSIRIS achieved 97.9% TAR at FAR=0.01 (EER=0.76%), IrisFormer trained only on UBIRIS.v2 achieved 0.057% EER on CUVIRIS, demonstrating high accuracy for visible spectrum iris recognition on smartphones.

Conclusion: Standardized capture protocols and VIS-adapted lightweight models enable accurate and practical iris recognition on smartphones, with released app, models, and dataset supporting reproducibility.

Abstract: Smartphone-based iris recognition in the visible spectrum (VIS) remains difficult due to illumination variability, pigmentation differences, and the absence of standardized capture controls. This work presents a compact end-to-end pipeline that enforces ISO/IEC 29794-6 quality compliance at acquisition and demonstrates that accurate VIS iris recognition is feasible on commodity devices. Using a custom Android application performing real-time framing, sharpness evaluation, and feedback, we introduce the CUVIRIS dataset of 752 compliant images from 47 subjects. A lightweight MobileNetV3-based multi-task segmentation network (LightIrisNet) is developed for efficient on-device processing, and a transformer matcher (IrisFormer) is adapted to the VIS domain. Under a standardized protocol and comparative benchmarking against prior CNN baselines, OSIRIS attains a TAR of 97.9% at FAR=0.01 (EER=0.76%), while IrisFormer, trained only on UBIRIS.v2, achieves an EER of 0.057% on CUVIRIS. The acquisition app, trained models, and a public subset of the dataset are released to support reproducibility. These results confirm that standardized capture and VIS-adapted lightweight models enable accurate and practical iris recognition on smartphones.

[337] Incomplete Multi-view Clustering via Hierarchical Semantic Alignment and Cooperative Completion

Xiaojian Ding, Lin Zhao, Xian Li, Xiaoying Zhu

Main category: eess.IV

TL;DR: HSACC is a novel incomplete multi-view clustering framework that uses hierarchical semantic alignment and cooperative completion to handle missing views through dual-level semantic spaces and dynamic view weighting.

DetailsMotivation: Existing deep incomplete multi-view clustering methods suffer from static fusion strategies and two-stage pipelines, leading to suboptimal fusion results and error propagation when dealing with samples where entire views are missing.

Method: HSACC employs a dual-level semantic space design: low-level space ensures consistency alignment via mutual information maximization across views; high-level space uses adaptive view weights based on distributional affinity for weighted fusion. It implicitly recovers missing views by projecting aligned latent representations into high-dimensional spaces and jointly optimizes reconstruction and clustering objectives.

Result: HSACC significantly outperforms state-of-the-art methods on five benchmark datasets. Ablation studies validate the effectiveness of hierarchical alignment and dynamic weighting, while parameter analysis confirms robustness to hyperparameter variations.

Conclusion: HSACC provides an effective framework for incomplete multi-view clustering through hierarchical semantic alignment and cooperative completion, addressing limitations of existing approaches with dynamic fusion and joint optimization.

Abstract: Incomplete multi-view data, where certain views are entirely missing for some samples, poses significant challenges for traditional multi-view clustering methods. Existing deep incomplete multi-view clustering approaches often rely on static fusion strategies or two-stage pipelines, leading to suboptimal fusion results and error propagation issues. To address these limitations, this paper proposes a novel incomplete multi-view clustering framework based on Hierarchical Semantic Alignment and Cooperative Completion (HSACC). HSACC achieves robust cross-view fusion through a dual-level semantic space design. In the low-level semantic space, consistency alignment is ensured by maximizing mutual information across views. In the high-level semantic space, adaptive view weights are dynamically assigned based on the distributional affinity between individual views and an initial fused representation, followed by weighted fusion to generate a unified global representation. Additionally, HSACC implicitly recovers missing views by projecting aligned latent representations into high-dimensional semantic spaces and jointly optimizes reconstruction and clustering objectives, enabling cooperative learning of completion and clustering. Experimental results demonstrate that HSACC significantly outperforms state-of-the-art methods on five benchmark datasets. Ablation studies validate the effectiveness of the hierarchical alignment and dynamic weighting mechanisms, while parameter analysis confirms the model’s robustness to hyperparameter variations. The code is available at https://github.com/XiaojianDing/2025-NeurIPS-HSACC.

[338] Context-Aware Asymmetric Ensembling for Interpretable Retinopathy of Prematurity Screening via Active Query and Vascular Attention

Md. Mehedi Hassan, Taufiq Hasan

Main category: eess.IV

TL;DR: CAA Ensemble model for ROP screening using clinical context-aware asymmetric streams for structure and vascular analysis, achieving SOTA performance with interpretable attention mechanisms.

DetailsMotivation: Address challenges in automated ROP screening due to limited data, complex conditions requiring both structural staging and microvascular analysis, and poor generalization of current deep learning models on small imbalanced datasets.

Method: Two specialized streams: MS-AQNet for structure analysis using clinical contexts as dynamic query vectors to localize fibrovascular ridge, and VascuMIL encoding Vascular Topology Maps with gated MIL for vascular tortuosity detection, ensembled by meta-learner.

Result: Achieved Macro F1-Score of 0.93 for Broad ROP staging and AUC of 0.996 for Plus Disease detection on imbalanced cohort of 188 infants (6,004 images), with interpretable attention heatmaps and vascular threat maps.

Conclusion: Clinical metadata can guide visual search in medical AI, architectural inductive bias can bridge medical AI data gaps, and the framework provides transparent ‘Glass Box’ interpretability for clinical adoption.

Abstract: Retinopathy of Prematurity (ROP) is among the major causes of preventable childhood blindness. Automated screening remains challenging, primarily due to limited data availability and the complex condition involving both structural staging and microvascular abnormalities. Current deep learning models depend heavily on large private datasets and passive multimodal fusion, which commonly fail to generalize on small, imbalanced public cohorts. We thus propose the Context-Aware Asymmetric Ensemble Model (CAA Ensemble) that simulates clinical reasoning through two specialized streams. First, the Multi-Scale Active Query Network (MS-AQNet) serves as a structure specialist, utilizing clinical contexts as dynamic query vectors to spatially control visual feature extraction for localization of the fibrovascular ridge. Secondly, VascuMIL encodes Vascular Topology Maps (VMAP) within a gated Multiple Instance Learning (MIL) network to precisely identify vascular tortuosity. A synergistic meta-learner ensembles these orthogonal signals to resolve diagnostic discordance across multiple objectives. Tested on a highly imbalanced cohort of 188 infants (6,004 images), the framework attained State-of-the-Art performance on two distinct clinical tasks: achieving a Macro F1-Score of 0.93 for Broad ROP staging and an AUC of 0.996 for Plus Disease detection. Crucially, the system features `Glass Box’ transparency through counterfactual attention heatmaps and vascular threat maps, proving that clinical metadata dictates the model’s visual search. Additionally, this study demonstrates that architectural inductive bias can serve as an effective bridge for the medical AI data gap.

Last updated: 2026-03-06
Built with Hugo, theme modified on Stack