Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 113]
cs.CV [Total: 100]
cs.AI [Total: 47]
cs.SD [Total: 12]
cs.LG [Total: 154]
cs.MA [Total: 0]
cs.MM [Total: 2]
eess.AS [Total: 10]
eess.IV [Total: 9]

cs.CL

[1] AMANDA: Agentic Medical Knowledge Augmentation for Data-Efficient Medical Visual Question Answering

Ziqing Wang, Chengsheng Mao, Xiaole Wen, Yuan Luo, Kaize Ding

Main category: cs.CL

TL;DR: AMANDA is a training-free agentic framework that enhances medical multimodal LLMs by addressing intrinsic and extrinsic reasoning bottlenecks through medical knowledge augmentation using LLM agents.

Details

Motivation: Existing Med-MLLMs fail in low-resource settings due to two reasoning bottlenecks: ignoring medical image details and failing to incorporate specialized medical knowledge.

Method: Uses LLM agents for medical knowledge augmentation - intrinsic augmentation via coarse-to-fine question decomposition, and extrinsic augmentation via biomedical knowledge graph retrieval.

Result: Substantial improvements across eight Med-VQA benchmarks in both zero-shot and few-shot settings.

Conclusion: AMANDA effectively addresses reasoning bottlenecks in Med-MLLMs through training-free knowledge augmentation, demonstrating strong performance in medical visual question answering.

Abstract: Medical Multimodal Large Language Models (Med-MLLMs) have shown great promise in medical visual question answering (Med-VQA). However, when deployed in low-resource settings where abundant labeled data are unavailable, existing Med-MLLMs commonly fail due to their medical reasoning capability bottlenecks: (i) the intrinsic reasoning bottleneck that ignores the details from the medical image; (ii) the extrinsic reasoning bottleneck that fails to incorporate specialized medical knowledge. To address those limitations, we propose AMANDA, a training-free agentic framework that performs medical knowledge augmentation via LLM agents. Specifically, our intrinsic medical knowledge augmentation focuses on coarse-to-fine question decomposition for comprehensive diagnosis, while extrinsic medical knowledge augmentation grounds the reasoning process via biomedical knowledge graph retrieval. Extensive experiments across eight Med-VQA benchmarks demonstrate substantial improvements in both zero-shot and few-shot Med-VQA settings. The code is available at https://github.com/REAL-Lab-NU/AMANDA.

[2] CLARITY: Clinical Assistant for Routing, Inference, and Triage

Vladimir Shaposhnikov, Aleksandr Nesterov, Ilia Kopanichuk, Ivan Bakulin, Egor Zhelvakov, Ruslan Abramov, Ekaterina Tsapieva, Dmitry V. Dylov, Ivan Oseledets

Main category: cs.CL

TL;DR: CLARITY is an AI clinical assistant platform that combines FSM-structured dialogue with LLM-powered agents for patient routing, consultations, and severity assessment, achieving human-level performance with faster consultation times.

Details

Motivation: To facilitate efficient patient-to-specialist routing, clinical consultations, and severity assessment in healthcare settings through AI assistance.

Method: Hybrid architecture combining Finite State Machine for structured dialogue flows with collaborative agents using Large Language Models to analyze symptoms and prioritize referrals, built on modular microservices framework.

Result: Successfully integrated into national inter-hospital IT platform with 55,000+ user dialogues in 2 months; validation on 2,500 expert-annotated dialogues showed surpassing human-level first-attempt routing precision and 3x faster consultation duration.

Conclusion: CLARITY demonstrates effective AI-driven clinical assistance with superior routing accuracy and efficiency compared to human performance, scalable for healthcare workflows.

Abstract: We present CLARITY (Clinical Assistant for Routing, Inference, and Triage), an AI-driven platform designed to facilitate patient-to-specialist routing, clinical consultations, and severity assessment of patients’ conditions. Its hybrid architecture combines a Finite State Machine (FSM) for structured dialogue flows with collaborative agents that employ Large Language Model (LLM) to analyze symptoms and prioritize referrals to appropriate specialists. Built on a modular microservices framework, CLARITY ensures safe, efficient, and robust performance, flexible and readily scalable to meet the demands of existing workflows and IT solutions in healthcare. We report integration of our clinical assistant into a large-scale nation-wide inter-hospital IT platform, with over 55,000 content-rich user dialogues completed within the two months of deployment, 2,500 of which were expert-annotated for a consequent validation. The validation results show that CLARITY surpasses human-level performance in terms of the first-attempt routing precision, naturally requiring up to 3 times shorter duration of the consultation than with a human.

[3] Hallucination reduction with CASAL: Contrastive Activation Steering For Amortized Learning

Wannan Yang, Xinchi Qiu, Lei Yu, Yuchen Zhang, Oliver Aobo Yang, Narine Kokhlikyan, Nicola Cancedda, Diego Garcia-Olano

Main category: cs.CL

TL;DR: CASAL is an efficient algorithm that bakes activation steering benefits into model weights, enabling LLMs to answer known questions while abstaining from unknown ones, reducing hallucinations by 30-40% with high compute and data efficiency.

Details

Motivation: LLMs often hallucinate by providing incorrect answers instead of admitting ignorance, and existing activation steering methods require real-time monitoring during inference, limiting practical deployment.

Method: Contrastive Activation Steering for Amortized Learning (CASAL) connects interpretability with amortized optimization by training only a submodule of a single transformer layer to directly incorporate activation steering benefits into model weights.

Result: CASAL reduces hallucinations by 30-40% across multiple short-form QA benchmarks, is 30x more compute-efficient and 20x more data-efficient than LoRA-based baselines, and generalizes effectively to out-of-distribution domains and both dense/MoE models.

Conclusion: CASAL represents a promising step for applying interpretability-inspired methods in practical deployment, offering an efficient solution to mitigate hallucinations in both text-only and vision-language models.

Abstract: Large Language Models (LLMs) exhibit impressive capabilities but often hallucinate, confidently providing incorrect answers instead of admitting ignorance. Prior work has shown that models encode linear representations of their own knowledge and that activation steering can reduce hallucinations. These approaches, however, require real-time monitoring and intervention during inference. We introduce Contrastive Activation Steering for Amortized Learning (CASAL), an efficient algorithm that connects interpretability with amortized optimization. CASAL directly bakes the benefits of activation steering into model’s weights. Once trained, LLMs answer questions they know while abstaining from answering those they do not. CASAL’s light-weight design requires training only a submodule of a single transformer layer and yet reduces hallucination by 30%-40% across multiple short-form QA benchmarks. CASAL is 30x more compute-efficient and 20x more data-efficient than strong LoRA-based baselines such as SFT and DPO, boosting its practical applicability in data scarce domains. Importantly, CASAL also generalizes effectively to out-of-distribution (OOD) domains. We showcase CASAL’s flexibility in mitigating hallucinations in both text-only and vision-language models. To our knowledge, CASAL is the first steering-based training method that has been shown to be effective for both dense and Mixture-of-Experts (MoE) models. CASAL represents a promising step forward for applying interpretability-inspired method for practical deployment in production systems.

[4] Hallucination-Resistant, Domain-Specific Research Assistant with Self-Evaluation and Vector-Grounded Retrieval

Vivek Bhavsar, Joseph Ereifej, Aravanan Gurusami

Main category: cs.CL

TL;DR: RA-FSM is a modular GPT-based research assistant that uses a finite-state control loop (Relevance → Confidence → Knowledge) to provide well-cited, reliable answers for technical domains like photonics, outperforming baseline models in expert evaluations.

Details

Motivation: Large language models accelerate literature synthesis but often hallucinate and mis-cite, limiting their usefulness in expert workflows where accuracy and proper citation are critical.

Method: A modular system with finite-state control loop (Relevance → Confidence → Knowledge), grounded in vector retrieval and deterministic citation pipeline. Uses ranked-tier ingestion from journals, conferences, preprints, and patents, with dense vector indexing and relational storage.

Result: In blinded A/B reviews, domain experts preferred RA-FSM over Notebook LM and vanilla GPT API, citing better boundary-condition handling and more defensible evidence use. System explores beyond baseline models while maintaining tunable latency and cost overheads.

Conclusion: RA-FSM provides transparent, well-cited answers for high-stakes technical work and is generalizable to other scientific domains, addressing key limitations of current LLM-based research assistants.

Abstract: Large language models accelerate literature synthesis but can hallucinate and mis-cite, limiting their usefulness in expert workflows. We present RA-FSM (Research Assistant - Finite State Machine), a modular GPT-based research assistant that wraps generation in a finite-state control loop: Relevance -> Confidence -> Knowledge. The system is grounded in vector retrieval and a deterministic citation pipeline. The controller filters out-of-scope queries, scores answerability, decomposes questions, and triggers retrieval only when needed, and emits answers with confidence labels and in-corpus, de-duplicated references. A ranked-tier ingestion workflow constructs a domain knowledge base from journals, conferences, indices, preprints, and patents, writing both to a dense vector index and to a relational store of normalized metrics. We implement the system for photonics and evaluate it on six task categories: analytical reasoning, numerical analysis, methodological critique, comparative synthesis, factual extraction, and application design. In blinded A/B reviews, domain experts prefer RA-FSM to both a strong Notebook LM (NLM) and a vanilla Default GPT API call single-pass baseline, citing stronger boundary-condition handling and more defensible evidence use. Coverage and novelty analyses indicate that RA-FSM explores beyond the NLM while incurring tunable latency and cost overheads. The design emphasizes transparent, well-cited answers for high-stakes technical work and is generalizable to other scientific domains.

[5] KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI

So Kuroki, Yotaro Kubo, Takuya Akiba, Yujin Tang

Main category: cs.CL

TL;DR: A hybrid speech-to-speech system that combines real-time S2S responsiveness with LLM knowledge injection to achieve both low latency and rich semantic understanding.

Details

Motivation: Bridge the gap between real-time S2S models (low latency but poor knowledge) and cascaded systems (rich knowledge but high latency) to enable natural conversational flow with deep understanding.

Method: Process user speech through S2S transformer for immediate response while concurrently sending query to back-end LLM, then inject LLM’s text response in real-time to guide S2S speech generation.

Result: Substantially outperforms baseline S2S model in response correctness, approaching cascaded system performance while maintaining baseline-level latency on MT-Bench speech-synthesized benchmark.

Conclusion: The hybrid architecture successfully combines the strengths of both paradigms, enabling knowledge-rich conversational responses without sacrificing real-time interaction quality.

Abstract: Real-time speech-to-speech (S2S) models excel at generating natural, low-latency conversational responses but often lack deep knowledge and semantic understanding. Conversely, cascaded systems combining automatic speech recognition, a text-based Large Language Model (LLM), and text-to-speech synthesis offer superior knowledge representation at the cost of high latency, which disrupts the flow of natural interaction. This paper introduces a novel hybrid architecture that bridges the gap between these two paradigms. Our framework processes user speech through an S2S transformer for immediate responsiveness while concurrently relaying the query to a powerful back-end LLM. The LLM’s text-based response is then injected in real time to guide the S2S model’s speech generation, effectively infusing its output with rich knowledge without the full latency penalty of a cascaded system. We evaluated our method using a speech-synthesized variant of the MT-Bench benchmark that consists of multi-turn question-answering sessions. The results demonstrate that our system substantially outperforms a baseline S2S model in response correctness, approaching that of a cascaded system, while maintaining a latency on par with the baseline.

[6] SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification

Kanghoon Yoon, Minsub Kim, Sungjae Lee, Joonhyung Lee, Sunghyeon Woo, Yeonjun In, Se Jung Kwon, Chanyoung Park, Dongsoo Lee

Main category: cs.CL

TL;DR: SelfJudge enables automatic training of judge verifiers for speculative decoding via self-supervision from the target model, improving inference speed while maintaining accuracy across diverse NLP tasks.

Details

Motivation: Existing judge decoding methods rely on human annotations or tasks with verifiable ground truths, limiting their generalizability across diverse NLP applications.

Method: Trains judge verifiers via self-supervision by measuring semantic preservation - assessing whether token-substituted responses maintain the meaning of original responses from the target model.

Result: SelfJudge achieves superior inference-accuracy trade-offs compared to judge decoding baselines.

Conclusion: SelfJudge offers a broadly applicable solution for faster LLM inference that works across diverse NLP tasks without requiring human annotations.

Abstract: Speculative decoding accelerates LLM inference by verifying candidate tokens from a draft model against a larger target model. Recent judge decoding boosts this process by relaxing verification criteria by accepting draft tokens that may exhibit minor discrepancies from target model output, but existing methods are restricted by their reliance on human annotations or tasks with verifiable ground truths, limiting generalizability across diverse NLP tasks. We propose SelfJudge, which trains judge verifiers via self-supervision of the target model. Our method measures semantic preservation by assessing whether token-substituted responses preserve the meaning of original responses, enabling automatic verifier training across diverse NLP tasks. Our experiments show SelfJudge achieves superior inference-accuracy trade-offs than judge decoding baselines, offering a broadly applicable solution for faster LLM inference.

[7] EntropyLong: Effective Long-Context Training via Predictive Uncertainty

Junlong Jia, Ziyang Chen, Xing Wu, Chaochen Gao, Zijia Lin, Debing Zhang, Songlin Hu, Binghui Guo

Main category: cs.CL

TL;DR: EntropyLong is a novel data construction method that uses predictive uncertainty to verify genuine long-range dependencies in training data, improving long-context language model performance.

Details

Motivation: Current approaches for training long-context models often fail to guarantee genuine long-range dependencies, using generic text concatenation or heuristics that may create spurious correlations.

Method: The method identifies high-entropy positions in documents, retrieves semantically relevant contexts from large corpora, and verifies their utility by assessing whether they reduce prediction entropy through model-in-the-loop verification.

Result: Models trained on the generated 128K-length sequences with verified dependencies show significant improvements on RULER benchmarks, especially in distant information tasks, and achieve substantial gains on LongBenchv2 after instruction fine-tuning.

Conclusion: Extensive ablation studies validate that entropy-based verification is necessary and effective for long-context training, ensuring dependencies represent measurable information gain rather than spurious correlation.

Abstract: Training long-context language models to capture long-range dependencies requires specialized data construction. Current approaches, such as generic text concatenation or heuristic-based variants, frequently fail to guarantee genuine long-range dependencies. We propose EntropyLong, a novel data construction method that leverages predictive uncertainty to verify dependency quality. Our approach identifies high-entropy positions in documents, retrieves semantically relevant contexts from large corpora, and verifies their utility by assessing whether they reduce prediction entropy. This model-in-the-loop verification ensures each dependency represents measurable information gain rather than spurious correlation. We construct training samples with long-range dependencies by combining original documents with these verified contextual supplements. Using FineWebEdu and Cosmopedia, we generate a dataset of 128K-length sequences with verified dependencies. Models trained on this data demonstrate significant improvements on RULER benchmarks, particularly in tasks requiring distant information. Following instruction fine-tuning, our models also achieve substantial gains on LongBenchv2, demonstrating enhanced long-context understanding. Extensive ablation studies further validate the necessity and effectiveness of entropybased verification for long-context training.

[8] Synthetic Dialogue Generation for Interactive Conversational Elicitation & Recommendation (ICER)

Moonkyung Ryu, Chih-Wei Hsu, Yinlam Chow, Mohammad Ghavamzadeh, Craig Boutilier

Main category: cs.CL

TL;DR: The paper addresses the lack of public conversational recommender system data by developing a method to generate natural, consistent dialogues using behavior simulators and language model prompting.

Details

Motivation: Language models have potential for conversational recommender systems but face challenges due to limited public CRS data, and existing user simulators often lack behavioral consistency.

Method: Developed a methodology combining behavior simulators with LM-prompting to generate natural dialogues consistent with users’ underlying states, creating a large open-source CRS dataset with preference elicitation and example critiquing.

Result: Generated a large open-source CRS dataset, with rater evaluation showing the dialogues exhibit considerable consistency, factuality and naturalness.

Conclusion: The proposed methodology successfully addresses data scarcity in conversational recommender systems by generating high-quality, consistent dialogues that can be used to train LM-based CRSs.

Abstract: While language models (LMs) offer great potential for conversational recommender systems (CRSs), the paucity of public CRS data makes fine-tuning LMs for CRSs challenging. In response, LMs as user simulators qua data generators can be used to train LM-based CRSs, but often lack behavioral consistency, generating utterance sequences inconsistent with those of any real user. To address this, we develop a methodology for generating natural dialogues that are consistent with a user’s underlying state using behavior simulators together with LM-prompting. We illustrate our approach by generating a large, open-source CRS data set with both preference elicitation and example critiquing. Rater evaluation on some of these dialogues shows them to exhibit considerable consistency, factuality and naturalness.

[9] A High-Capacity and Secure Disambiguation Algorithm for Neural Linguistic Steganography

Yapei Feng, Feng Jiang, Shanhao Wu, Hua Zhong

Main category: cs.CL

TL;DR: Proposes look-ahead Sync method for neural linguistic steganography that overcomes capacity limitations of SyncPool while maintaining security guarantees, achieving 160%+ improvement in English and 25%+ in Chinese embedding rates.

Details

Motivation: Address the fundamental challenge of tokenization ambiguity in modern tokenizers that causes catastrophic decoding failures, and overcome the capacity sacrifice in SyncPool method which uses entire Shannon entropy for synchronization rather than payload embedding.

Method: Look-ahead Sync method performs minimal synchronized sampling only on truly indistinguishable token sequences while strategically preserving all other discernible paths to maximize embedding capacity. Provides theoretical proofs for security.

Result: Method consistently approaches theoretical capacity upper bound, significantly outperforming SyncPool with embedding rate improvements exceeding 160% in English and 25% in Chinese, particularly with larger candidate pools.

Conclusion: Represents a significant step toward practical high-capacity provably secure linguistic steganography by overcoming capacity limitations while retaining security guarantees.

Abstract: Neural linguistic steganography aims to embed information into natural text while preserving statistical undetectability. A fundamental challenge in this ffeld stems from tokenization ambiguity in modern tokenizers, which can lead to catastrophic decoding failures. The recent method, SyncPool, addresses this ambiguity by employing a coarse-grained synchronization mechanism over groups of ambiguous candidates. However, SyncPool sacriffces embedding capacity, as it utilizes the entire Shannon entropy of an ambiguous group solely for synchronization rather than for payload embedding. We propose a method named look-ahead Sync, which overcomes the capacity limitation of SyncPool while retaining its provable security guarantees. Our approach performs minimal synchronized sampling only on truly indistinguishable token sequences, while strategically preserving all other discernible paths to maximize embedding capacity. We provide theoretical proofs for the security of our method and analyze the gap between its achievable embedding capacity and the theoretical upper bound. Experiments on English (using Llama 3) and Chinese (using Qwen 2.5) benchmarks show that our method consistently approaches the theoretical capacity upper bound and signiffcantly outperforms SyncPool. The improvement in embedding rate exceeds 160% in English and 25% in Chinese, particularly in settings with larger candidate pools. This work represents a signiffcant step toward practical high-capacity provably secure linguistic steganography.

Chiara Pugliese, Francesco Lettich, Guido Rocchietti, Chiara Renso, Fabio Pinelli

Main category: cs.CL

TL;DR: Two publicly available datasets of semantically enriched human trajectories with GPS traces, contextual layers, and synthetic social media posts generated by LLMs, covering Paris and New York cities.

Details

Motivation: To provide comprehensive datasets that combine real-world movement data with semantic enrichment and LLM-generated content for advanced mobility analysis and research applications.

Method: Created a pipeline using publicly available GPS traces from OpenStreetMap, enriched with contextual layers (stops, moves, POIs, transportation modes, weather), and generated synthetic social media posts using Large Language Models.

Result: Produced two datasets in tabular and RDF formats covering Paris and New York, supporting semantic reasoning and FAIR data practices, with an open source reproducible pipeline for customization.

Conclusion: This is the first resource combining real-world movement, structured semantic enrichment, LLM-generated text, and semantic web compatibility in a reusable framework for mobility research.

Abstract: In this resource paper, we present two publicly available datasets of semantically enriched human trajectories, together with the pipeline to build them. The trajectories are publicly available GPS traces retrieved from OpenStreetMap. Each dataset includes contextual layers such as stops, moves, points of interest (POIs), inferred transportation modes, and weather data. A novel semantic feature is the inclusion of synthetic, realistic social media posts generated by Large Language Models (LLMs), enabling multimodal and semantic mobility analysis. The datasets are available in both tabular and Resource Description Framework (RDF) formats, supporting semantic reasoning and FAIR data practices. They cover two structurally distinct, large cities: Paris and New York. Our open source reproducible pipeline allows for dataset customization, while the datasets support research tasks such as behavior modeling, mobility prediction, knowledge graph construction, and LLM-based applications. To our knowledge, our resource is the first to combine real-world movement, structured semantic enrichment, LLM-generated text, and semantic web compatibility in a reusable framework.

[11] Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing

Zhe Li, Wei Zhao, Yige Li, Jun Sun

Main category: cs.CL

TL;DR: A novel framework for diagnosing undesirable LLM behaviors by analyzing representation and its gradients in activation space, enabling efficient sample-level and token-level attribution.

Details

Motivation: Existing attribution methods based on parameter gradients are often noisy and computationally complex, making it difficult to diagnose root causes of LLM failures like harmful content generation, factual inaccuracies, and societal biases.

Method: Analyzes representation and its gradients directly in the model’s activation space to provide semantically meaningful signals linking outputs to training data, enabling both sample-level and fine-grained token-level analysis.

Result: The method excels at sample-level attribution and enables fine-grained token-level analysis, precisely identifying specific samples and phrases that causally influence model behavior across tasks including harmful content tracking, backdoor poisoning detection, and knowledge contamination identification.

Conclusion: Provides a powerful diagnostic tool to understand, audit, and mitigate risks associated with LLMs, offering an efficient alternative to noisy and complex gradient-based attribution methods.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their deployment is frequently undermined by undesirable behaviors such as generating harmful content, factual inaccuracies, and societal biases. Diagnosing the root causes of these failures poses a critical challenge for AI safety. Existing attribution methods, particularly those based on parameter gradients, often fall short due to prohibitive noisy signals and computational complexity. In this work, we introduce a novel and efficient framework that diagnoses a range of undesirable LLM behaviors by analyzing representation and its gradients, which operates directly in the model’s activation space to provide a semantically meaningful signal linking outputs to their training data. We systematically evaluate our method for tasks that include tracking harmful content, detecting backdoor poisoning, and identifying knowledge contamination. The results demonstrate that our approach not only excels at sample-level attribution but also enables fine-grained token-level analysis, precisely identifying the specific samples and phrases that causally influence model behavior. This work provides a powerful diagnostic tool to understand, audit, and ultimately mitigate the risks associated with LLMs. The code is available at https://github.com/plumprc/RepT.

[12] Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting?

Oriol Pareras, Gerard I. Gállego, Federico Costa, Cristina España-Bonet, Javier Hernando

Main category: cs.CL

TL;DR: Direct prompting in Speech-to-Text Translation improves more consistently than Chain-of-Thought prompting as S2TT data increases, suggesting it may become the preferred approach with larger datasets.

Details

Motivation: To systematically compare Chain-of-Thought (CoT) and Direct prompting strategies in Speech-to-Text Translation under increasing amounts of S2TT data, as current LLM-based models primarily use CoT prompting.

Method: Pseudo-labeled an ASR corpus by translating transcriptions into six European languages, then trained LLM-based S2TT systems with both CoT and Direct prompting strategies at different data scales.

Result: Direct prompting improves more consistently than CoT prompting as the amount of S2TT data increases.

Conclusion: Direct prompting may become a more effective approach than Chain-of-Thought prompting as larger S2TT resources are created.

Abstract: Recent work on Speech-to-Text Translation (S2TT) has focused on LLM-based models, introducing the increasingly adopted Chain-of-Thought (CoT) prompting, where the model is guided to first transcribe the speech and then translate it. CoT typically outperforms direct prompting primarily because it can exploit abundant Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) datasets to explicitly model its steps. In this paper, we systematically compare CoT and Direct prompting under increasing amounts of S2TT data. To this end, we pseudo-label an ASR corpus by translating its transcriptions into six European languages, and train LLM-based S2TT systems with both prompting strategies at different data scales. Our results show that Direct improves more consistently as the amount of data increases, suggesting that it may become a more effective approach as larger S2TT resources are created.

[13] FormalML: A Benchmark for Evaluating Formal Subgoal Completion in Machine Learning Theory

Xiao-Wen Yang, Zihao Zhang, Jianuo Cao, Zhi Zhou, Zenan Li, Lan-Zhe Guo, Yuan Yao, Taolue Chen, Yu-Feng Li, Xiaoxing Ma

Main category: cs.CL

TL;DR: FormalML is a Lean 4 benchmark for evaluating LLMs’ ability to complete missing proof steps in complex mathematical proofs, focusing on machine learning theories with 4937 problems.

Details

Motivation: To assess LLMs' practical utility as mathematical assistants by testing their ability to fill in missing proof steps (subgoal completion) in complex research-level contexts.

Method: Created FormalML benchmark using Lean 4, built from machine learning theories, with translation tactic converting procedural proofs to declarative form, spanning optimization and probability inequalities.

Result: Evaluation of state-of-the-art provers revealed persistent limitations in both accuracy and efficiency for subgoal completion tasks.

Conclusion: There is a need for more capable LLM-based theorem provers to effectively handle subgoal completion in mathematical proofs.

Abstract: Large language models (LLMs) have recently demonstrated remarkable progress in formal theorem proving. Yet their ability to serve as practical assistants for mathematicians, filling in missing steps within complex proofs, remains underexplored. We identify this challenge as the task of subgoal completion, where an LLM must discharge short but nontrivial proof obligations left unresolved in a human-provided sketch. To study this problem, we introduce FormalML, a Lean 4 benchmark built from foundational theories of machine learning. Using a translation tactic that converts procedural proofs into declarative form, we extract 4937 problems spanning optimization and probability inequalities, with varying levels of difficulty. FormalML is the first subgoal completion benchmark to combine premise retrieval and complex research-level contexts. Evaluation of state-of-the-art provers highlights persistent limitations in accuracy and efficiency, underscoring the need for more capable LLM-based theorem provers for effective subgoal completion,

[14] Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation

Jacobo Romero-Díaz, Gerard I. Gállego, Oriol Pareras, Federico Costa, Javier Hernando, Cristina España-Bonet

Main category: cs.CL

TL;DR: Chain-of-Thought prompting for Speech-to-Text Translation largely behaves like cascaded systems, relying mainly on transcripts rather than leveraging speech information, contrary to expectations.

Details

Motivation: To overcome limitations of cascaded S2TT systems (error propagation and inability to use acoustic cues) by exploring Chain-of-Thought prompting's potential for joint speech and text processing.

Method: Analyzed CoT through attribution methods, robustness evaluations with corrupted transcripts, and prosody-awareness tests. Also tested training interventions like adding Direct S2TT data and noisy transcript injection.

Result: CoT largely mirrors cascaded behavior, relying mainly on transcripts while barely leveraging speech. Training interventions enhanced robustness and increased speech attribution.

Conclusion: The assumed advantages of CoT are challenged, highlighting the need for architectures that explicitly integrate acoustic information into translation.

Abstract: Speech-to-Text Translation (S2TT) systems built from Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) modules face two major limitations: error propagation and the inability to exploit prosodic or other acoustic cues. Chain-of-Thought (CoT) prompting has recently been introduced, with the expectation that jointly accessing speech and transcription will overcome these issues. Analyzing CoT through attribution methods, robustness evaluations with corrupted transcripts, and prosody-awareness, we find that it largely mirrors cascaded behavior, relying mainly on transcripts while barely leveraging speech. Simple training interventions, such as adding Direct S2TT data or noisy transcript injection, enhance robustness and increase speech attribution. These findings challenge the assumed advantages of CoT and highlight the need for architectures that explicitly integrate acoustic information into translation.

[15] KurdSTS: The Kurdish Semantic Textual Similarity

Abdulhady Abas Abdullah, Hadi Veisi, Hussein M. Al

Main category: cs.CL

TL;DR: First Kurdish STS dataset with 10,000 annotated sentence pairs spanning formal and informal registers, benchmarking multiple models and establishing evaluation suite for Kurdish semantics.

Details

Motivation: Low-resource languages like Kurdish remain underserved in semantic textual similarity research despite extensive resources for high-resource languages.

Method: Created 10,000 Kurdish sentence pairs annotated for similarity, benchmarked Sentence-BERT, multilingual BERT, and other strong baselines on the dataset.

Result: Obtained competitive results while highlighting challenges from Kurdish morphology, orthographic variation, and code-mixing.

Conclusion: The dataset and baselines establish reproducible evaluation suite and provide strong foundation for future Kurdish semantics and low-resource NLP research.

Abstract: Semantic Textual Similarity (STS) measures the degree of meaning overlap between two texts and underpins many NLP tasks. While extensive resources exist for high-resource languages, low-resource languages such as Kurdish remain underserved. We present, to our knowledge, the first Kurdish STS dataset: 10,000 sentence pairs spanning formal and informal registers, each annotated for similarity. We benchmark Sentence-BERT, multilingual BERT, and other strong baselines, obtaining competitive results while highlighting challenges arising from Kurdish morphology, orthographic variation, and code-mixing. The dataset and baselines establish a reproducible evaluation suite and provide a strong starting point for future research on Kurdish semantics and low-resource NLP.

[16] CRACQ: A Multi-Dimensional Approach To Automated Document Assessment

Ishak Soltani, Francisco Belo, Bernardo Tavares

Main category: cs.CL

TL;DR: CRACQ is a multi-dimensional evaluation framework that assesses documents across five traits: Coherence, Rigor, Appropriateness, Completeness, and Quality, providing interpretable automated evaluation beyond traditional essay scoring.

Details

Motivation: To move beyond single-score evaluation approaches and provide a more nuanced, interpretable methodology for assessing diverse machine-generated text, addressing limitations of direct LLM evaluation.

Method: Built on trait-based Automated Essay Scoring principles, CRACQ integrates linguistic, semantic, and structural signals into cumulative assessment. Trained on 500 synthetic grant proposals and benchmarked against LLM-as-a-judge approach.

Result: CRACQ produces more stable and interpretable trait-level judgments than direct LLM evaluation, though challenges in reliability and domain scope persist.

Conclusion: CRACQ offers a promising rubric-driven framework for multi-dimensional document evaluation, providing both holistic and trait-level analysis capabilities, but requires further refinement for reliability and broader domain applicability.

Abstract: This paper presents CRACQ, a multi-dimensional evaluation framework tailored to evaluate documents across f i v e specific traits: Coherence, Rigor, Appropriateness, Completeness, and Quality. Building on insights from traitbased Automated Essay Scoring (AES), CRACQ expands its fo-cus beyond essays to encompass diverse forms of machine-generated text, providing a rubricdriven and interpretable methodology for automated evaluation. Unlike singlescore approaches, CRACQ integrates linguistic, semantic, and structural signals into a cumulative assessment, enabling both holistic and trait-level analysis. Trained on 500 synthetic grant pro-posals, CRACQ was benchmarked against an LLM-as-a-judge and further tested on both strong and weak real applications. Preliminary results in-dicate that CRACQ produces more stable and interpretable trait-level judgments than direct LLM evaluation, though challenges in reliability and domain scope remain

[17] Optimizing Long-Form Clinical Text Generation with Claim-Based Rewards

Samyak Jhaveri, Praphul Singh, Jangwon Kim, Tara Taghavi, Krishnaram Kenthapadi

Main category: cs.CL

TL;DR: An evaluation-integrated reinforcement learning framework using Group Relative Policy Optimization (GRPO) with DocLens evaluator for automated clinical documentation, improving factuality and completeness without separate reward models or human references.

Details

Motivation: Automating clinical documentation requires precise alignment with priorities like completeness and factual grounding, needing methods that directly optimize these objectives without relying on separate reward models or human-authored references.

Method: Evaluation-integrated reinforcement learning framework combining Group Relative Policy Optimization (GRPO) with DocLens - a claim-level evaluator providing deterministic, dialogue-grounded rewards. Uses reward-gating strategy to reduce training cost.

Result: Improves clinical note quality with higher preference for GRPO outputs in factuality, completeness, and brevity, with fewer omissions and hallucinations. Independent GPT-5 evaluation supports these gains. Improvements represent conservative lower bound due to clean benchmarks and well-aligned base model.

Conclusion: The framework is scalable to real-world clinical settings and can incorporate custom objectives like guideline adherence or billing preferences, providing an effective approach for automated clinical documentation.

Abstract: Automating clinical documentation with large language models requires precise alignment with priorities such as completeness and factual grounding. We present an evaluation-integrated reinforcement learning framework for long-form clinical text generation that couples Group Relative Policy Optimization (GRPO) with DocLens, a claim-level evaluator that provides deterministic, dialogue-grounded rewards. Our method directly optimizes factual grounding and completeness without training a separate reward model or relying on human-authored references. Empirically, the approach improves clinical note quality and reduces training cost via a simple reward-gating strategy. An independent GPT-5 qualitative evaluation further supports these gains, showing higher preference for GRPO outputs in factuality, completeness, and brevity, with fewer omissions and hallucinations. Because the benchmarks are relatively clean and the base model already well aligned, these improvements likely represent a conservative lower bound. The framework is scalable to real-world settings and can incorporate custom objectives such as guideline adherence or billing preferences.

[18] Evaluating Uncertainty Quantification Methods in Argumentative Large Language Models

Kevin Zhou, Adam Dejl, Gabriel Freedman, Lihu Chen, Antonio Rago, Francesca Toni

Main category: cs.CL

TL;DR: This paper explores uncertainty quantification methods for argumentative LLMs, finding that simple direct prompting outperforms more complex approaches in claim verification tasks.

Details

Motivation: To ensure reliability of LLMs by integrating uncertainty quantification methods in argumentative LLMs, which are explainable frameworks for decision-making based on computational argumentation.

Method: Conducted experiments evaluating ArgLLMs’ performance on claim verification tasks using different LLM UQ methods, providing a novel way to assess UQ method effectiveness.

Result: Direct prompting, despite its simplicity, was found to be an effective UQ strategy in ArgLLMs, outperforming considerably more complex approaches.

Conclusion: Simple direct prompting is surprisingly effective for uncertainty quantification in argumentative LLMs, suggesting that complex UQ methods may not always be necessary for reliable performance.

Abstract: Research in uncertainty quantification (UQ) for large language models (LLMs) is increasingly important towards guaranteeing the reliability of this groundbreaking technology. We explore the integration of LLM UQ methods in argumentative LLMs (ArgLLMs), an explainable LLM framework for decision-making based on computational argumentation in which UQ plays a critical role. We conduct experiments to evaluate ArgLLMs’ performance on claim verification tasks when using different LLM UQ methods, inherently performing an assessment of the UQ methods’ effectiveness. Moreover, the experimental procedure itself is a novel way of evaluating the effectiveness of UQ methods, especially when intricate and potentially contentious statements are present. Our results demonstrate that, despite its simplicity, direct prompting is an effective UQ strategy in ArgLLMs, outperforming considerably more complex approaches.

[19] Can Prompts Rewind Time for LLMs? Evaluating the Effectiveness of Prompted Knowledge Cutoffs

Xin Gao, Ruiyi Zhang, Daniel Du, Saurabh Mahindre, Sai Ashish Somayajula, Pengtao Xie

Main category: cs.CL

TL;DR: LLMs can simulate earlier knowledge cutoffs via prompting for direct factual forgetting but struggle with causally related knowledge, highlighting limitations in temporal prediction evaluation.

Details

Motivation: To address contamination concerns in LLM temporal prediction where accurate predictions may reflect memorization rather than reasoning, and investigate if prompting can simulate earlier knowledge cutoffs.

Method: Constructed three evaluation datasets to assess LLM forgetting capabilities: (1) direct factual knowledge, (2) semantic shifts, and (3) causally related knowledge, using prompt-based simulated knowledge cutoffs.

Result: Prompt-based simulated knowledge cutoffs are effective for direct factual queries but struggle to induce forgetting when the forgotten content is causally related to the query rather than directly asked.

Conclusion: Current prompting techniques for simulating knowledge cutoffs have limitations, particularly with causal relationships, necessitating more rigorous evaluation settings for LLMs in temporal prediction tasks.

Abstract: Large Language Models (LLMs) are widely used for temporal prediction, but their reliance on pretraining data raises contamination concerns, as accurate predictions on pre-cutoff test data may reflect memorization rather than reasoning, leading to an overestimation of their generalization capability. With the recent emergence of prompting-based unlearning techniques, a natural question arises: Can LLMs be prompted to simulate an earlier knowledge cutoff? In this work, we investigate the capability of prompting to simulate earlier knowledge cutoff in LLMs. We construct three evaluation datasets to assess the extent to which LLMs can forget (1) direct factual knowledge, (2) semantic shifts, and (3) causally related knowledge. Results demonstrate that while prompt-based simulated knowledge cutoffs show effectiveness when directly queried with the information after that date, they struggle to induce forgetting when the forgotten content is not directly asked but causally related to the query. These findings highlight the need for more rigorous evaluation settings when applying LLMs for temporal prediction tasks. The full dataset and evaluation code are available at https://github.com/gxx27/time_unlearn.

[20] DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning

Yifan Wang, Bolian Li, Junlin Wu, Zhaoxuan Tan, Zheli Liu, Ruqi Zhang, Ananth Grama, Qingkai Zeng

Main category: cs.CL

TL;DR: DRIFT is a preference training method that uses abundant implicit user dissatisfaction signals from real-world LLM deployments, dynamically sampling positives from the evolving policy to achieve significant performance improvements over baseline methods.

Details

Motivation: Real-world LLM deployments generate abundant implicit user dissatisfaction signals (refinements, corrections, preferences) while explicit satisfaction feedback is scarce, but existing preference learning approaches are poorly aligned with this data profile.

Method: DRIFT anchors training on real-world DSAT signals and samples positives dynamically from the evolving policy, preserving preference margins and avoiding gradient degeneration.

Result: DRIFT models achieve up to +6.23% (7B) / +7.61% (14B) on WildBench Task Score and up to +8.95% (7B) / +12.29% (14B) on AlpacaEval2 win rate over base models, with 14B models surpassing GPT-4o-mini on WildBench.

Conclusion: DRIFT is an effective and scalable recipe for real-world post-training that leverages the most abundant and informative signal while preserving exploratory capacity and yielding more diverse high-reward solutions.

Abstract: Real-world large language model deployments (e.g., conversational AI systems, code generation assistants) naturally generate abundant implicit user dissatisfaction (DSAT) signals, as users iterate toward better answers through refinements, corrections, and expressed preferences, while explicit satisfaction (SAT) feedback is scarce. Existing preference learning approaches are poorly aligned with this data profile, as they rely on costly human annotations or assume plentiful positive responses. In this paper, we introduce \textbf{DRIFT} (\textbf{D}issatisfaction-\textbf{R}efined \textbf{I}terative pre\textbf{F}erence \textbf{T}raining), which anchors training on real-world DSAT signals and samples positives dynamically from the evolving policy. Empirically, DRIFT models trained on real-world \textit{WildFeedback} datasets and synthetic \textit{UltraFeedback} datasets achieve up to +6.23% (7B) / +7.61% (14B) on WildBench Task Score and up to +8.95% (7B) / +12.29% (14B) on AlpacaEval2 win rate over base models, outperforming strong baseline methods such as iterative DPO and SPIN. At larger scales, the improvements are particularly pronounced: 14B models trained with DRIFT surpass GPT-4o-mini on WildBench. Further analysis shows that DRIFT also preserves exploratory capacity, yielding more diverse high-reward solutions rather than collapsing to narrow subsets. Theoretically, we demonstrate that this design preserves preference margins and avoids the gradient degeneration. These results show that DRIFT is an effective and scalable recipe for real-world post-training that leverages the most abundant and informative signal. The code and data are available at https://github.com/cacayaya/DRIFT.git.

Aurélien Bück-Kaeffer, Je Qin Chooi, Dan Zhao, Maximilian Puelma Touzel, Kellin Pelrine, Jean-François Godbout, Reihaneh Rabbany, Zachary Yang

Main category: cs.CL

TL;DR: SIMPACT is a framework for creating social media simulation datasets, with BluePrint as a concrete implementation using Bluesky political discourse data to train LLM-based agents for next-action prediction while preserving privacy.

Details

Motivation: There is a lack of standardized data resources for fine-tuning and evaluating LLMs as realistic social media agents, which limits studies of social media dynamics at scale.

Method: Introduces SIMPACT framework with next-action prediction task, clustering anonymized users into personas, and using 12 social media interaction types with context-dependent behaviors. BluePrint dataset is built from public Bluesky data with privacy protection measures.

Result: Created a large-scale dataset (BluePrint) focused on political discourse that captures authentic engagement patterns while safeguarding privacy through pseudonymization and PII removal.

Conclusion: SIMPACT provides standardized data and evaluation protocols for advancing rigorous, ethically responsible social media simulations, with BluePrint serving as both a benchmark and template for domain-specific studies.

Abstract: Large language models (LLMs) offer promising capabilities for simulating social media dynamics at scale, enabling studies that would be ethically or logistically challenging with human subjects. However, the field lacks standardized data resources for fine-tuning and evaluating LLMs as realistic social media agents. We address this gap by introducing SIMPACT, the SIMulation-oriented Persona and Action Capture Toolkit, a privacy respecting framework for constructing behaviorally-grounded social media datasets suitable for training agent models. We formulate next-action prediction as a task for training and evaluating LLM-based agents and introduce metrics at both the cluster and population levels to assess behavioral fidelity and stylistic realism. As a concrete implementation, we release BluePrint, a large-scale dataset built from public Bluesky data focused on political discourse. BluePrint clusters anonymized users into personas of aggregated behaviours, capturing authentic engagement patterns while safeguarding privacy through pseudonymization and removal of personally identifiable information. The dataset includes a sizable action set of 12 social media interaction types (likes, replies, reposts, etc.), each instance tied to the posting activity preceding it. This supports the development of agents that use context-dependence, not only in the language, but also in the interaction behaviours of social media to model social media users. By standardizing data and evaluation protocols, SIMPACT provides a foundation for advancing rigorous, ethically responsible social media simulations. BluePrint serves as both an evaluation benchmark for political discourse modeling and a template for building domain specific datasets to study challenges such as misinformation and polarization.

[22] Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression

Peijun Zhu, Ning Yang, Jiayu Wei, Jinghang Wu, Haijun Zhang

Main category: cs.CL

TL;DR: A unified framework using dynamic expert clustering and structured compression to solve MoE LLM trilemma of load imbalance, parameter redundancy, and communication overhead.

Details

Motivation: Address the fundamental trilemma in Mixture-of-Experts LLMs: load imbalance, parameter redundancy, and communication overhead that limit scalability and efficiency.

Method: Dynamic expert clustering using fused parameter-activation similarity metrics, weight decomposition into shared base matrix + low-rank residual adapters, hierarchical routing strategy, and heterogeneous precision scheme (FP16 shared bases + INT4 residuals).

Result: Matches standard MoE model quality while reducing total parameters by ~80%, improving throughput by 10-20%, and lowering expert load variance by over 3x on GLUE and WikiText-103 benchmarks.

Conclusion: Structural reorganization through dynamic clustering and compression provides a principled path to scalable, efficient, and memory-effective MoE LLMs.

Abstract: Mixture-of-Experts (MoE) Large Language Models (LLMs) face a trilemma of load imbalance, parameter redundancy, and communication overhead. We introduce a unified framework based on dynamic expert clustering and structured compression to address these issues cohesively. Our method employs an online clustering procedure that periodically regroups experts using a fused metric of parameter and activation similarity, which stabilizes expert utilization. To our knowledge, this is one of the first frameworks to leverage the semantic embedding capability of the router to dynamically reconfigure the model’s architecture during training for substantial efficiency gains. Within each cluster, we decompose expert weights into a shared base matrix and extremely low-rank residual adapters, achieving up to fivefold parameter reduction per group while preserving specialization. This structure enables a two-stage hierarchical routing strategy: tokens are first assigned to a cluster, then to specific experts within it, drastically reducing the routing search space and the volume of all-to-all communication. Furthermore, a heterogeneous precision scheme, which stores shared bases in FP16 and residual factors in INT4, coupled with dynamic offloading of inactive clusters, reduces peak memory consumption to levels comparable to dense models. Evaluated on GLUE and WikiText-103, our framework matches the quality of standard MoE models while reducing total parameters by approximately 80%, improving throughput by 10% to 20%, and lowering expert load variance by a factor of over three. Our work demonstrates that structural reorganization is a principled path toward scalable, efficient, and memory-effective MoE LLMs.

[23] Small Language Models for Curriculum-based Guidance

Konstantinos Katharakis, Sippo Rossi, Raghava Rao Mukkamala

Main category: cs.CL

TL;DR: Small language models (SLMs) with 7-17B parameters can match large language models like GPT-4o in educational applications when using retrieval-augmented generation and proper prompting, while offering sustainability benefits.

Details

Motivation: To develop sustainable AI teaching assistants that can provide curriculum-based guidance without relying on energy-intensive cloud infrastructure, making personalized learning more accessible and environmentally responsible.

Method: Used retrieval-augmented generation (RAG) pipeline with eight open-source SLMs (LLaMA 3.1, IBM Granite 3.3, Gemma 3) benchmarked against GPT-4o, employing proper prompting and targeted retrieval techniques.

Result: SLMs with 7-17B parameters achieved comparable performance to GPT-4o in delivering accurate, pedagogically aligned responses when using RAG and appropriate prompting strategies.

Conclusion: SLMs are viable alternatives to LLMs for AI teaching assistants, offering cost-effectiveness, privacy preservation, environmental sustainability, and the ability to run on consumer-grade hardware without cloud dependency.

Abstract: The adoption of generative AI and large language models (LLMs) in education is still emerging. In this study, we explore the development and evaluation of AI teaching assistants that provide curriculum-based guidance using a retrieval-augmented generation (RAG) pipeline applied to selected open-source small language models (SLMs). We benchmarked eight SLMs, including LLaMA 3.1, IBM Granite 3.3, and Gemma 3 (7-17B parameters), against GPT-4o. Our findings show that with proper prompting and targeted retrieval, SLMs can match LLMs in delivering accurate, pedagogically aligned responses. Importantly, SLMs offer significant sustainability benefits due to their lower computational and energy requirements, enabling real-time use on consumer-grade hardware without depending on cloud infrastructure. This makes them not only cost-effective and privacy-preserving but also environmentally responsible, positioning them as viable AI teaching assistants for educational institutions aiming to scale personalized learning in a sustainable and energy-efficient manner.

[24] mini-vec2vec: Scaling Universal Geometry Alignment with Linear Transformations

Guy Dar

Main category: cs.CL

TL;DR: mini-vec2vec is an efficient and robust linear alternative to vec2vec for aligning text embedding spaces without parallel data, offering significant computational savings while maintaining or improving performance.

Details

Motivation: The original vec2vec method achieves near-perfect alignment but is expensive and unstable, limiting its practical adoption and scalability.

Method: Three-stage approach: tentative matching of pseudo-parallel embedding vectors, transformation fitting, and iterative refinement to learn a linear transformation.

Result: Exceeds vec2vec by orders of magnitude in efficiency while matching or exceeding its results, with high stability and interpretable algorithmic steps.

Conclusion: mini-vec2vec’s stability, efficiency, and interpretability facilitate scaling and unlock new opportunities for adoption in various domains and fields.

Abstract: We build upon vec2vec, a procedure designed to align text embedding spaces without parallel data. vec2vec finds a near-perfect alignment, but it is expensive and unstable. We present mini-vec2vec, a simple and efficient alternative that requires substantially lower computational cost and is highly robust. Moreover, the learned mapping is a linear transformation. Our method consists of three main stages: a tentative matching of pseudo-parallel embedding vectors, transformation fitting, and iterative refinement. Our linear alternative exceeds the original instantiation of vec2vec by orders of magnitude in efficiency, while matching or exceeding their results. The method’s stability and interpretable algorithmic steps facilitate scaling and unlock new opportunities for adoption in new domains and fields.

[25] LLMSQL: Upgrading WikiSQL for the LLM Era of Text-to-SQL

Dzmitry Pihulski, Karol Charchut, Viktoria Novogrodskaia, Jan Kocoń

Main category: cs.CL

TL;DR: LLMSQL is a cleaned and improved version of WikiSQL designed for modern LLMs, addressing structural and annotation issues to provide clean natural language questions and SQL queries for better Text-to-SQL evaluation.

Details

Motivation: WikiSQL had declined in usage due to various issues including case sensitivity inconsistencies, data type mismatches, syntax errors, and unanswered questions, making it unsuitable for modern LLM-based Text-to-SQL research.

Method: Systematic classification of errors in WikiSQL and implementation of automated methods for cleaning and re-annotation, transforming it into an LLM-ready benchmark with clean natural language questions and full SQL queries as plain text.

Result: Multiple large language models (including Gemma 3, LLaMA 3.2, Mistral 7B, gpt-oss 20B, Phi-3.5 Mini, Qwen 2.5, OpenAI o4-mini, DeepSeek R1) were evaluated on the improved LLMSQL benchmark.

Conclusion: LLMSQL serves as a modern LLM-ready benchmark for Text-to-SQL research, providing clean data suitable for straightforward generation and evaluation, unlike the original WikiSQL which was tailored for pointer-network models.

Abstract: Converting natural language questions into SQL queries (Text-to-SQL) enables non-expert users to interact with relational databases and has long been a central task for natural language interfaces to data. While the WikiSQL dataset played a key role in early NL2SQL research, its usage has declined due to structural and annotation issues, including case sensitivity inconsistencies, data type mismatches, syntax errors, and unanswered questions. We present LLMSQL, a systematic revision and transformation of WikiSQL designed for the LLM era. We classify these errors and implement automated methods for cleaning and re-annotation. To assess the impact of these improvements, we evaluated multiple large language models (LLMs), including Gemma 3, LLaMA 3.2, Mistral 7B, gpt-oss 20B, Phi-3.5 Mini, Qwen 2.5, OpenAI o4-mini, DeepSeek R1 and others. Rather than serving as an update, LLMSQL is introduced as an LLM-ready benchmark: unlike the original WikiSQL, tailored for pointer-network models selecting tokens from input, LLMSQL provides clean natural language questions and full SQL queries as plain text, enabling straightforward generation and evaluation for modern natural language-to-SQL models.

[26] Language, Culture, and Ideology: Personalizing Offensiveness Detection in Political Tweets with Reasoning LLMs

Dzmitry Pihulski, Jan Kocoń

Main category: cs.CL

TL;DR: LLMs with reasoning abilities are more consistent and sensitive in assessing political offensiveness across different perspectives and languages, while smaller models struggle with subtle distinctions.

Details

Motivation: To understand how LLMs evaluate offensiveness in political discourse when adopting specific political and cultural perspectives across different languages.

Method: Evaluated multiple LLMs on multilingual MD-Agreement dataset from 2020 US elections, prompting models to judge tweet offensiveness from varied political personas (far-right, conservative, centrist, progressive) in English, Polish, and Russian contexts.

Result: Larger models with explicit reasoning capabilities (DeepSeek-R1, o4-mini) showed better consistency and sensitivity to ideological/cultural variation, while smaller models often failed to capture subtle distinctions.

Conclusion: Reasoning capabilities significantly improve personalization and interpretability of offensiveness judgments, suggesting they are key for adapting LLMs to nuanced sociopolitical text classification across languages and ideologies.

Abstract: We explore how large language models (LLMs) assess offensiveness in political discourse when prompted to adopt specific political and cultural perspectives. Using a multilingual subset of the MD-Agreement dataset centered on tweets from the 2020 US elections, we evaluate several recent LLMs - including DeepSeek-R1, o4-mini, GPT-4.1-mini, Qwen3, Gemma, and Mistral - tasked with judging tweets as offensive or non-offensive from the viewpoints of varied political personas (far-right, conservative, centrist, progressive) across English, Polish, and Russian contexts. Our results show that larger models with explicit reasoning abilities (e.g., DeepSeek-R1, o4-mini) are more consistent and sensitive to ideological and cultural variation, while smaller models often fail to capture subtle distinctions. We find that reasoning capabilities significantly improve both the personalization and interpretability of offensiveness judgments, suggesting that such mechanisms are key to adapting LLMs for nuanced sociopolitical text classification across languages and ideologies.

[27] Evaluating Bias in Spoken Dialogue LLMs for Real-World Decisions and Recommendations

Yihao Wu, Tianrui Wang, Yizhou Peng, Yi-Wen Chao, Xuyi Zhuang, Xinsheng Wang, Shunshun Yin, Ziyang Ma

Main category: cs.CL

TL;DR: This paper presents the first systematic study of biases in spoken dialogue models (SDMs), revealing that closed-source models show lower bias while open-source models are more sensitive to age and gender, with recommendation tasks amplifying cross-group disparities.

Details

Motivation: While biases in text-based LLMs have been studied, biases in spoken dialogue models with audio input/output remain unexplored. Paralinguistic features like age, gender, and accent can affect model outputs, potentially exacerbating biases in multi-turn conversations with implications for fairness in decision-making and recommendations.

Method: Systematically evaluated biases in speech LLMs using Group Unfairness Score (GUS) for decisions and similarity-based normalized statistics rate (SNSR) for recommendations. Tested both open-source (Qwen2.5-Omni, GLM-4-Voice) and closed-source APIs (GPT-4o Audio, Gemini-2.5-Flash) with multi-turn dialogues involving repeated negative feedback.

Result: Closed-source models generally exhibited lower bias. Open-source models were more sensitive to age and gender. Recommendation tasks amplified cross-group disparities. Biased decisions persisted in multi-turn conversations despite repeated negative feedback.

Conclusion: This work provides foundational insights into biases in end-to-end spoken dialogue models and offers the FairDialogue dataset and evaluation code to facilitate further research towards developing fair and reliable audio-based interactive systems.

Abstract: While biases in large language models (LLMs), such as stereotypes and cultural tendencies in outputs, have been examined and identified, their presence and characteristics in spoken dialogue models (SDMs) with audio input and output remain largely unexplored. Paralinguistic features, such as age, gender, and accent, can affect model outputs; when compounded by multi-turn conversations, these effects may exacerbate biases, with potential implications for fairness in decision-making and recommendation tasks. In this paper, we systematically evaluate biases in speech LLMs and study the impact of multi-turn dialogues with repeated negative feedback. Bias is measured using Group Unfairness Score (GUS) for decisions and similarity-based normalized statistics rate (SNSR) for recommendations, across both open-source models like Qwen2.5-Omni and GLM-4-Voice, as well as closed-source APIs such as GPT-4o Audio and Gemini-2.5-Flash. Our analysis reveals that closed-source models generally exhibit lower bias, while open-source models are more sensitive to age and gender, and recommendation tasks tend to amplify cross-group disparities. We found that biased decisions may persist in multi-turn conversations. This work provides the first systematic study of biases in end-to-end spoken dialogue models, offering insights towards fair and reliable audio-based interactive systems. To facilitate further research, we release the FairDialogue dataset and evaluation code.

[28] An Senegalese Legal Texts Structuration Using LLM-augmented Knowledge Graph

Oumar Kane, Mouhamad M. Allaya, Dame Samb, Mamadou Bousso

Main category: cs.CL

TL;DR: AI and LLMs applied to improve access to Senegalese legal texts, extracting 7,967 articles and creating a graph database with 2,872 nodes and 10,774 relationships to visualize legal interconnections.

Details

Motivation: To address difficulties in extracting and organizing legal documents in Senegal's judicial system and improve access to judicial information for citizens and legal professionals.

Method: Used advanced triple extraction techniques with models like GPT-4o, GPT-4, and Mistral-Large to identify relationships and metadata, focusing on the Land and Public Domain Code.

Result: Successfully extracted 7,967 articles from legal documents and developed a detailed graph database with 2,872 nodes and 10,774 relationships, demonstrating effective relationship identification.

Conclusion: The research creates a solid framework enabling Senegalese citizens and legal professionals to better understand their rights and responsibilities through improved access to legal information.

Abstract: This study examines the application of artificial intelligence (AI) and large language models (LLM) to improve access to legal texts in Senegal’s judicial system. The emphasis is on the difficulties of extracting and organizing legal documents, highlighting the need for better access to judicial information. The research successfully extracted 7,967 articles from various legal documents, particularly focusing on the Land and Public Domain Code. A detailed graph database was developed, which contains 2,872 nodes and 10,774 relationships, aiding in the visualization of interconnections within legal texts. In addition, advanced triple extraction techniques were utilized for knowledge, demonstrating the effectiveness of models such as GPT-4o, GPT-4, and Mistral-Large in identifying relationships and relevant metadata. Through these technologies, the aim is to create a solid framework that allows Senegalese citizens and legal professionals to more effectively understand their rights and responsibilities.

[29] Modeling the language cortex with form-independent and enriched representations of sentence meaning reveals remarkable semantic abstractness

Shreya Saha, Shurui Li, Greta Tuckute, Yuanning Li, Ru-Yuan Zhang, Leila Wehbe, Evelina Fedorenko, Meenakshi Khosla

Main category: cs.CL

TL;DR: The paper demonstrates that aggregating multiple visual representations and paraphrases of sentences improves prediction of language cortex responses, revealing abstract meaning representations.

Details

Motivation: To investigate whether the language cortex contains abstract, form-independent meaning representations by examining neural responses to sentences.

Method: Modeled neural responses using vision and language model embeddings, generated multiple images for sentences and aggregated their embeddings, averaged embeddings across paraphrases, and enriched paraphrases with contextual details.

Result: Aggregating across multiple generated images and paraphrases improved prediction accuracy, with enriched paraphrases sometimes surpassing predictions based on original sentences.

Conclusion: The language cortex maintains highly abstract, form-independent meaning representations that are richer than those in language models.

Abstract: The human language system represents both linguistic forms and meanings, but the abstractness of the meaning representations remains debated. Here, we searched for abstract representations of meaning in the language cortex by modeling neural responses to sentences using representations from vision and language models. When we generate images corresponding to sentences and extract vision model embeddings, we find that aggregating across multiple generated images yields increasingly accurate predictions of language cortex responses, sometimes rivaling large language models. Similarly, averaging embeddings across multiple paraphrases of a sentence improves prediction accuracy compared to any single paraphrase. Enriching paraphrases with contextual details that may be implicit (e.g., augmenting “I had a pancake” to include details like “maple syrup”) further increases prediction accuracy, even surpassing predictions based on the embedding of the original sentence, suggesting that the language system maintains richer and broader semantic representations than language models. Together, these results demonstrate the existence of highly abstract, form-independent meaning representations within the language cortex.

[30] DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding

Guanghao Li, Zhihui Fu, Min Fang, Qibin Zhao, Ming Tang, Chun Yuan, Jun Wang

Main category: cs.CL

TL;DR: DiffuSpec is a training-free framework that uses diffusion language models to generate multi-token drafts in a single forward pass for speculative decoding, achieving up to 3x speedup over autoregressive methods.

Details

Motivation: To overcome the latency limitations of autoregressive decoding in large language models, where each token requires a serial forward pass, by enabling parallel multi-token drafting.

Method: Uses pretrained diffusion language models to generate token lattices in one pass, with causal-consistency path search to extract left-to-right paths and adaptive draft-length controller to optimize proposal size.

Result: Achieves up to 3x wall-clock speedup across benchmarks compared to autoregressive drafters while maintaining compatibility with standard verification.

Conclusion: Diffusion-based drafting presents a robust alternative to autoregressive drafters for speculative decoding, significantly reducing latency without requiring retraining.

Abstract: As large language models (LLMs) scale up, accuracy improves, but the autoregressive (AR) nature of decoding increases latency since each token requires a serial forward pass. Speculative decoding addresses this by employing a fast drafter to propose multi-token drafts, which are then verified in parallel by the target model. However, many deployments still rely on AR drafters, where sequential passes limit wall-clock gains. We revisit the drafting stage and present DiffuSpec, a training-free drop-in framework that uses a pretrained diffusion language model (DLM) to produce multi-token drafts in a single forward pass, while remaining compatible with standard AR verifiers. Because DLM drafts are generated under bidirectional conditioning, parallel per-position candidates form a token lattice in which the locally highest-probability token at each position need not form a causal left-to-right path. Moreover, DLM drafting requires pre-specifying a draft length, inducing a speed-quality trade-off. To address these challenges, we introduce two practical components: (i) a causal-consistency path search (CPS) over this lattice that extracts a left-to-right path aligned with AR verification; and (ii) an adaptive draft-length (ADL) controller that adjusts next proposal size based on recent acceptance feedback and realized generated length. Across benchmarks, DiffuSpec yields up to 3x wall-clock speedup, establishing diffusion-based drafting as a robust alternative to autoregressive drafters for speculative decoding.

[31] Emission-GPT: A domain-specific language model agent for knowledge retrieval, emission inventory and data analysis

Jiashu Ye, Tong Wu, Weiwen Chen, Hao Zhang, Zeteng Lin, Xingxing Li, Shujuan Weng, Manni Zhu, Xin Yuan, Xinlong Hong, Jingjie Li, Junyu Zheng, Zhijiong Huang, Jing Tang

Main category: cs.CL

TL;DR: Emission-GPT is a knowledge-enhanced LLM agent for atmospheric emissions domain, built on 10,000+ documents, enabling natural language querying, visualization, and analysis of emissions data with automated workflows.

Details

Motivation: Current emission knowledge is fragmented and specialized, making it difficult for non-experts to access and interpret emissions data, hindering research and management efforts.

Method: Built on curated knowledge base of 10,000+ documents using prompt engineering and question completion techniques, with modular architecture supporting interactive natural language queries and data analysis.

Result: Case study in Guangdong Province shows Emission-GPT can extract key insights like point source distributions and sectoral trends directly from raw data using simple prompts.

Conclusion: Emission-GPT serves as a foundational tool for next-generation emission inventory development and scenario-based assessment, automating traditionally manual workflows.

Abstract: Improving air quality and addressing climate change relies on accurate understanding and analysis of air pollutant and greenhouse gas emissions. However, emission-related knowledge is often fragmented and highly specialized, while existing methods for accessing and compiling emissions data remain inefficient. These issues hinder the ability of non-experts to interpret emissions information, posing challenges to research and management. To address this, we present Emission-GPT, a knowledge-enhanced large language model agent tailored for the atmospheric emissions domain. Built on a curated knowledge base of over 10,000 documents (including standards, reports, guidebooks, and peer-reviewed literature), Emission-GPT integrates prompt engineering and question completion to support accurate domain-specific question answering. Emission-GPT also enables users to interactively analyze emissions data via natural language, such as querying and visualizing inventories, analyzing source contributions, and recommending emission factors for user-defined scenarios. A case study in Guangdong Province demonstrates that Emission-GPT can extract key insights–such as point source distributions and sectoral trends–directly from raw data with simple prompts. Its modular and extensible architecture facilitates automation of traditionally manual workflows, positioning Emission-GPT as a foundational tool for next-generation emission inventory development and scenario-based assessment.

[32] Spiral of Silence in Large Language Model Agents

Mingze Zhong, Meng Fang, Zijing Shi, Yuxuan Huang, Shunfeng Zheng, Yali Du, Ling Chen, Jun Wang

Main category: cs.CL

TL;DR: This paper investigates whether Spiral of Silence dynamics can emerge in LLM collectives through statistical language generation, developing an evaluation framework to test different signal conditions.

Details

Motivation: The Spiral of Silence theory was developed for human societies, but it's unclear if similar dynamics can emerge in LLM collectives where classical psychological explanations don't directly apply.

Method: Proposed an evaluation framework with four controlled conditions varying ‘History’ and ‘Persona’ signals, using trend tests (Mann-Kendall, Spearman’s rank) and concentration measures (kurtosis, interquartile range) to assess opinion dynamics.

Result: History and persona together produce strong majority dominance replicating SoS patterns; history alone induces strong anchoring; persona alone fosters diverse but uncorrelated opinions without SoS dynamics.

Conclusion: The work bridges computational sociology and responsible AI design, highlighting the need to monitor and mitigate emergent conformity in LLM-agent systems.

Abstract: The Spiral of Silence (SoS) theory holds that individuals with minority views often refrain from speaking out for fear of social isolation, enabling majority positions to dominate public discourse. When the ‘agents’ are large language models (LLMs), however, the classical psychological explanation is not directly applicable, since SoS was developed for human societies. This raises a central question: can SoS-like dynamics nevertheless emerge from purely statistical language generation in LLM collectives? We propose an evaluation framework for examining SoS in LLM agents. Specifically, we consider four controlled conditions that systematically vary the availability of ‘History’ and ‘Persona’ signals. Opinion dynamics are assessed using trend tests such as Mann-Kendall and Spearman’s rank, along with concentration measures including kurtosis and interquartile range. Experiments across open-source and closed-source models show that history and persona together produce strong majority dominance and replicate SoS patterns; history signals alone induce strong anchoring; and persona signals alone foster diverse but uncorrelated opinions, indicating that without historical anchoring, SoS dynamics cannot emerge. The work bridges computational sociology and responsible AI design, highlighting the need to monitor and mitigate emergent conformity in LLM-agent systems.

[33] ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference

Haojie Ouyang, Jianwei Lv, Lei Ren, Chen Wei, Xiaojie Wang, Fangxiang Feng

Main category: cs.CL

TL;DR: ChunkLLM is a lightweight training framework that addresses Transformer computational inefficiency through chunk-based attention with QK Adapter and Chunk Adapter components, achieving significant speedup while maintaining performance.

Details

Motivation: Transformer models suffer from quadratic computational complexity in self-attention, and existing block selection/compression methods have issues with semantic incompleteness or poor training-inference efficiency.

Method: Proposes ChunkLLM with QK Adapter for feature compression and chunk attention, and Chunk Adapter for boundary detection. Uses attention distillation for training, keeps backbone frozen, and triggers chunk selection only at boundaries during inference.

Result: Achieves comparable performance on short-text benchmarks, maintains 98.64% performance on long-context benchmarks with 48.58% KV cache retention, and achieves 4.48x speedup on 120K long texts compared to vanilla Transformer.

Conclusion: ChunkLLM effectively addresses Transformer computational inefficiency while preserving model performance, making it a practical solution for long-context processing.

Abstract: Transformer-based large models excel in natural language processing and computer vision, but face severe computational inefficiencies due to the self-attention’s quadratic complexity with input tokens. Recently, researchers have proposed a series of methods based on block selection and compression to alleviate this problem, but they either have issues with semantic incompleteness or poor training-inference efficiency. To comprehensively address these challenges, we propose ChunkLLM, a lightweight and pluggable training framework. Specifically, we introduce two components: QK Adapter (Q-Adapter and K-Adapter) and Chunk Adapter. The former is attached to each Transformer layer, serving dual purposes of feature compression and chunk attention acquisition. The latter operates at the bottommost layer of the model, functioning to detect chunk boundaries by leveraging contextual semantic information. During the training phase, the parameters of the backbone remain frozen, with only the QK Adapter and Chunk Adapter undergoing training. Notably, we design an attention distillation method for training the QK Adapter, which enhances the recall rate of key chunks. During the inference phase, chunk selection is triggered exclusively when the current token is detected as a chunk boundary, thereby accelerating model inference. Experimental evaluations are conducted on a diverse set of long-text and short-text benchmark datasets spanning multiple tasks. ChunkLLM not only attains comparable performance on short-text benchmarks but also maintains 98.64% of the performance on long-context benchmarks while preserving a 48.58% key-value cache retention rate. Particularly, ChunkLLM attains a maximum speedup of 4.48x in comparison to the vanilla Transformer in the processing of 120K long texts.

[34] A Cross-Lingual Analysis of Bias in Large Language Models Using Romanian History

Matei-Iulian Cocu, Răzvan-Cosmin Cristia, Adrian Marius Dumitran

Main category: cs.CL

TL;DR: This study examines biases in Large Language Models by testing their responses to controversial Romanian historical questions across different languages and response formats, revealing significant inconsistencies in model outputs.

Details

Motivation: To assess LLM biases in historical narratives, recognizing that history is often presented through altered perspectives influenced by culture and state ideals, and that LLMs trained on biased datasets can instill lack of neutrality in users.

Method: Three-stage research process testing multiple LLMs on controversial Romanian historical questions across languages and contexts, comparing binary answers with numeric scale responses to see if response format influences answers.

Result: Binary response stability is relatively high but not perfect, varies by language. Models often flip stance across languages or formats; numeric ratings frequently diverge from initial binary choices. Most consistent models are not always the most accurate or neutral.

Conclusion: LLMs show predisposition to inconsistencies based on language context and response format, highlighting the need for awareness of these biases when using models for historical or sensitive topics.

Abstract: In this case study, we select a set of controversial Romanian historical questions and ask multiple Large Language Models to answer them across languages and contexts, in order to assess their biases. Besides being a study mainly performed for educational purposes, the motivation also lies in the recognition that history is often presented through altered perspectives, primarily influenced by the culture and ideals of a state, even through large language models. Since they are often trained on certain data sets that may present certain ambiguities, the lack of neutrality is subsequently instilled in users. The research process was carried out in three stages, to confirm the idea that the type of response expected can influence, to a certain extent, the response itself; after providing an affirmative answer to some given question, an LLM could shift its way of thinking after being asked the same question again, but being told to respond with a numerical value of a scale. Results show that binary response stability is relatively high but far from perfect and varies by language. Models often flip stance across languages or between formats; numeric ratings frequently diverge from the initial binary choice, and the most consistent models are not always those judged most accurate or neutral. Our research brings to light the predisposition of models to such inconsistencies, within a specific contextualization of the language for the question asked.

[35] Beyond Manuals and Tasks: Instance-Level Context Learning for LLM Agents

Kuntai Cai, Juncheng Liu, Xianglin Yang, Zhaojie Niu, Xiaokui Xiao, Xing Chen

Main category: cs.CL

TL;DR: The paper introduces Instance-Level Context Learning (ILCL) as a crucial third type of context for LLM agents, addressing the gap between environment-level manuals and task-level guidance by capturing verifiable, reusable facts about specific environment instances.

Details

Motivation: Current LLM agents often fail in complex tasks because they lack instance-level context - verifiable facts about specific environments (object locations, recipes, local rules) that are essential for success but require efficient exploration and validation.

Method: Proposes a task-agnostic ILCL method using guided exploration with a compact TODO forest for action prioritization and a lightweight plan-act-extract loop to automatically produce high-precision, reusable context documents.

Result: Experiments across TextWorld, ALFWorld, and Crafter show significant improvements: ReAct’s success rate in TextWorld increased from 37% to 95%, and IGE improved from 81% to 95%, demonstrating consistent gains in both success and efficiency.

Conclusion: By transforming one-off exploration into persistent, reusable knowledge, ILCL complements existing contexts to enable more reliable and efficient LLM agents, amortizing initial exploration costs across multiple downstream tasks.

Abstract: Large language model (LLM) agents typically receive two kinds of context: (i) environment-level manuals that define interaction interfaces and global rules, and (ii) task-level guidance or demonstrations tied to specific goals. In this work, we identify a crucial but overlooked third type of context, instance-level context, which consists of verifiable and reusable facts tied to a specific environment instance, such as object locations, crafting recipes, and local rules. We argue that the absence of instance-level context is a common source of failure for LLM agents in complex tasks, as success often depends not only on reasoning over global rules or task prompts but also on making decisions based on precise and persistent facts. Acquiring such context requires more than memorization: the challenge lies in efficiently exploring, validating, and formatting these facts under tight interaction budgets. We formalize this problem as Instance-Level Context Learning (ILCL) and introduce our task-agnostic method to solve it. Our method performs a guided exploration, using a compact TODO forest to intelligently prioritize its next actions and a lightweight plan-act-extract loop to execute them. This process automatically produces a high-precision context document that is reusable across many downstream tasks and agents, thereby amortizing the initial exploration cost. Experiments across TextWorld, ALFWorld, and Crafter demonstrate consistent gains in both success and efficiency: for instance, ReAct’s mean success rate in TextWorld rises from 37% to 95%, while IGE improves from 81% to 95%. By transforming one-off exploration into persistent, reusable knowledge, our method complements existing contexts to enable more reliable and efficient LLM agents.

[36] Training Dynamics of Parametric and In-Context Knowledge Utilization in Language Models

Minsung Kim, Dong-Kyum Kim, Jea Kwon, Nakyeong Yang, Kyomin Jung, Meeyoung Cha

Main category: cs.CL

TL;DR: Training conditions significantly influence how language models arbitrate between parametric knowledge and in-context knowledge, with intra-document repetition and exposure to inconsistent information fostering robust arbitration strategies.

Details

Motivation: To understand how training conditions shape knowledge-arbitration strategies in LLMs, preventing wasted computational resources and undesirable behaviors in retrieval-augmented generation systems.

Method: Controlled study training transformer-based language models on synthetic biographies corpus while systematically varying training conditions including intra-document repetition, inconsistent information, and distributional skew.

Result: Intra-document repetition develops both parametric and in-context capabilities. Training on inconsistent information or distributional skew encourages robust strategies for leveraging both knowledge sources. Non-ideal properties like inconsistency are important for learning robust arbitration.

Conclusion: Training conditions crucially shape knowledge arbitration, and non-ideal properties should be embraced rather than removed to develop models that harmoniously integrate parametric and in-context knowledge.

Abstract: Large language models often encounter conflicts between in-context knowledge retrieved at inference time and parametric knowledge acquired during pretraining. Models that accept external knowledge uncritically are vulnerable to misinformation, whereas models that adhere rigidly to parametric knowledge fail to benefit from retrieval. Despite the widespread adoption of retrieval-augmented generation, we still lack a systematic understanding of what shapes knowledge-arbitration strategies during training. This gap risks producing pretrained models with undesirable arbitration behaviors and, consequently, wasting substantial computational resources after the pretraining budget has already been spent. To address this problem, we present the first controlled study of how training conditions influence models’ use of in-context and parametric knowledge, and how they arbitrate between them. We train transformer-based language models on a synthetic biographies corpus while systematically controlling various conditions. Our experiments reveal that intra-document repetition of facts fosters the development of both parametric and in-context capabilities. Moreover, training on a corpus that contains inconsistent information or distributional skew encourages models to develop robust strategies for leveraging parametric and in-context knowledge. Rather than viewing these non-ideal properties as artifacts to remove, our results indicate that they are important for learning robust arbitration. These insights offer concrete, empirical guidance for pretraining models that harmoniously integrate parametric and in-context knowledge.

[37] Pretraining with hierarchical memories: separating long-tail and common knowledge

Hadi Pouransari, David Grangier, C Thomas, Michael Kirchhof, Oncel Tuzel

Main category: cs.CL

TL;DR: Small language models augmented with hierarchical parametric memory banks achieve comparable performance to larger models by storing world knowledge in memory parameters while keeping the core model small.

Details

Motivation: Current language models require scaling parameters to store world knowledge, which is inefficient and impractical for edge devices with limited memory and compute.

Method: Memory-augmented architecture with small language models that access large hierarchical parametric memory banks, fetching context-dependent memory blocks during pretraining and inference.

Result: A 160M-parameter model with 18M memory from a 4.6B memory bank performs comparably to models with over 2x parameters. Hierarchical feed-forward memories work robustly across transformer architectures.

Conclusion: Memory-augmented small language models provide an efficient alternative to parameter scaling, enabling high performance with reduced model size suitable for edge devices.

Abstract: The impressive performance gains of modern language models currently rely on scaling parameters: larger models store more world knowledge and reason better. Yet compressing all world knowledge into parameters is unnecessary, as only a fraction is used per prompt, and impractical for edge devices with limited inference-time memory and compute. We address this shortcoming by a memory-augmented architecture and a pretraining strategy aligned with existing hardware paradigms. We introduce small language models that access large hierarchical parametric memory banks encoding world knowledge. During pretraining and inference, we fetch a small, context-dependent memory block and add it to the model. Our pretraining learns to store long-tail world knowledge in the memory parameters, while the small language model acts as an anchor capturing common knowledge and general reasoning abilities. Through trillion-token-scale experiments, we show significant gains: a 160M-parameters model augmented with an 18M-parameters memory fetched from a 4.6B memory bank obtains comparable performance to a regular model with more than 2x the parameters. Through extensive experiments, we study the optimal type and size of parametric memories in transformers, scaling them to over 21B parameters. We find that our proposed hierarchical feed-forward memories work robustly across transformer architectures, whether added during pretraining or post-hoc.

[38] Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems

Aakriti Agrawal, Rohith Aralikatti, Anirudh Satheesh, Souradip Chakraborty, Amrit Singh Bedi, Furong Huang

Main category: cs.CL

TL;DR: A novel method for selecting the best response from multiple LLMs using calibrated log-likelihood scores, achieving 3-5% improvements across various benchmarks without costly external verifiers.

Details

Motivation: Existing approaches for selecting reliable LLM responses rely on costly external verifiers, human evaluators, or self-consistency techniques, while multi-LLM systems underperform despite producing more diverse responses.

Method: Proposes a principled and computationally efficient method using calibrated log-likelihood scores to implicitly leverage the inherent knowledge and confidence of multiple LLMs for response selection.

Result: Demonstrates improvements of approximately 4%, 3%, and 5% on GSM8K, MMLU (6 subsets), and ARC datasets respectively, in both debate and non-debate settings.

Conclusion: The calibrated log-likelihood method provides an effective and efficient way to select the best response from multiple LLMs, outperforming existing approaches while being computationally efficient.

Abstract: Large Language Models (LLMs) have demonstrated exceptional capabilities, yet selecting the most reliable response from multiple LLMs remains a challenge, particularly in resource-constrained settings. Existing approaches often depend on costly external verifiers, human evaluators, or self-consistency techniques that require multiple samples from a single model. While multi-LLM systems produce more diverse responses than single models and thus have greater potential, they often underperform compared to single LLM self-consistency. We propose a principled, novel and computationally efficient method to select the best response from multiple different LLMs using a calibrated log-likelihood score, implicitly leveraging the inherent knowledge and confidence of these models. Our method demonstrates improvements of approx. 4%, 3%, and 5% across both debate (multi-round LLM discussions) and non-debate (Best-of-N with multiple LLMs) settings on GSM8K, MMLU (6 subsets), and ARC datasets respectively.

[39] Learning to Route: A Rule-Driven Agent Framework for Hybrid-Source Retrieval-Augmented Generation

Haoyue Bai, Haoyu Wang, Shengyu Chen, Zhengzhang Chen, Lu-An Tang, Wei Cheng, Haifeng Chen, Yanjie Fu

Main category: cs.CL

TL;DR: A rule-driven routing framework that intelligently selects between database and document retrieval paths for LLM question answering, achieving better accuracy and efficiency than static or learned routing methods.

Details

Motivation: LLMs struggle with domain-specific QA requiring accurate, up-to-date information. While RAG helps, existing systems focus on documents and overlook relational databases which provide precise, queryable factual information crucial in domains like finance and healthcare.

Method: Proposes a framework with: routing agent that scores augmentation paths using explicit rules; rule-making expert agent that refines rules using QA feedback; and path-level meta-cache that reuses routing decisions for similar queries to reduce latency and cost.

Result: Experiments on three QA benchmarks show the framework consistently outperforms static strategies and learned routing baselines, achieving higher accuracy while maintaining moderate computational cost.

Conclusion: The rule-driven routing framework effectively balances database and document retrieval based on query characteristics, demonstrating superior performance through systematic source selection and adaptive rule refinement.

Abstract: Large Language Models (LLMs) have shown remarkable performance on general Question Answering (QA), yet they often struggle in domain-specific scenarios where accurate and up-to-date information is required. Retrieval-Augmented Generation (RAG) addresses this limitation by enriching LLMs with external knowledge, but existing systems primarily rely on unstructured documents, while largely overlooking relational databases, which provide precise, timely, and efficiently queryable factual information, serving as indispensable infrastructure in domains such as finance, healthcare, and scientific research. Motivated by this gap, we conduct a systematic analysis that reveals three central observations: (i) databases and documents offer complementary strengths across queries, (ii) naively combining both sources introduces noise and cost without consistent accuracy gains, and (iii) selecting the most suitable source for each query is crucial to balance effectiveness and efficiency. We further observe that query types show consistent regularities in their alignment with retrieval paths, suggesting that routing decisions can be effectively guided by systematic rules that capture these patterns. Building on these insights, we propose a rule-driven routing framework. A routing agent scores candidate augmentation paths based on explicit rules and selects the most suitable one; a rule-making expert agent refines the rules over time using QA feedback to maintain adaptability; and a path-level meta-cache reuses past routing decisions for semantically similar queries to reduce latency and cost. Experiments on three QA benchmarks demonstrate that our framework consistently outperforms static strategies and learned routing baselines, achieving higher accuracy while maintaining moderate computational cost.

[40] KnowledgeSmith: Uncovering Knowledge Updating in LLMs with Model Editing and Unlearning

Yinyi Luo, Zhexian Zhou, Hao Chen, Kai Qiu, Marios Savvides, Yixuan Li, Jindong Wang

Main category: cs.CL

TL;DR: KnowledgeSmith is a unified framework that systematically analyzes how large language models update knowledge through editing and unlearning, revealing insights about knowledge propagation, plasticity scaling, and consistency-robustness trade-offs.

Details

Motivation: To understand the knowledge updating mechanism of LLMs, which remains largely unexplored due to insufficient, isolated, and small-scale evaluation, and to investigate whether LLMs update knowledge similarly to humans and how editing/unlearning differ with increasing training data.

Method: Proposes KnowledgeSmith framework that casts editing and unlearning as constrained optimization problems, and develops an automatic dataset generator providing structured interventions across multiple graph levels and data scales for controlled studies of knowledge modification strategies.

Result: Extensive experiments reveal nuanced insights: LLMs do not update knowledge similarly to humans across different knowledge levels, and there exists a consistency-capacity trade-off in knowledge updating mechanisms.

Conclusion: The findings provide suggestions for designing more reliable and scalable knowledge updating strategies for large language models, with implications for improving model adaptability and robustness.

Abstract: Knowledge editing and machine unlearning are two popular approaches for large language models (LLMs) to stay up-to-date. However, the knowledge updating mechanism of LLMs remains largely unexplored due to insufficient, isolated, and small-scale evaluation. For instance, are LLMs similar to humans in modifying certain knowledge? What differs editing and unlearning as training data increases? This paper proposes KnowledgeSmith, a unified framework to systematically understand the updating mechanism of LLMs. We first cast editing and unlearning as instances of one constrained optimization problem. Then, we propose an automatic dataset generator that provides structured interventions across multiple graph levels and data scales, enabling controlled studies of how different modification strategies propagate through model knowledge. Extensive experiments demonstrate nuanced insights over knowledge propagation, plasticity scaling, consistency, and robustness. For instance, our results show that LLMs do not exhibit similar updating as humans for different levels of knowledge, and there exists consistency-capacity trade-off. We hope our findings can offer suggestions to the design of more reliable and scalable strategies. Code: https://github.com/AIFrontierLab/KnowledgeSmith.git

[41] Retrieval and Augmentation of Domain Knowledge for Text-to-SQL Semantic Parsing

Manasi Patwardhan, Ayush Agarwal, Shabbirhussain Bhaisaheb, Aseem Arora, Lovekesh Vig, Sunita Sarawagi

Main category: cs.CL

TL;DR: A framework for improving LLM-based NL-to-SQL translation by using structured domain statements at database level instead of query-specific hints, with substring-based retrieval for better accuracy.

Details

Motivation: LLM performance for NL-to-SQL translation varies across databases due to domain-specific vocabulary, and existing benchmarks use unrealistic query-specific textual hints for domain knowledge.

Method: Propose systematic framework using structured domain statements at database level with substring-based retrieval of relevant statements for user queries.

Result: Evaluation on 11 realistic DB schemas across 5 LLMs shows DB-level structured domain statements are more practical and accurate than query-specific approaches, and substring-based retrieval provides significantly higher accuracy.

Conclusion: Database-level structured domain statements with substring-based retrieval offer superior performance for NL-to-SQL translation compared to existing methods.

Abstract: The performance of Large Language Models (LLMs) for translating Natural Language (NL) queries into SQL varies significantly across databases (DBs). NL queries are often expressed using a domain specific vocabulary, and mapping these to the correct SQL requires an understanding of the embedded domain expressions, their relationship to the DB schema structure. Existing benchmarks rely on unrealistic, ad-hoc query specific textual hints for expressing domain knowledge. In this paper, we propose a systematic framework for associating structured domain statements at the database level. We present retrieval of relevant structured domain statements given a user query using sub-string level match. We evaluate on eleven realistic DB schemas covering diverse domains across five open-source and proprietary LLMs and demonstrate that (1) DB level structured domain statements are more practical and accurate than existing ad-hoc query specific textual domain statements, and (2) Our sub-string match based retrieval of relevant domain statements provides significantly higher accuracy than other retrieval approaches.

[42] Words That Make Language Models Perceive

Sophie L. Wang, Phillip Isola, Brian Cheung

Main category: cs.CL

TL;DR: Sensory prompting can activate modality-appropriate representations in text-only LLMs, bringing them into closer alignment with specialist vision and audio encoders.

Details

Motivation: To test whether explicit sensory prompting can surface latent multimodal structure in text-only LLMs, despite their lack of direct perceptual experience.

Method: Using sensory prompts (e.g., ‘see’ or ‘hear’) to cue the model to resolve next-token predictions as if conditioned on latent visual or auditory evidence that is never actually supplied.

Result: Lightweight prompt engineering can reliably activate modality-appropriate representations in purely text-trained LLMs.

Conclusion: Text-only LLMs contain latent multimodal representations that can be surfaced through simple sensory prompting, demonstrating their implicit understanding of perceptual regularities encoded in language.

Abstract: Large language models (LLMs) trained purely on text ostensibly lack any direct perceptual experience, yet their internal representations are implicitly shaped by multimodal regularities encoded in language. We test the hypothesis that explicit sensory prompting can surface this latent structure, bringing a text-only LLM into closer representational alignment with specialist vision and audio encoders. When a sensory prompt tells the model to ‘see’ or ‘hear’, it cues the model to resolve its next-token predictions as if they were conditioned on latent visual or auditory evidence that is never actually supplied. Our findings reveal that lightweight prompt engineering can reliably activate modality-appropriate representations in purely text-trained LLMs.

[43] Unraveling Syntax: How Language Models Learn Context-Free Grammars

Laura Ying Schulz, Daniel Mitropolsky, Tomaso Poggio

Main category: cs.CL

TL;DR: A framework using probabilistic context-free grammars (PCFGs) to study transformer learning dynamics, revealing parallel learning across subgrammars and challenges with deep recursion.

Details

Motivation: To understand how language models acquire syntax, given that current large models achieve impressive results but their learning dynamics remain poorly understood.

Method: Train small models on synthetic languages generated from PCFGs, enabling precise control over grammar complexity, recursion depth, and subgrammar structure. Prove recursive formulae for training loss and KL divergence.

Result: Transformers reduce loss across all subgrammars in parallel (unlike children’s sequential learning), subgrammar pretraining helps smaller models, and models struggle with deep recursive structures.

Conclusion: PCFGs provide a versatile testbed for probing language model learning dynamics, revealing fundamental challenges in how neural networks represent hierarchical syntax.

Abstract: We introduce a new framework for understanding how language models acquire syntax. While large models achieve impressive results, little is known about their learning dynamics. Our approach starts with the observation that most domains of interest, such as natural language syntax, coding languages, arithmetic problems, are captured by probabilistic context-free grammars (PCFGs). We study the learning dynamics of small models trained on synthetic languages generated from PCFGs, enabling precise control over grammar complexity, recursion depth, and subgrammar structure. We prove several general, recursive formulae for the training loss and Kullback-Leibler divergence over the subgrammar structure of a PCFG. Empirically, we find that unlike children, who first master simple substructures before progressing to more complex constructions, transformers reduce loss across all subgrammars in parallel. We further show that subgrammar pretraining can improve the final loss for smaller models, and that pretrained models develop internal representations more aligned with the grammar’s substructure. Finally, we demonstrate that models struggle with deeper recursive structures (a limitation even of large language models), revealing fundamental challenges in how neural networks represent hierarchical syntax. Overall, our work initiates the study of the learning dynamics of transformers on PCFGs as a versatile testbed for probing learning in language models, opening a research direction with many open questions.

[44] Hierarchical Semantic Retrieval with Cobweb

Anant Gupta, Karthik Singaravadivelan, Zekun Wang

Main category: cs.CL

TL;DR: Cobweb is a hierarchy-aware document retrieval framework that organizes sentence embeddings into a prototype tree and ranks documents through coarse-to-fine traversal, providing multi-granular relevance signals and transparent rationales.

Details

Motivation: Traditional neural document retrieval treats corpora as flat vector clouds, underusing corpus structure and providing opaque explanations. The authors aim to leverage hierarchical structure for better retrieval and interpretability.

Method: Uses Cobweb framework to organize sentence embeddings into a prototype tree with internal nodes as concept prototypes. Implements two inference approaches: generalized best-first search and lightweight path-sum ranker. Evaluates on MS MARCO and QQP datasets with encoder (BERT/T5) and decoder (GPT-2) representations.

Result: Cobweb matches dot product search performance on strong encoder embeddings while remaining robust when kNN degrades. With GPT-2 vectors, dot product performance collapses but Cobweb still retrieves relevant results. Provides competitive effectiveness, improved robustness to embedding quality, scalability, and interpretable retrieval.

Conclusion: Cobweb framework successfully leverages hierarchical structure for document retrieval, offering competitive performance, robustness across different embedding types, scalability, and interpretability through hierarchical prototypes and retrieval paths.

Abstract: Neural document retrieval often treats a corpus as a flat cloud of vectors scored at a single granularity, leaving corpus structure underused and explanations opaque. We use Cobweb–a hierarchy-aware framework–to organize sentence embeddings into a prototype tree and rank documents via coarse-to-fine traversal. Internal nodes act as concept prototypes, providing multi-granular relevance signals and a transparent rationale through retrieval paths. We instantiate two inference approaches: a generalized best-first search and a lightweight path-sum ranker. We evaluate our approaches on MS MARCO and QQP with encoder (e.g., BERT/T5) and decoder (GPT-2) representations. Our results show that our retrieval approaches match the dot product search on strong encoder embeddings while remaining robust when kNN degrades: with GPT-2 vectors, dot product performance collapses whereas our approaches still retrieve relevant results. Overall, our experiments suggest that Cobweb provides competitive effectiveness, improved robustness to embedding quality, scalability, and interpretable retrieval via hierarchical prototypes.

[45] Knowledge-Graph Based RAG System Evaluation Framework

Sicheng Dong, Vahid Zolfaghari, Nenad Petrovic, Alois Knoll

Main category: cs.CL

TL;DR: Extended RAGAS framework into KG-based evaluation paradigm for RAG systems, enabling multi-hop reasoning and semantic clustering for more comprehensive scoring.

Details

Motivation: Traditional evaluation metrics struggle to capture key features of modern LLM-generated content that exhibits high fluency and naturalness, making RAG system evaluation challenging.

Method: Extended RAGAS framework into KG-based evaluation paradigm with multi-hop reasoning and semantic community clustering to derive comprehensive scoring metrics.

Result: KG-based evaluation method shows higher sensitivity to subtle semantic differences in generated outputs and better correlation with human judgments compared to RAGAS scores.

Conclusion: KG-based evaluation provides deeper understanding of RAG systems and more nuanced performance perspective, with identified challenges and future research directions.

Abstract: Large language models (LLMs) has become a significant research focus and is utilized in various fields, such as text generation and dialog systems. One of the most essential applications of LLM is Retrieval Augmented Generation (RAG), which greatly enhances generated content’s reliability and relevance. However, evaluating RAG systems remains a challenging task. Traditional evaluation metrics struggle to effectively capture the key features of modern LLM-generated content that often exhibits high fluency and naturalness. Inspired by the RAGAS tool, a well-known RAG evaluation framework, we extended this framework into a KG-based evaluation paradigm, enabling multi-hop reasoning and semantic community clustering to derive more comprehensive scoring metrics. By incorporating these comprehensive evaluation criteria, we gain a deeper understanding of RAG systems and a more nuanced perspective on their performance. To validate the effectiveness of our approach, we compare its performance with RAGAS scores and construct a human-annotated subset to assess the correlation between human judgments and automated metrics. In addition, we conduct targeted experiments to demonstrate that our KG-based evaluation method is more sensitive to subtle semantic differences in generated outputs. Finally, we discuss the key challenges in evaluating RAG systems and highlight potential directions for future research.

[46] Transcribe, Translate, or Transliterate: An Investigation of Intermediate Representations in Spoken Language Models

Tolúl\d{o}pé Ògúnrèmí, Christopher D. Manning, Dan Jurafsky, Karen Livescu

Main category: cs.CL

TL;DR: Analysis of modality adapters in spoken language models reveals two strategies: Whisper-based models use English interlingua for semantic representation, while others use phonetic representation with English words.

Details

Motivation: To understand how modality adapters transform speech encoder outputs into representations that decoder language models can process, as this crucial component's workings are poorly understood.

Method: Examined MA output representations in three SLMs (SALMONN, Qwen2-Audio, Phi-4-Multimodal-Instruct) by finding the nearest decoder LM token to MA representations.

Result: Found two distinct strategies: Whisper encoder models use English-based interlingua for semantic representation (handling unseen languages), while non-Whisper models represent phonetics using English words.

Conclusion: The representation strategy depends on whether the speech encoder is trained only for speech recognition or also for translation, with Whisper-based models achieving semantic understanding across languages.

Abstract: Spoken language models (SLMs) that integrate speech with large language models (LMs) rely on modality adapters (MAs) to map the output of speech encoders to a representation that is understandable to the decoder LM. Yet we know very little about how these crucial MAs transform representations. Here we examine the MA output representation in three SLMs (SALMONN, Qwen2-Audio and Phi-4-Multimodal-Instruct). By finding the nearest decoder LM token to an MA representation, we uncover two strategies for MA representations. For models using a Whisper encoder, MAs appear to represent the meaning of the input using an English-based interlingua, allowing them to handle languages unseen in instruction tuning. For models that don’t, like Phi-4-Multimodal-Instruct, MAs instead represent the phonetics of the input, but expressed with English words. We hypothesise that which arises depends on whether the speech encoder is trained only for speech recognition or also for translation.

[47] Evaluation Framework for Highlight Explanations of Context Utilisation in Language Models

Jingyi Sun, Pepa Atanasova, Sagnik Ray Choudhury, Sekh Mainul Islam, Isabelle Augenstein

Main category: cs.CL

TL;DR: This paper introduces the first gold standard evaluation framework for highlight explanations (HEs) in assessing context utilization by language models, revealing that while mechanistic interpretability approaches perform best, all methods struggle with longer contexts and exhibit positional biases.

Details

Motivation: Current language models lack transparency in context utilization - users cannot determine whether models use parametric memory or provided context, nor identify which specific context pieces inform responses. Highlight explanations offer a solution but lack proper evaluation frameworks.

Method: The authors introduce a gold standard HE evaluation framework using controlled test cases with known ground-truth context usage. They evaluate four HE methods (three established techniques and MechLight, a mechanistic interpretability approach adapted for this task) across four context scenarios, four datasets, and five language models.

Result: MechLight performs best across all context scenarios. However, all methods struggle with longer contexts and exhibit positional biases, indicating fundamental challenges in explanation accuracy.

Conclusion: There are fundamental challenges in delivering reliable context utilization explanations at scale, requiring new approaches to address limitations with longer contexts and positional biases in highlight explanation methods.

Abstract: Context utilisation, the ability of Language Models (LMs) to incorporate relevant information from the provided context when generating responses, remains largely opaque to users, who cannot determine whether models draw from parametric memory or provided context, nor identify which specific context pieces inform the response. Highlight explanations (HEs) offer a natural solution as they can point the exact context pieces and tokens that influenced model outputs. However, no existing work evaluates their effectiveness in accurately explaining context utilisation. We address this gap by introducing the first gold standard HE evaluation framework for context attribution, using controlled test cases with known ground-truth context usage, which avoids the limitations of existing indirect proxy evaluations. To demonstrate the framework’s broad applicability, we evaluate four HE methods – three established techniques and MechLight, a mechanistic interpretability approach we adapt for this task – across four context scenarios, four datasets, and five LMs. Overall, we find that MechLight performs best across all context scenarios. However, all methods struggle with longer contexts and exhibit positional biases, pointing to fundamental challenges in explanation accuracy that require new approaches to deliver reliable context utilisation explanations at scale.

[48] Mind the Gap: Linguistic Divergence and Adaptation Strategies in Human-LLM Assistant vs. Human-Human Interactions

Fulei Zhang, Zhou Yu

Main category: cs.CL

TL;DR: Users communicate differently with LLM chatbots vs human agents, showing differences in grammar, politeness, and vocabulary. Training on human-human data alone is insufficient, but data augmentation helps models adapt better.

Details

Motivation: To understand how user communication styles differ between LLM chatbots and human agents, and address the gap where models trained only on human-human data may not handle post-launch communication style changes effectively.

Method: Analyzed user communication patterns, then tested two strategies: (1) data augmentation during post-training phase, and (2) inference-time user message reformulation.

Result: Users showed significant differences in grammatical fluency, politeness, and lexical diversity when interacting with chatbots vs humans. Models trained on stylistically diverse datasets outperformed those trained on original or uniform datasets, while inference-time reformulation was less effective.

Conclusion: Data augmentation with stylistically diverse training data significantly improves LLM robustness to post-launch communication style changes, leading to better user interaction experiences.

Abstract: As Large Language Models (LLMs) are increasingly deployed in customer-facing applications, a critical yet underexplored question is how users communicate differently with LLM chatbots compared to human agent. In this study, we present empirical evidence that users adopt distinct communication styles when users interact with chatbots versus human agents. Our analysis reveals significant differences in grammatical fluency, politeness, and lexical diversity in user language between the two settings. These findings suggest that models trained exclusively on human-human interaction data may not adequately accommodate the communication style shift that occurs once an LLM chatbot is deployed. To enhance LLM robustness to post-launch communication style changes, we experimented with two strategies: (1) data augmentation during the post-training phase and (2) inference-time user message reformulation. Our results indicate that models trained on stylistically diverse datasets significantly outperform those trained exclusively on original or stylistically uniform datasets, while inference-time reformulation proved less effective. These insights help us to better adapt our models for improved LLM-user interaction experiences.

[49] SoT: Structured-of-Thought Prompting Guides Multilingual Reasoning in Large Language Models

Rui Qi, Zhibo Man, Yufeng Chen, Fengran Mo, Jinan Xu, Kaiyu Huang

Main category: cs.CL

TL;DR: SoT is a training-free method that improves multilingual reasoning by converting language-specific semantics to language-agnostic structured representations through multi-step transformations.

Details

Motivation: Current LLMs struggle with multilingual reasoning tasks due to resource constraints that prevent effective reasoning transfer to non-high-resource languages.

Method: Structured-of-Thought (SoT) uses Language Thinking Transformation and Structured Knowledge Transformation to convert semantic information into structured representations, guiding LLMs toward consistent reasoning across languages.

Result: SoT outperforms strong baselines on multiple multilingual reasoning benchmarks across various LLM backbones and can be integrated with other training-free strategies for further improvements.

Conclusion: SoT effectively enhances multilingual reasoning capabilities by transforming semantic information into structured representations, maintaining consistent reasoning pathways across different languages.

Abstract: Recent developments have enabled Large Language Models (LLMs) to engage in complex reasoning tasks through deep thinking. However, the capacity of reasoning has not been successfully transferred to non-high-resource languages due to resource constraints, which struggles with multilingual reasoning tasks. To this end, we propose Structured-of-Thought (SoT), a training-free method that improves the performance on multilingual reasoning through a multi-step transformation: Language Thinking Transformation and Structured Knowledge Transformation. The SoT method converts language-specific semantic information into language-agnostic structured representations, enabling the models to understand the query in different languages more sophisticated. Besides, SoT effectively guides LLMs toward more concentrated reasoning to maintain consistent underlying reasoning pathways when handling cross-lingual variations in expression. Experimental results demonstrate that SoT outperforms several strong baselines on multiple multilingual reasoning benchmarks when adapting to various backbones of LLMs. It can also be integrated with other training-free strategies for further improvements. Our code is available at https://github.com/Cherry-qwq/SoT.

[50] Self-Improvement in Multimodal Large Language Models: A Survey

Shijian Deng, Kai Wang, Tianyu Yang, Harsh Singh, Yapeng Tian

Main category: cs.CL

TL;DR: This is the first comprehensive survey on self-improvement in Multimodal LLMs (MLLMs), covering data collection, organization, model optimization methods, evaluations, and applications.

Details

Motivation: To extend self-improvement techniques from LLMs to multimodal domains, leveraging diverse data sources for developing more general self-improving models.

Method: Structured analysis from three perspectives: data collection, data organization, and model optimization; includes evaluations and downstream applications.

Result: Provides a comprehensive overview of current literature and methods for self-improvement in MLLMs.

Conclusion: Outlines open challenges and future research directions for self-improvement in multimodal LLMs.

Abstract: Recent advancements in self-improvement for Large Language Models (LLMs) have efficiently enhanced model capabilities without significantly increasing costs, particularly in terms of human effort. While this area is still relatively young, its extension to the multimodal domain holds immense potential for leveraging diverse data sources and developing more general self-improving models. This survey is the first to provide a comprehensive overview of self-improvement in Multimodal LLMs (MLLMs). We provide a structured overview of the current literature and discuss methods from three perspectives: 1) data collection, 2) data organization, and 3) model optimization, to facilitate the further development of self-improvement in MLLMs. We also include commonly used evaluations and downstream applications. Finally, we conclude by outlining open challenges and future research directions.

[51] Uncertainty as Feature Gaps: Epistemic Uncertainty Quantification of LLMs in Contextual Question-Answering

Yavuz Bakman, Sungmin Kang, Zhiqi Huang, Duygu Nur Yaldiz, Catarina G. Belém, Chenyang Zhu, Anoop Kumar, Alfy Samuel, Salman Avestimehr, Daben Liu, Sai Praneeth Karimireddy

Main category: cs.CL

TL;DR: This paper proposes a novel uncertainty quantification method for contextual question answering that measures epistemic uncertainty by comparing model predictions against an idealized model, focusing on three key semantic features: context-reliance, context comprehension, and honesty.

Details

Motivation: Current uncertainty quantification research has focused on closed-book factual QA while ignoring contextual QA, despite its importance in real-world applications where models must reason about provided context.

Method: The authors propose a theoretically grounded approach that defines token-level uncertainty as cross-entropy between model predictions and true distribution. They isolate epistemic uncertainty by approximating the true distribution with an idealized model and derive an upper bound interpreted as semantic feature gaps. For contextual QA, they extract three features (context-reliance, context comprehension, honesty) using a top-down interpretability approach with minimal labeled samples.

Result: Experiments on multiple QA benchmarks in both in-distribution and out-of-distribution settings show the method substantially outperforms state-of-the-art unsupervised and supervised UQ methods, achieving up to 13-point PRR improvement with negligible inference overhead.

Conclusion: The proposed framework provides an effective and efficient approach for uncertainty quantification in contextual QA by focusing on semantic feature gaps relative to an idealized model, demonstrating significant performance improvements over existing methods.

Abstract: Uncertainty Quantification (UQ) research has primarily focused on closed-book factual question answering (QA), while contextual QA remains unexplored, despite its importance in real-world applications. In this work, we focus on UQ for the contextual QA task and propose a theoretically grounded approach to quantify epistemic uncertainty. We begin by introducing a task-agnostic, token-level uncertainty measure defined as the cross-entropy between the predictive distribution of the given model and the unknown true distribution. By decomposing this measure, we isolate the epistemic component and approximate the true distribution by a perfectly prompted, idealized model. We then derive an upper bound for epistemic uncertainty and show that it can be interpreted as semantic feature gaps in the given model’s hidden representations relative to the ideal model. We further apply this generic framework to the contextual QA task and hypothesize that three features approximate this gap: context-reliance (using the provided context rather than parametric knowledge), context comprehension (extracting relevant information from context), and honesty (avoiding intentional lies). Using a top-down interpretability approach, we extract these features by using only a small number of labeled samples and ensemble them to form a robust uncertainty score. Experiments on multiple QA benchmarks in both in-distribution and out-of-distribution settings show that our method substantially outperforms state-of-the-art unsupervised (sampling-free and sampling-based) and supervised UQ methods, achieving up to a 13-point PRR improvement while incurring a negligible inference overhead.

[52] Time-To-Inconsistency: A Survival Analysis of Large Language Model Robustness to Adversarial Attacks

Yubo Li, Ramayya Krishnan, Rema Padman

Main category: cs.CL

TL;DR: Survival analysis reveals that abrupt semantic drift causes catastrophic conversational failure in LLMs, while gradual drift is protective and enables longer dialogues.

Details

Motivation: Existing evaluation frameworks fail to capture temporal dynamics of conversational degradation in multi-turn dialogues, leaving LLM robustness in extended conversations poorly understood.

Method: Comprehensive survival analysis of 36,951 conversation turns across 9 LLMs using Cox proportional hazards, Accelerated Failure Time, and Random Survival Forest approaches to model failure as time-to-event process.

Result: Abrupt prompt-to-prompt semantic drift dramatically increases failure hazard, while gradual cumulative drift reduces failure hazard and enables significantly longer dialogues. AFT models with interactions showed superior performance with excellent discrimination and calibration.

Conclusion: Survival analysis is a powerful paradigm for evaluating LLM robustness, offering insights for designing resilient conversational agents and challenging assumptions about semantic consistency necessity.

Abstract: Large Language Models (LLMs) have revolutionized conversational AI, yet their robustness in extended multi-turn dialogues remains poorly understood. Existing evaluation frameworks focus on static benchmarks and single-turn assessments, failing to capture the temporal dynamics of conversational degradation that characterize real-world interactions. In this work, we present the first comprehensive survival analysis of conversational AI robustness, analyzing 36,951 conversation turns across 9 state-of-the-art LLMs to model failure as a time-to-event process. Our survival modeling framework-employing Cox proportional hazards, Accelerated Failure Time, and Random Survival Forest approaches-reveals extraordinary temporal dynamics. We find that abrupt, prompt-to-prompt(P2P) semantic drift is catastrophic, dramatically increasing the hazard of conversational failure. In stark contrast, gradual, cumulative drift is highly protective, vastly reducing the failure hazard and enabling significantly longer dialogues. AFT models with interactions demonstrate superior performance, achieving excellent discrimination and exceptional calibration. These findings establish survival analysis as a powerful paradigm for evaluating LLM robustness, offer concrete insights for designing resilient conversational agents, and challenge prevailing assumptions about the necessity of semantic consistency in conversational AI Systems.

[53] TravelBench : Exploring LLM Performance in Low-Resource Domains

Srinivas Billa, Xiaonan Jing

Main category: cs.CL

TL;DR: General LLM benchmarks are insufficient for evaluating performance in low-resource domains. The study created 14 travel-domain datasets and found that despite training scale, LLMs struggle with complex domain-specific tasks, with reasoning helping smaller models more.

Details

Motivation: Existing LLM benchmarks provide limited information about model capabilities in low-resource tasks, making it difficult to develop effective solutions for these domains.

Method: Curated 14 travel-domain datasets spanning 7 common NLP tasks using anonymized real-world data, and analyzed LLM performance across accuracy, scaling behavior, and reasoning capabilities.

Result: General benchmarking results are insufficient for understanding model performance in low-resource tasks. LLMs hit performance bottlenecks in complex domain-specific scenarios despite training scale. Reasoning provides more significant boost for smaller LLMs.

Conclusion: Domain-specific evaluation is crucial for understanding LLM capabilities in low-resource settings, and reasoning capabilities are particularly beneficial for smaller models in certain tasks.

Abstract: Results on existing LLM benchmarks capture little information over the model capabilities in low-resource tasks, making it difficult to develop effective solutions in these domains. To address these challenges, we curated 14 travel-domain datasets spanning 7 common NLP tasks using anonymised data from real-world scenarios, and analysed the performance across LLMs. We report on the accuracy, scaling behaviour, and reasoning capabilities of LLMs in a variety of tasks. Our results confirm that general benchmarking results are insufficient for understanding model performance in low-resource tasks. Despite the amount of training FLOPs, out-of-the-box LLMs hit performance bottlenecks in complex, domain-specific scenarios. Furthermore, reasoning provides a more significant boost for smaller LLMs by making the model a better judge on certain tasks.

[54] PGMEL: Policy Gradient-based Generative Adversarial Network for Multimodal Entity Linking

KM Pooja, Cheng Long, Aixin Sun

Main category: cs.CL

TL;DR: The paper proposes PGMEL, a generative adversarial network using policy gradient for multimodal entity linking that generates high-quality negative samples to improve representation learning.

Details

Motivation: Existing multimodal entity linking methods haven't explored the potential of high-quality negative sample selection for metric learning, creating a gap in the literature.

Method: A generative adversarial framework where a generator creates challenging negative samples using policy gradient (discrete optimization), and a discriminator performs metric learning tasks.

Result: PGMEL outperforms state-of-the-art methods on Wiki-MEL, Richpedia-MEL and WikiDiverse datasets by learning more meaningful representations through better negative sample selection.

Conclusion: The proposed adversarial approach with policy gradient optimization effectively addresses the negative sample selection problem in multimodal entity linking, leading to improved performance.

Abstract: The task of entity linking, which involves associating mentions with their respective entities in a knowledge graph, has received significant attention due to its numerous potential applications. Recently, various multimodal entity linking (MEL) techniques have been proposed, targeted to learn comprehensive embeddings by leveraging both text and vision modalities. The selection of high-quality negative samples can potentially play a crucial role in metric/representation learning. However, to the best of our knowledge, this possibility remains unexplored in existing literature within the framework of MEL. To fill this gap, we address the multimodal entity linking problem in a generative adversarial setting where the generator is responsible for generating high-quality negative samples, and the discriminator is assigned the responsibility for the metric learning tasks. Since the generator is involved in generating samples, which is a discrete process, we optimize it using policy gradient techniques and propose a policy gradient-based generative adversarial network for multimodal entity linking (PGMEL). Experimental results based on Wiki-MEL, Richpedia-MEL and WikiDiverse datasets demonstrate that PGMEL learns meaningful representation by selecting challenging negative samples and outperforms state-of-the-art methods.

[55] IndiCASA: A Dataset and Bias Evaluation Framework in LLMs Using Contrastive Embedding Similarity in the Indian Context

Santhosh G S, Akshay Govind S, Gokul S Krishnan, Balaraman Ravindran, Sriraam Natarajan

Main category: cs.CL

TL;DR: Proposes a contrastive learning-based evaluation framework and IndiCASA dataset to assess cultural biases in LLMs for Indian contexts, revealing persistent disability biases despite global debiasing efforts.

Details

Motivation: Existing embedding-based bias assessment methods fall short in capturing nuanced stereotypes in culturally diverse contexts like India, necessitating rigorous evaluation of LLM biases in high-stakes applications.

Method: Developed an evaluation framework using encoder trained with contrastive learning to capture fine-grained bias through embedding similarity, and created IndiCASA dataset with 2,575 human-validated sentences across five demographic axes.

Result: All evaluated open-weight LLMs exhibited some degree of stereotypical bias, with disability-related biases being notably persistent and religion bias generally lower likely due to global debiasing efforts.

Conclusion: There is a demonstrated need for fairer model development as current LLMs still exhibit cultural biases, particularly in disability-related stereotypes, highlighting the importance of context-specific bias evaluation frameworks.

Abstract: Large Language Models (LLMs) have gained significant traction across critical domains owing to their impressive contextual understanding and generative capabilities. However, their increasing deployment in high stakes applications necessitates rigorous evaluation of embedded biases, particularly in culturally diverse contexts like India where existing embedding-based bias assessment methods often fall short in capturing nuanced stereotypes. We propose an evaluation framework based on a encoder trained using contrastive learning that captures fine-grained bias through embedding similarity. We also introduce a novel dataset - IndiCASA (IndiBias-based Contextually Aligned Stereotypes and Anti-stereotypes) comprising 2,575 human-validated sentences spanning five demographic axes: caste, gender, religion, disability, and socioeconomic status. Our evaluation of multiple open-weight LLMs reveals that all models exhibit some degree of stereotypical bias, with disability related biases being notably persistent, and religion bias generally lower likely due to global debiasing efforts demonstrating the need for fairer model development.

[56] The Path of Self-Evolving Large Language Models: Achieving Data-Efficient Learning via Intrinsic Feedback

Hangfan Zhang, Siyuan Xu, Zhimeng Guo, Huaisheng Zhu, Shicheng Liu, Xinrun Wang, Qiaosheng Zhang, Yang Chen, Peng Ye, Lei Bai, Shuyue Hu

Main category: cs.CL

TL;DR: Self-aware reinforcement learning for LLMs that alternates between task proposal and solving, using difficulty prediction and limit breaking to minimize data requirements.

Details

Motivation: Traditional RL training for LLMs requires substantial data annotation efforts, creating a need for more data-efficient approaches.

Method: Alternates between LLM proposing tasks and solving them, with self-aware difficulty prediction to prioritize challenging solvable tasks and self-aware limit breaking to request external data when needed.

Result: 53.8% relative improvement on nine benchmarks with less than 1.2% extra data compared to baseline approaches.

Conclusion: Self-aware RL demonstrates efficacy for improving LLMs with minimal data and shows promise for self-evolving agent training.

Abstract: Reinforcement learning (RL) has demonstrated potential in enhancing the reasoning capabilities of large language models (LLMs), but such training typically demands substantial efforts in creating and annotating data. In this work, we explore improving LLMs through RL with minimal data. Our approach alternates between the LLM proposing a task and then attempting to solve it. To minimize data dependency, we introduce two novel mechanisms grounded in self-awareness: (1) self-aware difficulty prediction, where the model learns to assess task difficulty relative to its own abilities and prioritize challenging yet solvable tasks, and (2) self-aware limit breaking, where the model recognizes when a task is beyond its capability boundary and proactively requests external data to break through that limit. Extensive experiments on nine benchmarks showing a 53.8% relative improvement with less than 1.2% extra data demonstrate the efficacy of self-aware RL and underscore the promise of self-evolving agent training.

[57] XTRA: Cross-Lingual Topic Modeling with Topic and Representation Alignments

Tien Phat Nguyen, Vu Minh Ngo, Tung Nguyen, Linh Van Ngo, Duc Anh Nguyen, Sang Dinh, Trung Le

Main category: cs.CL

TL;DR: XTRA is a cross-lingual topic modeling framework that combines Bag-of-Words modeling with multilingual embeddings using representation and topic alignments to improve topic coherence, diversity, and cross-lingual consistency.

Details

Motivation: Existing cross-lingual topic modeling methods struggle with ensuring high topic coherence and consistent alignment across languages, despite achieving some improvements in topic diversity.

Method: XTRA introduces two core components: representation alignment (aligning document-topic distributions via contrastive learning) and topic alignment (projecting topic-word distributions into shared semantic space) to enforce cross-lingual consistency.

Result: Experiments on multilingual corpora show XTRA significantly outperforms strong baselines in topic coherence, diversity, and alignment quality.

Conclusion: XTRA successfully learns topics that are interpretable (coherent and diverse) and well-aligned across languages through its dual alignment mechanism.

Abstract: Cross-lingual topic modeling aims to uncover shared semantic themes across languages. Several methods have been proposed to address this problem, leveraging both traditional and neural approaches. While previous methods have achieved some improvements in topic diversity, they often struggle to ensure high topic coherence and consistent alignment across languages. We propose XTRA (Cross-Lingual Topic Modeling with Topic and Representation Alignments), a novel framework that unifies Bag-of-Words modeling with multilingual embeddings. XTRA introduces two core components: (1) representation alignment, aligning document-topic distributions via contrastive learning in a shared semantic space; and (2) topic alignment, projecting topic-word distributions into the same space to enforce crosslingual consistency. This dual mechanism enables XTRA to learn topics that are interpretable (coherent and diverse) and well-aligned across languages. Experiments on multilingual corpora confirm that XTRA significantly outperforms strong baselines in topic coherence, diversity, and alignment quality. Code and reproducible scripts are available at https: //github.com/tienphat140205/XTRA.

Matej Gjurković

Main category: cs.CL

TL;DR: This thesis addresses challenges in automated personality assessment from social media text by introducing two large datasets (MBTI9k and PANDORA) and developing the SIMPA framework for interpretable personality assessment using semantic similarity matching.

Details

Motivation: To overcome the scarcity of personality-labeled datasets and bridge the gap between personality psychology and NLP, which limits model validity and interpretability in automated personality assessment.

Method: Created two Reddit-based datasets (MBTI9k and PANDORA) with personality labels, then developed the SIMPA framework that uses machine learning and semantic similarity to match user statements with validated questionnaire items for interpretable assessment.

Result: Experiments showed demographic variables influence model validity, and SIMPA achieved personality assessments comparable to human evaluations while maintaining high interpretability and efficiency.

Conclusion: SIMPA provides an effective framework for interpretable personality assessment that extends beyond personality to various applications with complex label taxonomies, offering scalability and model-agnostic design.

Abstract: Personality refers to individual differences in behavior, thinking, and feeling. With the growing availability of digital footprints, especially from social media, automated methods for personality assessment have become increasingly important. Natural language processing (NLP) enables the analysis of unstructured text data to identify personality indicators. However, two main challenges remain central to this thesis: the scarcity of large, personality-labeled datasets and the disconnect between personality psychology and NLP, which restricts model validity and interpretability. To address these challenges, this thesis presents two datasets – MBTI9k and PANDORA – collected from Reddit, a platform known for user anonymity and diverse discussions. The PANDORA dataset contains 17 million comments from over 10,000 users and integrates the MBTI and Big Five personality models with demographic information, overcoming limitations in data size, quality, and label coverage. Experiments on these datasets show that demographic variables influence model validity. In response, the SIMPA (Statement-to-Item Matching Personality Assessment) framework was developed - a computational framework for interpretable personality assessment that matches user-generated statements with validated questionnaire items. By using machine learning and semantic similarity, SIMPA delivers personality assessments comparable to human evaluations while maintaining high interpretability and efficiency. Although focused on personality assessment, SIMPA’s versatility extends beyond this domain. Its model-agnostic design, layered cue detection, and scalability make it suitable for various research and practical applications involving complex label taxonomies and variable cue associations with target concepts.

[59] StepChain GraphRAG: Reasoning Over Knowledge Graphs for Multi-Hop Question Answering

Tengjun Ni, Xin Yuan, Shenghong Li, Kai Wu, Ren Ping Liu, Wei Ni, Wenjie Zhang

Main category: cs.CL

TL;DR: StepChain GraphRAG is a framework that combines question decomposition with BFS reasoning flow for multi-hop QA, achieving state-of-the-art performance with improved explainability.

Details

Motivation: Address challenges in integrating iterative reasoning steps with external knowledge retrieval in retrieval-augmented generation for multi-hop question answering.

Method: Builds global corpus index, parses retrieved passages into knowledge graph, splits complex queries into sub-questions, and uses BFS-based traversal to expand along relevant edges while assembling explicit evidence chains.

Result: Achieves state-of-the-art Exact Match and F1 scores on MuSiQue, 2WikiMultiHopQA, and HotpotQA, with average EM improvement of 2.57% and F1 improvement of 2.13% over SOTA methods.

Conclusion: The framework enhances explainability through chain-of-thought preservation, though future work should address computational overhead and potential LLM hallucinations to improve efficiency and reliability.

Abstract: Recent progress in retrieval-augmented generation (RAG) has led to more accurate and interpretable multi-hop question answering (QA). Yet, challenges persist in integrating iterative reasoning steps with external knowledge retrieval. To address this, we introduce StepChain GraphRAG, a framework that unites question decomposition with a Breadth-First Search (BFS) Reasoning Flow for enhanced multi-hop QA. Our approach first builds a global index over the corpus; at inference time, only retrieved passages are parsed on-the-fly into a knowledge graph, and the complex query is split into sub-questions. For each sub-question, a BFS-based traversal dynamically expands along relevant edges, assembling explicit evidence chains without overwhelming the language model with superfluous context. Experiments on MuSiQue, 2WikiMultiHopQA, and HotpotQA show that StepChain GraphRAG achieves state-of-the-art Exact Match and F1 scores. StepChain GraphRAG lifts average EM by 2.57% and F1 by 2.13% over the SOTA method, achieving the largest gain on HotpotQA (+4.70% EM, +3.44% F1). StepChain GraphRAG also fosters enhanced explainability by preserving the chain-of-thought across intermediate retrieval steps. We conclude by discussing how future work can mitigate the computational overhead and address potential hallucinations from large language models to refine efficiency and reliability in multi-hop QA.

[60] Evaluating Large Language Models for IUCN Red List Species Information

Shinya Uryu

Main category: cs.CL

TL;DR: LLMs show strong performance in taxonomic classification (94.9%) but poor conservation reasoning (27.2%), revealing a knowledge-reasoning gap that limits their reliability for species conservation assessments.

Details

Motivation: To systematically validate LLMs' reliability for species evaluation in conservation, given their rapid adoption despite uncertain performance for biodiversity crisis applications.

Method: Systematic validation of five leading LLMs on 21,955 species across four IUCN Red List assessment components: taxonomy, conservation status, distribution, and threats.

Result: Models excelled at taxonomic classification but consistently failed at conservation reasoning, showing systematic biases favoring charismatic vertebrates and revealing inherent architectural constraints beyond data limitations.

Conclusion: LLMs are suitable for information retrieval but require human oversight for judgment-based decisions; a hybrid approach is recommended where LLMs augment expert capacity while humans retain authority over risk assessment and policy.

Abstract: Large Language Models (LLMs) are rapidly being adopted in conservation to address the biodiversity crisis, yet their reliability for species evaluation is uncertain. This study systematically validates five leading models on 21,955 species across four core IUCN Red List assessment components: taxonomy, conservation status, distribution, and threats. A critical paradox was revealed: models excelled at taxonomic classification (94.9%) but consistently failed at conservation reasoning (27.2% for status assessment). This knowledge-reasoning gap, evident across all models, suggests inherent architectural constraints, not just data limitations. Furthermore, models exhibited systematic biases favoring charismatic vertebrates, potentially amplifying existing conservation inequities. These findings delineate clear boundaries for responsible LLM deployment: they are powerful tools for information retrieval but require human oversight for judgment-based decisions. A hybrid approach is recommended, where LLMs augment expert capacity while human experts retain sole authority over risk assessment and policy.

[61] Constraint Satisfaction Approaches to Wordle: Novel Heuristics and Cross-Lexicon Validation

Jahidul Arafat, Fariha Tasmin, Sanjaya Poudel, Kamrujjaman, Eftakhar Ahmed Arnob, Ahsan Habib Tareq

Main category: cs.CL

TL;DR: First comprehensive CSP formulation of Wordle with constraint-aware solving strategies that outperform classical approaches through CSP-Aware Entropy and Probabilistic CSP framework.

Details

Motivation: Existing Wordle solvers use information-theoretic entropy or frequency heuristics without formal constraint treatment, creating a need for principled CSP approaches.

Method: CSP-Aware Entropy computes information gain after constraint propagation, and Probabilistic CSP integrates Bayesian word-frequency priors with logical constraints.

Result: CSP-Aware Entropy achieves 3.54 average guesses with 99.9% success rate (1.7% improvement over Forward Checking) and 46% faster runtime. Maintains advantages under noise and achieves 100% success across all noise levels with Probabilistic CSP.

Conclusion: Principled constraint satisfaction techniques outperform classical information-theoretic and learning-based approaches for structured puzzle-solving domains, with cross-language transferability demonstrated.

Abstract: Wordle presents an algorithmically rich testbed for constraint satisfaction problem (CSP) solving. While existing solvers rely on information-theoretic entropy maximization or frequency-based heuristics without formal constraint treatment, we present the first comprehensive CSP formulation of Wordle with novel constraint-aware solving strategies. We introduce CSP-Aware Entropy, computing information gain after constraint propagation rather than on raw candidate sets, and a Probabilistic CSP framework integrating Bayesian word-frequency priors with logical constraints. Through evaluation on 2,315 English words, CSP-Aware Entropy achieves 3.54 average guesses with 99.9% success rate, a statistically significant 1.7% improvement over Forward Checking (t=-4.82, p<0.001, Cohen’s d=0.07) with 46% faster runtime (12.9ms versus 23.7ms per guess). Under 10% noise, CSP-aware approaches maintain 5.3 percentage point advantages (29.0% versus 23.7%, p=0.041), while Probabilistic CSP achieves 100% success across all noise levels (0-20%) through constraint recovery mechanisms. Cross-lexicon validation on 500 Spanish words demonstrates 88% success with zero language-specific tuning, validating that core CSP principles transfer across languages despite an 11.2 percentage point gap from linguistic differences (p<0.001, Fisher’s exact test). Our open-source implementation with 34 unit tests achieving 91% code coverage provides reproducible infrastructure for CSP research. The combination of formal CSP treatment, constraint-aware heuristics, probabilistic-logical integration, robustness analysis, and cross-lexicon validation establishes new performance benchmarks demonstrating that principled constraint satisfaction techniques outperform classical information-theoretic and learning-based approaches for structured puzzle-solving domains.

[62] Self-Reflective Generation at Test Time

Jian Mu, Qixin Zhang, Zhiyong Wang, Menglin Yang, Shuang Qiu, Chengwei Qin, Zhongxiang Dai, Yao Shu

Main category: cs.CL

TL;DR: SRGen is a test-time framework that enables LLMs to self-reflect before generating uncertain tokens, using dynamic entropy thresholding and corrective vectors to improve reasoning reliability without expensive training.

Details

Motivation: Current LLM self-reflection methods are either reactive (revising full drafts) or require expensive training, making them inefficient for addressing error cascades in chain-of-thought reasoning.

Method: Uses dynamic entropy thresholding to identify high-uncertainty tokens during generation, then trains specific corrective vectors that exploit generated context for self-reflective generation to correct token probability distributions.

Result: Significantly improves reasoning on mathematical benchmarks, with DeepSeek-R1-Distill-Qwen-7B achieving +12.0% Pass@1 and +13.3% Cons@5 improvements on AIME2024. Consistently strengthens model reasoning across diverse LLMs.

Conclusion: SRGen is a plug-and-play method that integrates reflection into generation for reliable LLM reasoning, achieving consistent gains with bounded overhead and broad composability with other techniques.

Abstract: Large language models (LLMs) increasingly solve complex reasoning tasks via long chain-of-thought, but their forward-only autoregressive generation process is fragile; early token errors can cascade, which creates a clear need for self-reflection mechanisms. However, existing self-reflection either performs revisions over full drafts or learns self-correction via expensive training, both fundamentally reactive and inefficient. To address this, we propose Self-Reflective Generation at Test Time (SRGen), a lightweight test-time framework that reflects before generating at uncertain points. During token generation, SRGen utilizes dynamic entropy thresholding to identify high-uncertainty tokens. For each identified token, it trains a specific corrective vector, which fully exploits the already generated context for a self-reflective generation to correct the token probability distribution. By retrospectively analyzing the partial output, this self-reflection enables more trustworthy decisions, thereby significantly reducing the probability of errors at highly uncertain points. Evaluated on challenging mathematical reasoning benchmarks and a diverse set of LLMs, SRGen can consistently strengthen model reasoning: improvements in single-pass quality also translate into stronger self-consistency voting. Especially, on AIME2024 with DeepSeek-R1-Distill-Qwen-7B, SRGen yields absolute improvements of +12.0% on Pass@1 and +13.3% on Cons@5. Moreover, our findings position SRGen as a plug-and-play method that integrates reflection into the generation process for reliable LLM reasoning, achieving consistent gains with bounded overhead and broad composability with other training-time (e.g., RLHF) and test-time (e.g., SLOT) techniques.

[63] Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval

Yohan Lee, Yongwoo Song, Sangyeop Kim

Main category: cs.CL

TL;DR: The CDR benchmark is the first comprehensive test set for evaluating conversational data retrieval systems, revealing that current embedding models perform poorly on conversational data compared to document retrieval.

Details

Motivation: There is a need for reliable evaluation standards for systems that retrieve conversation data for product insights, as current models show substantial performance gaps in conversational data retrieval.

Method: Created a benchmark with 1.6k queries across five analytical tasks and 9.1k conversations, then evaluated 16 popular embedding models on this benchmark.

Result: Even the best models achieved only around NDCG@10 of 0.51, showing significant performance gap between document and conversational data retrieval. Identified unique challenges including implicit state recognition, turn dynamics, and contextual references.

Conclusion: The CDR benchmark provides a reliable standard for measuring conversational data retrieval performance and highlights substantial room for improvement in this domain.

Abstract: We present the Conversational Data Retrieval (CDR) benchmark, the first comprehensive test set for evaluating systems that retrieve conversation data for product insights. With 1.6k queries across five analytical tasks and 9.1k conversations, our benchmark provides a reliable standard for measuring conversational data retrieval performance. Our evaluation of 16 popular embedding models shows that even the best models reach only around NDCG@10 of 0.51, revealing a substantial gap between document and conversational data retrieval capabilities. Our work identifies unique challenges in conversational data retrieval (implicit state recognition, turn dynamics, contextual references) while providing practical query templates and detailed error analysis across different task categories. The benchmark dataset and code are available at https://github.com/l-yohai/CDR-Benchmark.

[64] Leave No TRACE: Black-box Detection of Copyrighted Dataset Usage in Large Language Models via Watermarking

Jingqi Zhang, Ruibo Chen, Yingqing Yang, Peihua Mai, Heng Huang, Yan Pang

Main category: cs.CL

TL;DR: TRACE is a black-box framework for detecting copyrighted dataset usage in LLM fine-tuning using distortion-free watermarks and entropy-gated detection.

Details

Motivation: Need for reliable safeguards against unauthorized use of proprietary/copyrighted material in LLM fine-tuning, as existing methods require internal model access or have practical limitations.

Method: Rewrites datasets with distortion-free watermarks using private key, then uses entropy-gated procedure that selectively scores high-uncertainty tokens to amplify detection power.

Result: Consistently achieves significant detections (p<0.05) across diverse datasets and model families, supports multi-dataset attribution, and remains robust after continued pretraining.

Conclusion: TRACE provides a practical route to reliable black-box verification of copyrighted dataset usage in LLM fine-tuning.

Abstract: Large Language Models (LLMs) are increasingly fine-tuned on smaller, domain-specific datasets to improve downstream performance. These datasets often contain proprietary or copyrighted material, raising the need for reliable safeguards against unauthorized use. Existing membership inference attacks (MIAs) and dataset-inference methods typically require access to internal signals such as logits, while current black-box approaches often rely on handcrafted prompts or a clean reference dataset for calibration, both of which limit practical applicability. Watermarking is a promising alternative, but prior techniques can degrade text quality or reduce task performance. We propose TRACE, a practical framework for fully black-box detection of copyrighted dataset usage in LLM fine-tuning. \texttt{TRACE} rewrites datasets with distortion-free watermarks guided by a private key, ensuring both text quality and downstream utility. At detection time, we exploit the radioactivity effect of fine-tuning on watermarked data and introduce an entropy-gated procedure that selectively scores high-uncertainty tokens, substantially amplifying detection power. Across diverse datasets and model families, TRACE consistently achieves significant detections (p<0.05), often with extremely strong statistical evidence. Furthermore, it supports multi-dataset attribution and remains robust even after continued pretraining on large non-watermarked corpora. These results establish TRACE as a practical route to reliable black-box verification of copyrighted dataset usage. We will make our code available at: https://github.com/NusIoraPrivacy/TRACE.

[65] Grounding Large Language Models in Clinical Evidence: A Retrieval-Augmented Generation System for Querying UK NICE Clinical Guidelines

Matthew Lewis, Samuel Thio, Richard JB Dobson, Spiros Denaxas

Main category: cs.CL

TL;DR: A RAG system was developed for querying UK NICE clinical guidelines using LLMs, achieving high retrieval performance (MRR 0.814, 99.1% recall in top 10) and significantly improving answer faithfulness by 64.7 percentage points to 99.5%.

Details

Motivation: The extensive length and volume of NICE clinical guidelines impede their utilization in time-constrained healthcare systems, requiring a system to provide precisely matched information through natural language queries.

Method: Developed a RAG system with hybrid embedding mechanism, evaluated on 10,195 text chunks from 300 guidelines using 7,901 queries, and tested generation on 70 manually curated question-answer pairs.

Result: High retrieval performance (MRR 0.814, 81% recall at first chunk, 99.1% within top 10), RAG-enhanced models showed 64.7 percentage point improvement in faithfulness to 99.5%, perfect context precision of 1, significantly outperforming medical-focused Meditron3-8B (43%).

Conclusion: RAG is an effective, reliable, and scalable approach for applying generative AI in healthcare, enabling cost-effective access to medical guidelines by preventing information fabrication through source-grounded answers.

Abstract: This paper presents the development and evaluation of a Retrieval-Augmented Generation (RAG) system for querying the United Kingdom’s National Institute for Health and Care Excellence (NICE) clinical guidelines using Large Language Models (LLMs). The extensive length and volume of these guidelines can impede their utilisation within a time-constrained healthcare system, a challenge this project addresses through the creation of a system capable of providing users with precisely matched information in response to natural language queries. The system’s retrieval architecture, composed of a hybrid embedding mechanism, was evaluated against a database of 10,195 text chunks derived from three hundred guidelines. It demonstrates high performance, with a Mean Reciprocal Rank (MRR) of 0.814, a Recall of 81% at the first chunk and of 99.1% within the top ten retrieved chunks, when evaluated on 7901 queries. The most significant impact of the RAG system was observed during the generation phase. When evaluated on a manually curated dataset of seventy question-answer pairs, RAG-enhanced models showed substantial gains in performance. Faithfulness, the measure of whether an answer is supported by the source text, was increased by 64.7 percentage points to 99.5% for the RAG-enhanced O4-Mini model and significantly outperformed the medical-focused Meditron3-8B LLM, which scored 43%. This, combined with a perfect Context Precision score of 1 for all RAG-enhanced models, confirms the system’s ability to prevent information fabrication by grounding its answers in relevant source material. This study thus establishes RAG as an effective, reliable, and scalable approach for applying generative AI in healthcare, enabling cost-effective access to medical guidelines.

[66] Semantic Differentiation in Speech Emotion Recognition: Insights from Descriptive and Expressive Speech Roles

Rongchen Guo, Vincent Francoeur, Isar Nejadgholi, Sylvain Gagnon, Miodrag Bolic

Main category: cs.CL

TL;DR: This paper distinguishes between descriptive and expressive semantics in speech emotion recognition, showing descriptive semantics align with intended emotions while expressive semantics correlate with evoked emotions.

Details

Motivation: Speech Emotion Recognition accuracy is constrained by emotional nuances in speech, and distinguishing between different types of semantics could improve human-computer interaction.

Method: Recorded audio clips of participants describing experiences after watching emotional movie segments, collected intended emotion tags, self-rated emotional responses, and valence/arousal scores.

Result: Descriptive semantics align with intended emotions, while expressive semantics correlate with evoked emotions.

Conclusion: The findings inform SER applications in human-AI interaction and enable more context-aware AI systems.

Abstract: Speech Emotion Recognition (SER) is essential for improving human-computer interaction, yet its accuracy remains constrained by the complexity of emotional nuances in speech. In this study, we distinguish between descriptive semantics, which represents the contextual content of speech, and expressive semantics, which reflects the speaker’s emotional state. After watching emotionally charged movie segments, we recorded audio clips of participants describing their experiences, along with the intended emotion tags for each clip, participants’ self-rated emotional responses, and their valence/arousal scores. Through experiments, we show that descriptive semantics align with intended emotions, while expressive semantics correlate with evoked emotions. Our findings inform SER applications in human-AI interaction and pave the way for more context-aware AI systems.

[67] Semantic Similarity in Radiology Reports via LLMs and NER

Beth Pearson, Ahmed Adnan, Zahraa Abdallah

Main category: cs.CL

TL;DR: The paper proposes Llama-EntScore, a semantic similarity scoring method that combines Llama 3.1 and Named-Entity-Recognition with tunable weights to compare radiology reports, achieving better performance than using LLMs or NER alone.

Details

Motivation: Radiology report evaluation is crucial for training junior radiologists and ensuring diagnostic accuracy. Current AI approaches struggle with specialized domain knowledge, and existing methods like LLMs and NER have limitations in providing accurate semantic similarity feedback.

Method: Proposed Llama-EntScore method that combines Llama 3.1 with Named-Entity-Recognition using tunable weights to emphasize or de-emphasize specific types of differences between preliminary and final radiology reports.

Result: The method achieves 67% exact-match accuracy and 93% accuracy within +/- 1 when compared to radiologist-provided ground truth scores, outperforming both LLMs and NER used independently.

Conclusion: Llama-EntScore provides an effective solution for comparing radiology reports by generating quantitative similarity scores and interpretable feedback, offering valuable guidance for junior radiologists to review and refine their reporting.

Abstract: Radiology report evaluation is a crucial part of radiologists’ training and plays a key role in ensuring diagnostic accuracy. As part of the standard reporting workflow, a junior radiologist typically prepares a preliminary report, which is then reviewed and edited by a senior radiologist to produce the final report. Identifying semantic differences between preliminary and final reports is essential for junior doctors, both as a training tool and to help uncover gaps in clinical knowledge. While AI in radiology is a rapidly growing field, the application of large language models (LLMs) remains challenging due to the need for specialised domain knowledge. In this paper, we explore the ability of LLMs to provide explainable and accurate comparisons of reports in the radiology domain. We begin by comparing the performance of several LLMs in comparing radiology reports. We then assess a more traditional approach based on Named-Entity-Recognition (NER). However, both approaches exhibit limitations in delivering accurate feedback on semantic similarity. To address this, we propose Llama-EntScore, a semantic similarity scoring method using a combination of Llama 3.1 and NER with tunable weights to emphasise or de-emphasise specific types of differences. Our approach generates a quantitative similarity score for tracking progress and also gives an interpretation of the score that aims to offer valuable guidance in reviewing and refining their reporting. We find our method achieves 67% exact-match accuracy and 93% accuracy within +/- 1 when compared to radiologist-provided ground truth scores - outperforming both LLMs and NER used independently. Code is available at: \href{https://github.com/otmive/llama_reports}{github.com/otmive/llama_reports}

[68] SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys?

Zhaojun Sun, Xuzhou Zhu, Xuanhe Zhou, Xin Tong, Shuo Wang, Jie Fu, Guoliang Li, Zhiyuan Liu, Fan Wu

Main category: cs.CL

TL;DR: SurveyBench is a quiz-driven evaluation framework that assesses automated survey generation methods, revealing they perform 21% worse than human-written surveys on content quality.

Details

Motivation: Existing automated survey generation methods (LLM4Survey) fall short of human standards and lack rigorous benchmarks to identify their deficiencies.

Method: Proposed SurveyBench framework with: (1) survey topics from 11,343 arXiv papers and 4,947 high-quality surveys, (2) multifaceted metric hierarchy assessing outline quality, content quality, and non-textual richness, (3) dual-mode evaluation with content-based and quiz-based answerability tests.

Result: SurveyBench effectively challenges existing LLM4Survey approaches, showing they perform on average 21% lower than human-written surveys in content-based evaluation.

Conclusion: SurveyBench provides a rigorous benchmark that reveals significant gaps in automated survey generation methods compared to human-written surveys.

Abstract: Academic survey writing, which distills vast literature into a coherent and insightful narrative, remains a labor-intensive and intellectually demanding task. While recent approaches, such as general DeepResearch agents and survey-specialized methods, can generate surveys automatically (a.k.a. LLM4Survey), their outputs often fall short of human standards and there lacks a rigorous, reader-aligned benchmark for thoroughly revealing their deficiencies. To fill the gap, we propose a fine-grained, quiz-driven evaluation framework SurveyBench, featuring (1) typical survey topics source from recent 11,343 arXiv papers and corresponding 4,947 high-quality surveys; (2) a multifaceted metric hierarchy that assesses the outline quality (e.g., coverage breadth, logical coherence), content quality (e.g., synthesis granularity, clarity of insights), and non-textual richness; and (3) a dual-mode evaluation protocol that includes content-based and quiz-based answerability tests, explicitly aligned with readers’ informational needs. Results show SurveyBench effectively challenges existing LLM4Survey approaches (e.g., on average 21% lower than human in content-based evaluation).

[69] Beyond the Final Layer: Intermediate Representations for Better Multilingual Calibration in Large Language Models

Ej Zhou, Caiqi Zhang, Tiancheng Hu, Chengzu Li, Nigel Collier, Ivan Vulić, Anna Korhonen

Main category: cs.CL

TL;DR: This paper reveals that non-English languages suffer from systematically worse confidence calibration in LLMs due to English-centric training bias in the final layer, and proposes training-free methods using late-intermediate layers for better multilingual calibration.

Details

Motivation: Confidence calibration is crucial for reliable LLM deployment but remains under-explored in multilingual contexts, with non-English languages showing systematically worse calibration due to English-centric training bias.

Method: Conducted large-scale multilingual calibration studies across 6 model families and 100+ languages, analyzed internal representations, and introduced training-free methods including Language-Aware Confidence Ensemble (LACE) that adaptively selects optimal layer ensembles per language.

Result: Found that final layers provide poor multilingual confidence signals due to English bias, while late-intermediate layers offer more reliable and better-calibrated signals across languages.

Conclusion: The study highlights hidden costs of English-centric alignment and provides a path toward more globally equitable LLMs by leveraging intermediate layers rather than final layers for multilingual confidence calibration.

Abstract: Confidence calibration, the alignment of a model’s predicted confidence with its actual accuracy, is crucial for the reliable deployment of Large Language Models (LLMs). However, this critical property remains largely under-explored in multilingual contexts. In this work, we conduct the first large-scale, systematic studies of multilingual calibration across six model families and over 100 languages, revealing that non-English languages suffer from systematically worse calibration. To diagnose this, we investigate the model’s internal representations and find that the final layer, biased by English-centric training, provides a poor signal for multilingual confidence. In contrast, our layer-wise analysis uncovers a key insight that late-intermediate layers consistently offer a more reliable and better-calibrated signal. Building on this, we introduce a suite of training-free methods, including Language-Aware Confidence Ensemble (LACE), which adaptively selects an optimal ensemble of layers for each specific language. Our study highlights the hidden costs of English-centric alignment and offer a new path toward building more globally equitable and trustworthy LLMs by looking beyond the final layer.

[70] EditLens: Quantifying the Extent of AI Editing in Text

Katherine Thai, Bradley Emi, Elyas Masrour, Mohit Iyyer

Main category: cs.CL

TL;DR: EditLens detects AI-edited text using similarity metrics and achieves state-of-the-art performance (F1=94.7% binary, 90.4% ternary classification) in distinguishing human, AI, and mixed writing.

Details

Motivation: Previous work focused on detecting fully AI-generated text, but many queries involve AI editing human text. This paper addresses the gap in detecting AI-edited content.

Method: Uses lightweight similarity metrics to quantify AI editing magnitude, validates with human annotators, then trains EditLens regression model using similarity metrics as intermediate supervision.

Result: Model achieves F1=94.7% on binary classification and 90.4% on ternary classification. Successfully detects degree of AI changes and analyzes Grammarly edits as case study.

Conclusion: AI-edited text is distinguishable from human-written and AI-generated text. Detection of editing degree has implications for authorship attribution, education, and policy. Models and dataset will be publicly released.

Abstract: A significant proportion of queries to large language models ask them to edit user-provided text, rather than generate new text from scratch. While previous work focuses on detecting fully AI-generated text, we demonstrate that AI-edited text is distinguishable from human-written and AI-generated text. First, we propose using lightweight similarity metrics to quantify the magnitude of AI editing present in a text given the original human-written text and validate these metrics with human annotators. Using these similarity metrics as intermediate supervision, we then train EditLens, a regression model that predicts the amount of AI editing present within a text. Our model achieves state-of-the-art performance on both binary (F1=94.7%) and ternary (F1=90.4%) classification tasks in distinguishing human, AI, and mixed writing. Not only do we show that AI-edited text can be detected, but also that the degree of change made by AI to human writing can be detected, which has implications for authorship attribution, education, and policy. Finally, as a case study, we use our model to analyze the effects of AI-edits applied by Grammarly, a popular writing assistance tool. To encourage further research, we commit to publicly releasing our models and dataset.

[71] Neural Correlates of Language Models Are Specific to Human Language

Iñigo Parra

Main category: cs.CL

TL;DR: This study validates and strengthens previous findings about correlations between LLM hidden states and fMRI brain responses, addressing concerns about dimensionality, similarity measures, training specificity, and positional encoding.

Details

Motivation: To test the robustness of previous findings about correlations between large language model hidden states and fMRI brain responses, addressing potential methodological concerns.

Method: Used dimensionality reduction, new similarity measures, comparisons with models trained on non-human language, and analysis of positional encoding effects.

Result: Confirmed previous correlations are robust to dimensionality concerns, validated with new similarity measures, specific to human language-trained models, and dependent on positional encoding.

Conclusion: The results strengthen evidence for representational similarity between LLMs and brain states, supporting the biological plausibility and interpretability of state-of-the-art language models.

Abstract: Previous work has shown correlations between the hidden states of large language models and fMRI brain responses, on language tasks. These correlations have been taken as evidence of the representational similarity of these models and brain states. This study tests whether these previous results are robust to several possible concerns. Specifically this study shows: (i) that the previous results are still found after dimensionality reduction, and thus are not attributable to the curse of dimensionality; (ii) that previous results are confirmed when using new measures of similarity; (iii) that correlations between brain representations and those from models are specific to models trained on human language; and (iv) that the results are dependent on the presence of positional encoding in the models. These results confirm and strengthen the results of previous research and contribute to the debate on the biological plausibility and interpretability of state-of-the-art large language models.

[72] Topic Modeling as Long-Form Generation: Can Long-Context LLMs revolutionize NTM via Zero-Shot Prompting?

Xuan Xu, Haolun Li, Zhongliang Yang, Beilin Chu, Jia Song, Moxuan Xu, Linna Zhou

Main category: cs.CL

TL;DR: This paper proposes a new paradigm for topic modeling using large language models (LLMs) as long-form generation tasks, comparing them against traditional neural topic models (NTMs) to evaluate if LLMs can outperform NTMs through zero-shot prompting.

Details

Motivation: To explore whether large language models can provide a superior alternative to traditional neural topic models by reframing topic modeling as a long-form generation task in the era of LLMs.

Method: Proposes a simple practical approach using LLMs for topic modeling: sample data subset, generate topics and representative text through prompting, and assign text using keyword matching. Conducts systematic comparison between NTMs and LLMs using zero-shot prompting.

Result: The paper empirically examines the claim that ‘a majority of NTMs are outdated’ by comparing topic quality between LLMs and traditional neural topic models.

Conclusion: The research investigates whether the long-form generation paradigm using LLMs can beat neural topic models, potentially rendering traditional approaches obsolete.

Abstract: Traditional topic models such as neural topic models rely on inference and generation networks to learn latent topic distributions. This paper explores a new paradigm for topic modeling in the era of large language models, framing TM as a long-form generation task whose definition is updated in this paradigm. We propose a simple but practical approach to implement LLM-based topic model tasks out of the box (sample a data subset, generate topics and representative text with our prompt, text assignment with keyword match). We then investigate whether the long-form generation paradigm can beat NTMs via zero-shot prompting. We conduct a systematic comparison between NTMs and LLMs in terms of topic quality and empirically examine the claim that “a majority of NTMs are outdated.”

[73] Model-Based Ranking of Source Languages for Zero-Shot Cross-Lingual Transfer

Abteen Ebrahimi, Adam Wiemerslage, Katharina von der Wense

Main category: cs.CL

TL;DR: NN-Rank is a novel algorithm that uses multilingual model representations and unlabeled target-language data to rank source languages for cross-lingual transfer, outperforming state-of-the-art baselines on POS tagging and NER tasks.

Details

Motivation: To improve cross-lingual transfer by developing a more effective method for ranking source languages, leveraging modern multilingual models and available unlabeled data rather than relying solely on lexical and linguistic features.

Method: NN-Rank algorithm uses hidden representations from pretrained multilingual models and unlabeled target-language data to compute language similarity and rank source languages for transfer learning.

Result: NN-Rank achieves significant improvements over baselines: up to 35.56 NDCG for POS tagging and 18.14 NDCG for NER. It remains competitive using only the Bible corpus and works well with as few as 25 unlabeled examples (achieving 92.8% of full-data performance).

Conclusion: NN-Rank provides an effective and data-efficient approach for source language ranking in cross-lingual transfer, demonstrating strong performance across different domains and with minimal target language data requirements.

Abstract: We present NN-Rank, an algorithm for ranking source languages for cross-lingual transfer, which leverages hidden representations from multilingual models and unlabeled target-language data. We experiment with two pretrained multilingual models and two tasks: part-of-speech tagging (POS) and named entity recognition (NER). We consider 51 source languages and evaluate on 56 and 72 target languages for POS and NER, respectively. When using in-domain data, NN-Rank beats state-of-the-art baselines that leverage lexical and linguistic features, with average improvements of up to 35.56 NDCG for POS and 18.14 NDCG for NER. As prior approaches can fall back to language-level features if target language data is not available, we show that NN-Rank remains competitive using only the Bible, an out-of-domain corpus available for a large number of languages. Ablations on the amount of unlabeled target data show that, for subsets consisting of as few as 25 examples, NN-Rank produces high-quality rankings which achieve 92.8% of the NDCG achieved using all available target data for ranking.

[74] FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

Imene Kerboua, Sahar Omidi Shayegan, Megh Thakkar, Xing Han Lù, Léo Boisvert, Massimo Caccia, Jérémy Espinas, Alexandre Aussem, Véronique Eglin, Alexandre Lacoste

Main category: cs.CL

TL;DR: FocusAgent uses a lightweight LLM retriever to extract relevant content from web pages, reducing observation size by over 50% while maintaining performance and improving security against prompt injection attacks.

Details

Motivation: Web agents face challenges with lengthy web page observations that exceed context limits, increase computational costs, and expose agents to security risks like prompt injection. Existing pruning methods either discard relevant content or keep irrelevant context.

Method: Leverages a lightweight LLM retriever to extract the most relevant lines from accessibility tree (AxTree) observations, guided by task goals, to prune noisy and irrelevant content.

Result: FocusAgent matches baseline performance on WorkArena and WebArena benchmarks while reducing observation size by over 50%. A variant significantly reduces success rate of prompt-injection attacks while maintaining task success in attack-free settings.

Conclusion: Targeted LLM-based retrieval is a practical and robust strategy for building web agents that are efficient, effective, and secure.

Abstract: Web agents powered by large language models (LLMs) must process lengthy web page observations to complete user goals; these pages often exceed tens of thousands of tokens. This saturates context limits and increases computational cost processing; moreover, processing full pages exposes agents to security risks such as prompt injection. Existing pruning strategies either discard relevant content or retain irrelevant context, leading to suboptimal action prediction. We introduce FocusAgent, a simple yet effective approach that leverages a lightweight LLM retriever to extract the most relevant lines from accessibility tree (AxTree) observations, guided by task goals. By pruning noisy and irrelevant content, FocusAgent enables efficient reasoning while reducing vulnerability to injection attacks. Experiments on WorkArena and WebArena benchmarks show that FocusAgent matches the performance of strong baselines, while reducing observation size by over 50%. Furthermore, a variant of FocusAgent significantly reduces the success rate of prompt-injection attacks, including banner and pop-up attacks, while maintaining task success performance in attack-free settings. Our results highlight that targeted LLM-based retrieval is a practical and robust strategy for building web agents that are efficient, effective, and secure.

[75] Cache-to-Cache: Direct Semantic Communication Between Large Language Models

Tianyu Fu, Zihan Min, Hanling Zhang, Jichao Yan, Guohao Dai, Wanli Ouyang, Yu Wang

Main category: cs.CL

TL;DR: Cache-to-Cache (C2C) enables direct semantic communication between LLMs using KV-cache projection and fusion, avoiding text generation bottlenecks and achieving better performance and speed than text-based communication.

Details

Motivation: Existing multi-LLM systems communicate through text, which loses rich semantic information and incurs token-by-token generation latency. The authors explore whether LLMs can communicate beyond text to overcome these limitations.

Method: C2C uses a neural network to project and fuse the source model’s KV-cache with the target model’s KV-cache for direct semantic transfer. A learnable gating mechanism selects target layers that benefit from cache communication.

Result: C2C achieves 8.5-10.5% higher average accuracy than individual models and outperforms text communication by 3.0-5.0% with 2.0x speedup in latency.

Conclusion: KV-cache serves as an effective medium for inter-model communication, enabling direct semantic transfer that improves both performance and efficiency compared to text-based communication.

Abstract: Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains unattainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source model’s KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 8.5-10.5% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.0-5.0%, while delivering an average 2.0x speedup in latency. Our code is available at https://github.com/thu-nics/C2C.

[76] Self-Anchor: Large Language Model Reasoning via Step-by-step Attention Alignment

Hongxiang Zhang, Yuan Tian, Tianyi Zhang

Main category: cs.CL

TL;DR: Self-Anchor is a novel prompting method that structures reasoning trajectories and aligns LLM attention to relevant inference steps, improving performance on complex reasoning tasks without retraining.

Details

Motivation: As reasoning chains extend in LLMs, critical intermediate steps and original prompts get buried in context, receiving insufficient attention and causing errors. Current prompting methods struggle with maintaining focus throughout long reasoning processes.

Method: Self-Anchor decomposes reasoning trajectories into structured plans and automatically aligns the model’s attention to the most relevant inference steps, allowing the model to maintain focus throughout generation without requiring retraining.

Result: Self-Anchor outperforms state-of-the-art prompting methods across six benchmarks and significantly reduces the performance gap between “non-reasoning” models and specialized reasoning models.

Conclusion: Self-Anchor has the potential to enable most LLMs to tackle complex reasoning tasks without retraining by leveraging structured reasoning plans and attention alignment.

Abstract: To solve complex reasoning tasks for Large Language Models (LLMs), prompting-based methods offer a lightweight alternative to fine-tuning and reinforcement learning. However, as reasoning chains extend, critical intermediate steps and the original prompt will be buried in the context, receiving insufficient attention and leading to errors. In this paper, we propose Self-Anchor, a novel pipeline that leverages the inherent structure of reasoning to steer LLM attention. Self-Anchor decomposes reasoning trajectories into structured plans and automatically aligns the model’s attention to the most relevant inference steps, allowing the model to maintain focus throughout generation. Our experiment shows that Self-Anchor outperforms SOTA prompting methods across six benchmarks. Notably, Self-Anchor significantly reduces the performance gap between ``non-reasoning’’ models and specialized reasoning models, with the potential to enable most LLMs to tackle complex reasoning tasks without retraining.

[77] Reward Models are Metrics in a Trench Coat

Sebastian Gehrmann

Main category: cs.CL

TL;DR: Reward models and evaluation metrics for LLMs face similar challenges but are developed separately, leading to redundant work. Closer collaboration could improve both areas by addressing common issues like spurious correlations and reward hacking.

Details

Motivation: The separation between reward model research and evaluation metric development creates redundant terminology and repeated pitfalls, suggesting that collaboration could benefit both fields.

Method: The paper provides an extensive survey comparing reward models and evaluation metrics, analyzes their performance on specific tasks, and identifies common challenges and potential solutions.

Result: The analysis shows that metrics can outperform reward models on certain tasks, and identifies multiple research areas where alignment could improve both reward models and metrics.

Conclusion: Closer collaboration between reward model and evaluation metric research communities can help overcome common challenges and improve both areas through shared insights and methodologies.

Abstract: The emergence of reinforcement learning in post-training of large language models has sparked significant interest in reward models. Reward models assess the quality of sampled model outputs to generate training signals. This task is also performed by evaluation metrics that monitor the performance of an AI model. We find that the two research areas are mostly separate, leading to redundant terminology and repeated pitfalls. Common challenges include susceptibility to spurious correlations, impact on downstream reward hacking, methods to improve data quality, and approaches to meta-evaluation. Our position paper argues that a closer collaboration between the fields can help overcome these issues. To that end, we show how metrics outperform reward models on specific tasks and provide an extensive survey of the two areas. Grounded in this survey, we point to multiple research topics in which closer alignment can improve reward models and metrics in areas such as preference elicitation methods, avoidance of spurious correlations and reward hacking, and calibration-aware meta-evaluation.

[78] Did Translation Models Get More Robust Without Anyone Even Noticing?

Ben Peters, André F. T. Martins

Main category: cs.CL

TL;DR: Recent multilingual MT models and LLMs show surprising robustness to noisy inputs like spelling errors and social media text, outperforming previous models despite not using specific robustness techniques.

Details

Motivation: To challenge the conventional belief that neural MT models are highly sensitive to noisy inputs, and to evaluate the robustness of modern multilingual MT models and LLMs against various types of noise.

Method: Conducted controlled experiments comparing modern models’ performance on noisy vs clean data, analyzed social media translation experiments, and evaluated source correction techniques for noise mitigation.

Result: Modern multilingual MT models and LLMs demonstrate significantly greater robustness to noise than previous models, even when performing similarly on clean data. LLMs also show improved robustness for social media text translation.

Conclusion: Robustness to many types of noise has substantially increased in modern translation models, challenging previous assumptions about neural MT sensitivity to noisy inputs.

Abstract: Neural machine translation (MT) models achieve strong results across a variety of settings, but it is widely believed that they are highly sensitive to “noisy” inputs, such as spelling errors, abbreviations, and other formatting issues. In this paper, we revisit this insight in light of recent multilingual MT models and large language models (LLMs) applied to machine translation. Somewhat surprisingly, we show through controlled experiments that these models are far more robust to many kinds of noise than previous models, even when they perform similarly on clean data. This is notable because, even though LLMs have more parameters and more complex training processes than past models, none of the open ones we consider use any techniques specifically designed to encourage robustness. Next, we show that similar trends hold for social media translation experiments – LLMs are more robust to social media text. We include an analysis of the circumstances in which source correction techniques can be used to mitigate the effects of noise. Altogether, we show that robustness to many types of noise has increased.

[79] Understanding How CodeLLMs (Mis)Predict Types with Activation Steering

Francesca Lucchetti, Arjun Guha

Main category: cs.CL

TL;DR: LLMs struggle with program semantics understanding, failing on type prediction after minor syntax changes. However, activation steering can restore performance by activating latent type prediction mechanisms that transfer across programming languages.

Details

Motivation: LLMs are widely used for programming tasks but lack deep semantic understanding, as minor syntactic changes degrade performance. This raises concerns about their true understanding of code semantics.

Method: Created adversarial examples for type prediction tasks, then used activation steering to manipulate internal model activations and guide models toward using their latent type prediction knowledge.

Result: Activation steering successfully restored accurate predictions on adversarial inputs, activating a shared type prediction mechanism across Python and TypeScript that was more effective than in-context prompting.

Conclusion: LLMs do learn generalizable representations of code semantics that transfer across programming languages, but these mechanisms often fail to activate in adversarial scenarios without intervention.

Abstract: Large Language Models (LLMs) are widely used by software engineers for programming tasks. However, research shows that LLMs often lack a deep understanding of program semantics. Even minor changes to syntax, such as renaming variables, can significantly degrade performance across various tasks. In this work, we examine the task of type prediction: given a partially typed program, can a model predict a missing type annotations such that the resulting program is more typed? We construct a dataset of adversarial examples where models initially predict the correct types, but begin to fail after semantically irrelevant edits. This is problematic, as models should ideally generalize across different syntactic forms of semantically equivalent code. This lack of robustness suggests that models may have a shallow understanding of code semantics. Despite this, we provide evidence that LLMs do, in fact, learn robust mechanisms for type prediction-though these mechanisms often fail to activate in adversarial scenarios. By using activation steering, a method that manipulates a model’s internal activations to guide it toward using latent knowledge, we restore accurate predictions on adversarial inputs. We show that steering successfully activates a type prediction mechanism that is shared by both Python and TypeScript, and is more effective than prompting with in-context examples. Across five different models, our comprehensive evaluation demonstrates that LLMs can learn generalizable representations of code semantics that transfer across programming languages.

[80] ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference

Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Yue Liu, Bo Li, Xuming Hu, Xiaowen Chu

Main category: cs.CL

TL;DR: ChunkKV is a novel KV cache compression method that treats semantic chunks instead of individual tokens as basic units, preserving contextual integrity and improving long-context LLM inference efficiency.

Details

Motivation: Existing KV cache compression methods overlook semantic relationships between tokens, causing fragmented context and performance degradation, while KV cache consumes up to 70% of GPU memory during inference.

Method: Uses semantic chunks as compression units with layer-wise index reuse technique that exploits cross-layer similarity of preserved indices to reduce computational overhead.

Result: Outperforms state-of-the-art methods by up to 8.7% in precision while maintaining same compression ratio, improves throughput by 26.5% on benchmarks like LongBench, Needle-In-A-HayStack, GSM8K, and JailbreakV.

Conclusion: Semantic-aware compression significantly enhances both efficiency and performance for long-context LLM inference, providing an effective solution to the memory bottleneck problem.

Abstract: Large Language Models (LLMs) require significant GPU memory when processing long texts, with the key value (KV) cache consuming up to 70% of total memory during inference. Although existing compression methods reduce memory by evaluating the importance of individual tokens, they overlook critical semantic relationships between tokens, resulting in fragmented context and degraded performance. We introduce ChunkKV, which fundamentally reimagines KV cache compression by treating semantic chunks - rather than isolated tokens - as basic compression units. This approach preserves complete linguistic structures and contextual integrity, ensuring that essential meaning is retained even under aggressive compression. Our innovation includes a novel layer-wise index reuse technique that exploits the higher cross-layer similarity of preserved indices in ChunkKV, reducing computational overhead and improving throughput by 26.5%. Comprehensive evaluations on challenging benchmarks: LongBench, Needle-In-A-HayStack, GSM8K, and JailbreakV demonstrate that ChunkKV outperforms state-of-the-art methods by up to 8.7% in precision while maintaining the same compression ratio. These results confirm that semantic-aware compression significantly enhances both efficiency and performance for long-context LLM inference, providing a simple yet effective solution to the memory bottleneck problem. The code is available at \href{https://github.com/NVIDIA/kvpress}{link}.

[81] On the Diminishing Returns of Complex Robust RAG Training in the Era of Powerful LLMs

Hanxing Ding, Shuchang Tao, Liang Pang, Zihao Wei, Liwei Chen, Kun Xu, Huawei Shen, Xueqi Cheng

Main category: cs.CL

TL;DR: As language models become more powerful, the benefits of complex robust training methods for retrieval-augmented generation (RAG) systems diminish significantly.

Details

Motivation: To investigate whether sophisticated robust training strategies for RAG systems remain beneficial as language models increase in capacity and power.

Method: Systematic evaluation across multiple model scales and question-answering datasets, analyzing performance improvements from complex training approaches versus simpler methods.

Result: Smaller models benefit significantly from complex training, but more capable models achieve comparable or superior performance with simpler approaches. Stronger models naturally exhibit better confidence calibration, cross-dataset generalization, and effective attention patterns.

Conclusion: As foundation models evolve, complex robust training yields diminishing returns, suggesting simplified RAG pipelines can maintain competitive performance for powerful models.

Abstract: Retrieval-augmented generation (RAG) systems traditionally employ sophisticated training strategies to enhance robustness against retrieval noise. In this work, we investigate a critical question: does the benefit of these complex robust training methods diminish as language models become more powerful? Through systematic evaluation across multiple model scales and question-answering datasets, our analysis reveals a consistent trend: \emph{the marginal robustness benefit of sophisticated training strategies decreases substantially as model capacity increases.} While smaller models show significant performance improvements from complex document selection and adversarial objectives, more capable models achieve comparable or even superior performance with simpler training approaches. Further investigation demonstrates that stronger models naturally exhibit better confidence calibration, cross-dataset generalization capability, and more effective attention patterns, even under simple training regimes. These findings suggest that as foundation models evolve, the engineering effort invested in complex robust training may yield diminishing returns, indicating that simplified RAG pipelines could suffice for powerful models while maintaining competitive performance.

[82] Towards Economical Inference: Enabling DeepSeek’s Multi-Head Latent Attention in Any Transformer-based LLMs

Tao Ji, Bin Guo, Yuanbin Wu, Qipeng Guo, Lixing Shen, Zhan Chen, Xipeng Qiu, Qi Zhang, Tao Gui

Main category: cs.CL

TL;DR: MHA2MLA enables efficient fine-tuning to convert standard LLMs from Multi-Head Attention to Multi-head Latent Attention, reducing KV cache size by 92.19% with minimal performance loss using only 0.3-0.6% of data.

Details

Motivation: Standard LLMs using MHA and GQA have significant cost disadvantages compared to MLA's efficient KV cache compression. Adapting pre-trained LLMs to MLA without full pre-training is challenging but meaningful for cost reduction.

Method: Two key components: partial-RoPE (removing RoPE from less important query/key dimensions) and low-rank approximation using joint SVD on pre-trained key/value parameters.

Result: KV cache size reduced by 92.19% for Llama2-7B with only 0.5% performance drop on LongBench. Method works with only 0.3-0.6% of data and integrates with KV cache quantization.

Conclusion: MHA2MLA provides a data-efficient way to transition LLMs to MLA architecture, significantly reducing inference costs while maintaining performance and compatibility with compression techniques.

Abstract: Multi-head Latent Attention (MLA) is an innovative architecture proposed by DeepSeek, designed to ensure efficient and economical inference by significantly compressing the Key-Value (KV) cache into a latent vector. Compared to MLA, standard LLMs employing Multi-Head Attention (MHA) and its variants such as Grouped-Query Attention (GQA) exhibit significant cost disadvantages. Enabling well-trained LLMs (e.g., Llama) to rapidly adapt to MLA without pre-training from scratch is both meaningful and challenging. This paper proposes the first data-efficient fine-tuning method for transitioning from MHA to MLA (MHA2MLA), which includes two key components: for partial-RoPE, we remove RoPE from dimensions of queries and keys that contribute less to the attention scores, for low-rank approximation, we introduce joint SVD approximations based on the pre-trained parameters of keys and values. These carefully designed strategies enable MHA2MLA to recover performance using only a small fraction (0.3% to 0.6%) of the data, significantly reducing inference costs while seamlessly integrating with compression techniques such as KV cache quantization. For example, the KV cache size of Llama2-7B is reduced by 92.19%, with only a 0.5% drop in LongBench performance.

[83] BottleHumor: Self-Informed Humor Explanation using the Information Bottleneck Principle

EunJeong Hwang, Peter West, Vered Shwartz

Main category: cs.CL

TL;DR: A method called \method{} uses information bottleneck principle to extract relevant world knowledge from vision and language models for explaining multimodal humor in an unsupervised way.

Details

Motivation: Humor in online communications often relies on multiple modalities and requires diverse knowledge types, but identifying the most useful knowledge remains challenging.

Method: \method{} applies information bottleneck principle to iteratively refine relevant world knowledge from vision and language models for humor explanation generation.

Result: Experiments on three datasets show the method outperforms various baselines.

Conclusion: The method can be adapted for other tasks benefiting from relevant world knowledge elicitation and opens new research directions.

Abstract: Humor is prevalent in online communications and it often relies on more than one modality (e.g., cartoons and memes). Interpreting humor in multimodal settings requires drawing on diverse types of knowledge, including metaphorical, sociocultural, and commonsense knowledge. However, identifying the most useful knowledge remains an open question. We introduce \method{}, a method inspired by the information bottleneck principle that elicits relevant world knowledge from vision and language models which is iteratively refined for generating an explanation of the humor in an unsupervised manner. Our experiments on three datasets confirm the advantage of our method over a range of baselines. Our method can further be adapted in the future for additional tasks that can benefit from eliciting and conditioning on relevant world knowledge and open new research avenues in this direction.

[84] L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

Pranjal Aggarwal, Sean Welleck

Main category: cs.CL

TL;DR: LCPO is a reinforcement learning method that trains language models to produce chain-of-thought reasoning outputs that adhere to user-specified length constraints, enabling precise control over computational cost and accuracy trade-offs.

Details

Motivation: Current reasoning language models lack control over chain-of-thought length, making it impossible to allocate test-time compute for desired performance levels. There's a need for methods that can optimize both accuracy and length adherence.

Method: Length Controlled Policy Optimization (LCPO) - a reinforcement learning approach that trains models to satisfy both accuracy objectives and user-specified length constraints given in prompts.

Result: L1 models trained with LCPO achieve smooth trade-offs between computational cost and accuracy across tasks, outperform state-of-the-art length control methods, and unexpectedly develop short chain-of-thought capabilities where even small models can match GPT-4o performance at equal reasoning lengths.

Conclusion: LCPO enables fine-grained control over reasoning length, allowing precise allocation of test-time compute and accuracy, with the additional benefit of discovering efficient short reasoning capabilities in trained models.

Abstract: Reasoning language models have shown an uncanny ability to improve performance at test-time by ``thinking longer’’-that is, by generating longer chain-of-thought sequences and hence using more compute. However, the length of their chain-of-thought reasoning is not controllable, making it impossible to allocate test-time compute to achieve a desired level of performance. We introduce Length Controlled Policy Optimization (LCPO), a simple reinforcement learning method that optimizes for accuracy and adherence to user-specified length constraints. We use LCPO to train L1, a reasoning language model that produces outputs satisfying a length constraint given in its prompt. L1’s length control allows for smoothly trading off computational cost and accuracy on a wide range of tasks, and outperforms the state-of-the-art S1 method for length control. Furthermore, we uncover an unexpected short chain-of-thought capability in models trained with LCPO. Specifically, using LCPO we derive Short Reasoning Models (SRMs), that exhibit similar reasoning patterns as full-length reasoning models, but can generate CoT lengths comparable to non-reasoning models. They demonstrate significant performance gains, for instance, our 1.5B L1 model surpasses GPT-4o at equal reasoning lengths. Overall, LCPO enables precise control over reasoning length, allowing for fine-grained allocation of test-time compute and accuracy. We release code and models at https://www.cmu-l3.github.io/l1

[85] DatawiseAgent: A Notebook-Centric LLM Agent Framework for Adaptive and Robust Data Science Automation

Ziming You, Yumiao Zhang, Dexuan Xu, Yiwei Lou, Yandong Yan, Wei Wang, Huaming Zhang, Yu Huang

Main category: cs.CL

TL;DR: DatawiseAgent is a notebook-centric LLM agent framework for data science automation that uses finite-state transducers to enable flexible planning, progressive development, and robust error recovery.

Details

Motivation: Existing LLM agents for data science are constrained by narrow task scopes, limited generalization, and over-reliance on SOTA LLMs.

Method: Introduces a unified interaction representation and multi-stage architecture based on finite-state transducers (FSTs), inspired by human data scientists working in computational notebooks.

Result: Consistently achieves SOTA performance, surpassing baselines like AutoGen and TaskWeaver, with graceful performance degradation under weaker models.

Conclusion: DatawiseAgent demonstrates superior effectiveness, adaptability, robustness, and scalability for data science automation.

Abstract: Existing large language model (LLM) agents for automating data science show promise, but they remain constrained by narrow task scopes, limited generalization across tasks and models, and over-reliance on state-of-the-art (SOTA) LLMs. We introduce DatawiseAgent, a notebook-centric LLM agent framework for adaptive and robust data science automation. Inspired by how human data scientists work in computational notebooks, DatawiseAgent introduces a unified interaction representation and a multi-stage architecture based on finite-state transducers (FSTs). This design enables flexible long-horizon planning, progressive solution development, and robust recovery from execution failures. Extensive experiments across diverse data science scenarios and models show that DatawiseAgent consistently achieves SOTA performance by surpassing strong baselines such as AutoGen and TaskWeaver, demonstrating superior effectiveness and adaptability. Further evaluations reveal graceful performance degradation under weaker or smaller models, underscoring the robustness and scalability.

[86] EEFSUVA: A New Mathematical Olympiad Benchmark

Nicole N Khatibi, Daniil A. Radamovich, Michael P. Brenner

Main category: cs.CL

TL;DR: Current benchmarks overstate LLM mathematical reasoning due to data contamination and narrow problem types. EEFSUVA benchmark from Eastern European Olympiads shows notable performance decline in state-of-the-art LLMs.

Details

Motivation: To assess whether current benchmarks genuinely capture LLM mathematical reasoning ability, given concerns about data contamination and narrow problem focus in existing benchmarks like IMO.

Method: Introduced EEFSUVA benchmark curated from under-circulated regional and national Olympiads of Eastern Europe and former Soviet Union countries, featuring problems of comparable difficulty to IMO but with nonstandard problem-solving techniques.

Result: Preliminary results show state-of-the-art LLMs exhibit notable performance decline on EEFSUVA relative to other Olympiad-style benchmarks.

Conclusion: Broader evaluation datasets are important for fuller assessment of mathematical reasoning and for guiding future model development.

Abstract: Recent breakthroughs have spurred claims that large language models (LLMs) match gold medal Olympiad to graduate level proficiency on mathematics benchmarks. In this work, we examine these claims in detail and assess the extent to which current benchmarks capture genuine LLM mathematical reasoning. The composition of these benchmarks, primarily drawing from the International Mathematics Olympiad (IMO) and related competitions, may overstate models reasoning ability due to potential data contamination and a narrow focus on familiar problem types. To enable a more holistic assessment of mathematical understanding, we introduce EEFSUVA, a novel benchmark curated from under circulated regional and national Olympiads of Eastern Europe and the countries from the former Soviet Union. These contests feature problems of comparable difficulty to the IMO and are renowned for demanding nonstandard problem-solving techniques, yet their problems are far less prevalent in online corpora. Preliminary results suggest that even state-of-the-art LLMs exhibit a notable performance decline on EEFSUVA relative to other Olympiad-style benchmarks. These findings also suggest the potential importance of broader evaluation datasets for a fuller assessment of mathematical reasoning and for guiding future model development.

[87] From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM

Kshitij Ambilduke, Ben Peters, Sonal Sannigrahi, Anil Keshwani, Tsz Kin Lam, Bruno Martins, André F. T. Martins, Marcely Zanon Boito

Main category: cs.CL

TL;DR: Spire is a speech-augmented language model that translates and transcribes English speech into 10 other languages, integrating speech modality through discretization and continued pre-training with only 42.5K hours of speech data.

Details

Motivation: To develop a multilingual language model that can handle both speech and text translation tasks while preserving strong text-based performance, using significantly less data than existing speech LMs.

Method: Integrates speech modality into existing multilingual LM via speech discretization, treating discretized speech as an additional translation language during continued pre-training using the framework of multilingual LMs.

Result: Successfully equipped the model with speech capabilities while maintaining strong text performance, achieving this with only 42.5K hours of speech data - significantly less than existing speech LMs.

Conclusion: Discretized speech input integration as an additional language is feasible during LM adaptation, enabling speech-augmented multilingual translation with efficient data usage.

Abstract: We introduce Spire, a speech-augmented language model (LM) capable of both translating and transcribing speech input from English into 10 other languages as well as translating text input in both language directions. Spire integrates the speech modality into an existing multilingual LM via speech discretization and continued pre-training using only 42.5K hours of speech. In particular, we adopt the pretraining framework of multilingual LMs and treat discretized speech input as an additional translation language. This approach not only equips the model with speech capabilities, but also preserves its strong text-based performance. We achieve this using significantly less data than existing speech LMs, demonstrating that discretized speech input integration as an additional language is feasible during LM adaptation. We make our code and models available to the community.

[88] Verbosity Tradeoffs and the Impact of Scale on the Faithfulness of LLM Self-Explanations

Noah Y. Siegel, Nicolas Heess, Maria Perez-Ortiz, Oana-Maria Camburu

Main category: cs.CL

TL;DR: Analyzes faithfulness of LLM explanations across 75 models, proposes new metrics (phi-CCT and F-AUROC) to better assess explanation quality, and finds larger models are more faithful.

Details

Motivation: To determine if LLM explanations are faithful (convey actual decision factors) rather than just plausible-sounding, and to develop better metrics for evaluating explanation faithfulness.

Method: Analyzed 75 models from 13 families using counterfactual faithfulness tests, examined tradeoffs between conciseness and comprehensiveness, and developed two new metrics: phi-CCT (simplified CCT) and F-AUROC (handles imbalanced distributions).

Result: Clear scaling trend: larger and more capable models consistently show higher faithfulness across all metrics. New metrics effectively capture explanation quality without gaming vulnerabilities.

Conclusion: Model size and capability correlate with explanation faithfulness, and the proposed metrics provide robust evaluation of LLM explanation quality while addressing limitations of existing methods.

Abstract: When asked to explain their decisions, LLMs can often give explanations which sound plausible to humans. But are these explanations faithful, i.e. do they convey the factors actually responsible for the decision? In this work, we analyse counterfactual faithfulness across 75 models from 13 families. We analyze the tradeoff between conciseness and comprehensiveness, how correlational faithfulness metrics assess this tradeoff, and the extent to which metrics can be gamed. This analysis motivates two new metrics: the phi-CCT, a simplified variant of the Correlational Counterfactual Test (CCT) which avoids the need for token probabilities while explaining most of the variance of the original test; and F-AUROC, which eliminates sensitivity to imbalanced intervention distributions and captures a model’s ability to produce explanations with different levels of detail. Our findings reveal a clear scaling trend: larger and more capable models are consistently more faithful on all metrics we consider. Our code is available at https://github.com/google-deepmind/corr_faith.

[89] Not a nuisance but a useful heuristic: Outlier dimensions favor frequent tokens in language models

Iuri Macocco, Nora Graichen, Gemma Boleda, Marco Baroni

Main category: cs.CL

TL;DR: Outlier dimensions are dimensions with extreme activations in language models that implement a heuristic for predicting frequent words, which can be contextually blocked by counterbalancing weights.

Details

Motivation: To understand the function and origin of outlier dimensions in modern language models, particularly their role in implementing token prediction heuristics.

Method: Analyzed outlier dimensions across various language models, traced their function to frequent word prediction heuristics, and investigated model parameters and training dynamics that influence their emergence.

Result: Found that outlier dimensions arise in many modern language models and serve as a specialized mechanism for implementing frequent word prediction, with counterbalancing weights used to block this heuristic when inappropriate.

Conclusion: Outlier dimensions are a specialized mechanism discovered by distinct models to implement useful token prediction heuristics, particularly for frequent word prediction.

Abstract: We study last-layer outlier dimensions, i.e. dimensions that display extreme activations for the majority of inputs. We show that outlier dimensions arise in many different modern language models, and trace their function back to the heuristic of constantly predicting frequent words. We further show how a model can block this heuristic when it is not contextually appropriate, by assigning a counterbalancing weight mass to the remaining dimensions, and we investigate which model parameters boost outlier dimensions and when they arise during training. We conclude that outlier dimensions are a specialized mechanism discovered by many distinct models to implement a useful token prediction heuristic.

[90] PropRAG: Guiding Retrieval with Beam Search over Proposition Paths

Jingjin Wang, Jiawei Han

Main category: cs.CL

TL;DR: PropRAG introduces a novel RAG framework that uses context-rich propositions instead of triples and employs LLM-free online beam search to discover multi-step reasoning chains, achieving state-of-the-art performance on complex QA benchmarks.

Details

Motivation: Standard RAG fails to capture interconnected information needed for multi-hop reasoning, and structured RAG methods using knowledge graphs suffer from context collapse due to triple-based representations.

Method: PropRAG shifts from triples to context-rich propositions and uses an efficient LLM-free online beam search over proposition paths to discover multi-step reasoning chains.

Result: Achieves state-of-the-art zero-shot Recall@5 and F1 scores on 2Wiki, HotpotQA, and MuSiQue benchmarks.

Conclusion: PropRAG advances non-parametric knowledge integration by improving evidence retrieval through richer representation and efficient reasoning path discovery.

Abstract: Retrieval Augmented Generation (RAG) has become the standard approach for equipping Large Language Models (LLMs) with up-to-date knowledge. However, standard RAG, relying on independent passage retrieval, often fails to capture the interconnected nature of information required for complex, multi-hop reasoning. While structured RAG methods attempt to address this using knowledge graphs built from triples, we argue that the inherent context loss of triples (context collapse) limits the fidelity of the knowledge representation. We introduce PropRAG, a novel RAG framework that shifts from triples to context-rich propositions and introduces an efficient, LLM-free online beam search over proposition paths to discover multi-step reasoning chains. By coupling a higher-fidelity knowledge representation with explicit path discovery, PropRAG achieves state-of-the-art zero-shot Recall@5 and F1 scores on 2Wiki, HotpotQA, and MuSiQue, advancing non-parametric knowledge integration by improving evidence retrieval through richer representation and efficient reasoning path discovery.

[91] Same evaluation, more tokens: On the effect of input length for machine translation evaluation using Large Language Models

Tobias Domhan, Dawei Zhu

Main category: cs.CL

TL;DR: LLMs show length bias in document-level translation evaluation - longer texts get fewer error spans and worse ranking accuracy. Proposed methods like Focus Sentence Prompting and fine-tuning help mitigate this bias.

Details

Motivation: Current LLM-based translation evaluation works well at sentence level but struggles with document-level assessment due to length bias, where longer texts receive fewer error annotations and reduced ranking accuracy.

Method: Evaluated several strategies: granularity-aligned prompting, Focus Sentence Prompting (FSP), and fine-tuning approach to align LLMs better with evaluation tasks and mitigate length bias.

Result: Longer texts lead to significantly fewer error spans and reduced system ranking accuracy. Focus Sentence Prompting and fine-tuning methods largely mitigate this length bias.

Conclusion: LLMs can be made more reliable for long-form translation evaluation by using methods like FSP and fine-tuning to address length bias issues.

Abstract: Accurately evaluating machine-translated text remains a long-standing challenge, particularly for long documents. Recent work has shown that large language models (LLMs) can serve as reliable and interpretable sentence-level translation evaluators via MQM error span annotations. With modern LLMs supporting larger context windows, a natural question arises: can we feed entire document translations into an LLM for quality assessment? Ideally, evaluation should be invariant to text length, producing consistent error spans regardless of input granularity. However, our analysis shows that text length significantly impacts evaluation: longer texts lead to fewer error spans and reduced system ranking accuracy. To address this limitation, we evaluate several strategies, including granularity-aligned prompting, Focus Sentence Prompting (FSP), and a fine-tuning approach to better align LLMs with the evaluation task. The latter two methods largely mitigate this length bias, making LLMs more reliable for long-form translation evaluation.

[92] Pre-training Limited Memory Language Models with Internal and External Knowledge

Linxi Zhao, Sofian Zalouk, Christian K. Belardi, Justin Lovelace, Jin Peng Zhou, Ryan Thomas Noonan, Dongyoung Go, Kilian Q. Weinberger, Yoav Artzi, Jennifer J. Sun

Main category: cs.CL

TL;DR: LMLMs externalize factual knowledge to databases during pre-training instead of memorizing in parameters, enabling editable and verifiable knowledge bases while maintaining competitive performance.

Details

Motivation: Neural language models are black-boxes where knowledge is distributed across opaque parameters, making it difficult to inspect, verify, or update specific facts reliably.

Method: Pre-training approach that masks externally retrieved factual values from training loss, teaching models to perform targeted lookups rather than memorizing knowledge in weights.

Result: LMLMs achieve competitive performance compared to significantly larger LLMs on standard benchmarks.

Conclusion: LMLMs offer advantages of explicit, editable, and verifiable knowledge bases while maintaining strong performance.

Abstract: Neural language models are black-boxes–both linguistic patterns and factual knowledge are distributed across billions of opaque parameters. This entangled encoding makes it difficult to reliably inspect, verify, or update specific facts. We introduce Limited Memory Language Models (LMLM), a new class of language models that externalizes factual knowledge to external database during pre-training rather than memorizing them. Our pre-training approach strategically masks externally retrieved factual values from the training loss, thereby teaching the model to perform targeted lookups rather than relying on memorization in model weights. Our experiments demonstrate that LMLMs achieve competitive performance compared to significantly larger LLMs on standard benchmarks, while offering the advantages of explicit, editable, and verifiable knowledge bases.

[93] Leveraging Online Data to Enhance Medical Knowledge in a Small Persian Language Model

Mehrdad Ghassabi, Pedram Rostami, Hamidreza Baradaran Kashani, Amirhossein Poursina, Zahra Kazemi, Milad Tavakoli

Main category: cs.CL

TL;DR: This paper introduces the first curated Persian medical dataset and fine-tunes a small language model (aya-expanse-8b) using parameter-efficient methods, enabling it to pass the Iranian Basic Medical Science Entrance Exam and improve medical QA accuracy.

Details

Motivation: Small language models struggle with specialized domains in low-resource languages like Persian, and no curated medical dataset existed for Persian despite numerous medical websites being available.

Method: Created a curated dataset of 20k doctor-patient Q&A pairs and 60% of a 90-million-token crawled corpus from medical magazines, then used parameter-efficient fine-tuning on the aya-expanse-8b baseline model.

Result: The fine-tuned model passed the Iranian Basic Medical Science Entrance Exam (which the baseline failed), improved medical question answering accuracy, and increased Persian-translated MMLU accuracy by an average of 2.67%.

Conclusion: This work demonstrates the potential of using open-access online data to enhance small language models for medical applications in resource-constrained environments, providing a novel solution for Persian medical AI.

Abstract: The rapid advancement of language models has demonstrated the potential of artificial intelligence in the healthcare industry. However, small language models struggle with specialized domains in low-resource languages like Persian. While numerous medical-domain websites exist in Persian, no curated dataset or corpus has been available making ours the first of its kind. This study introduces a newly curated dataset comprising 20k doctor-patient Q&A pairs and 60% of a 90-million-token crawled corpus from medical magazines. Using a parameter-efficient fine-tuning approach, we enhanced the medical knowledge of the baseline model, aya-expanse-8b. Benchmark evaluations demonstrate that the fine-tuned model achieves improved accuracy in medical question answering and successfully passed the Iranian Basic Medical Science Entrance Exam (IBSEE) in September 2023, which the baseline model did not. Additionally, the fine-tuned model improved Persian-translated MMLU accuracy by an average of 2.67%. This work highlights the potential of leveraging open-access online data to enrich small language models in medical fields, providing a novel solution for Persian medical AI applications suitable for resource-constrained environments. Future research could explore multimodal input to further enhance performance.

[94] NeSyGeo: A Neuro-Symbolic Framework for Multimodal Geometric Reasoning Data Generation

Weiming Wu, Jin Ye, Zi-kang Wang, Zhi Zhou, Yu-Feng Li, Lan-Zhe Guo

Main category: cs.CL

TL;DR: NeSyGeo is a neuro-symbolic framework for generating diverse geometric reasoning data using a domain-specific language and symbolic-visual-text pipeline, which significantly improves MLLMs’ geometric reasoning capabilities with minimal data.

Details

Motivation: Existing geometric reasoning data generation methods face limitations in diversity and numerical generalization, hindering the improvement of multi-modal large language models' geometric reasoning capabilities.

Method: Proposed a neuro-symbolic framework with domain-specific language for plane geometry representation, generative actions in symbolic space, and symbolic-visual-text pipeline for synthesizing symbolic sequences, visual/textual representations, and reasoning paths with reverse search and forward validation.

Result: Generated NeSyGeo CoT and NeSyGeo-Caption datasets (100k samples) and NeSyGeo-Test benchmark. With only 4k samples and two epochs of fine-tuning, models achieved +15.8% on MathVision, +8.4% on MathVerse, and +7.3% on GeoQA. A 4B model outperformed an 8B model from the same series.

Conclusion: The NeSyGeo framework effectively addresses diversity and generalization limitations in geometric reasoning data generation, significantly enhancing MLLMs’ geometric reasoning performance with minimal training data.

Abstract: Obtaining large-scale, high-quality reasoning data is crucial for improving the geometric reasoning capabilities of multi-modal large language models (MLLMs). However, existing data generation methods, whether based on predefined tem plates or constrained symbolic provers, inevitably face diversity and numerical generalization limitations. To address these limitations, we propose NeSyGeo, a novel neuro-symbolic framework for generating geometric reasoning data. First, we propose a domain-specific language grounded in the entity-attributes-relations paradigm to comprehensively represent all components of plane geometry, along with generative actions defined within this symbolic space. We then design a symbolic-visual-text pipeline that synthesizes symbolic sequences, maps them to visual and textual representations and generates reasoning path with reverse search and forward validation. Based on this framework, we construct NeSyGeo CoT and NeSyGeo-Caption datasets, containing 100k samples, and release a new benchmark NeSyGeo-Test for evaluating geometric reasoning abilities in MLLMs. Experiments demonstrate that the proposal significantly and consistently improves the performance of multiple MLLMs under both reinforcement and supervised fine-tuning. With only 4k samples and two epochs of reinforcement fine-tuning, base models achieve improvements of up to +15.8% on MathVision, +8.4% on MathVerse, and +7.3% on GeoQA. Notably, a 4B model can be improved to outperform an 8B model from the same series on geometric reasoning tasks.s

[95] EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments

Zefang Liu, Yinzhu Quan

Main category: cs.CL

TL;DR: EconWebArena is a benchmark for evaluating autonomous agents on complex economic tasks using real web environments, featuring 360 curated tasks from 82 authoritative economic websites.

Details

Motivation: To address the lack of benchmarks for evaluating autonomous agents on realistic, multimodal economic tasks that require navigating live websites, interpreting visual content, and extracting time-sensitive data through multi-step workflows.

Method: Constructed by prompting LLMs to generate candidate tasks followed by rigorous human curation. Evaluates multimodal LLMs as web agents through ablation studies assessing visual grounding, plan-based reasoning, and interaction design.

Result: Reveals substantial performance gaps and persistent challenges in grounding, navigation, and multimodal understanding among state-of-the-art models.

Conclusion: EconWebArena serves as a rigorous testbed for economic web intelligence, highlighting the need for improved capabilities in web-based economic reasoning and multimodal interaction.

Abstract: We introduce EconWebArena, a benchmark for evaluating autonomous agents on complex, multimodal economic tasks in realistic web environments. The benchmark comprises 360 curated tasks from 82 authoritative websites spanning domains such as macroeconomics, labor, finance, trade, and public policy. Each task challenges agents to navigate live websites, interpret structured and visual content, interact with real interfaces, and extract precise, time-sensitive data through multi-step workflows. We construct the benchmark by prompting multiple large language models (LLMs) to generate candidate tasks, followed by rigorous human curation to ensure clarity, feasibility, and source reliability. Unlike prior work, EconWebArena emphasizes fidelity to authoritative data sources and the need for grounded web-based economic reasoning. We evaluate a diverse set of state-of-the-art multimodal LLMs as web agents, analyze failure cases, and conduct ablation studies to assess the impact of visual grounding, plan-based reasoning, and interaction design. Our results reveal substantial performance gaps and highlight persistent challenges in grounding, navigation, and multimodal understanding, positioning EconWebArena as a rigorous testbed for economic web intelligence.

[96] Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs

Yaniv Nikankin, Dana Arad, Yossi Gandelsman, Yonatan Belinkov

Main category: cs.CL

TL;DR: Vision-Language models perform better on text tasks than analogous visual tasks. The performance gap stems from disjoint computational circuits between modalities, with visual representations aligning with text only in later layers. Patching visual representations from later to earlier layers closes one-third of the performance gap.

Details

Motivation: To investigate why Vision-Language models show higher accuracy on text tasks compared to analogous visual tasks, and to understand the underlying computational differences between modalities.

Method: Identified and compared task-specific computational circuits (sub-graphs) across modalities. Analyzed data representations and performed representation patching - moving visual data token representations from later layers back into earlier layers.

Result: Circuits are largely disjoint between modalities but implement similar functionalities. Visual representations align with text only in later layers. Patching visual representations from later to earlier layers closes about one-third of the performance gap between modalities across multiple tasks and models.

Conclusion: The multi-modal performance gap in VLMs arises from disjoint computational circuits and late alignment of visual representations. Simple representation patching can significantly reduce this gap without requiring retraining.

Abstract: Vision-Language models (VLMs) show impressive abilities to answer questions on visual inputs (e.g., counting objects in an image), yet demonstrate higher accuracies when performing an analogous task on text (e.g., counting words in a text). We investigate this accuracy gap by identifying and comparing the \textit{circuits} - the task-specific computational sub-graphs - in different modalities. We show that while circuits are largely disjoint between modalities, they implement relatively similar functionalities: the differences lie primarily in processing modality-specific data positions (an image or a text sequence). Zooming in on the image data representations, we observe they become aligned with the higher-performing analogous textual representations only towards later layers, too late in processing to effectively influence subsequent positions. To overcome this, we patch the representations of visual data tokens from later layers back into earlier layers. In experiments with multiple tasks and models, this simple intervention closes a third of the performance gap between the modalities, on average. Our analysis sheds light on the multi-modal performance gap in VLMs and suggests a training-free approach for reducing it.

[97] Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models

Milan Bhan, Jean-Noel Vittaut, Nicolas Chesneau, Sarath Chandar, Marie-Jeanne Lesot

Main category: cs.CL

TL;DR: NeuroFaith is a framework that measures faithfulness of LLM self-explanations by identifying key concepts and testing if they actually influence model predictions, with applications in detecting unfaithful explanations and improving faithfulness.

Details

Motivation: LLMs can generate plausible but unfaithful self-explanations that don't reflect their actual reasoning process, creating trustworthiness issues in AI systems.

Method: Identifies key concepts in explanations and mechanistically tests whether these concepts influence model predictions; develops linear faithfulness probe for detection and steering.

Result: Demonstrates versatility across 2-hop reasoning and classification tasks; enables detection of unfaithful explanations and improvement of faithfulness.

Conclusion: Provides principled approach to evaluate and enhance faithfulness of LLM self-explanations, addressing critical needs for trustworthy AI systems.

Abstract: Large Language Models (LLMs) can generate plausible free text self-explanations to justify their answers. However, these natural language explanations may not accurately reflect the model’s actual reasoning process, indicating a lack of faithfulness. Existing faithfulness evaluation methods rely primarily on behavioral tests or computational block analysis without examining the semantic content of internal neural representations. This paper proposes NeuroFaith, a flexible framework that measures the faithfulness of LLM free text self-explanation by identifying key concepts within explanations and mechanistically testing whether these concepts actually influence the model’s predictions. We show the versatility of NeuroFaith across 2-hop reasoning and classification tasks. Additionally, a linear faithfulness probe based on NeuroFaith is developed to detect unfaithful self-explanations from representation space and improve faithfulness through steering. NeuroFaith provides a principled approach to evaluating and enhancing the faithfulness of LLM free text self-explanations, addressing critical needs for trustworthy AI systems.

[98] Query-Level Uncertainty in Large Language Models

Lihu Chen, Gerard de Melo, Fabian M. Suchanek, Gaël Varoquaux

Main category: cs.CL

TL;DR: Proposes Internal Confidence, a training-free method to detect LLM knowledge boundaries using query-level uncertainty before token generation, enabling efficient adaptive inference like RAG and model cascading.

Details

Motivation: LLMs need awareness of their knowledge boundaries to enable adaptive inference mechanisms (RAG, deep thinking, abstention) for developing efficient and trustworthy AI.

Method: Uses Internal Confidence - a training-free approach that leverages self-evaluations across layers and tokens to estimate query-level uncertainty before generating any tokens.

Result: Outperforms baselines in confidence quality while being computationally cheaper. Reduces inference costs in RAG and model cascading while preserving performance.

Conclusion: Internal Confidence effectively detects LLM knowledge boundaries, enabling cost-efficient adaptive inference without compromising overall system performance.

Abstract: It is important for Large Language Models (LLMs) to be aware of the boundary of their knowledge, distinguishing queries they can confidently answer from those that lie beyond their capabilities. Such awareness enables models to perform adaptive inference, such as invoking retrieval-augmented generation (RAG), engaging in slow and deep thinking, or abstaining from answering when appropriate. These mechanisms are key to developing efficient and trustworthy AI. In this work, we propose a method to detect knowledge boundaries via Query-Level Uncertainty, which estimates if a model is capable of answering a given query before generating any tokens, thus avoiding the generation cost. To this end, we propose a novel, training-free method called Internal Confidence, which leverages self-evaluations across layers and tokens to provide a reliable signal of uncertainty. Empirical studies on both factual question answering and mathematical reasoning tasks demonstrate that our Internal Confidence outperforms several baselines in quality of confidence while being computationally cheaper. Furthermore, we demonstrate its benefits in adaptive inference settings, showing that for RAG and model cascading it reduces inference costs while preserving overall performance.

[99] When Large Language Models are Reliable for Judging Empathic Communication

Aakriti Kumar, Nalin Poungpeth, Diyi Yang, Erina Farrell, Bruce Lambert, Matthew Groh

Main category: cs.CL

TL;DR: LLMs can generate empathic responses but their judgment reliability is tested against experts and crowdworkers across four psychological frameworks, showing LLMs approach expert-level reliability and outperform crowdworkers.

Details

Motivation: To investigate how reliably LLMs judge nuances of empathic communication compared to human experts and crowdworkers in emotionally sensitive applications.

Method: Compared 3,150 expert annotations, 2,844 crowd annotations, and 3,150 LLM annotations across four evaluative frameworks from psychology, NLP, and communications applied to 200 real-world conversations about personal problems.

Result: Expert agreement is high but varies by framework sub-components. LLMs consistently approach expert-level benchmarks across all frameworks and exceed crowdworker reliability.

Conclusion: LLMs, when validated with appropriate benchmarks, can support transparency and oversight in emotionally sensitive applications like conversational companions.

Abstract: Large language models (LLMs) excel at generating empathic responses in text-based conversations. But, how reliably do they judge the nuances of empathic communication? We investigate this question by comparing how experts, crowdworkers, and LLMs annotate empathic communication across four evaluative frameworks drawn from psychology, natural language processing, and communications applied to 200 real-world conversations where one speaker shares a personal problem and the other offers support. Drawing on 3,150 expert annotations, 2,844 crowd annotations, and 3,150 LLM annotations, we assess inter-rater reliability between these three annotator groups. We find that expert agreement is high but varies across the frameworks’ sub-components depending on their clarity, complexity, and subjectivity. We show that expert agreement offers a more informative benchmark for contextualizing LLM performance than standard classification metrics. Across all four frameworks, LLMs consistently approach this expert level benchmark and exceed the reliability of crowdworkers. These results demonstrate how LLMs, when validated on specific tasks with appropriate benchmarks, can support transparency and oversight in emotionally sensitive applications including their use as conversational companions.

[100] A Survey of Pun Generation: Datasets, Evaluations and Methodologies

Yuchen Su, Yonghua Zhu, Ruofan Wang, Zijian Huang, Diana Benavides-Prado, Michael Witbrock

Main category: cs.CL

TL;DR: This paper provides the first comprehensive survey of pun generation in computational linguistics, covering datasets, methods, evaluation metrics, and future research directions.

Details

Motivation: Despite considerable attention to pun generation in computational linguistics, there is currently no dedicated survey that systematically reviews this specific area, creating a gap in the literature.

Method: The paper conducts a comprehensive review of pun generation datasets and methods across different stages, including conventional approaches, deep learning techniques, and pre-trained language models.

Result: The survey systematically organizes existing work on pun generation, summarizes both automated and human evaluation metrics, and identifies key research challenges in the field.

Conclusion: The paper bridges the research gap by providing a structured overview of pun generation research and proposes promising directions for future work in this area.

Abstract: Pun generation seeks to creatively modify linguistic elements in text to produce humour or evoke double meanings. It also aims to preserve coherence and contextual appropriateness, making it useful in creative writing and entertainment across various media and contexts. Although pun generation has received considerable attention in computational linguistics, there is currently no dedicated survey that systematically reviews this specific area. To bridge this gap, this paper provides a comprehensive review of pun generation datasets and methods across different stages, including conventional approaches, deep learning techniques, and pre-trained language models. Additionally, we summarise both automated and human evaluation metrics used to assess the quality of pun generation. Finally, we discuss the research challenges and propose promising directions for future work.

[101] Triadic Multi-party Voice Activity Projection for Turn-taking in Spoken Dialogue Systems

Mikey Elmers, Koji Inoue, Divesh Lala, Tatsuya Kawahara

Main category: cs.CL

TL;DR: This study extends voice activity projection (VAP) from dyadic to triadic conversations, showing improved performance for turn-taking prediction in three-person dialogues.

Details

Motivation: Conventional turn-taking studies mostly focus on dyadic settings, but real-world conversations often involve multiple participants. This work aims to apply VAP to predict upcoming turn-taking in triadic multi-party scenarios.

Method: Trained multiple VAP models on a Japanese triadic dataset where participants discussed various topics, using only acoustic data to predict future voice activity for each speaker.

Result: VAP models trained on triadic conversation outperformed baseline models, though conversation type affected accuracy. This is the first study to extend VAP into triadic conversation.

Conclusion: VAP can be successfully used for turn-taking prediction in triadic dialogue scenarios, establishing a foundation for incorporating this approach into spoken dialogue systems.

Abstract: Turn-taking is a fundamental component of spoken dialogue, however conventional studies mostly involve dyadic settings. This work focuses on applying voice activity projection (VAP) to predict upcoming turn-taking in triadic multi-party scenarios. The goal of VAP models is to predict the future voice activity for each speaker utilizing only acoustic data. This is the first study to extend VAP into triadic conversation. We trained multiple models on a Japanese triadic dataset where participants discussed a variety of topics. We found that the VAP trained on triadic conversation outperformed the baseline for all models but that the type of conversation affected the accuracy. This study establishes that VAP can be used for turn-taking in triadic dialogue scenarios. Future work will incorporate this triadic VAP turn-taking model into spoken dialogue systems.

[102] The Prompt Makes the Person(a): A Systematic Evaluation of Sociodemographic Persona Prompting for Large Language Models

Marlene Lutz, Indira Sen, Georg Ahnert, Elisa Rogers, Markus Strohmaier

Main category: cs.CL

TL;DR: The study examines how different persona prompt strategies affect LLM simulations of sociodemographic groups, finding that interview-style formats and name-based priming reduce stereotyping and improve alignment, with smaller models sometimes outperforming larger ones.

Details

Motivation: To address concerns about the fidelity of persona prompting in simulating various sociodemographic groups in LLMs, as how prompts are formulated significantly affects outcomes.

Method: Systematically examined different persona prompt strategies (role adoption formats and demographic priming strategies) using five open-source LLMs across 15 intersectional demographic groups in both open- and closed-ended tasks.

Result: LLMs struggle to simulate marginalized groups, but choice of demographic priming and role adoption strategy significantly impacts portrayal. Interview-style format and name-based priming reduce stereotyping and improve alignment. Smaller models like OLMo-2-7B outperformed larger ones such as Llama-3.3-70B.

Conclusion: The findings offer actionable guidance for designing sociodemographic persona prompts in LLM-based simulation studies.

Abstract: Persona prompting is increasingly used in large language models (LLMs) to simulate views of various sociodemographic groups. However, how a persona prompt is formulated can significantly affect outcomes, raising concerns about the fidelity of such simulations. Using five open-source LLMs, we systematically examine how different persona prompt strategies, specifically role adoption formats and demographic priming strategies, influence LLM simulations across 15 intersectional demographic groups in both open- and closed-ended tasks. Our findings show that LLMs struggle to simulate marginalized groups but that the choice of demographic priming and role adoption strategy significantly impacts their portrayal. Specifically, we find that prompting in an interview-style format and name-based priming can help reduce stereotyping and improve alignment. Surprisingly, smaller models like OLMo-2-7B outperform larger ones such as Llama-3.3-70B. Our findings offer actionable guidance for designing sociodemographic persona prompts in LLM-based simulation studies.

[103] Quantum-RAG and PunGPT2: Advancing Low-Resource Language Generation and Retrieval for the Punjabi Language

Jaskaranjeet Singh, Rakesh Thakur

Main category: cs.CL

TL;DR: PunGPT2 is the first open-source Punjabi generative model suite with a 35GB corpus, featuring Pun-RAG retrieval framework and Quantum-RAG for efficient retrieval. It outperforms multilingual baselines and establishes new SOTA for Punjabi language generation.

Details

Motivation: Low-resource languages like Punjabi remain excluded from NLP advancements, limiting digital access for millions of speakers.

Method: Developed PunGPT2 with optimized tokenizer for Gurmukhi/Shahmukhi scripts, Pun-RAG with FAISS retriever, Pun-Instruct using QLoRA, and innovative Quantum-RAG combining sparse, dense, and quantum kernel embeddings.

Result: Outperforms multilingual baselines (mBERT, mT5, MuRIL, BLOOM) on FLORES-200, IndicGenBench, and PunjabiEval. Quantum-RAG achieves +7.4 Recall@10 over FAISS and +3.5 BLEU over mT5.

Conclusion: Establishes new state-of-the-art results for Punjabi language generation and retrieval, with all resources publicly released to advance NLP for low-resource languages.

Abstract: Despite rapid advances in large language models (LLMs), low-resource languages remain excluded from NLP, limiting digital access for millions. We present PunGPT2, the first fully open-source Punjabi generative model suite, trained on a 35GB corpus covering literature, religious texts, news, social discourse, etc. PunGPT2 captures Punjabi’s syntactic and morphological richness through a tokenizer optimized for Gurmukhi and Shahmukhi scripts. We introduce Pun-RAG, a retrieval-augmented framework integrating PunGPT2 with a FAISS retriever over a curated Punjabi knowledge base, and Pun-Instruct, an instruction-tuned variant using QLoRA for robust zero-shot summarization, translation, and question answering. Our key innovation, Quantum-RAG, fuses sparse, dense, and quantum kernel embeddings for efficient, context-aware retrieval with low memory overhead, marking the first practical quantum-inspired retrieval in a low-resource LLM. Our models outperform multilingual baselines (mBERT, mT5, MuRIL, BLOOM) on FLORES-200, IndicGenBench, and a new PunjabiEval suite. Quantum-RAG yields +7.4 Recall@10 over FAISS and +3.5 BLEU over mT5 on PunjabiEval. We publicly release all training scripts, hyperparameters, evaluation pipelines, the 35GB Punjabi corpus, the PunjabiEval benchmark, and all model weights, establishing new state-of-the-art results for Punjabi language generation and retrieval.

[104] Better by Comparison: Retrieval-Augmented Contrastive Reasoning for Automatic Prompt Optimization

Juhyeon Lee, Wonduk Seo, Hyunjin An, Seunghyun Lee, Yi Bu

Main category: cs.CL

TL;DR: CRPO is a novel prompt optimization framework that uses contrastive reasoning with retrieval-augmented reference examples to improve LLM prompts by learning from high- and low-quality exemplars.

Details

Motivation: Most prior work focuses on direct prompt refinement or model fine-tuning, overlooking LLMs' inherent reasoning capability to learn from contrasting examples for more robust and interpretable optimization.

Method: Retrieves top k reference prompt-response pairs from HelpSteer2 dataset and constructs two paradigms: tiered contrastive reasoning (comparing high/medium/low-quality exemplars) and multi-metric contrastive reasoning (analyzing best exemplars along each evaluation dimension).

Result: Experimental results on HelpSteer2 benchmark demonstrate that CRPO significantly outperforms baselines.

Conclusion: The findings highlight the promise of contrastive, retrieval-augmented reasoning for advancing automatic prompt optimization.

Abstract: Automatic prompt optimization has recently emerged as a strategy for improving the quality of prompts used in Large Language Models (LLMs), with the goal of generating more accurate and useful responses. However, most prior work focuses on direct prompt refinement or model fine-tuning, overlooking the potential of leveraging LLMs’ inherent reasoning capability to learn from contrasting examples. In this paper, we present Contrastive Reasoning Prompt Optimization (CRPO), a novel framework that formulates prompt optimization as a retrieval-augmented reasoning process. Our approach retrieves top k reference prompt-response pairs from the HelpSteer2 dataset, an open source collection where each response is annotated for helpfulness, correctness, coherence, complexity, and verbosity, and constructs two complementary optimization paradigms: (1) tiered contrastive reasoning, where the LLM compares high-, medium-, and low-quality exemplars (both prompts and responses) to refine its own generation through reflective reasoning, and (2) multi-metric contrastive reasoning, where the LLM analyzes the best exemplars along each evaluation dimension and integrates their strengths into an optimized prompt. By explicitly contrasting high and low quality exemplars, CRPO enables the model to deduce why certain prompts succeed while others fail, thereby achieving more robust and interpretable optimization. Experimental results on the HelpSteer2 benchmark demonstrate that CRPO significantly outperforms baselines. Our findings highlight the promise of contrastive, retrieval-augmented reasoning for advancing automatic prompt optimization.

[105] Can Language Models Handle a Non-Gregorian Calendar?

Mutsumi Sasaki, Go Kamoda, Ryosuke Takahashi, Kosuke Sato, Kentaro Inui, Keisuke Sakaguchi, Benjamin Heinzerling

Main category: cs.CL

TL;DR: Evaluation of language models’ ability to handle Japanese calendar tasks, revealing limitations in non-Gregorian calendar understanding despite some capability in basic conversions.

Details

Motivation: Most temporal reasoning research focuses on Gregorian calendar, but many non-Gregorian systems are actively used and reflect cultural conceptions of time. The capability of current LMs to handle such calendars remains unevaluated.

Method: Created datasets for four tasks requiring temporal knowledge and reasoning, evaluated range of English-centric and Japanese-centric LMs on Japanese calendar handling.

Result: Some models can perform calendar conversions, but even Japanese-centric models struggle with Japanese-calendar arithmetic and maintaining consistency across calendars.

Conclusion: Highlights importance of developing LMs better equipped for culture-specific calendar understanding beyond Gregorian systems.

Abstract: Temporal reasoning and knowledge are essential capabilities for language models (LMs). While much prior work has analyzed and improved temporal reasoning in LMs, most studies have focused solely on the Gregorian calendar. However, many non-Gregorian systems, such as the Japanese, Hijri, and Hebrew calendars, are in active use and reflect culturally grounded conceptions of time. If and how well current LMs can accurately handle such non-Gregorian calendars has not been evaluated so far. Here, we present a systematic evaluation of how well open-source LMs handle one such non-Gregorian system: the Japanese calendar. For our evaluation, we create datasets for four tasks that require both temporal knowledge and temporal reasoning. Evaluating a range of English-centric and Japanese-centric LMs, we find that some models can perform calendar conversions, but even Japanese-centric models struggle with Japanese-calendar arithmetic and with maintaining consistency across calendars. Our results highlight the importance of developing LMs that are better equipped for culture-specific calendar understanding.

[106] RephQA: Evaluating Readability of Large Language Models in Public Health Question Answering

Weikang Qiu, Tinglin Huang, Ryan Rullo, Yucheng Kuang, Ali Maatouk, S. Raquel Ramos, Rex Ying

Main category: cs.CL

TL;DR: RephQA is a benchmark for evaluating LLM readability in public health question answering, revealing most models fail readability standards despite good reasoning. Token-adapted GRPO shows best results for improving readability.

Details

Motivation: While LLMs show promise in medical applications, there's a significant bottleneck in generating readable responses for non-medical audiences in public health contexts.

Method: Created RephQA benchmark with 533 expert-reviewed QA pairs across 13 topics, using Flesch-Kincaid grade level and professional score metrics. Evaluated 25 LLMs and tested four readability strategies including standard prompting, chain-of-thought, GRPO, and token-adapted GRPO.

Result: Most LLMs fail to meet readability standards, showing a gap between reasoning ability and effective communication. Token-adapted GRPO achieved the best performance in improving readability.

Conclusion: The work represents progress toward building more practical and user-friendly public health agents by addressing the readability gap in LLM-generated responses.

Abstract: Large Language Models (LLMs) hold promise in addressing complex medical problems. However, while most prior studies focus on improving accuracy and reasoning abilities, a significant bottleneck in developing effective healthcare agents lies in the readability of LLM-generated responses, specifically, their ability to answer public health problems clearly and simply to people without medical backgrounds. In this work, we introduce RephQA, a benchmark for evaluating the readability of LLMs in public health question answering (QA). It contains 533 expert-reviewed QA pairs from 27 sources across 13 topics, and includes a proxy multiple-choice task to assess informativeness, along with two readability metrics: Flesch-Kincaid grade level and professional score. Evaluation of 25 LLMs reveals that most fail to meet readability standards, highlighting a gap between reasoning and effective communication. To address this, we explore four readability-enhancing strategies-standard prompting, chain-of-thought prompting, Group Relative Policy Optimization (GRPO), and a token-adapted variant. Token-adapted GRPO achieves the best results, advancing the development of more practical and user-friendly public health agents. These results represent a step toward building more practical agents for public health.

[107] Exploiting Tree Structure for Credit Assignment in RL Training of LLMs

Hieu Tran, Zonghai Yao, Hong Yu

Main category: cs.CL

TL;DR: TEMPO is a critic-free RL algorithm that uses prefix trees to compute nonparametric prefix values and adds temporal-difference corrections at branching tokens, outperforming PPO and GRPO on reasoning tasks.

Details

Motivation: Sparse delayed rewards in long reasoning sequences make token-level credit assignment challenging. PPO is complex to train and prone to overfitting, while GRPO ignores branching and spreads rewards poorly.

Method: Prefix-to-Tree (P2T) converts responses into prefix trees to compute nonparametric prefix values. TEMPO augments GRPO with branch-gated temporal-difference corrections derived from the tree structure.

Result: TEMPO outperforms PPO and GRPO on MATH, MedQA, GSM-HARD, AMC23, MedMCQA, and MMLU-Medical benchmarks, achieving higher accuracy with similar training time.

Conclusion: TEMPO provides precise token-level credit assignment without learned value networks, demonstrating superior performance on verifiable-reward reasoning tasks.

Abstract: Reinforcement learning improves LLM reasoning, yet sparse delayed reward over long sequences makes token-level credit assignment the key bottleneck. We study the verifiable-reward setting, where the final answer is checkable and multiple responses can be drawn per prompt. Reasoning tasks in math and medical QA align with this setup, where only a few decision tokens significantly impact the outcome. PPO offers token-level advantages with a learned value model, but it is complex to train both the actor and critic models simultaneously, and it is not easily generalizable, as the token-level values from the critic model can make training prone to overfitting. GRPO is critic-free and supports verifiable rewards, but spreads a single sequence-level return across tokens and ignores branching. We introduce \textbf{Prefix-to-Tree (P2T)}, a simple procedure that converts a group of responses into a prefix tree and computes \emph{nonparametric} prefix values (V(s)) by aggregating descendant outcomes. Built on P2T, we propose \textbf{TEMPO} (\emph{\textbf{T}ree-\textbf{E}stimated \textbf{M}ean Prefix Value for \textbf{P}olicy \textbf{O}ptimization}), a critic-free algorithm that augments the group-relative outcome signal of GRPO with \emph{branch-gated} temporal-difference corrections derived from the tree. At non-branch tokens, the temporal-difference (TD) term is zero, so TEMPO reduces to GRPO; at branching tokens, it supplies precise token-level credit without a learned value network or extra judges/teachers. On Qwen3-1.7B/4B, TEMPO outperforms PPO and GRPO on in-distribution (MATH, MedQA) and out-of-distribution (GSM-HARD, AMC23, MedMCQA, MMLU-Medical) benchmarks, and reaches higher validation accuracy with roughly the same wall-clock time.

[108] When Long Helps Short: How Context Length in Supervised Fine-tuning Affects Behavior of Large Language Models

Yingming Zheng, Hanqi Li, Kai Yu, Lu Chen

Main category: cs.CL

TL;DR: Long-context supervised fine-tuning (SFT) unexpectedly improves short-context performance in LLMs, contrary to long-context pretraining effects. Analysis reveals both MHA and FFN components benefit independently, with long-context SFT promoting contextual knowledge while short-context SFT favors parametric knowledge.

Details

Motivation: To understand how SFT data length influences LLM behavior on short-context tasks, as real-world applications increasingly demand longer context windows and the effects of data length in continued pretraining have been studied but their implications for SFT remain unclear.

Method: Systematically investigate SFT data length effects, decouple and analyze Multi-Head Attention (MHA) and Feed-Forward Network (FFN) components, study their interaction, and examine knowledge preference bias between contextual and parametric knowledge.

Result: Long-context SFT improves short-context performance (counterintuitive finding), both MHA and FFN independently benefit from long-context SFT, and hybrid training mitigates the knowledge preference bias.

Conclusion: Hybrid training offers explainable guidance for fine-tuning LLMs by balancing contextual and parametric knowledge preferences, making exclusive reliance on long-context SFT suboptimal.

Abstract: Large language models (LLMs) have achieved impressive performance across natural language processing (NLP) tasks. As real-world applications increasingly demand longer context windows, continued pretraining and supervised fine-tuning (SFT) on long-context data has become a common approach. While the effects of data length in continued pretraining have been extensively studied, their implications for SFT remain unclear. In this work, we systematically investigate how SFT data length influences LLM behavior on short-context tasks. Counterintuitively, we find that long-context SFT improves short-context performance, contrary to the commonly observed degradation from long-context pretraining. To uncover the underlying mechanisms of this phenomenon, we first decouple and analyze two key components, Multi-Head Attention (MHA) and Feed-Forward Network (FFN), and show that both independently benefit from long-context SFT. We further study their interaction and reveal a knowledge preference bias: long-context SFT promotes contextual knowledge, while short-context SFT favors parametric knowledge, making exclusive reliance on long-context SFT suboptimal. Finally, we demonstrate that hybrid training mitigates this bias, offering explainable guidance for fine-tuning LLMs.

[109] jina-reranker-v3: Last but Not Late Interaction for Listwise Document Reranking

Feng Wang, Yuqing Li, Han Xiao

Main category: cs.CL

TL;DR: jina-reranker-v3 is a 0.6B-parameter multilingual listwise reranker that uses ’last but not late’ interaction for state-of-the-art BEIR performance.

Details

Motivation: To improve document reranking by enabling rich interactions between query and candidate documents before embedding extraction, addressing limitations of late interaction models like ColBERT.

Method: Applies causal attention between query and all candidate documents in the same context window, then extracts contextual embeddings from each document’s final token.

Result: Achieves state-of-the-art BEIR performance with 61.94 nDCG@10 while being significantly smaller than comparable models.

Conclusion: The ’last but not late’ interaction approach enables superior reranking performance with a more efficient model size.

Abstract: jina-reranker-v3 is a 0.6B-parameter multilingual listwise reranker that introduces a novel “last but not late” interaction. Unlike late interaction models like ColBERT that encode documents separately before multi-vector matching, our approach applies causal attention between the query and all candidate documents in the same context window, enabling rich interactions before extracting contextual embeddings from each document’s final token. The new model achieves state-of-the-art BEIR performance with 61.94 nDCG@10 while being significantly smaller than other models with comparable performance.

[110] Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding

Wenrui Bao, Zhiben Chen, Dan Xu, Yuzhang Shang

Main category: cs.CL

TL;DR: Learn2PD is a framework that trains a lightweight filter model to enable adaptive parallel decoding in diffusion-based LLMs, achieving significant speedup without performance loss.

Details

Motivation: Autoregressive decoding in LLMs is slow due to sequential processing, and existing parallel decoding methods use fixed heuristics that don't adapt to input characteristics, leading to suboptimal speed-quality trade-offs.

Method: Proposes Learn2PD with a lightweight filter model that predicts when token predictions match final output, and End-of-Text Prediction to detect completion. The filter is learned post-training with minimal computation.

Result: Achieves up to 22.58× speedup without performance drop on LLaDA benchmark, and up to 57.51× when combined with KV-Cache.

Conclusion: The framework enables flexible and dynamic parallel decoding that adapts to input characteristics, significantly improving inference throughput while maintaining quality.

Abstract: Autoregressive decoding in large language models (LLMs) requires $\mathcal{O}(n)$ sequential steps for $n$ tokens, fundamentally limiting inference throughput. Recent diffusion-based LLMs (dLLMs) enable parallel token generation through iterative denoising. However, current parallel decoding strategies rely on fixed, input-agnostic heuristics (e.g., confidence thresholds), which fail to adapt to input-specific characteristics, resulting in suboptimal speed-quality trade-offs across diverse NLP tasks. In this work, we explore a more flexible and dynamic approach to parallel decoding. We propose Learning to Parallel Decode (Learn2PD), a framework that trains a lightweight and adaptive filter model to predict, for each token position, whether the current prediction matches the final output. This learned filter approximates an oracle parallel decoding strategy that unmasks tokens only when correctly predicted. Importantly, the filter model is learned in a post-training manner, requiring only a small amount of computation to optimize it (minute-level GPU time). Additionally, we introduce End-of-Text Prediction (EoTP) to detect decoding completion at the end of sequence, avoiding redundant decoding of padding tokens. Experiments on the LLaDA benchmark demonstrate that our method achieves up to 22.58$\times$ speedup without any performance drop, and up to 57.51$\times$ when combined with KV-Cache.

[111] Agent-ScanKit: Unraveling Memory and Reasoning of Multimodal Agents via Sensitivity Perturbations

Pengzhou Cheng, Lingzhong Dong, Zeng Wu, Zongru Wu, Xiangru Tang, Chengwei Qin, Zhuosheng Zhang, Gongshen Liu

Main category: cs.CL

TL;DR: Agent-ScanKit is a probing framework that reveals multimodal GUI agents rely more on memorization than systematic reasoning, showing limited generalization in complex tasks.

Details

Motivation: Existing multimodal agents for GUI interaction show limited reliability in complex or out-of-domain tasks, raising concerns about whether they engage in spurious reasoning rather than genuine understanding.

Method: Proposed Agent-ScanKit framework with three orthogonal probing paradigms: visual-guided, text-guided, and structure-guided perturbations to quantify memorization vs reasoning contributions without accessing model internals.

Result: Evaluation across 5 GUI benchmarks with 18 multimodal agents shows mechanical memorization often outweighs systematic reasoning, with models functioning mainly as retrievers of training-aligned knowledge with limited generalization.

Conclusion: Findings highlight the necessity for robust reasoning modeling in multimodal agents for real-world scenarios and provide insights for developing more reliable multimodal agents.

Abstract: Although numerous strategies have recently been proposed to enhance the autonomous interaction capabilities of multimodal agents in graphical user interface (GUI), their reliability remains limited when faced with complex or out-of-domain tasks. This raises a fundamental question: Are existing multimodal agents reasoning spuriously? In this paper, we propose \textbf{Agent-ScanKit}, a systematic probing framework to unravel the memory and reasoning capabilities of multimodal agents under controlled perturbations. Specifically, we introduce three orthogonal probing paradigms: visual-guided, text-guided, and structure-guided, each designed to quantify the contributions of memorization and reasoning without requiring access to model internals. In five publicly available GUI benchmarks involving 18 multimodal agents, the results demonstrate that mechanical memorization often outweighs systematic reasoning. Most of the models function predominantly as retrievers of training-aligned knowledge, exhibiting limited generalization. Our findings underscore the necessity of robust reasoning modeling for multimodal agents in real-world scenarios, offering valuable insights toward the development of reliable multimodal agents.

[112] GPT and Prejudice: A Sparse Approach to Understanding Learned Representations in Large Language Models

Mariam Mahran, Katharina Simbeck

Main category: cs.CL

TL;DR: LLMs paired with sparse autoencoders can interpret both model behavior and training data structures, themes, and biases.

Details

Motivation: To understand model representations and the data internalized by LLMs trained on massive, uncurated corpora.

Method: Train a GPT-style transformer on Jane Austen novels and apply sparse autoencoders to hidden states across multiple layers.

Result: Uncovered sparse, interpretable features reflecting key narratives and concepts like gender, class, and societal duty in the corpus.

Conclusion: LLMs combined with SAEs serve as scalable probes for corpus exploration, bias discovery, and model interpretability at scale.

Abstract: As large language models (LLMs) are increasingly trained on massive, uncurated corpora, understanding both model representations and the data they internalize has become a major challenge. In this work, we show that pairing LLMs with sparse autoencoders (SAEs) enables interpretation not only of model behavior but also of the deeper structures, themes, and biases embedded in the training data. We train a GPT-style transformer model exclusively on the novels of Jane Austen, a corpus rich in social constructs and narrative patterns. We then apply SAEs to hidden states across multiple layers, uncovering sparse, interpretable features that reflect the key narratives and concepts present in the corpus, including gender, class, and societal duty. Our findings demonstrate that LLMs combined with SAEs can act as scalable probes into complex datasets, offering a new path for corpus exploration, bias discovery, and model interpretability at scale.

[113] Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey

Qiyuan Liu, Hao Xu, Xuhong Chen, Wei Chen, Yee Whye Teh, Ning Miao

Main category: cs.CL

TL;DR: A systematic survey of reward models (RMs) for LLM reasoning, covering their architectures, training methods, applications in inference, data synthesis, and RL fine-tuning, plus open research questions.

Details

Motivation: Reward models are crucial for improving LLM reasoning performance by providing training signals for RL fine-tuning and selecting optimal outputs during inference, but there's a need for comprehensive understanding of their applications and challenges.

Method: Systematic review of RM fundamentals (architectures, training, evaluation) and exploration of three key applications: inference guidance/output selection, data synthesis/self-improvement, and RL-based fine-tuning.

Result: Comprehensive framework for understanding RMs’ role in LLM reasoning, identifying their core applications and providing empirical insights into their deployment.

Conclusion: The paper provides actionable insights for effective RM deployment and identifies critical open questions regarding RM selection, generalization, evaluation, and enhancement for advancing LLM reasoning capabilities.

Abstract: Reward models (RMs) play a critical role in enhancing the reasoning performance of LLMs. For example, they can provide training signals to finetune LLMs during reinforcement learning (RL) and help select the best answer from multiple candidates during inference. In this paper, we provide a systematic introduction to RMs, along with a comprehensive survey of their applications in LLM reasoning. We first review fundamental concepts of RMs, including their architectures, training methodologies, and evaluation techniques. Then, we explore their key applications: (1) guiding generation and selecting optimal outputs during LLM inference, (2) facilitating data synthesis and iterative self-improvement for LLMs, and (3) providing training signals in RL-based finetuning. Finally, we discuss critical open questions regarding the selection, generalization, evaluation, and enhancement of RMs, based on existing research and our own empirical findings. Our analysis aims to provide actionable insights for the effective deployment and advancement of RMs for LLM reasoning.

cs.CV

[114] MaskCD: Mitigating LVLM Hallucinations by Image Head Masked Contrastive Decoding

Jingyuan Deng, Yujiu Yang

Main category: cs.CV

TL;DR: MaskCD is a method that uses image head masking in large vision-language models to create contrastive samples for reducing hallucinations while preserving general capabilities.

Details

Motivation: To address hallucinations in LVLMs where models generate content contradictory to input visuals/text, overcoming limitations of existing contrastive decoding (difficulty constructing samples) and attention manipulation (instability) methods.

Method: Proposes image head Masked Contrastive Decoding (MaskCD) that masks ‘image heads’ in LVLMs to construct contrastive samples for contrastive decoding.

Result: Evaluated on LLaVA-1.5-7b and Qwen-VL-7b using benchmarks like CHAIR, POPE, AMBER and MME, showing effective hallucination reduction while retaining general model capabilities.

Conclusion: MaskCD successfully alleviates hallucinations in LVLMs without compromising their general performance, providing a stable alternative to existing methods.

Abstract: Large vision-language models (LVLMs) have shown remarkable performance in visual-language understanding for downstream multimodal tasks. While their capabilities are improving, problems emerge simultaneously. Among those problems, the hallucinations have attracted much attention, which stands for the phenomenon where LVLMs generate contradictory content to their input visual and text contents. Many approaches have been proposed to deal with this issue, such as contrastive decoding and attention manipulation. However, contrastive decoding methods struggle in constructing appropriate contrastive samples, and attention manipulation methods are highly sensitive, lacking stability. In this work, we propose image head Masked Contrastive Decoding (MaskCD). Our approach utilizes the “image heads” in LVLMs, masking them to construct contrastive samples for contrastive decoding. We evaluated MaskCD on LLaVA-1.5-7b and Qwen-VL-7b, using various benchmarks such as CHAIR, POPE, AMBER and MME. The results demonstrate that MaskCD effectively alleviates the phenomenon of hallucinations and retains the general capabilities of LVLMs. Corresponding resources could be found at: https://github.com/Deng-Jingyuan/MaskCD .

[115] AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation

Jeongsoo Choi, Ji-Hoon Kim, Kim Sung-Bin, Tae-Hyun Oh, Joon Son Chung

Main category: cs.CV

TL;DR: AlignDiT is a multimodal aligned diffusion transformer that synthesizes high-quality speech from text, video, and reference audio inputs, addressing limitations in speech intelligibility, synchronization, and naturalness.

Details

Motivation: Multimodal-to-speech generation has wide applications in film production, dubbing, and virtual avatars, but existing methods struggle with speech intelligibility, audio-video synchronization, speech naturalness, and voice similarity to reference speakers.

Method: AlignDiT uses a multimodal aligned diffusion transformer with three strategies to align multimodal representations and introduces a novel multimodal classifier-free guidance mechanism to adaptively balance information from each modality during synthesis.

Result: AlignDiT significantly outperforms existing methods across multiple benchmarks in quality, synchronization, and speaker similarity, and shows strong generalization across various multimodal tasks including video-to-speech synthesis and visual forced alignment.

Conclusion: AlignDiT achieves state-of-the-art performance in multimodal-to-speech generation, demonstrating effective multimodal alignment and adaptive guidance for producing accurate, synchronized, and natural-sounding speech.

Abstract: In this paper, we address the task of multimodal-to-speech generation, which aims to synthesize high-quality speech from multiple input modalities: text, video, and reference audio. This task has gained increasing attention due to its wide range of applications, such as film production, dubbing, and virtual avatars. Despite recent progress, existing methods still suffer from limitations in speech intelligibility, audio-video synchronization, speech naturalness, and voice similarity to the reference speaker. To address these challenges, we propose AlignDiT, a multimodal Aligned Diffusion Transformer that generates accurate, synchronized, and natural-sounding speech from aligned multimodal inputs. Built upon the in-context learning capability of the DiT architecture, AlignDiT explores three effective strategies to align multimodal representations. Furthermore, we introduce a novel multimodal classifier-free guidance mechanism that allows the model to adaptively balance information from each modality during speech synthesis. Extensive experiments demonstrate that AlignDiT significantly outperforms existing methods across multiple benchmarks in terms of quality, synchronization, and speaker similarity. Moreover, AlignDiT exhibits strong generalization capability across various multimodal tasks, such as video-to-speech synthesis and visual forced alignment, consistently achieving state-of-the-art performance. The demo page is available at https://mm.kaist.ac.kr/projects/AlignDiT.

[116] Exploring OCR-augmented Generation for Bilingual VQA

JoonHo Lee, Sunho Park

Main category: cs.CV

TL;DR: This paper introduces KLOCR, a bilingual OCR system trained on 100M instances, and KOCRBench for Korean VQA evaluation, showing that OCR-augmented generation significantly improves performance in multilingual vision-language tasks.

Details

Motivation: To explore OCR-augmented generation with Vision Language Models for multilingual applications, particularly focusing on Korean and English, and to address the lack of resources for Korean VQA tasks.

Method: Trained KLOCR, a bilingual OCR baseline on 100M instances, curated KOCRBench for Korean VQA evaluation, and analyzed different prompting methods for OCR-augmented VLMs.

Result: Extensive experiments demonstrate that OCR-extracted text significantly boosts performance across both open source and commercial models in bilingual VQA tasks.

Conclusion: The work provides new insights into OCR-augmented generation for bilingual VQA and releases the model, code, and data publicly to support further research in this domain.

Abstract: We investigate OCR-augmented generation with Vision Language Models (VLMs), exploring tasks in Korean and English toward multilingualism. To support research in this domain, we train and release KLOCR, a strong bilingual OCR baseline trained on 100M instances to augment VLMs with OCR ability. To complement existing VQA benchmarks, we curate KOCRBench for Korean VQA, and analyze different prompting methods. Extensive experiments show that OCR-extracted text significantly boosts performance across open source and commercial models. Our work offers new insights into OCR-augmented generation for bilingual VQA. Model, code, and data are available at https://github.com/JHLee0513/KLOCR.

[117] HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios

Kunyu Peng, Junchao Huang, Xiangsheng Huang, Di Wen, Junwei Zheng, Yufan Chen, Kailun Yang, Jiamin Wu, Chongqing Hao, Rainer Stiefelhagen

Main category: cs.CV

TL;DR: This paper introduces a new task of textual reference-guided human action segmentation in multi-person settings, presents the RHAS133 dataset, and proposes HopaDIFF framework that achieves state-of-the-art performance.

Details

Motivation: Existing action segmentation methods focus on single-person activities with fixed action sequences, overlooking multi-person scenarios where textual descriptions specify target persons for segmentation.

Method: Proposed HopaDIFF framework with holistic-partial aware Fourier-conditioned diffusion, using cross-input gate attentional xLSTM for enhanced holistic-partial long-range reasoning and Fourier condition for fine-grained control in action segmentation generation.

Result: HopaDIFF achieves state-of-the-art results on the new RHAS133 dataset across diverse evaluation settings, significantly outperforming existing methods that showed limited performance and poor target person cue aggregation.

Conclusion: The work pioneers textual reference-guided action segmentation in multi-person settings, provides the first dataset for this task, and demonstrates the effectiveness of the proposed HopaDIFF framework through superior performance.

Abstract: Action segmentation is a core challenge in high-level video understanding, aiming to partition untrimmed videos into segments and assign each a label from a predefined action set. Existing methods primarily address single-person activities with fixed action sequences, overlooking multi-person scenarios. In this work, we pioneer textual reference-guided human action segmentation in multi-person settings, where a textual description specifies the target person for segmentation. We introduce the first dataset for Referring Human Action Segmentation, i.e., RHAS133, built from 133 movies and annotated with 137 fine-grained actions with 33h video data, together with textual descriptions for this new task. Benchmarking existing action segmentation methods on RHAS133 using VLM-based feature extractors reveals limited performance and poor aggregation of visual cues for the target person. To address this, we propose a holistic-partial aware Fourier-conditioned diffusion framework, i.e., HopaDIFF, leveraging a novel cross-input gate attentional xLSTM to enhance holistic-partial long-range reasoning and a novel Fourier condition to introduce more fine-grained control to improve the action segmentation generation. HopaDIFF achieves state-of-the-art results on RHAS133 in diverse evaluation settings. The dataset and code are available at https://github.com/KPeng9510/HopaDIFF.

Derek Shi, Ruben Glatt, Christine Klymko, Shubham Mohole, Hongjun Choi, Shashank Kushwaha, Sam Sakla, Felipe Leno da Silva

Main category: cs.CV

TL;DR: Oracle-RLAIF is a novel framework that replaces expensive trained reward models with a general Oracle ranker for reinforcement learning from AI feedback in video-language models, using rank-based optimization instead of scalar scoring.

Details

Motivation: To address the high cost of gathering human feedback and the expense of training specialized reward models for large video-language models, making fine-tuning more cost-effective and flexible.

Method: Proposes Oracle-RLAIF framework with a general Oracle ranker that ranks candidate responses instead of scoring them, and introduces GRPO_rank - a rank-based loss function based on Group Relative Policy Optimization that optimizes ordinal feedback with rank-aware advantages.

Result: Oracle-RLAIF consistently outperforms leading VLMs using existing fine-tuning methods across various video comprehension benchmarks.

Conclusion: Oracle-RLAIF enables flexible and data-efficient frameworks for aligning large multi-modal video models through reinforcement learning from rank rather than score.

Abstract: Recent advances in large video-language models (VLMs) rely on extensive fine-tuning techniques that strengthen alignment between textual and visual comprehension. Leading pipelines typically pair supervised fine-tuning (SFT) with reinforcement learning from preference data to enhance video comprehension. However, as VLMs scale in parameter size, so does the cost of gathering enough human feedback. To make fine-tuning more cost-effective, recent frameworks explore reinforcement learning with AI feedback (RLAIF), which replace human preference with AI as a judge. Current RLAIF frameworks rely on a specialized reward model trained with video narratives to create calibrated scalar rewards – an expensive and restrictive pipeline. We propose Oracle-RLAIF, a novel framework that replaces the trained reward model with a more general Oracle ranker which acts as a drop-in model ranking candidate model responses rather than scoring them. Alongside Oracle-RLAIF, we introduce $GRPO_{rank}$, a novel rank-based loss function based on Group Relative Policy Optimization (GRPO) that directly optimizes ordinal feedback with rank-aware advantages. Empirically, we demonstrate that Oracle-RLAIF consistently outperforms leading VLMs using existing fine-tuning methods when evaluated across various video comprehension benchmarks. Oracle-RLAIF paves the path to creating flexible and data-efficient frameworks for aligning large multi-modal video models with reinforcement learning from rank rather than score.

[119] AudioStory: Generating Long-Form Narrative Audio with Large Language Models

Yuxin Guo, Teng Wang, Yuying Ge, Shijie Ma, Yixiao Ge, Wei Zou, Ying Shan

Main category: cs.CV

TL;DR: AudioStory is a unified framework that integrates LLMs with TTA systems to generate coherent long-form audio narratives by decomposing complex queries into temporally ordered sub-tasks with contextual cues.

Details

Motivation: Address the gap in current TTA systems that excel at short audio clips but struggle with long-form narrative audio requiring temporal coherence and compositional reasoning.

Method: Uses LLMs to decompose narrative queries into temporally ordered sub-tasks, employs decoupled bridging mechanism with bridging query for intra-event alignment and residual query for cross-event coherence, and features end-to-end training.

Result: Superior performance on both single-audio and narrative audio generation, surpassing prior TTA baselines in instruction-following ability and audio fidelity.

Conclusion: AudioStory effectively bridges the gap in long-form audio generation by leveraging LLM reasoning capabilities and unified end-to-end training, establishing a new benchmark for narrative audio synthesis.

Abstract: Recent advances in text-to-audio (TTA) generation excel at synthesizing short audio clips but struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. To address this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long-form audio narratives. AudioStory possesses strong instruction-following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features: (1) Decoupled bridging mechanism: AudioStory disentangles LLM-diffuser collaboration into two specialized components, i.e., a bridging query for intra-event semantic alignment and a residual query for cross-event coherence preservation. (2) End-to-end training: By unifying instruction comprehension and audio generation within a single end-to-end framework, AudioStory eliminates the need for modular training pipelines while enhancing synergy between components. Furthermore, we establish a benchmark AudioStory-10K, encompassing diverse domains such as animated soundscapes and natural sound narratives. Extensive experiments show the superiority of AudioStory on both single-audio generation and narrative audio generation, surpassing prior TTA baselines in both instruction-following ability and audio fidelity. Our code is available at https://github.com/TencentARC/AudioStory

[120] PhysHMR: Learning Humanoid Control Policies from Vision for Physically Plausible Human Motion Reconstruction

Qiao Feng, Yiming Huang, Yufu Wang, Jiatao Gu, Lingjie Liu

Main category: cs.CV

TL;DR: PhysHMR is a unified framework that learns a visual-to-action policy for humanoid control in physics simulators, enabling physically plausible human motion reconstruction from monocular videos without the error accumulation of traditional two-stage approaches.

Details

Motivation: Existing methods for human motion reconstruction from monocular videos focus on kinematics-based pose estimation, leading to unrealistic results due to lack of physical constraints. Two-stage approaches with physics-based post-processing suffer from error accumulation, limiting reconstruction quality.

Method: Uses a unified framework with pixel-as-ray strategy that lifts 2D keypoints into 3D spatial rays in global space. Combines local visual features from pretrained encoder with global pose guidance. Employs distillation scheme to transfer motion knowledge from mocap-trained expert to vision-conditioned policy, refined with physically motivated reinforcement learning rewards.

Result: Extensive experiments show PhysHMR produces high-fidelity, physically plausible motion across diverse scenarios, outperforming prior approaches in both visual accuracy and physical realism.

Conclusion: PhysHMR successfully addresses the limitations of traditional two-stage approaches by directly learning a unified visual-to-action policy, achieving physically grounded motion reconstruction that is visually aligned with input videos.

Abstract: Reconstructing physically plausible human motion from monocular videos remains a challenging problem in computer vision and graphics. Existing methods primarily focus on kinematics-based pose estimation, often leading to unrealistic results due to the lack of physical constraints. To address such artifacts, prior methods have typically relied on physics-based post-processing following the initial kinematics-based motion estimation. However, this two-stage design introduces error accumulation, ultimately limiting the overall reconstruction quality. In this paper, we present PhysHMR, a unified framework that directly learns a visual-to-action policy for humanoid control in a physics-based simulator, enabling motion reconstruction that is both physically grounded and visually aligned with the input video. A key component of our approach is the pixel-as-ray strategy, which lifts 2D keypoints into 3D spatial rays and transforms them into global space. These rays are incorporated as policy inputs, providing robust global pose guidance without depending on noisy 3D root predictions. This soft global grounding, combined with local visual features from a pretrained encoder, allows the policy to reason over both detailed pose and global positioning. To overcome the sample inefficiency of reinforcement learning, we further introduce a distillation scheme that transfers motion knowledge from a mocap-trained expert to the vision-conditioned policy, which is then refined using physically motivated reinforcement learning rewards. Extensive experiments demonstrate that PhysHMR produces high-fidelity, physically plausible motion across diverse scenarios, outperforming prior approaches in both visual accuracy and physical realism.

[121] Unlocking the power of partnership: How humans and machines can work together to improve face recognition

P. Jonathon Phillips, Geraldine Jeckeln, Carina A. Hahn, Amy N. Yates, Peter C. Fontana, Alice J. O’Toole

Main category: cs.CV

TL;DR: Human-machine collaboration in face identification improves accuracy when human and machine baseline accuracies are similar, following the Proximal Accuracy Rule. Intelligent fusion selects humans who can enhance high-performing machines, achieving better accuracy than machines alone or all human-machine combinations.

Details

Motivation: To determine when combining human and machine decisions in face identification improves accuracy, given individual differences between people and machines affect collaborative outcomes.

Method: Used data from expert and non-expert face identifiers to examine human-human and human-machine collaborations. Applied Proximal Accuracy Rule to predict collaborative benefits and established critical fusion zones. Implemented intelligent human-machine fusion by selecting humans with potential to improve machine accuracy.

Result: Collaboration benefits increased as baseline accuracy difference between collaborators decreased. Critical fusion zone was surprisingly large. Intelligent fusion was more accurate than machine alone and all human-machine combinations. Human-only partnerships achieved similar average performance but intelligent human-machine collaboration minimized impact of low-performing humans.

Conclusion: Both humans and machines play meaningful roles in accurate face identification. The study provides an evidence-based roadmap for intelligent use of AI in face identification systems.

Abstract: Human review of consequential decisions by face recognition algorithms creates a “collaborative” human-machine system. Individual differences between people and machines, however, affect whether collaboration improves or degrades accuracy in any given case. We establish the circumstances under which combining human and machine face identification decisions improves accuracy. Using data from expert and non-expert face identifiers, we examined the benefits of human-human and human-machine collaborations. The benefits of collaboration increased as the difference in baseline accuracy between collaborators decreased-following the Proximal Accuracy Rule (PAR). This rule predicted collaborative (fusion) benefit across a wide range of baseline abilities, from people with no training to those with extensive training. Using the PAR, we established a critical fusion zone, where humans are less accurate than the machine, but fusing the two improves system accuracy. This zone was surprisingly large. We implemented “intelligent human-machine fusion” by selecting people with the potential to increase the accuracy of a high-performing machine. Intelligent fusion was more accurate than the machine operating alone and more accurate than combining all human and machine judgments. The highest system-wide accuracy achievable with human-only partnerships was found by graph theory. This fully human system approximated the average performance achieved by intelligent human-machine collaboration. However, intelligent human-machine collaboration more effectively minimized the impact of low-performing humans on system-wide accuracy. The results demonstrate a meaningful role for both humans and machines in assuring accurate face identification. This study offers an evidence-based road map for the intelligent use of AI in face identification.

[122] How Confident are Video Models? Empowering Video Models to Express their Uncertainty

Zhiting Mei, Ola Shorinwa, Anirudha Majumdar

Main category: cs.CV

TL;DR: First framework for uncertainty quantification in generative video models, introducing a calibration metric, black-box UQ method (S-QUBED), and benchmark dataset to address hallucination issues.

Details

Motivation: Video generation models hallucinate like LLMs, producing plausible but factually wrong videos, creating safety concerns due to lack of uncertainty quantification methods.

Method: Proposed S-QUBED framework with: (i) calibration metric using robust rank correlation, (ii) black-box UQ method decomposing uncertainty into aleatoric and epistemic components via latent modeling, (iii) UQ benchmark dataset.

Result: S-QUBED computes calibrated uncertainty estimates negatively correlated with task accuracy and effectively separates aleatoric and epistemic uncertainty components.

Conclusion: First successful uncertainty quantification framework for video models that addresses critical safety concerns by providing reliable uncertainty estimates and decomposition.

Abstract: Generative video models demonstrate impressive text-to-video capabilities, spurring widespread adoption in many real-world applications. However, like large language models (LLMs), video generation models tend to hallucinate, producing plausible videos even when they are factually wrong. Although uncertainty quantification (UQ) of LLMs has been extensively studied in prior work, no UQ method for video models exists, raising critical safety concerns. To our knowledge, this paper represents the first work towards quantifying the uncertainty of video models. We present a framework for uncertainty quantification of generative video models, consisting of: (i) a metric for evaluating the calibration of video models based on robust rank correlation estimation with no stringent modeling assumptions; (ii) a black-box UQ method for video models (termed S-QUBED), which leverages latent modeling to rigorously decompose predictive uncertainty into its aleatoric and epistemic components; and (iii) a UQ dataset to facilitate benchmarking calibration in video models. By conditioning the generation task in the latent space, we disentangle uncertainty arising due to vague task specifications from that arising from lack of knowledge. Through extensive experiments on benchmark video datasets, we demonstrate that S-QUBED computes calibrated total uncertainty estimates that are negatively correlated with the task accuracy and effectively computes the aleatoric and epistemic constituents.

[123] PEO: Training-Free Aesthetic Quality Enhancement in Pre-Trained Text-to-Image Diffusion Models with Prompt Embedding Optimization

Hovhannes Margaryan, Bo Wan, Tinne Tuytelaars

Main category: cs.CV

TL;DR: PEO is a training-free method that optimizes text embeddings for simple prompts to improve aesthetic quality in text-to-image diffusion models.

Details

Motivation: To enhance visual quality of images generated from simple, uncurated prompts in pre-trained text-to-image diffusion models without requiring additional training.

Method: Optimizes text embeddings using a tripartite objective function that improves aesthetic fidelity, maintains adherence to optimized embeddings, and preserves original prompt meaning through a preservation term.

Result: Quantitative and qualitative evaluations show PEO exceeds or matches state-of-the-art text-to-image and prompt adaptation methods.

Conclusion: PEO effectively improves aesthetic quality in diffusion models while being training-free and backbone-independent.

Abstract: This paper introduces a novel approach to aesthetic quality improvement in pre-trained text-to-image diffusion models when given a simple prompt. Our method, dubbed Prompt Embedding Optimization (PEO), leverages a pre-trained text-to-image diffusion model as a backbone and optimizes the text embedding of a given simple and uncurated prompt to enhance the visual quality of the generated image. We achieve this by a tripartite objective function that improves the aesthetic fidelity of the generated image, ensures adherence to the optimized text embedding, and minimal divergence from the initial prompt. The latter is accomplished through a prompt preservation term. Additionally, PEO is training-free and backbone-independent. Quantitative and qualitative evaluations confirm the effectiveness of the proposed method, exceeding or equating the performance of state-of-the-art text-to-image and prompt adaptation methods.

[124] Ego-Exo 3D Hand Tracking in the Wild with a Mobile Multi-Camera Rig

Patrick Rim, Kun He, Kevin Harris, Braden Copple, Shangchen Han, Sizhe An, Ivan Shugurov, Tomas Hodan, He Wen, Xu Xie

Main category: cs.CV

TL;DR: A novel marker-less multi-camera system for accurate 3D hand tracking in unconstrained settings, combining exocentric and egocentric views to bridge the gap between environmental realism and annotation accuracy.

Details

Motivation: Existing hand tracking datasets are predominantly captured in controlled lab setups, limiting environmental diversity and model generalization for real-world applications.

Method: Developed a lightweight back-mounted capture rig with eight exocentric cameras plus a Meta Quest 3 headset providing two egocentric views, with an ego-exo tracking pipeline for generating accurate 3D hand pose ground truth.

Result: Created an annotated dataset with synchronized multi-view images and precise 3D hand poses that significantly reduces the trade-off between environmental realism and 3D annotation accuracy.

Conclusion: The proposed system enables nearly unconstrained mobility in genuinely in-the-wild conditions while maintaining precise 3D hand tracking capabilities.

Abstract: Accurate 3D tracking of hands and their interactions with the world in unconstrained settings remains a significant challenge for egocentric computer vision. With few exceptions, existing datasets are predominantly captured in controlled lab setups, limiting environmental diversity and model generalization. To address this, we introduce a novel marker-less multi-camera system designed to capture precise 3D hands and objects, which allows for nearly unconstrained mobility in genuinely in-the-wild conditions. We combine a lightweight, back-mounted capture rig with eight exocentric cameras, and a user-worn Meta Quest 3 headset, which contributes two egocentric views. We design an ego-exo tracking pipeline to generate accurate 3D hand pose ground truth from this system, and rigorously evaluate its quality. By collecting an annotated dataset featuring synchronized multi-view images and precise 3D hand poses, we demonstrate the capability of our approach to significantly reduce the trade-off between environmental realism and 3D annotation accuracy.

[125] Input-Aware Sparse Attention for Real-Time Co-Speech Video Generation

Beijia Lu, Ziyi Chen, Jing Xiao, Jun-Yan Zhu

Main category: cs.CV

TL;DR: This paper presents a distillation method that converts a many-step diffusion model for co-speech video generation into a few-step student model, achieving real-time performance while maintaining video quality through input-aware sparse attention and distillation loss.

Details

Motivation: Existing diffusion-based methods for co-speech video synthesis are slow due to numerous denoising steps and costly attention mechanisms, preventing real-time deployment for applications like video creation and virtual agents.

Method: The method distills a many-step diffusion video model into a few-step student model using input human pose conditioning. It employs input-aware sparse attention guided by human pose keypoints to focus on relevant regions (face, hands, upper body), and introduces an input-aware distillation loss to enhance lip synchronization and hand motion realism.

Result: The method achieves real-time performance with improved visual quality compared to recent audio-driven and input-driven methods. The input-aware sparse attention reduces redundant computations and strengthens temporal correspondences, while the distillation loss improves lip sync and hand motion realism.

Conclusion: By integrating input-aware sparse attention and distillation loss, the proposed method successfully enables real-time co-speech video generation while maintaining high visual quality, addressing the computational limitations of previous diffusion-based approaches.

Abstract: Diffusion models can synthesize realistic co-speech video from audio for various applications, such as video creation and virtual agents. However, existing diffusion-based methods are slow due to numerous denoising steps and costly attention mechanisms, preventing real-time deployment. In this work, we distill a many-step diffusion video model into a few-step student model. Unfortunately, directly applying recent diffusion distillation methods degrades video quality and falls short of real-time performance. To address these issues, our new video distillation method leverages input human pose conditioning for both attention and loss functions. We first propose using accurate correspondence between input human pose keypoints to guide attention to relevant regions, such as the speaker’s face, hands, and upper body. This input-aware sparse attention reduces redundant computations and strengthens temporal correspondences of body parts, improving inference efficiency and motion coherence. To further enhance visual quality, we introduce an input-aware distillation loss that improves lip synchronization and hand motion realism. By integrating our input-aware sparse attention and distillation loss, our method achieves real-time performance with improved visual quality compared to recent audio-driven and input-driven methods. We also conduct extensive experiments showing the effectiveness of our algorithmic design choices.

[126] Deep Generative Continual Learning using Functional LoRA: FunLoRA

Victor Enescu, Hichem Sahbi

Main category: cs.CV

TL;DR: FunLoRA introduces a novel functional LoRA method using rank 1 matrices with dynamic conditioning to enable continual learning in generative models without catastrophic forgetting, achieving state-of-the-art performance with minimal memory and time requirements.

Details

Motivation: Address catastrophic forgetting in continual learning for generative models, overcoming limitations of retraining on synthetic data which leads to intractable training time and long-term performance degradation.

Method: Designs functional LoRA (FunLoRA) using rank 1 matrices with dynamic conditioning through carefully selected functions to increase matrix rank functionally, enabling parameter-efficient fine-tuning that only requires training on current task data.

Result: Surpasses prior state-of-the-art diffusion model results, achieves higher classification accuracy, and requires only a fraction of memory cost and sampling time compared to existing methods.

Conclusion: FunLoRA provides an effective solution for continual adaptation of generative models, eliminating catastrophic forgetting while maintaining efficiency and performance.

Abstract: Continual adaptation of deep generative models holds tremendous potential and critical importance, given their rapid and expanding usage in text and vision based applications. Incremental training, however, remains highly challenging due to catastrophic forgetting phenomenon, which makes it difficult for neural networks to effectively incorporate new knowledge. A common strategy consists in retraining the generative model on its own synthetic data in order to mitigate forgetting. Yet, such an approach faces two major limitations: (i) the continually increasing training time eventually becomes intractable, and (ii) reliance on synthetic data inevitably leads to long-term performance degradation, since synthetic samples lack the richness of real training data. In this paper, we attenuate these issues by designing a novel and more expressive conditioning mechanism for generative models based on low rank adaptation (LoRA), that exclusively employs rank 1 matrices, whose reparametrized matrix rank is functionally increased using carefully selected functions – and dubbed functional LoRA: FunLoRA. Using this dynamic conditioning, the generative model is guaranteed to avoid catastrophic forgetting and needs only to be trained on data from the current task. Extensive experiments using flow-matching based models trained from scratch, showcase that our proposed parameter-efficient fine-tuning (PEFT) method surpasses prior state-of-the-art results based on diffusion models, reaching higher classification accuracy scores, while only requiring a fraction of the memory cost and sampling time.

[127] Sequence-Preserving Dual-FoV Defense for Traffic Sign and Light Recognition in Autonomous Vehicles

Abhishek Joshi, Jahnavi Krishna Koda, Abhishek Phadke

Main category: cs.CV

TL;DR: This paper proposes a dual field-of-view, sequence-preserving robustness framework for traffic light and sign recognition in autonomous vehicles, using a multi-source dataset and a unified three-layer defense stack to improve accuracy and reduce attack success rates.

Details

Motivation: Traffic light and sign recognition is crucial for autonomous vehicle safety, but current models are vulnerable to both digital adversarial attacks and natural perturbations (glare, rain, dirt, graffiti), and lack consideration of temporal continuity and multi-static field-of-view sensing.

Method: The study uses a multi-source dataset from aiMotive, Udacity, Waymo, and self-recorded Texas videos. It temporally aligns mid and long-term RGB sequences across four operational domains (highway, night, rainy, urban) and implements a unified three-layer defense stack with feature squeezing, defensive distillation, entropy-based anomaly detection, and sequence-wise temporal voting.

Result: The Unified Defense Stack achieved 79.8 mAP and reduced attack success rate to 18.2%, outperforming YOLOv8, YOLOv9, and BEVFormer while reducing high-risk misclassification to 32%. Physical transferability was confirmed through recapture probes.

Conclusion: The proposed framework successfully enhances traffic light and sign recognition robustness against both digital and natural perturbations through temporal sequence preservation and multi-layer defense mechanisms, significantly improving safety metrics for autonomous vehicles.

Abstract: Traffic light and sign recognition are key for Autonomous Vehicles (AVs) because perception mistakes directly influence navigation and safety. In addition to digital adversarial attacks, models are vulnerable to existing perturbations (glare, rain, dirt, or graffiti), which could lead to dangerous misclassifications. The current work lacks consideration of temporal continuity, multistatic field-of-view (FoV) sensing, and robustness to both digital and natural degradation. This study proposes a dual FoV, sequence-preserving robustness framework for traffic lights and signs in the USA based on a multi-source dataset built on aiMotive, Udacity, Waymo, and self-recorded videos from the region of Texas. Mid and long-term sequences of RGB images are temporally aligned for four operational design domains (ODDs): highway, night, rainy, and urban. Over a series of experiments on a real-life application of anomaly detection, this study outlines a unified three-layer defense stack framework that incorporates feature squeezing, defensive distillation, and entropy-based anomaly detection, as well as sequence-wise temporal voting for further enhancement. The evaluation measures included accuracy, attack success rate (ASR), risk-weighted misclassification severity, and confidence stability. Physical transferability was confirmed using probes for recapture. The results showed that the Unified Defense Stack achieved 79.8mAP and reduced the ASR to 18.2%, which is superior to YOLOv8, YOLOv9, and BEVFormer, while reducing the high-risk misclassification to 32%.

[128] Smart-GRPO: Smartly Sampling Noise for Efficient RL of Flow-Matching Models

Benjamin Yu, Jackie Liu, Justin Cui

Main category: cs.CV

TL;DR: Smart-GRPO introduces an optimized noise perturbation method for reinforcement learning in flow-matching models, improving reward optimization and visual quality through iterative search.

Details

Motivation: Flow-matching models are deterministic and poorly suited for reinforcement learning, which is crucial for improving image quality and human alignment. Previous noise perturbation methods are inefficient and unstable.

Method: Smart-GRPO uses an iterative search strategy that decodes candidate perturbations, evaluates them with a reward function, and refines the noise distribution toward higher-reward regions.

Result: Experiments show Smart-GRPO improves both reward optimization and visual quality compared to baseline methods.

Conclusion: Smart-GRPO provides a practical path for reinforcement learning in flow-matching frameworks, bridging efficient training with human-aligned generation.

Abstract: Recent advancements in flow-matching have enabled high-quality text-to-image generation. However, the deterministic nature of flow-matching models makes them poorly suited for reinforcement learning, a key tool for improving image quality and human alignment. Prior work has introduced stochasticity by perturbing latents with random noise, but such perturbations are inefficient and unstable. We propose Smart-GRPO, the first method to optimize noise perturbations for reinforcement learning in flow-matching models. Smart-GRPO employs an iterative search strategy that decodes candidate perturbations, evaluates them with a reward function, and refines the noise distribution toward higher-reward regions. Experiments demonstrate that Smart-GRPO improves both reward optimization and visual quality compared to baseline methods. Our results suggest a practical path toward reinforcement learning in flow-matching frameworks, bridging the gap between efficient training and human-aligned generation.

[129] Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction

Kaisi Guan, Xihua Wang, Zhengfeng Lai, Xin Cheng, Peng Zhang, XiaoJiang Liu, Ruihua Song, Meng Cao

Main category: cs.CV

TL;DR: Proposes HVGC framework and BridgeDiT model for Text-to-Sounding-Video generation, addressing modal interference through disentangled captions and achieving state-of-the-art performance with bidirectional cross-modal interaction.

Details

Motivation: Address two key challenges in Text-to-Sounding-Video generation: (1) modal interference from shared text captions confusing pretrained backbones, and (2) unclear optimal mechanism for cross-modal feature interaction between audio and video.

Method: Hierarchical Visual-Grounded Captioning (HVGC) generates disentangled video and audio captions to eliminate interference. BridgeDiT dual-tower diffusion transformer uses Dual CrossAttention (DCA) mechanism for symmetric bidirectional information exchange between modalities.

Result: Achieves state-of-the-art results on three benchmark datasets with comprehensive human evaluations. Extensive ablation studies validate the effectiveness of the proposed components.

Conclusion: The method successfully addresses modal interference and cross-modal interaction challenges in T2SV generation, offering key insights for future research in this domain.

Abstract: This study focuses on a challenging yet promising task, Text-to-Sounding-Video (T2SV) generation, which aims to generate a video with synchronized audio from text conditions, meanwhile ensuring both modalities are aligned with text. Despite progress in joint audio-video training, two critical challenges still remain unaddressed: (1) a single, shared text caption where the text for video is equal to the text for audio often creates modal interference, confusing the pretrained backbones, and (2) the optimal mechanism for cross-modal feature interaction remains unclear. To address these challenges, we first propose the Hierarchical Visual-Grounded Captioning (HVGC) framework that generates pairs of disentangled captions, a video caption, and an audio caption, eliminating interference at the conditioning stage. Based on HVGC, we further introduce BridgeDiT, a novel dual-tower diffusion transformer, which employs a Dual CrossAttention (DCA) mechanism that acts as a robust ``bridge" to enable a symmetric, bidirectional exchange of information, achieving both semantic and temporal synchronization. Extensive experiments on three benchmark datasets, supported by human evaluations, demonstrate that our method achieves state-of-the-art results on most metrics. Comprehensive ablation studies further validate the effectiveness of our contributions, offering key insights for the future T2SV task. All the codes and checkpoints will be publicly released.

[130] Unified Domain Adaptive Semantic Segmentation

Zhe Zhang, Gaochang Wu, Jing Zhang, Xiatian Zhu, Dacheng Tao, Tianyou Chai

Main category: cs.CV

TL;DR: This paper proposes unifying unsupervised domain adaptive semantic segmentation (UDA-SS) across image and video domains through a general data augmentation framework called Quad-directional Mixup (QuadMix), which addresses domain shifts via four-directional feature mixing and optical flow-guided temporal alignment.

Details

Motivation: Current UDA-SS research is fragmented between image and video domains, preventing cross-pollination of ideas and leading to redundant efforts. The authors advocate for unified study to enable comprehensive understanding and synergistic advancements.

Method: Proposed QuadMix method with four-directional paths for intra- and inter-domain mixing in feature space, incorporating optical flow-guided feature aggregation for temporal shifts in videos to achieve fine-grained domain alignment.

Result: Extensive experiments show the method outperforms state-of-the-art works by large margins on four challenging UDA-SS benchmarks.

Conclusion: Unifying UDA-SS across image and video domains through a data augmentation perspective enables improved generalization, cross-pollination of ideas, and contributes to overall progress in the field.

Abstract: Unsupervised Domain Adaptive Semantic Segmentation (UDA-SS) aims to transfer the supervision from a labeled source domain to an unlabeled target domain. The majority of existing UDA-SS works typically consider images whilst recent attempts have extended further to tackle videos by modeling the temporal dimension. Although the two lines of research share the major challenges – overcoming the underlying domain distribution shift, their studies are largely independent, resulting in fragmented insights, a lack of holistic understanding, and missed opportunities for cross-pollination of ideas. This fragmentation prevents the unification of methods, leading to redundant efforts and suboptimal knowledge transfer across image and video domains. Under this observation, we advocate unifying the study of UDA-SS across video and image scenarios, enabling a more comprehensive understanding, synergistic advancements, and efficient knowledge sharing. To that end, we explore the unified UDA-SS from a general data augmentation perspective, serving as a unifying conceptual framework, enabling improved generalization, and potential for cross-pollination of ideas, ultimately contributing to the overall progress and practical impact of this field of research. Specifically, we propose a Quad-directional Mixup (QuadMix) method, characterized by tackling distinct point attributes and feature inconsistencies through four-directional paths for intra- and inter-domain mixing in a feature space. To deal with temporal shifts with videos, we incorporate optical flow-guided feature aggregation across spatial and temporal dimensions for fine-grained domain alignment. Extensive experiments show that our method outperforms the state-of-the-art works by large margins on four challenging UDA-SS benchmarks. Our source code and models will be released at https://github.com/ZHE-SAPI/UDASS.

[131] FSFSplatter: Build Surface and Novel Views with Sparse-Views within 3min

Yibin Zhao, Yihan Pan, Jun Nan, Jianjun Yi

Main category: cs.CV

TL;DR: FSFSplatter enables fast surface reconstruction from free sparse images using dense Gaussian initialization and geometry-enhanced optimization, outperforming state-of-the-art methods on DTU and Replica datasets.

Details

Motivation: Existing Gaussian Splatting methods require dense, calibrated views and perform poorly with sparse images due to limited overlap and overfitting issues.

Method: Integrates end-to-end dense Gaussian initialization using a large Transformer, self-splitting Gaussian head, contribution-based pruning to remove floaters, and depth/multi-view feature supervision with differentiable camera parameters during optimization.

Result: Outperforms current state-of-the-art methods on widely used DTU and Replica datasets.

Conclusion: FSFSplatter successfully addresses the challenge of reconstructing high-quality surfaces from free sparse images through its integrated approach of dense initialization and geometry-enhanced optimization.

Abstract: Gaussian Splatting has become a leading reconstruction technique, known for its high-quality novel view synthesis and detailed reconstruction. However, most existing methods require dense, calibrated views. Reconstructing from free sparse images often leads to poor surface due to limited overlap and overfitting. We introduce FSFSplatter, a new approach for fast surface reconstruction from free sparse images. Our method integrates end-to-end dense Gaussian initialization, camera parameter estimation, and geometry-enhanced scene optimization. Specifically, FSFSplatter employs a large Transformer to encode multi-view images and generates a dense and geometrically consistent Gaussian scene initialization via a self-splitting Gaussian head. It eliminates local floaters through contribution-based pruning and mitigates overfitting to limited views by leveraging depth and multi-view feature supervision with differentiable camera parameters during rapid optimization. FSFSplatter outperforms current state-of-the-art methods on widely used DTU and Replica.

[132] MoGIC: Boosting Motion Generation via Intention Understanding and Visual Context

Junyu Shi, Yong Sun, Zhiyuan Zhang, Lijiang Liu, Zhengjie Zhang, Yuxin He, Qiang Nie

Main category: cs.CV

TL;DR: MoGIC is a unified framework that integrates intention modeling and visual priors into multimodal motion synthesis, overcoming limitations of existing text-driven methods by capturing causal action logic and enabling precise spatiotemporal control.

Details

Motivation: Existing text-driven motion generation methods fail to capture causal logic of action execution and human intentions, and lack visual grounding which restricts precision and personalization since language alone cannot specify fine-grained spatiotemporal details.

Method: MoGIC jointly optimizes multimodal-conditioned motion generation and intention prediction, uses a mixture-of-attention mechanism with adaptive scope for local alignment between conditional tokens and motion subsequences, and leverages visual priors to enhance generation.

Result: After finetuning, MoGIC reduces FID by 38.6% on HumanML3D and 34.6% on Mo440H benchmark, surpasses LLM-based methods in motion captioning with a lightweight text head, and enables intention prediction and vision-conditioned generation.

Conclusion: MoGIC advances controllable motion synthesis and intention understanding by integrating intention modeling and visual priors, demonstrating versatile multimodal generative capability and improved performance across multiple benchmarks.

Abstract: Existing text-driven motion generation methods often treat synthesis as a bidirectional mapping between language and motion, but remain limited in capturing the causal logic of action execution and the human intentions that drive behavior. The absence of visual grounding further restricts precision and personalization, as language alone cannot specify fine-grained spatiotemporal details. We propose MoGIC, a unified framework that integrates intention modeling and visual priors into multimodal motion synthesis. By jointly optimizing multimodal-conditioned motion generation and intention prediction, MoGIC uncovers latent human goals, leverages visual priors to enhance generation, and exhibits versatile multimodal generative capability. We further introduce a mixture-of-attention mechanism with adaptive scope to enable effective local alignment between conditional tokens and motion subsequences. To support this paradigm, we curate Mo440H, a 440-hour benchmark from 21 high-quality motion datasets. Experiments show that after finetuning, MoGIC reduces FID by 38.6% on HumanML3D and 34.6% on Mo440H, surpasses LLM-based methods in motion captioning with a lightweight text head, and further enables intention prediction and vision-conditioned generation, advancing controllable motion synthesis and intention understanding. The code is available at https://github.com/JunyuShi02/MoGIC

[133] From Tokens to Nodes: Semantic-Guided Motion Control for Dynamic 3D Gaussian Splatting

Jianing Chen, Zehao Li, Yujun Cai, Hao Jiang, Shuqin Gao, Honglong Zhao, Tianlu Mao, Yucheng Zhang

Main category: cs.CV

TL;DR: A motion-adaptive 3D reconstruction framework that uses semantic and motion priors to dynamically allocate control points based on motion complexity, replacing static geometric allocation with adaptive compression and spline-based trajectory parameterization.

Details

Motivation: Current sparse control methods for dynamic 3D reconstruction suffer from inefficient control point allocation - static redundancy and dynamic insufficiency - due to purely geometric point distribution that doesn't align with actual motion complexity.

Method: Uses vision foundation models to establish patch-token-node correspondences, applies motion-adaptive compression to concentrate control points in dynamic regions, and employs iterative voxelization with motion tendency scoring. Replaces MLP-based deformation with spline-based trajectory parameterization initialized by 2D tracklets.

Result: Significant improvements in reconstruction quality and efficiency over state-of-the-art methods, achieving smoother motion representation and more stable optimization.

Conclusion: The motion-adaptive framework successfully addresses the fundamental mismatch between control point allocation and motion complexity, enabling more efficient and higher quality dynamic 3D reconstruction from monocular videos.

Abstract: Dynamic 3D reconstruction from monocular videos remains difficult due to the ambiguity inferring 3D motion from limited views and computational demands of modeling temporally varying scenes. While recent sparse control methods alleviate computation by reducing millions of Gaussians to thousands of control points, they suffer from a critical limitation: they allocate points purely by geometry, leading to static redundancy and dynamic insufficiency. We propose a motion-adaptive framework that aligns control density with motion complexity. Leveraging semantic and motion priors from vision foundation models, we establish patch-token-node correspondences and apply motion-adaptive compression to concentrate control points in dynamic regions while suppressing redundancy in static backgrounds. Our approach achieves flexible representational density adaptation through iterative voxelization and motion tendency scoring, directly addressing the fundamental mismatch between control point allocation and motion complexity. To capture temporal evolution, we introduce spline-based trajectory parameterization initialized by 2D tracklets, replacing MLP-based deformation fields to achieve smoother motion representation and more stable optimization. Extensive experiments demonstrate significant improvements in reconstruction quality and efficiency over existing state-of-the-art methods.

[134] Net2Net: When Un-trained Meets Pre-trained Networks for Robust Real-World Denoising

Weimin Yuan, Cai Meng

Main category: cs.CV

TL;DR: Net2Net combines untrained and pre-trained networks via regularization by denoising (RED) for real-world noise removal, improving generalization without requiring extensive labeled data.

Details

Motivation: Traditional denoising methods struggle with real noise complexity, while deep learning approaches require extensive labeled data and may not generalize well across diverse noise types.

Method: Hybrid framework combining unsupervised DIP (Deep Image Prior) and supervised pre-trained DRUNet using regularization by denoising (RED), where untrained network adapts to input-specific noise and pre-trained network provides robust denoising.

Result: Extensive experiments show superior performance for real-world noise removal, with enhanced generalization across varying noise patterns and improved performance in limited training data scenarios.

Conclusion: Net2Net effectively addresses real-world denoising challenges by leveraging both untrained and pre-trained networks, offering better generalization and performance without requiring extensive labeled datasets.

Abstract: Traditional denoising methods for noise removal have largely relied on handcrafted priors, often perform well in controlled environments but struggle to address the complexity and variability of real noise. In contrast, deep learning-based approaches have gained prominence for learning noise characteristics from large datasets, but these methods frequently require extensive labeled data and may not generalize effectively across diverse noise types and imaging conditions. In this paper, we present an innovative method, termed as Net2Net, that combines the strengths of untrained and pre-trained networks to tackle the challenges of real-world noise removal. The innovation of Net2Net lies in its combination of unsupervised DIP and supervised pre-trained model DRUNet by regularization by denoising (RED). The untrained network adapts to the unique noise characteristics of each input image without requiring labeled data, while the pre-trained network leverages learned representations from large-scale datasets to deliver robust denoising performance. This hybrid framework enhances generalization across varying noise patterns and improves performance, particularly in scenarios with limited training data. Extensive experiments on benchmark datasets demonstrate the superiority of our method for real-world noise removal.

[135] Enhancing Monocular Height Estimation via Sparse LiDAR-Guided Correction

Jian Song, Hongruixuan Chen, Naoto Yokoya

Main category: cs.CV

TL;DR: The paper analyzes monocular height estimation from remote sensing imagery, revealing that models trained on synthetic data rely heavily on shadow cues, which can cause height estimation errors. It proposes a correction pipeline using sparse LiDAR data to improve accuracy.

Details

Motivation: Monocular height estimation from remote sensing imagery is challenging due to lack of structural information. Existing methods using synthetic data have unclear reliability and may depend on unreliable cues like shadows, leading to inaccurate predictions.

Method: The authors investigate a state-of-the-art MHE model trained on synthetic data to understand its prediction patterns. They then propose a two-stage correction pipeline: pre-processing raw ICESat-2 LiDAR data, followed by a random forest-based approach to refine height estimates.

Result: Experiments in three urban regions (Saint-Omer, Tokyo, Sao Paulo) showed significant error reductions with MAE decreased by 22.8%, 6.9%, and 4.9% respectively after applying the correction pipeline.

Conclusion: Shadow awareness is critical in synthetic data-driven models, and fusing imperfect real-world LiDAR data can significantly improve the robustness and reliability of monocular height estimation for scalable 3D mapping solutions.

Abstract: Monocular height estimation (MHE) from very-high-resolution (VHR) remote sensing imagery via deep learning is notoriously challenging due to the lack of sufficient structural information. Conventional digital elevation models (DEMs), typically derived from airborne LiDAR or multi-view stereo, remain costly and geographically limited. Recently, models trained on synthetic data and refined through domain adaptation have shown remarkable performance in MHE, yet it remains unclear how these models make predictions or how reliable they truly are. In this paper, we investigate a state-of-the-art MHE model trained purely on synthetic data to explore where the model looks when making height predictions. Through systematic analyses, we find that the model relies heavily on shadow cues, a factor that can lead to overestimation or underestimation of heights when shadows deviate from expected norms. Furthermore, the inherent difficulty of evaluating regression tasks with the human eye underscores additional limitations of purely synthetic training. To address these issues, we propose a novel correction pipeline that integrates sparse, imperfect global LiDAR measurements (ICESat-2) with deep-learning outputs to improve local accuracy and achieve spatially consistent corrections. Our method comprises two stages: pre-processing raw ICESat-2 data, followed by a random forest-based approach to densely refine height estimates. Experiments in three representative urban regions – Saint-Omer, Tokyo, and Sao Paulo – reveal substantial error reductions, with mean absolute error (MAE) decreased by 22.8%, 6.9%, and 4.9%, respectively. These findings highlight the critical role of shadow awareness in synthetic data-driven models and demonstrate how fusing imperfect real-world LiDAR data can bolster the robustness of MHE, paving the way for more reliable and scalable 3D mapping solutions.

[136] Retrv-R1: A Reasoning-Driven MLLM Framework for Universal and Efficient Multimodal Retrieval

Lanyun Zhu, Deyi Ji, Tianrun Chen, Haiyang Wu, Shiqi Wang

Main category: cs.CV

TL;DR: Retrv-R1 is the first R1-style multimodal LLM designed for universal retrieval, using step-by-step reasoning to improve accuracy while addressing computational costs and training instability through information compression and a new training paradigm.

Details

Motivation: Directly applying DeepSeek-R1's RL methods to retrieval tasks is infeasible due to high computational costs from large token consumption and instability/suboptimal results when training RL for retrieval.

Method: Introduces information compression module with details inspection mechanism to reduce tokens while preserving critical information, and a new training paradigm with activation stage using retrieval-tailored synthetic CoT dataset followed by RL with curriculum reward.

Result: Retrv-R1 achieves state-of-the-art performance, high efficiency, and strong generalization ability across multiple benchmarks and tasks.

Conclusion: The proposed approach successfully addresses computational efficiency and training stability challenges in applying RL to multimodal retrieval, demonstrating the potential of step-by-step reasoning for improved retrieval accuracy.

Abstract: The success of DeepSeek-R1 demonstrates the immense potential of using reinforcement learning (RL) to enhance LLMs’ reasoning capabilities. This paper introduces Retrv-R1, the first R1-style MLLM specifically designed for multimodal universal retrieval, achieving higher performance by employing step-by-step reasoning to produce more accurate retrieval results. We find that directly applying the methods of DeepSeek-R1 to retrieval tasks is not feasible, mainly due to (1) the high computational cost caused by the large token consumption required for multiple candidates with reasoning processes, and (2) the instability and suboptimal results when directly applying RL to train for retrieval tasks. To address these issues, Retrv-R1 introduces an information compression module with a details inspection mechanism, which enhances computational efficiency by reducing the number of tokens while ensuring that critical information for challenging candidates is preserved. Furthermore, a new training paradigm is proposed, including an activation stage using a retrieval-tailored synthetic CoT dataset for more effective optimization, followed by RL with a novel curriculum reward to improve both performance and efficiency. Incorporating these novel designs, Retrv-R1 achieves SOTA performance, high efficiency, and strong generalization ability, as demonstrated by experiments across multiple benchmarks and tasks.

[137] YOLO-Based Defect Detection for Metal Sheets

Po-Heng Chou, Chun-Chi Wang, Wei-Lung Mao

Main category: cs.CV

TL;DR: A YOLO-based deep learning model using ConSinGAN for data augmentation achieves 91.3% accuracy in automated defect detection for metal sheets, with YOLOv9 performing best among tested versions.

Details

Motivation: To solve time-consuming and labor-intensive defect detection tasks in industrial manufacturing through automated optical inspection.

Method: Used YOLO models (v3, v4, v7, v9) combined with ConSinGAN for data augmentation to address limited metal sheet image data, trained on metal sheet defect images.

Result: YOLOv9 with ConSinGAN achieved best performance: 91.3% accuracy with 146 ms detection time, successfully integrated into manufacturing hardware and SCADA system.

Conclusion: The proposed automated defect detection system is effective, practical for industrial applications, and easily applicable to other manufacturing components.

Abstract: In this paper, we propose a YOLO-based deep learning (DL) model for automatic defect detection to solve the time-consuming and labor-intensive tasks in industrial manufacturing. In our experiments, the images of metal sheets are used as the dataset for training the YOLO model to detect the defects on the surfaces and in the holes of metal sheets. However, the lack of metal sheet images significantly degrades the performance of detection accuracy. To address this issue, the ConSinGAN is used to generate a considerable amount of data. Four versions of the YOLO model (i.e., YOLOv3, v4, v7, and v9) are combined with the ConSinGAN for data augmentation. The proposed YOLOv9 model with ConSinGAN outperforms the other YOLO models with an accuracy of 91.3%, and a detection time of 146 ms. The proposed YOLOv9 model is integrated into manufacturing hardware and a supervisory control and data acquisition (SCADA) system to establish a practical automated optical inspection (AOI) system. Additionally, the proposed automated defect detection is easily applied to other components in industrial manufacturing.

[138] Bayesian Test-time Adaptation for Object Recognition and Detection with Vision-language Models

Lihua Zhou, Mao Ye, Shuaifeng Li, Nianxin Li, Jinlin Wu, Xiatian Zhu, Lei Deng, Hongbin Liu, Jiebo Luo, Zhen Lei

Main category: cs.CV

TL;DR: BCA+ is a training-free test-time adaptation framework for vision-language models that uses Bayesian inference with dynamic caching of class embeddings, spatial scales, and adaptive priors to improve object recognition and detection under distribution shifts.

Details

Motivation: Existing test-time adaptation methods are either computationally expensive (requiring backpropagation) or focus only on likelihood adaptation while ignoring the important role of class priors, limiting their effectiveness for real-time deployment.

Method: BCA+ introduces a dynamic cache that stores and updates class embeddings, spatial scales for detection, and adaptive class priors from historical predictions. It formulates adaptation as Bayesian inference, fusing initial VLM outputs with cache-based predictions that combine dynamically updated likelihood (feature/scale similarity) and evolving class distribution priors.

Result: Extensive experiments show that BCA+ achieves state-of-the-art performance on both recognition and detection benchmarks while being highly efficient as a training-free method requiring no backpropagation.

Conclusion: BCA+ provides a unified, efficient framework for test-time adaptation that corrects both semantic understanding and contextual confidence through dual-adaptation and uncertainty-guided fusion, making it suitable for real-time deployment.

Abstract: Vision-language models (VLMs) such as CLIP and Grounding DINO have achieved remarkable success in object recognition and detection. However, their performance often degrades under real-world distribution shifts. Test-time adaptation (TTA) aims to mitigate this issue by adapting models during inference. Existing methods either rely on computationally expensive backpropagation, which hinders real-time deployment, or focus solely on likelihood adaptation, which overlooks the critical role of the prior. Our prior work, Bayesian Class Adaptation (BCA), addressed these shortcomings for object recognition by introducing a training-free framework that incorporates adaptive priors. Building upon this foundation, we now present Bayesian Class Adaptation plus (BCA+), a unified, training-free framework for TTA for both object recognition and detection. BCA+ introduces a dynamic cache that adaptively stores and updates class embeddings, spatial scales (for detection), and, crucially, adaptive class priors derived from historical predictions. We formulate adaptation as a Bayesian inference problem, where final predictions are generated by fusing the initial VLM output with a cache-based prediction. This cache-based prediction combines a dynamically updated likelihood (measuring feature and scale similarity) and a prior (reflecting the evolving class distribution). This dual-adaptation mechanism, coupled with uncertainty-guided fusion, enables BCA+ to correct both the model’s semantic understanding and its contextual confidence. As a training-free method requiring no backpropagation, BCA+ is highly efficient. Extensive experiments demonstrate that BCA+ achieves state-of-the-art performance on both recognition and detection benchmarks.

[139] Hierarchical Generalized Category Discovery for Brain Tumor Classification in Digital Pathology

Matthias Perkonigg, Patrick Rockenschaub, Georg Göbel, Adelheid Wöhrer

Main category: cs.CV

TL;DR: HGCD-BT is a hierarchical generalized category discovery method for brain tumor classification that can identify both known and unknown tumor types by combining contrastive learning with hierarchical clustering.

Details

Motivation: Existing brain tumor classification methods are limited to predefined classes and cannot identify new tumor types not seen during training, which is critical for intra-operative decision making in neurosurgery.

Method: The approach integrates hierarchical clustering with contrastive learning, extending GCD with a novel semi-supervised hierarchical clustering loss to capture the hierarchical structure of brain tumor taxonomies.

Result: Achieved +28% improvement in accuracy over state-of-the-art GCD methods for patch-level classification on OpenSRH dataset, with strong performance in identifying unseen tumor categories. Also demonstrated generalizability on slide-level classification using Digital Brain Tumor Atlas.

Conclusion: HGCD-BT effectively bridges the gap between supervised and unsupervised learning for brain tumor classification, enabling discovery of both known and unknown tumor types while capturing hierarchical relationships.

Abstract: Accurate brain tumor classification is critical for intra-operative decision making in neuro-oncological surgery. However, existing approaches are restricted to a fixed set of predefined classes and are therefore unable to capture patterns of tumor types not available during training. Unsupervised learning can extract general-purpose features, but it lacks the ability to incorporate prior knowledge from labelled data, and semi-supervised methods often assume that all potential classes are represented in the labelled data. Generalized Category Discovery (GCD) aims to bridge this gap by categorizing both known and unknown classes within unlabelled data. To reflect the hierarchical structure of brain tumor taxonomies, in this work, we introduce Hierarchical Generalized Category Discovery for Brain Tumor Classification (HGCD-BT), a novel approach that integrates hierarchical clustering with contrastive learning. Our method extends contrastive learning based GCD by incorporating a novel semi-supervised hierarchical clustering loss. We evaluate HGCD-BT on OpenSRH, a dataset of stimulated Raman histology brain tumor images, achieving a +28% improvement in accuracy over state-of-the-art GCD methods for patch-level classification, particularly in identifying previously unseen tumor categories. Furthermore, we demonstrate the generalizability of HGCD-BT on slide-level classification of hematoxylin and eosin stained whole-slide images from the Digital Brain Tumor Atlas, confirming its utility across imaging modalities.

[140] AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding

Xian Zhang, Zexi Wu, Zinuo Li, Hongming Xu, Luqi Gong, Farid Boussaid, Naoufel Werghi, Mohammed Bennamoun

Main category: cs.CV

TL;DR: AdaRD-Key is a training-free keyframe sampling module for long-form video understanding that balances query relevance and visual diversity using a unified Relevance-Diversity Max-Volume objective, with adaptive gating for weak alignment cases.

Details

Motivation: Current VLMs struggle with long-form videos due to uniform sampling overlooking critical moments, while existing keyframe selection methods either impose rigid temporal spacing (missing fine-grained cues) or emphasize diversity over query relevance.

Method: Proposes AdaRD-Key that maximizes a unified RD-MV objective combining query-conditioned relevance scores with log-determinant diversity. Includes lightweight relevance-aware gating that shifts to diversity-only mode when weak alignment is detected.

Result: Achieves state-of-the-art performance on LongVideoBench and Video-MME benchmarks, particularly for long-form videos. Runs in real time on a single GPU and is plug-and-play compatible with existing VLMs.

Conclusion: AdaRD-Key provides an effective, training-free solution for query-driven long-form video understanding by adaptively balancing relevance and diversity in keyframe selection, outperforming existing methods while maintaining computational efficiency.

Abstract: Understanding long-form videos remains a significant challenge for vision–language models (VLMs) due to their extensive temporal length and high information density. Most current multimodal large language models (MLLMs) rely on uniform sampling, which often overlooks critical moments, leading to incorrect responses to queries. In parallel, many keyframe selection approaches impose rigid temporal spacing: once a frame is chosen, an exclusion window suppresses adjacent timestamps to reduce redundancy. While effective at limiting overlap, this strategy frequently misses short, fine-grained cues near important events. Other methods instead emphasize visual diversity but neglect query relevance. We propose AdaRD-Key, a training-free keyframe sampling module for query-driven long-form video understanding. AdaRD-Key maximizes a unified Relevance–Diversity Max-Volume (RD-MV) objective, combining a query-conditioned relevance score with a log-determinant diversity component to yield informative yet non-redundant frames. To handle broad queries with weak alignment to the video, AdaRD-Key employs a lightweight relevance-aware gating mechanism; when the relevance distribution indicates weak alignment, the method seamlessly shifts into a diversity-only mode, enhancing coverage without additional supervision. Our pipeline is training-free, computationally efficient (running in real time on a single GPU), and compatible with existing VLMs in a plug-and-play manner. Extensive experiments on LongVideoBench and Video-MME demonstrate state-of-the-art performance, particularly on long-form videos. Code available at https://github.com/Xian867/AdaRD-Key.

[141] Reasoning Riddles: How Explainability Reveals Cognitive Limits in Vision-Language Models

Prahitha Movva

Main category: cs.CV

TL;DR: This paper analyzes how Vision-Language Models (VLMs) approach rebus puzzles, revealing their cognitive processes and failure patterns through explainability analysis rather than just performance metrics.

Details

Motivation: VLMs struggle with complex lateral thinking challenges like rebus puzzles, but the underlying reasoning processes and failure patterns remain largely unexplored despite recent work showing their poor performance.

Method: Created a systematically annotated dataset of 221 rebus puzzles across six cognitive categories, paired with an evaluation framework separating reasoning quality from answer correctness. Investigated three prompting strategies to elicit different explanatory processes.

Result: Reasoning quality varies dramatically across puzzle categories - models show strengths in visual composition but fundamental limitations in absence interpretation and cultural symbolism. Prompting strategy substantially influences both cognitive approach and problem-solving effectiveness.

Conclusion: Explainability should be an integral component of model performance evaluation rather than a post-hoc consideration, as it reveals critical insights into VLM cognitive processes and systematic limitations.

Abstract: Vision-Language Models (VLMs) excel at many multimodal tasks, yet their cognitive processes remain opaque on complex lateral thinking challenges like rebus puzzles. While recent work has demonstrated these models struggle significantly with rebus puzzle solving, the underlying reasoning processes and failure patterns remain largely unexplored. We address this gap through a comprehensive explainability analysis that moves beyond performance metrics to understand how VLMs approach these complex lateral thinking challenges. Our study contributes a systematically annotated dataset of 221 rebus puzzles across six cognitive categories, paired with an evaluation framework that separates reasoning quality from answer correctness. We investigate three prompting strategies designed to elicit different types of explanatory processes and reveal critical insights into VLM cognitive processes. Our findings demonstrate that reasoning quality varies dramatically across puzzle categories, with models showing systematic strengths in visual composition while exhibiting fundamental limitations in absence interpretation and cultural symbolism. We also discover that prompting strategy substantially influences both cognitive approach and problem-solving effectiveness, establishing explainability as an integral component of model performance rather than a post-hoc consideration.

[142] OTR: Synthesizing Overlay Text Dataset for Text Removal

Jan Zdenek, Wataru Shimoda, Kota Yamaguchi

Main category: cs.CV

TL;DR: This paper introduces a new synthetic text removal benchmark to address limitations in existing datasets like SCUT-EnsText, which suffer from ground truth artifacts and simplistic backgrounds.

Details

Motivation: Current text removal datasets have limitations including ground truth artifacts from manual editing, overly simplistic text backgrounds, and evaluation metrics that don't properly assess result quality, hindering out-of-domain generalization and accurate evaluation.

Method: The authors developed an approach to synthesize a text removal benchmark using text rendered on complex backgrounds with object-aware placement and vision-language model-generated content to ensure clean ground truth and challenging scenarios.

Result: The paper presents a new dataset that addresses the limitations of existing benchmarks, featuring complex backgrounds and clean ground truth, making it applicable to domains beyond scene texts.

Conclusion: The introduced synthetic text removal benchmark provides a more robust evaluation framework for text removal tasks across various domains, with the dataset publicly available for research use.

Abstract: Text removal is a crucial task in computer vision with applications such as privacy preservation, image editing, and media reuse. While existing research has primarily focused on scene text removal in natural images, limitations in current datasets hinder out-of-domain generalization or accurate evaluation. In particular, widely used benchmarks such as SCUT-EnsText suffer from ground truth artifacts due to manual editing, overly simplistic text backgrounds, and evaluation metrics that do not capture the quality of generated results. To address these issues, we introduce an approach to synthesizing a text removal benchmark applicable to domains other than scene texts. Our dataset features text rendered on complex backgrounds using object-aware placement and vision-language model-generated content, ensuring clean ground truth and challenging text removal scenarios. The dataset is available at https://huggingface.co/datasets/cyberagent/OTR .

[143] Align Your Query: Representation Alignment for Multimodality Medical Object Detection

Ara Seo, Bryan Sangwoo Kim, Hyungjin Chung, Jong Chul Ye

Main category: cs.CV

TL;DR: A framework for aligning object query representations in DETR-style detectors across different medical imaging modalities using modality tokens and contrastive pretraining.

Details

Motivation: Medical object detection struggles when trained on mixed modalities due to heterogeneous statistics and disjoint representation spaces across CXR, CT, and MRI.

Method: Proposes Multimodality Context Attention (MoCA) to integrate modality tokens into object queries via self-attention, and QueryREPA pretraining stage with contrastive objective to align queries with modality tokens.

Result: Consistently improves AP across diverse medical modalities with minimal overhead and no architectural modifications to DETR-style detectors.

Conclusion: The approach offers a practical path toward robust multimodality medical object detection by producing modality-aware, class-faithful queries that transfer effectively.

Abstract: Medical object detection suffers when a single detector is trained on mixed medical modalities (e.g., CXR, CT, MRI) due to heterogeneous statistics and disjoint representation spaces. To address this challenge, we turn to representation alignment, an approach that has proven effective for bringing features from different sources into a shared space. Specifically, we target the representations of DETR-style object queries and propose a simple, detector-agnostic framework to align them with modality context. First, we define modality tokens: compact, text-derived embeddings encoding imaging modality that are lightweight and require no extra annotations. We integrate the modality tokens into the detection process via Multimodality Context Attention (MoCA), mixing object-query representations via self-attention to propagate modality context within the query set. This preserves DETR-style architectures and adds negligible latency while injecting modality cues into object queries. We further introduce QueryREPA, a short pretraining stage that aligns query representations to their modality tokens using a task-specific contrastive objective with modality-balanced batches. Together, MoCA and QueryREPA produce modality-aware, class-faithful queries that transfer effectively to downstream training. Across diverse modalities trained altogether, the proposed approach consistently improves AP with minimal overhead and no architectural modifications, offering a practical path toward robust multimodality medical object detection. Project page: https://araseo.github.io/alignyourquery/.

[144] VERNIER: an open-source software pushing marker pose estimation down to the micrometer and nanometer scales

Patrick Sandoz, Antoine N. André, Guillaume J. Laurent

Main category: cs.CV

TL;DR: VERNIER is an open-source phase processing software for high-precision pose estimation using pseudo-periodic patterns, achieving nanometric and microradian resolutions over centimeter ranges with robustness to noise, defocus, and occlusion.

Details

Motivation: Pose estimation at small scales remains challenging, with few solutions available for capturing 6 degrees of freedom with nanometric and microradian resolutions over large ranges, particularly for microscopy applications.

Method: Uses phase-based processing of pseudo-periodic patterns with a phase-based local thresholding algorithm. The software handles various pattern designs optimized for different microscopy applications and provides implementation guidelines.

Result: The software provides fast and reliable pose measurement with proven robustness to noise, defocus, and occlusion. It enables centimeter ranges using pattern encoding and nanometer resolutions through phase processing.

Conclusion: VERNIER offers an effective solution for high-precision pose estimation in microscopy, with guidelines provided for selecting appropriate pattern designs and microscope magnification lenses based on desired performance requirements.

Abstract: Pose estimation is still a challenge at the small scales. Few solutions exist to capture the 6 degrees of freedom of an object with nanometric and microradians resolutions over relatively large ranges. Over the years, we have proposed several fiducial marker and pattern designs to achieve reliable performance for various microscopy applications. Centimeter ranges are possible using pattern encoding methods, while nanometer resolutions can be achieved using phase processing of the periodic frames. This paper presents VERNIER, an open source phase processing software designed to provide fast and reliable pose measurement based on pseudo-periodic patterns. Thanks to a phase-based local thresholding algorithm, the software has proven to be particularly robust to noise, defocus and occlusion. The successive steps of the phase processing are presented, as well as the different types of patterns that address different application needs. The implementation procedure is illustrated with synthetic and experimental images. Finally, guidelines are given for selecting the appropriate pattern design and microscope magnification lenses as a function of the desired performance.

[145] Med-K2N: Flexible K-to-N Modality Translation for Medical Image Synthesis

Feng Yuan, Yifan Gao, Yuehua Ye, Haoyue Li, Xin Gao

Main category: cs.CV

TL;DR: Med-K2N framework for cross-modal medical image synthesis that addresses K to N modality reconstruction challenges through adaptive weighting, quality-driven fusion, and modality identity consistency.

Details

Motivation: Clinical need for flexible modality reconstruction where missing imaging modalities can be synthesized from available ones to support diagnosis, addressing challenges of heterogeneous modality contributions, fusion quality control, and modality identity consistency.

Method: Treats multi-modal medical data as sequential frames with quality-driven selection. Uses three modules: PreWeightNet for global contribution assessment, ThresholdNet for adaptive filtering, and EffiWeightNet for effective weight computation. Includes Causal Modality Identity Module (CMIM) for maintaining modality identity consistency through vision-language modeling.

Result: Outperforms state-of-the-art methods by significant margins on multiple benchmarks.

Conclusion: The proposed Med-K2N framework effectively addresses key challenges in cross-modal medical image synthesis through adaptive weighting mechanisms and modality identity preservation, demonstrating superior performance in K to N modality reconstruction tasks.

Abstract: Cross-modal medical image synthesis research focuses on reconstructing missing imaging modalities from available ones to support clinical diagnosis. Driven by clinical necessities for flexible modality reconstruction, we explore K to N medical generation, where three critical challenges emerge: How can we model the heterogeneous contributions of different modalities to various target tasks? How can we ensure fusion quality control to prevent degradation from noisy information? How can we maintain modality identity consistency in multi-output generation? Driven by these clinical necessities, and drawing inspiration from SAM2’s sequential frame paradigm and clinicians’ progressive workflow of incrementally adding and selectively integrating multi-modal information, we treat multi-modal medical data as sequential frames with quality-driven selection mechanisms. Our key idea is to “learn” adaptive weights for each modality-task pair and “memorize” beneficial fusion patterns through progressive enhancement. To achieve this, we design three collaborative modules: PreWeightNet for global contribution assessment, ThresholdNet for adaptive filtering, and EffiWeightNet for effective weight computation. Meanwhile, to maintain modality identity consistency, we propose the Causal Modality Identity Module (CMIM) that establishes causal constraints between generated images and target modality descriptions using vision-language modeling. Extensive experimental results demonstrate that our proposed Med-K2N outperforms state-of-the-art methods by significant margins on multiple benchmarks. Source code is available.

[146] ELMF4EggQ: Ensemble Learning with Multimodal Feature Fusion for Non-Destructive Egg Quality Assessment

Md Zahim Hassan, Md. Osama, Muhammad Ashad Kabir, Md. Saiful Islam, Zannatul Naim

Main category: cs.CV

TL;DR: ELMF4EggQ is an ensemble learning framework that uses multimodal feature fusion of external egg attributes (image, shape, weight) to classify egg grade and freshness without internal measurements, achieving high accuracy.

Details

Motivation: Need for accurate, non-destructive egg quality assessment to ensure food safety, maintain product standards, and improve operational efficiency in poultry production.

Method: Uses ensemble learning with multimodal feature fusion: deep features from pre-trained CNNs (ResNet152, DenseNet169, ResNet152V2) on external images combined with shape and weight data, followed by PCA, SMOTE augmentation, and ensemble voting.

Result: Multimodal approach significantly outperforms single-modality baselines, achieving 86.57% accuracy in grade classification and 70.83% in freshness prediction using only external features.

Conclusion: The framework enables non-invasive internal egg quality assessment using external attributes, with publicly available dataset and code to promote transparency and further research.

Abstract: Accurate, non-destructive assessment of egg quality is critical for ensuring food safety, maintaining product standards, and operational efficiency in commercial poultry production. This paper introduces ELMF4EggQ, an ensemble learning framework that employs multimodal feature fusion to classify egg grade and freshness using only external attributes - image, shape, and weight. A novel, publicly available dataset of 186 brown-shelled eggs was constructed, with egg grade and freshness levels determined through laboratory-based expert assessments involving internal quality measurements, such as yolk index and Haugh unit. To the best of our knowledge, this is the first study to apply machine learning methods for internal egg quality assessment using only external, non-invasive features, and the first to release a corresponding labeled dataset. The proposed framework integrates deep features extracted from external egg images with structural characteristics such as egg shape and weight, enabling a comprehensive representation of each egg. Image feature extraction is performed using top-performing pre-trained CNN models (ResNet152, DenseNet169, and ResNet152V2), followed by PCA-based dimensionality reduction, SMOTE augmentation, and classification using multiple machine learning algorithms. An ensemble voting mechanism combines predictions from the best-performing classifiers to enhance overall accuracy. Experimental results demonstrate that the multimodal approach significantly outperforms image-only and tabular (shape and weight) only baselines, with the multimodal ensemble approach achieving 86.57% accuracy in grade classification and 70.83% in freshness prediction. All code and data are publicly available at https://github.com/Kenshin-Keeps/Egg_Quality_Prediction_ELMF4EggQ, promoting transparency, reproducibility, and further research in this domain.

[147] One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework

Lorenzo Bianchi, Giacomo Pacini, Fabio Carrara, Nicola Messina, Giuseppe Amato, Fabrizio Falchi

Main category: cs.CV

TL;DR: Patch-ioner is a zero-shot captioning framework that shifts from image-centric to patch-centric paradigm, enabling captioning of arbitrary regions without region-level supervision by treating patches as atomic captioning units.

Details

Motivation: Current zero-shot captioners are limited to global image representations and whole-image captions, lacking the ability to caption arbitrary regions without region-level supervision.

Method: Treat individual patches as atomic captioning units and aggregate them to describe arbitrary regions, from single patches to non-contiguous areas and entire images. Uses backbones like DINO that produce meaningful dense visual features.

Result: Achieves state-of-the-art performance on zero-shot dense, region-set, and trace captioning tasks, demonstrating better performance than other baselines and competitors.

Conclusion: Patch-wise semantic representations are effective for scalable caption generation, with DINO-like backbones being key to achieving state-of-the-art performance in region-based captioning tasks.

Abstract: Zero-shot captioners are recently proposed models that utilize common-space vision-language representations to caption images without relying on paired image-text data. To caption an image, they proceed by textually decoding a text-aligned image feature, but they limit their scope to global representations and whole-image captions. We present \frameworkName{}, a unified framework for zero-shot captioning that shifts from an image-centric to a patch-centric paradigm, enabling the captioning of arbitrary regions without the need of region-level supervision. Instead of relying on global image representations, we treat individual patches as atomic captioning units and aggregate them to describe arbitrary regions, from single patches to non-contiguous areas and entire images. We analyze the key ingredients that enable current latent captioners to work in our novel proposed framework. Experiments demonstrate that backbones producing meaningful, dense visual features, such as DINO, are key to achieving state-of-the-art performance in multiple region-based captioning tasks. Compared to other baselines and state-of-the-art competitors, our models achieve better performance on zero-shot dense, region-set, and a newly introduced trace captioning task, highlighting the effectiveness of patch-wise semantic representations for scalable caption generation. Project page at https://paciosoft.com/Patch-ioner/ .

Bing Li, Jiaxin Chen, Dongming Zhang, Xiuguo Bao, Di Huang

Main category: cs.CV

TL;DR: MEACI-Net is a two-stream framework for compressed video action recognition that enhances motion representations and improves RGB-motion fusion through selective motion complement and cross-modality augmentation modules.

Details

Motivation: Compressed video action recognition suffers from coarse/noisy dynamics and insufficient fusion of RGB and motion modalities, which limits performance despite reduced computational costs.

Method: Two-stream architecture with motion stream using multi-scale blocks and denoising modules, plus Selective Motion Complement (SMC) and Cross-Modality Augment (CMA) modules for enhanced interaction between RGB and motion streams.

Result: Extensive experiments on UCF-101, HMDB-51 and Kinetics-400 benchmarks demonstrate the effectiveness and efficiency of MEACI-Net.

Conclusion: The proposed MEACI-Net framework successfully addresses the challenges of noisy motion dynamics and insufficient modality fusion in compressed video action recognition.

Abstract: Compressed video action recognition has recently drawn growing attention, since it remarkably reduces the storage and computational cost via replacing raw videos by sparsely sampled RGB frames and compressed motion cues (e.g., motion vectors and residuals). However, this task severely suffers from the coarse and noisy dynamics and the insufficient fusion of the heterogeneous RGB and motion modalities. To address the two issues above, this paper proposes a novel framework, namely Attentive Cross-modal Interaction Network with Motion Enhancement (MEACI-Net). It follows the two-stream architecture, i.e. one for the RGB modality and the other for the motion modality. Particularly, the motion stream employs a multi-scale block embedded with a denoising module to enhance representation learning. The interaction between the two streams is then strengthened by introducing the Selective Motion Complement (SMC) and Cross-Modality Augment (CMA) modules, where SMC complements the RGB modality with spatio-temporally attentive local motion features and CMA further combines the two modalities with selective feature augmentation. Extensive experiments on the UCF-101, HMDB-51 and Kinetics-400 benchmarks demonstrate the effectiveness and efficiency of MEACI-Net.

[149] Training-Free Out-Of-Distribution Segmentation With Foundation Models

Laith Nayal, Hadi Salloum, Ahmad Taha, Yaroslav Kholodov, Alexander Gasnikov

Main category: cs.CV

TL;DR: The paper proposes a simple, training-free approach using InternImage backbone features with K-Means clustering and confidence thresholding to detect out-of-distribution objects in semantic segmentation, achieving state-of-the-art performance on benchmarks.

Details

Motivation: To investigate whether foundation models fine-tuned on segmentation datasets can inherently distinguish in-distribution from out-of-distribution regions without any outlier supervision, which is crucial for safety-critical applications like autonomous driving.

Method: A training-free approach that utilizes features from InternImage backbone and applies K-Means clustering alongside confidence thresholding on raw decoder logits to identify OoD clusters.

Result: Achieves 50.02 Average Precision on RoadAnomaly benchmark and 48.77 on ADE-OoD benchmark with InternImage-L, surpassing several supervised and unsupervised baselines.

Conclusion: The results suggest a promising direction for generic OoD segmentation methods that require minimal assumptions or additional data.

Abstract: Detecting unknown objects in semantic segmentation is crucial for safety-critical applications such as autonomous driving. Large vision foundation models, including DINOv2, InternImage, and CLIP, have advanced visual representation learning by providing rich features that generalize well across diverse tasks. While their strength in closed-set semantic tasks is established, their capability to detect out-of-distribution (OoD) regions in semantic segmentation remains underexplored. In this work, we investigate whether foundation models fine-tuned on segmentation datasets can inherently distinguish in-distribution (ID) from OoD regions without any outlier supervision. We propose a simple, training-free approach that utilizes features from the InternImage backbone and applies K-Means clustering alongside confidence thresholding on raw decoder logits to identify OoD clusters. Our method achieves 50.02 Average Precision on the RoadAnomaly benchmark and 48.77 on the benchmark of ADE-OoD with InternImage-L, surpassing several supervised and unsupervised baselines. These results suggest a promising direction for generic OoD segmentation methods that require minimal assumptions or additional data.

[150] Don’t Just Chase “Highlighted Tokens” in MLLMs: Revisiting Visual Holistic Context Retention

Xin Zou, Di Lu, Yizhou Wang, Yibo Yan, Yuanhuiyi Lyu, Xu Zheng, Linfeng Zhang, Xuming Hu

Main category: cs.CV

TL;DR: HoloV is a plug-and-play visual token pruning framework for MLLMs that addresses limitations of attention-first pruning approaches by distributing pruning budget across spatial crops to maintain global visual context.

Details

Motivation: Current MLLMs suffer from computational overhead due to massive visual tokens, and existing attention-first pruning methods preserve semantically similar tokens, causing performance drops at high pruning ratios.

Method: HoloV rethinks token retention from a holistic perspective by adaptively distributing pruning budget across different spatial crops to ensure retained tokens capture global visual context rather than isolated features.

Result: HoloV achieves superior performance across various tasks, MLLM architectures, and pruning ratios. LLaVA1.5 with HoloV preserves 95.8% of original performance after pruning 88.9% of visual tokens.

Conclusion: HoloV provides an effective solution for efficient MLLM inference by minimizing representational collapse and maintaining task-relevant information even under aggressive pruning, achieving superior efficiency-accuracy trade-offs.

Abstract: Despite their powerful capabilities, Multimodal Large Language Models (MLLMs) suffer from considerable computational overhead due to their reliance on massive visual tokens. Recent studies have explored token pruning to alleviate this problem, which typically uses text-vision cross-attention or [\texttt{CLS}] attention to assess and discard redundant visual tokens. In this work, we identify a critical limitation of such attention-first pruning approaches, i.e., they tend to preserve semantically similar tokens, resulting in pronounced performance drops under high pruning ratios. To this end, we propose {HoloV}, a simple yet effective, plug-and-play visual token pruning framework for efficient inference. Distinct from previous attention-first schemes, HoloV rethinks token retention from a holistic perspective. By adaptively distributing the pruning budget across different spatial crops, HoloV ensures that the retained tokens capture the global visual context rather than isolated salient features. This strategy minimizes representational collapse and maintains task-relevant information even under aggressive pruning. Experimental results demonstrate that our HoloV achieves superior performance across various tasks, MLLM architectures, and pruning ratios compared to SOTA methods. For instance, LLaVA1.5 equipped with HoloV preserves 95.8% of the original performance after pruning 88.9% of visual tokens, achieving superior efficiency-accuracy trade-offs.

[151] Zero-Shot Robustness of Vision Language Models Via Confidence-Aware Weighting

Nikoo Naghavian, Mostafa Tavassolipour

Main category: cs.CV

TL;DR: CAW (Confidence-Aware Weighting) improves zero-shot robustness in vision-language models through confidence-aware loss and feature alignment regularization, outperforming recent methods under strong attacks with less memory usage.

Details

Motivation: Vision-language models like CLIP show strong zero-shot generalization but are highly vulnerable to adversarial attacks, creating a need for improved robustness.

Method: CAW consists of: (1) Confidence-Aware loss that prioritizes uncertain adversarial examples by scaling KL divergence between clean and adversarial predictions, and (2) feature alignment regularization that preserves semantic consistency by minimizing distance between frozen and fine-tuned image encoder features on adversarial inputs.

Result: Extensive experiments on TinyImageNet and 14 additional datasets show CAW outperforms recent methods (PMG-AFT, TGA-ZSR) under strong attacks like AutoAttack while using less memory.

Conclusion: CAW effectively enhances both clean and robust accuracy in vision-language models without sacrificing generalization, providing a memory-efficient solution for zero-shot robustness.

Abstract: Vision-language models like CLIP demonstrate impressive zero-shot generalization but remain highly vulnerable to adversarial attacks. In this work, we propose Confidence-Aware Weighting (CAW) to enhance zero-shot robustness in vision-language models. CAW consists of two components: (1) a Confidence-Aware loss that prioritizes uncertain adversarial examples by scaling the KL divergence between clean and adversarial predictions, and (2) a feature alignment regularization that preserves semantic consistency by minimizing the distance between frozen and fine-tuned image encoder features on adversarial inputs. These components work jointly to improve both clean and robust accuracy without sacrificing generalization. Extensive experiments on TinyImageNet and 14 additional datasets show that CAW outperforms recent methods such as PMG-AFT and TGA-ZSR under strong attacks like AutoAttack, while using less memory.

[152] Multimodal Carotid Risk Stratification with Large Vision-Language Models: Benchmarking, Fine-Tuning, and Clinical Insights

Daphne Tsolissou, Theofanis Ganitidis, Konstantinos Mitsis, Stergios CHristodoulidis, Maria Vakalopoulou, Konstantina Nikita

Main category: cs.CV

TL;DR: This study evaluates large vision-language models (LVLMs) for multimodal carotid plaque assessment by integrating ultrasound imaging with clinical data. While zero-shot performance was limited, domain adaptation using LoRA significantly improved stroke risk stratification, achieving competitive results compared to CNN baselines.

Details

Motivation: Reliable carotid disease risk assessment requires integrating diverse clinical and imaging information in a transparent, interpretable way for clinicians. Current methods face challenges in multimodal data integration and clinical interpretability.

Method: Proposed a framework using interview-style question sequences to evaluate LVLMs. Adapted LLaVa-NeXT-Vicuna to ultrasound domain using low-rank adaptation (LoRA) and integrated multimodal tabular data as text.

Result: Zero-shot experiments showed LVLMs struggled with imaging modality/anatomy identification and risk classification. Domain adaptation with LoRA substantially improved stroke risk stratification. Multimodal data integration enhanced specificity and balanced accuracy to competitive levels with CNN baselines.

Conclusion: LVLMs show promise but have limitations in ultrasound-based cardiovascular risk prediction. Multimodal integration, model calibration, and domain adaptation are crucial for clinical translation.

Abstract: Reliable risk assessment for carotid atheromatous disease remains a major clinical challenge, as it requires integrating diverse clinical and imaging information in a manner that is transparent and interpretable to clinicians. This study investigates the potential of state-of-the-art and recent large vision-language models (LVLMs) for multimodal carotid plaque assessment by integrating ultrasound imaging (USI) with structured clinical, demographic, laboratory, and protein biomarker data. A framework that simulates realistic diagnostic scenarios through interview-style question sequences is proposed, comparing a range of open-source LVLMs, including both general-purpose and medically tuned models. Zero-shot experiments reveal that even if they are very powerful, not all LVLMs can accurately identify imaging modality and anatomy, while all of them perform poorly in accurate risk classification. To address this limitation, LLaVa-NeXT-Vicuna is adapted to the ultrasound domain using low-rank adaptation (LoRA), resulting in substantial improvements in stroke risk stratification. The integration of multimodal tabular data in the form of text further enhances specificity and balanced accuracy, yielding competitive performance compared to prior convolutional neural network (CNN) baselines trained on the same dataset. Our findings highlight both the promise and limitations of LVLMs in ultrasound-based cardiovascular risk prediction, underscoring the importance of multimodal integration, model calibration, and domain adaptation for clinical translation.

[153] Flip Distribution Alignment VAE for Multi-Phase MRI Synthesis

Xiaoyan Kui, Qianmu Xiao, Qqinsong Li, Zexin Ji, JIelin Zhang, Beiji Zou

Main category: cs.CV

TL;DR: FDA-VAE is a lightweight VAE model for multi-phase CE MRI synthesis that separates shared and independent features using symmetric latent distributions and bidirectional training.

Details

Motivation: Existing methods use inefficient deep autoencoders with poor parameter efficiency and lack interpretable training strategies for feature separation in multi-phase CE MRI synthesis.

Method: Proposes FDA-VAE with symmetric latent distributions aligned to standard normal distribution, using Y-shaped bidirectional training strategy to separate shared and independent features.

Result: Significantly reduces model parameters and inference time while improving synthesis quality compared to existing deep autoencoder-based methods.

Conclusion: FDA-VAE provides an efficient and interpretable solution for multi-phase CE MRI synthesis with better performance and reduced computational requirements.

Abstract: Separating shared and independent features is crucial for multi-phase contrast-enhanced (CE) MRI synthesis. However, existing methods use deep autoencoder generators with low parameter efficiency and lack interpretable training strategies. In this paper, we propose Flip Distribution Alignment Variational Autoencoder (FDA-VAE), a lightweight feature-decoupled VAE model for multi-phase CE MRI synthesis. Our method encodes input and target images into two latent distributions that are symmetric concerning a standard normal distribution, effectively separating shared and independent features. The Y-shaped bidirectional training strategy further enhances the interpretability of feature separation. Experimental results show that compared to existing deep autoencoder-based end-to-end synthesis methods, FDA-VAE significantly reduces model parameters and inference time while effectively improving synthesis quality. The source code is publicly available at https://github.com/QianMuXiao/FDA-VAE.

[154] TIT-Score: Evaluating Long-Prompt Based Text-to-Image Alignment via Text-to-Image-to-Text Consistency

Juntong Wang, Huiyu Duan, Jiarui Wang, Ziheng Jia, Guangtao Zhai, Xiongkuo Min

Main category: cs.CV

TL;DR: The paper introduces LPG-Bench, a benchmark for evaluating long-prompt text-to-image generation, and TIT, a novel metric that uses text-to-image-to-text consistency to better align with human preferences.

Details

Motivation: Current text-to-image models struggle with long and detailed prompts, showing inconsistent generation, and existing evaluation metrics don't align well with human preferences for long-prompt scenarios.

Method: Created LPG-Bench with 200 long prompts (250+ words) and generated 2,600 images from 13 models. Proposed TIT metric that compares prompt consistency with LMM-generated descriptions of images, including TIT-Score and TIT-Score-LLM variants.

Result: TIT-Score-LLM achieved 7.31% absolute improvement in pairwise accuracy over the strongest baseline, demonstrating superior alignment with human judgment compared to existing metrics like CLIP-score and LMM-score.

Conclusion: LPG-Bench and TIT methods provide better tools for benchmarking and advancing text-to-image models, particularly for long-prompt scenarios.

Abstract: With the rapid advancement of large multimodal models (LMMs), recent text-to-image (T2I) models can generate high-quality images and demonstrate great alignment to short prompts. However, they still struggle to effectively understand and follow long and detailed prompts, displaying inconsistent generation. To address this challenge, we introduce LPG-Bench, a comprehensive benchmark for evaluating long-prompt-based text-to-image generation. LPG-Bench features 200 meticulously crafted prompts with an average length of over 250 words, approaching the input capacity of several leading commercial models. Using these prompts, we generate 2,600 images from 13 state-of-the-art models and further perform comprehensive human-ranked annotations. Based on LPG-Bench, we observe that state-of-the-art T2I alignment evaluation metrics exhibit poor consistency with human preferences on long-prompt-based image generation. To address the gap, we introduce a novel zero-shot metric based on text-to-image-to-text consistency, termed TIT, for evaluating long-prompt-generated images. The core concept of TIT is to quantify T2I alignment by directly comparing the consistency between the raw prompt and the LMM-produced description on the generated image, which includes an efficient score-based instantiation TIT-Score and a large-language-model (LLM) based instantiation TIT-Score-LLM. Extensive experiments demonstrate that our framework achieves superior alignment with human judgment compared to CLIP-score, LMM-score, etc., with TIT-Score-LLM attaining a 7.31% absolute improvement in pairwise accuracy over the strongest baseline. LPG-Bench and TIT methods together offer a deeper perspective to benchmark and foster the development of T2I models. All resources will be made publicly available.

[155] Towards Scalable and Consistent 3D Editing

Ruihao Xia, Yang Tang, Pan Zhou

Main category: cs.CV

TL;DR: The paper introduces 3DEditVerse, the largest paired 3D editing benchmark, and 3DEditFormer, a 3D-structure-preserving conditional transformer that enables precise 3D editing without requiring manual 3D masks.

Details

Motivation: 3D editing remains challenging due to cross-view consistency, structural fidelity, and fine-grained controllability issues. Existing approaches are slow, prone to geometric distortions, or dependent on error-prone manual 3D masks.

Method: Built 3DEditVerse benchmark with 116,309 training pairs and 1,500 test pairs using pose-driven geometric edits and foundation model-guided appearance edits. Proposed 3DEditFormer transformer with dual-guidance attention and time-adaptive gating to disentangle editable regions from preserved structure.

Result: Extensive experiments show the framework outperforms state-of-the-art baselines both quantitatively and qualitatively, establishing a new standard for practical and scalable 3D editing.

Conclusion: The approach addresses key challenges in 3D editing by advancing both data (through 3DEditVerse benchmark) and models (through 3DEditFormer), enabling precise and consistent edits without auxiliary 3D masks.

Abstract: 3D editing - the task of locally modifying the geometry or appearance of a 3D asset - has wide applications in immersive content creation, digital entertainment, and AR/VR. However, unlike 2D editing, it remains challenging due to the need for cross-view consistency, structural fidelity, and fine-grained controllability. Existing approaches are often slow, prone to geometric distortions, or dependent on manual and accurate 3D masks that are error-prone and impractical. To address these challenges, we advance both the data and model fronts. On the data side, we introduce 3DEditVerse, the largest paired 3D editing benchmark to date, comprising 116,309 high-quality training pairs and 1,500 curated test pairs. Built through complementary pipelines of pose-driven geometric edits and foundation model-guided appearance edits, 3DEditVerse ensures edit locality, multi-view consistency, and semantic alignment. On the model side, we propose 3DEditFormer, a 3D-structure-preserving conditional transformer. By enhancing image-to-3D generation with dual-guidance attention and time-adaptive gating, 3DEditFormer disentangles editable regions from preserved structure, enabling precise and consistent edits without requiring auxiliary 3D masks. Extensive experiments demonstrate that our framework outperforms state-of-the-art baselines both quantitatively and qualitatively, establishing a new standard for practical and scalable 3D editing. Dataset and code will be released. Project: https://www.lv-lab.org/3DEditFormer/

[156] Not every day is a sunny day: Synthetic cloud injection for deep land cover segmentation robustness evaluation across data sources

Sara Mobsite, Renaud Hostache, Laure Berti Equille, Emmanuel Roux, Joris Guerin

Main category: cs.CV

TL;DR: This paper addresses land cover segmentation challenges in tropical regions with frequent cloud cover by developing a cloud injection algorithm and proposing a lightweight method to inject Normalized Difference Indices into deep learning models to retain spatial features.

Details

Motivation: Most existing Sentinel-2 datasets are cloud-free, limiting their usefulness in tropical regions where clouds are common. There's also an issue of losing spatial/spectral details during encoder downsampling in deep networks.

Method: Developed a cloud injection algorithm to simulate realistic cloud cover, and proposed injecting Normalized Difference Indices (NDIs) into final decoding layers to retain key spatial features with minimal computation. Also tested Sentinel-1 radar data to fill gaps from cloud-obstructed optical imagery.

Result: NDI injection improved land cover segmentation performance by 1.99% for U-Net and 2.78% for DeepLabV3 on cloud-free imagery. Under cloud-covered conditions, incorporating Sentinel-1 data led to significant performance gains across all models compared to using optical data alone.

Conclusion: Radar-optical fusion is effective in challenging atmospheric scenarios, and injecting NDIs enhances land cover segmentation while preserving spatial features with minimal computational overhead.

Abstract: Supervised deep learning for land cover semantic segmentation (LCS) relies on labeled satellite data. However, most existing Sentinel-2 datasets are cloud-free, which limits their usefulness in tropical regions where clouds are common. To properly evaluate the extent of this problem, we developed a cloud injection algorithm that simulates realistic cloud cover, allowing us to test how Sentinel-1 radar data can fill in the gaps caused by cloud-obstructed optical imagery. We also tackle the issue of losing spatial and/or spectral details during encoder downsampling in deep networks. To mitigate this loss, we propose a lightweight method that injects Normalized Difference Indices (NDIs) into the final decoding layers, enabling the model to retain key spatial features with minimal additional computation. Injecting NDIs enhanced land cover segmentation performance on the DFC2020 dataset, yielding improvements of 1.99% for U-Net and 2.78% for DeepLabV3 on cloud-free imagery. Under cloud-covered conditions, incorporating Sentinel-1 data led to significant performance gains across all models compared to using optical data alone, highlighting the effectiveness of radar-optical fusion in challenging atmospheric scenarios.

[157] PocketSR: The Super-Resolution Expert in Your Pocket Mobiles

Haoze Sun, Linfeng Jiang, Fan Li, Renjing Pei, Zhixin Wang, Yong Guo, Jiaqi Xu, Haoyu Chen, Jin Han, Fenglong Song, Yujiu Yang, Wenbo Li

Main category: cs.CV

TL;DR: PocketSR is an ultra-lightweight single-step model for real-world image super-resolution that achieves state-of-the-art performance while being highly efficient for edge deployment.

Details

Motivation: Existing RealSR methods using large generative models have high computational costs and latency, making them impractical for edge devices like mobile phones.

Method: Uses LiteED (97.5% parameter reduction from original VAE) and online annealing pruning for U-Net with multi-layer feature distillation to transfer generative priors from heavy to lightweight modules.

Result: 146M parameter model processes 4K images in 0.8 seconds, achieving performance comparable to state-of-the-art single-step and multi-step RealSR models.

Conclusion: PocketSR provides a highly practical solution for edge-device applications by balancing efficiency and performance in real-world image super-resolution.

Abstract: Real-world image super-resolution (RealSR) aims to enhance the visual quality of in-the-wild images, such as those captured by mobile phones. While existing methods leveraging large generative models demonstrate impressive results, the high computational cost and latency make them impractical for edge deployment. In this paper, we introduce PocketSR, an ultra-lightweight, single-step model that brings generative modeling capabilities to RealSR while maintaining high fidelity. To achieve this, we design LiteED, a highly efficient alternative to the original computationally intensive VAE in SD, reducing parameters by 97.5% while preserving high-quality encoding and decoding. Additionally, we propose online annealing pruning for the U-Net, which progressively shifts generative priors from heavy modules to lightweight counterparts, ensuring effective knowledge transfer and further optimizing efficiency. To mitigate the loss of prior knowledge during pruning, we incorporate a multi-layer feature distillation loss. Through an in-depth analysis of each design component, we provide valuable insights for future research. PocketSR, with a model size of 146M parameters, processes 4K images in just 0.8 seconds, achieving a remarkable speedup over previous methods. Notably, it delivers performance on par with state-of-the-art single-step and even multi-step RealSR models, making it a highly practical solution for edge-device applications.

[158] When and Where do Events Switch in Multi-Event Video Generation?

Ruotong Liao, Guowen Huang, Qing Cheng, Thomas Seidl, Daniel Cremers, Volker Tresp

Main category: cs.CV

TL;DR: MEve is a self-curated prompt suite for evaluating multi-event text-to-video generation, with systematic analysis revealing that early intervention in denoising steps and block-wise model layers are crucial for controlling event transitions in multi-event video generation.

Details

Motivation: Existing methods for multi-event video generation overlook intrinsic factors in event shifting, and the paper aims to understand when and where multi-event prompts control event transitions during T2V generation.

Method: Introduces MEve evaluation suite and conducts systematic study of OpenSora and CogVideoX model families through extensive experiments.

Result: Experiments demonstrate the importance of early intervention in denoising steps and block-wise model layers for multi-event video generation.

Conclusion: The study reveals essential factors for multi-event video generation and highlights possibilities for multi-event conditioning in future models.

Abstract: Text-to-video (T2V) generation has surged in response to challenging questions, especially when a long video must depict multiple sequential events with temporal coherence and controllable content. Existing methods that extend to multi-event generation omit an inspection of the intrinsic factor in event shifting. The paper aims to answer the central question: When and where multi-event prompts control event transition during T2V generation. This work introduces MEve, a self-curated prompt suite for evaluating multi-event text-to-video (T2V) generation, and conducts a systematic study of two representative model families, i.e., OpenSora and CogVideoX. Extensive experiments demonstrate the importance of early intervention in denoising steps and block-wise model layers, revealing the essential factor for multi-event video generation and highlighting the possibilities for multi-event conditioning in future models.

[159] InsideOut: An EfficientNetV2-S Based Deep Learning Framework for Robust Multi-Class Facial Emotion Recognition

Ahsan Farabi, Israt Khandaker, Ibrahim Khalil Shanto, Md Abdul Ahad Minhaz, Tanisha Zaman

Main category: cs.CV

TL;DR: InsideOut is a reproducible FER framework using EfficientNetV2-S with transfer learning, data augmentation, and imbalance-aware optimization to achieve 62.8% accuracy on FER2013.

Details

Motivation: FER is challenging due to occlusions, illumination variations, pose changes, subtle emotion differences, and dataset imbalance that affects minority emotion recognition.

Method: Uses EfficientNetV2-S with transfer learning, strong data augmentation, stratified splitting, and fine-tunes a lightweight classification head with class-weighted loss to handle imbalanced data.

Result: Achieves 62.8% accuracy with macro averaged F1 of 0.590 on FER2013, showing competitive performance compared to conventional CNN baselines.

Conclusion: Demonstrates that efficient architectures combined with tailored imbalance handling can provide practical, transparent, and reproducible FER solutions.

Abstract: Facial Emotion Recognition (FER) is a key task in affective computing, enabling applications in human-computer interaction, e-learning, healthcare, and safety systems. Despite advances in deep learning, FER remains challenging due to occlusions, illumination and pose variations, subtle intra-class differences, and dataset imbalance that hinders recognition of minority emotions. We present InsideOut, a reproducible FER framework built on EfficientNetV2-S with transfer learning, strong data augmentation, and imbalance-aware optimization. The approach standardizes FER2013 images, applies stratified splitting and augmentation, and fine-tunes a lightweight classification head with class-weighted loss to address skewed distributions. InsideOut achieves 62.8% accuracy with a macro averaged F1 of 0.590 on FER2013, showing competitive results compared to conventional CNN baselines. The novelty lies in demonstrating that efficient architectures, combined with tailored imbalance handling, can provide practical, transparent, and reproducible FER solutions.

[160] What Drives Compositional Generalization in Visual Generative Models?

Karim Farid, Rajat Sahay, Yumna Ali Alnaggar, Simon Schrodi, Volker Fischer, Cordelia Schmid, Thomas Brox

Main category: cs.CV

TL;DR: The paper studies how design choices affect compositional generalization in visual generative models, identifying training objective type (discrete vs continuous) and conditioning information as key factors, and proposes improving discrete models with auxiliary continuous objectives.

Details

Motivation: To systematically understand mechanisms that enable or inhibit compositional generalization in visual generative models, as this ability to generate novel combinations of known concepts is crucial but not fully understood.

Method: Conducted controlled experiments to study design choices, identified key factors (training objective type and conditioning information), and proposed relaxing discrete MaskGIT loss with auxiliary continuous JEPA-based objective.

Result: Identified that training objective operating on discrete vs continuous distributions and extent of conditioning information about constituent concepts significantly influence compositional generalization.

Conclusion: Auxiliary continuous objectives can improve compositional performance in discrete generative models, providing a practical approach to enhance compositional generalization capabilities.

Abstract: Compositional generalization, the ability to generate novel combinations of known concepts, is a key ingredient for visual generative models. Yet, not all mechanisms that enable or inhibit it are fully understood. In this work, we conduct a systematic study of how various design choices influence compositional generalization in image and video generation in a positive or negative way. Through controlled experiments, we identify two key factors: (i) whether the training objective operates on a discrete or continuous distribution, and (ii) to what extent conditioning provides information about the constituent concepts during training. Building on these insights, we show that relaxing the MaskGIT discrete loss with an auxiliary continuous JEPA-based objective can improve compositional performance in discrete models like MaskGIT.

[161] Latent Diffusion Unlearning: Protecting Against Unauthorized Personalization Through Trajectory Shifted Perturbations

Naresh Kumar Devulapally, Shruti Agarwal, Tejas Gokhale, Vishnu Suresh Lokhande

Main category: cs.CV

TL;DR: Proposes a latent space perturbation method for diffusion models to create “unlearnable” training samples that resist unauthorized personalization while maintaining visual quality.

Details

Motivation: Address privacy and IP concerns from unauthorized use of personalization techniques in text-to-image diffusion models by creating imperceptible poisoned training data.

Method: Model-based perturbation strategy in latent space using trajectory-shifted sampling that alternates between denoising and inversion while modifying denoising starting points.

Result: Achieves 8-10% improvement in perceptual metrics (PSNR, SSIM, FID) and ~10% better robustness across five adversarial settings on four benchmark datasets.

Conclusion: The method provides effective and imperceptible defense against unauthorized model adaptation in diffusion models while maintaining high visual fidelity.

Abstract: Text-to-image diffusion models have demonstrated remarkable effectiveness in rapid and high-fidelity personalization, even when provided with only a few user images. However, the effectiveness of personalization techniques has lead to concerns regarding data privacy, intellectual property protection, and unauthorized usage. To mitigate such unauthorized usage and model replication, the idea of generating ``unlearnable’’ training samples utilizing image poisoning techniques has emerged. Existing methods for this have limited imperceptibility as they operate in the pixel space which results in images with noise and artifacts. In this work, we propose a novel model-based perturbation strategy that operates within the latent space of diffusion models. Our method alternates between denoising and inversion while modifying the starting point of the denoising trajectory: of diffusion models. This trajectory-shifted sampling ensures that the perturbed images maintain high visual fidelity to the original inputs while being resistant to inversion and personalization by downstream generative models. This approach integrates unlearnability into the framework of Latent Diffusion Models (LDMs), enabling a practical and imperceptible defense against unauthorized model adaptation. We validate our approach on four benchmark datasets to demonstrate robustness against state-of-the-art inversion attacks. Results demonstrate that our method achieves significant improvements in imperceptibility ($\sim 8 % -10%$ on perceptual metrics including PSNR, SSIM, and FID) and robustness ( $\sim 10%$ on average across five adversarial settings), highlighting its effectiveness in safeguarding sensitive data.

[162] Geometry Meets Vision: Revisiting Pretrained Semantics in Distilled Fields

Zhiting Mei, Ola Shorinwa, Anirudha Majumdar

Main category: cs.CV

TL;DR: Geometry-grounded semantic features in radiance fields provide finer structural details but surprisingly decrease pose estimation accuracy, while visual-only features offer greater versatility for downstream tasks.

Details

Motivation: To investigate whether geometry-grounded semantic features offer advantages over visual-only features in distilled radiance fields for spatial tasks like pose estimation.

Method: Proposed SPINE framework with two components: coarse inversion using distilled semantics and fine inversion using photometric-based optimization, comparing geometry-grounded vs visual-only features.

Result: Geometry-grounded features contain finer structural details but surprisingly decrease pose estimation accuracy, while visual-only features show better versatility across tasks.

Conclusion: Visual-only features are more versatile for downstream tasks despite geometry-grounded features having more geometric detail, highlighting the need for better geometry-grounding strategies.

Abstract: Semantic distillation in radiance fields has spurred significant advances in open-vocabulary robot policies, e.g., in manipulation and navigation, founded on pretrained semantics from large vision models. While prior work has demonstrated the effectiveness of visual-only semantic features (e.g., DINO and CLIP) in Gaussian Splatting and neural radiance fields, the potential benefit of geometry-grounding in distilled fields remains an open question. In principle, visual-geometry features seem very promising for spatial tasks such as pose estimation, prompting the question: Do geometry-grounded semantic features offer an edge in distilled fields? Specifically, we ask three critical questions: First, does spatial-grounding produce higher-fidelity geometry-aware semantic features? We find that image features from geometry-grounded backbones contain finer structural details compared to their counterparts. Secondly, does geometry-grounding improve semantic object localization? We observe no significant difference in this task. Thirdly, does geometry-grounding enable higher-accuracy radiance field inversion? Given the limitations of prior work and their lack of semantics integration, we propose a novel framework SPINE for inverting radiance fields without an initial guess, consisting of two core components: coarse inversion using distilled semantics, and fine inversion using photometric-based optimization. Surprisingly, we find that the pose estimation accuracy decreases with geometry-grounded features. Our results suggest that visual-only features offer greater versatility for a broader range of downstream tasks, although geometry-grounded features contain more geometric detail. Notably, our findings underscore the necessity of future research on effective strategies for geometry-grounding that augment the versatility and performance of pretrained semantic features.

[163] GeoComplete: Geometry-Aware Diffusion for Reference-Driven Image Completion

Beibei Lin, Tingting Chen, Robby T. Tan

Main category: cs.CV

TL;DR: GeoComplete is a novel framework for reference-driven image completion that incorporates explicit 3D structural guidance to enforce geometric consistency, outperforming existing methods by 17.1 PSNR.

Details

Motivation: Existing generative methods for reference-driven image completion rely solely on diffusion priors without geometric cues, often producing misaligned or implausible content when target views differ significantly from references.

Method: Uses a dual-branch diffusion architecture: one branch synthesizes missing regions from masked target, while another extracts geometric features from projected point clouds. Includes target-aware masking to detect occluded areas and joint self-attention across branches.

Result: Achieves 17.1 PSNR improvement over state-of-the-art methods, significantly boosting geometric accuracy while maintaining high visual quality.

Conclusion: GeoComplete offers a unified and robust solution for geometry-conditioned image completion by integrating geometry-aware dual-branch diffusion with target-aware masking.

Abstract: Reference-driven image completion, which restores missing regions in a target view using additional images, is particularly challenging when the target view differs significantly from the references. Existing generative methods rely solely on diffusion priors and, without geometric cues such as camera pose or depth, often produce misaligned or implausible content. We propose GeoComplete, a novel framework that incorporates explicit 3D structural guidance to enforce geometric consistency in the completed regions, setting it apart from prior image-only approaches. GeoComplete introduces two key ideas: conditioning the diffusion process on projected point clouds to infuse geometric information, and applying target-aware masking to guide the model toward relevant reference cues. The framework features a dual-branch diffusion architecture. One branch synthesizes the missing regions from the masked target, while the other extracts geometric features from the projected point cloud. Joint self-attention across branches ensures coherent and accurate completion. To address regions visible in references but absent in the target, we project the target view into each reference to detect occluded areas, which are then masked during training. This target-aware masking directs the model to focus on useful cues, enhancing performance in difficult scenarios. By integrating a geometry-aware dual-branch diffusion architecture with a target-aware masking strategy, GeoComplete offers a unified and robust solution for geometry-conditioned image completion. Experiments show that GeoComplete achieves a 17.1 PSNR improvement over state-of-the-art methods, significantly boosting geometric accuracy while maintaining high visual quality.

[164] HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion

Shiyi Zhang, Dong Liang, Hairong Zheng, Yihang Zhou

Main category: cs.CV

TL;DR: HAVIR model reconstructs visual information from brain activity by separating visual cortex into hierarchical regions and extracting distinct structural and semantic features, achieving superior reconstruction quality in complex scenes.

Details

Motivation: Existing methods struggle to accurately reconstruct complex visual stimuli due to heterogeneous low-level features and semantically entangled high-level features in natural scenes.

Method: Separates visual cortex into hierarchical regions: Structural Generator extracts structural information from spatial processing voxels as latent diffusion priors, and Semantic Extractor converts semantic processing voxels into CLIP embeddings, integrated via Versatile Diffusion model.

Result: HAVIR enhances both structural and semantic quality of reconstructions, outperforms existing models, and works well even in complex scenes.

Conclusion: The hierarchical approach inspired by visual cortex representation theory effectively addresses challenges in brain activity-based visual reconstruction.

Abstract: The reconstruction of visual information from brain activity fosters interdisciplinary integration between neuroscience and computer vision. However, existing methods still face challenges in accurately recovering highly complex visual stimuli. This difficulty stems from the characteristics of natural scenes: low-level features exhibit heterogeneity, while high-level features show semantic entanglement due to contextual overlaps. Inspired by the hierarchical representation theory of the visual cortex, we propose the HAVIR model, which separates the visual cortex into two hierarchical regions and extracts distinct features from each. Specifically, the Structural Generator extracts structural information from spatial processing voxels and converts it into latent diffusion priors, while the Semantic Extractor converts semantic processing voxels into CLIP embeddings. These components are integrated via the Versatile Diffusion model to synthesize the final image. Experimental results demonstrate that HAVIR enhances both the structural and semantic quality of reconstructions, even in complex scenes, and outperforms existing models.

[165] Mask2IV: Interaction-Centric Video Generation via Mask Trajectories

Gen Li, Bo Zhao, Jianfei Yang, Laura Sevilla-Lara

Main category: cs.CV

TL;DR: Mask2IV is a novel framework for interaction-centric video generation that uses a decoupled two-stage pipeline to predict motion trajectories and generate videos without requiring dense mask annotations.

Details

Motivation: Generating interaction-centric videos is crucial for embodied intelligence but existing methods struggle with complex interactions and require dense mask annotations which are challenging to obtain.

Method: Two-stage pipeline: first predicts motion trajectories for actor and object, then generates video conditioned on these trajectories. Supports control through action descriptions or spatial position cues.

Result: Achieves superior visual realism and controllability compared to existing baselines, as demonstrated through extensive experiments on curated benchmarks.

Conclusion: Mask2IV provides an effective solution for interaction-centric video generation that eliminates the need for dense mask inputs while maintaining flexibility and control over the interaction process.

Abstract: Generating interaction-centric videos, such as those depicting humans or robots interacting with objects, is crucial for embodied intelligence, as they provide rich and diverse visual priors for robot learning, manipulation policy training, and affordance reasoning. However, existing methods often struggle to model such complex and dynamic interactions. While recent studies show that masks can serve as effective control signals and enhance generation quality, obtaining dense and precise mask annotations remains a major challenge for real-world use. To overcome this limitation, we introduce Mask2IV, a novel framework specifically designed for interaction-centric video generation. It adopts a decoupled two-stage pipeline that first predicts plausible motion trajectories for both actor and object, then generates a video conditioned on these trajectories. This design eliminates the need for dense mask inputs from users while preserving the flexibility to manipulate the interaction process. Furthermore, Mask2IV supports versatile and intuitive control, allowing users to specify the target object of interaction and guide the motion trajectory through action descriptions or spatial position cues. To support systematic training and evaluation, we curate two benchmarks covering diverse action and object categories across both human-object interaction and robotic manipulation scenarios. Extensive experiments demonstrate that our method achieves superior visual realism and controllability compared to existing baselines.

[166] ReeMark: Reeb Graphs for Simulating Patterns of Life in Spatiotemporal Trajectories

Anantajit Subrahmanya, Chandrakanth Gudavalli, Connor Levenson, Umang Garg, B. S. Manjunath

Main category: cs.CV

TL;DR: Markovian Reeb Graphs is a novel framework for simulating realistic human mobility trajectories that preserve Patterns of Life learned from baseline data, combining individual and population-level structures.

Details

Motivation: Accurate human mobility modeling is critical for urban planning, epidemiology, and traffic management, requiring methods that capture both consistency and variability in daily life patterns.

Method: Combines individual- and population-level mobility structures within a probabilistic topological model to generate spatiotemporal trajectories that preserve learned Patterns of Life.

Result: Evaluation on Urban Anomalies dataset (Atlanta and Berlin) using Jensen-Shannon Divergence shows strong fidelity across population- and agent-level metrics while remaining data- and compute-efficient.

Conclusion: Markovian Reeb Graphs provides a scalable framework for trajectory simulation with broad applicability across diverse urban environments.

Abstract: Accurately modeling human mobility is critical for urban planning, epidemiology, and traffic management. In this work, we introduce Markovian Reeb Graphs, a novel framework for simulating spatiotemporal trajectories that preserve Patterns of Life (PoLs) learned from baseline data. By combining individual- and population-level mobility structures within a probabilistic topological model, our approach generates realistic future trajectories that capture both consistency and variability in daily life. Evaluations on the Urban Anomalies dataset (Atlanta and Berlin subsets) using the Jensen-Shannon Divergence (JSD) across population- and agent-level metrics demonstrate that the proposed method achieves strong fidelity while remaining data- and compute-efficient. These results position Markovian Reeb Graphs as a scalable framework for trajectory simulation with broad applicability across diverse urban environments.

[167] SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus

Ming Zhao, Wenhui Dong, Yang Zhang, Xiang Zheng, Zhonghao Zhang, Zian Zhou, Yunzhi Guan, Liukun Xu, Wei Peng, Zhaoyang Gong, Zhicheng Zhang, Dachuan Li, Xiaosheng Ma, Yuli Ma, Jianing Ni, Changjiang Jiang, Lixia Tian, Qixin Chen, Kaishun Xia, Pingping Liu, Tongshun Zhang, Zhiqiang Liu, Zhongan Bi, Chenyang Si, Tiansheng Sun, Caifeng Shan

Main category: cs.CV

TL;DR: SpineMed introduces a comprehensive ecosystem for AI-assisted spine diagnosis, featuring a large-scale multimodal dataset (SpineMed-450k) and a clinically-grounded benchmark (SpineBench) to address limitations in vertebral-level reasoning across imaging modalities.

Details

Motivation: Spine disorders affect 619 million people globally and are a leading cause of disability, but AI-assisted diagnosis is limited by the lack of level-aware, multimodal datasets and standardized benchmarks for clinical decision-making across X-ray, CT, and MRI at specific vertebral levels.

Method: Developed SpineMed-450k dataset with 450,000 instruction instances using clinician-in-the-loop pipeline with two-stage LLM generation (draft and revision), curated from textbooks, guidelines, open datasets, and ~1,000 hospital cases. Created SpineBench evaluation framework for clinically salient tasks including level identification, pathology assessment, and surgical planning.

Result: Evaluation of advanced LVLMs on SpineBench revealed systematic weaknesses in fine-grained, level-specific reasoning. The model fine-tuned on SpineMed-450k demonstrated consistent and significant improvements across all tasks. Clinician assessments confirmed diagnostic clarity and practical utility.

Conclusion: SpineMed addresses critical gaps in spine AI diagnosis by providing traceable, clinically-grounded data and benchmarks, enabling improved vertebral-level reasoning across imaging modalities for enhanced clinical decision-making in spine disorders.

Abstract: Spine disorders affect 619 million people globally and are a leading cause of disability, yet AI-assisted diagnosis remains limited by the lack of level-aware, multimodal datasets. Clinical decision-making for spine disorders requires sophisticated reasoning across X-ray, CT, and MRI at specific vertebral levels. However, progress has been constrained by the absence of traceable, clinically-grounded instruction data and standardized, spine-specific benchmarks. To address this, we introduce SpineMed, an ecosystem co-designed with practicing spine surgeons. It features SpineMed-450k, the first large-scale dataset explicitly designed for vertebral-level reasoning across imaging modalities with over 450,000 instruction instances, and SpineBench, a clinically-grounded evaluation framework. SpineMed-450k is curated from diverse sources, including textbooks, guidelines, open datasets, and ~1,000 de-identified hospital cases, using a clinician-in-the-loop pipeline with a two-stage LLM generation method (draft and revision) to ensure high-quality, traceable data for question-answering, multi-turn consultations, and report generation. SpineBench evaluates models on clinically salient axes, including level identification, pathology assessment, and surgical planning. Our comprehensive evaluation of several recently advanced large vision-language models (LVLMs) on SpineBench reveals systematic weaknesses in fine-grained, level-specific reasoning. In contrast, our model fine-tuned on SpineMed-450k demonstrates consistent and significant improvements across all tasks. Clinician assessments confirm the diagnostic clarity and practical utility of our model’s outputs.

[168] UniShield: An Adaptive Multi-Agent Framework for Unified Forgery Image Detection and Localization

Qing Huang, Zhipei Xu, Xuanyu Zhang, Jian Zhang

Main category: cs.CV

TL;DR: UniShield is a multi-agent unified system for detecting and localizing image forgeries across multiple domains, using a perception agent to select detection models and a detection agent to generate interpretable reports.

Details

Motivation: Address limitations of domain-specific forgery detection methods including narrow specialization, poor cross-domain generalization, and lack of integrated adaptive frameworks.

Method: Multi-agent system with perception agent analyzing image features to dynamically select detection models, and detection agent consolidating expert detectors into unified framework.

Result: Achieves state-of-the-art results, surpassing both unified approaches and domain-specific detectors.

Conclusion: UniShield demonstrates superior practicality, adaptiveness, and scalability for forgery detection across diverse domains.

Abstract: With the rapid advancements in image generation, synthetic images have become increasingly realistic, posing significant societal risks, such as misinformation and fraud. Forgery Image Detection and Localization (FIDL) thus emerges as essential for maintaining information integrity and societal security. Despite impressive performances by existing domain-specific detection methods, their practical applicability remains limited, primarily due to their narrow specialization, poor cross-domain generalization, and the absence of an integrated adaptive framework. To address these issues, we propose UniShield, the novel multi-agent-based unified system capable of detecting and localizing image forgeries across diverse domains, including image manipulation, document manipulation, DeepFake, and AI-generated images. UniShield innovatively integrates a perception agent with a detection agent. The perception agent intelligently analyzes image features to dynamically select suitable detection models, while the detection agent consolidates various expert detectors into a unified framework and generates interpretable reports. Extensive experiments show that UniShield achieves state-of-the-art results, surpassing both existing unified approaches and domain-specific detectors, highlighting its superior practicality, adaptiveness, and scalability.

[169] ROGR: Relightable 3D Objects using Generative Relighting

Jiapeng Tang, Matthew Lavine, Dor Verbin, Stephan J. Garbin, Matthias Nießner, Ricardo Martin Brualla, Pratul P. Srinivasan, Philipp Henzler

Main category: cs.CV

TL;DR: ROGR is a novel method that reconstructs relightable 3D objects using a generative relighting model and lighting-conditioned NeRF with dual-branch architecture for separate encoding of general lighting and specular effects.

Details

Motivation: To create relightable 3D models that can be efficiently rendered under arbitrary environment lighting without requiring per-illumination optimization or complex light transport simulations.

Method: Uses a generative relighting model to sample object appearances under multiple lighting environments, then trains a lighting-conditioned NeRF with dual-branch architecture to separately encode general lighting effects and specularities.

Result: Outperforms state-of-the-art methods on TensoIR and Stanford-ORB datasets across most metrics, and demonstrates effectiveness on real-world object captures.

Conclusion: ROGR enables efficient feed-forward relighting under arbitrary environment maps and advances the state-of-the-art in relightable 3D reconstruction.

Abstract: We introduce ROGR, a novel approach that reconstructs a relightable 3D model of an object captured from multiple views, driven by a generative relighting model that simulates the effects of placing the object under novel environment illuminations. Our method samples the appearance of the object under multiple lighting environments, creating a dataset that is used to train a lighting-conditioned Neural Radiance Field (NeRF) that outputs the object’s appearance under any input environmental lighting. The lighting-conditioned NeRF uses a novel dual-branch architecture to encode the general lighting effects and specularities separately. The optimized lighting-conditioned NeRF enables efficient feed-forward relighting under arbitrary environment maps without requiring per-illumination optimization or light transport simulation. We evaluate our approach on the established TensoIR and Stanford-ORB datasets, where it improves upon the state-of-the-art on most metrics, and showcase our approach on real-world object captures.

[170] Dynamic Prompt Generation for Interactive 3D Medical Image Segmentation Training

Tidiane Camaret Ndir, Alexander Pfefferle, Robin Tibor Schirrmeister

Main category: cs.CV

TL;DR: A training strategy combining dynamic volumetric prompt generation and content-aware adaptive cropping for efficient interactive 3D biomedical image segmentation, achieving strong performance on benchmark metrics.

Details

Motivation: Current foundation models lack volumetric awareness or have limited interactive capabilities for 3D biomedical image segmentation, requiring efficient models that can iteratively refine predictions based on user prompts.

Method: Combines dynamic volumetric prompt generation with content-aware adaptive cropping to optimize image encoder usage, simulates realistic user interaction patterns during training, addresses computational challenges of sequential refinement feedback on single GPU, and initializes network using publicly available nnInteractive segmentation model weights.

Result: Achieved average final Dice score of 0.6385, normalized surface distance of 0.6614, and area-under-the-curve metrics of 2.4799 (Dice) and 2.5671 (NSD) on the Foundation Models for Interactive 3D Biomedical Image Segmentation competition.

Conclusion: The proposed training strategy effectively addresses computational challenges while enabling efficient interactive 3D biomedical image segmentation with strong performance on benchmark metrics.

Abstract: Interactive 3D biomedical image segmentation requires efficient models that can iteratively refine predictions based on user prompts. Current foundation models either lack volumetric awareness or suffer from limited interactive capabilities. We propose a training strategy that combines dynamic volumetric prompt generation with content-aware adaptive cropping to optimize the use of the image encoder. Our method simulates realistic user interaction patterns during training while addressing the computational challenges of learning from sequential refinement feedback on a single GPU. For efficient training, we initialize our network using the publicly available weights from the nnInteractive segmentation model. Evaluation on the \textbf{Foundation Models for Interactive 3D Biomedical Image Segmentation} competition demonstrates strong performance with an average final Dice score of 0.6385, normalized surface distance of 0.6614, and area-under-the-curve metrics of 2.4799 (Dice) and 2.5671 (NSD).

[171] Product-Quantised Image Representation for High-Quality Image Synthesis

Denis Zavadski, Nikita Philip Tatsch, Carsten Rother

Main category: cs.CV

TL;DR: PQGAN integrates product quantization into VQGAN framework, achieving significant improvements in image reconstruction quality and enabling faster/higher-resolution generation with diffusion models.

Details

Motivation: Product quantization has been underutilized for latent representations in high-fidelity image generation despite its scalability, while VQGAN has shown success but has limitations.

Method: PQGAN combines product quantization with VQGAN framework, analyzing interactions between codebook size, embedding dimensionality, and subspace factorization to optimize performance.

Result: Achieved PSNR of 37dB (vs 27dB prior), reduced FID, LPIPS, and CMMD scores by up to 96%, and enables either faster generation or doubled output resolution with diffusion models.

Conclusion: Product quantization is a strong extension for discrete latent representations in image synthesis, offering superior performance and flexibility compared to existing methods.

Abstract: Product quantisation (PQ) is a classical method for scalable vector encoding, yet it has seen limited usage for latent representations in high-fidelity image generation. In this work, we introduce PQGAN, a quantised image autoencoder that integrates PQ into the well-known vector quantisation (VQ) framework of VQGAN. PQGAN achieves a noticeable improvement over state-of-the-art methods in terms of reconstruction performance, including both quantisation methods and their continuous counterparts. We achieve a PSNR score of 37dB, where prior work achieves 27dB, and are able to reduce the FID, LPIPS, and CMMD score by up to 96%. Our key to success is a thorough analysis of the interaction between codebook size, embedding dimensionality, and subspace factorisation, with vector and scalar quantisation as special cases. We obtain novel findings, such that the performance of VQ and PQ behaves in opposite ways when scaling the embedding dimension. Furthermore, our analysis shows performance trends for PQ that help guide optimal hyperparameter selection. Finally, we demonstrate that PQGAN can be seamlessly integrated into pre-trained diffusion models. This enables either a significantly faster and more compute-efficient generation, or a doubling of the output resolution at no additional cost, positioning PQ as a strong extension for discrete latent representation in image synthesis.

[172] Memory Forcing: Spatio-Temporal Memory for Consistent Scene Generation on Minecraft

Junchao Huang, Xinting Hu, Boyao Han, Shaoshuai Shi, Zhuotao Tian, Tianyu He, Li Jiang

Main category: cs.CV

TL;DR: Memory Forcing is a learning framework for autoregressive video diffusion models that improves long-term spatial consistency in Minecraft gameplay simulation by combining temporal and spatial memory with specialized training protocols.

Details

Motivation: To address the trade-off between long-term spatial consistency and new scene generation quality in autoregressive video diffusion models for Minecraft gameplay simulation, where temporal-only memory lacks consistency and spatial memory can degrade quality when over-relied upon.

Method: Uses Memory Forcing framework with: 1) Hybrid Training for distinct gameplay regimes, 2) Chained Forward Training with model rollouts for pose variations, 3) Point-to-Frame Retrieval for efficient history access, and 4) Incremental 3D Reconstruction for explicit 3D cache maintenance.

Result: Achieves superior long-term spatial consistency and generative quality across diverse environments while maintaining computational efficiency for extended sequences.

Conclusion: Memory Forcing effectively balances spatial consistency and generation quality in autoregressive video diffusion models for Minecraft gameplay simulation through its hybrid memory approach and specialized training protocols.

Abstract: Autoregressive video diffusion models have proved effective for world modeling and interactive scene generation, with Minecraft gameplay as a representative application. To faithfully simulate play, a model must generate natural content while exploring new scenes and preserve spatial consistency when revisiting explored areas. Under limited computation budgets, it must compress and exploit historical cues within a finite context window, which exposes a trade-off: Temporal-only memory lacks long-term spatial consistency, whereas adding spatial memory strengthens consistency but may degrade new scene generation quality when the model over-relies on insufficient spatial context. We present Memory Forcing, a learning framework that pairs training protocols with a geometry-indexed spatial memory. Hybrid Training exposes distinct gameplay regimes, guiding the model to rely on temporal memory during exploration and incorporate spatial memory for revisits. Chained Forward Training extends autoregressive training with model rollouts, where chained predictions create larger pose variations and encourage reliance on spatial memory for maintaining consistency. Point-to-Frame Retrieval efficiently retrieves history by mapping currently visible points to their source frames, while Incremental 3D Reconstruction maintains and updates an explicit 3D cache. Extensive experiments demonstrate that Memory Forcing achieves superior long-term spatial consistency and generative quality across diverse environments, while maintaining computational efficiency for extended sequences.

[173] MonSTeR: a Unified Model for Motion, Scene, Text Retrieval

Luca Collorone, Matteo Gioia, Massimiliano Pappa, Paolo Leoni, Giovanni Ficarra, Or Litany, Indro Spinelli, Fabio Galasso

Main category: cs.CV

TL;DR: MonSTeR is the first MOtioN-Scene-TExt Retrieval model that creates a unified latent space to evaluate alignment between skeletal movement, intention, and surrounding context, outperforming trimodal models and enabling zero-shot applications.

Details

Motivation: Existing research lacks tools to evaluate the alignment between skeletal movement (motion), intention (text), and surrounding context (scene), despite the intuitive relationship between these modalities in human movement.

Method: MonSTeR constructs a unified latent space by leveraging unimodal and cross-modal representations, inspired by higher-order relations modeling to capture intricate dependencies between modalities.

Result: MonSTeR outperforms trimodal models that rely solely on unimodal representations, and user studies validate that retrieval scores align with human preferences. The model also demonstrates versatility in zero-shot in-Scene Object Placement and Motion Captioning.

Conclusion: MonSTeR successfully creates a unified framework for motion-scene-text retrieval, providing the first tool to evaluate alignment between these modalities and enabling flexible cross-modal applications.

Abstract: Intention drives human movement in complex environments, but such movement can only happen if the surrounding context supports it. Despite the intuitive nature of this mechanism, existing research has not yet provided tools to evaluate the alignment between skeletal movement (motion), intention (text), and the surrounding context (scene). In this work, we introduce MonSTeR, the first MOtioN-Scene-TExt Retrieval model. Inspired by the modeling of higher-order relations, MonSTeR constructs a unified latent space by leveraging unimodal and cross-modal representations. This allows MonSTeR to capture the intricate dependencies between modalities, enabling flexible but robust retrieval across various tasks. Our results show that MonSTeR outperforms trimodal models that rely solely on unimodal representations. Furthermore, we validate the alignment of our retrieval scores with human preferences through a dedicated user study. We demonstrate the versatility of MonSTeR’s latent space on zero-shot in-Scene Object Placement and Motion Captioning. Code and pre-trained models are available at github.com/colloroneluca/MonSTeR.

[174] Test-Time Defense Against Adversarial Attacks via Stochastic Resonance of Latent Ensembles

Dong Lao, Yuxiang Zhang, Haniyeh Ehsani Oskouie, Yangchao Wu, Alex Wong, Stefano Soatto

Main category: cs.CV

TL;DR: A training-free, architecture-agnostic test-time defense against adversarial attacks using stochastic resonance through small translational perturbations and feature alignment.

Details

Motivation: To combat adversarial attacks without information loss from feature filtering/smoothing methods, leveraging the principle of 'combat noise with noise' through stochastic resonance.

Method: Introduces small translational perturbations to input images, aligns transformed feature embeddings, aggregates them, and maps back to original reference using a closed-form formula without additional modules or fine-tuning.

Result: State-of-the-art robustness: recovers up to 68.1% accuracy loss on image classification, 71.9% on stereo matching, and 29.2% on optical flow under various adversarial attacks.

Conclusion: The method provides a versatile, practical defense that works across diverse network architectures and tasks without training requirements, establishing the first generic test-time defense for dense prediction tasks.

Abstract: We propose a test-time defense mechanism against adversarial attacks: imperceptible image perturbations that significantly alter the predictions of a model. Unlike existing methods that rely on feature filtering or smoothing, which can lead to information loss, we propose to “combat noise with noise” by leveraging stochastic resonance to enhance robustness while minimizing information loss. Our approach introduces small translational perturbations to the input image, aligns the transformed feature embeddings, and aggregates them before mapping back to the original reference image. This can be expressed in a closed-form formula, which can be deployed on diverse existing network architectures without introducing additional network modules or fine-tuning for specific attack types. The resulting method is entirely training-free, architecture-agnostic, and attack-agnostic. Empirical results show state-of-the-art robustness on image classification and, for the first time, establish a generic test-time defense for dense prediction tasks, including stereo matching and optical flow, highlighting the method’s versatility and practicality. Specifically, relative to clean (unperturbed) performance, our method recovers up to 68.1% of the accuracy loss on image classification, 71.9% on stereo matching, and 29.2% on optical flow under various types of adversarial attacks.

[175] MIXER: Mixed Hyperspherical Random Embedding Neural Network for Texture Recognition

Ricardo T. Fares, Lucas C. Ribas

Main category: cs.CV

TL;DR: Mixer is a novel randomized neural network for texture representation learning that uses hyperspherical random embeddings and a dual-branch module to capture intra- and inter-channel relationships, achieving strong results on texture benchmarks.

Details

Motivation: Existing randomized neural network approaches for texture recognition have focused mainly on improving cross-information prediction without significant architectural advancements.

Method: Leverages hyperspherical random embeddings with a dual-branch learning module to capture intra- and inter-channel relationships, enhanced by a newly formulated optimization problem for building rich texture representations.

Result: Experimental results show interesting performance across several pure texture benchmarks with distinct characteristics and challenges.

Conclusion: The proposed Mixer approach effectively combines the advantages of traditional techniques and learning-based approaches for texture representation learning.

Abstract: Randomized neural networks for representation learning have consistently achieved prominent results in texture recognition tasks, effectively combining the advantages of both traditional techniques and learning-based approaches. However, existing approaches have so far focused mainly on improving cross-information prediction, without introducing significant advancements to the overall randomized network architecture. In this paper, we propose Mixer, a novel randomized neural network for texture representation learning. At its core, the method leverages hyperspherical random embeddings coupled with a dual-branch learning module to capture both intra- and inter-channel relationships, further enhanced by a newly formulated optimization problem for building rich texture representations. Experimental results have shown the interesting results of the proposed approach across several pure texture benchmarks, each with distinct characteristics and challenges. The source code will be available upon publication.

[176] Improving GUI Grounding with Explicit Position-to-Coordinate Mapping

Suyuchen Wang, Tianyu Zhang, Ahmed Masry, Christopher Pal, Spandana Gella, Bang Liu, Perouz Taslakian

Main category: cs.CV

TL;DR: The paper introduces RULER tokens and Interleaved MRoPE to improve GUI grounding accuracy by providing explicit spatial guidance instead of relying on implicit coordinate generation, especially for high-resolution displays.

Details

Motivation: Current VLMs struggle with GUI grounding due to unreliable patch-to-pixel mapping when extrapolating to unseen high-resolution displays, causing accuracy degradation and failures on new resolutions.

Method: Two innovations: 1) RULER tokens as explicit coordinate markers for position referencing, 2) Interleaved MRoPE (I-MRoPE) for improved spatial encoding by ensuring equal representation of width and height dimensions.

Result: Experiments on ScreenSpot, ScreenSpot-V2, and ScreenSpot-Pro show consistent gains in grounding accuracy, with largest improvements on high-resolution interfaces.

Conclusion: By providing explicit spatial guidance rather than relying on implicit learning, the approach enables more reliable GUI automation across diverse resolutions and platforms.

Abstract: GUI grounding, the task of mapping natural-language instructions to pixel coordinates, is crucial for autonomous agents, yet remains difficult for current VLMs. The core bottleneck is reliable patch-to-pixel mapping, which breaks when extrapolating to high-resolution displays unseen during training. Current approaches generate coordinates as text tokens directly from visual features, forcing the model to infer complex position-to-pixel mappings implicitly; as a result, accuracy degrades and failures proliferate on new resolutions. We address this with two complementary innovations. First, RULER tokens serve as explicit coordinate markers, letting the model reference positions similar to gridlines on a map and adjust rather than generate coordinates from scratch. Second, Interleaved MRoPE (I-MRoPE) improves spatial encoding by ensuring that width and height dimensions are represented equally, addressing the asymmetry of standard positional schemes. Experiments on ScreenSpot, ScreenSpot-V2, and ScreenSpot-Pro show consistent gains in grounding accuracy, with the largest improvements on high-resolution interfaces. By providing explicit spatial guidance rather than relying on implicit learning, our approach enables more reliable GUI automation across diverse resolutions and platforms.

[177] LEAML: Label-Efficient Adaptation to Out-of-Distribution Visual Tasks for Multimodal Large Language Models

Ci-Siang Lin, Min-Hung Chen, Yu-Yang Sheng, Yu-Chiang Frank Wang

Main category: cs.CV

TL;DR: LEAML is a label-efficient adaptation framework for MLLMs that uses pseudo QA generation and selective neuron updates to improve performance on specialized domains with limited labeled data.

Details

Motivation: MLLMs struggle with out-of-distribution tasks in specialized domains like medical imaging where labeled data is scarce and expensive to obtain.

Method: Generates domain-relevant pseudo question-answer pairs for unlabeled data using a QA generator regularized by caption distillation, and selectively updates only neurons most relevant to question-answering.

Result: Experiments on gastrointestinal endoscopy and sports VQA show LEAML consistently outperforms standard fine-tuning under minimal supervision.

Conclusion: LEAML framework is effective for adapting MLLMs to specialized domains with limited labeled data through pseudo QA generation and selective parameter updates.

Abstract: Multimodal Large Language Models (MLLMs) have achieved strong performance on general visual benchmarks but struggle with out-of-distribution (OOD) tasks in specialized domains such as medical imaging, where labeled data is limited and expensive. We introduce LEAML, a label-efficient adaptation framework that leverages both scarce labeled VQA samples and abundant unlabeled images. Our approach generates domain-relevant pseudo question-answer pairs for unlabeled data using a QA generator regularized by caption distillation. Importantly, we selectively update only those neurons most relevant to question-answering, enabling the QA Generator to efficiently acquire domain-specific knowledge during distillation. Experiments on gastrointestinal endoscopy and sports VQA demonstrate that LEAML consistently outperforms standard fine-tuning under minimal supervision, highlighting the effectiveness of our proposed LEAML framework.

[178] Filter-Guided Diffusion for Controllable Image Generation

Zeqi Gu, Ethan Yang, Abe Davis

Main category: cs.CV

TL;DR: FGD is an efficient diffusion-based method for image translation and editing that uses fast filtering operations instead of feature manipulation, enabling better control and variety while being faster than state-of-the-art approaches.

Details

Motivation: Current diffusion-based image translation methods are computationally expensive, memory-intensive, and limited by deterministic sampling that reduces output variety.

Method: Uses filter-guided diffusion with fast filtering operations during the diffusion process to control guidance strength and frequencies, compatible with non-deterministic samplers.

Result: FGD achieves superior structural and semantic metrics, runs faster than SOTA methods (multiple seeds in less time than single run of others), and enables localized editing with masks.

Conclusion: FGD provides an efficient alternative to feature-based approaches, offering finer control, greater variety, and faster performance for image translation and editing tasks.

Abstract: Recent advances in diffusion-based generative models have shown incredible promise for zero shot image-to-image translation and editing. Most of these approaches work by combining or replacing network-specific features used in the generation of new images with those taken from the inversion of some guide image. Methods of this type are considered the current state-of-the-art in training-free approaches, but have some notable limitations: they tend to be costly in runtime and memory, and often depend on deterministic sampling that limits variation in generated results. We propose Filter-Guided Diffusion (FGD), an alternative approach that leverages fast filtering operations during the diffusion process to support finer control over the strength and frequencies of guidance and can work with non-deterministic samplers to produce greater variety. With its efficiency, FGD can be sampled over multiple seeds and hyperparameters in less time than a single run of other SOTA methods to produce superior results based on structural and semantic metrics. We conduct extensive quantitative and qualitative experiments to evaluate the performance of FGD in translation tasks and also demonstrate its potential in localized editing when used with masks. Project page: https://filterguideddiffusion.github.io/

[179] Comparing YOLOv8 and Mask R-CNN for instance segmentation in complex orchard environments

Ranjan Sapkota, Dawood Ahmed, Manoj Karkee

Main category: cs.CV

TL;DR: YOLOv8 outperforms Mask R-CNN in instance segmentation for agricultural automation, achieving higher precision, recall, and faster inference times across two orchard datasets.

Details

Motivation: Instance segmentation is crucial for agricultural automation tasks like selective harvesting and precision pruning, requiring accurate delineation of individual objects in orchard environments.

Method: Compared one-stage YOLOv8 model with two-stage Mask R-CNN model for instance segmentation using two datasets: Dataset 1 (dormant season - branches/trunks) and Dataset 2 (early growing season - fruitlets).

Result: YOLOv8 achieved higher precision (0.90 vs 0.81 for Dataset 1; 0.93 vs 0.85 for Dataset 2) and recall (0.95 vs 0.81; 0.97 vs 0.88) with faster inference times (10.9ms vs 15.6ms; 7.8ms vs 12.8ms) compared to Mask R-CNN.

Conclusion: YOLOv8 demonstrates superior accuracy and efficiency for real-time orchard automation tasks, making it more suitable for applications like robotic harvesting and fruit thinning.

Abstract: Instance segmentation is an important image processing operation for agricultural automation, providing precise delineation of individual objects within images and enabling tasks such as selective harvesting and precision pruning. This study compares the one stage YOLOv8 model with the two stage Mask R CNN model for instance segmentation under varying orchard conditions across two datasets. Dataset 1, collected in the dormant season, contains images of apple trees without foliage and was used to train multi object segmentation models delineating branches and trunks. Dataset 2, collected in the early growing season, includes canopy images with green foliage and immature apples and was used to train single object segmentation models delineating fruitlets. Results showed YOLOv8 outperformed Mask R CNN with higher precision and near perfect recall at a confidence threshold of 0.5. For Dataset 1, YOLOv8 achieved precision 0.90 and recall 0.95 compared to 0.81 and 0.81 for Mask R CNN. For Dataset 2, YOLOv8 reached precision 0.93 and recall 0.97 compared to 0.85 and 0.88. Inference times were also lower for YOLOv8, at 10.9 ms and 7.8 ms, versus 15.6 ms and 12.8 ms for Mask R CNN. These findings demonstrate superior accuracy and efficiency of YOLOv8 for real time orchard automation tasks such as robotic harvesting and fruit thinning.

[180] RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives

Jaehong Yoon, Shoubin Yu, Mohit Bansal

Main category: cs.CV

TL;DR: RACCooN is a video-to-paragraph-to-video framework that enables flexible video editing through automated video description generation and user-guided refinement, supporting removal, addition, and modification operations without requiring labor-intensive text prompts.

Details

Motivation: Current video generative models require carefully written text prompts for specific tasks and labor-intensive textual descriptions for input videos, which hinders flexibility in adapting personal/raw videos to user specifications.

Method: Two-stage pipeline: Video-to-Paragraph (V2P) automatically describes video scenes using multi-granular spatiotemporal pooling to capture holistic context and object details; Paragraph-to-Video (P2V) allows users to refine descriptions to guide video diffusion models for editing operations.

Result: The framework demonstrates impressive versatile capabilities in video-to-paragraph generation and video content editing, and can be incorporated into other state-of-the-art video generative models for enhancement.

Conclusion: RACCooN provides a versatile and user-friendly approach for multiple video editing capabilities through a unified pipeline, eliminating the need for complex human annotations and simplifying precise video content editing based on text.

Abstract: Recent video generative models primarily rely on carefully written text prompts for specific tasks, like inpainting or style editing. They require labor-intensive textual descriptions for input videos, hindering their flexibility to adapt personal/raw videos to user specifications. This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework that supports multiple video editing capabilities such as removal, addition, and modification, through a unified pipeline. RACCooN consists of two principal stages: Video-to-Paragraph (V2P) and Paragraph-to-Video (P2V). In the V2P stage, we automatically describe video scenes in well-structured natural language, capturing both the holistic context and focused object details. Subsequently, in the P2V stage, users can optionally refine these descriptions to guide the video diffusion model, enabling various modifications to the input video, such as removing, changing subjects, and/or adding new objects. The proposed approach stands out from other methods through several significant contributions: (1) RACCooN suggests a multi-granular spatiotemporal pooling strategy to generate well-structured video descriptions, capturing both the broad context and object details without requiring complex human annotations, simplifying precise video content editing based on text for users. (2) Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content. (3) RACCooN also plans to imagine new objects in a given video, so users simply prompt the model to receive a detailed video editing plan for complex video editing. The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.

[181] A Survey of Defenses against AI-generated Visual Media: Detection, Disruption, and Authentication

Jingyi Deng, Chenhao Lin, Zhengyu Zhao, Shuai Liu, Qian Wang, Chao Shen

Main category: cs.CV

TL;DR: A systematic review of defense methods against AI-generated visual media, covering detection, disruption, and authentication approaches within a unified passive and proactive framework.

Details

Motivation: Deep generative models can be used for malicious purposes like misinformation, deception, and copyright violation, creating a need for effective defense mechanisms.

Method: The paper provides a comprehensive review of existing defense methods, categorizing them within a unified framework covering detection, disruption, and authentication. It formulates general pipelines and proposes taxonomies based on methodological strategies applicable to primary subtasks.

Result: The review summarizes mainstream defense-related tasks, derivative tasks concerning defense trustworthiness (robustness and fairness), evaluation datasets, criteria, and metrics. It provides a systematic analysis of current research landscape.

Conclusion: The analysis reveals current research challenges and suggests possible directions for future research in defending against AI-generated visual media.

Abstract: Deep generative models have demonstrated impressive performance in various computer vision applications, including image synthesis, video generation, and medical analysis. Despite their significant advancements, these models may be used for malicious purposes, such as misinformation, deception, and copyright violation. In this paper, we provide a systematic and timely review of research efforts on defenses against AI-generated visual media, covering detection, disruption, and authentication. We review existing methods and summarize the mainstream defense-related tasks within a unified passive and proactive framework. Moreover, we survey the derivative tasks concerning the trustworthiness of defenses, such as their robustness and fairness. For each task, we formulate its general pipeline and propose a taxonomy based on methodological strategies that are uniformly applicable to the primary subtasks. Additionally, we summarize the commonly used evaluation datasets, criteria, and metrics. Finally, by analyzing the reviewed studies, we provide insights into current research challenges and suggest possible directions for future research.

[182] Toward a Holistic Evaluation of Robustness in CLIP Models

Weijie Tu, Weijian Deng, Tom Gedeon

Main category: cs.CV

TL;DR: This paper provides a comprehensive evaluation of CLIP models beyond standard classification accuracy, examining robustness to visual factors, safety objectives, modality bridging, 3D awareness, and vision-language interactions in multimodal models.

Details

Motivation: To move beyond existing evaluations of CLIP's overall classification robustness and provide a more comprehensive assessment across multiple new dimensions including safety, modality understanding, and 3D awareness.

Method: Systematic evaluation of CLIP models across five perspectives: robustness to visual variations, safety objectives (confidence uncertainty and OOD detection), modality bridging finesse, 3D awareness, and vision-language interactions in LMMs. Analyzed impact of six factors: model architecture, training distribution, training set size, fine-tuning, contrastive loss, and test-time prompts.

Result: Uncovered several new insights: visual encoder architecture significantly affects 3D corruption robustness; CLIP models exhibit shape bias that diminishes after ImageNet fine-tuning; LMMs like LLaVA show improved classification performance for challenging categories compared to CLIP alone.

Conclusion: The findings provide valuable guidance for enhancing CLIP model robustness and reliability, revealing previously unknown characteristics and performance patterns across different evaluation dimensions.

Abstract: Contrastive Language-Image Pre-training (CLIP) models have shown significant potential, particularly in zero-shot classification across diverse distribution shifts. Building on existing evaluations of overall classification robustness, this work aims to provide a more comprehensive assessment of CLIP by introducing several new perspectives. First, we investigate their robustness to variations in specific visual factors. Second, we assess two critical safety objectives–confidence uncertainty and out-of-distribution detection–beyond mere classification accuracy. Third, we evaluate the finesse with which CLIP models bridge the image and text modalities. Fourth, we extend our examination to 3D awareness in CLIP models, moving beyond traditional 2D image understanding. Finally, we explore the interaction between vision and language encoders within modern large multimodal models (LMMs) that utilize CLIP as the visual backbone, focusing on how this interaction impacts classification robustness. In each aspect, we consider the impact of six factors on CLIP models: model architecture, training distribution, training set size, fine-tuning, contrastive loss, and test-time prompts. Our study uncovers several previously unknown insights into CLIP. For instance, the architecture of the visual encoder in CLIP plays a significant role in their robustness against 3D corruption. CLIP models tend to exhibit a bias towards shape when making predictions. Moreover, this bias tends to diminish after fine-tuning on ImageNet. Vision-language models like LLaVA, leveraging the CLIP vision encoder, could exhibit benefits in classification performance for challenging categories over CLIP alone. Our findings are poised to offer valuable guidance for enhancing the robustness and reliability of CLIP models.

[183] Fine-grained Abnormality Prompt Learning for Zero-shot Anomaly Detection

Jiawen Zhu, Yew-Soon Ong, Chunhua Shen, Guansong Pang

Main category: cs.CV

TL;DR: FAPrompt introduces fine-grained abnormality prompts for zero-shot anomaly detection, addressing limitations of existing methods that only capture coarse-grained semantics. It uses compound abnormality prompt learning and data-dependent abnormality prior learning to model diverse abnormal patterns and enhance cross-dataset generalization.

Details

Motivation: Current zero-shot anomaly detection methods focus on coarse-grained semantics like "damaged" or "defective" objects, which limits their ability to recognize diverse abnormality details that deviate from these general patterns in various ways.

Method: FAPrompt uses two novel modules: 1) Compound Abnormality Prompt learning (CAP) to learn complementary, decomposed abnormality prompts that model diverse abnormal patterns from the same normality semantic, and 2) Data-dependent Abnormality Prior learning (DAP) to learn sample-wise abnormality prior from abnormal features of each test image to dynamically adapt prompts to individual images.

Result: Comprehensive experiments on 19 real-world datasets covering industrial defects and medical anomalies show that FAPrompt substantially outperforms state-of-the-art methods in both image- and pixel-level zero-shot anomaly detection tasks.

Conclusion: FAPrompt effectively addresses the limitation of coarse-grained abnormality modeling in zero-shot anomaly detection by learning fine-grained abnormality prompts and dynamically adapting them to individual test images, achieving superior performance across diverse datasets.

Abstract: Current zero-shot anomaly detection (ZSAD) methods show remarkable success in prompting large pre-trained vision-language models to detect anomalies in a target dataset without using any dataset-specific training or demonstration. However, these methods often focus on crafting/learning prompts that capture only coarse-grained semantics of abnormality, e.g., high-level semantics like “damaged”, “imperfect”, or “defective” objects. They therefore have limited capability in recognizing diverse abnormality details that deviate from these general abnormal patterns in various ways. To address this limitation, we propose FAPrompt, a novel framework designed to learn Fine-grained Abnormality Prompts for accurate ZSAD. To this end, a novel Compound Abnormality Prompt learning (CAP) module is introduced in FAPrompt to learn a set of complementary, decomposed abnormality prompts, where abnormality prompts are enforced to model diverse abnormal patterns derived from the same normality semantic. On the other hand, the fine-grained abnormality patterns can be different from one dataset to another. To enhance the cross-dataset generalization, another novel module, namely Data-dependent Abnormality Prior learning (DAP), is introduced in FAPrompt to learn a sample-wise abnormality prior from abnormal features of each test image to dynamically adapt the abnormality prompts to individual test images. Comprehensive experiments on 19 real-world datasets, covering both industrial defects and medical anomalies, demonstrate that FAPrompt substantially outperforms state-of-the-art methods in both image- and pixel-level ZSAD tasks. Code is available at https://github.com/mala-lab/FAPrompt.

[184] Ranked from Within: Ranking Large Multimodal Models Without Labels

Weijie Tu, Weijian Deng, Dylan Campbell, Yu Yao, Jiyang Zheng, Tom Gedeon, Tongliang Liu

Main category: cs.CV

TL;DR: The paper explores whether uncertainty-based metrics can predict relative performance of large multimodal models without access to ground-truth labels, enabling unsupervised model ranking.

Details

Motivation: As large multimodal models proliferate, there's a need for efficient ways to choose between them for new tasks without the labor-intensive process of manual annotation and ground-truth determination.

Method: Evaluated 47 state-of-the-art LMMs across 9 visual question answering benchmarks, analyzing how well uncertainty-based metrics (particularly from softmax distributions) can predict relative model performance.

Result: Uncertainty scores derived from softmax distributions provide a robust and consistent basis for ranking models across various tasks, enabling model selection without manual annotation.

Conclusion: Uncertainty-based metrics offer a practical approach for ranking LMMs on unlabeled data, facilitating model selection for diverse target domains without requiring ground-truth labels.

Abstract: Can the relative performance of a pre-trained large multimodal model (LMM) be predicted without access to labels? As LMMs proliferate, it becomes increasingly important to develop efficient ways to choose between them when faced with new data or tasks. The usual approach does the equivalent of giving the models an exam and marking them. We opt to avoid marking and the associated labor of determining the ground-truth answers. Instead, we explore other signals elicited and ascertain how well the models know their own limits, evaluating the effectiveness of these signals at unsupervised model ranking. We evaluate $47$ state-of-the-art LMMs (\eg, LLaVA) across $9$ visual question answering benchmarks, analyzing how well uncertainty-based metrics can predict relative model performance. Our findings show that uncertainty scores derived from softmax distributions provide a robust and consistent basis for ranking models across various tasks. This facilitates the ranking of LMMs on unlabeled data, providing a practical approach for selecting models for diverse target domains without requiring manual annotation.

[185] SoccerSynth-Detection: A Synthetic Dataset for Soccer Player Detection

Haobin Qin, Calvin Yeung, Rikuhei Umemoto, Keisuke Fujii

Main category: cs.CV

TL;DR: Created SoccerSynth-Detection, the first synthetic dataset for soccer player detection, which matches real dataset performance and excels in motion blur scenarios.

Details

Motivation: Limited real soccer datasets due to copyright restrictions, lack of diversity in existing datasets (SoccerNet-Tracking, SportsMOT), and difficulty adapting algorithms to varied soccer video contexts with occlusions and multiple players.

Method: Developed synthetic dataset with random lighting, textures, and simulated camera motion blur, validated using Yolov8n object detection model against real datasets through transfer and pre-training tests.

Result: Synthetic dataset matched real dataset performance in transfer tests, significantly outperformed real datasets in motion blur scenarios, and enhanced overall algorithm performance when used for pre-training.

Conclusion: Synthetic datasets have potential to replace real datasets for algorithm training in soccer video analysis, addressing data scarcity and diversity issues.

Abstract: In soccer video analysis, player detection is essential for identifying key events and reconstructing tactical positions. The presence of numerous players and frequent occlusions, combined with copyright restrictions, severely restricts the availability of datasets, leaving limited options such as SoccerNet-Tracking and SportsMOT. These datasets suffer from a lack of diversity, which hinders algorithms from adapting effectively to varied soccer video contexts. To address these challenges, we developed SoccerSynth-Detection, the first synthetic dataset designed for the detection of synthetic soccer players. It includes a broad range of random lighting and textures, as well as simulated camera motion blur. We validated its efficacy using the object detection model (Yolov8n) against real-world datasets (SoccerNet-Tracking and SportsMoT). In transfer tests, it matched the performance of real datasets and significantly outperformed them in images with motion blur; in pre-training tests, it demonstrated its efficacy as a pre-training dataset, significantly enhancing the algorithm’s overall performance. Our work demonstrates the potential of synthetic datasets to replace real datasets for algorithm training in the field of soccer video analysis.

[186] CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification

Cristiano Patrício, Isabel Rio-Torto, Jaime S. Cardoso, Luís F. Teixeira, João C. Neves

Main category: cs.CV

TL;DR: CBVLM uses Large Vision-Language Models for medical diagnosis by predicting human-interpretable concepts and then classifying based on those concepts, achieving better performance than traditional methods without training and with minimal annotations.

Details

Motivation: Address the challenges of limited annotated data and lack of interpretability in medical deep learning systems, while avoiding the high annotation burden and retraining requirements of Concept Bottleneck Models.

Method: Uses LVLMs in two stages: first prompts the model to detect predefined concepts in images, then classifies based on concept predictions. Incorporates retrieval module for in-context learning with few examples.

Result: Outperforms CBMs and task-specific supervised methods across four medical datasets and twelve LVLMs, without requiring training and using only a few annotated examples.

Conclusion: CBVLM provides an effective solution for interpretable medical diagnosis that reduces annotation costs and eliminates retraining needs while maintaining high performance.

Abstract: The main challenges limiting the adoption of deep learning-based solutions in medical workflows are the availability of annotated data and the lack of interpretability of such systems. Concept Bottleneck Models (CBMs) tackle the latter by constraining the model output on a set of predefined and human-interpretable concepts. However, the increased interpretability achieved through these concept-based explanations implies a higher annotation burden. Moreover, if a new concept needs to be added, the whole system needs to be retrained. Inspired by the remarkable performance shown by Large Vision-Language Models (LVLMs) in few-shot settings, we propose a simple, yet effective, methodology, CBVLM, which tackles both of the aforementioned challenges. First, for each concept, we prompt the LVLM to answer if the concept is present in the input image. Then, we ask the LVLM to classify the image based on the previous concept predictions. Moreover, in both stages, we incorporate a retrieval module responsible for selecting the best examples for in-context learning. By grounding the final diagnosis on the predicted concepts, we ensure explainability, and by leveraging the few-shot capabilities of LVLMs, we drastically lower the annotation cost. We validate our approach with extensive experiments across four medical datasets and twelve LVLMs (both generic and medical) and show that CBVLM consistently outperforms CBMs and task-specific supervised methods without requiring any training and using just a few annotated examples. More information on our project page: https://cristianopatricio.github.io/CBVLM/.

[187] Gate-Shift-Pose: Enhancing Action Recognition in Sports with Skeleton Information

Edoardo Bianchi, Oswald Lanz

Main category: cs.CV

TL;DR: Gate-Shift-Pose enhances Gate-Shift-Fuse networks for athlete fall classification in figure skating by integrating skeleton pose data with RGB frames, achieving up to 40% accuracy improvement over RGB-only baselines.

Details

Motivation: To improve athlete fall classification in figure skating by leveraging both RGB frames and skeleton pose data to better capture complex motion patterns that are challenging to recognize from RGB alone.

Method: Two fusion strategies: early-fusion combining RGB frames with Gaussian heatmaps of pose keypoints at input, and late-fusion using multi-stream architecture with attention mechanisms to combine RGB and pose features. Evaluated with ResNet18 and ResNet50 backbones.

Result: Gate-Shift-Pose significantly outperforms RGB-only baseline (40% improvement with ResNet18, 20% with ResNet50). Early-fusion achieves 98.08% accuracy with ResNet50, while late-fusion works better with ResNet18.

Conclusion: Multimodal architectures integrating skeleton pose data are highly effective for sports action recognition, with pose information playing a critical role in capturing complex motion patterns.

Abstract: This paper introduces Gate-Shift-Pose, an enhanced version of Gate-Shift-Fuse networks, designed for athlete fall classification in figure skating by integrating skeleton pose data alongside RGB frames. We evaluate two fusion strategies: early-fusion, which combines RGB frames with Gaussian heatmaps of pose keypoints at the input stage, and late-fusion, which employs a multi-stream architecture with attention mechanisms to combine RGB and pose features. Experiments on the FR-FS dataset demonstrate that Gate-Shift-Pose significantly outperforms the RGB-only baseline, improving accuracy by up to 40% with ResNet18 and 20% with ResNet50. Early-fusion achieves the highest accuracy (98.08%) with ResNet50, leveraging the model’s capacity for effective multimodal integration, while late-fusion is better suited for lighter backbones like ResNet18. These results highlight the potential of multimodal architectures for sports action recognition and the critical role of skeleton pose information in capturing complex motion patterns. Visit the project page at https://edowhite.github.io/Gate-Shift-Pose

[188] EFC++: Elastic Feature Consolidation with Prototype Re-balancing for Cold Start Exemplar-free Incremental Learning

Simone Magistri, Tomaso Trinci, Albin Soutif-Cormerais, Joost van de Weijer, Andrew D. Bagdanov

Main category: cs.CV

TL;DR: EFC++ addresses the Cold Start problem in Exemplar-free Class Incremental Learning by regularizing feature drift in important directions using an Empirical Feature Matrix and updating prototypes to reduce task-recency bias.

Details

Motivation: The paper addresses the challenging Cold Start scenario in EFCIL where insufficient data in the first task prevents learning a high-quality backbone, leading to feature drift that's difficult to compensate for without exemplars.

Method: Proposes Elastic Feature Consolidation++ (EFC++) which uses an Empirical Feature Matrix to approximate feature drift, regularizes drift in important directions, employs Gaussian prototypes, and includes post-training prototype re-balancing to update classifiers.

Result: Experimental results on CIFAR-100, Tiny-ImageNet, ImageNet-Subset, ImageNet-1K and DomainNet show EFC++ significantly outperforms state-of-the-art methods by better maintaining model plasticity.

Conclusion: EFC++ effectively addresses the Cold Start problem in EFCIL by consolidating feature representations and reducing task-recency bias, enabling better learning of new tasks while maintaining plasticity.

Abstract: Exemplar-free Class Incremental Learning (EFCIL) aims to learn from a sequence of tasks without having access to previous task data. In this paper, we consider the challenging Cold Start scenario in which insufficient data is available in the first task to learn a high-quality backbone. This is especially challenging for EFCIL since it requires high plasticity, resulting in feature drift which is difficult to compensate for in the exemplar-free setting. To address this problem, we propose an effective approach to consolidate feature representations by regularizing drift in directions highly relevant to previous tasks while employing prototypes to reduce task-recency bias. Our approach, which we call Elastic Feature Consolidation++ (EFC++) exploits a tractable second-order approximation of feature drift based on a proposed Empirical Feature Matrix (EFM). The EFM induces a pseudo-metric in feature space which we use to regularize feature drift in important directions and to update Gaussian prototypes. In addition, we introduce a post-training prototype re-balancing phase that updates classifiers to compensate for feature drift. Experimental results on CIFAR-100, Tiny-ImageNet, ImageNet-Subset, ImageNet-1K and DomainNet demonstrate that EFC++ is better able to learn new tasks by maintaining model plasticity and significantly outperforms the state-of-the-art.

[189] Vehicle-Scene Interaction: A Text-Driven 3D Lidar Place Recognition Method for Autonomous Driving

Tianyi Shang, Zhenyu Li, Pengjie Xu, Zhaojun Deng

Main category: cs.CV

TL;DR: Des4Pos is a two-stage text-driven localization framework that achieves state-of-the-art performance in text-to-point-cloud place recognition by addressing modality gaps and enhancing local/global feature extraction.

Details

Motivation: Current approaches struggle with point cloud encoders' inability to capture local details and long-range spatial relationships, plus significant modality gaps between text and point cloud representations in large-scale autonomous systems.

Method: Two-stage framework: 1) Coarse stage with Multi-scale Fusion Attention Mechanism for local features and bidirectional LSTM for global relationships, plus Stepped Text Encoder with CLIP prior knowledge; 2) Fine stage with Cascaded Residual Attention for cross-modal fusion and offset prediction.

Result: Achieves top-1 accuracy of 40% and top-10 accuracy of 77% under 5-meter radius threshold on KITTI360Pose test set, surpassing best existing methods by 7% and 7% respectively.

Conclusion: Des4Pos effectively bridges modality discrepancies and enhances feature representation for superior text-driven localization in large-scale point cloud maps.

Abstract: Environment description-based localization in large-scale point cloud maps constructed through remote sensing is critically significant for the advancement of large-scale autonomous systems, such as delivery robots operating in the last mile. However, current approaches encounter challenges due to the inability of point cloud encoders to effectively capture local details and long-range spatial relationships, as well as a significant modality gap between text and point cloud representations. To address these challenges, we present Des4Pos, a novel two-stage text-driven remote sensing localization framework. In the coarse stage, the point-cloud encoder utilizes the Multi-scale Fusion Attention Mechanism (MFAM) to enhance local geometric features, followed by a bidirectional Long Short-Term Memory (LSTM) module to strengthen global spatial relationships. Concurrently, the Stepped Text Encoder (STE) integrates cross-modal prior knowledge from CLIP [1] and aligns text and point-cloud features using this prior knowledge, effectively bridging modality discrepancies. In the fine stage, we introduce a Cascaded Residual Attention (CRA) module to fuse cross-modal features and predict relative localization offsets, thereby achieving greater localization precision. Experiments on the KITTI360Pose test set demonstrate that Des4Pos achieves state-of-the-art performance in text-to-point-cloud place recognition. Specifically, it attains a top-1 accuracy of 40% and a top-10 accuracy of 77% under a 5-meter radius threshold, surpassing the best existing methods by 7% and 7%, respectively.

[190] SkillFormer: Unified Multi-View Video Understanding for Proficiency Estimation

Edoardo Bianchi, Antonio Liotta

Main category: cs.CV

TL;DR: SkillFormer is a parameter-efficient architecture for unified multi-view proficiency estimation from egocentric and exocentric videos, achieving state-of-the-art accuracy with significantly reduced computational costs.

Details

Motivation: Assessing human skill levels in complex activities is challenging but important for applications in sports, rehabilitation, and training. Current methods lack efficient multi-view integration for fine-grained skill assessment.

Method: Built on TimeSformer backbone, SkillFormer introduces CrossViewFusion module with multi-head cross-attention, learnable gating, and adaptive self-calibration. Uses Low-Rank Adaptation for parameter-efficient fine-tuning.

Result: Achieves state-of-the-art accuracy on EgoExo4D dataset with 4.5x fewer parameters and 3.75x fewer training epochs than prior baselines. Excels in multiple structured tasks.

Conclusion: SkillFormer demonstrates the value of multi-view integration for fine-grained skill assessment while maintaining remarkable computational efficiency.

Abstract: Assessing human skill levels in complex activities is a challenging problem with applications in sports, rehabilitation, and training. In this work, we present SkillFormer, a parameter-efficient architecture for unified multi-view proficiency estimation from egocentric and exocentric videos. Building on the TimeSformer backbone, SkillFormer introduces a CrossViewFusion module that fuses view-specific features using multi-head cross-attention, learnable gating, and adaptive self-calibration. We leverage Low-Rank Adaptation to fine-tune only a small subset of parameters, significantly reducing training costs. In fact, when evaluated on the EgoExo4D dataset, SkillFormer achieves state-of-the-art accuracy in multi-view settings while demonstrating remarkable computational efficiency, using 4.5x fewer parameters and requiring 3.75x fewer training epochs than prior baselines. It excels in multiple structured tasks, confirming the value of multi-view integration for fine-grained skill assessment. Project page at https://edowhite.github.io/SkillFormer

Zhenglin Huang, Tianxiao Li, Xiangtai Li, Haiquan Wen, Yiwei He, Jiangning Zhang, Hao Fei, Xi Yang, Xiaowei Huang, Bei Peng, Guangliang Cheng

Main category: cs.CV

TL;DR: This paper introduces So-Fake-Set, a large-scale social media forgery detection dataset with 2M+ images from 35 generative models, and So-Fake-R1, a vision-language framework using reinforcement learning for accurate detection, localization, and explainable inference.

Details

Motivation: Address the limitations of existing forgery detection datasets and methods, which lack diversity, scale, and realism for social media contexts, and struggle with generalization to unseen generative technologies.

Method: Created So-Fake-Set dataset with 2M+ high-quality images from 35 state-of-the-art generative models, established So-Fake-OOD benchmark for cross-domain evaluation, and developed So-Fake-R1 vision-language framework using reinforcement learning for detection, localization, and interpretable rationales.

Result: So-Fake-R1 outperforms the second-best method with 1.3% gain in detection accuracy and 4.5% increase in localization IoU, demonstrating superior performance in forgery detection and localization.

Conclusion: The work establishes a new foundation for social media-centric forgery detection research by integrating scalable datasets, challenging benchmarks, and advanced detection frameworks, with all resources to be publicly released.

Abstract: Recent advances in AI-powered generative models have enabled the creation of increasingly realistic synthetic images, posing significant risks to information integrity and public trust on social media platforms. While robust detection frameworks and diverse, large-scale datasets are essential to mitigate these risks, existing academic efforts remain limited in scope: current datasets lack the diversity, scale, and realism required for social media contexts, while detection methods struggle with generalization to unseen generative technologies. To bridge this gap, we introduce So-Fake-Set, a comprehensive social media-oriented dataset with over 2 million high-quality images, diverse generative sources, and photorealistic imagery synthesized using 35 state-of-the-art generative models. To rigorously evaluate cross-domain robustness, we establish a novel and large-scale (100K) out-of-domain benchmark (So-Fake-OOD) featuring synthetic imagery from commercial models explicitly excluded from the training distribution, creating a realistic testbed for evaluating real-world performance. Leveraging these resources, we present So-Fake-R1, an advanced vision-language framework that employs reinforcement learning for highly accurate forgery detection, precise localization, and explainable inference through interpretable visual rationales. Extensive experiments show that So-Fake-R1 outperforms the second-best method, with a 1.3% gain in detection accuracy and a 4.5% increase in localization IoU. By integrating a scalable dataset, a challenging OOD benchmark, and an advanced detection framework, this work establishes a new foundation for social media-centric forgery detection research. The code, models, and datasets will be released publicly.

[192] Photography Perspective Composition: Towards Aesthetic Perspective Recommendation

Lujian Yao, Siming Zheng, Xinbin Yuan, Zhuoxuan Cai, Pu Wu, Jinwei Chen, Bo Li, Peng-Tao Jiang

Main category: cs.CV

TL;DR: Proposes photography perspective composition (PPC) as a 3D recomposition method beyond traditional 2D cropping, with automated dataset creation, video generation for perspective transformation, and quality assessment model.

Details

Motivation: Traditional 2D cropping methods are insufficient for scenes with poorly arranged subjects, while professional photographers use perspective adjustment for better compositional balance by modifying 2D projections while maintaining actual spatial positions.

Method: Three key contributions: (1) Automated framework for building PPC datasets using expert photographs, (2) Video generation approach showing perspective transformation from poor to enhanced perspectives, (3) Perspective quality assessment (PQA) model based on human performance.

Result: The approach is concise, requires no additional prompts or camera trajectories, and helps ordinary users enhance their composition skills through perspective-based recomposition.

Conclusion: PPC extends traditional photography composition methods by enabling 3D perspective adjustments, providing tools for dataset creation, transformation visualization, and quality assessment to improve photographic composition.

Abstract: Traditional photography composition approaches are dominated by 2D cropping-based methods. However, these methods fall short when scenes contain poorly arranged subjects. Professional photographers often employ perspective adjustment as a form of 3D recomposition, modifying the projected 2D relationships between subjects while maintaining their actual spatial positions to achieve better compositional balance. Inspired by this artistic practice, we propose photography perspective composition (PPC), extending beyond traditional cropping-based methods. However, implementing the PPC faces significant challenges: the scarcity of perspective transformation datasets and undefined assessment criteria for perspective quality. To address these challenges, we present three key contributions: (1) An automated framework for building PPC datasets through expert photographs. (2) A video generation approach that demonstrates the transformation process from less favorable to aesthetically enhanced perspectives. (3) A perspective quality assessment (PQA) model constructed based on human performance. Our approach is concise and requires no additional prompt instructions or camera trajectories, helping and guiding ordinary users to enhance their composition skills.

[193] Revisiting Reweighted Risk for Calibration: AURC, Focal, and Inverse Focal Loss

Han Zhou, Sebastian G. Gruber, Teodora Popordanoska, Matthew B. Blaschko

Main category: cs.CV

TL;DR: This paper establishes theoretical connections between calibration error and selective classification, showing that optimizing selective risk in low-confidence regions improves model calibration through a flexible reweighting approach with efficient CDF-based optimization.

Details

Motivation: Several reweighted risk functionals like focal loss and AURC have been proposed for calibration improvement, but their theoretical connections to calibration errors remain unclear, motivating the need for principled analysis.

Method: The approach uses a bin-based cumulative distribution function (CDF) approximation to optimize selective risk in low-confidence regions, enabling efficient gradient-based optimization without expensive sorting (O(nK) complexity).

Result: Empirical evaluations demonstrate competitive calibration performance across various datasets and model architectures.

Conclusion: Minimizing calibration error is closely linked to selective classification, and optimizing selective risk in low-confidence regions naturally leads to improved calibration with greater flexibility through choice of confidence score functions.

Abstract: Several variants of reweighted risk functionals, such as focal loss, inverse focal loss, and the Area Under the Risk–Coverage Curve (AURC), have been proposed for improving model calibration, yet their theoretical connections to calibration errors remain unclear. In this paper, we revisit a broad class of weighted risk functions commonly used in deep learning and establish a principled connection between calibration error and selective classification. We show that minimizing calibration error is closely linked to the selective classification paradigm and demonstrate that optimizing selective risk in low-confidence region naturally leads to improved calibration. This loss shares a similar reweighting strategy with dual focal loss but offers greater flexibility through the choice of confidence score functions (CSFs). Our approach uses a bin-based cumulative distribution function (CDF) approximation, enabling efficient gradient-based optimization without requiring expensive sorting and achieving $O(nK)$ complexity. Empirical evaluations demonstrate that our method achieves competitive calibration performance across a range of datasets and model architectures.

[194] PATS: Proficiency-Aware Temporal Sampling for Multi-View Sports Skill Assessment

Edoardo Bianchi, Antonio Liotta

Main category: cs.CV

TL;DR: PATS is a novel temporal sampling method that preserves complete fundamental movements for automated sports skill assessment, achieving state-of-the-art performance across various sports domains.

Details

Motivation: Current video sampling methods disrupt temporal continuity essential for proficiency evaluation in sports skill assessment, limiting accurate distinction between expert and novice performance.

Method: Proficiency-Aware Temporal Sampling (PATS) adaptively segments videos to ensure each analyzed portion contains full execution of critical performance components, repeating across multiple segments to maximize information coverage while maintaining temporal coherence.

Result: PATS surpasses state-of-the-art accuracy across all viewing configurations (+0.65% to +3.05%) on EgoExo4D benchmark with substantial gains in challenging domains (+26.22% bouldering, +2.39% music, +1.13% basketball).

Conclusion: PATS successfully adapts to diverse activity characteristics and demonstrates effectiveness as an adaptive temporal sampling approach that advances automated skill assessment for real-world applications.

Abstract: Automated sports skill assessment requires capturing fundamental movement patterns that distinguish expert from novice performance, yet current video sampling methods disrupt the temporal continuity essential for proficiency evaluation. To this end, we introduce Proficiency-Aware Temporal Sampling (PATS), a novel sampling strategy that preserves complete fundamental movements within continuous temporal segments for multi-view skill assessment. PATS adaptively segments videos to ensure each analyzed portion contains full execution of critical performance components, repeating this process across multiple segments to maximize information coverage while maintaining temporal coherence. Evaluated on the EgoExo4D benchmark with SkillFormer, PATS surpasses the state-of-the-art accuracy across all viewing configurations (+0.65% to +3.05%) and delivers substantial gains in challenging domains (+26.22% bouldering, +2.39% music, +1.13% basketball). Systematic analysis reveals that PATS successfully adapts to diverse activity characteristics-from high-frequency sampling for dynamic sports to fine-grained segmentation for sequential skills-demonstrating its effectiveness as an adaptive approach to temporal sampling that advances automated skill assessment for real-world applications. Visit our project page at https://edowhite.github.io/PATS

[195] SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration

Ye Li, Yuan Meng, Zewen Sun, Kangye Ji, Chen Tang, Jiajun Fan, Xinzhu Ma, Shutao Xia, Zhi Wang, Wenwu Zhu

Main category: cs.CV

TL;DR: SP-VLA accelerates Vision-Language-Action models by jointly scheduling models and pruning tokens to address temporal and spatial redundancy, achieving 1.5-2.4× lossless acceleration with improved performance.

Details

Motivation: Current VLA models have high computational cost and low execution frequency, making them unsuitable for real-time tasks. Existing acceleration methods focus only on structural optimization and overlook temporal redundancy in sequential actions and spatial redundancy in visual inputs.

Method: Proposes SP-VLA with two mechanisms: 1) Action-aware model scheduling that dynamically switches between VLA model and lightweight generator based on action type (deliberative vs intuitive), 2) Spatio-semantic dual-aware token pruning that classifies and prunes tokens based on spatial and semantic importance.

Result: Achieves 1.5× lossless acceleration in LIBERO and 2.4× in SimplerEnv, with up to 6% average performance gain. Inference frequency and latency improve by 2.2× in SimplerEnv and 1.4× in LIBERO.

Conclusion: SP-VLA effectively accelerates VLA models by focusing on critical actions and salient visual information through joint model scheduling and token pruning, maintaining high accuracy while significantly improving inference speed.

Abstract: Vision-Language-Action (VLA) models have attracted increasing attention for their strong control capabilities. However, their high computational cost and low execution frequency hinder their suitability for real-time tasks such as robotic manipulation and autonomous navigation. Existing VLA acceleration methods primarily focus on structural optimization, overlooking the fact that these models operate in sequential decision-making environments. As a result, temporal redundancy in sequential action generation and spatial redundancy in visual input remain unaddressed. To this end, we propose SP-VLA, a unified framework that accelerates VLA models by jointly scheduling models and pruning tokens. Specifically, we design an action-aware model scheduling mechanism that reduces temporal redundancy by dynamically switching between VLA model and a lightweight generator. Inspired by the human motion pattern of focusing on key decision points while relying on intuition for other actions, we categorize VLA actions into deliberative and intuitive, assigning the former to the VLA model and the latter to the lightweight generator, enabling frequency-adaptive execution through collaborative model scheduling. To address spatial redundancy, we further develop a spatio-semantic dual-aware token pruning method. Tokens are classified into spatial and semantic types and pruned based on their dual-aware importance to accelerate VLA inference. These two mechanisms work jointly to guide the VLA in focusing on critical actions and salient visual information, achieving effective acceleration while maintaining high accuracy. Extensive experiments show that our method achieves 1.5$\times$ lossless acceleration in LIBERO and 2.4$\times$ in SimplerEnv, with up to 6% average performance gain. Inference frequency and latency improve by 2.2$\times$ in SimplerEnv and 1.4$\times$ in LIBERO.

[196] From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding

Xiangfeng Wang, Xiao Li, Yadong Wei, Xueyu Song, Yang Song, Xiaoqiang Xia, Fangrui Zeng, Zaiyi Chen, Liu Liu, Gu Xu, Tong Xu

Main category: cs.CV

TL;DR: Proposes HIVE, a human-inspired automatic video editing framework using multimodal narrative understanding to create coherent short videos from long-form content, outperforming existing methods.

Details

Motivation: Address limitations of existing automatic video editing methods that rely mainly on textual cues and neglect visual context, leading to incoherent outputs.

Method: Uses multimodal narrative understanding with character extraction, dialogue analysis, narrative summarization via MLLMs, scene-level segmentation, and three subtasks: highlight detection, opening/ending selection, and content pruning.

Result: Consistently outperforms existing baselines on both general and advertisement editing tasks, significantly narrowing quality gap between automatic and human-edited videos.

Conclusion: HIVE framework effectively addresses coherence issues in automatic video editing through multimodal understanding and human-inspired editing process decomposition.

Abstract: The rapid growth of online video content, especially on short video platforms, has created a growing demand for efficient video editing techniques that can condense long-form videos into concise and engaging clips. Existing automatic editing methods predominantly rely on textual cues from ASR transcripts and end-to-end segment selection, often neglecting the rich visual context and leading to incoherent outputs. In this paper, we propose a human-inspired automatic video editing framework (HIVE) that leverages multimodal narrative understanding to address these limitations. Our approach incorporates character extraction, dialogue analysis, and narrative summarization through multimodal large language models, enabling a holistic understanding of the video content. To further enhance coherence, we apply scene-level segmentation and decompose the editing process into three subtasks: highlight detection, opening/ending selection, and pruning of irrelevant content. To facilitate research in this area, we introduce DramaAD, a novel benchmark dataset comprising over 800 short drama episodes and 500 professionally edited advertisement clips. Experimental results demonstrate that our framework consistently outperforms existing baselines across both general and advertisement-oriented editing tasks, significantly narrowing the quality gap between automatic and human-edited videos.

[197] RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation

Liheng Zhang, Lexi Pang, Hang Ye, Xiaoxuan Ma, Yizhou Wang

Main category: cs.CV

TL;DR: The paper proposes a training-free framework for text-to-image diffusion models that improves structure guidance by decoupling condition feature sampling from the denoising process, achieving better balance between structure alignment and appearance quality.

Details

Motivation: Existing feature injection methods for conditional image generation suffer from structural misalignment, condition leakage, and visual artifacts, especially when condition images diverge from natural RGB distributions. The key limitation identified is the unexplored sampling schedule of condition features and its evolving interplay with structure preservation and domain alignment.

Method: A flexible training-free framework that decouples condition feature sampling from denoising process, systematically investigating feature injection schedules. Uses condition features from a single timestep for efficiency, introduces restart refinement schedule, and employs appearance-rich prompting strategy.

Result: The approach achieves state-of-the-art results across diverse zero-shot conditioning scenarios, enabling training-free generation that is both structure-rich and appearance-rich.

Conclusion: The proposed framework successfully addresses limitations of existing feature injection methods by optimizing the sampling schedule of condition features, providing a simple yet effective solution for high-quality structure guidance in text-to-image generation.

Abstract: Text-to-image (T2I) diffusion models have shown remarkable success in generating high-quality images from text prompts. Recent efforts extend these models to incorporate conditional images (e.g., canny edge) for fine-grained spatial control. Among them, feature injection methods have emerged as a training-free alternative to traditional fine-tuning-based approaches. However, they often suffer from structural misalignment, condition leakage, and visual artifacts, especially when the condition image diverges significantly from natural RGB distributions. Through an empirical analysis of existing methods, we identify a key limitation: the sampling schedule of condition features, previously unexplored, fails to account for the evolving interplay between structure preservation and domain alignment throughout diffusion steps. Inspired by this observation, we propose a flexible training-free framework that decouples the sampling schedule of condition features from the denoising process, and systematically investigate the spectrum of feature injection schedules for a higher-quality structure guidance in the feature space. Specifically, we find that condition features sampled from a single timestep are sufficient, yielding a simple yet efficient schedule that balances structure alignment and appearance quality. We further enhance the sampling process by introducing a restart refinement schedule, and improve the visual quality with an appearance-rich prompting strategy. Together, these designs enable training-free generation that is both structure-rich and appearance-rich. Extensive experiments show that our approach achieves state-of-the-art results across diverse zero-shot conditioning scenarios.

[198] VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting

Juyi Lin, Amir Taherin, Arash Akbari, Arman Akbari, Lei Lu, Guangyu Chen, Taskin Padir, Xiaomeng Yang, Weiwei Chen, Yiqian Li, Xue Lin, David Kaeli, Pu Zhao, Yanzhi Wang

Main category: cs.CV

TL;DR: A training framework to finetune VLA models for generating fewer action tokens with high parallelism, reducing inference latency and training cost, plus a voting-based ensemble strategy to improve action utilization.

Details

Motivation: Current VLA models suffer from massive token generation causing high inference latency and training costs, and insufficient utilization of generated actions leading to performance loss.

Method: Developed a training framework to finetune VLA models for generating significantly fewer action tokens with high parallelism, and introduced a voting-based ensemble strategy to combine current and previous action predictions.

Result: Achieved superior performance compared with state-of-the-art VLA models, with significantly higher success rates and 39x faster inference than OpenVLA with 46 Hz throughput on edge platforms.

Conclusion: The proposed approach demonstrates practical deployability by addressing both inference efficiency and action utilization issues in VLA models.

Abstract: Recent large-scale Vision Language Action (VLA) models have shown superior performance in robotic manipulation tasks guided by natural language. However, current VLA models suffer from two drawbacks: (i) generation of massive tokens leading to high inference latency and increased training cost, and (ii) insufficient utilization of generated actions resulting in potential performance loss. To address these issues, we develop a training framework to finetune VLA models for generating significantly fewer action tokens with high parallelism, effectively reducing inference latency and training cost. Furthermore, we introduce an inference optimization technique with a novel voting-based ensemble strategy to combine current and previous action predictions, improving the utilization of generated actions and overall performance. Our results demonstrate that we achieve superior performance compared with state-of-the-art VLA models, achieving significantly higher success rates and 39$\times$ faster inference than OpenVLA with 46 Hz throughput on edge platforms, demonstrating practical deployability. The code is available at https://github.com/LukeLIN-web/VOTE.

[199] SBP-YOLO:A Lightweight Real-Time Model for Detecting Speed Bumps and Potholes toward Intelligent Vehicle Suspension Systems

Chuanqi Liang, Jie Fu, Miao Yu, Lei Luo

Main category: cs.CV

TL;DR: SBP-YOLO is an efficient detection framework for speed bumps and potholes in embedded systems, achieving 87.0% mAP and 139.5 FPS on Jetson AGX Xavier after optimization.

Details

Motivation: Speed bumps and potholes significantly affect ride comfort and vehicle stability. Preview-based suspension control requires accurate real-time detection, but embedded deployment faces challenges from limited computational resources and small target sizes in images.

Method: Built on YOLOv11n, integrates GhostConv and VoVGSCSPC modules in backbone and neck to reduce computation while enhancing multi-scale features. Adds P2-level branch for small-object detection and lightweight efficient detection head (LEDH). Uses hybrid training with NWD loss, BCKD knowledge distillation, and Albumentations-based augmentation.

Result: Achieves 87.0% mAP, outperforming YOLOv11n baseline by 5.8%. After TensorRT FP16 quantization, runs at 139.5 FPS on Jetson AGX Xavier with 12.4% speedup over P2-enhanced YOLOv11.

Conclusion: SBP-YOLO demonstrates suitability for fast, low-latency road condition perception in embedded suspension control systems, balancing accuracy and computational efficiency.

Abstract: Speed bumps and potholes are the most common road anomalies, significantly affecting ride comfort and vehicle stability. Preview-based suspension control mitigates their impact by detecting such irregularities in advance and adjusting suspension parameters proactively. Accurate and real-time detection is essential, but embedded deployment is constrained by limited computational resources and the small size of targets in input images.To address these challenges, this paper proposes SBP-YOLO, an efficient detection framework for speed bumps and potholes in embedded systems. Built upon YOLOv11n, it integrates GhostConv and VoVGSCSPC modules in the backbone and neck to reduce computation while enhancing multi-scale semantic features. A P2-level branch improves small-object detection, and a lightweight and efficient detection head (LEDH) maintains accuracy with minimal overhead. A hybrid training strategy further enhances robustness under varying road and environmental conditions, combining NWD loss, BCKD knowledge distillation, and Albumentations-based augmentation. Experiments show that SBP-YOLO achieves 87.0% mAP, outperforming the YOLOv11n baseline by 5.8%. After TensorRT FP16 quantization, it runs at 139.5 FPS on Jetson AGX Xavier, yielding a 12.4% speedup over the P2-enhanced YOLOv11. These results demonstrate the framework’s suitability for fast, low-latency road condition perception in embedded suspension control systems.

[200] RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization

Wen Huang, Jiarui Yang, Tao Dai, Jiawei Li, Shaoxiong Zhan, Bin Wang, Shu-Tao Xia

Main category: cs.CV

TL;DR: RelayFormer is a unified framework for visual manipulation localization that handles varying resolutions and modalities through sub-image partitioning and global-local relay attention, achieving state-of-the-art performance.

Details

Motivation: To address challenges in visual manipulation localization including resolution diversity (where resizing/padding distorts forensic traces) and modality gap (separate models needed for images vs videos).

Method: Partitions inputs into fixed-size sub-images and introduces Global-Local Relay (GLR) tokens with global-local relay attention (GLRA) mechanism to propagate structured context while preserving fine-grained manipulation artifacts.

Result: Achieves state-of-the-art performance across diverse benchmarks with notable efficiency, scaling to arbitrary resolutions and video sequences without excessive overhead.

Conclusion: RelayFormer provides resolution adaptivity without interpolation/padding, unified modeling for both images and videos, and strong balance between accuracy and computational cost.

Abstract: Visual manipulation localization (VML) aims to identify tampered regions in images and videos, a task that has become increasingly challenging with the rise of advanced editing tools. Existing methods face two main issues: resolution diversity, where resizing or padding distorts forensic traces and reduces efficiency, and the modality gap, as images and videos often require separate models. To address these challenges, we propose RelayFormer, a unified framework that adapts to varying resolutions and modalities. RelayFormer partitions inputs into fixed-size sub-images and introduces Global-Local Relay (GLR) tokens, which propagate structured context through a global-local relay attention (GLRA) mechanism. This enables efficient exchange of global cues, such as semantic or temporal consistency, while preserving fine-grained manipulation artifacts. Unlike prior methods that rely on uniform resizing or sparse attention, RelayFormer naturally scales to arbitrary resolutions and video sequences without excessive overhead. Experiments across diverse benchmarks demonstrate that RelayFormer achieves state-of-the-art performance with notable efficiency, combining resolution adaptivity without interpolation or excessive padding, unified modeling for both images and videos, and a strong balance between accuracy and computational cost. Code is available at: https://github.com/WenOOI/RelayFormer.

[201] PointAD+: Learning Hierarchical Representations for Zero-shot 3D Anomaly Detection

Qihang Zhou, Shibo He, Jiangtao Yan, Wenchao Meng, Jiming Chen

Main category: cs.CV

TL;DR: PointAD+ is a unified framework that transfers CLIP’s 2D generalization to 3D anomaly detection by combining implicit (rendering pixel) and explicit (spatial geometry) representations through hierarchical learning and cross-hierarchy contrastive alignment.

Details

Motivation: To transfer CLIP's robust 2D generalization capabilities to identify 3D anomalies across unseen objects with diverse class semantics, addressing the limitation of existing methods that neglect spatial relationships in point clouds.

Method: Proposes PointAD+ with explicit 3D representation using G-aggregation for spatial awareness, hierarchical representation learning with rendering and geometry prompts, and cross-hierarchy contrastive alignment to integrate implicit and explicit anomaly semantics.

Result: Extensive experiments show PointAD+ achieves superior performance in zero-shot 3D anomaly detection across unseen objects with diverse class semantics, enabling holistic understanding of abnormality.

Conclusion: PointAD+ successfully bridges 2D and 3D anomaly detection by comprehensively capturing both rendering and spatial abnormalities, demonstrating effective transfer of CLIP’s generalization to 3D domains with plug-and-play RGB integration capability.

Abstract: In this paper, we aim to transfer CLIP’s robust 2D generalization capabilities to identify 3D anomalies across unseen objects of highly diverse class semantics. To this end, we propose a unified framework to comprehensively detect and segment 3D anomalies by leveraging both point- and pixel-level information. We first design PointAD, which leverages point-pixel correspondence to represent 3D anomalies through their associated rendering pixel representations. This approach is referred to as implicit 3D representation, as it focuses solely on rendering pixel anomalies but neglects the inherent spatial relationships within point clouds. Then, we propose PointAD+ to further broaden the interpretation of 3D anomalies by introducing explicit 3D representation, emphasizing spatial abnormality to uncover abnormal spatial relationships. Hence, we propose G-aggregation to involve geometry information to enable the aggregated point representations spatially aware. To simultaneously capture rendering and spatial abnormality, PointAD+ proposes hierarchical representation learning, incorporating implicit and explicit anomaly semantics into hierarchical text prompts: rendering prompts for the rendering layer and geometry prompts for the geometry layer. A cross-hierarchy contrastive alignment is further introduced to promote the interaction between the rendering and geometry layers, facilitating mutual anomaly learning. Finally, PointAD+ integrates anomaly semantics from both layers to capture the generalized anomaly semantics. During the test, PointAD+ can integrate RGB information in a plug-and-play manner and further improve its detection performance. Extensive experiments demonstrate the superiority of PointAD+ in ZS 3D anomaly detection across unseen objects with highly diverse class semantics, achieving a holistic understanding of abnormality.

[202] Contextualized Representation Learning for Effective Human-Object Interaction Detection

Zhehao Li, Yucheng Qian, Chong Wang, Yinghao Lu, Zhihao Yang, Jiafei Wu

Main category: cs.CV

TL;DR: A contextualized representation learning approach for HOI detection that integrates affordance-guided reasoning and contextual prompts to better capture complex interactions involving auxiliary entities like tools.

Details

Motivation: Existing two-stage HOI detection approaches face challenges due to incomplete context modeling, particularly for complex interactions involving multiple entities.

Method: Enhances conventional HOI detection by modeling multivariate relationships through triplet structures <human, tool, object> and integrating learnable prompts with visual features using attention mechanisms.

Result: Demonstrates superior performance on both HICO-Det and V-COCO datasets in most scenarios.

Conclusion: Contextualized representations with affordance reasoning and prompt integration enable more reliable reasoning over complex, context-dependent interactions.

Abstract: Human-Object Interaction (HOI) detection aims to simultaneously localize human-object pairs and recognize their interactions. While recent two-stage approaches have made significant progress, they still face challenges due to incomplete context modeling. In this work, we introduce a Contextualized Representation Learning that integrates both affordance-guided reasoning and contextual prompts with visual cues to better capture complex interactions. We enhance the conventional HOI detection framework by expanding it beyond simple human-object pairs to include multivariate relationships involving auxiliary entities like tools. Specifically, we explicitly model the functional role (affordance) of these auxiliary objects through triplet structures <human, tool, object>. This enables our model to identify tool-dependent interactions such as ‘filling’. Furthermore, the learnable prompt is enriched with instance categories and subsequently integrated with contextual visual features using an attention mechanism. This process aligns language with image content at both global and regional levels. These contextualized representations equip the model with enriched relational cues for more reliable reasoning over complex, context-dependent interactions. Our proposed method demonstrates superior performance on both the HICO-Det and V-COCO datasets in most scenarios. The source code is available at https://github.com/lzzhhh1019/CRL.

[203] Dual-Stage Reweighted MoE for Long-Tailed Egocentric Mistake Detection

Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Sicong Li, Qingming Huang

Main category: cs.CV

TL;DR: Proposes DR-MoE framework for detecting subtle user action mistakes from egocentric video using dual-stage feature fusion and multi-classifier ensemble.

Details

Motivation: Address the challenge of identifying subtle and infrequent user action mistakes in egocentric video data, which are difficult to detect due to their rarity and ambiguity.

Method: Two-stage framework: 1) Feature extraction using frozen ViViT and LoRA-tuned ViViT models combined via feature-level expert module; 2) Three classifiers with different objectives (reweighted cross-entropy, AUC loss, label-aware loss with sharpness-aware minimization) fused via classification-level expert module.

Result: Achieves strong performance, particularly in identifying rare and ambiguous mistake instances.

Conclusion: The DR-MoE framework effectively handles class imbalance and improves detection of subtle action mistakes in egocentric video through dual-stage feature fusion and multi-objective classifier ensemble.

Abstract: In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To handle the challenges posed by subtle and infrequent mistakes, we propose a Dual-Stage Reweighted Mixture-of-Experts (DR-MoE) framework. In the first stage, features are extracted using a frozen ViViT model and a LoRA-tuned ViViT model, which are combined through a feature-level expert module. In the second stage, three classifiers are trained with different objectives: reweighted cross-entropy to mitigate class imbalance, AUC loss to improve ranking under skewed distributions, and label-aware loss with sharpness-aware minimization to enhance calibration and generalization. Their predictions are fused using a classification-level expert module. The proposed method achieves strong performance, particularly in identifying rare and ambiguous mistake instances. The code is available at https://github.com/boyuh/DR-MoE.

[204] Towards Size-invariant Salient Object Detection: A Generic Evaluation and Optimization Approach

Shilong Bao, Qianqian Xu, Feiran Li, Boyu Han, Zhiyong Yang, Xiaochun Cao, Qingming Huang

Main category: cs.CV

TL;DR: This paper addresses size bias in Salient Object Detection (SOD) evaluation metrics, showing current metrics favor larger objects. It proposes SIEva for size-invariant evaluation and SIOpt for improved detection across all object sizes.

Details

Motivation: Existing SOD evaluation metrics are inherently size-sensitive, causing biased assessments where larger objects dominate performance scores while smaller but semantically important objects are overlooked, leading to practical degradation.

Method: Proposes a Size-Invariant Evaluation (SIEva) framework that evaluates separable components individually and aggregates results. Also develops SIOpt optimization framework that follows size-invariant principles and can integrate with various SOD backbones.

Result: The approach effectively mitigates size imbalance impact, enhances detection of salient objects across different sizes, and provides theoretical evidence supporting the new evaluation protocols.

Conclusion: The proposed size-invariant evaluation and optimization frameworks address fundamental limitations in SOD, providing more balanced performance assessment and improved detection capabilities across objects of varying sizes.

Abstract: This paper investigates a fundamental yet underexplored issue in Salient Object Detection (SOD): the size-invariant property for evaluation protocols, particularly in scenarios when multiple salient objects of significantly different sizes appear within a single image. We first present a novel perspective to expose the inherent size sensitivity of existing widely used SOD metrics. Through careful theoretical derivations, we show that the evaluation outcome of an image under current SOD metrics can be essentially decomposed into a sum of several separable terms, with the contribution of each term being directly proportional to its corresponding region size. Consequently, the prediction errors would be dominated by the larger regions, while smaller yet potentially more semantically important objects are often overlooked, leading to biased performance assessments and practical degradation. To address this challenge, a generic Size-Invariant Evaluation (SIEva) framework is proposed. The core idea is to evaluate each separable component individually and then aggregate the results, thereby effectively mitigating the impact of size imbalance across objects. Building upon this, we further develop a dedicated optimization framework (SIOpt), which adheres to the size-invariant principle and significantly enhances the detection of salient objects across a broad range of sizes. Notably, SIOpt is model-agnostic and can be seamlessly integrated with a wide range of SOD backbones. Theoretically, we also present generalization analysis of SOD methods and provide evidence supporting the validity of our new evaluation protocols. Finally, comprehensive experiments speak to the efficacy of our proposed approach. The code is available at https://github.com/Ferry-Li/SI-SOD.

[205] Evict3R: Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers

Soroush Mahdi, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, Mahdi Javanmardi

Main category: cs.CV

TL;DR: A training-free token eviction policy for streaming visual transformers that bounds KV memory by discarding redundant tokens while maintaining performance.

Details

Motivation: Streaming visual transformers suffer from unbounded KV memory growth, limiting scalability for long sequences.

Method: Inference-time token eviction policy that selectively discards redundant tokens while preserving informative ones, without requiring retraining.

Result: Reduces peak memory from 18.63 GB to 9.39 GB on 7-Scenes with only 0.003 drop in accuracy/completeness. Enables denser frame sampling under memory constraints.

Conclusion: The approach makes long-horizon streaming inference practical by closely matching StreamVGGT performance at a fraction of the memory cost across multiple 3D perception tasks.

Abstract: Streaming visual transformers like StreamVGGT achieve strong 3D perception but suffer from unbounded growth of key value (KV) memory, which limits scalability. We propose a training-free, inference-time token eviction policy that bounds memory by discarding redundant tokens while keeping the most informative ones. Our method uses significantly less memory with little to no drop in accuracy: on 7-Scenes with long sequences it reduces peak memory from 18.63 GB to 9.39 GB while accuracy and completeness drop by only 0.003. Under strict memory budgets, eviction enables denser frame sampling, which improves reconstruction accuracy compared to the baseline. Experiments across video depth estimation (Sintel, KITTI), 3D reconstruction (7-Scenes, NRGBD), and camera pose estimation (Sintel, TUM-dynamics) show that our approach closely matches StreamVGGT at a fraction of the memory and makes long-horizon streaming inference more practical.

[206] Q-FSRU: Quantum-Augmented Frequency-Spectral For Medical Visual Question Answering

Rakesh Thakur, Yusra Tariq, Rakesh Chandra Joshi

Main category: cs.CV

TL;DR: Q-FSRU is a new medical VQA model combining frequency domain processing with quantum-inspired retrieval to improve accuracy and explainability in clinical image-text reasoning.

Details

Motivation: Current healthcare AI struggles with complex clinical questions requiring both image and text understanding, especially in noisy medical data environments.

Method: Uses Fast Fourier Transform to shift image/text features to frequency domain for noise filtering, combined with quantum-inspired retrieval system to fetch relevant medical facts from external sources.

Result: Outperforms previous models on VQA-RAD dataset, particularly on complex cases requiring image-text reasoning, with improved performance and explainability.

Conclusion: The frequency-quantum fusion approach provides a promising path for developing intelligent, transparent, and helpful AI tools for medical professionals.

Abstract: Solving tough clinical questions that require both image and text understanding is still a major challenge in healthcare AI. In this work, we propose Q-FSRU, a new model that combines Frequency Spectrum Representation and Fusion (FSRU) with a method called Quantum Retrieval-Augmented Generation (Quantum RAG) for medical Visual Question Answering (VQA). The model takes in features from medical images and related text, then shifts them into the frequency domain using Fast Fourier Transform (FFT). This helps it focus on more meaningful data and filter out noise or less useful information. To improve accuracy and ensure that answers are based on real knowledge, we add a quantum inspired retrieval system. It fetches useful medical facts from external sources using quantum-based similarity techniques. These details are then merged with the frequency-based features for stronger reasoning. We evaluated our model using the VQA-RAD dataset, which includes real radiology images and questions. The results showed that Q-FSRU outperforms earlier models, especially on complex cases needing image text reasoning. The mix of frequency and quantum information improves both performance and explainability. Overall, this approach offers a promising way to build smart, clear, and helpful AI tools for doctors.

[207] ExGS: Extreme 3D Gaussian Compression with Diffusion Priors

Jiaqi Chen, Xinhao Ji, Yuanyuan Gao, Hao Li, Yuning Gong, Yifei Liu, Dan Xu, Zhihang Zhong, Dingwen Zhang, Xiao Sun

Main category: cs.CV

TL;DR: ExGS is a feed-forward framework for extreme compression of 3D Gaussian Splatting scenes, achieving over 100x compression while preserving rendering quality through Universal Gaussian Compression and GaussPainter with diffusion priors.

Details

Motivation: Neural scene representations like 3DGS have high storage and transmission costs that limit deployment in resource-constrained environments. Existing compression methods either require slow optimization or degrade quality under high compression ratios.

Method: ExGS unifies Universal Gaussian Compression (UGC) for re-optimization-free pruning of Gaussian primitives with GaussPainter, which uses diffusion priors and mask-guided refinement to restore high-quality renderings from heavily pruned scenes.

Result: The framework achieves over 100x compression (reducing 354.77 MB to 3.31 MB) while preserving fidelity and significantly improving image quality under challenging conditions.

Conclusion: Diffusion priors play a central role in bridging the gap between extreme compression and high-quality neural rendering, enabling practical real-time restoration with lightweight components.

Abstract: Neural scene representations, such as 3D Gaussian Splatting (3DGS), have enabled high-quality neural rendering; however, their large storage and transmission costs hinder deployment in resource-constrained environments. Existing compression methods either rely on costly optimization, which is slow and scene-specific, or adopt training-free pruning and quantization, which degrade rendering quality under high compression ratios. In contrast, recent data-driven approaches provide a promising direction to overcome this trade-off, enabling efficient compression while preserving high rendering quality.We introduce ExGS, a novel feed-forward framework that unifies Universal Gaussian Compression (UGC) with GaussPainter for Extreme 3DGS compression. UGC performs re-optimization-free pruning to aggressively reduce Gaussian primitives while retaining only essential information, whereas GaussPainter leverages powerful diffusion priors with mask-guided refinement to restore high-quality renderings from heavily pruned Gaussian scenes. Unlike conventional inpainting, GaussPainter not only fills in missing regions but also enhances visible pixels, yielding substantial improvements in degraded renderings.To ensure practicality, it adopts a lightweight VAE and a one-step diffusion design, enabling real-time restoration. Our framework can even achieve over 100X compression (reducing a typical 354.77 MB model to about 3.31 MB) while preserving fidelity and significantly improving image quality under challenging conditions. These results highlight the central role of diffusion priors in bridging the gap between extreme compression and high-quality neural rendering.Our code repository will be released at: https://github.com/chenttt2001/ExGS

[208] Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation

Lei Tong, Zhihua Liu, Chaochao Lu, Dino Oglic, Tom Diethe, Philip Teare, Sotirios A. Tsaftaris, Chen Jin

Main category: cs.CV

TL;DR: Causal-Adapter is a modular framework that adapts frozen text-to-image diffusion models for counterfactual image generation by enabling causal interventions on target attributes while preserving image identity, using structural causal modeling and attribute regularization strategies.

Details

Motivation: Prior approaches rely on prompt engineering without explicit causal structure, lacking precise control over causal relationships and attribute propagation in counterfactual image generation.

Method: Leverages structural causal modeling with two attribute regularization strategies: prompt-aligned injection for semantic control and conditioned token contrastive loss for disentangling attribute factors and reducing spurious correlations.

Result: Achieves state-of-the-art performance with up to 91% MAE reduction on Pendulum dataset for accurate attribute control and 87% FID reduction on ADNI for high-fidelity MRI image generation.

Conclusion: The approach enables robust, generalizable counterfactual editing with faithful attribute modification and strong identity preservation, outperforming existing methods.

Abstract: We present Causal-Adapter, a modular framework that adapts frozen text-to-image diffusion backbones for counterfactual image generation. Our method enables causal interventions on target attributes, consistently propagating their effects to causal dependents without altering the core identity of the image. In contrast to prior approaches that rely on prompt engineering without explicit causal structure, Causal-Adapter leverages structural causal modeling augmented with two attribute regularization strategies: prompt-aligned injection, which aligns causal attributes with textual embeddings for precise semantic control, and a conditioned token contrastive loss to disentangle attribute factors and reduce spurious correlations. Causal-Adapter achieves state-of-the-art performance on both synthetic and real-world datasets, with up to 91% MAE reduction on Pendulum for accurate attribute control and 87% FID reduction on ADNI for high-fidelity MRI image generation. These results show that our approach enables robust, generalizable counterfactual editing with faithful attribute modification and strong identity preservation.

[209] Less is More: Lean yet Powerful Vision-Language Model for Autonomous Driving

Sheng Yang, Tong Zhan, Guancheng Chen, Yanfeng Lu, Jian Wang

Main category: cs.CV

TL;DR: Max-V1 is a one-stage end-to-end autonomous driving framework that formulates trajectory planning as next waypoint prediction using Vision-Language Models, achieving state-of-the-art performance on nuScenes with 30% improvement over baselines.

Details

Motivation: To reconceptualize autonomous driving as a generalized language task and create a single-pass generation paradigm that aligns with the sequential nature of driving, enabling end-to-end trajectory prediction directly from camera input.

Method: Uses Vision-Language Models (VLM) for one-stage end-to-end autonomous driving, formulating trajectory planning as next waypoint prediction with a principled supervision strategy derived from statistical modeling for imitation learning from expert demonstrations.

Result: Achieves state-of-the-art performance on nuScenes dataset with over 30% overall improvement compared to prior baselines, and demonstrates superior generalization on cross-domain datasets from diverse vehicles.

Conclusion: The framework enables fundamental driving behaviors and lays the foundation for more capable self-driving agents, showing notable potential for cross-vehicle robustness and adaptability.

Abstract: In this work, we reconceptualize autonomous driving as a generalized language and formulate the trajectory planning task as next waypoint prediction. We introduce Max-V1, a novel framework for one-stage end-to-end autonomous driving. Our framework presents a single-pass generation paradigm that aligns with the inherent sequentiality of driving. This approach leverages the generative capacity of the VLM (Vision-Language Model) to enable end-to-end trajectory prediction directly from front-view camera input. The efficacy of this method is underpinned by a principled supervision strategy derived from statistical modeling. This provides a well-defined learning objective, which makes the framework highly amenable to master complex driving policies through imitation learning from large-scale expert demonstrations. Empirically, our method achieves the state-of-the-art performance on the nuScenes dataset, delivers an overall improvement of over 30% compared to prior baselines. Furthermore, it exhibits superior generalization performance on cross-domain datasets acquired from diverse vehicles, demonstrating notable potential for cross-vehicle robustness and adaptability. Due to these empirical strengths, this work introduces a model enabling fundamental driving behaviors, laying the foundation for the development of more capable self-driving agents. Code will be available upon publication.

[210] Equivariant Splitting: Self-supervised learning from incomplete data

Victor Sechaud, Jérémy Scanvic, Quentin Barthélemy, Patrice Abry, Julián Tachella

Main category: cs.CV

TL;DR: Proposes a new self-supervised learning method for inverse problems using equivariant reconstruction networks and splitting losses that achieves supervised-equivalent performance without ground-truth data.

Details

Motivation: To enable learning-based solutions for inverse problems when ground-truth training data is expensive or impossible to obtain, particularly in settings with single incomplete observation models.

Method: Introduces equivariant reconstruction networks combined with self-supervised splitting losses, which theoretically achieve the same minimizer as supervised learning in expectation.

Result: Achieves state-of-the-art performance on image inpainting, accelerated MRI, and compressive sensing tasks, especially with highly rank-deficient forward models.

Conclusion: The proposed self-supervised approach effectively replaces supervised learning for inverse problems by leveraging equivariance and splitting losses, enabling practical applications where ground-truth data is unavailable.

Abstract: Self-supervised learning for inverse problems allows to train a reconstruction network from noise and/or incomplete data alone. These methods have the potential of enabling learning-based solutions when obtaining ground-truth references for training is expensive or even impossible. In this paper, we propose a new self-supervised learning strategy devised for the challenging setting where measurements are observed via a single incomplete observation model. We introduce a new definition of equivariance in the context of reconstruction networks, and show that the combination of self-supervised splitting losses and equivariant reconstruction networks results in the same minimizer in expectation as the one of a supervised loss. Through a series of experiments on image inpainting, accelerated magnetic resonance imaging, and compressive sensing, we demonstrate that the proposed loss achieves state-of-the-art performance in settings with highly rank-deficient forward models.

[211] VirDA: Reusing Backbone for Unsupervised Domain Adaptation with Visual Reprogramming

Duy Nguyen, Dat Nguyen

Main category: cs.CV

TL;DR: VirDA proposes a parameter-efficient UDA method using visual reprogramming layers instead of full backbone fine-tuning, achieving competitive accuracy with significantly fewer parameters.

Details

Motivation: Existing UDA methods require fine-tuning entire backbones for each source-target pair, leading to linear growth in parameters and storage, and preventing backbone reuse.

Method: Prepends domain-specific visual reprogramming layers to produce visual prompts that add textural bias to input images, optimizing with multiple objective functions for intra- and inter-domain distribution differences without modifying backbone parameters.

Result: Achieves 92.8% mean accuracy on Office-31 with only 1.5M trainable parameters, surpassing PDA by +1.6% accuracy using 46% of parameters, and outperforms full-backbone methods while using only 1.7-2.8% of their parameters.

Conclusion: VirDA provides an efficient UDA approach that enables backbone reuse across domains while maintaining competitive performance with minimal parameter overhead.

Abstract: Existing UDA pipelines fine-tune already well-trained backbone parameters for every new source-and-target pair, resulting in the number of training parameters and storage memory growing linearly with each new pair, and also preventing the reuse of these well-trained backbone parameters. Inspired by recent implications that existing backbones have textural biases, we propose making use of domain-specific textural bias for domain adaptation via visual reprogramming, namely VirDA. Instead of fine-tuning the full backbone, VirDA prepends a domain-specific visual reprogramming layer to the backbone. This layer produces visual prompts that act as an added textural bias to the input image, adapting its “style” to a target domain. To optimize these visual reprogramming layers, we use multiple objective functions that optimize the intra- and inter-domain distribution differences when domain-adapting visual prompts are applied. This process does not require modifying the backbone parameters, allowing the same backbone to be reused across different domains. We evaluate VirDA on Office-31 and obtain 92.8% mean accuracy with only 1.5M trainable parameters. VirDA surpasses PDA, the state-of-the-art parameter-efficient UDA baseline, by +1.6% accuracy while using just 46% of its parameters. Compared with full-backbone fine-tuning, VirDA outperforms CDTrans and FixBi by +0.2% and +1.4%, respectively, while requiring only 1.7% and 2.8% of their trainable parameters. Relative to the strongest current methods (PMTrans and TVT), VirDA uses ~1.7% of their parameters and trades off only 2.2% and 1.1% accuracy, respectively.

[212] UniVerse: Unleashing the Scene Prior of Video Diffusion Models for Robust Radiance Field Reconstruction

Jin Cao, Hongrui Wu, Ziyong Feng, Hujun Bao, Xiaowei Zhou, Sida Peng

Main category: cs.CV

TL;DR: UniVerse is a unified framework that decouples robust 3D reconstruction into restoration and reconstruction subtasks using a video diffusion model to handle inconsistent multi-view images.

Details

Motivation: Existing methods for robust reconstruction from inconsistent multi-view images rely heavily on dense observations and complex optimization. There's a need for a more generalizable approach that can handle diverse image inconsistencies without case-by-case degradation modeling.

Method: Proposes UniVerse framework that: 1) converts inconsistent images into initial videos, 2) uses a specially designed video diffusion model to restore them into consistent images, and 3) reconstructs 3D scenes from the restored images. Leverages diffusion model’s scene prior learned from large-scale data.

Result: Extensive experiments on synthetic and real-world datasets show strong generalization capability and superior performance in robust reconstruction. The method can also control the style of reconstructed 3D scenes.

Conclusion: UniVerse effectively addresses robust reconstruction by decoupling the problem into simpler subtasks and leveraging diffusion models’ general scene priors, outperforming existing methods while offering style control capabilities.

Abstract: This paper tackles the challenge of robust reconstruction, i.e., the task of reconstructing a 3D scene from a set of inconsistent multi-view images. Some recent works have attempted to simultaneously remove image inconsistencies and perform reconstruction by integrating image degradation modeling into neural 3D scene representations. However, these methods rely heavily on dense observations for robustly optimizing model parameters. To address this issue, we propose to decouple robust reconstruction into two subtasks: restoration and reconstruction, which naturally simplifies the optimization process. To this end, we introduce UniVerse, a unified framework for robust reconstruction based on a video diffusion model. Specifically, UniVerse first converts inconsistent images into initial videos, then uses a specially designed video diffusion model to restore them into consistent images, and finally reconstructs the 3D scenes from these restored images. Compared with case-by-case per-view degradation modeling, the diffusion model learns a general scene prior from large-scale data, making it applicable to diverse image inconsistencies. Extensive experiments on both synthetic and real-world datasets demonstrate the strong generalization capability and superior performance of our method in robust reconstruction. Moreover, UniVerse can control the style of the reconstructed 3D scene. Project page: https://jin-cao-tma.github.io/UniVerse.github.io/

[213] Pack and Force Your Memory: Long-form and Consistent Video Generation

Xiaofei Wu, Guozhen Zhang, Zhiyong Xu, Yuan Zhou, Qinglin Lu, Xuming He

Main category: cs.CV

TL;DR: Proposes MemoryPack for dynamic context modeling and Direct Forcing to mitigate error accumulation in long-form video generation, achieving minute-level temporal consistency with linear complexity.

Details

Motivation: Address challenges in long-form video generation: capturing long-range dependencies while preventing error accumulation in autoregressive decoding.

Method: MemoryPack: learnable context-retrieval mechanism using textual and image information as global guidance; Direct Forcing: efficient single-step approximating strategy for training-inference alignment.

Result: Achieves minute-level temporal consistency, scales gracefully with video length, preserves computational efficiency with linear complexity, and reduces error propagation.

Conclusion: MemoryPack and Direct Forcing substantially enhance context consistency and reliability of long-form video generation, advancing practical usability of autoregressive video models.

Abstract: Long-form video generation presents a dual challenge: models must capture long-range dependencies while preventing the error accumulation inherent in autoregressive decoding. To address these challenges, we make two contributions. First, for dynamic context modeling, we propose MemoryPack, a learnable context-retrieval mechanism that leverages both textual and image information as global guidance to jointly model short- and long-term dependencies, achieving minute-level temporal consistency. This design scales gracefully with video length, preserves computational efficiency, and maintains linear complexity. Second, to mitigate error accumulation, we introduce Direct Forcing, an efficient single-step approximating strategy that improves training-inference alignment and thereby curtails error propagation during inference. Together, MemoryPack and Direct Forcing substantially enhance the context consistency and reliability of long-form video generation, advancing the practical usability of autoregressive video models.

cs.AI

Sagnik Anupam, Davis Brown, Shuo Li, Eric Wong, Hamed Hassani, Osbert Bastani

Main category: cs.AI

TL;DR: BrowserArena is a live open-web evaluation platform for LLM web agents that identifies three key failure modes through step-level human feedback and head-to-head comparisons.

Details

Motivation: Current web agent evaluations are limited to sandboxed environments or artificial tasks, lacking real-world open-web testing.

Method: Collect user-submitted tasks, run Arena-style head-to-head comparisons, use step-level human feedback to analyze agent traces, and construct targeted datasets for identified failure modes.

Result: Identified three consistent failure modes: captcha resolution, pop-up banner removal, and direct URL navigation. Found model-specific variations - o4-mini uses diverse captcha strategies while DeepSeek-R1 misleads users about captcha resolution.

Conclusion: Current web agents show both diversity and brittleness. The benchmarking methodology provides scalable approach to evaluate and understand web agent failure modes.

Abstract: LLM web agents now browse and take actions on the open web, yet current agent evaluations are constrained to sandboxed environments or artificial tasks. We introduce BrowserArena, a live open-web agent evaluation platform that collects user-submitted tasks, runs Arena-style head-to-head comparisons, and uses step-level human feedback to surface failure modes. Collecting and analyzing step-level annotations on the agent traces, we identify three consistent failure modes: captcha resolution, pop-up banner removal, and direct navigation to URLs. By constructing targeted datasets to further study these tasks, we discover variations in how different language models navigate these failure modes. We find, for example, that o4-mini deploys a wider variety of strategies to circumvent captcha resolution than other models and DeepSeek-R1 consistently misleads users about captcha resolution. Our findings surface both the diversity and brittleness of current web agents. More broadly, our benchmarking methodology provides an approach to evaluating and understanding web agent failure modes at scale.

[215] RefineShot: Rethinking Cinematography Understanding with Foundational Skill Evaluation

Hang Wu, Yujun Cai, Haonan Ge, Hongkai Chen, Ming-Hsuan Yang, Yiwei Wang

Main category: cs.AI

TL;DR: The paper identifies issues with ShotBench (current benchmark for cinematography understanding) and ShotVL (state-of-the-art model), then proposes RefineShot as a refined benchmark with improved evaluation protocols.

Details

Motivation: Current benchmarks for cinematography understanding have ambiguous option designs and evaluation reliability issues, which limit fair comparison and hinder future progress in the field.

Method: Systematically refine ShotBench through consistent option restructuring, conduct critical analysis of ShotVL’s reasoning behavior, and introduce extended evaluation protocol assessing both task accuracy and core model competencies.

Result: Developed RefineShot, a refined and expanded benchmark that enables more reliable assessment of cinematography understanding capabilities.

Conclusion: RefineShot provides a more reliable evaluation framework that fosters future advances in cinematography understanding by addressing previous benchmark limitations.

Abstract: Cinematography understanding refers to the ability to recognize not only the visual content of a scene but also the cinematic techniques that shape narrative meaning. This capability is attracting increasing attention, as it enhances multimodal understanding in real-world applications and underpins coherent content creation in film and media. As the most comprehensive benchmark for this task, ShotBench spans a wide range of cinematic concepts and VQA-style evaluations, with ShotVL achieving state-of-the-art results on it. However, our analysis reveals that ambiguous option design in ShotBench and ShotVL’s shortcomings in reasoning consistency and instruction adherence undermine evaluation reliability, limiting fair comparison and hindering future progress. To overcome these issues, we systematically refine ShotBench through consistent option restructuring, conduct the first critical analysis of ShotVL’s reasoning behavior, and introduce an extended evaluation protocol that jointly assesses task accuracy and core model competencies. These efforts lead to RefineShot, a refined and expanded benchmark that enables more reliable assessment and fosters future advances in cinematography understanding.

[216] Safe and Efficient In-Context Learning via Risk Control

Andrea Wynn, Metod Jazbec, Charith Peris, Rinat Khaziev, Anqi Liu, Daniel Khashabi, Eric Nalisnick

Main category: cs.AI

TL;DR: Proposes a method using distribution-free risk control to limit performance degradation from harmful in-context demonstrations while maintaining computational efficiency gains from helpful demonstrations.

Details

Motivation: Address safety concerns where LLMs can be influenced by incorrect or malicious in-context examples, requiring built-in mechanisms to guard against such attacks.

Method: Define baseline safe behavior (zero-shot performance), apply distribution-free risk control with dynamic early exit prediction that ignores later attention heads attending to unsafe inputs, and modify DFRC to control risk while leveraging helpful inputs.

Result: The approach effectively controls risk for harmful in-context demonstrations while achieving substantial computational efficiency gains with helpful demonstrations.

Conclusion: The proposed method provides principled protection against adversarial in-context examples while maintaining performance benefits from legitimate demonstrations.

Abstract: Large language models (LLMs) demonstrate a remarkable ability to learn new tasks from a few in-context examples. However, this flexibility introduces safety concerns: LLMs can be influenced by incorrect or malicious demonstrations – for example, if an adversary tampers with or injects harmful examples without a human supervisor noticing. This motivates principled designs in which the system itself includes built-in mechanisms to guard against such attacks. We propose a novel approach to limit the degree to which harmful demonstrations can degrade model performance. First, we define a baseline ``safe’’ behavior for the model – the model’s performance given no in-context demonstrations (zero-shot). Next, we apply distribution-free risk control (DFRC) to control the extent to which in-context samples can decay performance below zero-shot. We achieve this by leveraging dynamic early exit prediction, ignoring later attention heads that attend the most to the unsafe inputs. Finally, we propose modifications to DFRC that allow it to both control risk for harmful inputs \textit{and} leverage performance and efficiency gains on helpful inputs. We present both theoretical and empirical results showing that our approach can effectively control risk for harmful in-context demonstrations while simultaneously achieving substantial computational efficiency gains with helpful demonstrations.

[217] Improving Cooperation in Collaborative Embodied AI

Hima Jacob Leven Suprabha, Laxmi Nag Laxminarayan Nagesh, Ajith Nair, Alvin Reuben Amal Selvaster, Ayan Khan, Raghuram Damarla, Sanju Hannah Samuel, Sreenithi Saravana Perumal, Titouan Puech, Venkataramireddy Marella, Vishal Sonar, Alessandro Suglia, Oliver Lemon

Main category: cs.AI

TL;DR: This paper enhances CoELA framework by optimizing LLM prompting methods and integrating speech capabilities to improve collaborative agent performance in multiagent systems.

Details

Motivation: To explore how different prompting methods and LLM combinations can enhance collaborative behavior and decision-making in multiagent systems using the CoELA framework.

Method: Enhanced CoELA framework with systematic experimentation on different LLMs and prompt engineering strategies, plus integration of speech capabilities for voice-based interactions.

Result: Best prompt optimization combination improved system efficiency by 22% with Gemma3 compared to original CoELA system. Speech integration provided more engaging user interface.

Conclusion: Prompt optimization significantly enhances collaborative agent performance, and speech integration improves user experience for system development and demonstrations.

Abstract: The integration of Large Language Models (LLMs) into multiagent systems has opened new possibilities for collaborative reasoning and cooperation with AI agents. This paper explores different prompting methods and evaluates their effectiveness in enhancing agent collaborative behaviour and decision-making. We enhance CoELA, a framework designed for building Collaborative Embodied Agents that leverage LLMs for multi-agent communication, reasoning, and task coordination in shared virtual spaces. Through systematic experimentation, we examine different LLMs and prompt engineering strategies to identify optimised combinations that maximise collaboration performance. Furthermore, we extend our research by integrating speech capabilities, enabling seamless collaborative voice-based interactions. Our findings highlight the effectiveness of prompt optimisation in enhancing collaborative agent performance; for example, our best combination improved the efficiency of the system running with Gemma3 by 22% compared to the original CoELA system. In addition, the speech integration provides a more engaging user interface for iterative system development and demonstrations.

[218] Multimodal Function Vectors for Spatial Relations

Shuhao Fu, Esther Goldberg, Ying Nian Wu, Hongjing Lu

Main category: cs.AI

TL;DR: The paper identifies specific attention heads in OpenFlamingo-4B that encode spatial relations, extracts ‘function vectors’ from them, and shows these can be manipulated to improve relational reasoning performance without retraining the full model.

Details

Motivation: To understand the internal mechanisms behind LMMs' in-context learning abilities for relational tasks, particularly how spatial relational knowledge is encoded and can be controlled.

Method: Apply causal mediation analysis to identify key attention heads, extract multimodal function vectors from their activations, fine-tune these vectors with limited data while keeping LMM parameters frozen, and linearly combine vectors for analogy problems.

Result: Function vectors improve zero-shot accuracy, outperform in-context learning baselines after fine-tuning, and can solve novel spatial relation analogies through linear combination.

Conclusion: LMMs encode spatial relational knowledge in localized structures that can be systematically extracted and optimized, advancing understanding of model modularity and enhancing control over relational reasoning.

Abstract: Large Multimodal Models (LMMs) demonstrate impressive in-context learning abilities from limited multimodal demonstrations, yet the internal mechanisms supporting such task learning remain opaque. Building on prior work of large language models, we show that a small subset of attention heads in the vision-language model OpenFlamingo-4B is responsible for transmitting representations of spatial relations. The activations of these attention heads, termed function vectors, can be extracted and manipulated to alter an LMM’s performance on relational tasks. First, using both synthetic and real image datasets, we apply causal mediation analysis to identify attention heads that strongly influence relational predictions, and extract multimodal function vectors that improve zero-shot accuracy at inference time. We further demonstrate that these multimodal function vectors can be fine-tuned with a modest amount of training data, while keeping LMM parameters frozen, to significantly outperform in-context learning baselines. Finally, we show that relation-specific function vectors can be linearly combined to solve analogy problems involving novel and untrained spatial relations, highlighting the strong generalization ability of this approach. Our results show that LMMs encode spatial relational knowledge within localized internal structures, which can be systematically extracted and optimized, thereby advancing our understanding of model modularity and enhancing control over relational reasoning in LMMs.

[219] Orchestrating Human-AI Teams: The Manager Agent as a Unifying Research Challenge

Charlie Masters, Advaith Vellanki, Jiangbo Shangguan, Bart Kultys, Jonathan Gilmore, Alastair Moore, Stefano V. Albrecht

Main category: cs.AI

TL;DR: This paper proposes Autonomous Manager Agents for orchestrating multi-agent workflows, formalizes workflow management as a Partially Observable Stochastic Game, identifies four key challenges, and releases MA-Gym framework showing current AI struggles with joint optimization.

Details

Motivation: While AI has advanced in automating individual tasks, managing complex multi-agent workflows remains challenging, requiring systems that can orchestrate collaboration in dynamic human-AI teams.

Method: Proposed Autonomous Manager Agent concept, formalized workflow management as Partially Observable Stochastic Game, identified four foundational challenges, and developed MA-Gym simulation framework to evaluate GPT-5-based agents across 20 workflows.

Result: Evaluation showed GPT-5-based Manager Agents struggle to jointly optimize for goal completion, constraint adherence, and workflow runtime, highlighting workflow management as a difficult open problem.

Conclusion: Workflow management remains an open challenge requiring advances in compositional reasoning, multi-objective optimization, team coordination, and governance. The paper discusses organizational and ethical implications of autonomous management systems.

Abstract: While agentic AI has advanced in automating individual tasks, managing complex multi-agent workflows remains a challenging problem. This paper presents a research vision for autonomous agentic systems that orchestrate collaboration within dynamic human-AI teams. We propose the Autonomous Manager Agent as a core challenge: an agent that decomposes complex goals into task graphs, allocates tasks to human and AI workers, monitors progress, adapts to changing conditions, and maintains transparent stakeholder communication. We formalize workflow management as a Partially Observable Stochastic Game and identify four foundational challenges: (1) compositional reasoning for hierarchical decomposition, (2) multi-objective optimization under shifting preferences, (3) coordination and planning in ad hoc teams, and (4) governance and compliance by design. To advance this agenda, we release MA-Gym, an open-source simulation and evaluation framework for multi-agent workflow orchestration. Evaluating GPT-5-based Manager Agents across 20 workflows, we find they struggle to jointly optimize for goal completion, constraint adherence, and workflow runtime - underscoring workflow management as a difficult open problem. We conclude with organizational and ethical implications of autonomous management systems.

[220] Agentic Additive Manufacturing Alloy Discovery

Peter Pak, Achuth Chandrasekhar, Amir Barati Farimani

Main category: cs.AI

TL;DR: This paper presents a multi-agent system using LLMs to automate alloy discovery in additive manufacturing by leveraging tools like Thermo-Calc calculations and process map generation through Model Context Protocol.

Details

Motivation: Alloy discovery in additive manufacturing is complex and requires expertise across multiple domains. The research aims to automate and accelerate this process using intelligent agent systems.

Method: Developed a multi-agent system using Large Language Models that dispatch tool calls via Model Context Protocol to perform thermodynamic simulations (Thermo-Calc property diagrams) and process analysis (lack of fusion process maps).

Result: The system can effectively reason through complex user prompts, analyze alloy printability, and dynamically adjust task trajectories based on tool call outcomes, enabling autonomous decision-making.

Conclusion: LLM-enabled multi-agent systems can successfully automate and accelerate alloy discovery in additive manufacturing, demonstrating the practical benefits of adopting such intelligent systems.

Abstract: Agentic systems enable the intelligent use of research tooling, augmenting a researcher’s ability to investigate and propose novel solutions to existing problems. Within Additive Manufacturing (AM), alloy discovery remains a complex challenge, often requiring expertise in the various domains of materials science, thermodynamic simulations, and experimental analysis. Large Language Model (LLM) enabled agents can facilitate this endeavor by utilizing their extensive knowledge base to dispatch tool calls via Model Context Protocol (MCP) to perform actions such as Thermo-Calc property diagram calculations and lack of fusion process map generation. In addition, the multi-agent system developed in this work is able to effectively reason through complex user prompts and provide analysis on the printability of proposed alloys. These agents can dynamically adjust their task trajectory to the outcomes of tool call results, effectively enabling autonomous decision-making in practical environments. This work aims to utilize LLM enabled agents to automate and accelerate the task of alloy discovery within the field of additive manufacturing and showcase the benefits of adopting this multi-agent system.

[221] A Benchmark Study of Deep Reinforcement Learning Algorithms for the Container Stowage Planning Problem

Yunqi Huang, Nishith Chennakeshava, Alexis Carras, Vladislav Neverov, Wei Liu, Aske Plaat, Yingjie Fan

Main category: cs.AI

TL;DR: This paper develops a Gym environment for container stowage planning with crane scheduling and benchmarks five RL algorithms (DQN, QR-DQN, A2C, PPO, TRPO) across varying complexity scenarios.

Details

Motivation: Container stowage planning is critical for maritime transportation efficiency but traditionally relies on human expertise. While RL has been applied to CSPP, there's a lack of systematic benchmark comparisons across different algorithms.

Method: Created a Gym environment capturing CSPP fundamentals with crane scheduling in both multi-agent and single-agent formulations. Evaluated five RL algorithms (DQN, QR-DQN, A2C, PPO, TRPO) under multiple scenarios of varying complexity.

Result: Results show distinct performance gaps with increasing complexity, highlighting the importance of algorithm choice and problem formulation for CSPP.

Conclusion: The paper provides a benchmark for multiple RL methods in CSPP and offers a reusable Gym environment with crane scheduling, establishing a foundation for future research and practical deployment in maritime logistics.

Abstract: Container stowage planning (CSPP) is a critical component of maritime transportation and terminal operations, directly affecting supply chain efficiency. Owing to its complexity, CSPP has traditionally relied on human expertise. While reinforcement learning (RL) has recently been applied to CSPP, systematic benchmark comparisons across different algorithms remain limited. To address this gap, we develop a Gym environment that captures the fundamental features of CSPP and extend it to include crane scheduling in both multi-agent and single-agent formulations. Within this framework, we evaluate five RL algorithms: DQN, QR-DQN, A2C, PPO, and TRPO under multiple scenarios of varying complexity. The results reveal distinct performance gaps with increasing complexity, underscoring the importance of algorithm choice and problem formulation for CSPP. Overall, this paper benchmarks multiple RL methods for CSPP while providing a reusable Gym environment with crane scheduling, thus offering a foundation for future research and practical deployment in maritime logistics.

[222] Multimodal Large Language Model Framework for Safe and Interpretable Grid-Integrated EVs

Jean Douglas Carvalho, Hugo Kenji, Ahmad Mohammad Saber, Glaucia Melo, Max Mauro Dias Santos, Deepa Kundur

Main category: cs.AI

TL;DR: A multi-modal LLM framework processes sensor data to generate natural-language alerts for EV drivers, enhancing safety and decision-making in urban driving scenarios.

Details

Motivation: To address the challenge of ensuring safe and interpretable interactions between drivers, vehicles, and the environment in EV integration with smart grids.

Method: Combines visual perception (YOLOv8), geocoded positioning, and CAN bus telemetry with a multi-modal large language model to process sensor data and generate driver alerts.

Result: Validated with real-world data, the framework effectively generates context-aware alerts for critical situations like proximity to pedestrians, cyclists, and other vehicles.

Conclusion: LLMs show potential as assistive tools in e-mobility, enabling safer driving, scalable fleet coordination, EV load forecasting, and traffic-aware energy planning.

Abstract: The integration of electric vehicles (EVs) into smart grids presents unique opportunities to enhance both transportation systems and energy networks. However, ensuring safe and interpretable interactions between drivers, vehicles, and the surrounding environment remains a critical challenge. This paper presents a multi-modal large language model (LLM)-based framework to process multimodal sensor data - such as object detection, semantic segmentation, and vehicular telemetry - and generate natural-language alerts for drivers. The framework is validated using real-world data collected from instrumented vehicles driving on urban roads, ensuring its applicability to real-world scenarios. By combining visual perception (YOLOv8), geocoded positioning, and CAN bus telemetry, the framework bridges raw sensor data and driver comprehension, enabling safer and more informed decision-making in urban driving scenarios. Case studies using real data demonstrate the framework’s effectiveness in generating context-aware alerts for critical situations, such as proximity to pedestrians, cyclists, and other vehicles. This paper highlights the potential of LLMs as assistive tools in e-mobility, benefiting both transportation systems and electric networks by enabling scalable fleet coordination, EV load forecasting, and traffic-aware energy planning. Index Terms - Electric vehicles, visual perception, large language models, YOLOv8, semantic segmentation, CAN bus, prompt engineering, smart grid.

Chen Henry Wu, Neil Kale, Aditi Raghunathan

Main category: cs.AI

TL;DR: Foundation models struggle with cross-modal reasoning, particularly in handling conflicts across different modalities. While they perform well with single modalities (90% conflict recognition), performance drops dramatically (to 3%) when evidence is split across modalities due to attention imbalance.

Details

Motivation: To understand how well foundation models perform joint reasoning across multiple modalities, especially when modalities interact and form cross-modal contexts, by studying their ability to handle cross-modal conflicts.

Method: The study examines FMs on cross-modal conflicts where conflicting evidence is presented across modalities. They analyze attention imbalance and propose a simple, scalable method of explicitly combining multiple modalities within each training instance to reduce this imbalance.

Result: FMs recognize conflicts in unimodal contexts 90% of the time, but performance drops to as low as 3% when evidence is split across modalities. Cross-modal attention imbalance is identified as the root cause, with FMs disproportionately prioritizing certain modalities. The proposed method significantly reduces attention imbalance and improves downstream performance on vision-language benchmarks.

Conclusion: Systematically addressing cross-modal contexts is crucial for building reliable foundation models. Simply scaling up multimodal datasets without explicit cross-modal reasoning training is insufficient, but targeted methods that explicitly combine modalities can significantly improve performance.

Abstract: Foundation models (FMs) deployed in real-world tasks such as computer-use agents must integrate diverse modalities. How good are FMs at performing joint reasoning, simultaneously reasoning over multiple modalities, especially when the modalities interact and relate to each other to form cross-modal context? To better understand this problem, we study FMs on cross-modal conflicts: scenarios where conflicting evidence is presented across modalities. This allows us to examine whether FMs prioritize one modality over another or reason jointly to reconcile the conflict. Our experiments reveal that FMs can recognize conflicts in unimodal contexts, composed of a single modality, 90% of the time, but the ratio falls as low as 3% when evidence is split across modalities – similar observations hold in cross-lingual contexts, composed of multiple languages. We trace this failure to cross-modal attention imbalance, showing that FMs exhibit extreme asymmetry in attention scores, disproportionately prioritizing certain modalities. We show that cross-modal attention imbalance does not go away by simply scaling up multimodal or multilingual datasets blindly, since they lack training examples that explicitly require cross-modal reasoning. We demonstrate that even a simple and scalable method of explicitly combining multiple modalities within each training instance significantly reduces attention imbalance. Reduced attention imbalance directly translates to improved downstream performance on several vision-language benchmarks. Our findings underscore the importance of systematically addressing cross-modal contexts to build reliable foundation models.

[224] On the Role of Temperature Sampling in Test-Time Scaling

Yuheng Wu, Azalia Mirhoseini, Thierry Tambe

Main category: cs.AI

TL;DR: Test-time scaling with multiple reasoning traces reaches diminishing returns at large sample sizes, but scaling across different temperatures significantly improves LLM reasoning performance by exploring diverse solution spaces.

Details

Motivation: Prior work showed that increasing the number of reasoning samples (K) improves accuracy, but this trend doesn't hold indefinitely - hard questions remain unsolved regardless of K, and different temperatures solve different problem subsets.

Method: Proposed temperature scaling where multiple reasoning traces are generated at different sampling temperatures, and a multi-temperature voting method to reduce computational overhead.

Result: Temperature scaling yields an additional 7.3 points over single-temperature TTS across Qwen3 models and five reasoning benchmarks, enabling base models to reach RL-trained model performance without additional training.

Conclusion: Test-time scaling is more powerful than previously thought, and temperature scaling offers a simple, effective way to unlock the latent potential of base language models for reasoning tasks.

Abstract: Large language models (LLMs) can improve reasoning at inference time through test-time scaling (TTS), where multiple reasoning traces are generated and the best one is selected. Prior work shows that increasing the number of samples K steadily improves accuracy. In this paper, we demonstrate that this trend does not hold indefinitely: at large K, further scaling yields no gains, and certain hard questions remain unsolved regardless of the number of traces. Interestingly, we find that different sampling temperatures solve different subsets of problems, implying that single-temperature scaling explores only part of a model’s potential. We therefore propose scaling along the temperature dimension, which enlarges the reasoning boundary of LLMs. Averaged over Qwen3 (0.6B, 1.7B, 4B, 8B) and five representative reasoning benchmarks (AIME 2024/2025, MATH500, LiveCodeBench, Hi-ToM), temperature scaling yields an additional 7.3 points over single-temperature TTS. Temperature scaling also enables base models to reach performance comparable to reinforcement learning (RL)-trained counterparts, without additional post-training. We further provide a comprehensive analysis of this phenomenon and design a multi-temperature voting method that reduces the overhead of temperature scaling. Overall, our findings suggest that TTS is more powerful than previously thought, and that temperature scaling offers a simple and effective way to unlock the latent potential of base models.

[225] Geolog-IA: Conversational System for Academic Theses

Micaela Fuel Pozo, Andrea Guatumillo Saltos, Yeseña Tipan Llumiquinga, Kelly Lascano Aguirre, Marilyn Castillo Jara, Christian Mejia-Escobar

Main category: cs.AI

TL;DR: Geolog-IA is an AI conversational system for geology theses at Central University of Ecuador, using Llama 3.1 and Gemini 2.5 models with RAG architecture to prevent hallucinations and outdated knowledge.

Details

Motivation: To create a natural conversational system for answering geology thesis questions, overcoming issues like AI hallucinations and outdated knowledge in academic settings.

Method: Uses Llama 3.1 and Gemini 2.5 language models combined with Retrieval Augmented Generation (RAG) architecture and SQLite database for accurate information retrieval.

Result: Achieved average BLEU score of 0.87, indicating high consistency and accuracy in responses, with an intuitive web interface for various university stakeholders.

Conclusion: The system provides key support for education, training, and research, establishing a foundation for future applications in other disciplines.

Abstract: This study presents the development of Geolog-IA, a novel conversational system based on artificial intelligence that responds naturally to questions about geology theses from the Central University of Ecuador. Our proposal uses the Llama 3.1 and Gemini 2.5 language models, which are complemented by a Retrieval Augmented Generation (RAG) architecture and an SQLite database. This strategy allows us to overcome problems such as hallucinations and outdated knowledge. The evaluation of Geolog-IA’s performance with the BLEU metric reaches an average of 0.87, indicating high consistency and accuracy in the responses generated. The system offers an intuitive, web-based interface that facilitates interaction and information retrieval for directors, teachers, students, and administrative staff at the institution. This tool can be a key support in education, training, and research and establishes a basis for future applications in other disciplines.

[226] A Concept of Possibility for Real-World Events

Daniel G. Schwartz

Main category: cs.AI

TL;DR: A new concept of possibility as an alternative to Zadeh’s standard possibility theory, focusing specifically on real-world events and their prerequisites and constraints, with applications in planning problems.

Details

Motivation: To provide an alternative to Zadeh's possibility theory that specifically addresses the possibility of real-world events, inspired by the original but formally different, using Łukasiewicz multivalent logic.

Method: Events are viewed as having prerequisites (enablers) and constraints (impediments), with possibility computed as a function of probabilities that prerequisites hold and constraints don’t. Applied to planning problems to determine most feasible plans.

Result: Developed a new possibility theory that can be used to determine which plan is most possible/feasible when multiple plans are available for achieving a goal, with potential applications in vehicle route planning.

Conclusion: The new possibility theory captures normal human reasoning about plans and has practical applications in planning problems, with potential for future applications beyond the illustrative vehicle route planning example.

Abstract: This paper offers a new concept of {\it possibility} as an alternative to the now-a-days standard concept originally introduced by L.A. Zadeh in 1978. This new version was inspired by the original but, formally, has nothing in common with it other than that they both adopt the {\L}ukasiewicz multivalent interpretation of the logical connectives. Moreover, rather than seeking to provide a general notion of possibility, this focuses specifically on the possibility of a real-world event. An event is viewed as having prerequisites that enable its occurrence and constraints that may impede its occurrence, and the possibility of the event is computed as a function of the probabilities that the prerequisites hold and the constraints do not. This version of possibility might appropriately be applied to problems of planning. When there are multiple plans available for achieving a goal, this theory can be used to determine which plan is most possible, i.e., easiest or most feasible to complete. It is speculated that this model of reasoning correctly captures normal human reasoning about plans. The theory is elaborated and an illustrative example for vehicle route planning is provided. There is also a suggestion of potential future applications.

[227] AutoMaAS: Self-Evolving Multi-Agent Architecture Search for Large Language Models

Bo Ma, Hang Li, ZeHua Hu, XiaoFan Gui, LuYao Liu, Simon Liu

Main category: cs.AI

TL;DR: AutoMaAS is a self-evolving multi-agent architecture search framework that automatically discovers optimal agent configurations using neural architecture search principles, achieving better performance with lower inference costs.

Details

Motivation: Existing automated design approaches for multi-agent systems fail to adapt resource allocation based on query complexity and domain requirements, seeking monolithic solutions instead.

Method: Leverages neural architecture search principles with four innovations: automatic operator generation/fusion/elimination, dynamic cost-aware optimization, online feedback integration, and enhanced interpretability through decision tracing.

Result: Achieves 1.0-7.1% performance improvement while reducing inference costs by 3-5% across six benchmarks, with superior transferability across datasets and LLM backbones.

Conclusion: Establishes a new paradigm for automated multi-agent system design in the era of large language models through self-evolving architecture search.

Abstract: Multi-agent systems powered by large language models have demonstrated remarkable capabilities across diverse domains, yet existing automated design approaches seek monolithic solutions that fail to adapt resource allocation based on query complexity and domain requirements. This paper introduces AutoMaAS, a self-evolving multi-agent architecture search framework that leverages neural architecture search principles to automatically discover optimal agent configurations through dynamic operator lifecycle management and automated machine learning techniques. Our approach incorporates four key innovations: (1) automatic operator generation, fusion, and elimination based on performance-cost analysis, (2) dynamic cost-aware optimization with real-time parameter adjustment, (3) online feedback integration for continuous architecture refinement, and (4) enhanced interpretability through decision tracing mechanisms. Extensive experiments across six benchmarks demonstrate that AutoMaAS achieves 1.0-7.1% performance improvement while reducing inference costs by 3-5% compared to state-of-the-art methods. The framework shows superior transferability across datasets and LLM backbones, establishing a new paradigm for automated multi-agent system design in the era of large language models.

[228] ARMs: Adaptive Red-Teaming Agent against Multimodal Models with Plug-and-Play Attacks

Zhaorun Chen, Xun Liu, Mintong Kang, Jiawei Zhang, Minzhou Pan, Shuang Yang, Bo Li

Main category: cs.AI

TL;DR: ARMs is an adaptive red-teaming agent that systematically conducts comprehensive risk assessments for vision-language models (VLMs) by automatically optimizing diverse red-teaming strategies with reasoning-enhanced multi-step orchestration to effectively elicit harmful outputs.

Details

Motivation: Existing red-teaming efforts for VLMs are either restricted to narrow adversarial patterns or depend heavily on manual engineering, lacking scalable exploration of emerging real-world VLM vulnerabilities.

Method: Proposed 11 novel multimodal attack strategies covering diverse adversarial patterns, integrated 17 red-teaming algorithms via model context protocol (MCP), and designed layered memory with epsilon-greedy attack exploration algorithm to balance diversity and effectiveness.

Result: Achieved SOTA attack success rates, exceeding baselines by average 52.1% and surpassing 90% on Claude-4-Sonnet. Generated significantly more diverse red-teaming instances revealing emerging vulnerabilities. Constructed ARMs-Bench with over 30K red-teaming instances spanning 51 risk categories.

Conclusion: Safety fine-tuning with ARMs-Bench substantially improves VLM robustness while preserving general utility, providing actionable guidance to improve multimodal safety alignment against emerging threats.

Abstract: As vision-language models (VLMs) gain prominence, their multimodal interfaces also introduce new safety vulnerabilities, making the safety evaluation challenging and critical. Existing red-teaming efforts are either restricted to a narrow set of adversarial patterns or depend heavily on manual engineering, lacking scalable exploration of emerging real-world VLM vulnerabilities. To bridge this gap, we propose ARMs, an adaptive red-teaming agent that systematically conducts comprehensive risk assessments for VLMs. Given a target harmful behavior or risk definition, ARMs automatically optimizes diverse red-teaming strategies with reasoning-enhanced multi-step orchestration, to effectively elicit harmful outputs from target VLMs. We propose 11 novel multimodal attack strategies, covering diverse adversarial patterns of VLMs (e.g., reasoning hijacking, contextual cloaking), and integrate 17 red-teaming algorithms into ARMs via model context protocol (MCP). To balance the diversity and effectiveness of the attack, we design a layered memory with an epsilon-greedy attack exploration algorithm. Extensive experiments on instance- and policy-based benchmarks show that ARMs achieves SOTA attack success rates, exceeding baselines by an average of 52.1% and surpassing 90% on Claude-4-Sonnet. We show that the diversity of red-teaming instances generated by ARMs is significantly higher, revealing emerging vulnerabilities in VLMs. Leveraging ARMs, we construct ARMs-Bench, a large-scale multimodal safety dataset comprising over 30K red-teaming instances spanning 51 diverse risk categories, grounded in both real-world multimodal threats and regulatory risks. Safety fine-tuning with ARMs-Bench substantially improves the robustness of VLMs while preserving their general utility, providing actionable guidance to improve multimodal safety alignment against emerging threats.

[229] Automated Constraint Specification for Job Scheduling by Regulating Generative Model with Domain-Specific Representation

Yu-Zhe Shi, Qiao Xu, Yanjia Li, Mingchen Liu, Huamin Qu, Lecheng Ruan, Qining Wang

Main category: cs.AI

TL;DR: A constraint-centric architecture that regulates LLMs for reliable automated constraint specification in production scheduling, outperforming pure LLM approaches.

Details

Motivation: Manual constraint specification for manufacturing scheduling is labor-intensive, and direct LLM application faces challenges with ambiguity, non-determinism, and limited domain knowledge.

Method: Hierarchical structural space across three levels with domain-specific representation, plus automated production scenario adaptation algorithm for manufacturing customization.

Result: Successfully balances LLM generative capabilities with manufacturing reliability requirements, significantly outperforming pure LLM-based approaches.

Conclusion: The proposed constraint-centric architecture enables reliable automated constraint specification for production scheduling while maintaining flexibility.

Abstract: Advanced Planning and Scheduling (APS) systems have become indispensable for modern manufacturing operations, enabling optimized resource allocation and production efficiency in increasingly complex and dynamic environments. While algorithms for solving abstracted scheduling problems have been extensively investigated, the critical prerequisite of specifying manufacturing requirements into formal constraints remains manual and labor-intensive. Although recent advances of generative models, particularly Large Language Models (LLMs), show promise in automating constraint specification from heterogeneous raw manufacturing data, their direct application faces challenges due to natural language ambiguity, non-deterministic outputs, and limited domain-specific knowledge. This paper presents a constraint-centric architecture that regulates LLMs to perform reliable automated constraint specification for production scheduling. The architecture defines a hierarchical structural space organized across three levels, implemented through domain-specific representation to ensure precision and reliability while maintaining flexibility. Furthermore, an automated production scenario adaptation algorithm is designed and deployed to efficiently customize the architecture for specific manufacturing configurations. Experimental results demonstrate that the proposed approach successfully balances the generative capabilities of LLMs with the reliability requirements of manufacturing systems, significantly outperforming pure LLM-based approaches in constraint specification tasks.

[230] NCV: A Node-Wise Consistency Verification Approach for Low-Cost Structured Error Localization in LLM Reasoning

Yulong Zhang, Li Wang, Wei Du, Peilin Li, Yuqin Dai Zhiyuan Zhao, Lingyong Fang, Ziniu Liu, Ru Zhang, Huijia Zhu, Gongshen Liu

Main category: cs.AI

TL;DR: Node-wise Consistency Verification (NCV) is a training-free framework that improves LLM reasoning verification through lightweight binary consistency checks at node level, achieving better accuracy with significantly fewer tokens.

Details

Motivation: Existing methods for verifying multi-step reasoning in LLMs suffer from imprecise error localization, high token costs, attention dilution, and expensive multi-sampling requirements.

Method: NCV decomposes chain of thought into interconnected verification nodes and performs lightweight binary consistency checks at each node, avoiding long-form generation.

Result: NCV achieves 10-25% improvement in F1 scores over baselines while using 6x-58x fewer tokens than traditional CoT-based verifiers on public datasets.

Conclusion: NCV provides a scalable solution for reliable LLM reasoning verification with enhanced interpretability and efficiency through node-level consistency checking.

Abstract: Verifying multi-step reasoning in large language models is difficult due to imprecise error localization and high token costs. Existing methods either assess entire reasoning chains, suffering attention dilution, or rely on expensive multi-sampling. We introduce Node-wise Consistency Verification (NCV), a training-free framework that recasts verification as lightweight binary consistency checks at the node level. By decomposing the chain of thought into interconnected verification nodes, NCV precisely localizes errors and avoids unnecessary long-form generation. Experiments demonstrate that our approach enhances interpretability and efficiency, presenting a scalable solution for reliable LLM reasoning verification. On public datasets, NCV achieves a 10% to 25% improvement in F1 scores over baselines while utilizing $6\times$~$58\times$ fewer tokens than traditional methods like CoT-based verifiers.

[231] Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

Wonjoong Kim, Sangwu Park, Yeonjun In, Sein Kim, Dongha Lee, Chanyoung Park

Main category: cs.AI

TL;DR: TRACE is a framework for multi-dimensional evaluation of tool-augmented LLM agents that assesses problem-solving trajectories beyond just final answers, addressing limitations of current evaluation methods.

Details

Motivation: Current tool-augmented benchmarks rely primarily on answer matching, but as tasks become more complex with multiple steps, evaluation needs to assess the entire problem-solving trajectory including efficiency, hallucination, and adaptivity.

Method: TRACE incorporates an evidence bank that accumulates knowledge from preceding reasoning steps, enabling multi-faceted analysis of agent trajectories without requiring expensive ground-truth annotations.

Result: The framework was validated using a meta-evaluation dataset with diverse and flawed trajectories, showing that TRACE accurately evaluates complex behaviors in a scalable and cost-effective manner even with small open-source LLMs.

Conclusion: TRACE provides an effective solution for comprehensive evaluation of tool-augmented LLM agents, revealing previously unreported observations about agent performance and enabling better assessment of reasoning trajectories.

Abstract: Although recent tool-augmented benchmarks incorporate complex user requests and diverse tools, the evaluation methods for most of them remain limited to answer matching. However, as the number of steps required to resolve a user request increases, a proper evaluation of an agent’s performance must go beyond the final answer to also assess the problem-solving trajectory, including previously ignored aspects such as efficiency, hallucination, and adaptivity. The most straightforward method for evaluating these aspects is to compare an agent’s trajectory with the ground-truth trajectory, but this approach is fundamentally limited since annotating all valid ground-truth trajectories is prohibitively expensive. However, a simple LLM-based evaluator struggles to assess trajectories in detail without ground truth. To effectively evaluate the agents in this manner, we introduce TRACE, a framework for the multi-dimensional evaluation of tool-augmented LLM agent performance. By incorporating an evidence bank, which accumulates knowledge gathered from preceding reasoning steps, TRACE enables a multi-faceted analysis and evaluation of an agent’s reasoning trajectory effectively. To validate our framework, we develop a new meta-evaluation dataset by augmenting existing benchmarks with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates these complex behaviors in a scalable and cost-effective manner, even with small open-source LLMs. Furthermore, we apply our method to evaluate the trajectories that agents produce while solving tool-augmented tasks, presenting previously unreported observations and their corresponding insights.

[232] Take Goodhart Seriously: Principled Limit on General-Purpose AI Optimization

Antoine Maier, Aude Maier, Tom David

Main category: cs.AI

TL;DR: The paper challenges the Objective Satisfaction Assumption (OSA) in machine learning, arguing that training never perfectly achieves specified objectives due to approximation, estimation, and optimization errors, plus inevitable objective misspecification.

Details

Motivation: To examine the rarely questioned assumption that training yields models satisfying their specified objectives, and to highlight the risks when this assumption fails under strong optimization pressure.

Method: Uses a learning-paradigm-agnostic framework to analyze systematic deviations from intended objectives, building on recent mathematical results about Goodhart’s law failure modes.

Result: Shows that OSA fails in realistic conditions due to technical limitations and practical impossibility of perfectly capturing developer intent, making objective misspecification inevitable.

Conclusion: A principled limit on optimization of General-Purpose AI systems is necessary to prevent predictable and irreversible loss of control when systems are pushed beyond their Goodhart breaking point.

Abstract: A common but rarely examined assumption in machine learning is that training yields models that actually satisfy their specified objective function. We call this the Objective Satisfaction Assumption (OSA). Although deviations from OSA are acknowledged, their implications are overlooked. We argue, in a learning-paradigm-agnostic framework, that OSA fails in realistic conditions: approximation, estimation, and optimization errors guarantee systematic deviations from the intended objective, regardless of the quality of its specification. Beyond these technical limitations, perfectly capturing and translating the developer’s intent, such as alignment with human preferences, into a formal objective is practically impossible, making misspecification inevitable. Building on recent mathematical results, absent a mathematical characterization of these gaps, they are indistinguishable from those that collapse into Goodhart’s law failure modes under strong optimization pressure. Because the Goodhart breaking point cannot be located ex ante, a principled limit on the optimization of General-Purpose AI systems is necessary. Absent such a limit, continued optimization is liable to push systems into predictable and irreversible loss of control.

[233] Reward Model Routing in Alignment

Xinle Wu, Yao Lu

Main category: cs.AI

TL;DR: BayesianRouter is a hybrid framework for reward model routing that combines offline learning of RM strengths with online Bayesian selection to improve alignment quality in RLHF/RLAIF pipelines.

Details

Motivation: Current RLHF/RLAIF pipelines rely on single reward models, limiting alignment quality and risking overfitting. Existing RM routing methods suffer from cold-start and insufficient exploration issues.

Method: Two-stage approach: offline training of multi-task router on preference data to estimate RM reliability, followed by online Bayesian Thompson sampling router that performs per-query RM selection using Gaussian priors from offline embeddings and adaptively updates posteriors with online rewards.

Result: Extensive experiments on instruction-following (AlpacaEval-2, Arena-Hard, MT-Bench) and reasoning (GSM8K, MMLU) benchmarks show BayesianRouter consistently outperforms individual RMs, RM ensembling, and existing routing methods.

Conclusion: BayesianRouter effectively addresses cold-start and exploration problems in RM routing, providing a robust solution for improving alignment quality in LLM training pipelines.

Abstract: Reinforcement learning from human or AI feedback (RLHF / RLAIF) has become the standard paradigm for aligning large language models (LLMs). However, most pipelines rely on a single reward model (RM), limiting alignment quality and risking overfitting. Recent work explores RM routing–dynamically selecting an RM from a candidate pool to exploit complementary strengths while maintaining $O(1)$ RM calls–but existing methods suffer from cold-start and insufficient exploration. We propose BayesianRouter, a hybrid routing framework that combines offline RM strengths learning with online Bayesian selection. In the offline stage, a multi-task router is trained on preference data to estimate per-RM reliability. In the online stage, a Bayesian Thompson sampling router performs per-query RM selection, initializing RM-specific weight vectors with offline embeddings as Gaussian priors and adaptively updating their posteriors with online rewards to adapt to the evolving policy distribution. Extensive experiments on instruction-following (AlpacaEval-2, Arena-Hard, MT-Bench) and reasoning (GSM8K, MMLU) benchmarks show that BayesianRouter consistently outperforms individual RMs, RM ensembling, and existing routing methods.

[234] Consolidating Reinforcement Learning for Multimodal Discrete Diffusion Models

Tianren Ma, Mu Zhang, Yibing Wang, Qixiang Ye

Main category: cs.AI

TL;DR: MaskGRPO is the first viable approach for scalable multimodal reinforcement learning in discrete diffusion models, addressing challenges with importance sampling and rollout complexity through theoretical foundations and modality-specific adaptations.

Details

Motivation: Optimizing discrete diffusion models with rewards is challenging due to the non-autoregressive paradigm making importance sampling intractable and rollout complex, which puzzles reinforcement learning methods like GRPO.

Method: Developed theoretical foundation for DDMs to build importance estimator capturing token fluctuation for gradient updates, and tailored rollout method for visual sequences to yield diverse completions and reliable optimization gradients.

Result: MaskGRPO brings more stable and efficient updates across math reasoning, coding, and visual generation benchmarks, leading to stronger reasoning performance and better generation quality.

Conclusion: MaskGRPO establishes itself as a systematic policy optimization approach and the first practical way for discretized visual diffusion.

Abstract: Optimizing discrete diffusion model (DDM) with rewards remains a challenge: the non-autoregressive paradigm makes importance sampling intractable and rollout complex, puzzling reinforcement learning methods such as Group Relative Policy Optimization (GRPO). In this study, we introduce MaskGRPO, the first viable approach to enable scalable multimodal reinforcement learning in discrete diffusion with effective importance sampling and modality-specific adaptations. To this end, we first clarify the theoretical foundation for DDMs, which facilitates building an importance estimator that captures valuable token fluctuation for gradient updates. We then delicately tailored the rollout method for visual sequences, which yields diverse completions and reliable optimization gradients. Upon math reasoning, coding, and visual generation benchmarks, MaskGRPO brings more stable and efficient updates, leading to stronger reasoning performance and better generation quality. This study establishes MaskGRPO as a systematic policy optimization approach and the first practical way for discretized visual diffusion.

[235] Onto-Epistemological Analysis of AI Explanations

Martina Mattioli, Eike Petersen, Aasa Feragen, Marcello Pelillo, Siavash A. Bigdeli

Main category: cs.AI

TL;DR: The paper examines how different explainable AI (XAI) methods incorporate varying ontological and epistemological assumptions about explanations, and how these assumptions affect the validity and interpretation of AI explanations across different domains.

Details

Motivation: Current deep learning methods are black-box systems that lack explanations, limiting their trustworthiness and adoption. XAI methods aim to provide explanations but are developed with technical assumptions that may not align with philosophical understandings of what constitutes an explanation.

Method: The authors investigate ontological and epistemological assumptions in XAI methods applied to AI systems, analyzing how technical changes in XAI methods correspond to different underlying assumptions about explanations.

Result: The analysis reveals that seemingly minor technical changes in XAI methods can represent significant differences in fundamental assumptions about the existence and knowability of explanations, with important consequences for validity and interpretation.

Conclusion: Ignoring the underlying onto-epistemological paradigm when selecting XAI methods poses risks, and researchers should carefully select and adapt XAI methods based on the specific domain requirements and philosophical considerations.

Abstract: Artificial intelligence (AI) is being applied in almost every field. At the same time, the currently dominant deep learning methods are fundamentally black-box systems that lack explanations for their inferences, significantly limiting their trustworthiness and adoption. Explainable AI (XAI) methods aim to overcome this challenge by providing explanations of the models’ decision process. Such methods are often proposed and developed by engineers and scientists with a predominantly technical background and incorporate their assumptions about the existence, validity, and explanatory utility of different conceivable explanatory mechanisms. However, the basic concept of an explanation – what it is, whether we can know it, whether it is absolute or relative – is far from trivial and has been the subject of deep philosophical debate for millennia. As we point out here, the assumptions incorporated into different XAI methods are not harmless and have important consequences for the validity and interpretation of AI explanations in different domains. We investigate ontological and epistemological assumptions in explainability methods when they are applied to AI systems, meaning the assumptions we make about the existence of explanations and our ability to gain knowledge about those explanations. Our analysis shows how seemingly small technical changes to an XAI method may correspond to important differences in the underlying assumptions about explanations. We furthermore highlight the risks of ignoring the underlying onto-epistemological paradigm when choosing an XAI method for a given application, and we discuss how to select and adapt appropriate XAI methods for different domains of application.

[236] From Facts to Foils: Designing and Evaluating Counterfactual Explanations for Smart Environments

Anna Trapp, Mersedeh Sadeghi, Andreas Vogelsang

Main category: cs.AI

TL;DR: First formalization and implementation of counterfactual explanations for rule-based smart environments, showing user preference depends on context - causal explanations for simplicity and time pressure, counterfactuals for actionable problem-solving.

Details

Motivation: Counterfactual explanations are powerful in XAI but no established methods exist for generating them in rule-based smart environments, creating a gap in explainability for these systems.

Method: Developed a formalization and implementation of counterfactual explanations as a plugin extending an existing explanation engine, then conducted a user study (N=17) comparing counterfactuals against traditional causal explanations.

Result: User preference is highly contextual - causal explanations preferred for linguistic simplicity and time-pressured situations, while counterfactuals preferred for actionable content when users want to resolve problems.

Conclusion: Provides a practical framework for counterfactual explanations in smart environments and empirical evidence guiding when each explanation type is most effective.

Abstract: Explainability is increasingly seen as an essential feature of rule-based smart environments. While counterfactual explanations, which describe what could have been done differently to achieve a desired outcome, are a powerful tool in eXplainable AI (XAI), no established methods exist for generating them in these rule-based domains. In this paper, we present the first formalization and implementation of counterfactual explanations tailored to this domain. It is implemented as a plugin that extends an existing explanation engine for smart environments. We conducted a user study (N=17) to evaluate our generated counterfactuals against traditional causal explanations. The results show that user preference is highly contextual: causal explanations are favored for their linguistic simplicity and in time-pressured situations, while counterfactuals are preferred for their actionable content, particularly when a user wants to resolve a problem. Our work contributes a practical framework for a new type of explanation in smart environments and provides empirical evidence to guide the choice of when each explanation type is most effective.

[237] A Study of Rule Omission in Raven’s Progressive Matrices

Binze Li

Main category: cs.AI

TL;DR: This study evaluates AI models’ abstract reasoning capabilities on Raven’s Progressive Matrices by testing generalization to unseen structural rules, revealing limitations in current approaches.

Details

Motivation: To determine whether AI models demonstrate genuine reasoning ability or rely on statistical shortcuts, particularly when faced with incomplete training data where structural rules are deliberately omitted.

Method: Evaluated sequence-to-sequence transformers and vision-based architectures (CoPINet, Dual-Contrast Network) on the Impartial-RAVEN dataset with deliberately omitted structural rules during training.

Result: Transformers showed strong performance on familiar rules but sharp accuracy decline on novel/omitted rules. Token-level vs complete answer accuracy gap revealed fundamental limitations.

Conclusion: Current AI models lack robust abstract reasoning capabilities and primarily rely on pattern recognition, highlighting the need for architectures that move beyond statistical shortcuts to genuine reasoning.

Abstract: Analogical reasoning lies at the core of human cognition and remains a fundamental challenge for artificial intelligence. Raven’s Progressive Matrices (RPM) serve as a widely used benchmark to assess abstract reasoning by requiring the inference of underlying structural rules. While many vision-based and language-based models have achieved success on RPM tasks, it remains unclear whether their performance reflects genuine reasoning ability or reliance on statistical shortcuts. This study investigates the generalization capacity of modern AI systems under conditions of incomplete training by deliberately omitting several structural rules during training. Both sequence-to-sequence transformer models and vision-based architectures such as CoPINet and the Dual-Contrast Network are evaluated on the Impartial-RAVEN (I-RAVEN) dataset. Experiments reveal that although transformers demonstrate strong performance on familiar rules, their accuracy declines sharply when faced with novel or omitted rules. Moreover, the gap between token-level accuracy and complete answer accuracy highlights fundamental limitations in current approaches. These findings provide new insights into the reasoning mechanisms underlying deep learning models and underscore the need for architectures that move beyond pattern recognition toward robust abstract reasoning.

[238] CoDA: Agentic Systems for Collaborative Data Visualization

Zichen Chen, Jiefeng Chen, Sercan Ö. Arik, Misha Sra, Tomas Pfister, Jinsung Yoon

Main category: cs.AI

TL;DR: CoDA is a multi-agent system that automates data visualization from natural language queries by using specialized LLM agents for metadata analysis, task planning, code generation, and self-reflection, achieving 41.5% improvement over baselines.

Details

Motivation: Current visualization automation systems struggle with complex multi-file datasets and iterative refinement, often oversimplifying the task and failing to handle data complexity, code errors, or visualization quality.

Method: CoDA employs a collaborative multi-agent approach with specialized LLM agents for metadata analysis (bypassing token limits), task planning, code generation, and self-reflection through quality-driven refinement.

Result: Extensive evaluations show CoDA achieves substantial gains in overall score, outperforming competitive baselines by up to 41.5%.

Conclusion: The future of visualization automation lies in integrated, collaborative agentic workflows rather than isolated code generation.

Abstract: Deep research has revolutionized data analysis, yet data scientists still devote substantial time to manually crafting visualizations, highlighting the need for robust automation from natural language queries. However, current systems struggle with complex datasets containing multiple files and iterative refinement. Existing approaches, including simple single- or multi-agent systems, often oversimplify the task, focusing on initial query parsing while failing to robustly manage data complexity, code errors, or final visualization quality. In this paper, we reframe this challenge as a collaborative multi-agent problem. We introduce CoDA, a multi-agent system that employs specialized LLM agents for metadata analysis, task planning, code generation, and self-reflection. We formalize this pipeline, demonstrating how metadata-focused analysis bypasses token limits and quality-driven refinement ensures robustness. Extensive evaluations show CoDA achieves substantial gains in the overall score, outperforming competitive baselines by up to 41.5%. This work demonstrates that the future of visualization automation lies not in isolated code generation but in integrated, collaborative agentic workflows.

[239] Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner

Cai Zhou, Chenxiao Yang, Yi Hu, Chenyu Wang, Chubin Zhang, Muhan Zhang, Lester Mackey, Tommi Jaakkola, Stephen Bates, Dinghuai Zhang

Main category: cs.AI

TL;DR: Continuous diffusion models have stronger theoretical expressivity than discrete diffusions but underperform in practice due to trainability issues. CCDD proposes a joint multimodal diffusion process combining continuous representation space and discrete token space to achieve both expressiveness and good trainability.

Details

Motivation: To address the contradiction between theoretical expressiveness and empirical performance of continuous diffusion models, and to leverage the advantages of both continuous and discrete approaches for language modeling.

Method: Propose Coevolutionary Continuous Discrete Diffusion (CCDD) - a joint multimodal diffusion process on the union of continuous representation space and discrete token space, using a single model to simultaneously denoise in both spaces.

Result: CCDD achieves strong empirical performance in extensive language modeling experiments on real-world tasks, demonstrating both expressiveness with rich semantics and good trainability with explicit discrete tokens.

Conclusion: CCDD successfully bridges the gap between theoretical expressivity and practical performance by combining continuous and discrete modalities, providing a promising approach for diffusion language models.

Abstract: Diffusion language models, especially masked discrete diffusion models, have achieved great success recently. While there are some theoretical and primary empirical results showing the advantages of latent reasoning with looped transformers or continuous chain-of-thoughts, continuous diffusion models typically underperform their discrete counterparts. In this paper, we argue that diffusion language models do not necessarily need to be in the discrete space. In particular, we prove that continuous diffusion models have stronger expressivity than discrete diffusions and looped transformers. We attribute the contradiction between the theoretical expressiveness and empirical performance to their practical trainability: while continuous diffusion provides intermediate supervision that looped transformers lack, they introduce additional difficulty decoding tokens into the discrete token space from the continuous representation space. We therefore propose Coevolutionary Continuous Discrete Diffusion (CCDD), which defines a joint multimodal diffusion process on the union of a continuous representation space and a discrete token space, leveraging a single model to simultaneously denoise in the joint space. By combining two modalities, CCDD is expressive with rich semantics in the latent space, as well as good trainability and sample quality with the help of explicit discrete tokens. We also propose effective architectures and advanced training/sampling techniques for CCDD, which reveals strong empirical performance in extensive language modeling experiments on real-world tasks.

[240] Improved Monte Carlo Planning via Causal Disentanglement for Structurally-Decomposed Markov Decision Processes

Larkin Liu, Shiqi Liu, Yinruo Hua, Matej Jusup

Main category: cs.AI

TL;DR: Introduces Structurally Decomposed MDP (SD-MDP) that leverages causal disentanglement for resource allocation problems, achieving O(T log T) complexity and computational efficiency independent of state-action space size.

Details

Motivation: Traditional MDPs overlook causal structure benefits. For resource allocation problems, incorporating causal disentanglement can enable dimensionality reduction and computational efficiency gains.

Method: SD-MDP partitions MDP’s temporal causal graph into independent components using causal disentanglement. Reduces sequential optimization to fractional knapsack problem and integrates with Monte Carlo Tree Search.

Result: Achieves log-linear complexity O(T log T), outperforming traditional stochastic programming methods. Computational advantages are independent of state-action space size. Higher expected rewards under constrained simulation budgets with vanishing simple regret bound.

Conclusion: SD-MDP demonstrates superior policy performance over benchmarks in logistics and finance domains, providing an efficient framework for resource allocation problems through causal structure exploitation.

Abstract: Markov Decision Processes (MDPs), as a general-purpose framework, often overlook the benefits of incorporating the causal structure of the transition and reward dynamics. For a subclass of resource allocation problems, we introduce the Structurally Decomposed MDP (SD-MDP), which leverages causal disentanglement to partition an MDP’s temporal causal graph into independent components. By exploiting this disentanglement, SD-MDP enables dimensionality reduction and computational efficiency gains in optimal value function estimation. We reduce the sequential optimization problem to a fractional knapsack problem with log-linear complexity $O(T \log T)$, outperforming traditional stochastic programming methods that exhibit polynomial complexity with respect to the time horizon $T$. Additionally, SD-MDP’s computational advantages are independent of state-action space size, making it viable for high-dimensional spaces. Furthermore, our approach integrates seamlessly with Monte Carlo Tree Search (MCTS), achieving higher expected rewards under constrained simulation budgets while providing a vanishing simple regret bound. Empirical results demonstrate superior policy performance over benchmarks across various logistics and finance domains.

[241] OML: A Primitive for Reconciling Open Access with Owner Control in AI Model Distribution

Zerui Cheng, Edoardo Contente, Ben Finch, Oleg Golev, Jonathan Hayase, Andrew Miller, Niusha Moshrefi, Anshul Nasery, Sandeep Nailwal, Sewoong Oh, Himanshu Tyagi, Pramod Viswanath

Main category: cs.AI

TL;DR: OML enables AI models to be freely distributed for local execution while maintaining cryptographically enforced usage authorization, bridging the gap between closed API models and open distribution.

Details

Motivation: Current AI model distribution faces a dichotomy: closed API models lack transparency and local execution, while open models sacrifice monetization and control. OML aims to resolve this conflict.

Method: OML 1.0 uses AI-native model fingerprinting combined with crypto-economic enforcement mechanisms to achieve white-box model protection with model extraction resistance and permission forgery resistance.

Result: The paper establishes OML as a foundational primitive with rigorous security definitions, proves fundamental bounds on achievable properties, and characterizes the complete design space of potential constructions.

Conclusion: OML represents a new research direction at the intersection of cryptography, machine learning, and mechanism design, with critical implications for sustainable AI ecosystems and future AI distribution governance.

Abstract: The current paradigm of AI model distribution presents a fundamental dichotomy: models are either closed and API-gated, sacrificing transparency and local execution, or openly distributed, sacrificing monetization and control. We introduce OML(Open-access, Monetizable, and Loyal AI Model Serving), a primitive that enables a new distribution paradigm where models can be freely distributed for local execution while maintaining cryptographically enforced usage authorization. We are the first to introduce and formalize this problem, introducing rigorous security definitions tailored to the unique challenge of white-box model protection: model extraction resistance and permission forgery resistance. We prove fundamental bounds on the achievability of OML properties and characterize the complete design space of potential constructions, from obfuscation-based approaches to cryptographic solutions. To demonstrate practical feasibility, we present OML 1.0, a novel OML construction leveraging AI-native model fingerprinting coupled with crypto-economic enforcement mechanisms. Through extensive theoretical analysis and empirical evaluation, we establish OML as a foundational primitive necessary for sustainable AI ecosystems. This work opens a new research direction at the intersection of cryptography, machine learning, and mechanism design, with critical implications for the future of AI distribution and governance.

[242] ViLBias: Detecting and Reasoning about Bias in Multimodal Content

Shaina Raza, Caesar Saleh, Azib Farooq, Emrul Hasan, Franklin Ogidi, Maximus Powers, Veronica Chatrath, Marcelo Lotif, Karanpal Sekhon, Roya Javadi, Haad Zahid, Anam Zahid, Vahid Reza Khazaie, Zhenyu Yu

Main category: cs.AI

TL;DR: ViLBias is a multimodal benchmark for detecting bias in news using text-image pairs, showing that combining images with text improves bias detection by 3-5% and that parameter-efficient methods achieve near-full fine-tuning performance with minimal trainable parameters.

Details

Motivation: Current bias detection models primarily focus on text classification, but real-world news often contains both text and images that can convey bias through framing and inconsistencies between modalities.

Method: Created a VQA-style benchmark with 40,945 text-image pairs from diverse news outlets, annotated using a two-stage LLM-as-annotator pipeline with hierarchical majority voting and human validation. Evaluated SLMs, LLMs, and VLMs on classification and open-ended reasoning tasks.

Result: Image incorporation improved detection accuracy by 3-5%. LLMs/VLMs better captured subtle framing and text-image inconsistencies than SLMs. Parameter-efficient methods recovered 97-99% of full fine-tuning performance with <5% trainable parameters. Reasoning accuracy spanned 52-79% and faithfulness 68-89%.

Conclusion: ViLBias provides a scalable benchmark for multimodal bias detection, demonstrating the importance of combining visual and textual information and the effectiveness of parameter-efficient tuning methods for this task.

Abstract: Detecting bias in multimodal news requires models that reason over text–image pairs, not just classify text. In response, we present ViLBias, a VQA-style benchmark and framework for detecting and reasoning about bias in multimodal news. The dataset comprises 40,945 text–image pairs from diverse outlets, each annotated with a bias label and concise rationale using a two-stage LLM-as-annotator pipeline with hierarchical majority voting and human-in-the-loop validation. We evaluate Small Language Models (SLMs), Large Language Models (LLMs), and Vision–Language Models (VLMs) across closed-ended classification and open-ended reasoning (oVQA), and compare parameter-efficient tuning strategies. Results show that incorporating images alongside text improves detection accuracy by 3–5%, and that LLMs/VLMs better capture subtle framing and text–image inconsistencies than SLMs. Parameter-efficient methods (LoRA/QLoRA/Adapters) recover 97–99% of full fine-tuning performance with $<5%$ trainable parameters. For oVQA, reasoning accuracy spans 52–79% and faithfulness 68–89%, both improved by instruction tuning; closed accuracy correlates strongly with reasoning ($r = 0.91$). ViLBias offers a scalable benchmark and strong baselines for multimodal bias detection and rationale quality.

[243] Grounding Multimodal LLMs to Embodied Agents that Ask for Help with Reinforcement Learning

Ram Ramrakhya, Matthew Chang, Xavier Puig, Ruta Desai, Zsolt Kira, Roozbeh Mottaghi

Main category: cs.AI

TL;DR: The paper introduces Ask-to-Act task where embodied agents ask clarification questions to resolve ambiguity in household instructions, and proposes RL-finetuned MLLMs that outperform baselines by 10.4-16.5%.

Details

Motivation: Household robots need to interpret ambiguous human instructions and ask relevant clarification questions to accurately infer user intent for effective task execution.

Method: Fine-tunes multi-modal large language models (MLLMs) as vision-language-action policies using online reinforcement learning with LLM-generated rewards, eliminating need for human demonstrations or manual rewards.

Result: RL-finetuned MLLM outperforms zero-shot baselines (GPT-4o) and supervised fine-tuned MLLMs by 10.4-16.5%, generalizing well to novel scenes and tasks.

Conclusion: This is the first demonstration of adapting MLLMs as VLA agents that can both act and ask for help using LLM-generated rewards with online RL, showing significant performance improvements.

Abstract: Embodied agents operating in household environments must interpret ambiguous and under-specified human instructions. A capable household robot should recognize ambiguity and ask relevant clarification questions to infer the user intent accurately, leading to more effective task execution. To study this problem, we introduce the Ask-to-Act task, where an embodied agent is tasked with a single or multi-object rearrangement task using an under-specified instruction in a home environment. The agent must strategically ask minimal, yet relevant, clarification questions to resolve ambiguity while navigating under partial observability. To address this challenge, we propose a novel approach that fine-tunes multi-modal large language models (MLLMs) as vision-language-action (VLA) policies using online reinforcement learning (RL) with LLM-generated rewards. Our method eliminates the need for large-scale human demonstrations or manually engineered rewards for training such agents. We benchmark against strong zero-shot baselines including GPT-4o as well as supervised fine-tuned MLLMs on our task. Our results show that our RL-finetuned MLLM outperforms all baselines by a significant margin (10.4-16.5%), generalizing well to novel scenes and tasks. To the best of our knowledge, this is the first demonstration of adapting MLLMs as VLA agents that can act and ask for help using LLM-generated rewards with online RL.

[244] SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning

Zheng Li, Qingxiu Dong, Jingyuan Ma, Di Zhang, Kai Jia, Zhifang Sui

Main category: cs.AI

TL;DR: SelfBudgeter is a framework that prevents overthinking in reasoning models by predicting token budgets before reasoning, allowing dynamic budget allocation based on problem complexity and manual control.

Details

Motivation: Reasoning models often overthink simple problems, wasting computational resources and degrading user experience. This paper aims to address this issue with an adaptive controllable reasoning framework.

Method: Uses a dual-phase training: cold-start phase learns to predict token budgets before reasoning, and RL phase trains the model to autonomously plan budgets based on problem difficulty and strictly adhere to them.

Result: Achieved 61% average response length compression for 1.5B model and 48% for 7B model on GSM8K, MATH500, and AIME2025 datasets while maintaining nearly undiminished accuracy.

Conclusion: SelfBudgeter effectively reduces computational waste by dynamically allocating reasoning budgets based on problem complexity, improving efficiency while preserving performance.

Abstract: While reasoning models demonstrate exceptional performance on complex tasks, they often exhibit tendencies of overthinking on simple problems. This phenomenon not only leads to excessive computational resource consumption but also significantly degrades user experience. To address this challenge, we propose SelfBudgeter - a novel user-friendly adaptive controllable reasoning framework that incorporates a budget estimation mechanism prior to reasoning. The framework adopts a dual-phase training paradigm: during the cold-start phase, the model learns to predict token budgets before executing reasoning in a standardized format; in the reinforcement learning phase, the model is trained to autonomously plan budgets based on problem difficulty and strictly adhere to them when generating responses. Since the model outputs budget estimates at the initial stage, users can immediately anticipate waiting duration, enabling flexible decisions on whether to interrupt or continue the generation process. Notably, our method supports manual control of reasoning length through pre-filled budget fields. Experimental results demonstrate that SelfBudgeter can dynamically allocate budgets according to problem complexity, yielding an average response length compression of 61% for the 1.5B model on GSM8K, MATH500, and AIME2025, and 48% for the 7B model, while maintaining nearly undiminished accuracy.

[245] MIRROR: Modular Internal Processing for Personalized Safety in LLM Dialogue

Nicole Hsing

Main category: cs.AI

TL;DR: MIRROR is a modular architecture that prevents harmful recommendations in multi-turn dialogues by maintaining persistent user context and using dual-component processing (Talker for immediate responses, Thinker for deliberation).

Details

Motivation: Large language models often generate harmful recommendations by ignoring user-specific safety context, exhibiting sycophantic agreement, and prioritizing group preferences over individual safety.

Method: Modular architecture with persistent internal state and dual-component design (Talker for immediate responses, Thinker for asynchronous deliberation with parallel reasoning threads).

Result: 21% relative improvement (69% to 84%) on CuRaTe benchmark across 7 models; open-source Llama 4 and Mistral 3 variants surpassed GPT-4o and Claude 3.7 Sonnet at only $0.0028-$0.0172 additional cost per turn.

Conclusion: MIRROR democratizes access to safer, personalized AI by enabling flexible deployment configurations and narrowing the safety gap between affordable open-source models and frontier systems.

Abstract: Large language models frequently generate harmful recommendations in personal multi-turn dialogue by ignoring user-specific safety context, exhibiting sycophantic agreement, and compromising user safety for larger group preferences. We introduce MIRROR, a modular production-focused architecture that prevents these failures through a persistent, bounded internal state that preserves personal conversational information across conversational turns. Our dual-component design inspired by Dual Process Theory separates immediate response generation (Talker) from asynchronous deliberative processing (Thinker), which synthesizes parallel reasoning threads between turns with marginal latency. On the CuRaTe personalized safety benchmark, MIRROR-augmented models achieve a 21% relative improvement (69% to 84%) across seven diverse frontier models, with open-source Llama 4 and Mistral 3 variants surpassing both GPT-4o and Claude 3.7 Sonnet at only $0.0028 to $0.0172 additional cost per turn, narrowing the gap between affordable open-source models to frontier systems in the safety space. The modular architecture enables flexible deployment: full internal processing for affordable models or single-component configurations for expensive systems, democratizing access to safer, personalized AI.

[246] V2X-UniPool: Unifying Multimodal Perception and Knowledge Reasoning for Autonomous Driving

Xuewen Luo, Fengze Yang, Fan Ding, Xiangbo Gao, Shuo Xing, Yang Zhou, Zhengzhong Tu, Chenxi Liu

Main category: cs.AI

TL;DR: V2X-UniPool is a framework that combines V2X perception with language-based reasoning for autonomous driving, achieving state-of-the-art planning accuracy with 80%+ communication cost reduction.

Details

Motivation: Single-vehicle perception is limited by sensing range and occlusions, while V2X communication faces heterogeneity, synchronization, and latency issues. Language models offer reasoning capabilities but aren't designed for raw sensor data and suffer from hallucination.

Method: Transforms multimodal V2X data into structured language-based knowledge, organizes it in a time-indexed knowledge pool for temporal consistency, and uses Retrieval-Augmented Generation (RAG) to ground decisions in real-time context.

Result: Achieves state-of-the-art planning accuracy and safety on DAIR-V2X dataset while reducing communication cost by more than 80%, with the lowest overhead among evaluated methods.

Conclusion: The framework successfully bridges V2X perception and language reasoning to advance scalable and trustworthy autonomous driving.

Abstract: Autonomous driving (AD) has achieved significant progress, yet single-vehicle perception remains constrained by sensing range and occlusions. Vehicle-to-Everything (V2X) communication addresses these limits by enabling collaboration across vehicles and infrastructure, but it also faces heterogeneity, synchronization, and latency constraints. Language models offer strong knowledge-driven reasoning and decision-making capabilities, but they are not inherently designed to process raw sensor streams and are prone to hallucination. We propose V2X-UniPool, the first framework that unifies V2X perception with language-based reasoning for knowledge-driven AD. It transforms multimodal V2X data into structured, language-based knowledge, organizes it in a time-indexed knowledge pool for temporally consistent reasoning, and employs Retrieval-Augmented Generation (RAG) to ground decisions in real-time context. Experiments on the real-world DAIR-V2X dataset show that V2X-UniPool achieves state-of-the-art planning accuracy and safety while reducing communication cost by more than 80%, achieving the lowest overhead among evaluated methods. These results highlight the promise of bridging V2X perception and language reasoning to advance scalable and trustworthy driving. Our code is available at: https://github.com/Xuewen2025/V2X-UniPool

[247] Bridging Ethical Principles and Algorithmic Methods: An Alternative Approach for Assessing Trustworthiness in AI Systems

Michael Papademas, Xenia Ziouvelou, Antonis Troumpoukis, Vangelis Karkaletsis

Main category: cs.AI

TL;DR: This paper introduces an assessment method that combines ethical components of Trustworthy AI with PageRank and TrustRank algorithms to create a quantitative framework for evaluating AI system trustworthiness.

Details

Motivation: AI systems pose significant societal risks due to their complexity and pervasive reach, operating beyond direct human oversight. Current guidelines lack quantification methods, while technological tools lack holistic perspectives.

Method: Combines ethical components of Trustworthy AI with algorithmic processes of PageRank and TrustRank to establish an assessment framework that minimizes subjectivity in self-assessment techniques.

Result: The approach provides quantitative insights for holistic assessment of AI system trustworthiness while considering theoretical content of relevant guidelines.

Conclusion: A holistic assessment of AI system trustworthiness can be achieved through quantitative methods that integrate ethical considerations with algorithmic criteria, reducing subjectivity in current evaluation practices.

Abstract: Artificial Intelligence (AI) technology epitomizes the complex challenges posed by human-made artifacts, particularly those widely integrated into society and exerting significant influence, highlighting potential benefits and their negative consequences. While other technologies may also pose substantial risks, AI’s pervasive reach makes its societal effects especially profound. The complexity of AI systems, coupled with their remarkable capabilities, can lead to a reliance on technologies that operate beyond direct human oversight or understanding. To mitigate the risks that arise, several theoretical tools and guidelines have been developed, alongside efforts to create technological tools aimed at safeguarding Trustworthy AI. The guidelines take a more holistic view of the issue but fail to provide techniques for quantifying trustworthiness. Conversely, while technological tools are better at achieving such quantification, they lack a holistic perspective, focusing instead on specific aspects of Trustworthy AI. This paper aims to introduce an assessment method that combines the ethical components of Trustworthy AI with the algorithmic processes of PageRank and TrustRank. The goal is to establish an assessment framework that minimizes the subjectivity inherent in the self-assessment techniques prevalent in the field by introducing algorithmic criteria. The application of our approach indicates that a holistic assessment of an AI system’s trustworthiness can be achieved by providing quantitative insights while considering the theoretical content of relevant guidelines.

[248] LayerCake: Token-Aware Contrastive Decoding within Large Language Model Layers

Jingze Zhu, Yongliang Wu, Wenbo Zhu, Jiawang Cao, Yanqiang Zheng, Jiawei Chen, Xu Yang, Bernt Schiele, Jonas Fischer, Xinting Hu

Main category: cs.AI

TL;DR: Token-aware, layer-localized contrastive decoding method that aligns specific token types with their most influential transformer layers to improve factual generation in LLMs without additional training.

Details

Motivation: LLMs are vulnerable to factual errors despite strong language capabilities, and existing decoding-time methods treat token-level and layer-level signals in isolation, missing their joint dynamics.

Method: Empirical attention analysis identifies punctuation tokens dominate early layers while conceptual tokens govern intermediate layers. The method selectively suppresses attention to these token types at respective depths to induce controlled factual degradation and derive contrastive signals.

Result: The method consistently improves factuality across multiple LLMs and various benchmarks without requiring additional training or model modification.

Conclusion: Aligning specific token types with their most influential transformer layers through layer-localized contrastive decoding effectively enhances factual generation in LLMs.

Abstract: Large language models (LLMs) excel at natural language understanding and generation but remain vulnerable to factual errors, limiting their reliability in knowledge-intensive tasks. While decoding-time strategies provide a promising efficient solution without training, existing methods typically treat token-level and layer-level signals in isolation, overlooking the joint dynamics between them. In this work, we introduce a token-aware, layer-localized contrastive decoding method that aligns specific token types with their most influential transformer layers to improve factual generation. Through empirical attention analysis, we identify two key patterns: punctuation tokens receive dominant attention in early layers, while conceptual tokens govern semantic reasoning in intermediate layers. By selectively suppressing attention to these token types at their respective depths, we achieve the induction of controlled factual degradation and derive contrastive signals to guide the final factual decoding. Our method requires no additional training or model modification, and experiments demonstrate that our method consistently improves factuality across multiple LLMs and various benchmarks.

[249] Disentangling Multiplex Spatial-Temporal Transition Graph Representation Learning for Socially Enhanced POI Recommendation

Jie Li, Haoye Dong, Zhengyang Wu, Zetao Zheng, Mingrong Lin

Main category: cs.AI

TL;DR: DiMuST is a POI recommendation model that uses disentangled representation learning on multiplex spatial-temporal graphs to address misalignment issues in existing methods, achieving superior performance.

Details

Motivation: Existing POI recommendation models separately model spatial and temporal transitions, causing misaligned representations of spatial-temporal key nodes, which introduces redundant information and reduces model interpretability.

Method: Proposes DiMuST with Disentangled variational multiplex graph Auto-Encoder (DAE) that disentangles shared and private distributions using multiplex spatial-temporal graphs, fuses shared features via Product of Experts (PoE), and denoises private features through contrastive constraints.

Result: Experiments on two challenging datasets show DiMuST significantly outperforms existing methods across multiple metrics.

Conclusion: DiMuST effectively captures spatial-temporal transition representations while preserving intrinsic correlations of spatial-temporal relationships, addressing misalignment issues in POI recommendation.

Abstract: Next Point-of-Interest (POI) recommendation is a research hotspot in business intelligence, where users’ spatial-temporal transitions and social relationships play key roles. However, most existing works model spatial and temporal transitions separately, leading to misaligned representations of the same spatial-temporal key nodes. This misalignment introduces redundant information during fusion, increasing model uncertainty and reducing interpretability. To address this issue, we propose DiMuST, a socially enhanced POI recommendation model based on disentangled representation learning over multiplex spatial-temporal transition graphs. The model employs a novel Disentangled variational multiplex graph Auto-Encoder (DAE), which first disentangles shared and private distributions using a multiplex spatial-temporal graph strategy. It then fuses the shared features via a Product of Experts (PoE) mechanism and denoises the private features through contrastive constraints. The model effectively captures the spatial-temporal transition representations of POIs while preserving the intrinsic correlation of their spatial-temporal relationships. Experiments on two challenging datasets demonstrate that our DiMuST significantly outperforms existing methods across multiple metrics.

[250] Gala: Global LLM Agents for Text-to-Model Translation

Junyang Cai, Serdar Kadioglu, Bistra Dilkina

Main category: cs.AI

TL;DR: Gala is a framework using multiple specialized LLM agents to translate natural language problem descriptions into MiniZinc models by decomposing the task by global constraint type.

Details

Motivation: Natural language descriptions of optimization problems are difficult to translate into correct MiniZinc models due to the need for both logical reasoning and constraint programming expertise.

Method: Multiple specialized LLM agents decompose the modeling task by global constraint type, with each agent dedicated to detecting and generating code for a specific class of global constraint, and a final assembler agent integrates these into a complete model.

Result: Initial experiments show better performance against baselines like one-shot prompting and chain-of-thought prompting.

Conclusion: The paper outlines a comprehensive roadmap for future work with potential enhancements and directions for improvement.

Abstract: Natural language descriptions of optimization or satisfaction problems are challenging to translate into correct MiniZinc models, as this process demands both logical reasoning and constraint programming expertise. We introduce Gala, a framework that addresses this challenge with a global agentic approach: multiple specialized large language model (LLM) agents decompose the modeling task by global constraint type. Each agent is dedicated to detecting and generating code for a specific class of global constraint, while a final assembler agent integrates these constraint snippets into a complete MiniZinc model. By dividing the problem into smaller, well-defined sub-tasks, each LLM handles a simpler reasoning challenge, potentially reducing overall complexity. We conduct initial experiments with several LLMs and show better performance against baselines such as one-shot prompting and chain-of-thought prompting. Finally, we outline a comprehensive roadmap for future work, highlighting potential enhancements and directions for improvement.

[251] THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning

Qikai Chang, Zhenrong Zhang, Pengfei Hu, Jun Du, Jiefeng Ma, Yicheng Pan, Jianshu Zhang, Quan Liu, Jianqing Gao

Main category: cs.AI

TL;DR: THOR is a framework that integrates external tools with LLMs for mathematical reasoning, addressing challenges in data construction, fine-grained optimization, and inference enhancement through hierarchical RL and self-correction mechanisms.

Details

Motivation: LLMs struggle with high-precision mathematical tasks like numerical computation and symbolic manipulation, and existing tool integration methods face challenges in data construction, fine-grained optimization, and inference enhancement.

Method: Proposes THOR with three components: TIRGen (multi-agent actor-critic pipeline for dataset construction), hierarchical RL optimization for both episode-level problem solving and step-level code generation, and self-correction mechanism using tool feedback during inference.

Result: Achieves state-of-the-art performance on multiple mathematical benchmarks for models of similar scale, shows strong generalization across diverse models (both reasoning and non-reasoning), and delivers consistent improvements on code benchmarks.

Conclusion: THOR effectively bridges the gap in LLM mathematical reasoning through systematic tool integration, hierarchical optimization, and dynamic self-correction, demonstrating broad applicability and superior performance.

Abstract: Large Language Models (LLMs) have made remarkable progress in mathematical reasoning, but still continue to struggle with high-precision tasks like numerical computation and formal symbolic manipulation. Integrating external tools has emerged as a promising approach to bridge this gap. Despite recent advances, existing methods struggle with three key challenges: constructing tool-integrated reasoning data, performing fine-grained optimization, and enhancing inference. To overcome these limitations, we propose THOR (Tool-Integrated Hierarchical Optimization via RL). First, we introduce TIRGen, a multi-agent actor-critic-based pipeline for constructing high-quality datasets of tool-integrated reasoning paths, aligning with the policy and generalizing well across diverse models. Second, to perform fine-grained hierarchical optimization, we introduce an RL strategy that jointly optimizes for both episode-level problem solving and step-level code generation. This is motivated by our key insight that the success of an intermediate tool call is a strong predictor of the final answer’s correctness. Finally, THOR incorporates a self-correction mechanism that leverages immediate tool feedback to dynamically revise erroneous reasoning paths during inference. Our approach demonstrates strong generalization across diverse models, performing effectively in both reasoning and non-reasoning models. It further achieves state-of-the-art performance for models of a similar scale on multiple mathematical benchmarks, while also delivering consistent improvements on code benchmarks. Our code will be publicly available at https://github.com/JingMog/THOR.

[252] Efficient & Correct Predictive Equivalence for Decision Trees

Joao Marques-Silva, Alexey Ignatiev

Main category: cs.AI

TL;DR: This paper identifies critical issues with using Quine-McCluskey method for decision tree analysis, showing it can have exponential complexity and incorrect results, and proposes polynomial-time alternatives.

Details

Motivation: To address problems with predictive equivalent decision trees in Rashomon sets that cause inaccurate feature importance, and to overcome limitations of existing Quine-McCluskey based approaches.

Method: Demonstrates exponential worst-case behavior of QM method, identifies constraints for correct predictive equivalence testing, and develops polynomial-time algorithms for DT analysis problems.

Result: Shows QM method can be exponentially slow and incorrect, and provides efficient polynomial-time solutions that outperform existing methods by orders of magnitude.

Conclusion: The proposed polynomial-time algorithms are superior to QM-based approaches for decision tree analysis tasks, offering both correctness guarantees and computational efficiency.

Abstract: The Rashomon set of decision trees (DTs) finds importance uses. Recent work showed that DTs computing the same classification function, i.e. predictive equivalent DTs, can represent a significant fraction of the Rashomon set. Such redundancy is undesirable. For example, feature importance based on the Rashomon set becomes inaccurate due the existence of predictive equivalent DTs, i.e. DTs with the same prediction for every possible input. In recent work, McTavish et al. proposed solutions for several computational problems related with DTs, including that of deciding predictive equivalent DTs. The approach of McTavish et al. consists of applying the well-known method of Quine-McCluskey (QM) for obtaining minimum-size DNF (disjunctive normal form) representations of DTs, which are then used for comparing DTs for predictive equivalence. Furthermore, the minimum-size DNF representation was also applied to computing explanations for the predictions made by DTs, and to finding predictions in the presence of missing data. However, the problem of formula minimization is hard for the second level of the polynomial hierarchy, and the QM method may exhibit worst-case exponential running time and space. This paper first demonstrates that there exist decision trees that trigger the worst-case exponential running time and space of the QM method. Second, the paper shows that the QM method may incorrectly decide predictive equivalence, if two key constraints are not respected, and one may be difficult to formally guarantee. Third, the paper shows that any of the problems to which the smallest DNF representation has been applied to can be solved in polynomial time, in the size of the DT. The experiments confirm that, for DTs for which the worst-case of the QM method is triggered, the algorithms proposed in this paper are orders of magnitude faster than the ones proposed by McTavish et al.

[253] PRIME: Planning and Retrieval-Integrated Memory for Enhanced Reasoning

Hieu Tran, Zonghai Yao, Nguyen Luong Tran, Zhichao Yang, Feiyun Ouyang, Shuo Han, Razieh Rahimi, Hong Yu

Main category: cs.AI

TL;DR: PRIME is a multi-agent reasoning framework that integrates fast System 1 thinking with deliberate System 2 thinking, enabling open-source LLMs to compete with state-of-the-art closed-source models on complex reasoning tasks.

Details

Motivation: Inspired by the dual-process theory of human cognition from Thinking, Fast and Slow, the goal is to create a framework that mimics human cognitive processes by dynamically integrating intuitive and deliberate thinking to enhance reasoning efficiency and accuracy.

Method: PRIME employs a Quick Thinking Agent (System 1) for rapid answers, and if uncertainty is detected, triggers a structured System 2 pipeline with specialized agents for planning, hypothesis generation, retrieval, information integration, and decision-making.

Result: Experimental results with LLaMA 3 models show that PRIME enables open-source LLMs to perform competitively with state-of-the-art closed-source models like GPT-4 and GPT-4o on benchmarks requiring multi-hop and knowledge-grounded reasoning.

Conclusion: PRIME establishes itself as a scalable solution for improving LLMs in domains requiring complex, knowledge-intensive reasoning by faithfully mimicking human cognitive processes.

Abstract: Inspired by the dual-process theory of human cognition from \textit{Thinking, Fast and Slow}, we introduce \textbf{PRIME} (Planning and Retrieval-Integrated Memory for Enhanced Reasoning), a multi-agent reasoning framework that dynamically integrates \textbf{System 1} (fast, intuitive thinking) and \textbf{System 2} (slow, deliberate thinking). PRIME first employs a Quick Thinking Agent (System 1) to generate a rapid answer; if uncertainty is detected, it then triggers a structured System 2 reasoning pipeline composed of specialized agents for \textit{planning}, \textit{hypothesis generation}, \textit{retrieval}, \textit{information integration}, and \textit{decision-making}. This multi-agent design faithfully mimics human cognitive processes and enhances both efficiency and accuracy. Experimental results with LLaMA 3 models demonstrate that PRIME enables open-source LLMs to perform competitively with state-of-the-art closed-source models like GPT-4 and GPT-4o on benchmarks requiring multi-hop and knowledge-grounded reasoning. This research establishes PRIME as a scalable solution for improving LLMs in domains requiring complex, knowledge-intensive reasoning.

[254] GUI-PRA: Process Reward Agent for GUI Tasks

Tao Xiong, Xavier Hu, Yurun Chen, Yuhang Liu, Changqiao Wu, Pengzhi Gao, Wei Liu, Jian Luan, Shengyu Zhang

Main category: cs.AI

TL;DR: GUI-PRA is a process reward agent that addresses limitations of standard PRMs in GUI tasks by using dynamic memory and adaptive UI perception to handle long histories and dynamic UI changes.

Details

Motivation: Standard Process Reward Models struggle with GUI tasks due to the 'lost in the middle' phenomenon with long histories and lack UI changing awareness, leading to poor performance in dynamic GUI environments.

Method: GUI-PRA uses dynamic memory mechanism with Relevance-based Retrieval and Progressive Summarization modules, plus Adaptive UI Perception to reason about UI state changes and gather visual evidence.

Result: The proposed approach better handles long historical contexts and dynamic UI changes compared to standard PRMs, improving process reward guidance for GUI agents.

Conclusion: GUI-PRA effectively addresses key challenges in GUI task automation by intelligently processing historical context and actively perceiving UI state changes, providing superior process rewards for MLLM-powered GUI agents.

Abstract: Graphical User Interface (GUI) Agents powered by Multimodal Large Language Models (MLLMs) show significant potential for automating tasks. However, they often struggle with long-horizon tasks, leading to frequent failures. Process Reward Models (PRMs) are a promising solution, as they can guide these agents with crucial process signals during inference. Nevertheless, their application to the GUI domain presents unique challenges. When processing dense artificial inputs with long history data, PRMs suffer from a “lost in the middle” phenomenon, where the overwhelming historical context compromises the evaluation of the current step. Furthermore, standard PRMs lacks GUI changing awareness, providing static evaluations that are disconnected from the dynamic consequences of actions, a critical mismatch with the inherently dynamic nature of GUI tasks. In response to these challenges, we introduce GUI-PRA (Process Reward Agent for GUI Tasks), a judge agent designed to better provide process reward than standard PRM by intelligently processing historical context and actively perceiving UI state changes. Specifically, to directly combat the ``lost in the middle’’ phenomenon, we introduce a dynamic memory mechanism consisting of two core components: a Relevance-based Retrieval Module to actively fetch pertinent information from long histories and a Progressive Summarization Module to dynamically condense growing interaction data, ensuring the model focuses on relevant context. Moreover, to address the lack of UI changing awareness, we introduce an Aadaptive UI Perception mechanism. This mechanism enables the agent to reason about UI state changes and dynamically select the most appropriate tool to gather grounded visual evidence, ensuring its evaluation is always informed by the current UI context.

[255] Understanding Generative Recommendation with Semantic IDs from a Model-scaling View

Jingzhe Liu, Liam Collins, Jiliang Tang, Tong Zhao, Neil Shah, Clark Mingxuan Ju

Main category: cs.AI

TL;DR: The paper identifies scaling bottlenecks in Semantic ID-based Generative Recommendation systems and shows that using Large Language Models directly as recommenders achieves better scaling and performance.

Details

Motivation: To address the scaling limitations of Semantic ID-based Generative Recommendation systems, which show performance saturation when scaling up modality encoders, quantization tokenizers, and the recommender systems themselves.

Method: The study compares two paradigms: SID-based GR (using discrete semantic IDs from modality encoders) and LLM-as-RS (using large language models directly as recommenders). Experiments were conducted across model sizes from 44M to 14B parameters.

Result: LLM-as-RS paradigm shows superior scaling properties, achieving up to 20% improvement over the best achievable performance of SID-based GR. LLMs demonstrate improved ability to capture collaborative filtering information as they scale up.

Conclusion: LLM-as-RS represents a promising path toward foundation models for Generative Recommendation, overcoming the intrinsic scaling limits of SID-based approaches.

Abstract: Recent advancements in generative models have allowed the emergence of a promising paradigm for recommender systems (RS), known as Generative Recommendation (GR), which tries to unify rich item semantics and collaborative filtering signals. One popular modern approach is to use semantic IDs (SIDs), which are discrete codes quantized from the embeddings of modality encoders (e.g., large language or vision models), to represent items in an autoregressive user interaction sequence modeling setup (henceforth, SID-based GR). While generative models in other domains exhibit well-established scaling laws, our work reveals that SID-based GR shows significant bottlenecks while scaling up the model. In particular, the performance of SID-based GR quickly saturates as we enlarge each component: the modality encoder, the quantization tokenizer, and the RS itself. In this work, we identify the limited capacity of SIDs to encode item semantic information as one of the fundamental bottlenecks. Motivated by this observation, as an initial effort to obtain GR models with better scaling behaviors, we revisit another GR paradigm that directly uses large language models (LLMs) as recommenders (henceforth, LLM-as-RS). Our experiments show that the LLM-as-RS paradigm has superior model scaling properties and achieves up to 20 percent improvement over the best achievable performance of SID-based GR through scaling. We also challenge the prevailing belief that LLMs struggle to capture collaborative filtering information, showing that their ability to model user-item interactions improves as LLMs scale up. Our analyses on both SID-based GR and LLMs across model sizes from 44M to 14B parameters underscore the intrinsic scaling limits of SID-based GR and position LLM-as-RS as a promising path toward foundation models for GR.

[256] Learning to Interact in World Latent for Team Coordination

Dongsu Lee, Daehee Lee, Yaru Niu, Honguk Woo, Amy Zhang, Ding Zhao

Main category: cs.AI

TL;DR: IWoL is a novel representation learning framework for multi-agent reinforcement learning that creates a learnable representation space combining inter-agent relations and task information through modeled communication protocols, enabling decentralized execution with implicit coordination.

Details

Motivation: Team coordination in MARL is challenging due to complex multi-agent dynamics and incomplete information from local observations. Existing explicit communication methods have drawbacks like slow decision-making, security vulnerabilities, and bandwidth sensitivity.

Method: Constructs a learnable representation space that jointly captures inter-agent relations and task-specific world information by directly modeling communication protocols. The representation can be used as implicit latent for agents or explicit messages for communication.

Result: Evaluated across four challenging MARL benchmarks, IWoL provides simple yet powerful coordination. Both variants (implicit latent and explicit communication) show strong performance. The representation can be combined with existing MARL algorithms to further enhance their performance.

Conclusion: IWoL offers an effective framework for team coordination in MARL that avoids drawbacks of explicit message passing while maintaining decentralized execution with implicit coordination.

Abstract: This work presents a novel representation learning framework, interactive world latent (IWoL), to facilitate team coordination in multi-agent reinforcement learning (MARL). Building effective representation for team coordination is a challenging problem, due to the intricate dynamics emerging from multi-agent interaction and incomplete information induced by local observations. Our key insight is to construct a learnable representation space that jointly captures inter-agent relations and task-specific world information by directly modeling communication protocols. This representation, we maintain fully decentralized execution with implicit coordination, all while avoiding the inherent drawbacks of explicit message passing, e.g., slower decision-making, vulnerability to malicious attackers, and sensitivity to bandwidth constraints. In practice, our representation can be used not only as an implicit latent for each agent, but also as an explicit message for communication. Across four challenging MARL benchmarks, we evaluate both variants and show that IWoL provides a simple yet powerful key for team coordination. Moreover, we demonstrate that our representation can be combined with existing MARL algorithms to further enhance their performance.

[257] OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!

Jingdi Lei, Varun Gumma, Rishabh Bhardwaj, Seok Min Lim, Chuan Li, Amir Zadeh, Soujanya Poria

Main category: cs.AI

TL;DR: The paper introduces operational safety for LLMs - their ability to appropriately accept or refuse user queries for specific use cases. It presents OffTopicEval benchmark showing current LLMs are highly operationally unsafe, and proposes prompt-based steering methods that significantly improve refusal rates.

Details

Motivation: Enterprises need to ensure LLM-based agents are safe for their intended use cases, moving beyond generic safety concerns to operational safety - whether models can appropriately handle queries within their designated purpose.

Method: Proposed OffTopicEval evaluation suite for measuring operational safety, tested on 20 open-weight LLMs across 6 model families. Introduced two prompt-based steering methods: query grounding (Q-ground) and system-prompt grounding (P-ground) to improve out-of-distribution refusal.

Result: All tested models showed poor operational safety, with best performers Qwen-3 (235B) at 77.77% and Mistral (24B) at 79.96%. Prompt-based steering significantly improved performance: Q-ground provided up to 23% gains, P-ground boosted Llama-3.3 (70B) by 41% and Qwen-3 (30B) by 27%.

Conclusion: Current LLMs lack reliable operational safety for enterprise deployment. Prompt-based steering offers promising first-step interventions to improve refusal capabilities, highlighting the urgent need for operational safety measures in LLM-based agents.

Abstract: Large Language Model (LLM) safety is one of the most pressing challenges for enabling wide-scale deployment. While most studies and global discussions focus on generic harms, such as models assisting users in harming themselves or others, enterprises face a more fundamental concern: whether LLM-based agents are safe for their intended use case. To address this, we introduce operational safety, defined as an LLM’s ability to appropriately accept or refuse user queries when tasked with a specific purpose. We further propose OffTopicEval, an evaluation suite and benchmark for measuring operational safety both in general and within specific agentic use cases. Our evaluations on six model families comprising 20 open-weight LLMs reveal that while performance varies across models, all of them remain highly operationally unsafe. Even the strongest models - Qwen-3 (235B) with 77.77% and Mistral (24B) with 79.96% - fall far short of reliable operational safety, while GPT models plateau in the 62-73% range, Phi achieves only mid-level scores (48-70%), and Gemma and Llama-3 collapse to 39.53% and 23.84%, respectively. While operational safety is a core model alignment issue, to suppress these failures, we propose prompt-based steering methods: query grounding (Q-ground) and system-prompt grounding (P-ground), which substantially improve OOD refusal. Q-ground provides consistent gains of up to 23%, while P-ground delivers even larger boosts, raising Llama-3.3 (70B) by 41% and Qwen-3 (30B) by 27%. These results highlight both the urgent need for operational safety interventions and the promise of prompt-based steering as a first step toward more reliable LLM-based agents.

[258] Thinkquel: A Model Dedicated to Text-to-dbt Using Synthetic Data and a Span-Aware Objective

Anni Li, Aria Attar, Paul Dong

Main category: cs.AI

TL;DR: Thinkquel is a fine-tuned model that transforms natural language into reliable database queries using synthetic data generation and span-aware reinforcement learning to bridge token-level training with sequence-level execution rewards.

Details

Motivation: Natural language to SQL conversion faces challenges with schema linking, SQL dialect variations, and misalignment between token-level training objectives and sequence-level execution supervision, making it difficult to produce robust and portable queries.

Method: Thinkquel integrates TS-SQL synthetic data pipeline using dbt as portable intermediate representation and TS-GRPO (Token-Sequence GRPO) reinforcement learning that aligns token-level training with sequence-level execution rewards during LLM fine-tuning.

Result: On TS-SQL test set, Thinkquel (32B) achieved 93.2% execution success and 61.8% exact-result match, improving over base model by 67.2% and 44.4% respectively. In Spider experiments, TS-GRPO improved training stability and accelerated convergence compared to GRPO and GSPO.

Conclusion: Thinkquel demonstrates that combining synthetic data generation with span-aware reinforcement learning effectively bridges the gap between token-level training and sequence-level execution validation, enabling robust and portable natural language to SQL conversion.

Abstract: Transforming natural-language requests into reliable, production-ready data transformations remains challenging: correctness depends on precise schema linking and warehouse-specific SQL dialects, while the strongest supervision available during training–execution success and result matching–are provided only at the sequence level. At the same time, assembling large, execution-validated corpora is costly, and token-level objectives misalign with these global signals, yielding unstable optimization and limited portability. We introduce Thinkquel, a fine-tuned model for producing robust, portable, and execution-validated database queries. Methodologies in Thinkquel integrates a novel synthetic data pipeline, TS-SQL, that leverages dbt as a portable intermediate representation with a span-aware reinforcement learning objective, and Token-Sequence GRPO (TS-GRPO), specifically designed to bridge the gap between token-level training signals and sequence-level execution rewards when finetuning LLMs. On the 500-example TS-SQL test set, Thinkquel (32B) reaches 93.2% execution success and 61.8% exact-result match with a two-stage SFT curriculum, improving over the base model by 67.2% (exec.) and 44.4% (match). In Spider (14B) experiments, TS-GRPO increases training stability and speeds convergence of the execution-match reward relative to GRPO and GSPO.

[259] Learning to Decide with Just Enough: Information-Theoretic Context Summarization for CMDPs

Peidong Liu, Junjiang Lin, Shaowen Wang, Yao Xu, Haiqing Li, Xuhao Xie, Siyi Wu, Hao Li

Main category: cs.AI

TL;DR: LLM-based information-theoretic summarization for Contextual Markov Decision Processes that compresses high-dimensional contexts into low-dimensional summaries, improving performance while reducing computational costs.

Details

Motivation: Existing CMDP methods fail to generalize in high-dimensional contexts, leading to excessive computation and unstable performance.

Method: Information-theoretic summarization using LLMs to compress contextual inputs into low-dimensional, semantically rich summaries that augment states while preserving decision-critical information.

Result: Outperforms raw-context and non-context baselines across benchmarks, improving reward, success rate, sample efficiency while reducing latency and memory usage.

Conclusion: LLM-based summarization provides a scalable and interpretable solution for efficient decision-making in context-rich, resource-constrained environments.

Abstract: Contextual Markov Decision Processes (CMDPs) offer a framework for sequential decision-making under external signals, but existing methods often fail to generalize in high-dimensional or unstructured contexts, resulting in excessive computation and unstable performance. We propose an information-theoretic summarization approach that uses large language models (LLMs) to compress contextual inputs into low-dimensional, semantically rich summaries. These summaries augment states by preserving decision-critical cues while reducing redundancy. Building on the notion of approximate context sufficiency, we provide, to our knowledge, the first regret bounds and a latency-entropy trade-off characterization for CMDPs. Our analysis clarifies how informativeness impacts computational cost. Experiments across discrete, continuous, visual, and recommendation benchmarks show that our method outperforms raw-context and non-context baselines, improving reward, success rate, and sample efficiency, while reducing latency and memory usage. These findings demonstrate that LLM-based summarization offers a scalable and interpretable solution for efficient decision-making in context-rich, resource-constrained environments.

[260] Do AI Models Perform Human-like Abstract Reasoning Across Modalities?

Claas Beger, Ryan Yi, Shuhao Fu, Arseny Moskvichev, Sarah W. Tsai, Sivasankaran Rajamanickam, Melanie Mitchell

Main category: cs.AI

TL;DR: While some AI models achieve human-level accuracy on ConceptARC benchmark, their abstract reasoning capabilities are limited - they often rely on surface-level shortcuts rather than true abstractions, and accuracy alone overestimates text-based reasoning while underestimating visual reasoning abilities.

Details

Motivation: To investigate whether state-of-the-art models truly recognize and reason with intended abstractions in ARC tasks, rather than just achieving high accuracy through surface-level pattern matching.

Method: Evaluated models on ConceptARC with varying input modalities (textual vs. visual), external tool usage, and reasoning effort. Used dual evaluation of both output accuracy and fine-grained analysis of natural-language rules generated to explain solutions.

Result: Text-based models matched human accuracy but their rules showed surface-level shortcuts and captured intended abstractions far less than humans. Visual models had lower accuracy but still exhibited substantial abstraction capture, though often failed to apply rules correctly.

Conclusion: Models still lag humans in abstract reasoning, and accuracy alone is insufficient for evaluating abstract reasoning - it overestimates text-based reasoning and underestimates visual reasoning. The proposed evaluation framework provides a more faithful assessment of multimodal abstract reasoning abilities.

Abstract: OpenAI’s o3-preview reasoning model exceeded human accuracy on the ARC-AGI benchmark, but does that mean state-of-the-art models recognize and reason with the abstractions that the task creators intended? We investigate models’ abstraction abilities on ConceptARC. We evaluate models under settings that vary the input modality (textual vs. visual), whether the model is permitted to use external Python tools, and, for reasoning models, the amount of reasoning effort. In addition to measuring output accuracy, we perform fine-grained evaluation of the natural-language rules that models generate to explain their solutions. This dual evaluation lets us assess whether models solve tasks using the abstractions ConceptARC was designed to elicit, rather than relying on surface-level patterns. Our results show that, while some models using text-based representations match human output accuracy, the best models’ rules are often based on surface-level ``shortcuts’’ and capture intended abstractions far less often than humans. Thus their capabilities for general abstract reasoning may be overestimated by evaluations based on accuracy alone. In the visual modality, AI models’ output accuracy drops sharply, yet our rule-level analysis reveals that models might be underestimated, as they still exhibit a substantial share of rules that capture intended abstractions, but are often unable to correctly apply these rules. In short, our results show that models still lag humans in abstract reasoning, and that using accuracy alone to evaluate abstract reasoning on ARC-like tasks may overestimate abstract-reasoning capabilities in textual modalities and underestimate it in visual modalities. We believe that our evaluation framework offers a more faithful picture of multimodal models’ abstract reasoning abilities and a more principled way to track progress toward human-like, abstraction-centered intelligence.

cs.SD

[261] Accelerated Convolutive Transfer Function-Based Multichannel NMF Using Iterative Source Steering

Xuemai Xie, Xianrui Wang, Liyuan Zhang, Yichen Yang, Shoji Makino

Main category: cs.SD

TL;DR: Proposes an efficient variant of CTF-MNMF using iterative source steering (ISS) to reduce computational complexity while maintaining separation performance in reverberant environments.

Details

Motivation: CTF-MNMF performs well in highly reverberant environments but has high computational cost due to iterative projection update rules requiring matrix inversion for each source, hindering practical deployment.

Method: Integrates iterative source steering (ISS), a matrix inversion-free update rule for separation filters, into CTF-MNMF to create an efficient variant.

Result: Achieves comparable or superior separation performance to original CTF-MNMF while significantly reducing computational complexity.

Conclusion: The proposed ISS-based CTF-MNMF variant provides an efficient solution for blind source separation in reverberant environments with reduced computational burden.

Abstract: Among numerous blind source separation (BSS) methods, convolutive transfer function-based multichannel non-negative matrix factorization (CTF-MNMF) has demonstrated strong performance in highly reverberant environments by modeling multi-frame correlations of delayed source signals. However, its practical deployment is hindered by the high computational cost associated with the iterative projection (IP) update rule, which requires matrix inversion for each source. To address this issue, we propose an efficient variant of CTF-MNMF that integrates iterative source steering (ISS), a matrix inversion-free update rule for separation filters. Experimental results show that the proposed method achieves comparable or superior separation performance to the original CTF-MNMF, while significantly reducing the computational complexity.

[262] Linear RNNs for autoregressive generation of long music samples

Konrad Szewczyk, Daniel Gallo Fernández, James Townsend

Main category: cs.SD

TL;DR: HarmonicRNN uses linear RNNs (deep state space models) to model raw audio waveforms, achieving state-of-the-art performance on small-scale datasets by training on sequences up to 1 minute long.

Details

Motivation: Autoregressive audio waveform generation is challenging due to long sequences and multi-scale structure. Traditional RNNs, causal convolutions, and self-attention have limited success, while linear RNNs show promise for efficiency.

Method: Pushes boundaries of linear RNNs for raw audio modeling, investigates architectural choices, and uses context-parallelism to train on sequences up to 1M tokens (1 minute).

Result: HarmonicRNN achieves state-of-the-art log-likelihoods and perceptual metrics on small-scale datasets.

Conclusion: Linear RNNs (deep state space models) are highly effective for autoregressive audio waveform modeling, enabling training on long sequences and outperforming traditional approaches.

Abstract: Directly learning to generate audio waveforms in an autoregressive manner is a challenging task, due to the length of the raw sequences and the existence of important structure on many different timescales. Traditional approaches based on recurrent neural networks, as well as causal convolutions and self-attention, have only had limited success on this task. However, recent work has shown that deep state space models, also referred to as linear RNNs, can be highly efficient in this context. In this work, we push the boundaries of linear RNNs applied to raw audio modeling, investigating the effects of different architectural choices and using context-parallelism to enable training on sequences up to one minute (1M tokens) in length. We present a model, HarmonicRNN, which attains state of the art log-likelihoods and perceptual metrics on small-scale datasets.

[263] Latent Multi-view Learning for Robust Environmental Sound Representations

Sivan Sing, Julia Wilkins, Magdalena Fuentes, Juan Pablo Bello

Main category: cs.SD

TL;DR: A multi-view learning framework that integrates contrastive learning into generative models for environmental sound representation, encoding audio latents into view-specific and view-common subspaces.

Details

Motivation: To explore how contrastive and generative SSL approaches can complement each other in a unified framework for environmental sound learning, as this integration remains relatively underexplored.

Method: Encodes compressed audio latents into view-specific and view-common subspaces using two self-supervised objectives: contrastive learning for targeted information flow between subspaces, and reconstruction for overall information preservation.

Result: Demonstrated improved downstream performance on urban sound sensor network dataset for sound source and sensor classification compared to traditional SSL techniques.

Conclusion: The framework shows potential for disentangling environmental sound attributes within structured latent space and offers improved performance over conventional SSL methods.

Abstract: Self-supervised learning (SSL) approaches, such as contrastive and generative methods, have advanced environmental sound representation learning using unlabeled data. However, how these approaches can complement each other within a unified framework remains relatively underexplored. In this work, we propose a multi-view learning framework that integrates contrastive principles into a generative pipeline to capture sound source and device information. Our method encodes compressed audio latents into view-specific and view-common subspaces, guided by two self-supervised objectives: contrastive learning for targeted information flow between subspaces, and reconstruction for overall information preservation. We evaluate our method on an urban sound sensor network dataset for sound source and sensor classification, demonstrating improved downstream performance over traditional SSL techniques. Additionally, we investigate the model’s potential to disentangle environmental sound attributes within the structured latent space under varied training configurations.

[264] TART: A Comprehensive Tool for Technique-Aware Audio-to-Tab Guitar Transcription

Akshaj Gupta, Andrea Guzman, Anagha Badriprasad, Hwi Joo Park, Upasana Puranik, Robin Netzorg, Jiachen Lian, Gopala Krishna Anumanchipalli

Main category: cs.SD

TL;DR: A four-stage end-to-end pipeline for generating detailed guitar tablature from audio, addressing limitations in existing AMT systems for guitar by incorporating expressive technique detection and accurate string/fret assignment.

Details

Motivation: Existing guitar transcription systems fail to detect expressive techniques (slides, bends, percussive hits) and incorrectly map notes to wrong string/fret combinations. Prior models are trained on small datasets, limiting generalizability to real-world recordings.

Method: Four-stage pipeline: (1) Audio-to-MIDI pitch conversion using adapted piano transcription model, (2) MLP-based expressive technique classification, (3) Transformer-based string and fret assignment, (4) LSTM-based tablature generation.

Result: The framework is the first to generate detailed guitar tablature with accurate fingerings and expressive labels directly from guitar audio.

Conclusion: Proposed end-to-end pipeline overcomes limitations of existing guitar AMT systems by producing comprehensive tablature with expressive technique annotations and correct string/fret mappings.

Abstract: Automatic Music Transcription (AMT) has advanced significantly for the piano, but transcription for the guitar remains limited due to several key challenges. Existing systems fail to detect and annotate expressive techniques (e.g., slides, bends, percussive hits) and incorrectly map notes to the wrong string and fret combination in the generated tablature. Furthermore, prior models are typically trained on small, isolated datasets, limiting their generalizability to real-world guitar recordings. To overcome these limitations, we propose a four-stage end-to-end pipeline that produces detailed guitar tablature directly from audio. Our system consists of (1) Audio-to-MIDI pitch conversion through a piano transcription model adapted to guitar datasets; (2) MLP-based expressive technique classification; (3) Transformer-based string and fret assignment; and (4) LSTM-based tablature generation. To the best of our knowledge, this framework is the first to generate detailed tablature with accurate fingerings and expressive labels from guitar audio.

[265] Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech

Hieu-Nghia Huynh-Nguyen, Huynh Nguyen Dang, Ngoc-Son Nguyen, Van Nguyen

Main category: cs.SD

TL;DR: Flamed-TTS is a novel zero-shot TTS framework that addresses challenges in current models by emphasizing low computational cost, low latency, and high speech fidelity with rich temporal diversity, achieving state-of-the-art performance.

Details

Motivation: Current zero-shot TTS models face challenges including unreliable synthesis (token repetition, unexpected content transfer), slow inference, substantial computational overhead, and underexplored temporal diversity needed for natural speech.

Method: Reformulated flow matching training paradigm and incorporated both discrete and continuous representations corresponding to different speech attributes to achieve efficient and diverse synthesis.

Result: Flamed-TTS surpasses state-of-the-art models in intelligibility, naturalness, speaker similarity, acoustic characteristics preservation, and dynamic pace, achieving best WER of 4% while maintaining low latency and high fidelity.

Conclusion: Flamed-TTS successfully addresses key challenges in zero-shot TTS by providing a computationally efficient framework that delivers high-quality, temporally diverse speech synthesis with superior performance metrics.

Abstract: Zero-shot Text-to-Speech (TTS) has recently advanced significantly, enabling models to synthesize speech from text using short, limited-context prompts. These prompts serve as voice exemplars, allowing the model to mimic speaker identity, prosody, and other traits without extensive speaker-specific data. Although recent approaches incorporating language models, diffusion, and flow matching have proven their effectiveness in zero-shot TTS, they still encounter challenges such as unreliable synthesis caused by token repetition or unexpected content transfer, along with slow inference and substantial computational overhead. Moreover, temporal diversity-crucial for enhancing the naturalness of synthesized speech-remains largely underexplored. To address these challenges, we propose Flamed-TTS, a novel zero-shot TTS framework that emphasizes low computational cost, low latency, and high speech fidelity alongside rich temporal diversity. To achieve this, we reformulate the flow matching training paradigm and incorporate both discrete and continuous representations corresponding to different attributes of speech. Experimental results demonstrate that Flamed-TTS surpasses state-of-the-art models in terms of intelligibility, naturalness, speaker similarity, acoustic characteristics preservation, and dynamic pace. Notably, Flamed-TTS achieves the best WER of 4% compared to the leading zero-shot TTS baselines, while maintaining low latency in inference and high fidelity in generated speech. Code and audio samples are available at our demo page https://flamed-tts.github.io.

[266] Forensic Similarity for Speech Deepfakes

Viola Negroni, Davide Salvi, Daniele Ugo Leonzio, Paolo Bestagini, Stefano Tubaro

Main category: cs.SD

TL;DR: A digital audio forensics method that determines if two audio segments share the same forensic traces, using a deep-learning system with feature extraction and similarity scoring.

Details

Motivation: To develop a forensic similarity approach for speech deepfakes that generalizes well to unknown forensic traces without requiring prior knowledge of them during training, inspired by similar work in image forensics.

Method: Two-part deep-learning system: a feature extractor based on speech deepfake detector backbone and a shallow similarity network that maps audio pairs to scores indicating same/different forensic traces.

Result: The method effectively performs source verification (identifying if samples come from same generative model) and splicing detection, generalizing to various forensic traces including unseen ones.

Conclusion: The approach demonstrates strong generalization capabilities and practical value in digital audio forensics for detecting forensic traces across different scenarios.

Abstract: In this paper, we introduce a digital audio forensics approach called Forensic Similarity for Speech Deepfakes, which determines whether two audio segments contain the same forensic traces or not. Our work is inspired by prior work in the image domain on forensic similarity, which proved strong generalization capabilities against unknown forensic traces, without requiring prior knowledge of them at training time. To achieve this in the audio setting, we propose a two-part deep-learning system composed of a feature extractor based on a speech deepfake detector backbone and a shallow neural network, referred to as the similarity network. This system maps pairs of audio segments to a score indicating whether they contain the same or different forensic traces. We evaluate the system on the emerging task of source verification, highlighting its ability to identify whether two samples originate from the same generative model. Additionally, we assess its applicability to splicing detection as a complementary use case. Experiments show that the method generalizes to a wide range of forensic traces, including previously unseen ones, illustrating its flexibility and practical value in digital audio forensics.

[267] WavInWav: Time-domain Speech Hiding via Invertible Neural Network

Wei Fan, Kejiang Chen, Xiangkun Wang, Weiming Zhang, Nenghai Yu

Main category: cs.SD

TL;DR: A flow-based invertible neural network approach for audio data hiding that improves secret audio recovery quality using time-frequency loss and encryption, outperforming previous methods on standard datasets.

Details

Motivation: Previous audio hiding methods often result in unsatisfactory quality when recovering secret audio due to limitations in modeling time-frequency relationships, creating a need for more effective and reversible embedding/extraction techniques.

Method: Uses flow-based invertible neural network to link stego audio, cover audio, and secret audio; implements time-frequency loss on time-domain signal to address quality degradation issues; adds encryption for data protection.

Result: Outperforms previous approaches on VCTK and LibriSpeech datasets in both subjective and objective metrics, and exhibits robustness to various types of noise.

Conclusion: The method provides enhanced reversibility for message recovery and demonstrates utility in targeted secure communication scenarios with improved audio quality and security.

Abstract: Data hiding is essential for secure communication across digital media, and recent advances in Deep Neural Networks (DNNs) provide enhanced methods for embedding secret information effectively. However, previous audio hiding methods often result in unsatisfactory quality when recovering secret audio, due to their inherent limitations in the modeling of time-frequency relationships. In this paper, we explore these limitations and introduce a new DNN-based approach. We use a flow-based invertible neural network to establish a direct link between stego audio, cover audio, and secret audio, enhancing the reversibility of embedding and extracting messages. To address common issues from time-frequency transformations that degrade secret audio quality during recovery, we implement a time-frequency loss on the time-domain signal. This approach not only retains the benefits of time-frequency constraints but also enhances the reversibility of message recovery, which is vital for practical applications. We also add an encryption technique to protect the hidden data from unauthorized access. Experimental results on the VCTK and LibriSpeech datasets demonstrate that our method outperforms previous approaches in terms of subjective and objective metrics and exhibits robustness to various types of noise, suggesting its utility in targeted secure communication scenarios.

[268] SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos

Amir Dellali, Luca A. Lanzendörfer, Florian Grötschla, Roger Wattenhofer

Main category: cs.SD

TL;DR: SALSA-V is a multimodal video-to-audio generation model that synthesizes high-fidelity, synchronized long-form audio from silent videos using masked diffusion and achieves fast generation in just 8 steps.

Details

Motivation: To create a system that can generate highly synchronized, high-quality audio from silent video content for applications like Foley generation and sound design, addressing limitations in existing methods.

Method: Uses masked diffusion objective for audio-conditioned generation, integrates shortcut loss for fast sampling, and employs random masking during training to match spectral characteristics of reference audio.

Result: Significantly outperforms state-of-the-art methods in audiovisual alignment and synchronization in both quantitative evaluation and human listening studies.

Conclusion: SALSA-V enables efficient, high-quality audio generation from video with broad applicability to professional audio synthesis tasks without requiring dedicated fine-tuning.

Abstract: We propose SALSA-V, a multimodal video-to-audio generation model capable of synthesizing highly synchronized, high-fidelity long-form audio from silent video content. Our approach introduces a masked diffusion objective, enabling audio-conditioned generation and the seamless synthesis of audio sequences of unconstrained length. Additionally, by integrating a shortcut loss into our training process, we achieve rapid generation of high-quality audio samples in as few as eight sampling steps, paving the way for near-real-time applications without requiring dedicated fine-tuning or retraining. We demonstrate that SALSA-V significantly outperforms existing state-of-the-art methods in both audiovisual alignment and synchronization with video content in quantitative evaluation and a human listening study. Furthermore, our use of random masking during training enables our model to match spectral characteristics of reference audio samples, broadening its applicability to professional audio synthesis tasks such as Foley generation and sound design.

[269] AudioToolAgent: An Agentic Framework for Audio-Language Models

Gijs Wijngaard, Elia Formisano, Michel Dumontier

Main category: cs.SD

TL;DR: AudioToolAgent is a framework that coordinates audio-language models as tools via a central LLM agent for audio question answering and speech-to-text, achieving state-of-the-art accuracy on multiple benchmarks without training costs.

Details

Motivation: Large Audio-Language Models (LALMs) perform well on audio understanding but lack multi-step reasoning and tool-calling capabilities found in recent LLMs.

Method: A central LLM agent coordinates audio-language models as tools via tool adapters, selecting tools, asking follow-up questions, and comparing outputs for verification. Uses Monte Carlo sampling for shapley values to identify effective agent-tool combinations.

Result: Achieved state-of-the-art accuracy: 74.10% on MMAU, 68.80% on MMAR, and 57.96% on MMAU-Pro. Identified effective agent-tool combinations across 374 configurations.

Conclusion: The modular framework enables integration of new tools, eliminates data and training costs, and provides effective coordination between LLM agents and audio tools for improved audio understanding tasks.

Abstract: Large Audio-Language Models (LALMs) perform well on audio understanding tasks but lack multi-step reasoning and tool-calling found in recent Large Language Models (LLMs). This paper presents AudioToolAgent, a framework that coordinates audio-language models as tools via a central LLM agent that accesses tool adapters for audio question answering and speech-to-text. The agent selects tools, asks follow-up questions, and compares outputs for verification. Experiments with MMAU, MMAR, and MMAU-Pro show state-of-the-art accuracy: up to 74.10% on MMAU, 68.80% on MMAR, and 57.96% on MMAU-Pro. Monte Carlo sampling for shapley values across 374 configurations identifies effective agent-tool combinations. The modular design allows integration of new tools and eliminates the use of data and training costs. Code and reproduction materials are available at: github.com/GLJS/AudioToolAgent

[270] PAGURI: a user experience study of creative interaction with text-to-music models

Francesca Ronchini, Luca Comanducci, Gabriele Perego, Fabio Antonacci

Main category: cs.SD

TL;DR: This paper presents PAGURI, a user experience study investigating how musicians interact with text-to-music models, revealing both creative potential and significant limitations in real-world integration.

Details

Motivation: To understand how text-to-music models can be realistically integrated into the artistic practice of musicians and music practitioners, as current models showcase technological progress but lack clear practical applications.

Method: Developed an online tool for music generation and personalization via fine-tuning, then conducted semi-structured interviews with musicians to analyze their interactions and experiences with the system.

Result: Participants recognized the creative potential and appreciated personalization features, but identified key challenges including prompt ambiguity, limited controllability, and workflow integration issues.

Conclusion: Text-to-music models show promise but require addressing significant usability challenges before they can be effectively integrated into real-world music creation practices.

Abstract: In recent years, text-to-music models have been the biggest breakthrough in automatic music generation. While they are unquestionably a showcase of technological progress, it is not clear yet how they can be realistically integrated into the artistic practice of musicians and music practitioners. This paper aims to address this question via Prompt Audio Generation User Research Investigation (PAGURI), a user experience study where we leverage recent text-to-music developments to study how musicians and practitioners interact with these systems, evaluating their satisfaction levels. We developed an online tool through which users can generate music samples and/or apply recently proposed personalization techniques based on fine-tuning to allow the text-to-music model to generate sounds closer to their needs and preferences. Using semi-structured interviews, we analyzed different aspects related to how participants interacted with the proposed tool to understand the current effectiveness and limitations of text-to-music models in enhancing users’ creativity. Our research centers on user experiences to uncover insights that can guide the future development of TTM models and their role in AI-driven music creation. Additionally, they offered insightful perspectives on potential system improvements and their integration into their music practices. The results obtained through the study reveal the pros and cons of the use of TTMs for creative endeavors. Participants recognized the system’s creative potential and appreciated the usefulness of its personalization features. However, they also identified several challenges that must be addressed before TTMs are ready for real-world music creation, particularly issues of prompt ambiguity, limited controllability, and integration into existing workflows.

[271] A Speech Enhancement Method Using Fast Fourier Transform and Convolutional Autoencoder

Pu-Yun Kow, Pu-Zhao Kow

Main category: cs.SD

TL;DR: A lightweight FFT-ConvAE model achieved second place in Helsinki Speech Challenge 2024, showing neural-network-free approaches can effectively reconstruct audio from degraded signals.

Details

Motivation: To address the problem of reconstructing audio signals from degraded measurements and explore neural-network-free approaches for speech signal reconstruction.

Method: Proposed a lightweight model combining discrete Fourier transform with a Convolutional Autoencoder (FFT-ConvAE).

Result: Achieved second place in the Helsinki Speech Challenge 2024, demonstrating competitive performance against other approaches.

Conclusion: Neural-network-free approaches like FFT-ConvAE have significant potential for effective speech signal reconstruction tasks.

Abstract: This paper addresses the reconstruction of audio signals from degraded measurements. We propose a lightweight model that combines the discrete Fourier transform with a Convolutional Autoencoder (FFT-ConvAE), which enabled our team to achieve second place in the Helsinki Speech Challenge 2024. Our results, together with those of other teams, demonstrate the potential of neural-network-free approaches for effective speech signal reconstruction.

[272] SingMOS-Pro: An Comprehensive Benchmark for Singing Quality Assessment

Yuxun Tang, Lan Liu, Wenhao Feng, Yiwen Zhao, Jionghao Han, Yifeng Yu, Jiatong Shi, Qin Jin

Main category: cs.SD

TL;DR: SingMOS-Pro is a comprehensive dataset for automatic singing quality assessment that expands on SingMOS with detailed annotations for lyrics, melody, and overall quality across 7,981 singing clips from 41 models.

Details

Motivation: Current singing voice generation evaluation relies on costly human listening tests, while existing objective metrics capture limited perceptual aspects, creating a need for better automatic assessment methods.

Method: Created SingMOS-Pro dataset with 7,981 singing clips from 41 models across 12 datasets, each rated by at least five professional annotators for lyrics, melody, and overall quality, and benchmarked various evaluation methods.

Result: The dataset provides reliable and consistent annotations with broader coverage and greater diversity than previous versions, establishing strong baselines for singing quality assessment research.

Conclusion: SingMOS-Pro offers a practical reference for future research in singing quality assessment and enables more effective utilization of MOS data annotated under different standards.

Abstract: Singing voice generation progresses rapidly, yet evaluating singing quality remains a critical challenge. Human subjective assessment, typically in the form of listening tests, is costly and time consuming, while existing objective metrics capture only limited perceptual aspects. In this work, we introduce SingMOS-Pro, a dataset for automatic singing quality assessment. Building on our preview version SingMOS, which provides only overall ratings, SingMOS-Pro expands annotations of the additional part to include lyrics, melody, and overall quality, offering broader coverage and greater diversity. The dataset contains 7,981 singing clips generated by 41 models across 12 datasets, spanning from early systems to recent advances. Each clip receives at least five ratings from professional annotators, ensuring reliability and consistency. Furthermore, we explore how to effectively utilize MOS data annotated under different standards and benchmark several widely used evaluation methods from related tasks on SingMOS-Pro, establishing strong baselines and practical references for future research. The dataset can be accessed at https://huggingface.co/datasets/TangRain/SingMOS-Pro.

cs.LG

[273] Extreme value forecasting using relevance-based data augmentation with deep learning models

Junru Hua, Rahul Ahluwalia, Rohitash Chandra

Main category: cs.LG

TL;DR: This paper presents a data augmentation framework for extreme value forecasting using GANs and SMOTE with deep learning models like Conv-LSTM and BD-LSTM for multistep prediction.

Details

Motivation: Extreme value forecasting is challenging but important for applications from finance to climate change. Data augmentation with GANs has been popular for class imbalance problems, but its application to extreme value forecasting needs investigation.

Method: Used deep learning models (Conv-LSTM and BD-LSTM) with data augmentation models (GANs and SMOTE) for multistep ahead prediction. Developed novel strategies for incorporating data augmentation based on a relevance function for extreme values.

Result: SMOTE-based strategy consistently showed superior adaptability and improved performance across both short- and long-horizon forecasts. Conv-LSTM excels in periodic, stable datasets, while BD-LSTM performs better in chaotic or non-stationary sequences.

Conclusion: The framework successfully combines data augmentation with deep learning for extreme value forecasting, with SMOTE demonstrating better performance than GANs, and different LSTM architectures showing complementary strengths for different data characteristics.

Abstract: Data augmentation with generative adversarial networks (GANs) has been popular for class imbalance problems, mainly for pattern classification and computer vision-related applications. Extreme value forecasting is a challenging field that has various applications from finance to climate change problems. In this study, we present a data augmentation framework for extreme value forecasting. In this framework, our focus is on forecasting extreme values using deep learning models in combination with data augmentation models such as GANs and synthetic minority oversampling technique (SMOTE). We use deep learning models such as convolutional long short-term memory (Conv-LSTM) and bidirectional long short-term memory (BD-LSTM) networks for multistep ahead prediction featuring extremes. We investigate which data augmentation models are the most suitable, taking into account the prediction accuracy overall and at extreme regions, along with computational efficiency. We also present novel strategies for incorporating data augmentation, considering extreme values based on a relevance function. Our results indicate that the SMOTE-based strategy consistently demonstrated superior adaptability, leading to improved performance across both short- and long-horizon forecasts. Conv-LSTM and BD-LSTM exhibit complementary strengths: the former excels in periodic, stable datasets, while the latter performs better in chaotic or non-stationary sequences.

[274] OpenTSLM: Time-Series Language Models for Reasoning over Multivariate Medical Text- and Time-Series Data

Patrick Langer, Thomas Kaar, Max Rosenblattl, Maxwell A. Xu, Winnie Chow, Martin Maritsch, Aradhana Verma, Brian Han, Daniel Seung Kim, Henry Chubb, Scott Ceresnak, Aydin Zahedivash, Alexander Tarlochan Singh Sandhu, Fatima Rodriguez, Daniel McDuff, Elgar Fleisch, Oliver Aalami, Filipe Barata, Paul Schmiedmayer

Main category: cs.LG

TL;DR: OpenTSLM is a family of Time Series Language Models that integrate time series as a native modality into pretrained LLMs, enabling multimodal reasoning over time series data. It introduces two architectures that outperform text-only models and GPT-4o on medical time series reasoning tasks.

Details

Motivation: LLMs are powerful for multimodal data but have limitations in handling time series data, which is crucial in medical applications for synthesizing clinical information into actionable insights.

Method: Two architectures: OpenTSLM-SoftPrompt (implicit modeling with learnable tokens) and OpenTSLM-Flamingo (explicit modeling via cross-attention). Both integrate time series with text for Chain-of-Thought reasoning tasks.

Result: OpenTSLM models outperform baselines significantly, achieving 69.9 F1 in sleep staging and 65.4 in HAR vs 9.05 and 52.2 for text-only models. Even 1B-parameter models surpass GPT-4o. OpenTSLM-Flamingo handles longer sequences better with stable memory requirements.

Conclusion: OpenTSLM successfully enables LLMs to reason over time series data, with explicit modeling (Flamingo) scaling better than implicit approaches. The models show strong clinical reasoning capabilities and are provided open-source for further research.

Abstract: LLMs have emerged as powerful tools for interpreting multimodal data. In medicine, they hold particular promise for synthesizing large volumes of clinical information into actionable insights and digital health applications. Yet, a major limitation remains their inability to handle time series. To overcome this gap, we present OpenTSLM, a family of Time Series Language Models (TSLMs) created by integrating time series as a native modality to pretrained LLMs, enabling reasoning over multiple time series of any length. We investigate two architectures for OpenTSLM. The first, OpenTSLM-SoftPrompt, models time series implicitly by concatenating learnable time series tokens with text tokens via soft prompting. Although parameter-efficient, we hypothesize that explicit time series modeling scales better and outperforms implicit approaches. We thus introduce OpenTSLM-Flamingo, which integrates time series with text via cross-attention. We benchmark both variants against baselines that treat time series as text tokens or plots, across a suite of text-time-series Chain-of-Thought (CoT) reasoning tasks. We introduce three datasets: HAR-CoT, Sleep-CoT, and ECG-QA-CoT. Across all, OpenTSLM models outperform baselines, reaching 69.9 F1 in sleep staging and 65.4 in HAR, compared to 9.05 and 52.2 for finetuned text-only models. Notably, even 1B-parameter OpenTSLM models surpass GPT-4o (15.47 and 2.95). OpenTSLM-Flamingo matches OpenTSLM-SoftPrompt in performance and outperforms on longer sequences, while maintaining stable memory requirements. By contrast, SoftPrompt grows exponentially in memory with sequence length, requiring around 110 GB compared to 40 GB VRAM when training on ECG-QA with LLaMA-3B. Expert reviews by clinicians find strong reasoning capabilities exhibited by OpenTSLMs on ECG-QA. To facilitate further research, we provide all code, datasets, and models open-source.

[275] RainSeer: Fine-Grained Rainfall Reconstruction via Physics-Guided Modeling

Lin Chen, Jun Chen, Minghui Qiu, Shuxin Zhong, Binghong Chen, Kaishun Wu

Main category: cs.LG

TL;DR: RainSeer is a structure-aware rainfall reconstruction framework that uses radar reflectivity as a physical prior to capture sharp transitions and localized extremes, addressing challenges in spatial alignment and physical disconnect between aloft hydro-meteors and ground-level precipitation.

Details

Motivation: Existing spatial interpolation methods for rainfall reconstruction often over-smooth critical structures, failing to capture sharp transitions and localized extremes, which is essential for flood forecasting, hydrological modeling, and climate analysis.

Method: RainSeer uses a physics-informed two-stage architecture: (1) Structure-to-Point Mapper performs spatial alignment by projecting mesoscale radar structures into localized ground-level rainfall through bidirectional mapping, and (2) Geo-Aware Rain Decoder captures semantic transformation of hydro-meteors via causal spatiotemporal attention mechanism.

Result: Evaluation on RAIN-F (Korea) and MeteoNet (France) datasets shows consistent improvements over state-of-the-art baselines, reducing MAE by over 13.31% and significantly enhancing structural fidelity in reconstructed rainfall fields.

Conclusion: RainSeer effectively addresses the fundamental challenges of translating high-resolution volumetric radar fields into sparse point-wise rainfall observations and bridging the physical disconnect between aloft hydro-meteors and ground-level precipitation.

Abstract: Reconstructing high-resolution rainfall fields is essential for flood forecasting, hydrological modeling, and climate analysis. However, existing spatial interpolation methods-whether based on automatic weather station (AWS) measurements or enhanced with satellite/radar observations often over-smooth critical structures, failing to capture sharp transitions and localized extremes. We introduce RainSeer, a structure-aware reconstruction framework that reinterprets radar reflectivity as a physically grounded structural prior-capturing when, where, and how rain develops. This shift, however, introduces two fundamental challenges: (i) translating high-resolution volumetric radar fields into sparse point-wise rainfall observations, and (ii) bridging the physical disconnect between aloft hydro-meteors and ground-level precipitation. RainSeer addresses these through a physics-informed two-stage architecture: a Structure-to-Point Mapper performs spatial alignment by projecting mesoscale radar structures into localized ground-level rainfall, through a bidirectional mapping, and a Geo-Aware Rain Decoder captures the semantic transformation of hydro-meteors through descent, melting, and evaporation via a causal spatiotemporal attention mechanism. We evaluate RainSeer on two public datasets-RAIN-F (Korea, 2017-2019) and MeteoNet (France, 2016-2018)-and observe consistent improvements over state-of-the-art baselines, reducing MAE by over 13.31% and significantly enhancing structural fidelity in reconstructed rainfall fields.

[276] How to Train Your Advisor: Steering Black-Box LLMs with Advisor Models

Parth Asawa, Alan Zhu, Matei Zaharia, Alexandros G. Dimakis, Joseph E. Gonzalez

Main category: cs.LG

TL;DR: Advisor Models are lightweight parametric policies trained with RL to dynamically issue natural language steering instructions to black-box foundation models, outperforming static prompt optimization by adapting to different inputs and environments.

Details

Motivation: Foundation models are increasingly deployed as black-box services where model weights cannot be modified, and static prompt optimization produces fixed prompts that fail to adapt to different inputs, users, or environments.

Method: Train lightweight parametric policies using reinforcement learning to reactively issue natural language steering instructions in-context to black-box models. The advisor acts as a second small model that sits between input and model, shaping behavior per-instance using reward signals.

Result: Across multiple domains involving reasoning and personalization, Advisor Models outperform static prompt optimizers, discover environment dynamics, improve downstream task performance, transfer across black-box models, and achieve specialization while maintaining robustness to out-of-distribution inputs.

Conclusion: Advisor Models provide a learnable interface to black-box systems, acting as parametric environment-specific memory, and represent a promising direction for enabling personalization and environment-adaptable AI with frontier-level capabilities.

Abstract: Foundation models are increasingly deployed as black-box services, where model weights cannot be modified and customization is limited to prompting. While static prompt optimization has shown promise, it produces a single fixed prompt that fails to adapt to different inputs, users, or environments. We introduce Advisor Models, lightweight parametric policies trained with reinforcement learning to reactively issue natural language steering instructions in-context to black-box models. The advisor is a second small model that sits between the input and the model, shaping behavior on a per-instance basis using reward signals from the environment. Across multiple domains involving reasoning and personalization, we show that Advisor Models outperform static prompt optimizers, discovering environment dynamics and improving downstream task performance. We also demonstrate the generalizability of advisors by transferring them across black-box models, as well as the framework’s ability to achieve specialization while retaining robustness to out-of-distribution inputs. Viewed more broadly, Advisor Models provide a learnable interface to black-box systems where the advisor acts as a parametric, environment-specific memory. We argue that dynamic optimization of black-box models via Advisor Models is a promising direction for enabling personalization and environment-adaptable AI with frontier-level capabilities.

[277] Market-Based Data Subset Selection – Principled Aggregation of Multi-Criteria Example Utility

Ashish Jha, Valentin Leplat, AH Phan

Main category: cs.LG

TL;DR: Proposes a market-based data selection method using prediction markets to price training examples, with automatic signal aggregation, explicit token budget handling, and diversity mechanisms.

Details

Motivation: Traditional data selection methods combine heterogeneous utility signals (uncertainty, rarity, diversity) with ad hoc weights, lacking principled aggregation and explicit budget handling.

Method: Uses cost-function prediction market (LMSR) to price examples, with signals as traders. Includes topic-wise normalization, price-per-token rule for budget control, diversity head for coverage, and quantifies coverage via topic cluster coverage and effective sample size.

Result: On GSM8K with 60k-token budget: achieves parity with strong baselines while reducing seed variance and <0.1 GPU-hr overhead. On AGNews at 5-25% kept data: competitive accuracy with improved balance and stability.

Conclusion: The framework unifies multi-signal data curation under fixed compute constraints for both prompt-level reasoning and classification tasks, providing transparent aggregation with interpretable parameters.

Abstract: Selecting a small yet useful subset of training data is hard because signals of example utility (uncertainty, rarity, diversity, etc.) are heterogeneous and typically combined with ad hoc weights. We propose a market-based selector that prices each example via a cost-function prediction market (LMSR), signals act as traders, a single liquidity parameter controls concentration, and topic-wise normalization stabilizes calibration. Token budgets are handled explicitly by a price-per-token rule $\rho=p/\ell^{\gamma}$, with $\gamma$ exposing an interpretable length bias; a lightweight diversity head improves coverage. We quantify coverage via topic cluster coverage and effective sample size. On the theory side, we show that LMSR implements a maximum-entropy aggregation with exponential weighting and a convex objective, yielding transparent knobs for aggregation strength. Empirically, on GSM8K (60k-token budget) the market with diversity achieves parity with strong single-signal baselines while reducing seed variance and incurring $<!0.1$ GPU-hr selection overhead; on AGNews at kept=5-25% the market (with light balancing) delivers competitive accuracy with improved balance and stability. The framework unifies multi-signal data curation under fixed compute for prompt-level reasoning and classification.

[278] Assessing the Potential for Catastrophic Failure in Dynamic Post-Training Quantization

Logan Frank, Paul Ardis

Main category: cs.LG

TL;DR: This paper investigates catastrophic performance failures in post-training quantization (PTQ) by identifying network-policy pairs that can cause up to 65% accuracy drops, compared to robust counterparts with <2% decreases.

Details

Motivation: To understand the potential for drastic performance reduction in PTQ when deployed in safety-critical environments, and to identify characteristics of input distributions that may cause this failure.

Method: Formulated a knowledge distillation and reinforcement learning task to learn network and bit-width policy pairs, analyzing catastrophic failure under quantization in terms of worst-case potential.

Result: Confirmed existence of “detrimental” network-policy pairs causing 10-65% accuracy reductions, while robust counterparts showed <2% decreases. Provided initial exploration of highest vulnerability points.

Conclusion: The findings emphasize the need for caution in real-world PTQ deployment and encourage more rigorous robustness examinations and safety considerations in deep learning.

Abstract: Post-training quantization (PTQ) has recently emerged as an effective tool for reducing the computational complexity and memory usage of a neural network by representing its weights and activations with lower precision. While this paradigm has shown great success in lowering compute and storage costs, there is the potential for drastic performance reduction depending upon the distribution of inputs experienced in inference. When considering possible deployment in safety-critical environments, it is important to investigate the extent of potential performance reduction, and what characteristics of input distributions may give rise to this reduction. In this work, we explore the idea of extreme failure stemming from dynamic PTQ and formulate a knowledge distillation and reinforcement learning task to learn a network and bit-width policy pair such that catastrophic failure under quantization is analyzed in terms of worst case potential. Our results confirm the existence of this “detrimental” network-policy pair, with several instances demonstrating performance reductions in the range of 10-65% in accuracy, compared to their “robust” counterparts encountering a <2% decrease. From systematic experimentation and analyses, we also provide an initial exploration into points at highest vulnerability. While our results represent an initial step toward understanding failure cases introduced by PTQ, our findings ultimately emphasize the need for caution in real-world deployment scenarios. We hope this work encourages more rigorous examinations of robustness and a greater emphasis on safety considerations for future works within the broader field of deep learning.

[279] SAGE: Streaming Agreement-Driven Gradient Sketches for Representative Subset Selection

Ashish Jha, Salman Ahmadi-Asl

Main category: cs.LG

TL;DR: SAGE is a streaming data-subset selection method that uses Frequent Directions sketching to maintain gradient geometry in constant memory, prioritizing examples with gradient alignment to consensus directions for efficient training.

Details

Motivation: Training modern neural networks on large datasets is computationally and energy intensive, requiring more efficient methods that reduce memory and compute requirements while maintaining competitive accuracy.

Method: SAGE maintains a compact Frequent Directions sketch of gradient geometry in O(ℓD) memory, prioritizes examples whose sketched gradients align with consensus direction, eliminates N×N pairwise similarities and explicit N×ℓ gradient stores, and uses a simple two-pass GPU-friendly pipeline.

Result: Across multiple benchmarks, SAGE trains with small kept-rate budgets while retaining competitive accuracy relative to full-data training and recent subset-selection baselines, and reduces end-to-end compute and peak memory.

Conclusion: SAGE offers a practical, constant-memory alternative that complements pruning and model compression for efficient training, providing deterministic approximation guarantees through Frequent Directions sketching.

Abstract: Training modern neural networks on large datasets is computationally and energy intensive. We present SAGE, a streaming data-subset selection method that maintains a compact Frequent Directions (FD) sketch of gradient geometry in $O(\ell D)$ memory and prioritizes examples whose sketched gradients align with a consensus direction. The approach eliminates $N \times N$ pairwise similarities and explicit $N \times \ell$ gradient stores, yielding a simple two-pass, GPU-friendly pipeline. Leveraging FD’s deterministic approximation guarantees, we analyze how agreement scoring preserves gradient energy within the principal sketched subspace. Across multiple benchmarks, SAGE trains with small kept-rate budgets while retaining competitive accuracy relative to full-data training and recent subset-selection baselines, and reduces end-to-end compute and peak memory. Overall, SAGE offers a practical, constant-memory alternative that complements pruning and model compression for efficient training.

[280] Uncertainty-Guided Model Selection for Tabular Foundation Models in Biomolecule Efficacy Prediction

Jie Li, Andrew McCarthy, Zhizhuo Zhang, Stephen Young

Main category: cs.LG

TL;DR: Using model uncertainty (inter-quantile range) to guide ensemble selection improves siRNA knockdown efficacy prediction without needing ground truth labels.

Details

Motivation: In-context learners like TabPFN show promise for biomolecule efficacy prediction but are sensitive to context selection. Current approaches lack methods to select the best models for ensembling without access to ground truth labels.

Method: Proposed an uncertainty-guided strategy using TabPFN with sequence-based features, where model selection for ensembling is based on the lowest mean inter-quantile range (IQR) as a measure of uncertainty.

Result: TabPFN with simple sequence features outperformed specialized state-of-the-art predictors. Model’s IQR showed negative correlation with true prediction error. Ensembling models with lowest mean IQR achieved superior performance compared to naive ensembling or single models.

Conclusion: Model uncertainty serves as a powerful, label-free heuristic for optimizing biomolecule efficacy predictions through effective ensemble selection.

Abstract: In-context learners like TabPFN are promising for biomolecule efficacy prediction, where established molecular feature sets and relevant experimental results can serve as powerful contextual examples. However, their performance is highly sensitive to the provided context, making strategies like post-hoc ensembling of models trained on different data subsets a viable approach. An open question is how to select the best models for the ensemble without access to ground truth labels. In this study, we investigate an uncertainty-guided strategy for model selection. We demonstrate on an siRNA knockdown efficacy task that a TabPFN model using simple sequence-based features can surpass specialized state-of-the-art predictors. We also show that the model’s predicted inter-quantile range (IQR), a measure of its uncertainty, has a negative correlation with true prediction error. By selecting and averaging an ensemble of models with the lowest mean IQR, we achieve superior performance compared to naive ensembling or using a single model trained on all available data. This finding highlights model uncertainty as a powerful, label-free heuristic for optimizing biomolecule efficacy predictions.

[281] Litespark Technical Report: High-Throughput, Energy-Efficient LLM Training Framework

Nii Osae Osae Dade, Moinul Hossain Rahat

Main category: cs.LG

TL;DR: Litespark is a novel pre-training framework that achieves 2x-6x training throughput improvement and 55%-83% energy reduction for LLMs through optimizations to transformer attention and MLP layers.

Details

Motivation: Address the massive computational costs and energy consumption of training large language models, which currently require months of computation and gigawatt-hours of electricity.

Method: Combines architectural improvements with algorithmic enhancements to transformer attention and MLP layers, focusing on maximizing Model FLOPs Utilization while maintaining compatibility with standard transformer implementations.

Result: Demonstrated 2x-6x training throughput improvement and 55%-83% energy consumption reduction on 3B and 30B parameter Llama models using SlimPajama-627B dataset across multi-node H200 GPU clusters.

Conclusion: Litespark provides model- and hardware-agnostic optimizations that enable substantial efficiency gains in LLM training while maintaining broad applicability across transformer architectures and extending to post-training phases.

Abstract: Training Large Language Models (LLMs) is plagued by long training times and massive energy consumption, with modern models requiring months of computation and gigawatt-hours of electricity. In light of these challenges,we introduce Litespark, a novel pre-training framework that addresses these inefficiencies through targeted optimizations to transformer attention and MLP layers. Our approach combines architectural improvements with algorithmic enhancements to maximize Model FLOPs Utilization (MFU) while maintaining compatibility with standard transformer implementations. Comprehensive benchmarking on 3B and 30B parameter Llama models using the SlimPajama-627B dataset demonstrates substantial performance gains: 2x-6x training throughput improvement and $55%-83$% energy consumption reduction across multi-node H200 GPU clusters. These optimizations are model- and hardware-agnostic, enabling broad applicability across transformer architectures and extending to post-training phases including supervised fine-tuning and direct preference optimization.

[282] From Pixels to Factors: Learning Independently Controllable State Variables for Reinforcement Learning

Rafael Rodriguez-Sanchez, Cameron Allen, George Konidaris

Main category: cs.LG

TL;DR: ACF is a contrastive learning method that discovers independently controllable latent variables from high-dimensional observations, enabling sample-efficient reinforcement learning by exploiting factored structure without requiring prior knowledge of the factorization.

Details

Motivation: Existing factored MDP algorithms require known factored representations and fail with high-dimensional observations, while deep RL cannot exploit factored structure. There's a need to automatically discover controllable factors from raw observations.

Method: Action-Controllable Factorization (ACF) uses contrastive learning to identify state components that each action can influence separately, leveraging sparsity where actions typically affect only subsets of variables while others evolve under environment dynamics.

Result: ACF successfully recovers ground truth controllable factors from pixel observations on Taxi, FourRooms, and MiniGrid-DoorKey benchmarks, consistently outperforming baseline disentanglement algorithms.

Conclusion: ACF effectively bridges the gap between factored MDP methods and deep RL by automatically discovering controllable latent structure from high-dimensional inputs, enabling sample-efficient learning without requiring prior factorization knowledge.

Abstract: Algorithms that exploit factored Markov decision processes are far more sample-efficient than factor-agnostic methods, yet they assume a factored representation is known a priori – a requirement that breaks down when the agent sees only high-dimensional observations. Conversely, deep reinforcement learning handles such inputs but cannot benefit from factored structure. We address this representation problem with Action-Controllable Factorization (ACF), a contrastive learning approach that uncovers independently controllable latent variables – state components each action can influence separately. ACF leverages sparsity: actions typically affect only a subset of variables, while the rest evolve under the environment’s dynamics, yielding informative data for contrastive training. ACF recovers the ground truth controllable factors directly from pixel observations on three benchmarks with known factored structure – Taxi, FourRooms, and MiniGrid-DoorKey – consistently outperforming baseline disentanglement algorithms.

[283] Improved Robustness of Deep Reinforcement Learning for Control of Time-Varying Systems by Bounded Extremum Seeking

Shaifalee Saxena, Alan Williams, Rafael Fierro, Alexander Scheinker

Main category: cs.LG

TL;DR: Hybrid controller combining deep reinforcement learning (DRL) and bounded extremum seeking (ES) for robust control of nonlinear time-varying systems, with applications to particle accelerator tuning.

Details

Motivation: DRL can learn from large datasets to control many-parameter systems but degrades with rapid model changes, while bounded ES handles time-varying systems but slows with more parameters and can get stuck in local minima.

Method: Combines DRL and bounded ES into a hybrid controller where DRL learns from historical data for fast control and bounded ES ensures robustness to time variations.

Result: The hybrid controller outperforms individual components, with DRL providing fast control of many-parameter systems and bounded ES ensuring robustness to time variations.

Conclusion: The ES-DRL hybrid controller achieves superior performance by leveraging DRL’s learning capabilities and bounded ES’s robustness, demonstrated in particle accelerator tuning applications.

Abstract: In this paper, we study the use of robust model independent bounded extremum seeking (ES) feedback control to improve the robustness of deep reinforcement learning (DRL) controllers for a class of nonlinear time-varying systems. DRL has the potential to learn from large datasets to quickly control or optimize the outputs of many-parameter systems, but its performance degrades catastrophically when the system model changes rapidly over time. Bounded ES can handle time-varying systems with unknown control directions, but its convergence speed slows down as the number of tuned parameters increases and, like all local adaptive methods, it can get stuck in local minima. We demonstrate that together, DRL and bounded ES result in a hybrid controller whose performance exceeds the sum of its parts with DRL taking advantage of historical data to learn how to quickly control a many-parameter system to a desired setpoint while bounded ES ensures its robustness to time variations. We present a numerical study of a general time-varying system and a combined ES-DRL controller for automatic tuning of the Low Energy Beam Transport section at the Los Alamos Neutron Science Center linear particle accelerator.

[284] Addressing Representation Collapse in Vector Quantized Models with One Linear Layer

Yongxin Zhu, Bocheng Li, Yifei Xin, Zhihua Xia, Linli Xu

Main category: cs.LG

TL;DR: SimVQ addresses representation collapse in Vector Quantization by reparameterizing code vectors through a learnable linear transformation layer over a latent basis, optimizing the entire linear space rather than individual code vectors.

Details

Motivation: Vector Quantization suffers from representation collapse, causing low codebook utilization and limiting scalability. Existing solutions often rely on complex optimizations or reduce latent dimensionality, compromising model capacity.

Method: Proposes SimVQ which reparameterizes code vectors through a learnable linear transformation layer over a latent basis, optimizing the entire linear space rather than nearest individual code vectors.

Result: Extensive experiments on image and audio tasks demonstrate improved codebook usage, easy implementation, and good generalization across modalities and architectures.

Conclusion: SimVQ effectively prevents representation collapse in Vector Quantization through a simple linear transformation approach that maintains model capacity while improving codebook utilization.

Abstract: Vector Quantization (VQ) is essential for discretizing continuous representations in unsupervised learning but suffers from representation collapse, causing low codebook utilization and limiting scalability. Existing solutions often rely on complex optimizations or reduce latent dimensionality, which compromises model capacity and fails to fully solve the problem. We identify the root cause as disjoint codebook optimization, where only a few code vectors are updated via gradient descent. To fix this, we propose \textbf{Sim}ple\textbf{VQ}, which reparameterizes code vectors through a learnable linear transformation layer over a latent basis, optimizing the \textit{entire linear space} rather than nearest \textit{individual code vectors}. Although the multiplication of two linear matrices is equivalent to applying a single linear layer, this simple approach effectively prevents collapse. Extensive experiments on image and audio tasks demonstrate that SimVQ improves codebook usage, is easy to implement, and generalizes well across modalities and architectures. The code is available at https://github.com/youngsheen/SimVQ.

[285] Beyond Imitation: Recovering Dense Rewards from Demonstrations

Jiangnan Li, Thuy-Trang Vu, Ehsan Abbasnejad, Gholamreza Haffari

Main category: cs.LG

TL;DR: SFT is equivalent to Inverse Q-Learning, learning both policy and implicit token-level rewards that can be recovered and used to improve models via reinforcement learning.

Details

Motivation: Challenge the conventional view of SFT as simple imitation learning by establishing its equivalence to Inverse Reinforcement Learning and demonstrating its potential for reward learning.

Method: Prove SFT objective is a special case of Inverse Q-Learning, formulate baseline-relative reward function to recover dense token-level rewards, and develop Dense-Path REINFORCE method.

Result: Recovered dense rewards enable granular credit assignment, and Dense-Path REINFORCE consistently outperforms original SFT models on instruction-following benchmarks.

Conclusion: SFT should be reframed as a powerful reward learning mechanism beyond policy imitation, opening new possibilities for leveraging expert demonstrations.

Abstract: Conventionally, supervised fine-tuning (SFT) is treated as a simple imitation learning process that only trains a policy to imitate expert behavior on demonstration datasets. In this work, we challenge this view by establishing a fundamental equivalence between SFT and Inverse Reinforcement Learning. We prove that the SFT objective is a special case of Inverse Q-Learning, which implies that the SFT process does not just learn a policy, but also an implicit, dense, token-level reward model that explains the expert demonstrations. We then show how to recover this dense reward signal directly from the SFT model by formulating a baseline-relative reward function. The availability of such a dense reward model offers numerous benefits, providing granular credit assignment for each token generated. We demonstrate one key application by using these recovered rewards to further improve the policy with reinforcement learning. Our method, Dense-Path REINFORCE, consistently outperforms the original SFT models on instruction-following benchmarks. This work reframes SFT not merely as policy imitation but as a powerful reward learning mechanism, opening new possibilities for leveraging expert demonstrations.

[286] In-memory Training on Analog Devices with Limited Conductance States via Multi-tile Residual Learning

Jindan Li, Zhaoxian Wu, Gaowen Liu, Tayfun Gokmen, Tianyi Chen

Main category: cs.LG

TL;DR: Proposes a residual learning framework for analog in-memory computing that enables effective training with limited 4-bit memristive devices by sequentially learning across multiple crossbar tiles to compensate for low-precision errors.

Details

Motivation: Analog in-memory computing accelerators require 8-bit conductance states for effective training, but many practical memristive devices like ReRAM only offer ~4-bit resolution due to fabrication constraints, which degrades training accuracy.

Method: A residual learning framework that sequentially learns on multiple crossbar tiles to compensate for residual errors from low-precision weight updates.

Result: Theoretical analysis shows optimality gap shrinks with number of tiles and achieves linear convergence rate. Experiments demonstrate consistent outperformance over state-of-the-art methods under limited-state settings with moderate hardware overhead.

Conclusion: The proposed residual learning framework enables effective on-chip training with limited-state memristive devices while maintaining competitive accuracy and manageable hardware costs.

Abstract: Analog in-memory computing (AIMC) accelerators enable efficient deep neural network computation directly within memory using resistive crossbar arrays, where model parameters are represented by the conductance states of memristive devices. However, effective in-memory training typically requires at least 8-bit conductance states to match digital baselines. Realizing such fine-grained states is costly and often requires complex noise mitigation techniques that increase circuit complexity and energy consumption. In practice, many promising memristive devices such as ReRAM offer only about 4-bit resolution due to fabrication constraints, and this limited update precision substantially degrades training accuracy. To enable on-chip training with these limited-state devices, this paper proposes a \emph{residual learning} framework that sequentially learns on multiple crossbar tiles to compensate the residual errors from low-precision weight updates. Our theoretical analysis shows that the optimality gap shrinks with the number of tiles and achieves a linear convergence rate. Experiments on standard image classification benchmarks demonstrate that our method consistently outperforms state-of-the-art in-memory analog training strategies under limited-state settings, while incurring only moderate hardware overhead as confirmed by our cost analysis.

[287] Graph Generation with Spectral Geodesic Flow Matching

Xikun Huang, Tianyu Ruan, Chihao Zhang, Shihua Zhang

Main category: cs.LG

TL;DR: SFMG is a graph generation method that uses spectral eigenmaps to embed graphs into Riemannian manifolds and matches distributions along geodesic flows, achieving state-of-the-art performance with 30x speedup over diffusion models.

Details

Motivation: Existing graph generation methods focus on aligning spectrum or degree profiles but ignore the geometry induced by eigenvectors and global graph structure, limiting their ability to capture complex geometric relationships.

Method: Proposes Spectral Geodesic Flow Matching (SFMG) that embeds input and target graphs into continuous Riemannian manifolds using spectral eigenmaps, then defines geodesic flows between embeddings and matches distributions along these flows.

Result: SFMG matches state-of-the-art performance on graphlet, degree, and spectral metrics across diverse benchmarks, achieves 30x speedup over diffusion-based models, and demonstrates ability to generalize to unseen graph scales.

Conclusion: SFMG provides a new approach to graph synthesis by integrating spectral geometry with flow matching, offering improved geometric structure capture, flexible generation, and efficient scalability.

Abstract: Graph generation is a fundamental task with wide applications in modeling complex systems. Although existing methods align the spectrum or degree profile of the target graph, they often ignore the geometry induced by eigenvectors and the global structure of the graph. In this work, we propose Spectral Geodesic Flow Matching (SFMG), a novel framework that uses spectral eigenmaps to embed both input and target graphs into continuous Riemannian manifolds. We then define geodesic flows between embeddings and match distributions along these flows to generate output graphs. Our method yields several advantages: (i) captures geometric structure beyond eigenvalues, (ii) supports flexible generation of diverse graphs, and (iii) scales efficiently. Empirically, SFMG matches the performance of state-of-the-art approaches on graphlet, degree, and spectral metrics across diverse benchmarks. In particular, it achieves up to 30$\times$ speedup over diffusion-based models, offering a substantial advantage in scalability and training efficiency. We also demonstrate its ability to generalize to unseen graph scales. Overall, SFMG provides a new approach to graph synthesis by integrating spectral geometry with flow matching.

[288] Model-brain comparison using inter-animal transforms

Imran Thobani, Javier Sagastuy-Brena, Aran Nayebi, Jacob Prince, Rosa Cao, Daniel Yamins

Main category: cs.LG

TL;DR: The paper proposes the Inter-Animal Transform Class (IATC) methodology for comparing artificial neural network models to brain responses, enabling bidirectional mapping between model and brain data while resolving detailed neural mechanisms without sacrificing predictivity.

Details

Motivation: There is little consensus on correct methods for comparing model activations to brain responses, and existing approaches may not adequately address the tradeoff between predictive accuracy and mechanistic identification in neuroscience.

Method: The IATC methodology identifies the strictest set of functions needed to accurately map neural responses between subjects, allowing bidirectional mapping between candidate models’ responses and brain data to assess how well models can masquerade as typical subjects.

Result: The IATC resolves detailed neural mechanisms like non-linear activation functions, enables accurate predictions of neural activity with high specificity in mechanism identification, and provides evidence favoring topographical deep neural networks as models of the visual system.

Conclusion: The IATC demonstrates that there is no inherent tradeoff between neural engineering goals (high predictivity) and neuroscientific goals (mechanistic accuracy), enabling principled model-brain comparisons that improve upon previous approaches.

Abstract: Artificial neural network models have emerged as promising mechanistic models of the brain. However, there is little consensus on the correct method for comparing model activations to brain responses. Drawing on recent work in philosophy of neuroscience, we propose a comparison methodology based on the Inter-Animal Transform Class (IATC) - the strictest set of functions needed to accurately map neural responses between subjects in an animal population. Using the IATC, we can map bidirectionally between a candidate model’s responses and brain data, assessing how well the model can masquerade as a typical subject using the same kinds of transforms needed to map across real subjects. We identify the IATC in three settings: a simulated population of neural network models, a population of mouse subjects, and a population of human subjects. We find that the IATC resolves detailed aspects of the neural mechanism, such as the non-linear activation function. Most importantly, we find that the IATC enables accurate predictions of neural activity while also achieving high specificity in mechanism identification, evidenced by its ability to separate response patterns from different brain areas while strongly aligning same-brain-area responses between subjects. In other words, the IATC is a proof-by-existence that there is no inherent tradeoff between the neural engineering goal of high model-brain predictivity and the neuroscientific goal of identifying mechanistically accurate brain models. Using IATC-guided transforms, we obtain new evidence in favor of topographical deep neural networks (TDANNs) as models of the visual system. Overall, the IATC enables principled model-brain comparisons, contextualizing previous findings about the predictive success of deep learning models of the brain, while improving upon previous approaches to model-brain comparison.

[289] AttentiveGRUAE: An Attention-Based GRU Autoencoder for Temporal Clustering and Behavioral Characterization of Depression from Wearable Data

Nidhi Soley, Vishal M Patel, Casey O Taylor

Main category: cs.LG

TL;DR: AttentiveGRUAE is an attention-based GRU autoencoder for temporal clustering and depression prediction from wearable sleep data, achieving superior clustering and classification performance with clinically interpretable results.

Details

Motivation: To develop a model that can jointly learn compact behavioral representations, predict depression outcomes, and identify behavioral subtypes from longitudinal wearable data for better clinical insights.

Method: Uses attention-based GRU autoencoder with three joint objectives: sequence reconstruction, depression classification, and GMM-based soft clustering of embeddings. Evaluated on longitudinal sleep data from 372 participants.

Result: Achieved superior performance: silhouette score 0.70 vs 0.32-0.70 baselines, AUC 0.74 vs 0.50-0.67 for depression classification. External validation confirmed reproducibility (silhouette 0.63, AUC 0.61).

Conclusion: AttentiveGRUAE effectively clusters behavioral subtypes and predicts depression from wearable data, providing clinically interpretable explanations through temporal attention visualization.

Abstract: In this study, we present AttentiveGRUAE, a novel attention-based gated recurrent unit (GRU) autoencoder designed for temporal clustering and prediction of outcome from longitudinal wearable data. Our model jointly optimizes three objectives: (1) learning a compact latent representation of daily behavioral features via sequence reconstruction, (2) predicting end-of-period depression rate through a binary classification head, and (3) identifying behavioral subtypes through Gaussian Mixture Model (GMM) based soft clustering of learned embeddings. We evaluate AttentiveGRUAE on longitudinal sleep data from 372 participants (GLOBEM 2018-2019), and it demonstrates superior performance over baseline clustering, domain-aligned self-supervised, and ablated models in both clustering quality (silhouette score = 0.70 vs 0.32-0.70) and depression classification (AUC = 0.74 vs 0.50-0.67). Additionally, external validation on cross-year cohorts from 332 participants (GLOBEM 2020-2021) confirms cluster reproducibility (silhouette score = 0.63, AUC = 0.61) and stability. We further perform subtype analysis and visualize temporal attention, which highlights sleep-related differences between clusters and identifies salient time windows that align with changes in sleep regularity, yielding clinically interpretable explanations of risk.

[290] On The Expressive Power of GNN Derivatives

Yam Eitan, Moshe Eliasof, Yoav Gelberg, Fabrizio Frasca, Guy Bar-Shalom, Haggai Maron

Main category: cs.LG

TL;DR: HOD-GNN enhances GNN expressivity using high-order node derivatives, creating structure-aware embeddings processed by a second GNN, with theoretical alignment to WL hierarchy and strong empirical performance.

Details

Motivation: Limited expressivity of GNNs remains a fundamental challenge, and while derivatives have been studied for oversquashing and explainability, they haven't been explored for enhancing expressivity.

Method: High-Order Derivative GNN (HOD-GNN) leverages high-order node derivatives of base MPNNs to generate expressive structure-aware embeddings, processed by a second GNN in end-to-end trainable architecture with efficient message-passing algorithm.

Result: Theoretical analysis shows alignment with WL hierarchy, deep connections with Subgraph GNNs and structural encodings, and evaluations demonstrate strong performance on graph learning benchmarks.

Conclusion: High-order derivatives provide a natural and effective way to enhance GNN expressivity, bridging theoretical expressivity with practical computational efficiency.

Abstract: Despite significant advances in Graph Neural Networks (GNNs), their limited expressivity remains a fundamental challenge. Research on GNN expressivity has produced many expressive architectures, leading to architecture hierarchies with models of increasing expressive power. Separately, derivatives of GNNs with respect to node features have been widely studied in the context of the oversquashing and over-smoothing phenomena, GNN explainability, and more. To date, these derivatives remain unexplored as a means to enhance GNN expressivity. In this paper, we show that these derivatives provide a natural way to enhance the expressivity of GNNs. We introduce High-Order Derivative GNN (HOD-GNN), a novel method that enhances the expressivity of Message Passing Neural Networks (MPNNs) by leveraging high-order node derivatives of the base model. These derivatives generate expressive structure-aware node embeddings processed by a second GNN in an end-to-end trainable architecture. Theoretically, we show that the resulting architecture family’s expressive power aligns with the WL hierarchy. We also draw deep connections between HOD-GNN, Subgraph GNNs, and popular structural encoding schemes. For computational efficiency, we develop a message-passing algorithm for computing high-order derivatives of MPNNs that exploits graph sparsity and parallelism. Evaluations on popular graph learning benchmarks demonstrate HOD-GNN’s strong performance on popular graph learning tasks.

[291] Geospatial Machine Learning Libraries

Adam J. Stewart, Caleb Robinson, Arindam Banerjee

Main category: cs.LG

TL;DR: This chapter provides a comprehensive overview of geospatial machine learning (GeoML) libraries, analyzing their evolution, functionalities, and ecosystem, while introducing popular tools and discussing methodologies, applications, and future directions.

Details

Motivation: The availability of Earth observation data has outpaced the development of domain-specific libraries to handle unique geospatial challenges like varying spatial resolutions, spectral properties, temporal cadence, and coordinate systems.

Method: Presents analysis of GeoML libraries’ evolution and architecture, introduces popular libraries (TorchGeo, eo-learn, Raster Vision), discusses methodologies for data preprocessing, spatial-temporal joins, benchmarking, and pretrained models, and includes a crop type mapping case study.

Result: Provides a comprehensive guide to the GeoML ecosystem, detailing library architectures, supported data types, integration with ML frameworks, and practical applications through case studies.

Conclusion: The chapter serves as a guide for practitioners, developers, and researchers to navigate and contribute to the rapidly evolving GeoML landscape, highlighting best practices, open challenges, and future directions including foundation models and governance in open-source geospatial software.

Abstract: Recent advances in machine learning have been supported by the emergence of domain-specific software libraries, enabling streamlined workflows and increased reproducibility. For geospatial machine learning (GeoML), the availability of Earth observation data has outpaced the development of domain libraries to handle its unique challenges, such as varying spatial resolutions, spectral properties, temporal cadence, data coverage, coordinate systems, and file formats. This chapter presents a comprehensive overview of GeoML libraries, analyzing their evolution, core functionalities, and the current ecosystem. It also introduces popular GeoML libraries such as TorchGeo, eo-learn, and Raster Vision, detailing their architecture, supported data types, and integration with ML frameworks. Additionally, it discusses common methodologies for data preprocessing, spatial–temporal joins, benchmarking, and the use of pretrained models. Through a case study in crop type mapping, it demonstrates practical applications of these tools. Best practices in software design, licensing, and testing are highlighted, along with open challenges and future directions, particularly the rise of foundation models and the need for governance in open-source geospatial software. Our aim is to guide practitioners, developers, and researchers in navigating and contributing to the rapidly evolving GeoML landscape.

[292] Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning

Ahmed Hendawy, Henrik Metternich, Théo Vincent, Mahdi Kallel, Jan Peters, Carlo D’Eramo

Main category: cs.LG

TL;DR: MINTO introduces a novel update rule that uses the minimum estimate between target and online networks for value function estimation, enabling faster and more stable learning in deep RL algorithms.

Details

Motivation: Target networks provide stability but slow learning, while using online networks directly causes instability. MINTO aims to combine the benefits of both approaches.

Method: MINTO computes bootstrapped targets using the minimum value estimate between the target and online networks, which can be seamlessly integrated into various value-based and actor-critic algorithms.

Result: MINTO consistently improves performance across diverse benchmarks including online/offline RL and discrete/continuous action spaces, demonstrating faster and more stable value function learning.

Conclusion: MINTO effectively mitigates overestimation bias while maintaining stability, offering a simple yet powerful enhancement for deep RL algorithms with broad applicability.

Abstract: The use of target networks is a popular approach for estimating value functions in deep Reinforcement Learning (RL). While effective, the target network remains a compromise solution that preserves stability at the cost of slowly moving targets, thus delaying learning. Conversely, using the online network as a bootstrapped target is intuitively appealing, albeit well-known to lead to unstable learning. In this work, we aim to obtain the best out of both worlds by introducing a novel update rule that computes the target using the MINimum estimate between the Target and Online network, giving rise to our method, MINTO. Through this simple, yet effective modification, we show that MINTO enables faster and stable value function learning, by mitigating the potential overestimation bias of using the online network for bootstrapping. Notably, MINTO can be seamlessly integrated into a wide range of value-based and actor-critic algorithms with a negligible cost. We evaluate MINTO extensively across diverse benchmarks, spanning online and offline RL, as well as discrete and continuous action spaces. Across all benchmarks, MINTO consistently improves performance, demonstrating its broad applicability and effectiveness.

[293] Towards CONUS-Wide ML-Augmented Conceptually-Interpretable Modeling of Catchment-Scale Precipitation-Storage-Runoff Dynamics

Yuan-Heng Wang, Yang Yang, Fabio Ciulla, Hoshin V. Gupta, Charuleka Varadharajan

Main category: cs.LG

TL;DR: ML-augmented physically-interpretable catchment models using Mass-Conserving Perceptron (MCP) achieve comparable performance to LSTM models in CONUS-wide hydrology study, emphasizing appropriate model complexity based on hydrological regime.

Details

Motivation: Current ML-based hydrologic modeling lacks predictive improvements grounded in enhanced physical-conceptual understanding, needing approaches that combine ML with physical interpretability.

Method: CONUS-wide large-sample study using ML-augmented physically-interpretable catchment-scale models based on Mass-Conserving Perceptron (MCP), evaluated across diverse hydro-geo-climatic conditions with attribute masks.

Result: Physically-interpretable MCP-based models achieve performance comparable to LSTM models, with importance of selecting appropriate model complexity based on process dominance variations across hydrological regimes.

Conclusion: Theory-informed, physically grounded approaches to large-sample hydrology enable mechanistic understanding and development of parsimonious, interpretable model architectures for future ‘models of everywhere’.

Abstract: While many modern studies are dedicated to ML-based large-sample hydrologic modeling, these efforts have not necessarily translated into predictive improvements that are grounded in enhanced physical-conceptual understanding. Here, we report on a CONUS-wide large-sample study (spanning diverse hydro-geo-climatic conditions) using ML-augmented physically-interpretable catchment-scale models of varying complexity based in the Mass-Conserving Perceptron (MCP). Results were evaluated using attribute masks such as snow regime, forest cover, and climate zone. Our results indicate the importance of selecting model architectures of appropriate model complexity based on how process dominance varies with hydrological regime. Benchmark comparisons show that physically-interpretable mass-conserving MCP-based models can achieve performance comparable to data-based models based in the Long Short-Term Memory network (LSTM) architecture. Overall, this study highlights the potential of a theory-informed, physically grounded approach to large-sample hydrology, with emphasis on mechanistic understanding and the development of parsimonious and interpretable model architectures, thereby laying the foundation for future models of everywhere that architecturally encode information about spatially- and temporally-varying process dominance.

[294] MINERVA: Mutual Information Neural Estimation for Supervised Feature Selection

Taurai Muvunzaa, Egor Kraev, Pere Planell-Morell, Alexander Y. Shestopaloff

Main category: cs.LG

TL;DR: MINERVA is a neural network-based feature selection method that uses mutual information estimation to capture higher-order feature interactions, outperforming traditional pair-wise dependence metrics.

Details

Motivation: Traditional feature filters fail when targets depend on higher-order feature interactions rather than individual feature contributions, limiting their effectiveness in complex real-world scenarios.

Method: Two-stage neural network approach: 1) Parameterize mutual information approximation with neural networks, 2) Feature selection using sparsity-inducing regularizers and ensemble evaluation of feature subsets.

Result: Successfully captures complex feature-target relationships that traditional methods miss, with experimental validation on synthetic and real-life fraud datasets showing exact solutions and improved performance.

Conclusion: MINERVA provides an effective neural-based framework for feature selection that handles higher-order dependencies and complex feature interactions better than statistical pair-wise methods.

Abstract: Existing feature filters rely on statistical pair-wise dependence metrics to model feature-target relationships, but this approach may fail when the target depends on higher-order feature interactions rather than individual contributions. We introduce Mutual Information Neural Estimation Regularized Vetting Algorithm (MINERVA), a novel approach to supervised feature selection based on neural estimation of mutual information between features and targets. We paramaterize the approximation of mutual information with neural networks and perform feature selection using a carefully designed loss function augmented with sparsity-inducing regularizers. Our method is implemented in a two-stage process to decouple representation learning from feature selection, ensuring better generalization and a more accurate expression of feature importance. We present examples of ubiquitous dependency structures that are rarely captured in literature and show that our proposed method effectively captures these complex feature-target relationships by evaluating feature subsets as an ensemble. Experimental results on synthetic and real-life fraud datasets demonstrate the efficacy of our method and its ability to perform exact solutions.

[295] TabImpute: Accurate and Fast Zero-Shot Missing-Data Imputation with a Pre-Trained Transformer

Jacob Feitelberg, Dwaipayan Saha, Kyuseong Choi, Zaid Ahmad, Anish Agarwal, Raaz Dwivedi

Main category: cs.LG

TL;DR: TabImpute is a pre-trained transformer for zero-shot tabular data imputation that requires no fitting or hyperparameter tuning, achieving robust performance across diverse domains.

Details

Motivation: Missing data is a pervasive problem in tabular settings with no default imputation method due to performance variance across domains and time-consuming hyperparameter tuning.

Method: Builds on TabPFN foundation model, uses entry-wise featurization for 100x speedup, synthetic training data with realistic missingness patterns, and comprehensive MissBench evaluation framework.

Result: TabImpute delivers accurate and fast zero-shot imputations, showing robust performance compared to 11 established methods across 42 OpenML datasets and 13 missingness patterns spanning medicine, finance, and engineering.

Conclusion: TabImpute provides an effective zero-shot solution for tabular data imputation that eliminates the need for fitting and hyperparameter tuning while maintaining strong performance across diverse real-world scenarios.

Abstract: Missing data is a pervasive problem in tabular settings. Existing solutions range from simple averaging to complex generative adversarial networks. However, due to huge variance in performance across real-world domains and time-consuming hyperparameter tuning, no default imputation method exists. Building on TabPFN, a recent tabular foundation model for supervised learning, we propose TabImpute, a pre-trained transformer that delivers accurate and fast zero-shot imputations requiring no fitting or hyperparameter tuning at inference-time. To train and evaluate TabImpute, we introduce (i) an entry-wise featurization for tabular settings, which enables a $100\times$ speedup over the previous TabPFN imputation method, (ii) a synthetic training data generation pipeline incorporating realistic missingness patterns, which boosts test-time performance, and (iii) MissBench, a comprehensive benchmark for evaluation of imputation methods with $42$ OpenML datasets and $13$ missingness patterns. MissBench spans domains such as medicine, finance, and engineering, showcasing TabImpute’s robust performance compared to $11$ established imputation methods.

[296] HyperAdaLoRA: Accelerating LoRA Rank Allocation During Training via Hypernetworks without Sacrificing Performance

Hao Zhang, Zhenjia Li, Runfeng Bao, Yifan Gao, Xi Xiao, Bo Huang, Yuhang Wu, Tianyang Wang, Hao Xu

Main category: cs.LG

TL;DR: HyperAdaLoRA is a novel framework that accelerates AdaLoRA convergence using a hypernetwork with attention mechanisms to dynamically generate SVD parameters, achieving faster convergence without performance loss.

Details

Motivation: To address the slow convergence speed and high computational overhead issues in AdaLoRA, which uses dynamic rank allocation via SVD pruning but suffers from training inefficiencies.

Method: Uses a hypernetwork based on attention mechanisms to dynamically generate SVD parameters (P, Λ, Q) instead of directly optimizing them, with pruning of hypernetwork outputs for singular values to achieve dynamic rank allocation.

Result: Comprehensive experiments show faster convergence without sacrificing performance across various datasets and models, with extension experiments validating broad applicability to other LoRA-based approaches.

Conclusion: HyperAdaLoRA successfully accelerates AdaLoRA convergence while maintaining performance, demonstrating the effectiveness of hypernetwork-based parameter generation for efficient fine-tuning of large language models.

Abstract: Parameter-Efficient Fine-Tuning (PEFT), especially Low-Rank Adaptation (LoRA), has emerged as a promising approach to fine-tuning large language models(LLMs) while reducing computational and memory overhead. However, LoRA assumes a uniform rank \textit{r} for each incremental matrix, not accounting for the varying significance of weight matrices across different modules and layers. AdaLoRA leverages Singular Value Decomposition (SVD) to parameterize updates and employs pruning of singular values to introduce dynamic rank allocation, thereby enhancing adaptability. However, during the training process, it often encounters issues of slow convergence speed and high computational overhead. To address this issue, we propose HyperAdaLoRA, a novel framework that accelerates the convergence of AdaLoRA by leveraging a hypernetwork. Instead of directly optimizing the components of Singular Value Decomposition $(P, \Lambda, Q)$, HyperAdaLoRA employs a hypernetwork based on attention mechanisms to dynamically generate these parameters. By pruning the outputs of the hypernetwork that generates the singular values, dynamic rank allocation is achieved. Comprehensive experiments on various datasets and models demonstrate that our method achieves faster convergence without sacrificing performance. Additionally, further extension experiments on other LoRA-based approaches validate the broad applicability of our method.

[297] Optimal Characteristics of Inspection Vehicle for Drive-by Bridge Inspection

A. Calderon Hurtado, E. Atroshchenko, K. C. Chang, C. W. Kim, M. Makki Alamdari

Main category: cs.LG

TL;DR: A framework for optimizing inspection vehicles in drive-by bridge monitoring using adversarial autoencoders and Kriging meta-model to enhance damage sensitivity by optimizing vehicle mass and stiffness parameters.

Details

Motivation: Current drive-by inspection methods are limited by vehicle mechanical properties affecting detection performance, requiring optimization of the sensing platform itself.

Method: Unsupervised deep learning with adversarial autoencoders to reconstruct frequency-domain acceleration responses, combined with Kriging meta-model to optimize vehicle mass and stiffness by minimizing Wasserstein distance between healthy and damaged bridge states.

Result: Vehicles with frequency ratios between 0.3-0.7 relative to bridge’s first natural frequency are most effective, while resonant vehicles perform poorly. Lighter vehicles need lower natural frequencies for optimal detection.

Conclusion: First study to rigorously optimize the sensing platform for drive-by sensing and propose a purpose-built inspection vehicle, demonstrating significant improvement in damage detection sensitivity.

Abstract: Drive-by inspection for bridge health monitoring has gained increasing attention over the past decade. This method involves analysing the coupled vehicle-bridge response, recorded by an instrumented inspection vehicle, to assess structural integrity and detect damage. However, the vehicles mechanical and dynamic properties significantly influence detection performance, limiting the effectiveness of the approach. This study presents a framework for optimising the inspection vehicle to enhance damage sensitivity. An unsupervised deep learning methodbased on adversarial autoencoders (AAE)is used to reconstruct the frequency-domain representation of acceleration responses. The mass and stiffness of the tyre suspension system of a two-axle vehicle are optimised by minimising the Wasserstein distance between damage index distributions for healthy and damaged bridge states. A Kriging meta-model is employed to approximate this objective function efficiently and identify optimal vehicle configurations in both dimensional and non-dimensional parameter spaces. Results show that vehicles with frequency ratios between 0.3 and 0.7 relative to the bridges’ first natural frequency are most effective, while those near resonance perform poorly. Lighter vehicles require lower natural frequencies for optimal detection. This is the first study to rigorously optimise the sensing platform for drive-by sensing and to propose a purpose-built inspection vehicle.

[298] TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models

Rakshith S Srinivasa, Zora Che, Chen Bo Calvin Zhang, Diego Mares, Ernesto Hernandez, Jayeon Park, Dean Lee, Guillermo Mangialardi, Charmaine Ng, Ed-Yeremai Hernandez Cardona, Anisha Gunjal, Yunzhong He, Bing Liu, Chen Xing

Main category: cs.LG

TL;DR: TutorBench is a dataset and benchmark for evaluating LLMs’ tutoring skills, showing current models perform poorly (≤56%) and need significant improvement in adaptive explanations, feedback, and hint generation.

Details

Motivation: As students increasingly use LLMs as learning aids, there's a need to develop models with proper tutoring capabilities including identifying student needs, adaptability, personalization, and accuracy.

Method: Created TutorBench with 1,490 expert-curated samples from high-school/AP curricula, covering three tutoring tasks: adaptive explanations, actionable feedback, and hint generation. Uses LLM-judge with sample-specific rubrics for fine-grained automatic evaluation.

Result: Evaluation of 16 frontier LLMs shows none achieve >56% score, with all models <60% pass rate on core tutoring skills. Claude models excel in active learning support but lag in other areas.

Conclusion: Current LLMs fall short in comprehensive tutoring capabilities. TutorBench provides an unsaturated benchmark to guide development of next-generation AI tutors with better guidance, diagnosis, and student support.

Abstract: As students increasingly adopt large language models (LLMs) as learning aids, it is crucial to build models that are adept at handling the nuances of tutoring: they need to identify the core needs of students, be adaptive, provide personalized guidance, and be accurate. To this end, we introduce TutorBench, a dataset and evaluation benchmark designed to rigorously evaluate the core tutoring skills of LLMs. The dataset comprises 1,490 samples curated by human experts, focused on high-school and AP-level curricula. The samples are drawn from three common tutoring tasks: (i) generating adaptive explanations tailored to a student’s confusion, (ii) providing actionable feedback on a student’s work, and (iii) promoting active learning through effective hint generation. To account for the inherent complexity of tutoring, samples are accompanied by sample-specific rubrics which are used to judge model responses during evaluation. TutorBench uses a reliable and fine-grained automatic evaluation method that uses an LLM-judge and the sample-specific rubrics. We evaluate 16 frontier LLMs on TutorBench and present a detailed analysis of their performance and behavior. Our results show that none of the frontier LLMs achieve a score of greater than $56%$, showing a large room for improvement. We find that LLMs fall short in exhibiting the full range of tutoring skills needed to guide, diagnose, and support students effectively, with all the frontier models achieving less than a $60%$ pass rate on rubric criteria related to these skills. We also find that different model families exhibit varied strengths and limitations: the Claude models outperform others in supporting active learning, while they lag behind in the other two use cases. By releasing TutorBench, we provide a comprehensive and unsaturated benchmark to guide the development of the next-generation of AI tutors.

[299] Topological Invariance and Breakdown in Learning

Yongyi Yang, Tomaso Poggio, Isaac Chuang, Liu Ziyin

Main category: cs.LG

TL;DR: Permutation-equivariant learning rules induce bi-Lipschitz mappings between neurons, constraining neuron distribution topology. Learning rate η creates a topological critical point η*: below η* preserves all topological structure, above η* allows topological simplification that reduces model expressivity.

Details

Motivation: To understand how learning dynamics affect neuron topology and reveal qualitative differences between small and large learning rates in neural network training.

Method: Theoretical analysis proving that permutation-equivariant learning rules (including SGD, Adam) induce bi-Lipschitz mappings between neurons, constraining neuron distribution topology during training.

Result: Discovered a topological critical point η*: training below η* preserves all topological structure, while above η* allows topological simplification that makes neuron manifolds coarser and reduces expressivity. Learning dynamics have two phases: smooth optimization under topological constraints, then learning through topological simplifications.

Conclusion: The theory provides a universal topological framework for studying deep learning that is independent of specific architectures or loss functions, revealing fundamental topological constraints in neural network training dynamics.

Abstract: We prove that for a broad class of permutation-equivariant learning rules (including SGD, Adam, and others), the training process induces a bi-Lipschitz mapping between neurons and strongly constrains the topology of the neuron distribution during training. This result reveals a qualitative difference between small and large learning rates $\eta$. With a learning rate below a topological critical point $\eta^$, the training is constrained to preserve all topological structure of the neurons. In contrast, above $\eta^$, the learning process allows for topological simplification, making the neuron manifold progressively coarser and thereby reducing the model’s expressivity. Viewed in combination with the recent discovery of the edge of stability phenomenon, the learning dynamics of neuron networks under gradient descent can be divided into two phases: first they undergo smooth optimization under topological constraints, and then enter a second phase where they learn through drastic topological simplifications. A key feature of our theory is that it is independent of specific architectures or loss functions, enabling the universal application of topological methods to the study of deep learning.

[300] Multiplicative-Additive Constrained Models:Toward Joint Visualization of Interactive and Independent Effects

Fumin Wang

Main category: cs.LG

TL;DR: MACMs combine multiplicative and additive components to improve interpretability while capturing feature interactions, outperforming both CESR and GAMs in predictive performance.

Details

Motivation: To enhance interpretability in high-stakes fields like healthcare while overcoming the limitations of GAMs (which omit higher-order interactions) and CESR (which has poor performance despite capturing interactions).

Method: Introduce Multiplicative-Additive Constrained Models (MACMs) that augment CESR with an additive part to disentangle interactive and independent feature effects, effectively broadening the hypothesis space while maintaining visualizable shape functions.

Result: Neural network-based MACMs significantly outperform both CESR and state-of-the-art GAMs in predictive performance while maintaining interpretability through visualizable shape functions.

Conclusion: MACMs provide a superior approach that balances interpretability and predictive performance by combining the strengths of both multiplicative and additive modeling approaches.

Abstract: Interpretability is one of the considerations when applying machine learning to high-stakes fields such as healthcare that involve matters of life safety. Generalized Additive Models (GAMs) enhance interpretability by visualizing shape functions. Nevertheless, to preserve interpretability, GAMs omit higher-order interaction effects (beyond pairwise interactions), which imposes significant constraints on their predictive performance. We observe that Curve Ergodic Set Regression (CESR), a multiplicative model, naturally enables the visualization of its shape functions and simultaneously incorporates both interactions among all features and individual feature effects. Nevertheless, CESR fails to demonstrate superior performance compared to GAMs. We introduce Multiplicative-Additive Constrained Models (MACMs), which augment CESR with an additive part to disentangle the intertwined coefficients of its interactive and independent terms, thus effectively broadening the hypothesis space. The model is composed of a multiplicative part and an additive part, whose shape functions can both be naturally visualized, thereby assisting users in interpreting how features participate in the decision-making process. Consequently, MACMs constitute an improvement over both CESR and GAMs. The experimental results indicate that neural network-based MACMs significantly outperform both CESR and the current state-of-the-art GAMs in terms of predictive performance.

[301] To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration

Zeyu Yang, Tianyi Zhang, Jianwen Xie, Chuan Li, Zhaozhuo Xu, Anshumali Shrivastava

Main category: cs.LG

TL;DR: The paper identifies an exponent concentration phenomenon in GenAI models and proposes ECF8, a lossless FP8 compression framework that achieves significant memory savings and throughput acceleration while maintaining perfect numerical accuracy.

Details

Motivation: The scaling of GenAI models to hundreds of billions of parameters requires efficient low-precision computation. Current approaches face limitations, and the authors argue that low-precision floating-point formats provide better numerical stability and efficiency without dequantization overhead.

Method: The authors conduct theoretical and empirical analysis of exponent concentration in GenAI weights, proving tight bounds on exponent entropy. They design ECF8, an entropy-aware encoding framework with GPU-optimized decoding that leverages the observed exponent concentration phenomenon.

Result: Experiments on LLMs and DiTs up to 671B parameters show up to 26.9% memory savings and 177.1% throughput acceleration, with perfectly lossless computations (no deviation in model outputs).

Conclusion: Exponent concentration is established as a statistical law of trained models, providing a principled path for lossless low-precision floating-point design in the FP8 era.

Abstract: The scaling of Generative AI (GenAI) models into the hundreds of billions of parameters makes low-precision computation indispensable for efficient deployment. We argue that the fundamental solution lies in developing low-precision floating-point formats, which inherently provide numerical stability, memory savings, and hardware efficiency without dequantization overhead. In this paper, we present a theoretical and empirical study of an exponent concentration phenomenon in GenAI weights: exponents consistently exhibit low entropy across architectures and modalities. We show that this arises naturally from $\alpha$-stable distributions induced by stochastic gradient descent, and we prove tight bounds on the entropy of exponents. Our analysis establishes a theoretical compression limit near FP4.67, which motivates the design of a practical FP8 format. Building on these insights, we propose Exponent-Concentrated FP8 (ECF8), a lossless compression framework with entropy-aware encoding and GPU-optimized decoding. Experiments on LLMs and DiTs up to 671B parameters demonstrate up to 26.9% memory savings and 177.1% throughput acceleration, with perfectly lossless computations, i.e., no deviation in model outputs. Our results establish exponent concentration as a statistical law of trained models and open a principled path for lossless low-precision floating-point design in the FP8 era.

[302] Can Data-Driven Dynamics Reveal Hidden Physics? There Is A Need for Interpretable Neural Operators

Wenhan Gao, Jian Luo, Fang Wan, Ruichen Xu, Xiang Liu, Haipeng Xing, Yi Liu

Main category: cs.LG

TL;DR: This paper classifies neural operators into spatial and functional domain models, analyzes their learning mechanisms for physics-informed dynamics, proposes explanation methods, and advocates for dual-space multi-scale approaches and principled physics integration.

Details

Motivation: To better understand neural operators' learning mechanisms for data-driven simulations of complex dynamics, particularly how they learn physical principles from data.

Method: Classifies neural operators into spatial domain models (learning on grids) and functional domain models (learning with function bases). Proposes explanation methods for neural operator predictions and introduces dual-space multi-scale models.

Result: Shows neural operators can learn hidden physical patterns from data, and that simple dual-space multi-scale models achieve state-of-the-art performance. Identifies limitations in current explanation methods.

Conclusion: Dual-space multi-spatio-scale models have significant potential for learning complex physics, and there is an urgent need for generalizable explanation methods and principled frameworks to incorporate known physics into neural operators for better generalization.

Abstract: Recently, neural operators have emerged as powerful tools for learning mappings between function spaces, enabling data-driven simulations of complex dynamics. Despite their successes, a deeper understanding of their learning mechanisms remains underexplored. In this work, we classify neural operators into two types: (1) Spatial domain models that learn on grids and (2) Functional domain models that learn with function bases. We present several viewpoints based on this classification and focus on learning data-driven dynamics adhering to physical principles. Specifically, we provide a way to explain the prediction-making process of neural operators and show that neural operator can learn hidden physical patterns from data. However, this explanation method is limited to specific situations, highlighting the urgent need for generalizable explanation methods. Next, we show that a simple dual-space multi-scale model can achieve SOTA performance and we believe that dual-space multi-spatio-scale models hold significant potential to learn complex physics and require further investigation. Lastly, we discuss the critical need for principled frameworks to incorporate known physics into neural operators, enabling better generalization and uncovering more hidden physical phenomena.

[303] EvoSpeak: Large Language Models for Interpretable Genetic Programming-Evolved Heuristics

Meng Xu, Jiao Liu, Yew Soon Ong

Main category: cs.LG

TL;DR: EvoSpeak integrates genetic programming with LLMs to enhance heuristic evolution by improving efficiency, interpretability, and transferability across optimization tasks.

Details

Motivation: Address challenges in dynamic/large-scale scenarios where complex GP heuristics hinder interpretability, slow convergence, and limit transferability.

Method: Integrates GP with LLMs to learn from high-quality GP heuristics, extract knowledge, generate warm-start populations, translate GP trees into natural language explanations, and enable cross-task knowledge transfer.

Result: Produces more effective heuristics, improves evolutionary efficiency, and delivers human-readable reports that enhance usability in dynamic flexible job shop scheduling.

Conclusion: EvoSpeak advances intelligent, transparent, and user-aligned heuristics by coupling GP’s symbolic reasoning with LLMs’ interpretative and generative strengths.

Abstract: Genetic programming (GP) has demonstrated strong effectiveness in evolving tree-structured heuristics for complex optimization problems. Yet, in dynamic and large-scale scenarios, the most effective heuristics are often highly complex, hindering interpretability, slowing convergence, and limiting transferability across tasks. To address these challenges, we present EvoSpeak, a novel framework that integrates GP with large language models (LLMs) to enhance the efficiency, transparency, and adaptability of heuristic evolution. EvoSpeak learns from high-quality GP heuristics, extracts knowledge, and leverages this knowledge to (i) generate warm-start populations that accelerate convergence, (ii) translate opaque GP trees into concise natural-language explanations that foster interpretability and trust, and (iii) enable knowledge transfer and preference-aware heuristic generation across related tasks. We verify the effectiveness of EvoSpeak through extensive experiments on dynamic flexible job shop scheduling (DFJSS), under both single- and multi-objective formulations. The results demonstrate that EvoSpeak produces more effective heuristics, improves evolutionary efficiency, and delivers human-readable reports that enhance usability. By coupling the symbolic reasoning power of GP with the interpretative and generative strengths of LLMs, EvoSpeak advances the development of intelligent, transparent, and user-aligned heuristics for real-world optimization problems.

[304] Fine-Tuning Diffusion Models via Intermediate Distribution Shaping

Gautham Govind Anil, Shaan Ul Haque, Nithish Kannen, Dheeraj Nagaraj, Sanjay Shakkottai, Karthikeyan Shanmugam

Main category: cs.LG

TL;DR: The paper introduces GRAFT and P-GRAFT methods for fine-tuning diffusion models using reward functions, showing they implicitly perform PPO with reshaped rewards and achieve better performance than policy gradient methods on various generation tasks.

Details

Motivation: While pre-trained diffusion models capture training data distributions, it's desirable to shape these distributions using reward functions for downstream applications. Policy gradient methods like PPO work for autoregressive generation but are intractable for diffusion models due to marginal likelihood requirements.

Method: Unify RAFT variants as GRAFT, introduce P-GRAFT to shape distributions at intermediate noise levels, propose inverse noise correction for flow models without explicit rewards, and evaluate on text-to-image generation, layout generation, molecule generation, and unconditional image generation.

Result: Applied to Stable Diffusion 2, the framework improves over policy gradient methods on T2I benchmarks in VQAScore with 8.81% relative improvement over base model. For unconditional image generation, inverse noise correction improves FID at lower FLOPs/image.

Conclusion: The proposed methods provide effective fine-tuning for diffusion models through a bias-variance tradeoff analysis, demonstrating superior performance across multiple generation domains compared to existing approaches.

Abstract: Diffusion models are widely used for generative tasks across domains. While pre-trained diffusion models effectively capture the training data distribution, it is often desirable to shape these distributions using reward functions to align with downstream applications. Policy gradient methods, such as Proximal Policy Optimization (PPO), are widely used in the context of autoregressive generation. However, the marginal likelihoods required for such methods are intractable for diffusion models, leading to alternative proposals and relaxations. In this context, we unify variants of Rejection sAmpling based Fine-Tuning (RAFT) as GRAFT, and show that this implicitly performs PPO with reshaped rewards. We then introduce P-GRAFT to shape distributions at intermediate noise levels and demonstrate empirically that this can lead to more effective fine-tuning. We mathematically explain this via a bias-variance tradeoff. Motivated by this, we propose inverse noise correction to improve flow models without leveraging explicit rewards. We empirically evaluate our methods on text-to-image(T2I) generation, layout generation, molecule generation and unconditional image generation. Notably, our framework, applied to Stable Diffusion 2, improves over policy gradient methods on popular T2I benchmarks in terms of VQAScore and shows an $8.81%$ relative improvement over the base model. For unconditional image generation, inverse noise correction improves FID of generated images at lower FLOPs/image.

[305] RAMAC: Multimodal Risk-Aware Offline Reinforcement Learning and the Role of Behavior Regularization

Kai Fukazawa, Kunal Mundada, Iman Soltani

Main category: cs.LG

TL;DR: RAMAC is a risk-aware offline RL framework that combines expressive generative actors with distributional critics to achieve high returns while minimizing catastrophic risk in safety-critical domains.

Details

Motivation: Address the gap in offline RL where prior risk-averse methods sacrifice policy expressiveness for safety, while expressive policies are only used in risk-neutral settings.

Method: Couples expressive generative actors (diffusion and flow-matching) with distributional critics, using a composite objective that combines distributional risk and behavior cloning loss through the generative path.

Result: Consistent gains in CVaR0.1 (conditional value at risk) while maintaining strong returns on most Stochastic-D4RL tasks.

Conclusion: RAMAC successfully enables risk-sensitive learning in complex multimodal scenarios without sacrificing policy expressiveness.

Abstract: In safety-critical domains where online data collection is infeasible, offline reinforcement learning (RL) offers an attractive alternative but only if policies deliver high returns without incurring catastrophic lower-tail risk. Prior work on risk-averse offline RL achieves safety at the cost of value conservatism and restricted policy classes, whereas expressive policies are only used in risk-neutral settings. Here, we address this gap by introducing the \textbf{Risk-Aware Multimodal Actor-Critic (RAMAC)} framework, which couples an \emph{expressive generative actor} with a distributional critic. The RAMAC differentiates composite objective combining distributional risk and BC loss through the generative path, achieving risk-sensitive learning in complex multimodal scenarios. We instantiate RAMAC with diffusion and flow-matching actors and observe consistent gains in $\mathrm{CVaR}_{0.1}$ while maintaining strong returns on most Stochastic-D4RL tasks. Code: https://github.com/KaiFukazawa/RAMAC.git

[306] A Novel Unified Lightweight Temporal-Spatial Transformer Approach for Intrusion Detection in Drone Networks

Tarun Kumar Biswas, Ashrafun Zannat, Waqas Ishtiaq, Md. Alamgir Hossain

Main category: cs.LG

TL;DR: TSLT-Net is a lightweight intrusion detection system for drone networks that uses temporal-spatial transformers to achieve 99.99% multiclass accuracy and 100% binary anomaly detection with minimal resource usage.

Details

Motivation: Drone networks face significant cybersecurity challenges and existing intrusion detection systems lack adaptability, efficiency, and generalizability for dynamic, resource-constrained drone environments.

Method: Proposes TSLT-Net, a temporal spatial transformer-based system that uses self-attention mechanisms to model temporal patterns and spatial dependencies in network traffic, with streamlined preprocessing and unified architecture for both multiclass classification and binary anomaly detection.

Result: Achieved 99.99% accuracy in multiclass detection and 100% in binary anomaly detection on ISOT Drone Anomaly Detection Dataset (2.3M+ records), with only 0.04 MB memory footprint and 9722 trainable parameters.

Conclusion: TSLT-Net is an effective, scalable solution for real-time drone cybersecurity suitable for deployment on edge devices in mission-critical UAV systems.

Abstract: The growing integration of drones across commercial, industrial, and civilian domains has introduced significant cybersecurity challenges, particularly due to the susceptibility of drone networks to a wide range of cyberattacks. Existing intrusion detection mechanisms often lack the adaptability, efficiency, and generalizability required for the dynamic and resource constrained environments in which drones operate. This paper proposes TSLT-Net, a novel lightweight and unified Temporal Spatial Transformer based intrusion detection system tailored specifically for drone networks. By leveraging self attention mechanisms, TSLT-Net effectively models both temporal patterns and spatial dependencies in network traffic, enabling accurate detection of diverse intrusion types. The framework includes a streamlined preprocessing pipeline and supports both multiclass attack classification and binary anomaly detection within a single architecture. Extensive experiments conducted on the ISOT Drone Anomaly Detection Dataset, consisting of more than 2.3 million labeled records, demonstrate the superior performance of TSLT-Net with 99.99 percent accuracy in multiclass detection and 100 percent in binary anomaly detection, while maintaining a minimal memory footprint of only 0.04 MB and 9722 trainable parameters. These results establish TSLT-Net as an effective and scalable solution for real time drone cybersecurity, particularly suitable for deployment on edge devices in mission critical UAV systems.

[307] CST-AFNet: A dual attention-based deep learning framework for intrusion detection in IoT networks

Waqas Ishtiaq, Ashrafun Zannat, A. H. M. Shahariar Parvez, Md. Alamgir Hossain, Muntasir Hasan Kanchan, Muhammad Masud Tarek

Main category: cs.LG

TL;DR: CST AFNet is a dual attention-based deep learning framework for IoT intrusion detection that achieves 99.97% accuracy on the Edge IIoTset dataset, outperforming traditional models.

Details

Motivation: The rapid IoT expansion introduces complex cybersecurity challenges due to heterogeneous, resource-constrained, and distributed environments, requiring robust intrusion detection solutions.

Method: Integrates multi-scale CNNs for spatial features, BiGRUs for temporal dependencies, and dual attention mechanism (channel and temporal) to focus on critical patterns in IoT network data.

Result: Achieves 99.97% accuracy for 15 attack types and benign traffic, with macro averaged precision, recall, and F1 score all above 99.3%, significantly outperforming traditional deep learning models.

Conclusion: CST AFNet is a powerful and scalable solution for real-time cyber threat detection in complex IoT environments, enabling more secure and adaptive cyber-physical systems.

Abstract: The rapid expansion of the Internet of Things (IoT) has revolutionized modern industries by enabling smart automation and real time connectivity. However, this evolution has also introduced complex cybersecurity challenges due to the heterogeneous, resource constrained, and distributed nature of these environments. To address these challenges, this research presents CST AFNet, a novel dual attention based deep learning framework specifically designed for robust intrusion detection in IoT networks. The model integrates multi scale Convolutional Neural Networks (CNNs) for spatial feature extraction, Bidirectional Gated Recurrent Units (BiGRUs) for capturing temporal dependencies, and a dual attention mechanism, channel and temporal attention, to enhance focus on critical patterns in the data. The proposed method was trained and evaluated on the Edge IIoTset dataset, a comprehensive and realistic benchmark containing more than 2.2 million labeled instances spanning 15 attack types and benign traffic, collected from a seven layer industrial testbed. Our proposed model achieves outstanding accuracy for both 15 attack types and benign traffic. CST AFNet achieves 99.97 percent accuracy. Moreover, this model demonstrates exceptional performance with macro averaged precision, recall, and F1 score all above 99.3 percent. Experimental results show that CST AFNet achieves superior detection accuracy, significantly outperforming traditional deep learning models. The findings confirm that CST AFNet is a powerful and scalable solution for real time cyber threat detection in complex IoT and IIoT environments, paving the way for more secure, intelligent, and adaptive cyber physical systems.

[308] Hyperparameter Loss Surfaces Are Simple Near their Optima

Nicholas Lourie, He He, Kyunghyun Cho

Main category: cs.LG

TL;DR: The paper reveals that hyperparameter loss surfaces exhibit simple structure near optima, characterized by effective dimension and best possible loss. A novel random search technique uncovers this asymptotic regime, enabling new analysis tools for hyperparameter optimization.

Details

Motivation: Hyperparameters significantly impact model performance, but modern models are too large for extensive search. Few tools exist for understanding hyperparameter loss surfaces, despite their importance in developing effective training recipes.

Method: Developed a novel random search technique to uncover the asymptotic regime of hyperparameter loss surfaces. Analyzed the distribution of best scores from random search, whose parameters correspond to features defining the loss surface in the asymptotic regime.

Result: Discovered that hyperparameter loss surfaces become simple near optima, characterized by basic features like effective dimension and best possible loss. Derived a new asymptotic law for random search that explains and extrapolates its convergence.

Conclusion: The new tools enable novel analyses including confidence intervals for best possible performance and determining effective number of hyperparameters. These tools are made publicly available for hyperparameter optimization research.

Abstract: Hyperparameters greatly impact models’ capabilities; however, modern models are too large for extensive search. Instead, researchers design recipes that train well across scales based on their understanding of the hyperparameters. Despite this importance, few tools exist for understanding the hyperparameter loss surface. We discover novel structure in it and propose a new theory yielding such tools. The loss surface is complex, but as you approach the optimum simple structure emerges. It becomes characterized by a few basic features, like its effective dimension and the best possible loss. To uncover this asymptotic regime, we develop a novel technique based on random search. Within this regime, the best scores from random search take on a new distribution we discover. Its parameters are exactly the features defining the loss surface in the asymptotic regime. From these features, we derive a new asymptotic law for random search that can explain and extrapolate its convergence. These new tools enable new analyses, such as confidence intervals for the best possible performance or determining the effective number of hyperparameters. We make these tools available at https://github.com/nicholaslourie/opda .

[309] Accuracy Law for the Future of Deep Time Series Forecasting

Yuxuan Wang, Haixu Wu, Yuezhou Ma, Yuchen Fang, Ziyi Zhang, Yong Liu, Shiyu Wang, Zhou Ye, Yang Xiang, Jianmin Wang, Mingsheng Long

Main category: cs.LG

TL;DR: The paper proposes an accuracy law that estimates performance upper bounds for deep time series forecasting by discovering an exponential relationship between minimum forecasting error and window-wise series pattern complexity.

Details

Motivation: To address confusion in deep time series forecasting research due to minor benchmark improvements and establish a fundamental understanding of performance limits, since time series forecasting inherently has non-zero error bounds unlike image recognition.

Method: Conducted rigorous statistical tests on over 2,800 trained deep forecasters to analyze window-wise series properties (beyond classical series-wise metrics) and discovered the exponential relationship between minimum error and pattern complexity.

Result: Found a significant exponential relationship (accuracy law) between minimum forecasting error and window-wise series pattern complexity, which successfully identifies saturated tasks and derives effective training strategies for large time series models.

Conclusion: The accuracy law provides valuable insights for future research by establishing performance upper bounds, guiding researchers away from saturated tasks, and offering effective training strategies for large-scale time series forecasting models.

Abstract: Deep time series forecasting has emerged as a booming direction in recent years. Despite the exponential growth of community interests, researchers are sometimes confused about the direction of their efforts due to minor improvements on standard benchmarks. In this paper, we notice that, unlike image recognition, whose well-acknowledged and realizable goal is 100% accuracy, time series forecasting inherently faces a non-zero error lower bound due to its partially observable and uncertain nature. To pinpoint the research objective and release researchers from saturated tasks, this paper focuses on a fundamental question: how to estimate the performance upper bound of deep time series forecasting? Going beyond classical series-wise predictability metrics, e.g., ADF test, we realize that the forecasting performance is highly related to window-wise properties because of the sequence-to-sequence forecasting paradigm of deep time series models. Based on rigorous statistical tests of over 2,800 newly trained deep forecasters, we discover a significant exponential relationship between the minimum forecasting error of deep models and the complexity of window-wise series patterns, which is termed the accuracy law. The proposed accuracy law successfully guides us to identify saturated tasks from widely used benchmarks and derives an effective training strategy for large time series models, offering valuable insights for future research.

[310] Dale meets Langevin: A Multiplicative Denoising Diffusion Model

Nishanth Shetty, Madhava Prasath, Chandra Sekhar Seelamantula

Main category: cs.LG

TL;DR: This paper introduces a biologically inspired generative model using multiplicative updates based on geometric Brownian motion, connecting exponential gradient descent from Dale’s law with score-based generative modeling.

Details

Motivation: Standard gradient descent is inconsistent with biological learning systems. The work aims to develop biologically inspired learning techniques, particularly drawing from Dale's law about inhibitory/excitatory synapses.

Method: Leverages connection between exponential gradient descent and geometric Brownian motion SDEs. Discretizes reverse-time SDE to get multiplicative update rules. Proposes multiplicative denoising score-matching formalism for log-normally distributed data.

Result: Developed novel generative model with multiplicative updates. Demonstrated generative capability on MNIST, Fashion MNIST, and Kuzushiji datasets. First biologically inspired generative model using multiplicative updates based on geometric Brownian motion.

Conclusion: Successfully bridges biological learning principles with modern generative modeling, creating a novel multiplicative update scheme for score-based models that aligns with Dale’s law and log-normal weight distributions.

Abstract: Gradient descent has proven to be a powerful and effective technique for optimization in numerous machine learning applications. Recent advances in computational neuroscience have shown that learning in standard gradient descent optimization formulation is not consistent with learning in biological systems. This has opened up interesting avenues for building biologically inspired learning techniques. One such approach is inspired by Dale’s law, which states that inhibitory and excitatory synapses do not swap roles during the course of learning. The resulting exponential gradient descent optimization scheme leads to log-normally distributed synaptic weights. Interestingly, the density that satisfies the Fokker-Planck equation corresponding to the stochastic differential equation (SDE) with geometric Brownian motion (GBM) is the log-normal density. Leveraging this connection, we start with the SDE governing geometric Brownian motion, and show that discretizing the corresponding reverse-time SDE yields a multiplicative update rule, which surprisingly, coincides with the sampling equivalent of the exponential gradient descent update founded on Dale’s law. Furthermore, we propose a new formalism for multiplicative denoising score-matching, subsuming the loss function proposed by Hyvaerinen for non-negative data. Indeed, log-normally distributed data is positive and the proposed score-matching formalism turns out to be a natural fit. This allows for training of score-based models for image data and results in a novel multiplicative update scheme for sample generation starting from a log-normal density. Experimental results on MNIST, Fashion MNIST, and Kuzushiji datasets demonstrate generative capability of the new scheme. To the best of our knowledge, this is the first instance of a biologically inspired generative model employing multiplicative updates, founded on geometric Brownian motion.

[311] Hybrid-Collaborative Augmentation and Contrastive Sample Adaptive-Differential Awareness for Robust Attributed Graph Clustering

Tianxiang Zhao, Youqing Wang, Jinlu Wang, Jiapu Wang, Mingliang Cui, Junbin Gao, Jipeng Guo

Main category: cs.LG

TL;DR: Proposes RAGC method with hybrid-collaborative augmentation and contrastive sample adaptive-differential awareness to improve graph clustering by addressing limitations in existing contrastive attributed graph clustering approaches.

Details

Motivation: Existing CAGC methods focus only on node-level embedding augmentation and treat all contrastive sample pairs equally, overlooking edge-level augmentation and differences between hard/easy sample pairs, limiting discriminative capability.

Method: RAGC incorporates hybrid-collaborative augmentation (simultaneous node-level and edge-level embedding representations and augmentations) and contrastive sample adaptive-differential awareness (adaptive identification and differential treatment of contrastive pairs using pseudo-label information and weight modulation function).

Result: Comprehensive evaluations on six benchmark datasets demonstrate RAGC’s effectiveness against several state-of-the-art CAGC methods.

Conclusion: The proposed RAGC method successfully addresses limitations of existing approaches through mutual reinforcement between HCA and CSADA modules, enhancing discriminability in representation learning for graph clustering.

Abstract: Due to its powerful capability of self-supervised representation learning and clustering, contrastive attributed graph clustering (CAGC) has achieved great success, which mainly depends on effective data augmentation and contrastive objective setting. However, most CAGC methods utilize edges as auxiliary information to obtain node-level embedding representation and only focus on node-level embedding augmentation. This approach overlooks edge-level embedding augmentation and the interactions between node-level and edge-level embedding augmentations across various granularity. Moreover, they often treat all contrastive sample pairs equally, neglecting the significant differences between hard and easy positive-negative sample pairs, which ultimately limits their discriminative capability. To tackle these issues, a novel robust attributed graph clustering (RAGC), incorporating hybrid-collaborative augmentation (HCA) and contrastive sample adaptive-differential awareness (CSADA), is proposed. First, node-level and edge-level embedding representations and augmentations are simultaneously executed to establish a more comprehensive similarity measurement criterion for subsequent contrastive learning. In turn, the discriminative similarity further consciously guides edge augmentation. Second, by leveraging pseudo-label information with high confidence, a CSADA strategy is elaborately designed, which adaptively identifies all contrastive sample pairs and differentially treats them by an innovative weight modulation function. The HCA and CSADA modules mutually reinforce each other in a beneficent cycle, thereby enhancing discriminability in representation learning. Comprehensive graph clustering evaluations over six benchmark datasets demonstrate the effectiveness of the proposed RAGC against several state-of-the-art CAGC methods.

[312] TokenFlow: Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling

Junyi Chen, Chuheng Du, Renyuan Liu, Shuochao Yao, Dingtian Yan, Jiang Liao, Shengzhong Liu, Fan Wu, Guihai Chen

Main category: cs.LG

TL;DR: TokenFlow is a novel LLM serving system that improves real-time text streaming performance through preemptive request scheduling and proactive KV cache management, achieving higher throughput and lower latency.

Details

Motivation: Standard LLM serving systems have poor resource utilization and low request processing parallelism under bursts due to inflexible non-preemptive scheduling and reactive memory management, leading to poor streaming performance.

Method: TokenFlow uses preemptive request scheduling based on real-time token buffer occupancy and consumption rates, plus proactive KV cache transfer between GPU/CPU memory with I/O-computation overlap to minimize preemption overhead.

Result: Experiments on Llama3-8B and Qwen2.5-32B across multiple GPUs show TokenFlow achieves up to 82.5% higher effective throughput and reduces P99 TTFT by up to 80.2% without degrading overall token throughput.

Conclusion: TokenFlow significantly improves real-time LLM streaming performance through its preemptive scheduling and proactive KV cache management approach.

Abstract: Real-time LLM interactions demand streamed token generations, where text tokens are progressively generated and delivered to users while balancing two objectives: responsiveness (i.e., low time-to-first-token) and steady generation (i.e.,required time-between-tokens). Standard LLM serving systems suffer from the inflexibility caused by non-preemptive request scheduling and reactive memory management, leading to poor resource utilization and low request processing parallelism under request bursts. Therefore, we present TokenFlow, a novel LLM serving system with enhanced text streaming performance via preemptive request scheduling and proactive key-value (KV) cache management. TokenFlow dynamically prioritizes requests based on real-time token buffer occupancy and token consumption rate, while actively transferring KV cache between GPU and CPU memory in the background and overlapping I/O with computation to minimize request preemption overhead. Extensive experiments on Llama3-8B and Qwen2.5-32B across multiple GPUs (RTX 4090, A6000, H200) demonstrate that TokenFlow achieves up to 82.5% higher effective throughput (accounting for actual user consumption) while reducing P99 TTFT by up to 80.2%, without degrading overall token throughput.

[313] Fusing Multi- and Hyperspectral Satellite Data for Harmful Algal Bloom Monitoring with Self-Supervised and Hierarchical Deep Learning

Nicholas LaHaye, Kelly M. Luis, Michelle M. Gierach

Main category: cs.LG

TL;DR: SIT-FUSE is a self-supervised framework that detects and maps harmful algal bloom severity and speciation using multi-sensor satellite data without requiring labeled datasets.

Details

Motivation: To advance scalable HAB monitoring in label-scarce environments and enable operational self-supervised learning for global aquatic biogeochemistry.

Method: Fuses reflectance data from VIIRS, MODIS, Sentinel-3, PACE with TROPOMI solar-induced fluorescence using self-supervised representation learning and hierarchical deep clustering.

Result: Strong agreement with in-situ measurements of total phytoplankton, Karenia brevis, Alexandrium spp., and Pseudo-nitzschia spp. in Gulf of Mexico and Southern California (2018-2025).

Conclusion: The framework successfully advances scalable HAB monitoring and enables exploratory analysis through hierarchical embeddings, moving toward operational self-supervised learning for aquatic biogeochemistry.

Abstract: We present a self-supervised machine learning framework for detecting and mapping harmful algal bloom (HAB) severity and speciation using multi-sensor satellite data. By fusing reflectance data from operational instruments (VIIRS, MODIS, Sentinel-3, PACE) with TROPOMI solar-induced fluorescence (SIF), our framework, called SIT-FUSE, generates HAB severity and speciation products without requiring per-instrument labeled datasets. The framework employs self-supervised representation learning, hierarchical deep clustering to segment phytoplankton concentrations and speciations into interpretable classes, validated against in-situ data from the Gulf of Mexico and Southern California (2018-2025). Results show strong agreement with total phytoplankton, Karenia brevis, Alexandrium spp., and Pseudo-nitzschia spp. measurements. This work advances scalable HAB monitoring in label-scarce environments while enabling exploratory analysis via hierarchical embeddings: a critical step toward operationalizing self-supervised learning for global aquatic biogeochemistry.

[314] Curl Descent: Non-Gradient Learning Dynamics with Sign-Diverse Plasticity

Hugo Ninou, Jonathan Kadmon, N. Alex Cayco-Gajic

Main category: cs.LG

TL;DR: Biological neural networks may use non-gradient “curl” learning dynamics instead of pure gradient descent. These curl terms emerge naturally in networks with inhibitory-excitatory connectivity or Hebbian/anti-Hebbian plasticity, and can either stabilize learning or cause chaotic dynamics depending on their strength.

Details

Motivation: To understand whether biological neural networks use gradient-based learning strategies, given the diversity of synaptic plasticity rules observed in experiments that may not fit gradient descent frameworks.

Method: Analyzed feedforward networks using student-teacher framework, systematically introducing non-gradient dynamics through neurons with rule-flipped plasticity to study curl terms in learning dynamics.

Result: Small curl terms preserve solution stability similar to gradient descent. Strong curl terms destabilize solutions, potentially causing chaotic dynamics that destroy performance, but can also speed learning by escaping saddles through temporary loss ascent.

Conclusion: Specific neural architectures can support robust learning via diverse non-gradient learning rules, challenging normative gradient-based learning theories and providing insights into biological learning mechanisms.

Abstract: Gradient-based algorithms are a cornerstone of artificial neural network training, yet it remains unclear whether biological neural networks use similar gradient-based strategies during learning. Experiments often discover a diversity of synaptic plasticity rules, but whether these amount to an approximation to gradient descent is unclear. Here we investigate a previously overlooked possibility: that learning dynamics may include fundamentally non-gradient “curl”-like components while still being able to effectively optimize a loss function. Curl terms naturally emerge in networks with inhibitory-excitatory connectivity or Hebbian/anti-Hebbian plasticity, resulting in learning dynamics that cannot be framed as gradient descent on any objective. To investigate the impact of these curl terms, we analyze feedforward networks within an analytically tractable student-teacher framework, systematically introducing non-gradient dynamics through neurons exhibiting rule-flipped plasticity. Small curl terms preserve the stability of the original solution manifold, resulting in learning dynamics similar to gradient descent. Beyond a critical value, strong curl terms destabilize the solution manifold. Depending on the network architecture, this loss of stability can lead to chaotic learning dynamics that destroy performance. In other cases, the curl terms can counterintuitively speed learning compared to gradient descent by allowing the weight dynamics to escape saddles by temporarily ascending the loss. Our results identify specific architectures capable of supporting robust learning via diverse learning rules, providing an important counterpoint to normative theories of gradient-based learning in neural networks.

[315] A Granular Study of Safety Pretraining under Model Abliteration

Shashank Agnihotri, Jonas Jakubassa, Priyam Dey, Sachin Goyal, Bernt Schiele, Venkatesh Babu Radhakrishnan, Margret Keuper

Main category: cs.LG

TL;DR: The paper evaluates whether safety interventions like refusal training survive activation edits, using model abliteration to remove refusal-sensitive directions in LLMs.

Details

Motivation: To determine if common safety interventions remain effective when models are modified at inference time with simple activation edits.

Method: Used model abliteration (lightweight projection) on Safety Pretraining checkpoints for SmolLM2-1.7B, evaluated 20 systems with 100 prompts, classified responses using multiple judges, and validated judge fidelity.

Result: Produced checkpoint-level characterization of robust safety components under abliteration, quantified judge selection impact, and established protocol for integrating inference-time edits in safety assessments.

Conclusion: Provides insights into which data-centric safety components remain robust under activation edits and outlines practical evaluation protocols for safety interventions.

Abstract: Open-weight LLMs can be modified at inference time with simple activation edits, which raises a practical question for safety: do common safety interventions like refusal training or metatag training survive such edits? We study model abliteration, a lightweight projection technique designed to remove refusal-sensitive directions, and conduct a controlled evaluation across a granular sequence of Safety Pretraining checkpoints for SmolLM2-1.7B, alongside widely used open baselines. For each of 20 systems, original and abliterated, we issue 100 prompts with balanced harmful and harmless cases, classify responses as Refusal or Non-Refusal using multiple judges, and validate judge fidelity on a small human-labeled subset. We also probe whether models can identify refusal in their own outputs. Our study produces a checkpoint-level characterization of which data-centric safety components remain robust under abliteration, quantifies how judge selection influences evaluation outcomes, and outlines a practical protocol for integrating inference-time edits into safety assessments. Code: https://github.com/shashankskagnihotri/safety_pretraining.

[316] Optimal Rates for Generalization of Gradient Descent for Deep ReLU Classification

Yuanfan Li, Yunwen Lei, Zheng-Chu Guo, Yiming Ying

Main category: cs.LG

TL;DR: This paper establishes optimal generalization rates for gradient descent with deep ReLU networks, achieving polynomial dependence on depth rather than exponential, and matching optimal SVM-type rates up to depth-dependent factors.

Details

Motivation: Existing results either yield suboptimal generalization rates of O(1/√n) or focus on networks with smooth activation functions that incur exponential dependence on network depth. The paper aims to bridge this gap by achieving optimal rates with only polynomial depth dependence.

Method: The authors carefully trade off optimization and generalization errors, using a novel technique to control activation patterns near a reference model. This enables sharper Rademacher complexity bounds for deep ReLU networks trained with gradient descent.

Result: Under the assumption that data are NTK separable from margin γ, the paper proves an excess risk rate of Õ(L⁴(1 + γL²)/(nγ²)), which aligns with the optimal SVM-type rate Õ(1/(nγ²)) up to depth-dependent factors.

Conclusion: The work demonstrates that gradient descent can achieve near-optimal generalization rates for deep ReLU networks with only polynomial dependence on depth, overcoming limitations of previous approaches that either had suboptimal rates or exponential depth dependence.

Abstract: Recent advances have significantly improved our understanding of the generalization performance of gradient descent (GD) methods in deep neural networks. A natural and fundamental question is whether GD can achieve generalization rates comparable to the minimax optimal rates established in the kernel setting. Existing results either yield suboptimal rates of $O(1/\sqrt{n})$, or focus on networks with smooth activation functions, incurring exponential dependence on network depth $L$. In this work, we establish optimal generalization rates for GD with deep ReLU networks by carefully trading off optimization and generalization errors, achieving only polynomial dependence on depth. Specifically, under the assumption that the data are NTK separable from the margin $\gamma$, we prove an excess risk rate of $\widetilde{O}(L^4 (1 + \gamma L^2) / (n \gamma^2))$, which aligns with the optimal SVM-type rate $\widetilde{O}(1 / (n \gamma^2))$ up to depth-dependent factors. A key technical contribution is our novel control of activation patterns near a reference model, enabling a sharper Rademacher complexity bound for deep ReLU networks trained with gradient descent.

[317] OptunaHub: A Platform for Black-Box Optimization

Yoshihiko Ozaki, Shuhei Watanabe, Toshihiko Yanase

Main category: cs.LG

TL;DR: OptunaHub is a community platform that centralizes black-box optimization methods and benchmarks across domains like AutoML and Materials Informatics.

Details

Motivation: To address the fragmentation of black-box optimization research across different domains and create a centralized platform for unified access.

Method: Provides unified Python APIs, a contributor package registry, and a web interface to promote searchability and cross-domain collaboration.

Result: A publicly available platform with source code hosted on GitHub that enables centralized access to BBO methods and benchmarks.

Conclusion: OptunaHub aims to foster a virtuous cycle of contributions and applications in black-box optimization research by providing a unified community platform.

Abstract: Black-box optimization (BBO) drives advances in domains such as AutoML and Materials Informatics, yet research efforts often remain fragmented across domains. We introduce OptunaHub (https://hub.optuna.org/), a community platform that centralizes BBO methods and benchmarks. OptunaHub provides unified Python APIs, a contributor package registry, and a web interface to promote searchability and cross-domain research. OptunaHub aims to foster a virtuous cycle of contributions and applications. The source code is publicly available in the optunahub, optunahub-registry, and optunahub-web repositories under the Optuna organization on GitHub (https://github.com/optuna/).

[318] Relevance-Aware Thresholding in Online Conformal Prediction for Time Series

Théo Dupuy, Binbin Xu, Stéphane Perrey, Jacky Montmain, Abdelhak Imoussaten

Main category: cs.LG

TL;DR: This paper proposes enhancing Online Conformal Prediction (OCP) for time series by replacing the binary threshold update with relevance-based functions that quantify prediction interval quality using ground truth, leading to narrower intervals while maintaining coverage validity.

Details

Motivation: Current OCP methods focus only on whether ground truth falls inside/outside prediction intervals during threshold updates, ignoring interval relevance. This can lead to abrupt threshold changes and suboptimal interval widths.

Method: Proposes replacing binary evaluation (inside/outside) with broader class of functions that quantify prediction interval relevance using ground truth, preventing abrupt threshold changes.

Result: Experimental results on real-world datasets show the proposed functions produce tighter prediction intervals compared to existing OCP methods while maintaining coverage validity.

Conclusion: Leveraging relevance-based functions in threshold updates improves OCP by generating narrower prediction intervals without compromising coverage guarantees.

Abstract: Uncertainty quantification has received considerable interest in recent works in Machine Learning. In particular, Conformal Prediction (CP) gains ground in this field. For the case of time series, Online Conformal Prediction (OCP) becomes an option to address the problem of data distribution shift over time. Indeed, the idea of OCP is to update a threshold of some quantity (whether the miscoverage level or the quantile) based on the distribution observation. To evaluate the performance of OCP methods, two key aspects are typically considered: the coverage validity and the prediction interval width minimization. Recently, new OCP methods have emerged, offering long-run coverage guarantees and producing more informative intervals. However, during the threshold update step, most of these methods focus solely on the validity of the prediction intervals~–~~that is, whether the ground truth falls inside or outside the interval~~–~without accounting for their relevance. In this paper, we aim to leverage this overlooked aspect. Specifically, we propose enhancing the threshold update step by replacing the binary evaluation (inside/outside) with a broader class of functions that quantify the relevance of the prediction interval using the ground truth. This approach helps prevent abrupt threshold changes, potentially resulting in narrower prediction intervals. Indeed, experimental results on real-world datasets suggest that these functions can produce tighter intervals compared to existing OCP methods while maintaining coverage validity.

[319] Dissecting Transformers: A CLEAR Perspective towards Green AI

Hemang Jain, Shailender Goyal, Divyansh Pandey, Karthik Vaidhyanathan

Main category: cs.LG

TL;DR: This paper presents CLEAR, a novel methodology for fine-grained energy measurement of transformer components during LLM inference, revealing that attention blocks consume disproportionately more energy per FLOP than other components.

Details

Motivation: LLM inference dominates AI energy footprint but current studies only report coarse model-level metrics, treating energy efficiency as an afterthought rather than primary objective.

Method: Component-Level Energy Assessment via Repeated sampling (CLEAR) - overcomes temporal mismatch between microsecond component execution and millisecond energy sensors by using repeated sampling to measure 15 models across four architecture types.

Result: Attention blocks consume significantly more energy per FLOP than other components, showing FLOPs alone fail to capture true energy costs. CLEAR achieves <9.5% component-wise variance while capturing >90% of total model energy.

Conclusion: Establishes detailed component-level energy baselines and provides insights for building energy-efficient transformer models through component-level optimizations.

Abstract: The rapid adoption of Large Language Models (LLMs) has raised significant environmental concerns. Unlike the one-time cost of training, LLM inference occurs continuously at a global scale and now dominates the AI energy footprint. Yet, most sustainability studies report only coarse, model-level metrics due to the lack of fine-grained measurement methods, treating energy efficiency more as an afterthought than as a primary objective. We present the first fine-grained empirical analysis of inference energy across core components of transformer architecture. We propose a novel methodology, Component-Level Energy Assessment via Repeated sampling (CLEAR), to overcome temporal mismatch between microsecond scale component execution and monitoring of millisecond (ms) scale energy sensors. Using CLEAR, we evaluate 15 models spanning four distinct architecture types and consistently keep component-wise energy variance below 9.5% while capturing more than 90% of the model’s total energy as individual components. Our empirical analysis reveals that Attention blocks consume significantly more energy per floating-point operation (FLOP), indicating that energy consumption is not proportionally aligned with FLOP counts. This shows that FLOPs alone fail to capture the true energy cost at a component level. Our findings establish detailed component-level energy baselines and provide insight as an initial step to build energy-efficient transformer models through component-level optimizations.

[320] Mitigating Spurious Correlation via Distributionally Robust Learning with Hierarchical Ambiguity Sets

Sung Ho Jo, Seonghwi Kim, Minwoo Chae

Main category: cs.LG

TL;DR: Hierarchical extension of Group DRO that addresses both inter-group and intra-group distributional uncertainties, providing robustness to distribution shifts at multiple levels.

Details

Motivation: Existing robust learning methods like Group DRO are vulnerable to intra-group distributional shifts, especially in minority groups with limited samples, which frequently occur in real-world scenarios.

Method: Proposed hierarchical extension of Group DRO that captures both inter-group and intra-group uncertainties, along with new benchmark settings simulating realistic minority group distribution shifts.

Result: Demonstrates strong robustness under challenging minority group distribution shifts where existing methods consistently fail, while achieving superior performance on standard benchmarks.

Conclusion: Highlights the importance of broadening the ambiguity set to better capture both inter-group and intra-group distributional uncertainties for improved robustness.

Abstract: Conventional supervised learning methods are often vulnerable to spurious correlations, particularly under distribution shifts in test data. To address this issue, several approaches, most notably Group DRO, have been developed. While these methods are highly robust to subpopulation or group shifts, they remain vulnerable to intra-group distributional shifts, which frequently occur in minority groups with limited samples. We propose a hierarchical extension of Group DRO that addresses both inter-group and intra-group uncertainties, providing robustness to distribution shifts at multiple levels. We also introduce new benchmark settings that simulate realistic minority group distribution shifts-an important yet previously underexplored challenge in spurious correlation research. Our method demonstrates strong robustness under these conditions-where existing robust learning methods consistently fail-while also achieving superior performance on standard benchmarks. These results highlight the importance of broadening the ambiguity set to better capture both inter-group and intra-group distributional uncertainties.

[321] Online Learning in the Random Order Model

Martino Bernasconi, Andrea Celli, Riccardo Colini-Baldeschi, Federico Fusco, Stefano Leonardi, Matteo Russo

Main category: cs.LG

TL;DR: The paper proposes a general template to adapt stochastic learning algorithms to work effectively in the random-order model, recovering improved regret bounds for various online learning problems and showing that learnability in random order is characterized by VC dimension rather than Littlestone dimension.

Details

Motivation: Random-order inputs can exhibit significant non-stationarity that hinders stochastic learning algorithms, while adversarial algorithms maintain guarantees but may be suboptimal. There's a need to bridge this gap and leverage the benefits of both approaches.

Method: The authors develop a general template to adapt stochastic learning algorithms to the random-order model without substantially affecting their regret guarantees.

Result: The method recovers improved regret bounds for prediction with delays, online learning with constraints, and bandits with switching costs. It also proves that in random order, learnability is characterized by VC dimension rather than Littlestone dimension.

Conclusion: The proposed template successfully bridges stochastic and random-order learning, providing practical algorithms with improved guarantees and revealing fundamental differences between random-order and adversarial online learning models.

Abstract: In the random-order model for online learning, the sequence of losses is chosen upfront by an adversary and presented to the learner after a random permutation. Any random-order input is \emph{asymptotically} equivalent to a stochastic i.i.d. one, but, for finite times, it may exhibit significant {\em non-stationarity}, which can hinder the performance of stochastic learning algorithms. While algorithms for adversarial inputs naturally maintain their regret guarantees in random order, simple no-regret algorithms exist for the stochastic model that fail against random-order instances. In this paper, we propose a general template to adapt stochastic learning algorithms to the random-order model without substantially affecting their regret guarantees. This allows us to recover improved regret bounds for prediction with delays, online learning with constraints, and bandits with switching costs. Finally, we investigate online classification and prove that, in random order, learnability is characterized by the VC dimension rather than the Littlestone dimension, thus providing a further separation from the general adversarial model.

[322] FlexiQ: Adaptive Mixed-Precision Quantization for Latency/Accuracy Trade-Offs in Deep Neural Networks

Jaemin Kim, Hongjun Um, Sungkyun Kim, Yongjun Park, Jiwon Seo

Main category: cs.LG

TL;DR: FlexiQ is an adaptive mixed-precision quantization scheme for computer vision models that dynamically adjusts low-bitwidth channel ratios to handle workload fluctuations while maintaining accuracy.

Details

Motivation: Hardware accelerators like NPUs and GPUs are costly and hard to scale for real-time workload fluctuations, creating a need for efficient quantization methods that can adapt to varying computational demands.

Method: FlexiQ selectively applies low-bitwidth computation to feature channels with small value ranges, uses efficient bit-lowering to minimize quantization errors, and dynamically adjusts low-bitwidth channel ratios in real time.

Result: Achieved 6.6% higher accuracy for 4-bit models with finetuning, outperformed four state-of-the-art quantization techniques, and achieved efficient accuracy-latency trade-off with only 0.6% accuracy loss for 50% 4-bit model while getting 40% of the speedup.

Conclusion: FlexiQ demonstrates hardware efficiency with minimal runtime overhead on NPUs and GPUs, providing overall performance benefits for adaptive mixed-precision quantization in computer vision models.

Abstract: Neural networks commonly execute on hardware accelerators such as NPUs and GPUs for their size and computation overhead. These accelerators are costly and it is hard to scale their resources to handle real-time workload fluctuations. We present FlexiQ, an adaptive mixed-precision quantization scheme for computer vision models. FlexiQ selectively applies low-bitwidth computation to feature channels with small value ranges and employs an efficient bit-lowering method to minimize quantization errors while maintaining inference accuracy. Furthermore, FlexiQ adjusts its low-bitwidth channel ratio in real time, enabling quantized models to effectively manage fluctuating inference workload. We implemented FlexiQ prototype, including the mixed-precision inference runtime on our custom NPU and GPUs. Evaluated on eleven convolution- and transformer-based vision models, FlexiQ achieves on average 6.6% higher accuracy for 4-bit models with finetuning and outperforms four state-of-the-art quantization techniques. Moreover, our mixed-precision models achieved an efficient accuracy-latency trade-off, with the 50% 4-bit model incurring only 0.6% accuracy loss while achieving 40% of the speedup of the 100% 4-bit model over 8-bit model. Latency evaluations on our NPU and GPUs confirmed that FlexiQ introduces minimal runtime overhead, demonstrating its hardware efficiency and overall performance benefits.

[323] The Curious Case of In-Training Compression of State Space Models

Makram Chahine, Philipp Nazari, Daniela Rus, T. Konstantin Rusch

Main category: cs.LG

TL;DR: The paper proposes using Hankel singular value analysis from control theory to dynamically reduce State Space Models during training, preserving only high-influence dimensions to accelerate optimization while maintaining expressivity.

Details

Motivation: To address the computational burden of large SSMs while maximizing expressivity, balancing the trade-off between model size and computational efficiency in long sequence modeling tasks.

Method: Leveraging Hankel singular value analysis and balanced truncation from control theory to identify and preserve only high-influence dimensions during SSM training, applicable to Linear Time-Invariant SSMs and extendable to selective models.

Result: In-training reduction significantly accelerates optimization while preserving expressivity, with compressed models retaining task-critical structure that is lost by models trained directly at smaller dimensions.

Conclusion: SSMs that begin large and shrink during training achieve computational efficiency while maintaining higher performance compared to models trained directly at smaller dimensions.

Abstract: State Space Models (SSMs), developed to tackle long sequence modeling tasks efficiently, offer both parallelizable training and fast inference. At their core are recurrent dynamical systems that maintain a hidden state, with update costs scaling with the state dimension. A key design challenge is striking the right balance between maximizing expressivity and limiting this computational burden. Control theory, and more specifically Hankel singular value analysis, provides a potent framework for the measure of energy for each state, as well as the balanced truncation of the original system down to a smaller representation with performance guarantees. Leveraging the eigenvalue stability properties of Hankel matrices, we apply this lens to SSMs during training, where only dimensions of high influence are identified and preserved. Our approach applies to Linear Time-Invariant SSMs such as Linear Recurrent Units, but is also extendable to selective models. Experiments show that in-training reduction significantly accelerates optimization while preserving expressivity, with compressed models retaining task-critical structure lost by models trained directly at smaller dimension. In other words, SSMs that begin large and shrink during training achieve computational efficiency while maintaining higher performance.

[324] Multi-scale Autoregressive Models are Laplacian, Discrete, and Latent Diffusion Models in Disguise

Steve Hong, Samuel Belkadi

Main category: cs.LG

TL;DR: The paper reframes Visual Autoregressive (VAR) models as an iterative-refinement framework with a forward process building a Laplacian pyramid and a backward process reconstructing it in coarse-to-fine steps, connecting VAR to diffusion models.

Details

Motivation: To better understand VAR models' efficiency and fidelity by viewing them through an iterative-refinement lens rather than just as next-scale autoregression, and to connect them with denoising diffusion methods.

Method: Formalizes VAR as a deterministic forward process creating a Laplacian-style latent pyramid and a learned backward process that reconstructs it in few coarse-to-fine steps. Identifies three key design factors: latent space refinement, discrete classification prediction, and spatial frequency partitioning.

Result: The framework explains VAR’s efficiency and fidelity, enables controlled experiments to quantify each factor’s contribution, and extends to graph generation and weather forecasting while allowing VAR to leverage diffusion tools.

Conclusion: The iterative-refinement perspective provides a unified understanding of VAR models, connects them to diffusion methods, and enables practical interfaces for combining VAR’s few-step generation with diffusion ecosystem tools.

Abstract: We revisit Visual Autoregressive (VAR) models through the lens of an iterative-refinement framework. Rather than viewing VAR solely as next-scale autoregression, we formalise it as a deterministic forward process that constructs a Laplacian-style latent pyramid, paired with a learned backward process that reconstructs it in a small number of coarse-to-fine steps. This view connects VAR to denoising diffusion and isolates three design choices that help explain its efficiency and fidelity: refining in a learned latent space, casting prediction as discrete classification over code indices, and partitioning the task by spatial frequency. We run controlled experiments to quantify each factor’s contribution to fidelity and speed, and we outline how the same framework extends to permutation-invariant graph generation and to probabilistic, ensemble-style medium-range weather forecasting. The framework also suggests practical interfaces for VAR to leverage tools from the diffusion ecosystem while retaining few-step, scale-parallel generation.

[325] Subject-Adaptive Sparse Linear Models for Interpretable Personalized Health Prediction from Multimodal Lifelog Data

Dohyun Bu, Jisoo Han, Soohwa Kwon, Yulim So, Jong-Seok Lee

Main category: cs.LG

TL;DR: The paper proposes SASL, an interpretable framework for personalized health prediction that combines sparse linear models with subject-specific interactions and selectively integrates LightGBM outputs through confidence-based gating to maintain both accuracy and transparency.

Details

Motivation: Current deep learning and gradient-boosting models for health outcome prediction sacrifice interpretability and fail to address significant inter-individual variability in lifelog data, limiting their clinical utility.

Method: SASL integrates ordinary least squares with subject-specific interactions, uses iterative backward feature elimination with nested F-tests for sparsity, employs regression-then-thresholding for ordinal targets, and selectively incorporates LightGBM outputs via confidence-based gating.

Result: On the CH-2025 dataset (450 daily observations from 10 subjects), the hybrid SASL-LightGBM framework achieves predictive performance comparable to black-box methods with significantly fewer parameters and greater transparency.

Conclusion: SASL provides an interpretable alternative to black-box models for personalized health prediction, delivering comparable accuracy while offering clear, actionable insights for clinical practice.

Abstract: Improved prediction of personalized health outcomes – such as sleep quality and stress – from multimodal lifelog data could have meaningful clinical and practical implications. However, state-of-the-art models, primarily deep neural networks and gradient-boosted ensembles, sacrifice interpretability and fail to adequately address the significant inter-individual variability inherent in lifelog data. To overcome these challenges, we propose the Subject-Adaptive Sparse Linear (SASL) framework, an interpretable modeling approach explicitly designed for personalized health prediction. SASL integrates ordinary least squares regression with subject-specific interactions, systematically distinguishing global from individual-level effects. We employ an iterative backward feature elimination method based on nested $F$-tests to construct a sparse and statistically robust model. Additionally, recognizing that health outcomes often represent discretized versions of continuous processes, we develop a regression-then-thresholding approach specifically designed to maximize macro-averaged F1 scores for ordinal targets. For intrinsically challenging predictions, SASL selectively incorporates outputs from compact LightGBM models through confidence-based gating, enhancing accuracy without compromising interpretability. Evaluations conducted on the CH-2025 dataset – which comprises roughly 450 daily observations from ten subjects – demonstrate that the hybrid SASL-LightGBM framework achieves predictive performance comparable to that of sophisticated black-box methods, but with significantly fewer parameters and substantially greater transparency, thus providing clear and actionable insights for clinicians and practitioners.

[326] Knowledge-Aware Modeling with Frequency Adaptive Learning for Battery Health Prognostics

Vijay Babu Pamshetti, Wei Zhang, Sumei Sun, Jie Zhang, Yonggang Wen, Qingyu Yan

Main category: cs.LG

TL;DR: Karma is a knowledge-aware model with frequency-adaptive learning that improves battery health prognostics by combining signal decomposition, dual-stream deep learning, and empirical knowledge guidance.

Details

Motivation: Existing data-driven models for battery health prognostics lack knowledge guidance, leading to unreliable long-term predictions due to complex degradation behaviors like nonlinearity, noise, and capacity regeneration.

Method: Proposes Karma model with signal decomposition to derive battery signals in different frequency bands, dual-stream deep learning architecture (one for long-term low-frequency trends, one for high-frequency dynamics), and knowledge regulation using double exponential function optimized with particle filters.

Result: Achieves average error reductions of 50.6% and 32.6% over state-of-the-art algorithms on two mainstream datasets, demonstrating superior performance in battery health prediction.

Conclusion: Karma shows robustness, generalizability, and potential for safer and more reliable battery management across diverse applications through its knowledge-aware approach.

Abstract: Battery health prognostics are critical for ensuring safety, efficiency, and sustainability in modern energy systems. However, it has been challenging to achieve accurate and robust prognostics due to complex battery degradation behaviors with nonlinearity, noise, capacity regeneration, etc. Existing data-driven models capture temporal degradation features but often lack knowledge guidance, which leads to unreliable long-term health prognostics. To overcome these limitations, we propose Karma, a knowledge-aware model with frequency-adaptive learning for battery capacity estimation and remaining useful life prediction. The model first performs signal decomposition to derive battery signals in different frequency bands. A dual-stream deep learning architecture is developed, where one stream captures long-term low-frequency degradation trends and the other models high-frequency short-term dynamics. Karma regulates the prognostics with knowledge, where battery degradation is modeled as a double exponential function based on empirical studies. Our dual-stream model is used to optimize the parameters of the knowledge with particle filters to ensure physically consistent and reliable prognostics and uncertainty quantification. Experimental study demonstrates Karma’s superior performance, achieving average error reductions of 50.6% and 32.6% over state-of-the-art algorithms for battery health prediction on two mainstream datasets, respectively. These results highlight Karma’s robustness, generalizability, and potential for safer and more reliable battery management across diverse applications.

[327] RoiRL: Efficient, Self-Supervised Reasoning with Offline Iterative Reinforcement Learning

Aleksei Arzhantsev, Otmane Sakhi, Flavian Vasile

Main category: cs.LG

TL;DR: RoiRL is a lightweight offline reinforcement learning method that improves reasoning in LLMs without ground-truth rewards, training 2.5x faster than TTRL while achieving better performance.

Details

Motivation: Traditional RL for LLMs requires ground-truth rewards, while existing test-time RL methods like TTRL are computationally expensive and rely on heavy online RL with majority-vote rewards.

Method: RoiRL uses offline iterative reinforcement learning with weighted log-likelihood objectives instead of maintaining a reference model, enabling stable training with lower memory and compute requirements.

Result: Experimental results show RoiRL trains 2.5x faster than TTRL and consistently outperforms it on reasoning benchmarks.

Conclusion: RoiRL establishes a scalable path to self-improving LLMs without labels by providing a lightweight offline alternative to computationally expensive online RL methods.

Abstract: Reinforcement learning (RL) is central to improving reasoning in large language models (LLMs) but typically requires ground-truth rewards. Test-Time Reinforcement Learning (TTRL) removes this need by using majority-vote rewards, but relies on heavy online RL and incurs substantial computational cost. We propose RoiRL: Reasoning with offline iterative Reinforcement Learning, a family of lightweight offline learning alternatives that can target the same regularized optimal policies. Unlike TTRL, RoiRL eliminates the need to maintain a reference model and instead optimizes weighted log-likelihood objectives, enabling stable training with significantly lower memory and compute requirements. Experimental results show that RoiRL trains to 2.5x faster and consistently outperforms TTRL on reasoning benchmarks, establishing a scalable path to self-improving LLMs without labels.

[328] DMark: Order-Agnostic Watermarking for Diffusion Large Language Models

Linyu Wu, Linhao Zhong, Wenjie Qu, Yuexin Li, Yue Liu, Shengfang Zhai, Chunhua Shen, Jiaheng Zhang

Main category: cs.LG

TL;DR: DMark is the first watermarking framework for diffusion large language models (dLLMs) that addresses their non-sequential decoding by using predictive, bidirectional, and combined watermarking strategies to achieve high detection rates while maintaining text quality.

Details

Motivation: Existing watermarking methods fail on diffusion LLMs because they rely on sequential left-to-right token generation, while dLLMs can finalize tokens in arbitrary order, breaking traditional causal watermark designs.

Method: DMark introduces three strategies: predictive watermarking (using model-predicted tokens when context is unavailable), bidirectional watermarking (exploiting both forward and backward dependencies in diffusion decoding), and predictive-bidirectional watermarking (combining both approaches for maximum detection strength).

Result: Experiments show DMark achieves 92.0-99.5% detection rates at 1% false positive rate while maintaining text quality, compared to only 49.6-71.2% for naive adaptations of existing methods. It also demonstrates robustness against text manipulations.

Conclusion: Effective watermarking is feasible for non-autoregressive language models, and DMark establishes the first successful framework for watermarking diffusion LLMs.

Abstract: Diffusion large language models (dLLMs) offer faster generation than autoregressive models while maintaining comparable quality, but existing watermarking methods fail on them due to their non-sequential decoding. Unlike autoregressive models that generate tokens left-to-right, dLLMs can finalize tokens in arbitrary order, breaking the causal design underlying traditional watermarks. We present DMark, the first watermarking framework designed specifically for dLLMs. DMark introduces three complementary strategies to restore watermark detectability: predictive watermarking uses model-predicted tokens when actual context is unavailable; bidirectional watermarking exploits both forward and backward dependencies unique to diffusion decoding; and predictive-bidirectional watermarking combines both approaches to maximize detection strength. Experiments across multiple dLLMs show that DMark achieves 92.0-99.5% detection rates at 1% false positive rate while maintaining text quality, compared to only 49.6-71.2% for naive adaptations of existing methods. DMark also demonstrates robustness against text manipulations, establishing that effective watermarking is feasible for non-autoregressive language models.

[329] Learning Explicit Single-Cell Dynamics Using ODE Representations

Jan-Philipp von Bassewitz, Adeel Pervez, Marco Fumero, Matthew Robinson, Theofanis Karaletsos, Francesco Locatello

Main category: cs.LG

TL;DR: Cell-MNN is an end-to-end encoder-decoder model that learns interpretable gene interactions through a locally linearized ODE representation of cellular differentiation dynamics, achieving competitive performance on single-cell benchmarks with better scalability.

Details

Motivation: To address limitations of current models that rely on computationally expensive optimal transport preprocessing, multi-stage training, and lack explicit gene interaction discovery in cellular differentiation modeling.

Method: Proposes Cell-MNN - an encoder-decoder architecture with latent representation as a locally linearized ODE governing cellular evolution from stem to tissue cells, using standard PCA preprocessing.

Result: Achieves competitive performance on single-cell benchmarks, surpasses state-of-the-art baselines in scaling to larger datasets and joint training across multiple datasets, and learns interpretable gene interactions validated against TRRUST database.

Conclusion: Cell-MNN provides an effective end-to-end approach for modeling cellular differentiation that learns biologically consistent and interpretable gene interactions while maintaining competitive performance and scalability.

Abstract: Modeling the dynamics of cellular differentiation is fundamental to advancing the understanding and treatment of diseases associated with this process, such as cancer. With the rapid growth of single-cell datasets, this has also become a particularly promising and active domain for machine learning. Current state-of-the-art models, however, rely on computationally expensive optimal transport preprocessing and multi-stage training, while also not discovering explicit gene interactions. To address these challenges we propose Cell-Mechanistic Neural Networks (Cell-MNN), an encoder-decoder architecture whose latent representation is a locally linearized ODE governing the dynamics of cellular evolution from stem to tissue cells. Cell-MNN is fully end-to-end (besides a standard PCA pre-processing) and its ODE representation explicitly learns biologically consistent and interpretable gene interactions. Empirically, we show that Cell-MNN achieves competitive performance on single-cell benchmarks, surpasses state-of-the-art baselines in scaling to larger datasets and joint training across multiple datasets, while also learning interpretable gene interactions that we validate against the TRRUST database of gene interactions.

[330] FeDABoost: Fairness Aware Federated Learning with Adaptive Boosting

Tharuka Kasthuri Arachchige, Veselka Boeva, Shahrooz Abghari

Main category: cs.LG

TL;DR: FeDABoost improves FL performance and fairness in non-IID settings through dynamic boosting and adaptive gradient aggregation, achieving better fairness and competitive performance compared to FedAvg and Ditto.

Details

Motivation: To address performance and fairness issues in Federated Learning under non-IID data distributions, particularly the challenge of underperforming clients and unreliable model contributions.

Method: Proposes FeDABoost framework with: 1) Adaptive gradient aggregation inspired by Multiclass AdaBoost that weights clients based on local error rates, 2) Dynamic boosting mechanism that adjusts focal loss parameters to emphasize hard examples for underperforming clients.

Result: Evaluation on MNIST, FEMNIST, and CIFAR10 datasets shows FeDABoost achieves improved fairness and competitive performance compared to FedAvg and Ditto baselines.

Conclusion: FeDABoost effectively enhances FL fairness and performance in non-IID settings through its dual approach of adaptive aggregation and dynamic client boosting.

Abstract: This work focuses on improving the performance and fairness of Federated Learning (FL) in non IID settings by enhancing model aggregation and boosting the training of underperforming clients. We propose FeDABoost, a novel FL framework that integrates a dynamic boosting mechanism and an adaptive gradient aggregation strategy. Inspired by the weighting mechanism of the Multiclass AdaBoost (SAMME) algorithm, our aggregation method assigns higher weights to clients with lower local error rates, thereby promoting more reliable contributions to the global model. In parallel, FeDABoost dynamically boosts underperforming clients by adjusting the focal loss focusing parameter, emphasizing hard to classify examples during local training. We have evaluated FeDABoost on three benchmark datasets MNIST, FEMNIST, and CIFAR10, and compared its performance with those of FedAvg and Ditto. The results show that FeDABoost achieves improved fairness and competitive performance.

[331] RAxSS: Retrieval-Augmented Sparse Sampling for Explainable Variable-Length Medical Time Series Classification

Aydin Javadov, Samir Garibov, Tobias Hoesli, Qiyang Sun, Florian von Wangenheim, Joseph Ollier, Björn W. Schuller

Main category: cs.LG

TL;DR: The paper proposes a retrieval-informed classification method for medical time series that combines stochastic sparse sampling with similarity-weighted predictions for improved explainability and performance on variable-length iEEG data.

Details

Motivation: Medical time series analysis faces challenges with data sparsity, noise, and variable recording lengths. Existing approaches using stochastic sparse sampling handle variable-length signals well, while retrieval-augmented methods improve explainability and robustness.

Method: Generalizes stochastic sparse sampling framework for retrieval-informed classification by weighting window predictions using within-channel similarity and aggregating them in probability space, creating convex series-level scores and explicit evidence trails.

Result: Achieves competitive iEEG classification performance across recordings from four medical centers, providing practitioners with greater transparency and explainability.

Conclusion: The method demonstrates potential for reliable and explainable clinical variable-length time series classification, addressing key challenges in medical time series analysis.

Abstract: Medical time series analysis is challenging due to data sparsity, noise, and highly variable recording lengths. Prior work has shown that stochastic sparse sampling effectively handles variable-length signals, while retrieval-augmented approaches improve explainability and robustness to noise and weak temporal correlations. In this study, we generalize the stochastic sparse sampling framework for retrieval-informed classification. Specifically, we weight window predictions by within-channel similarity and aggregate them in probability space, yielding convex series-level scores and an explicit evidence trail for explainability. Our method achieves competitive iEEG classification performance and provides practitioners with greater transparency and explainability. We evaluate our method in iEEG recordings collected in four medical centers, demonstrating its potential for reliable and explainable clinical variable-length time series classification.

[332] Ergodic Risk Measures: Towards a Risk-Aware Foundation for Continual Reinforcement Learning

Juan Sebastian Rojas, Chi-Guhn Lee

Main category: cs.LG

TL;DR: This paper introduces the first theoretical framework for risk-aware continual reinforcement learning, showing classical risk measures are incompatible with continual settings and proposing new ergodic risk measures that work effectively.

Details

Motivation: Current continual RL focuses only on risk-neutral (expected value) decision-making, but real-world applications often require considering risk beyond just the mean performance. The authors aim to extend continual RL to handle risk-aware decision-making.

Method: The authors first demonstrate incompatibility of classical risk measure theory with continual RL, then develop a new class of ergodic risk measures specifically designed for continual learning settings.

Result: The proposed ergodic risk measures are shown to be theoretically sound and practically effective through case studies and empirical results, providing intuitive appeal for risk-aware continual learning.

Conclusion: This work establishes the first formal theoretical foundation for risk-aware continual RL by introducing ergodic risk measures that are compatible with continual learning, enabling agents to balance retention and adaptation while considering risk beyond expected values.

Abstract: Continual reinforcement learning (continual RL) seeks to formalize the notions of lifelong learning and endless adaptation in RL. In particular, the aim of continual RL is to develop RL agents that can maintain a careful balance between retaining useful information and adapting to new situations. To date, continual RL has been explored almost exclusively through the lens of risk-neutral decision-making, in which the agent aims to optimize the expected (or mean) long-run performance. In this work, we present the first formal theoretical treatment of continual RL through the lens of risk-aware decision-making, in which the agent aims to optimize a reward-based measure of long-run performance beyond the mean. In particular, we show that the classical theory of risk measures, widely used as a theoretical foundation in non-continual risk-aware RL, is, in its current form, incompatible with the continual setting. Then, building on this insight, we extend risk measure theory into the continual setting by introducing a new class of ergodic risk measures that are compatible with continual learning. Finally, we provide a case study of risk-aware continual learning, along with empirical results, which show the intuitive appeal and theoretical soundness of ergodic risk measures.

[333] ContextFlow: Context-Aware Flow Matching For Trajectory Inference From Spatial Omics Data

Santanu Subhash Rathod, Francesco Ceccarelli, Sean B. Holden, Pietro Liò, Xiao Zhang, Jovan Tanevski

Main category: cs.LG

TL;DR: ContextFlow is a context-aware flow matching framework that incorporates tissue organization and ligand-receptor patterns to infer biologically meaningful trajectories from longitudinal spatially-resolved omics data.

Details

Motivation: Understanding tissue dynamics in development, regeneration, disease progression, and treatment response requires accurate inference of trajectories from spatially-resolved omics data.

Method: Integrates local tissue organization and ligand-receptor communication patterns into a transition plausibility matrix that regularizes the optimal transport objective.

Result: Outperforms state-of-the-art flow matching methods across multiple quantitative and qualitative metrics on three datasets.

Conclusion: ContextFlow provides a generalizable framework for modeling spatiotemporal dynamics from longitudinal spatially-resolved omics data with improved biological coherence.

Abstract: Inferring trajectories from longitudinal spatially-resolved omics data is fundamental to understanding the dynamics of structural and functional tissue changes in development, regeneration and repair, disease progression, and response to treatment. We propose ContextFlow, a novel context-aware flow matching framework that incorporates prior knowledge to guide the inference of structural tissue dynamics from spatially resolved omics data. Specifically, ContextFlow integrates local tissue organization and ligand-receptor communication patterns into a transition plausibility matrix that regularizes the optimal transport objective. By embedding these contextual constraints, ContextFlow generates trajectories that are not only statistically consistent but also biologically meaningful, making it a generalizable framework for modeling spatiotemporal dynamics from longitudinal, spatially resolved omics data. Evaluated on three datasets, ContextFlow consistently outperforms state-of-the-art flow matching methods across multiple quantitative and qualitative metrics of inference accuracy and biological coherence. Our code is available at: \href{https://github.com/santanurathod/ContextFlow}{ContextFlow}

[334] Confidence and Dispersity as Signals: Unsupervised Model Evaluation and Ranking

Weijian Deng, Weijie Tu, Ibrahim Radwan, Mohammad Abu Alsheikh, Stephen Gould, Liang Zheng

Main category: cs.LG

TL;DR: A framework for unsupervised model evaluation using confidence and dispersity metrics, showing hybrid approaches outperform single metrics for accuracy estimation and model ranking under distribution shifts.

Details

Motivation: Need to assess model generalization without labeled test data in real-world deployment scenarios, particularly under distribution shifts.

Method: Systematically benchmark confidence-based, dispersity-based, and hybrid metrics across various models, datasets, and distribution shifts. Use nuclear norm of prediction matrix as key metric.

Result: Hybrid metrics consistently outperform single-aspect metrics. Nuclear norm provides robust performance across tasks and maintains reliability under moderate class imbalance.

Conclusion: The framework offers practical and generalizable basis for unsupervised model assessment in deployment scenarios.

Abstract: Assessing model generalization under distribution shift is essential for real-world deployment, particularly when labeled test data is unavailable. This paper presents a unified and practical framework for unsupervised model evaluation and ranking in two common deployment settings: (1) estimating the accuracy of a fixed model on multiple unlabeled test sets (dataset-centric evaluation), and (2) ranking a set of candidate models on a single unlabeled test set (model-centric evaluation). We demonstrate that two intrinsic properties of model predictions, namely confidence (which reflects prediction certainty) and dispersity (which captures the diversity of predicted classes), together provide strong and complementary signals for generalization. We systematically benchmark a set of confidence-based, dispersity-based, and hybrid metrics across a wide range of model architectures, datasets, and distribution shift types. Our results show that hybrid metrics consistently outperform single-aspect metrics on both dataset-centric and model-centric evaluation settings. In particular, the nuclear norm of the prediction matrix provides robust and accurate performance across tasks, including real-world datasets, and maintains reliability under moderate class imbalance. These findings offer a practical and generalizable basis for unsupervised model assessment in deployment scenarios.

[335] From high-frequency sensors to noon reports: Using transfer learning for shaft power prediction in maritime

Akriti Sharma, Dogan Altan, Dusica Marijan, Arnbjørn Maressa

Main category: cs.LG

TL;DR: A transfer learning approach for predicting vessel shaft power using high-frequency data from one vessel and fine-tuning with low-frequency noon reports from other vessels, achieving significant error reduction across different vessel types.

Details

Motivation: Energy optimization is crucial in maritime transportation to reduce costs and improve efficiency. Accurate shaft power prediction is key for fuel consumption optimization, but high-quality sensor data is often unavailable or expensive, making alternative data sources like noon reports necessary.

Method: Transfer learning-based approach where a model is first trained on high-frequency data from one vessel, then fine-tuned using low-frequency daily noon reports from other vessels. Tested on sister vessels (identical), similar vessels (slightly different), and different vessels (distinct configurations).

Result: Mean absolute percentage error decreased by 10.6% for sister vessels, 3.6% for similar vessels, and 5.3% for different vessels compared to models trained solely on noon report data.

Conclusion: Transfer learning effectively improves shaft power prediction accuracy across various vessel types using limited noon report data, demonstrating practical value for maritime energy optimization when high-frequency sensor data is unavailable.

Abstract: With the growth of global maritime transportation, energy optimization has become crucial for reducing costs and ensuring operational efficiency. Shaft power is the mechanical power transmitted from the engine to the shaft and directly impacts fuel consumption, making its accurate prediction a paramount step in optimizing vessel performance. Power consumption is highly correlated with ship parameters such as speed and shaft rotation per minute, as well as weather and sea conditions. Frequent access to this operational data can improve prediction accuracy. However, obtaining high-quality sensor data is often infeasible and costly, making alternative sources such as noon reports a viable option. In this paper, we propose a transfer learning-based approach for predicting vessels shaft power, where a model is initially trained on high-frequency data from a vessel and then fine-tuned with low-frequency daily noon reports from other vessels. We tested our approach on sister vessels (identical dimensions and configurations), a similar vessel (slightly larger with a different engine), and a different vessel (distinct dimensions and configurations). The experiments showed that the mean absolute percentage error decreased by 10.6 percent for sister vessels, 3.6 percent for a similar vessel, and 5.3 percent for a different vessel, compared to the model trained solely on noon report data.

[336] BrainIB++: Leveraging Graph Neural Networks and Information Bottleneck for Functional Brain Biomarkers in Schizophrenia

Tianzheng Hu, Qiang Li, Shu Liu, Vince D. Calhoun, Guido van Wingen, Shujian Yu

Main category: cs.LG

TL;DR: BrainIB++ is a graph neural network framework that uses information bottleneck to identify informative brain regions as subgraphs for schizophrenia diagnosis from fMRI data, achieving superior accuracy and interpretability.

Details

Motivation: Current machine learning models for psychiatric diagnosis require extensive feature engineering (introducing bias) or lack interpretability (limiting clinical applicability), creating a need for explainable and reliable brain biomarkers.

Method: An end-to-end graph neural network framework that applies the information bottleneck principle to automatically identify the most informative brain regions as subgraphs during model training for interpretation.

Result: BrainIB++ outperformed nine established brain network classification methods across three multi-cohort schizophrenia datasets, demonstrating superior diagnostic accuracy and generalizability to unseen data. The identified subgraphs aligned with established clinical biomarkers in schizophrenia.

Conclusion: The framework provides both high diagnostic performance and interpretable brain biomarkers that correspond with known clinical findings, enhancing its potential for real-world diagnostic applications in psychiatry.

Abstract: The development of diagnostic models is gaining traction in the field of psychiatric disorders. Recently, machine learning classifiers based on resting-state functional magnetic resonance imaging (rs-fMRI) have been developed to identify brain biomarkers that differentiate psychiatric disorders from healthy controls. However, conventional machine learning-based diagnostic models often depend on extensive feature engineering, which introduces bias through manual intervention. While deep learning models are expected to operate without manual involvement, their lack of interpretability poses significant challenges in obtaining explainable and reliable brain biomarkers to support diagnostic decisions, ultimately limiting their clinical applicability. In this study, we introduce an end-to-end innovative graph neural network framework named BrainIB++, which applies the information bottleneck (IB) principle to identify the most informative data-driven brain regions as subgraphs during model training for interpretation. We evaluate the performance of our model against nine established brain network classification methods across three multi-cohort schizophrenia datasets. It consistently demonstrates superior diagnostic accuracy and exhibits generalizability to unseen data. Furthermore, the subgraphs identified by our model also correspond with established clinical biomarkers in schizophrenia, particularly emphasizing abnormalities in the visual, sensorimotor, and higher cognition brain functional network. This alignment enhances the model’s interpretability and underscores its relevance for real-world diagnostic applications.

[337] Distributional Inverse Reinforcement Learning

Feiyang Wu, Ye Zhao, Anqi Wu

Main category: cs.LG

TL;DR: A distributional framework for offline IRL that models uncertainty over reward functions and return distributions, capturing richer expert behavior structure through first-order stochastic dominance minimization and distortion risk measures.

Details

Motivation: Conventional IRL approaches only recover deterministic reward estimates or match expected returns, failing to capture the full structure and uncertainty in expert behavior, particularly reward distributions.

Method: Jointly models uncertainty over reward functions and full return distributions by minimizing first-order stochastic dominance violations and integrating distortion risk measures into policy learning.

Result: Empirical results on synthetic benchmarks, neurobehavioral data, and MuJoCo tasks show the method recovers expressive reward representations and achieves state-of-the-art imitation performance.

Conclusion: The proposed distributional framework enables recovery of both reward distributions and distribution-aware policies, making it well-suited for behavior analysis and risk-aware imitation learning.

Abstract: We propose a distributional framework for offline Inverse Reinforcement Learning (IRL) that jointly models uncertainty over reward functions and full distributions of returns. Unlike conventional IRL approaches that recover a deterministic reward estimate or match only expected returns, our method captures richer structure in expert behavior, particularly in learning the reward distribution, by minimizing first-order stochastic dominance (FSD) violations and thus integrating distortion risk measures (DRMs) into policy learning, enabling the recovery of both reward distributions and distribution-aware policies. This formulation is well-suited for behavior analysis and risk-aware imitation learning. Empirical results on synthetic benchmarks, real-world neurobehavioral data, and MuJoCo control tasks demonstrate that our method recovers expressive reward representations and achieves state-of-the-art imitation performance.

[338] Learning Robust Diffusion Models from Imprecise Supervision

Dong-Dong Wu, Jiacheng Cui, Wei Wang, Zhiqiang She, Masashi Sugiyama

Main category: cs.LG

TL;DR: DMIS is a unified framework for training robust diffusion models from imprecise supervision, addressing noisy, ambiguous, or incomplete labels in training data.

Details

Motivation: Conditional diffusion models rely on large datasets that often contain imprecise information in conditional inputs, causing condition mismatch and degrading generation quality.

Method: Derived from likelihood maximization, DMIS decomposes the objective into generative and classification components: generative component models imprecise-label distributions, while classification component uses a diffusion classifier with optimized timestep sampling.

Result: Extensive experiments on diverse imprecise supervision tasks (image generation, weakly supervised learning, noisy dataset condensation) show DMIS consistently produces high-quality, class-discriminative samples.

Conclusion: DMIS is the first systematic study for training robust diffusion models from imprecise supervision and demonstrates effectiveness across various tasks with noisy labels.

Abstract: Conditional diffusion models have achieved remarkable success in various generative tasks recently, but their training typically relies on large-scale datasets that inevitably contain imprecise information in conditional inputs. Such supervision, often stemming from noisy, ambiguous, or incomplete labels, will cause condition mismatch and degrade generation quality. To address this challenge, we propose DMIS, a unified framework for training robust Diffusion Models from Imprecise Supervision, which is the first systematic study within diffusion models. Our framework is derived from likelihood maximization and decomposes the objective into generative and classification components: the generative component models imprecise-label distributions, while the classification component leverages a diffusion classifier to infer class-posterior probabilities, with its efficiency further improved by an optimized timestep sampling strategy. Extensive experiments on diverse forms of imprecise supervision, covering tasks of image generation, weakly supervised learning, and noisy dataset condensation demonstrate that DMIS consistently produces high-quality and class-discriminative samples.

[339] Differentially Private Wasserstein Barycenters

Anming Gu, Sasidhar Kunapuli, Mark Bun, Edward Chien, Kristjan Greenewald

Main category: cs.LG

TL;DR: First algorithms for computing Wasserstein barycenters under differential privacy, with empirical validation on synthetic data, MNIST, and population datasets.

Details

Motivation: Wasserstein barycenters are used with sensitive datasets, requiring privacy protection through differential privacy.

Method: Developed differentially private algorithms for computing Wasserstein barycenters.

Result: Methods produce high-quality private barycenters with strong accuracy-privacy tradeoffs on various datasets.

Conclusion: Successful development of first DP algorithms for Wasserstein barycenters with practical effectiveness.

Abstract: The Wasserstein barycenter is defined as the mean of a set of probability measures under the optimal transport metric, and has numerous applications spanning machine learning, statistics, and computer graphics. In practice these input measures are empirical distributions built from sensitive datasets, motivating a differentially private (DP) treatment. We present, to our knowledge, the first algorithms for computing Wasserstein barycenters under differential privacy. Empirically, on synthetic data, MNIST, and large-scale U.S. population datasets, our methods produce high-quality private barycenters with strong accuracy-privacy tradeoffs.

[340] CHORD: Customizing Hybrid-precision On-device Model for Sequential Recommendation with Device-cloud Collaboration

Tianqi Liu, Kairui Fu, Shengyu Zhang, Wenyan Fan, Zhaocheng Du, Jieming Zhu, Fan Wu, Fei Wu

Main category: cs.LG

TL;DR: CHORD is a framework for customizing hybrid-precision on-device models for sequential recommendation through device-cloud collaboration, enabling personalized quantization without retraining.

Details

Motivation: Current quantization methods for on-device deployment overlook device-specific user interests, compromising recommendation accuracy, while on-device finetuning imposes computational burden through local retraining.

Method: CHORD distributes randomly initialized models across devices and uses cloud-based hypernetwork modules to identify user-specific critical parameters through multi-granularity sensitivity analysis (layer, filter, element levels), then applies channel-wise mixed-precision quantization with 2-bit strategy encoding.

Result: Experiments on three real-world datasets with SASRec and Caser backbones demonstrate CHORD achieves accuracy, efficiency, and adaptivity in on-device sequential recommendation.

Conclusion: CHORD enables dynamic model adaptation and accelerated inference without backpropagation or retraining cycles, effectively balancing personalization and resource constraints through device-cloud collaboration.

Abstract: With the advancement of mobile device capabilities, deploying reranking models directly on devices has become feasible, enabling real-time contextual recommendations. When migrating models from cloud to devices, resource heterogeneity inevitably necessitates model compression. Recent quantization methods show promise for efficient deployment, yet they overlook device-specific user interests, resulting in compromised recommendation accuracy. While on-device finetuning captures personalized user preference, it imposes additional computational burden through local retraining. To address these challenges, we propose a framework for \underline{\textbf{C}}ustomizing \underline{\textbf{H}}ybrid-precision \underline{\textbf{O}}n-device model for sequential \underline{\textbf{R}}ecommendation with \underline{\textbf{D}}evice-cloud collaboration (\textbf{CHORD}), leveraging channel-wise mixed-precision quantization to simultaneously achieve personalization and resource-adaptive deployment. CHORD distributes randomly initialized models across heterogeneous devices and identifies user-specific critical parameters through auxiliary hypernetwork modules on the cloud. Our parameter sensitivity analysis operates across multiple granularities (layer, filter, and element levels), enabling precise mapping from user profiles to quantization strategy. Through on-device mixed-precision quantization, CHORD delivers dynamic model adaptation and accelerated inference without backpropagation, eliminating costly retraining cycles. We minimize communication overhead by encoding quantization strategies using only 2 bits per channel instead of 32-bit weights. Experiments on three real-world datasets with two popular backbones (SASRec and Caser) demonstrate the accuracy, efficiency, and adaptivity of CHORD.

[341] Lightweight Transformer for EEG Classification via Balanced Signed Graph Algorithm Unrolling

Junyi Yao, Parham Eftekhar, Gene Cheung, Xujin Chris Liu, Yao Wang, Wei Hu

Main category: cs.LG

TL;DR: The paper proposes a lightweight transformer-like neural network for EEG-based epilepsy detection by unrolling a spectral denoising algorithm on balanced signed graphs, achieving comparable performance to deep learning methods with far fewer parameters.

Details

Motivation: EEG signals have inherent anti-correlations that can be modeled by negative edges in graphs, and there's a need for interpretable and efficient methods to differentiate epilepsy patients from healthy subjects.

Method: Build transformer-like neural nets by unrolling a spectral denoising algorithm for signals on balanced signed graphs, implement ideal low-pass filter via Lanczos approximation on mapped positive graphs, and learn optimal cutoff frequency from data.

Result: The method achieves classification performance comparable to representative deep learning schemes while employing dramatically fewer parameters.

Conclusion: The proposed approach provides an efficient and interpretable alternative to deep learning methods for EEG-based epilepsy detection, leveraging the structural properties of balanced signed graphs.

Abstract: Samples of brain signals collected by EEG sensors have inherent anti-correlations that are well modeled by negative edges in a finite graph. To differentiate epilepsy patients from healthy subjects using collected EEG signals, we build lightweight and interpretable transformer-like neural nets by unrolling a spectral denoising algorithm for signals on a balanced signed graph – graph with no cycles of odd number of negative edges. A balanced signed graph has well-defined frequencies that map to a corresponding positive graph via similarity transform of the graph Laplacian matrices. We implement an ideal low-pass filter efficiently on the mapped positive graph via Lanczos approximation, where the optimal cutoff frequency is learned from data. Given that two balanced signed graph denoisers learn posterior probabilities of two different signal classes during training, we evaluate their reconstruction errors for binary classification of EEG signals. Experiments show that our method achieves classification performance comparable to representative deep learning schemes, while employing dramatically fewer parameters.

[342] ZeroShotOpt: Towards Zero-Shot Pretrained Models for Efficient Black-Box Optimization

Jamison Meindl, Yunsheng Tian, Tony Cui, Veronika Thost, Zhang-Wei Hong, Johannes Dürholt, Jie Chen, Wojciech Matusik, Mina Konaković Luković

Main category: cs.LG

TL;DR: ZeroShotOpt is a pretrained model for black-box optimization that uses offline reinforcement learning on BO trajectories and synthetic functions to achieve zero-shot performance matching or exceeding traditional BO methods.

Details

Motivation: Bayesian optimization's performance depends on hyperparameters that are hard to tune and don't generalize well across different problem landscapes, requiring a more robust and general-purpose approach.

Method: Uses offline reinforcement learning on optimization trajectories from 12 BO variants, pretrained on millions of synthetic Gaussian process-based functions with diverse landscapes to learn transferable optimization policies.

Result: Achieves robust zero-shot generalization on unseen benchmarks from 2D to 20D, matching or surpassing sample efficiency of leading global optimizers including BO.

Conclusion: ZeroShotOpt provides a reusable foundation for black-box optimization that generalizes well without manual tuning, offering a promising alternative to traditional BO methods.

Abstract: Global optimization of expensive, derivative-free black-box functions requires extreme sample efficiency. While Bayesian optimization (BO) is the current state-of-the-art, its performance hinges on surrogate and acquisition function hyper-parameters that are often hand-tuned and fail to generalize across problem landscapes. We present ZeroShotOpt, a general-purpose, pretrained model for continuous black-box optimization tasks ranging from 2D to 20D. Our approach leverages offline reinforcement learning on large-scale optimization trajectories collected from 12 BO variants. To scale pretraining, we generate millions of synthetic Gaussian process-based functions with diverse landscapes, enabling the model to learn transferable optimization policies. As a result, ZeroShotOpt achieves robust zero-shot generalization on a wide array of unseen benchmarks, matching or surpassing the sample efficiency of leading global optimizers, including BO, while also offering a reusable foundation for future extensions and improvements. Our open-source code, dataset, and model are available at: https://github.com/jamisonmeindl/zeroshotopt

[343] Bayesian E(3)-Equivariant Interatomic Potential with Iterative Restratification of Many-body Message Passing

Soohaeng Yoo Willow, Tae Hyeon Park, Gi Beom Sim, Sung Wook Moon, Seung Kyu Min, D. ChangMo Yang, Hyun Woo Kim, Juho Lee, Chang Woo Myung

Main category: cs.LG

TL;DR: Bayesian E(3) equivariant machine learning potentials with iterative restratification and joint energy-force negative log-likelihood loss for uncertainty quantification in atomistic simulations.

Details

Motivation: Current MLPs lack reliable uncertainty quantification, limiting their use for active learning, calibration, and out-of-distribution detection in atomistic simulations.

Method: Developed Bayesian E(3) equivariant MLPs with iterative restratification of many-body message passing and introduced joint energy-force negative log-likelihood loss function. Benchmarked multiple Bayesian approaches including deep ensembles, stochastic weight averaging Gaussian, improved variational online Newton, and laplace approximation.

Result: Achieved competitive accuracy with state-of-the-art models while enabling uncertainty-guided active learning, OOD detection, and energy/forces calibration. Outperformed random sampling and energy-uncertainty-based sampling using Bayesian active learning by disagreement.

Conclusion: Bayesian equivariant neural networks establish a powerful framework for developing uncertainty-aware MLPs for scalable atomistic simulations.

Abstract: Machine learning potentials (MLPs) have become essential for large-scale atomistic simulations, enabling ab initio-level accuracy with computational efficiency. However, current MLPs struggle with uncertainty quantification, limiting their reliability for active learning, calibration, and out-of-distribution (OOD) detection. We address these challenges by developing Bayesian E(3) equivariant MLPs with iterative restratification of many-body message passing. Our approach introduces the joint energy-force negative log-likelihood (NLL$\text{JEF}$) loss function, which explicitly models uncertainty in both energies and interatomic forces, yielding superior accuracy compared to conventional NLL losses. We systematically benchmark multiple Bayesian approaches, including deep ensembles with mean-variance estimation, stochastic weight averaging Gaussian, improved variational online Newton, and laplace approximation by evaluating their performance on uncertainty prediction, OOD detection, calibration, and active learning tasks. We further demonstrate that NLL$\text{JEF}$ facilitates efficient active learning by quantifying energy and force uncertainties. Using Bayesian active learning by disagreement (BALD), our framework outperforms random sampling and energy-uncertainty-based sampling. Our results demonstrate that Bayesian MLPs achieve competitive accuracy with state-of-the-art models while enabling uncertainty-guided active learning, OOD detection, and energy/forces calibration. This work establishes Bayesian equivariant neural networks as a powerful framework for developing uncertainty-aware MLPs for atomistic simulations at scale.

[344] Comparative Analysis of Parameterized Action Actor-Critic Reinforcement Learning Algorithms for Web Search Match Plan Generation

Ubayd Bapoo, Clement N Nyirenda

Main category: cs.LG

TL;DR: PAGAC outperforms PASAC and PATQC in high-dimensional parametrized action spaces, achieving fastest training times and highest returns in Platform and Robot Soccer Goal benchmarks.

Details

Motivation: To evaluate SAC, GAC, and TQC algorithms in high-dimensional decision-making tasks with parametrized action spaces, eliminating the need for recurrent networks.

Method: Used fully observable environments with parametrized action spaces, benchmarked on Platform-v0 and Goal-v0. Performed hyperparameter optimization with Microsoft NNI and modified GAC/TQC codebases for reproducibility.

Result: PAGAC achieved fastest training times (41:24 for Platform, 24:04 for Robot Soccer Goal) and highest returns, demonstrating superior efficiency and stability compared to PASAC and PATQC.

Conclusion: PAGAC is ideal for tasks requiring rapid convergence and robust performance in complex action spaces. Future work could explore hybrid entropy-regularization with truncation methods.

Abstract: This study evaluates the performance of Soft Actor Critic (SAC), Greedy Actor Critic (GAC), and Truncated Quantile Critics (TQC) in high-dimensional decision-making tasks using fully observable environments. The focus is on parametrized action (PA) spaces, eliminating the need for recurrent networks, with benchmarks Platform-v0 and Goal-v0 testing discrete actions linked to continuous action-parameter spaces. Hyperparameter optimization was performed with Microsoft NNI, ensuring reproducibility by modifying the codebase for GAC and TQC. Results show that Parameterized Action Greedy Actor-Critic (PAGAC) outperformed other algorithms, achieving the fastest training times and highest returns across benchmarks, completing 5,000 episodes in 41:24 for the Platform game and 24:04 for the Robot Soccer Goal game. Its speed and stability provide clear advantages in complex action spaces. Compared to PASAC and PATQC, PAGAC demonstrated superior efficiency and reliability, making it ideal for tasks requiring rapid convergence and robust performance. Future work could explore hybrid strategies combining entropy-regularization with truncation-based methods to enhance stability and expand investigations into generalizability.

[345] A Unified Deep Reinforcement Learning Approach for Close Enough Traveling Salesman Problem

Mingfeng Fan, Jiaqi Cheng, Yaoxin Wu, Yifeng Zhang, Yibin Yang, Guohua Wu, Guillaume Sartoretti

Main category: cs.LG

TL;DR: Proposes UD3RL, a dual-decoder DRL framework for solving the close-enough TSP, separating node selection and waypoint determination with enhanced spatial reasoning.

Details

Motivation: Limited attention has been given to CETSP due to the challenge of neighborhood-based visitation criteria, where nodes are visited when agents enter their compact neighborhoods.

Method: Formulates CETSP as MDP with discretization, uses unified dual-decoder DRL framework with node-decoder and loc-decoder, employs k-nearest neighbors subgraph interaction for spatial reasoning, and customizes REINFORCE algorithm for training.

Result: UD3RL outperforms conventional methods in solution quality and runtime, exhibits strong generalization across problem scales, spatial distributions, radius ranges, and shows robustness to dynamic environments.

Conclusion: The proposed UD3RL framework effectively addresses CETSP challenges and demonstrates superior performance and generalization capabilities compared to existing methods.

Abstract: In recent years, deep reinforcement learning (DRL) has gained traction for solving the NP-hard traveling salesman problem (TSP). However, limited attention has been given to the close-enough TSP (CETSP), primarily due to the challenge introduced by its neighborhood-based visitation criterion, wherein a node is considered visited if the agent enters a compact neighborhood around it. In this work, we formulate a Markov decision process (MDP) for CETSP using a discretization scheme and propose a novel unified dual-decoder DRL (UD3RL) framework that separates decision-making into node selection and waypoint determination. Specifically, an adapted encoder is employed for effective feature extraction, followed by a node-decoder and a loc-decoder to handle the two sub-tasks, respectively. A k-nearest neighbors subgraph interaction strategy is further introduced to enhance spatial reasoning during location decoding. Furthermore, we customize the REINFORCE algorithm to train UD3RL as a unified model capable of generalizing across different problem sizes and varying neighborhood radius types (i.e., constant and random radii). Experimental results show that UD3RL outperforms conventional methods in both solution quality and runtime, while exhibiting strong generalization across problem scales, spatial distributions, and radius ranges, as well as robustness to dynamic environments.

[346] Bootstrap Learning for Combinatorial Graph Alignment with Sequential GNNs

Marc Lelarge

Main category: cs.LG

TL;DR: A novel chaining procedure for GNNs that iteratively refines graph alignment solutions, achieving 3x better accuracy than existing methods and uniquely solving regular graphs where other approaches fail.

Details

Motivation: GNNs have struggled to outperform traditional optimization methods on combinatorial problems like graph alignment, limiting their practical impact.

Method: Trains a sequence of GNNs where each network learns to iteratively refine similarity matrices from previous networks, using a bootstrap effect during inference. Combines with an architecture that operates on node pairs rather than individual nodes to capture global structural patterns.

Result: Achieves over 3x better accuracy than existing methods on challenging instances, uniquely solves regular graphs where all competing approaches fail, and substantially outperforms state-of-the-art solvers when combined with traditional optimization as post-processing.

Conclusion: The chained GNN approach successfully bridges the performance gap between neural networks and traditional optimization methods for combinatorial graph problems.

Abstract: Graph neural networks (GNNs) have struggled to outperform traditional optimization methods on combinatorial problems, limiting their practical impact. We address this gap by introducing a novel chaining procedure for the graph alignment problem, a fundamental NP-hard task of finding optimal node correspondences between unlabeled graphs using only structural information. Our method trains a sequence of GNNs where each network learns to iteratively refine similarity matrices produced by previous networks. During inference, this creates a bootstrap effect: each GNN improves upon partial solutions by incorporating discrete ranking information about node alignment quality from prior iterations. We combine this with a powerful architecture that operates on node pairs rather than individual nodes, capturing global structural patterns essential for alignment that standard message-passing networks cannot represent. Extensive experiments on synthetic benchmarks demonstrate substantial improvements: our chained GNNs achieve over 3x better accuracy than existing methods on challenging instances, and uniquely solve regular graphs where all competing approaches fail. When combined with traditional optimization as post-processing, our method substantially outperforms state-of-the-art solvers on the graph alignment benchmark.

[347] Distilled Protein Backbone Generation

Liyang Xie, Haoran Zhang, Zhendong Wang, Wesley Tansey, Mingyuan Zhou

Main category: cs.LG

TL;DR: The paper presents a method to accelerate protein backbone generation using score distillation, achieving 20x faster sampling while maintaining quality comparable to the original diffusion model.

Details

Motivation: Diffusion-based protein generation models are computationally expensive, requiring hundreds of iterative steps that limit their practical use in large-scale protein discovery applications.

Method: Adapted Score identity Distillation (SiD) with multistep generation and inference time noise modulation to train few-step protein backbone generators from pretrained teacher models.

Result: Achieved more than 20-fold improvement in sampling speed while maintaining similar designability, diversity, and novelty as the original Proteina teacher model.

Conclusion: The distilled few-step generators enable large-scale in silico protein design, making diffusion-based models more practical for real-world protein engineering applications.

Abstract: Diffusion- and flow-based generative models have recently demonstrated strong performance in protein backbone generation tasks, offering unprecedented capabilities for de novo protein design. However, while achieving notable performance in generation quality, these models are limited by their generating speed, often requiring hundreds of iterative steps in the reverse-diffusion process. This computational bottleneck limits their practical utility in large-scale protein discovery, where thousands to millions of candidate structures are needed. To address this challenge, we explore the techniques of score distillation, which has shown great success in reducing the number of sampling steps in the vision domain while maintaining high generation quality. However, a straightforward adaptation of these methods results in unacceptably low designability. Through extensive study, we have identified how to appropriately adapt Score identity Distillation (SiD), a state-of-the-art score distillation strategy, to train few-step protein backbone generators which significantly reduce sampling time, while maintaining comparable performance to their pretrained teacher model. In particular, multistep generation combined with inference time noise modulation is key to the success. We demonstrate that our distilled few-step generators achieve more than a 20-fold improvement in sampling speed, while achieving similar levels of designability, diversity, and novelty as the Proteina teacher model. This reduction in inference cost enables large-scale in silico protein design, thereby bringing diffusion-based models closer to real-world protein engineering applications.

[348] Adaptive Node Feature Selection For Graph Neural Networks

Ali Azizpour, Madeline Navarro, Santiago Segarra

Main category: cs.LG

TL;DR: An adaptive node feature selection method for GNNs that identifies and removes unnecessary features during training using permutation-based intervention.

Details

Motivation: Graph-structured data introduces complex dependencies that make classical feature importance metrics inadequate. The approach aims to interpret GNN decisions, reduce dimensionality, and potentially improve performance by eliminating unhelpful features.

Method: A model- and task-agnostic method that determines relevant features during training based on changes in validation performance when feature values are permuted. The approach tracks how feature relevance evolves as features are successively dropped.

Result: The method demonstrates flexibility across different graph architectures and adaptability to challenging graph learning settings. It provides feature importance scores and allows monitoring of feature elimination effectiveness.

Conclusion: The intervention-based approach effectively addresses the challenges of feature selection in graph-structured data, offering a theoretically motivated and empirically validated solution for adaptive feature selection in GNNs.

Abstract: We propose an adaptive node feature selection approach for graph neural networks (GNNs) that identifies and removes unnecessary features during training. The ability to measure how features contribute to model output is key for interpreting decisions, reducing dimensionality, and even improving performance by eliminating unhelpful variables. However, graph-structured data introduces complex dependencies that may not be amenable to classical feature importance metrics. Inspired by this challenge, we present a model- and task-agnostic method that determines relevant features during training based on changes in validation performance upon permuting feature values. We theoretically motivate our intervention-based approach by characterizing how GNN performance depends on the relationships between node data and graph structure. Not only do we return feature importance scores once training concludes, we also track how relevance evolves as features are successively dropped. We can therefore monitor if features are eliminated effectively and also evaluate other metrics with this technique. Our empirical results verify the flexibility of our approach to different graph architectures as well as its adaptability to more challenging graph learning settings.

[349] Signature-Informed Transformer for Asset Allocation

Yoontae Hwang, Stefan Zohren

Main category: cs.LG

TL;DR: The paper introduces SIT, a novel framework that learns end-to-end allocation policies by directly optimizing risk-aware financial objectives, using path signatures for geometric representation of asset dynamics and signature-augmented attention with financial inductive biases.

Details

Motivation: Deep-learning forecasters often fail in robust asset allocation due to objective mismatch and error amplification, creating a need for better approaches that directly optimize financial objectives.

Method: Signature-Informed Transformer (SIT) framework using path signatures for geometric representation of asset dynamics and signature-augmented attention mechanism with financial inductive biases like lead-lag effects.

Result: SIT decisively outperforms traditional and deep-learning baselines on daily S&P 100 equity data, especially compared to predict-then-optimize models.

Conclusion: Portfolio-aware objectives and geometry-aware inductive biases are essential for risk-aware capital allocation in machine-learning systems.

Abstract: Robust asset allocation is a key challenge in quantitative finance, where deep-learning forecasters often fail due to objective mismatch and error amplification. We introduce the Signature-Informed Transformer (SIT), a novel framework that learns end-to-end allocation policies by directly optimizing a risk-aware financial objective. SIT’s core innovations include path signatures for a rich geometric representation of asset dynamics and a signature-augmented attention mechanism embedding financial inductive biases, like lead-lag effects, into the model. Evaluated on daily S&P 100 equity data, SIT decisively outperforms traditional and deep-learning baselines, especially when compared to predict-then-optimize models. These results indicate that portfolio-aware objectives and geometry-aware inductive biases are essential for risk-aware capital allocation in machine-learning systems. The code is available at: https://github.com/Yoontae6719/Signature-Informed-Transformer-For-Asset-Allocation

[350] AdaBet: Gradient-free Layer Selection for Efficient Training of Deep Neural Networks

Irene Tenison, Soumyajit Chatterjee, Fahim Kawsar, Mohammad Malekzadeh

Main category: cs.LG

TL;DR: AdaBet is a gradient-free layer selection method for efficient on-device neural network retraining that uses Betti Numbers to analyze activation space topology, achieving better accuracy with lower memory usage than gradient-based approaches.

Details

Motivation: To enable efficient adaptation of pre-trained neural networks on edge/mobile devices with limited compute/memory resources, avoiding the impracticality of full-model retraining and limitations of current layer selection methods that require labels, gradients, or server-side training.

Method: Uses Betti Numbers to analyze topological features of layer activation spaces through forward passes alone, ranking layers by learning capacity without requiring labels or gradients.

Result: On 16 model-dataset pairs, AdaBet achieves 5% higher classification accuracy than gradient-based baselines while reducing average peak memory consumption by 40%.

Conclusion: AdaBet provides an effective gradient-free approach for layer selection in on-device retraining, enabling efficient adaptation without requiring labels or server resources.

Abstract: To utilize pre-trained neural networks on edge and mobile devices, we often require efficient adaptation to user-specific runtime data distributions while operating under limited compute and memory resources. On-device retraining with a target dataset can facilitate such adaptations; however, it remains impractical due to the increasing depth of modern neural nets, as well as the computational overhead associated with gradient-based optimization across all layers. Current approaches reduce training cost by selecting a subset of layers for retraining, however, they rely on labeled data, at least one full-model backpropagation, or server-side meta-training; limiting their suitability for constrained devices. We introduce AdaBet, a gradient-free layer selection approach to rank important layers by analyzing topological features of their activation spaces through Betti Numbers and using forward passes alone. AdaBet allows selecting layers with high learning capacity, which are important for retraining and adaptation, without requiring labels or gradients. Evaluating AdaBet on sixteen pairs of benchmark models and datasets, shows AdaBet achieves an average gain of 5% more classification accuracy over gradient-based baselines while reducing average peak memory consumption by 40%.

[351] Real Time Headway Predictions in Urban Rail Systems and Implications for Service Control: A Deep Learning Approach

Muhammad Usama, Haris Koutsopoulos

Main category: cs.LG

TL;DR: A ConvLSTM-based deep learning framework for predicting train headway propagation in metro systems, enabling dispatchers to evaluate terminal headway control decisions efficiently.

Details

Motivation: To improve service reliability, resource utilization, and passenger satisfaction through efficient real-time dispatching in urban metro systems by proactively controlling operations.

Method: Uses a Convolutional LSTM model that incorporates planned terminal headways and historical data to predict spatiotemporal headway dynamics across all stations, with flexible methodology to simulate various dispatcher strategies.

Result: The model demonstrates promising headway predictions on a large-scale metro dataset, providing actionable insights for real-time decision-making without intensive simulations.

Conclusion: The framework offers rail operators a computationally efficient tool to optimize dispatching strategies, significantly improving service consistency and passenger satisfaction.

Abstract: Efficient real-time dispatching in urban metro systems is essential for ensuring service reliability, maximizing resource utilization, and improving passenger satisfaction. This study presents a novel deep learning framework centered on a Convolutional Long Short-Term Memory (ConvLSTM) model designed to predict the complex spatiotemporal propagation of train headways across an entire metro line. By directly incorporating planned terminal headways as a critical input alongside historical headway data, the proposed model accurately forecasts future headway dynamics, effectively capturing both their temporal evolution and spatial dependencies across all stations. This capability empowers dispatchers to evaluate the impact of various terminal headway control decisions without resorting to computationally intensive simulations. We introduce a flexible methodology to simulate diverse dispatcher strategies, ranging from maintaining even headways to implementing custom patterns derived from observed terminal departures. In contrast to existing research primarily focused on passenger load predictioning or atypical disruption scenarios, our approach emphasizes proactive operational control. Evaluated on a large-scale dataset from an urban metro line, the proposed ConvLSTM model demonstrates promising headway predictions, offering actionable insights for real-time decision-making. This framework provides rail operators with a powerful, computationally efficient tool to optimize dispatching strategies, thereby significantly improving service consistency and passenger satisfaction.

Flavio Giorgi, Matteo Silvestri, Cesare Campagnano, Fabrizio Silvestri, Gabriele Tolomei

Main category: cs.LG

TL;DR: A pipeline using Language Models to generate narrative counterfactual explanations, with knowledge distillation to make small models perform comparably to large ones, plus an evaluation method for narrative quality.

Details

Motivation: Counterfactual explanations are promising for AI explainability but are often too technical for non-experts to understand.

Method: Proposed a pipeline using Language Models (large and small) with knowledge distillation and refining mechanisms to generate narrative explanations, plus an evaluation method for narrative quality.

Result: The pipeline enhances reasoning capabilities and practical performance of student models, making them suitable for real-world use.

Conclusion: The approach successfully makes counterfactual explanations more accessible to non-experts through narrative generation while maintaining model performance.

Abstract: Explainable Artificial Intelligence has become a crucial area of research, aiming to demystify the decision-making processes of deep learning models. Among various explainability techniques, counterfactual explanations have been proven particularly promising, as they offer insights into model behavior by highlighting minimal changes that would alter a prediction. Despite their potential, these explanations are often complex and technical, making them difficult for non-experts to interpret. To address this challenge, we propose a novel pipeline that leverages Language Models, large and small, to compose narratives for counterfactual explanations. We employ knowledge distillation techniques along with a refining mechanism to enable Small Language Models to perform comparably to their larger counterparts while maintaining robust reasoning abilities. In addition, we introduce a simple but effective evaluation method to assess natural language narratives, designed to verify whether the models’ responses are in line with the factual, counterfactual ground truth. As a result, our proposed pipeline enhances both the reasoning capabilities and practical performance of student models, making them more suitable for real-world use cases.

[353] Taming Imperfect Process Verifiers: A Sampling Perspective on Backtracking

Dhruv Rohatgi, Abhishek Shetty, Donya Saless, Yuchen Li, Ankur Moitra, Andrej Risteski, Dylan J. Foster

Main category: cs.LG

TL;DR: VGB is a new process-guided test-time sampling algorithm that uses backtracking to improve robustness to verifier errors in language model generation, outperforming baselines on various metrics.

Details

Motivation: Standard decoding techniques suffer catastrophic failures from seemingly benign verifier errors due to error amplification during generation, motivating the need for more sophisticated decoding strategies.

Method: VGB interprets autoregressive generation as a random walk on a tree of partial generations with transition probabilities guided by process verifier and base model, using probabilistic backtracking that generalizes the Sinclair-Jerrum random walk.

Result: Empirical evaluation on synthetic and real language modeling tasks shows VGB outperforms baselines on a variety of metrics.

Conclusion: VGB provides provably better robustness to verifier errors through theoretically grounded backtracking, demonstrating improved performance over standard decoding approaches.

Abstract: Test-time algorithms that combine the generative power of language models with process verifiers that assess the quality of partial generations offer a promising lever for eliciting new reasoning capabilities, but the algorithmic design space and computational scaling properties of such approaches are still opaque, and their benefits are far from apparent when one accounts for the cost of learning a high-quality verifier. Our starting point is the observation that seemingly benign errors in a learned verifier can lead to catastrophic failures for standard decoding techniques due to error amplification during the course of generation. We then ask: can this be improved with more sophisticated decoding strategies? We introduce a new process-guided test-time sampling algorithm, VGB, which uses theoretically grounded backtracking to achieve provably better robustness to verifier errors. VGB interprets autoregressive generation as a random walk on a tree of partial generations, with transition probabilities guided by the process verifier and base model; crucially, backtracking occurs probabilistically. This process generalizes the seminal Sinclair-Jerrum random walk (Sinclair & Jerrum, 1989) from the literature on approximate counting and sampling in theoretical computer science, and a conceptual contribution of our work is to highlight parallels with this literature. Empirically, we demonstrate on both synthetic and real language modeling tasks that VGB outperforms baselines on a variety of metrics.

[354] Mixture of Many Zero-Compute Experts: A High-Rate Quantization Theory Perspective

Yehuda Dar

Main category: cs.LG

TL;DR: This paper applies high-rate quantization theory to analyze mixture-of-experts (MoE) models for regression, examining approximation error minimization and the tradeoff between approximation and estimation errors based on the number of experts.

Details

Motivation: To provide new theoretical insights into MoE models using classical high-rate quantization theory, particularly focusing on how the number of experts affects model performance in regression tasks.

Method: Defines an MoE model with input-space segmentation and constant-predictor experts. Uses high-rate quantization assumptions (large number of experts) to analyze approximation error. Studies test error minimization for 1D inputs and formulates upper bounds for multidimensional inputs. Also analyzes statistical learning properties of expert parameter estimation.

Result: Theoretical formulations for test error minimization in 1D and upper bounds for multidimensional cases. Analysis shows how the tradeoff between approximation error (decreases with more experts) and estimation error (increases with more experts) depends on the number of experts.

Conclusion: The paper demonstrates that high-rate quantization theory provides valuable insights into MoE model behavior, revealing the fundamental tradeoff between approximation and estimation errors that governs optimal expert selection in mixture-of-experts learning.

Abstract: This paper uses classical high-rate quantization theory to provide new insights into mixture-of-experts (MoE) models for regression tasks. Our MoE is defined by a segmentation of the input space to regions, each with a single-parameter expert that acts as a constant predictor with zero-compute at inference. Motivated by high-rate quantization theory assumptions, we assume that the number of experts is sufficiently large to make their input-space regions very small. This lets us to study the approximation error of our MoE model class: (i) for one-dimensional inputs, we formulate the test error and its minimizing segmentation and experts; (ii) for multidimensional inputs, we formulate an upper bound for the test error and study its minimization. Moreover, we consider the learning of the expert parameters from a training dataset, given an input-space segmentation, and formulate their statistical learning properties. This leads us to theoretically and empirically show how the tradeoff between approximation and estimation errors in MoE learning depends on the number of experts.

[355] Calibrated Uncertainty Sampling for Active Learning

Ha Manh Bui, Iliana Maifeld-Carucci, Anqi Liu

Main category: cs.LG

TL;DR: Proposes a new active learning acquisition function that estimates calibration errors to query samples with highest calibration error before using DNN uncertainty, achieving better calibration and generalization.

Details

Motivation: Standard uncertainty-based acquisition functions in active learning are affected by uncalibrated uncertainty models, especially with DNNs, leading to poor generalization and high calibration error.

Method: Uses kernel calibration error estimator under covariate shift to identify samples with highest calibration error, then leverages DNN uncertainty for querying in pool-based active learning.

Result: Empirically outperforms other acquisition function baselines with lower calibration and generalization errors across pool-based active learning settings.

Conclusion: The proposed calibration-aware acquisition function effectively bounds calibration error on unlabeled pool and test data, improving active learning performance.

Abstract: We study the problem of actively learning a classifier with a low calibration error. One of the most popular Acquisition Functions (AFs) in pool-based Active Learning (AL) is querying by the model’s uncertainty. However, we recognize that an uncalibrated uncertainty model on the unlabeled pool may significantly affect the AF effectiveness, leading to sub-optimal generalization and high calibration error on unseen data. Deep Neural Networks (DNNs) make it even worse as the model uncertainty from DNN is usually uncalibrated. Therefore, we propose a new AF by estimating calibration errors and query samples with the highest calibration error before leveraging DNN uncertainty. Specifically, we utilize a kernel calibration error estimator under the covariate shift and formally show that AL with this AF eventually leads to a bounded calibration error on the unlabeled pool and unseen test data. Empirically, our proposed method surpasses other AF baselines by having a lower calibration and generalization error across pool-based AL settings.

[356] Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward

Guanhua Huang, Tingqiang Xu, Mingze Wang, Qi Yi, Xue Gong, Siheng Li, Ruibin Xiong, Kejiao Li, Yuhao Jiang, Bo Zhou

Main category: cs.LG

TL;DR: The paper introduces Low-probability Regularization (Lp-Reg) to address exploration collapse in RLVR training by preserving valuable low-probability tokens called “reasoning sparks” through regularization towards a heuristic proxy distribution.

Details

Motivation: RLVR training suffers from performance plateaus due to policy entropy collapse, where valuable exploratory tokens are systematically eliminated through over-penalization, leading to degenerated exploration.

Method: Lp-Reg regularizes the policy towards a heuristic proxy distribution constructed by filtering out noise tokens and re-normalizing over remaining candidates, amplifying reasoning sparks via KL divergence regularization.

Result: Lp-Reg enables stable on-policy training for ~1,000 steps where baseline methods collapse, achieving 60.17% average accuracy on five math benchmarks (2.66% improvement over prior methods).

Conclusion: Preserving reasoning sparks through targeted regularization enables sustained exploration and state-of-the-art performance in RLVR training.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has propelled Large Language Models in complex reasoning, yet its scalability is often hindered by a training bottleneck where performance plateaus as policy entropy collapses, signaling a loss of exploration. Previous methods typically address this by maintaining high policy entropy, yet the precise mechanisms that govern meaningful exploration have remained underexplored. Our analysis suggests that an unselective focus on entropy risks amplifying irrelevant tokens and destabilizing training. This paper investigates the exploration dynamics within RLVR and identifies a key issue: the gradual elimination of valuable low-probability exploratory tokens, which we term \textbf{\textit{reasoning sparks}}. We find that while abundant in pre-trained models, these sparks are systematically extinguished during RLVR due to over-penalization, leading to a degeneracy in exploration. To address this, we introduce Low-probability Regularization (Lp-Reg). Its core mechanism regularizes the policy towards a heuristic proxy distribution. This proxy is constructed by filtering out presumed noise tokens and re-normalizing the distribution over the remaining candidates. The result is a less-noisy proxy where the probability of \textit{reasoning sparks} is amplified, which then serves as a soft regularization target to shield these valuable tokens from elimination via KL divergence. Experiments show that Lp-Reg enables stable on-policy training for around 1,000 steps, a regime where baseline entropy-control methods collapse. This sustained exploration leads to state-of-the-art performance, achieving a $60.17%$ average accuracy on five math benchmarks, an improvement of $2.66%$ over prior methods. Code is available at https://github.com/CarlanLark/Lp-Reg.

[357] Why Do We Need Warm-up? A Theoretical Perspective

Foivos Alimisis, Rustem Islamov, Aurelien Lucchi

Main category: cs.LG

TL;DR: This paper provides a theoretical explanation for why learning rate warm-up improves deep learning training, establishing faster convergence bounds under a generalized smoothness condition.

Details

Motivation: Learning rate warm-up is widely used in deep learning but lacks theoretical foundations. The authors aim to provide principled explanations for why warm-up improves training.

Method: The authors use a generalized (L₀, L₁)-smoothness condition that bounds local curvature as a linear function of loss sub-optimality. They prove convergence bounds for Gradient Descent with warm-up schedules under this assumption.

Result: Theoretical analysis shows Gradient Descent with warm-up achieves faster convergence than fixed step-size methods. Experiments validate the theory on language and vision models.

Conclusion: Warm-up schedules provide provable benefits for training neural networks under realistic smoothness conditions, with both theoretical guarantees and empirical validation.

Abstract: Learning rate warm-up - increasing the learning rate at the beginning of training - has become a ubiquitous heuristic in modern deep learning, yet its theoretical foundations remain poorly understood. In this work, we provide a principled explanation for why warm-up improves training. We rely on a generalization of the $(L_0, L_1)$-smoothness condition, which bounds local curvature as a linear function of the loss sub-optimality and exhibits desirable closure properties. We demonstrate both theoretically and empirically that this condition holds for common neural architectures trained with mean-squared error and cross-entropy losses. Under this assumption, we prove that Gradient Descent with a warm-up schedule achieves faster convergence than with a fixed step-size, establishing upper and lower complexity bounds. Finally, we validate our theoretical insights through experiments on language and vision models, confirming the practical benefits of warm-up schedules.

[358] FTTE: Federated Learning on Resource-Constrained Devices

Irene Tenison, Anna Murphy, Charles Beauville, Lalana Kagal

Main category: cs.LG

TL;DR: FTTE is a semi-asynchronous federated learning framework that uses sparse parameter updates and staleness-weighted aggregation to achieve faster convergence, lower memory usage, and reduced communication compared to traditional FL methods, making it suitable for resource-constrained edge devices.

Details

Motivation: Federated learning on resource-constrained edge devices faces challenges due to limited memory, energy, and communication bandwidth. Traditional synchronous and asynchronous FL approaches suffer from straggler delays and slow convergence in heterogeneous networks.

Method: FTTE employs sparse parameter updates and a staleness-weighted aggregation mechanism that considers both the age and variance of client updates in a semi-asynchronous framework.

Result: Extensive experiments show FTTE achieves 81% faster convergence, 80% lower on-device memory usage, and 69% communication payload reduction compared to synchronous FL (FedAVG), while maintaining comparable or higher accuracy than semi-asynchronous methods (FedBuff) even with 90% stragglers.

Conclusion: FTTE establishes itself as the first practical and scalable solution for real-world FL deployments on heterogeneous and predominantly resource-constrained edge devices.

Abstract: Federated learning (FL) enables collaborative model training across distributed devices while preserving data privacy, but deployment on resource-constrained edge nodes remains challenging due to limited memory, energy, and communication bandwidth. Traditional synchronous and asynchronous FL approaches further suffer from straggler induced delays and slow convergence in heterogeneous, large scale networks. We present FTTE (Federated Tiny Training Engine),a novel semi-asynchronous FL framework that uniquely employs sparse parameter updates and a staleness-weighted aggregation based on both age and variance of client updates. Extensive experiments across diverse models and data distributions - including up to 500 clients and 90% stragglers - demonstrate that FTTE not only achieves 81% faster convergence, 80% lower on-device memory usage, and 69% communication payload reduction than synchronous FL (eg.FedAVG), but also consistently reaches comparable or higher target accuracy than semi-asynchronous (eg.FedBuff) in challenging regimes. These results establish FTTE as the first practical and scalable solution for real-world FL deployments on heterogeneous and predominantly resource-constrained edge devices.

[359] Q-Learning with Shift-Aware Upper Confidence Bound in Non-Stationary Reinforcement Learning

Ha Manh Bui, Felix Parker, Kimia Ghobadi, Anqi Liu

Main category: cs.LG

TL;DR: Proposes Density-QUCB (DQUCB), a shift-aware Q-learning UCB algorithm that uses transition density functions to detect distribution shifts and improve uncertainty estimation, achieving better regret guarantees than standard QUCB in non-stationary RL environments.

Details

Motivation: Standard Q-learning UCB algorithms can discover proper policies but may exploit sub-optimal rewards after distribution shifts occur in non-stationary RL environments, leading to performance degradation.

Method: DQUCB uses transition density functions to detect distribution shifts and leverages likelihood to enhance uncertainty estimation in Q-learning UCB, balancing exploration and exploitation in both finite-horizon episodic and infinite-horizon discounted MDPs.

Result: Theoretically proven to achieve better regret guarantees than QUCB. Empirically outperforms QUCB baselines with lower regret across RL tasks and real-world COVID-19 patient hospital allocation using Deep-Q-learning.

Conclusion: DQUCB effectively addresses non-stationary RL challenges by detecting distribution shifts and improving uncertainty estimation, providing both theoretical guarantees and practical performance improvements in various RL scenarios.

Abstract: We study the Non-Stationary Reinforcement Learning (RL) under distribution shifts in both finite-horizon episodic and infinite-horizon discounted Markov Decision Processes (MDPs). In the finite-horizon case, the transition functions may suddenly change at a particular episode. In the infinite-horizon setting, such changes can occur at an arbitrary time step during the agent’s interaction with the environment. While the Q-learning Upper Confidence Bound algorithm (QUCB) can discover a proper policy during learning, due to the distribution shifts, this policy can exploit sub-optimal rewards after the shift happens. To address this issue, we propose Density-QUCB (DQUCB), a shift-aware Q-learning~~UCB algorithm, which uses a transition density function to detect distribution shifts, then leverages its likelihood to enhance the uncertainty estimation quality of Q-learning~~UCB, resulting in a balance between exploration and exploitation. Theoretically, we prove that our oracle DQUCB achieves a better regret guarantee than QUCB. Empirically, our DQUCB enjoys the computational efficiency of model-free RL and outperforms QUCB baselines by having a lower regret across RL tasks, as well as a real-world COVID-19 patient hospital allocation task using a Deep-Q-learning architecture.

[360] PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning

Wanjia Zhao, Qinwei Ma, Jingzhe Shi, Shirley Wu, Jiaqi Han, Yijia Xiao, Si-Yuan Chen, Xiao Luo, Ludwig Schmidt, James Zou

Main category: cs.LG

TL;DR: PRISM-Physics is a process-level evaluation framework for physics reasoning that uses directed acyclic graphs (DAGs) to represent solution steps with causal dependencies, enabling fine-grained scoring without heuristic LLM judgments.

Details

Motivation: Existing physics benchmarks only evaluate final answers, failing to capture reasoning processes, while stepwise methods rely on unreliable heuristic scoring or restrictive linear assumptions.

Method: Represent solutions as DAGs of formulas with explicit causal dependencies, use rule-based symbolic formula equivalence matching for consistent validation, and prove optimality of DAG representation and scoring policy.

Result: The framework aligns better with human expert scoring, reveals persistent reasoning failures in state-of-the-art LLMs, and provides diagnostic insights through step-level scoring.

Conclusion: PRISM-Physics provides a principled foundation for process-level evaluation in physics, combining structural rigor, theoretical guarantees, and symbolic validation to advance scientific reasoning capabilities.

Abstract: Benchmarks for competition-style reasoning have advanced evaluation in mathematics and programming, yet physics remains comparatively explored. Most existing physics benchmarks evaluate only final answers, which fail to capture reasoning processes, while recent stepwise methods rely on heuristic LLM-as-judge scoring or restrictive linear assumptions, limiting reliability and diagnostic validity. We introduce PRISM-Physics, a process-level evaluation framework and benchmark for complex physics reasoning problems. Solutions are represented as directed acyclic graphs (DAGs) of formulas, explicitly encoding causal dependencies among intermediate steps to enable fine-grained, interpretable, and theoretically grounded scoring. We prove the optimality of the DAG representation and the corresponding scoring policy. Combining with a fully rule-based method for symbolic formula equivalence matching that we developed, we ensure consistent validation across diverse formulations without heuristic judgments. Results show that our evaluation framework is more aligned with human experts’ scoring. Experiments on state-of-the-art LLMs reveal persistent reasoning failures in physics, while step-level scoring offers both diagnostic insight and rich signals for later training. By combining structural rigor, theoretical guarantees, and symbolic validation, PRISM-Physics provides a principled foundation for advancing process-level evaluation and guiding the development of models with deeper scientific reasoning capabilities.

[361] Superposition disentanglement of neural representations reveals hidden alignment

André Longon, David Klindt, Meenakshi Khosla

Main category: cs.LG

TL;DR: Superposition in neural representations can interfere with alignment metrics, causing models with the same features but different superposition arrangements to appear less aligned. Disentangling superposition through sparse autoencoders improves alignment scores.

Details

Motivation: To investigate whether superposition arrangements in neural representations interfere with alignment metrics, causing models with identical features but different superposition patterns to appear less aligned than they actually are.

Method: Developed theory on permutation metrics’ dependence on superposition, trained sparse autoencoders (SAEs) to disentangle superposition in toy models, and tested alignment metrics (semi-matching, soft-matching, linear regression) on DNN→DNN and DNN→brain mappings.

Result: Alignment scores increased when base neurons were replaced with sparse overcomplete latent codes from SAEs. Similar improvements were found for DNN→DNN and DNN→brain linear regression alignment in visual domain.

Conclusion: Superposition disentanglement is necessary for mapping metrics to accurately measure true representational alignment between neural codes, as superposition arrangements can artificially lower alignment scores.

Abstract: The superposition hypothesis states that a single neuron within a population may participate in the representation of multiple features in order for the population to represent more features than the number of neurons. In neuroscience and AI, representational alignment metrics measure the extent to which different deep neural networks (DNNs) or brains represent similar information. In this work, we explore a critical question: \textit{does superposition interact with alignment metrics in any undesirable way?} We hypothesize that models which represent the same features in \textit{different superposition arrangements}, i.e., their neurons have different linear combinations of the features, will interfere with predictive mapping metrics (semi-matching, soft-matching, linear regression), producing lower alignment than expected. We first develop a theory for how the strict permutation metrics are dependent on superposition arrangements. This is tested by training sparse autoencoders (SAEs) to disentangle superposition in toy models, where alignment scores are shown to typically increase when a model’s base neurons are replaced with its sparse overcomplete latent codes. We find similar increases for DNN(\rightarrow)DNN and DNN(\rightarrow)brain linear regression alignment in the visual domain. Our results suggest that superposition disentanglement is necessary for mapping metrics to uncover the true representational alignment between neural codes.

[362] Estimation of Resistance Training RPE using Inertial Sensors and Electromyography

James Thomas, Johan Wahlström

Main category: cs.LG

TL;DR: Machine learning models using wearable sensors can estimate perceived exertion during bicep curls, with random forest achieving 41.4% exact accuracy and 85.9% ±1 RPE accuracy.

Details

Motivation: Accurate RPE estimation can enhance resistance training through personalized feedback and injury prevention.

Method: Used wearable inertial and EMG sensors to collect data from 69 sets and over 1000 repetitions during single-arm dumbbell bicep curls, with statistical features extracted for model training.

Result: Random forest classifier performed best with 41.4% exact accuracy and 85.9% ±1 RPE accuracy. EMG data provided slight improvement over inertial sensors alone. Eccentric repetition time was identified as the strongest RPE predictor.

Conclusion: Demonstrates feasibility of wearable-sensor-based RPE estimation but identifies challenges for improving model generalizability, particularly regarding EMG data quality and placement sensitivity.

Abstract: Accurate estimation of rating of perceived exertion (RPE) can enhance resistance training through personalized feedback and injury prevention. This study investigates the application of machine learning models to estimate RPE during single-arm dumbbell bicep curls, using data from wearable inertial and electromyography (EMG) sensors. A custom dataset of 69 sets and over 1000 repetitions was collected, with statistical features extracted for model training. Among the models evaluated, a random forest classifier achieved the highest performance, with 41.4% exact accuracy and 85.9% $\pm1$ RPE accuracy. While the inclusion of EMG data slightly improved model accuracy over inertial sensors alone, its utility may have been limited by factors such as data quality and placement sensitivity. Feature analysis highlighted eccentric repetition time as the strongest RPE predictor. The results demonstrate the feasibility of wearable-sensor-based RPE estimation and identify key challenges for improving model generalizability.

[363] Best-of-Majority: Minimax-Optimal Strategy for Pass@$k$ Inference Scaling

Qiwei Di, Kaixuan Ji, Xuheng Li, Heyang Zhao, Quanquan Gu

Main category: cs.LG

TL;DR: The paper proposes Best-of-Majority (BoM), a new LLM inference strategy that combines majority voting and Best-of-N approaches to achieve optimal scaling in Pass@k settings, with proven minimax optimality and improved performance on math problems.

Details

Motivation: Current single-shot selection strategies like majority voting and Best-of-N underperform on difficult tasks in Pass@k inference settings, and neither exhibits desirable scaling with k and sampling budget N.

Method: Proposed Best-of-Majority (BoM) strategy that restricts candidates to high-frequency responses from N samples before selecting top-k rewards, combining advantages of majority voting and BoN.

Result: Proved that with sampling budget N=Ω(C*), BoM achieves regret O(ε_opt + √(ε_RM²C*/k)), established matching lower bound showing minimax optimality, and experimental results show BoM outperforms both majority voting and BoN on math problems.

Conclusion: BoM is a minimax optimal inference strategy that addresses scaling limitations of existing methods, maintains performance when increasing N, and demonstrates superior practical performance on challenging tasks.

Abstract: LLM inference often generates a batch of candidates for a prompt and selects one via strategies like majority voting or Best-of- N (BoN). For difficult tasks, this single-shot selection often underperforms. Consequently, evaluations commonly report Pass@$k$: the agent may submit up to $k$ responses, and only the best of them is used when computing regret. Motivated by this, we study inference scaling in the more general Pass@$k$ inference setting, and prove that neither majority voting nor BoN exhibits the desirable scaling with $k$ and the sampling budget $N$. Combining the advantages of majority voting and BoN, we propose a new inference strategy called Best-of-Majority (BoM), with a pivotal step that restricts the candidates to the responses with high frequency in the $N$ samples before selecting the top-$k$ rewards. We prove that when the sampling budget is $N=\tilde\Omega(C^)$, the regret of BoM is $O(\epsilon_{\mathrm{opt}}+\sqrt{\epsilon_{\mathrm{RM}}^2C^/k})$, where $C^*$ is the coverage coefficient, $\epsilon_{\mathrm{RM}}$ is the estimation error of the reward model, and $\epsilon_{\mathrm{opt}}$ is the estimation error of reward at the optimal response. We further establish a matching lower bound, certifying that our algorithm is minimax optimal. Beyond optimality, BoM has a key advantage: unlike majority voting and BoN, its performance does not degrade when increasing $N$. Experimental results of inference on math problems show BoM outperforming both majority voting and BoN.

[364] To Distill or Decide? Understanding the Algorithmic Trade-off in Partially Observable Reinforcement Learning

Yuda Song, Dhruv Rohatgi, Aarti Singh, J. Andrew Bagnell

Main category: cs.LG

TL;DR: The paper analyzes the trade-off between privileged expert distillation and standard RL in partially observable environments, finding that the effectiveness depends on latent dynamics stochasticity and that the optimal latent policy isn’t always the best to distill.

Details

Motivation: Partial observability is challenging in RL, and while privileged expert distillation (using latent state info during training) is computationally efficient, it has known failure modes that need investigation.

Method: Theoretical analysis using a perturbed Block MDP model and controlled experiments on simulated locomotion tasks to compare privileged expert distillation vs standard RL without privileged information.

Result: (1) The trade-off depends on latent dynamics stochasticity, as predicted by contrasting approximate decodability with belief contraction; (2) The optimal latent policy is not always the best to distill.

Conclusion: The findings provide new guidelines for effectively exploiting privileged information, potentially advancing policy learning efficiency in partially observable domains.

Abstract: Partial observability is a notorious challenge in reinforcement learning (RL), due to the need to learn complex, history-dependent policies. Recent empirical successes have used privileged expert distillation–which leverages availability of latent state information during training (e.g., from a simulator) to learn and imitate the optimal latent, Markovian policy–to disentangle the task of “learning to see” from “learning to act”. While expert distillation is more computationally efficient than RL without latent state information, it also has well-documented failure modes. In this paper–through a simple but instructive theoretical model called the perturbed Block MDP, and controlled experiments on challenging simulated locomotion tasks–we investigate the algorithmic trade-off between privileged expert distillation and standard RL without privileged information. Our main findings are: (1) The trade-off empirically hinges on the stochasticity of the latent dynamics, as theoretically predicted by contrasting approximate decodability with belief contraction in the perturbed Block MDP; and (2) The optimal latent policy is not always the best latent policy to distill. Our results suggest new guidelines for effectively exploiting privileged information, potentially advancing the efficiency of policy learning across many practical partially observable domains.

[365] Mutual Information Guided Backdoor Mitigation for Pre-trained Encoders

Tingxu Han, Weisong Sun, Ziqi Ding, Chunrong Fang, Hanwei Qian, Jiaxun Li, Zhenyu Chen, Xiangyu Zhang

Main category: cs.LG

TL;DR: MIMIC is a mutual information guided backdoor mitigation technique for pre-trained encoders in self-supervised learning, using knowledge distillation with random initialization to remove backdoors while maintaining performance.

Details

Motivation: Pre-trained encoders by SSL are vulnerable to backdoor attacks, and existing mitigation techniques designed for downstream tasks are ineffective for pre-trained encoders due to lack of label information during pre-training.

Method: Uses knowledge distillation with random initialization of student encoder, leverages mutual information to locate benign knowledge in teacher net, and employs distillation loss with clone loss and attention loss to mitigate backdoors while maintaining performance.

Result: MIMIC significantly reduces attack success rate using <5% of clean data, outperforming seven state-of-the-art backdoor mitigation techniques on two SSL backdoor attacks.

Conclusion: MIMIC effectively addresses backdoor attacks in pre-trained SSL encoders through mutual information guided knowledge distillation, achieving strong mitigation with minimal clean data.

Abstract: Self-supervised learning (SSL) is increasingly attractive for pre-training encoders without requiring labeled data. Downstream tasks built on top of those pre-trained encoders can achieve nearly state-of-the-art performance. The pre-trained encoders by SSL, however, are vulnerable to backdoor attacks as demonstrated by existing studies. Numerous backdoor mitigation techniques are designed for downstream task models. However, their effectiveness is impaired and limited when adapted to pre-trained encoders, due to the lack of label information when pre-training. To address backdoor attacks against pre-trained encoders, in this paper, we innovatively propose a mutual information guided backdoor mitigation technique, named MIMIC. MIMIC treats the potentially backdoored encoder as the teacher net and employs knowledge distillation to distill a clean student encoder from the teacher net. Different from existing knowledge distillation approaches, MIMIC initializes the student with random weights, inheriting no backdoors from teacher nets. Then MIMIC leverages mutual information between each layer and extracted features to locate where benign knowledge lies in the teacher net, with which distillation is deployed to clone clean features from teacher to student. We craft the distillation loss with two aspects, including clone loss and attention loss, aiming to mitigate backdoors and maintain encoder performance at the same time. Our evaluation conducted on two backdoor attacks in SSL demonstrates that MIMIC can significantly reduce the attack success rate by only utilizing <5% of clean data, surpassing seven state-of-the-art backdoor mitigation techniques.

[366] Graph Neural Networks for Transmission Grid Topology Control: Busbar Information Asymmetry and Heterogeneous Representations

Matthijs de Jong, Jan Viebahn, Yuliya Shapovalova

Main category: cs.LG

TL;DR: This paper investigates how graph representation affects GNN effectiveness for power grid topology control, identifying busbar information asymmetry in homogeneous graphs and proposing a heterogeneous graph solution that outperforms other methods.

Details

Motivation: Grid congestion is a pressing problem due to renewable energy proliferation and electrification. Traditional topology control methods are too slow, and while ML approaches show promise, the effect of graph representation on GNN performance needs investigation.

Method: The study compares homogeneous and heterogeneous graph representations for GNNs in topology control, identifying busbar information asymmetry in homogeneous graphs. Models are evaluated on an imitation learning task using classification accuracy and grid operation ability metrics.

Result: Heterogeneous GNNs perform best on in-distribution network configurations, followed by FCNNs, and then homogeneous GNNs. Both GNN types generalize better to out-of-distribution configurations than FCNNs.

Conclusion: The proposed heterogeneous graph representation resolves busbar information asymmetry and improves GNN performance for topology control, with GNNs showing better generalization than fully connected networks.

Abstract: Factors such as the proliferation of renewable energy and electrification contribute to grid congestion as a pressing problem. Topology control is an appealing method for relieving congestion, but traditional approaches for topology discovery have proven too slow for practical application. Recent research has focused on machine learning (ML) as an efficient alternative. Graph neural networks (GNNs) are particularly well-suited for topology control applications due to their ability to model the graph structure of power grids. This study investigates the effect of the graph representation on GNN effectiveness for topology control. We identify the busbar information asymmetry problem inherent to the popular homogeneous graph representation. We propose a heterogeneous graph representation that resolves this problem. We apply GNNs with both representations and a fully connected neural network (FCNN) baseline on an imitation learning task. The models are evaluated by classification accuracy and grid operation ability. We find that heterogeneous GNNs perform best on in-distribution network configurations, followed by FCNNs, and lastly, homogeneous GNNs. We also find that both GNN types generalize better to out-of-distribution network configurations than FCNNs.

[367] Rethinking the Vulnerability of Concept Erasure and a New Method

Alex D. Richardson, Kaicheng Zhang, Lucas Beerens, Dongdong Chen

Main category: cs.LG

TL;DR: RECORD is a novel coordinate-descent-based algorithm that exposes vulnerabilities in concept-erased text-to-image models by efficiently restoring supposedly erased concepts, outperforming existing restoration methods by up to 17.8 times.

Details

Motivation: Current concept erasure methods for text-to-image diffusion models are vulnerable to restoration attacks, revealing critical security gaps in privacy protection mechanisms.

Method: The paper investigates adversarial vulnerabilities in concept-erased models and introduces RECORD, a coordinate-descent-based restoration algorithm that exploits vulnerabilities in the prompt embedding space.

Result: RECORD consistently outperforms existing restoration methods by up to 17.8 times, demonstrating that erased concepts can be efficiently recovered through adversarial prompting.

Conclusion: Current concept erasure defense mechanisms have fundamental vulnerabilities that can be exploited by advanced restoration algorithms like RECORD, highlighting the need for more robust privacy protection methods.

Abstract: The proliferation of text-to-image diffusion models has raised significant privacy and security concerns, particularly regarding the generation of copyrighted or harmful images. In response, concept erasure (defense) methods have been developed to “unlearn” specific concepts through post-hoc finetuning. However, recent concept restoration (attack) methods have demonstrated that these supposedly erased concepts can be recovered using adversarially crafted prompts, revealing a critical vulnerability in current defense mechanisms. In this work, we first investigate the fundamental sources of adversarial vulnerability and reveal that vulnerabilities are pervasive in the prompt embedding space of concept-erased models, a characteristic inherited from the original pre-unlearned model. Furthermore, we introduce RECORD, a novel coordinate-descent-based restoration algorithm that consistently outperforms existing restoration methods by up to 17.8 times. We conduct extensive experiments to assess its compute-performance tradeoff and propose acceleration strategies.

[368] Generative Modeling of Weights: Generalization or Memorization?

Boya Zeng, Yida Yin, Zhiqiu Xu, Zhuang Liu

Main category: cs.LG

TL;DR: Current generative models for neural network weights largely memorize training checkpoints rather than creating novel weights, failing to outperform simple baselines like adding noise or weight ensembles.

Details

Motivation: To examine whether generative models can truly synthesize novel neural network weights rather than just memorizing training checkpoints, as claimed in prior work.

Method: Analyzed four representative generative methods on their ability to generate novel weights, comparing them against simple baselines like adding noise to weights and weight ensembles.

Result: Found that current methods primarily produce replicas or simple interpolations of training checkpoints, failing to generate truly novel weights and underperforming compared to simple baselines.

Conclusion: Memorization likely results from limited data, overparameterized models, and underuse of structural priors, highlighting the need for more careful design and rigorous evaluation in this domain.

Abstract: Generative models have recently been explored for synthesizing neural network weights. These approaches take neural network checkpoints as training data and aim to generate high-performing weights during inference. In this work, we examine four representative, well-known methods on their ability to generate novel model weights, i.e., weights that are different from the checkpoints seen during training. Contrary to claims in prior work, we find that these methods synthesize weights largely by memorization: they produce either replicas, or, at best, simple interpolations of the training checkpoints. Moreover, they fail to outperform simple baselines, such as adding noise to the weights or taking a simple weight ensemble, in obtaining different and simultaneously high-performing models. Our further analysis suggests that this memorization might result from limited data, overparameterized models, and the underuse of structural priors specific to weight data. These findings highlight the need for more careful design and rigorous evaluation of generative models when applied to new domains. Our code is available at https://github.com/boyazeng/weight_memorization.

[369] A Multi-Fidelity Control Variate Approach for Policy Gradient Estimation

Xinjie Liu, Cyrus Neary, Kushagra Gupta, Wesley A. Suttle, Christian Ellis, Ufuk Topcu, David Fridovich-Keil

Main category: cs.LG

TL;DR: MFPG is a reinforcement learning framework that mixes small amounts of high-fidelity target environment data with large volumes of low-fidelity simulation data to create variance-reduced policy gradient estimators, enabling efficient training with limited expensive data.

Details

Motivation: Many RL algorithms require large amounts of data, making them impractical for operational systems or expensive high-fidelity simulations. Low-fidelity simulators can provide cheap training data but are too coarse for direct transfer.

Method: Multi-fidelity policy gradients (MFPG) framework that combines a small amount of target environment data with a control variate from abundant low-fidelity data to create unbiased, variance-reduced estimators for on-policy policy gradients. Instantiated with a multi-fidelity REINFORCE variant.

Result: MFPG guarantees asymptotic convergence to locally optimal policies and achieves faster finite-sample convergence than high-fidelity-only training. In robotics benchmarks, it reliably improves median performance over high-fidelity baseline, matches leading multi-fidelity methods, and shows strongest robustness under large dynamics gaps. Effective even with low-fidelity reward misspecification.

Conclusion: MFPG provides a novel paradigm for efficient sim-to-real transfer and a principled approach to balance policy performance with data collection costs, offering simplicity and minimal tuning overhead.

Abstract: Many reinforcement learning (RL) algorithms are impractical for deployment in operational systems or for training with computationally expensive high-fidelity simulations, as they require large amounts of data. Meanwhile, low-fidelity simulators – such as reduced-order models, heuristic rewards, or generative world models – can cheaply provide useful data for RL training, even if they are too coarse for zero-shot transfer. We propose multi-fidelity policy gradients (MFPGs), an RL framework that mixes a small amount of data from the target environment with a control variate formed from a large volume of low-fidelity simulation data to construct an unbiased, variance-reduced estimator for on-policy policy gradients. We instantiate the framework with a multi-fidelity variant of the classical REINFORCE algorithm. We show that under standard assumptions, the MFPG estimator guarantees asymptotic convergence of REINFORCE to locally optimal policies in the target environment, and achieves faster finite-sample convergence rates compared to training with high-fidelity data alone. Empirically, we evaluate the MFPG algorithm across a suite of simulated robotics benchmark tasks with limited high-fidelity data but abundant off-dynamics, low-fidelity data. With mild-moderate dynamics gaps, MFPG reliably improves the median performance over a high-fidelity-only baseline, matching the performance of leading multi-fidelity baselines despite its simplicity and minimal tuning overhead. Under large dynamics gaps, MFPG demonstrates the strongest robustness among the evaluated multi-fidelity approaches. An additional experiment shows that MFPG can remain effective even under low-fidelity reward misspecification. Thus, MFPG not only offers a novel paradigm for efficient sim-to-real transfer but also provides a principled approach to managing the trade-off between policy performance and data collection costs.

[370] Towards Quantifying Long-Range Interactions in Graph Machine Learning: a Large Graph Dataset and a Measurement

Huidong Liang, Haitz Sáez de Ocáriz Borde, Baskaran Sripathmanathan, Michael Bronstein, Xiaowen Dong

Main category: cs.LG

TL;DR: Introduces City-Networks, a large-scale transductive dataset from real city road networks with 100k+ nodes and large diameters to study long-range dependencies in graph learning, along with a model-agnostic measurement method.

Details

Motivation: Current graph datasets are small and focus on inductive tasks, lacking proper evaluation of long-range dependencies. Existing comparisons between global attention and local aggregation models don't directly measure long-range interactions.

Method: Created City-Networks dataset from real city road networks with large diameters and annotated based on node eccentricities. Proposed Jacobian-based measurement for quantifying long-range dependencies across distant hops.

Result: Developed a dataset with graphs over 100k nodes and larger diameters than existing benchmarks, naturally containing long-range information. Provided theoretical justification for dataset design and measurement method focusing on over-smoothing and influence score dilution.

Conclusion: City-Networks dataset and the proposed measurement establish a robust foundation for studying long-range dependencies in graph neural networks, addressing limitations in current benchmarks and evaluation methods.

Abstract: Long-range dependencies are critical for effective graph representation learning, yet most existing datasets focus on small graphs tailored to inductive tasks, offering limited insight into long-range interactions. Current evaluations primarily compare models employing global attention (e.g., graph transformers) with those using local neighborhood aggregation (e.g., message-passing neural networks) without a direct measurement of long-range dependency. In this work, we introduce City-Networks, a novel large-scale transductive learning dataset derived from real-world city road networks. This dataset features graphs with over 100k nodes and significantly larger diameters than those in existing benchmarks, naturally embodying long-range information. We annotate the graphs based on local node eccentricities, ensuring that the classification task inherently requires information from distant nodes. Furthermore, we propose a model-agnostic measurement based on the Jacobians of neighbors from distant hops, offering a principled quantification of long-range dependencies. Finally, we provide theoretical justifications for both our dataset design and the proposed measurement-particularly by focusing on over-smoothing and influence score dilution-which establishes a robust foundation for further exploration of long-range interactions in graph neural networks.

[371] Ultra-Efficient Decoding for End-to-End Neural Compression and Reconstruction

Ethan G. Rogers, Cheng Wang

Main category: cs.LG

TL;DR: A new neural compression framework using low-rank representations and vector quantization to eliminate decoder bottlenecks while maintaining image quality.

Details

Motivation: Current neural compression methods suffer from high computational complexity and large costs in convolution-based decoders during reconstruction, hindering adoption.

Method: Incorporates low-rank representation in an autoencoder with vector quantization, performing computationally efficient low-rank operations on learned latent representations.

Result: Dramatically reduces computational overhead in decoding phase while maintaining high fidelity of image outputs.

Conclusion: The approach effectively eliminates the decoder compute bottleneck in neural compression/reconstruction systems.

Abstract: Image compression and reconstruction are crucial for various digital applications. While contemporary neural compression methods achieve impressive compression rates, the adoption of such technology has been largely hindered by the complexity and large computational costs of the convolution-based decoders during data reconstruction. To address the decoder bottleneck in neural compression, we develop a new compression-reconstruction framework based on incorporating low-rank representation in an autoencoder with vector quantization. We demonstrated that performing a series of computationally efficient low-rank operations on the learned latent representation of images can efficiently reconstruct the data with high quality. Our approach dramatically reduces the computational overhead in the decoding phase of neural compression/reconstruction, essentially eliminating the decoder compute bottleneck while maintaining high fidelity of image outputs.

[372] Activated LoRA: Fine-tuned LLMs for Intrinsics

Kristjan Greenewald, Luis Lastras, Thomas Parnell, Vraj Shah, Lucian Popa, Giulio Zizzo, Chulaka Gunasekara, Ambrish Rawat, David Cox

Main category: cs.LG

TL;DR: aLoRA (Activated LoRA) is an improved adapter architecture that allows instant activation of specialized models during inference without recomputing KV cache, enabling efficient switching between LoRAs in multiturn settings.

Details

Motivation: Standard LoRA requires recomputing the entire KV cache when switching between adapters in multiturn conversations, which is inefficient for real-time applications.

Method: Modified LoRA framework to only adapt weights for tokens after the aLoRA is invoked, allowing it to accept the base model’s KV cache without recomputation.

Result: aLoRA achieves competitive accuracy with standard LoRA while significantly improving inference efficiency by enabling instant activation of specialized models.

Conclusion: aLoRA enables efficient building of “intrinsics” - specialized models that can be instantly activated when needed, making it practical for real-time multiturn applications.

Abstract: Low-Rank Adaptation (LoRA) has emerged as a highly efficient framework for finetuning the weights of large foundation models, and has become the go-to method for data-driven customization of LLMs. Despite the promise of highly customized behaviors and capabilities, switching between relevant LoRAs in a multiturn setting is inefficient, as the key-value (KV) cache of the entire turn history must be recomputed with the LoRA weights before generation can begin. To address this problem, we propose Activated LoRA (aLoRA), an adapter architecture which modifies the LoRA framework to only adapt weights for the tokens in the sequence after the aLoRA is invoked. This change crucially allows aLoRA to accept the base model’s KV cache of the input string, meaning that aLoRA can be instantly activated whenever needed in a chain without recomputing the prior keys and values. This enables building what we call intrinsics, i.e. specialized models invoked to perform well-defined operations on portions of an input chain or conversation that otherwise uses the base model by default. We train a set of aLoRA-based intrinsics models, demonstrating competitive accuracy with standard LoRA while significantly improving inference efficiency. We contributed our Activated LoRA implementation to the Huggingface PEFT library https://github.com/huggingface/peft.

[373] Continuous Thought Machines

Luke Darlow, Ciaran Regan, Sebastian Risi, Jeffrey Seely, Llion Jones

Main category: cs.LG

TL;DR: The Continuous Thought Machine (CTM) is a biologically-inspired neural network that incorporates neuron-level temporal processing and neural synchronization as core mechanisms, enabling complex sequential reasoning and adaptive computation.

Details

Motivation: To challenge the paradigm of ignoring neural dynamics in artificial neural networks by reintroducing neural timing and biological realism while maintaining computational tractability.

Method: CTM features two innovations: (1) neuron-level temporal processing with unique weight parameters for processing incoming histories, and (2) neural synchronization as a latent representation that balances biological realism with computational efficiency.

Result: CTM demonstrates performance across diverse tasks including 2D mazes, ImageNet-1K classification, and parity computation, showing rich internal representations, interpretability, and adaptive compute capabilities.

Conclusion: CTM represents a significant step toward more biologically plausible AI systems, offering a foundation for neural dynamics-based computation rather than focusing on state-of-the-art performance.

Abstract: Biological brains demonstrate complex neural activity, where neural dynamics are critical to how brains process information. Most artificial neural networks ignore the complexity of individual neurons. We challenge that paradigm. By incorporating neuron-level processing and synchronization, we reintroduce neural timing as a foundational element. We present the Continuous Thought Machine (CTM), a model designed to leverage neural dynamics as its core representation. The CTM has two innovations: (1) neuron-level temporal processing, where each neuron uses unique weight parameters to process incoming histories; and (2) neural synchronization as a latent representation. The CTM aims to strike a balance between neuron abstractions and biological realism. It operates at a level of abstraction that effectively captures essential temporal dynamics while remaining computationally tractable. We demonstrate the CTM’s performance and versatility across a range of tasks, including solving 2D mazes, ImageNet-1K classification, parity computation, and more. Beyond displaying rich internal representations and offering a natural avenue for interpretation owing to its internal process, the CTM is able to perform tasks that require complex sequential reasoning. The CTM can also leverage adaptive compute, where it can stop earlier for simpler tasks, or keep computing when faced with more challenging instances. The goal of this work is to share the CTM and its associated innovations, rather than pushing for new state-of-the-art results. To that end, we believe the CTM represents a significant step toward developing more biologically plausible and powerful artificial intelligence systems. We provide an accompanying interactive online demonstration at https://pub.sakana.ai/ctm/ and an extended technical report at https://pub.sakana.ai/ctm/paper .

[374] Comparing Exploration-Exploitation Strategies of LLMs and Humans: Insights from Standard Multi-armed Bandit Experiments

Ziyuan Zhang, Darcy Wang, Ningyuan Chen, Rodrigo Mansur, Vahid Sarhangian

Main category: cs.LG

TL;DR: LLMs can simulate human decision-making in exploration-exploitation tradeoffs, with thinking-enabled models showing more human-like behavior, but struggling with adaptability in complex non-stationary environments.

Details

Motivation: To investigate whether LLMs exhibit similar decision-making behavior to humans in exploration-exploitation tradeoffs and whether they can achieve comparable performance.

Method: Used canonical multi-armed bandit experiments from cognitive science literature, employed interpretable choice models to capture strategies, and tested thinking traces through prompting strategies and thinking models.

Result: Thinking-enabled LLMs showed more human-like behavior with similar levels of random and directed exploration in simple settings, but struggled with adaptability in complex non-stationary environments despite achieving similar regret in some scenarios.

Conclusion: LLMs show promise as human behavior simulators but have limitations in complex environments, pointing to areas for improvement in automated decision-making.

Abstract: Large language models (LLMs) are increasingly used to simulate or automate human behavior in complex sequential decision-making settings. A natural question is then whether LLMs exhibit similar decision-making behavior to humans, and can achieve comparable (or superior) performance. In this work, we focus on the exploration-exploitation (E&E) tradeoff, a fundamental aspect of dynamic decision-making under uncertainty. We employ canonical multi-armed bandit (MAB) experiments introduced in the cognitive science and psychiatry literature to conduct a comparative study of the E&E strategies of LLMs, humans, and MAB algorithms. We use interpretable choice models to capture the E&E strategies of the agents and investigate how enabling thinking traces, through both prompting strategies and thinking models, shapes LLM decision-making. We find that enabling thinking in LLMs shifts their behavior toward more human-like behavior, characterized by a mix of random and directed exploration. In a simple stationary setting, thinking-enabled LLMs exhibit similar levels of random and directed exploration compared to humans. However, in more complex, non-stationary environments, LLMs struggle to match human adaptability, particularly in effective directed exploration, despite achieving similar regret in certain scenarios. Our findings highlight both the promise and limits of LLMs as simulators of human behavior and tools for automated decision-making and point to potential areas for improvement.

[375] OT Score: An OT based Confidence Score for Source Free Unsupervised Domain Adaptation

Yiming Zhang, Sitong Liu, Alex Cloninger

Main category: cs.LG

TL;DR: The paper introduces the Optimal Transport (OT) score, a novel confidence metric for source-free unsupervised domain adaptation that addresses computational limitations of existing methods and provides reliable uncertainty estimates without target labels.

Details

Motivation: Current distributional alignment methods for SFUDA have computational and theoretical limitations, with existing frameworks yielding intractable quantities and failing to reflect alignment algorithm properties.

Method: Proposes the OT score derived from theoretical analysis using Semi-Discrete Optimal Transport alignment, which exploits the flexibility of decision boundaries and provides principled uncertainty estimates for target pseudo-labels.

Result: Experimental results show OT score outperforms existing confidence scores, improves SFUDA performance through training-time reweighting, and provides a reliable label-free proxy for model performance.

Conclusion: The OT score is an interpretable and theoretically rigorous confidence metric that effectively addresses limitations of current SFUDA methods and enhances domain adaptation performance.

Abstract: We address the computational and theoretical limitations of current distributional alignment methods for source-free unsupervised domain adaptation (SFUDA). In particular, we focus on estimating classification performance and confidence in the absence of target labels. Current theoretical frameworks for these methods often yield computationally intractable quantities and fail to adequately reflect the properties of the alignment algorithms employed. To overcome these challenges, we introduce the Optimal Transport (OT) score, a confidence metric derived from a novel theoretical analysis that exploits the flexibility of decision boundaries induced by Semi-Discrete Optimal Transport alignment. The proposed OT score is intuitively interpretable and theoretically rigorous. It provides principled uncertainty estimates for any given set of target pseudo-labels. Experimental results demonstrate that OT score outperforms existing confidence scores. Moreover, it improves SFUDA performance through training-time reweighting and provides a reliable, label-free proxy for model performance.

[376] Efficient Preimage Approximation for Neural Network Certification

Anton Björklund, Mykola Zaitsev, Marta Kwiatkowska

Main category: cs.LG

TL;DR: The paper presents algorithmic improvements to PREMAP for neural network certification against patch attacks, enabling scaling to convolutional networks and demonstrating effectiveness on computer vision and control use cases.

Details

Motivation: Growing reliance on AI in safety-critical applications requires effective neural network certification, particularly against challenging real-world patch attacks that obscure parts of images like traffic signs.

Method: Novel algorithmic extensions to PREMAP involving tighter bounds, adaptive Monte Carlo sampling, and improved branching heuristics to improve efficiency and scalability.

Result: The efficiency improvements significantly outperform original PREMAP and enable scaling to convolutional neural networks previously intractable, with successful certification on computer vision and control use cases.

Conclusion: Preimage approximation methodology shows strong potential for analyzing and certifying reliability and robustness in neural networks for real-world safety-critical applications.

Abstract: The growing reliance on artificial intelligence in safety- and security-critical applications demands effective neural network certification. A challenging real-world use case is “patch attacks”, where adversarial patches or lighting conditions obscure parts of images, for example, traffic signs. A significant step towards certification against patch attacks was recently achieved using PREMAP, which uses under- and over-approximations of the preimage, the set of inputs that lead to a specified output, for the certification. While the PREMAP approach is versatile, it is currently limited to fully-connected neural networks of moderate dimensionality. In order to tackle broader real-world use cases, we present novel algorithmic extensions to PREMAP involving tighter bounds, adaptive Monte Carlo sampling, and improved branching heuristics. Firstly, we demonstrate that these efficiency improvements significantly outperform the original PREMAP and enable scaling to convolutional neural networks that were previously intractable. Secondly, we showcase the potential of preimage approximation methodology for analysing and certifying reliability and robustness on a range of use cases from computer vision and control.

[377] Manipulating 3D Molecules in a Fixed-Dimensional E(3)-Equivariant Latent Space

Zitao Chen, Yinjun Jia, Zitong Tian, Wei-Ying Ma, Yanyan Lan

Main category: cs.LG

TL;DR: MolFLAE is a zero-shot 3D molecule manipulation method using an E(3)-equivariant VAE that learns a shared latent space for flexible molecule editing without retraining.

Details

Motivation: To enable flexible drug optimization by manipulating 3D molecular structures while preserving key features like shapes and pharmacophores, overcoming limitations of supervised approaches.

Method: Uses E(3)-equivariant VAE with fixed-dimensional latent space independent of atom counts, encoding molecules into latent nodes with learned embeddings, and reconstructing via Bayesian Flow Network.

Result: Achieves competitive performance on 3D molecule generation benchmarks and enables zero-shot manipulation including atom editing, reconstruction, and coordinated interpolation for structure and properties.

Conclusion: MolFLAE provides flexible, robust molecule editing capabilities with real-world utility for drug optimization, demonstrated by improving hydrophilicity while preserving key interactions in glucocorticoid receptor drugs.

Abstract: Medicinal chemists often optimize drugs considering their 3D structures and designing structurally distinct molecules that retain key features, such as shapes, pharmacophores, or chemical properties. Previous deep learning approaches address this through supervised tasks like molecule inpainting or property-guided optimization. In this work, we propose a flexible zero-shot molecule manipulation method by navigating in a shared latent space of 3D molecules. We introduce a Variational AutoEncoder (VAE) for 3D molecules, named MolFLAE, which learns a fixed-dimensional, E(3)-equivariant latent space independent of atom counts. MolFLAE encodes 3D molecules using an E(3)-equivariant neural network into fixed number of latent nodes, distinguished by learned embeddings. The latent space is regularized, and molecular structures are reconstructed via a Bayesian Flow Network (BFN) conditioned on the encoder’s latent output. MolFLAE achieves competitive performance on standard unconditional 3D molecule generation benchmarks. Moreover, the latent space of MolFLAE enables zero-shot molecule manipulation, including atom number editing, structure reconstruction, and coordinated latent interpolation for both structure and properties. We further demonstrate our approach on a drug optimization task for the human glucocorticoid receptor, generating molecules with improved hydrophilicity while preserving key interactions, under computational evaluations. These results highlight the flexibility, robustness, and real-world utility of our method, opening new avenues for molecule editing and optimization.

[378] The Challenges of Hyperparameter Tuning for Accurate Causal Effect Estimation

Damian Machlanski, Spyridon Samothrakis, Paul Clarke

Main category: cs.LG

TL;DR: The paper addresses the challenge of hyperparameter tuning and model selection in causal inference, showing that proper tuning significantly improves performance but current metrics are inconsistent across scenarios.

Details

Motivation: There's no consensus on tuning metrics for causal inference tasks, making model comparison difficult. Causal model selection involves multiple components (estimators, regressors, hyperparameters, metrics) which complicates evaluation.

Method: Extensive empirical study combining various causal estimators, base learners, and metrics on four well-known causal inference benchmark datasets to evaluate the importance of each component.

Result: Hyperparameter tuning increased probability of reaching state-of-the-art performance from 65% to 81% for average effect estimation and from 50% to 57% for individualized effect estimation. Standard metrics showed inconsistent performance across different scenarios.

Conclusion: The findings highlight the need for further research to find metrics that can uniformly achieve state-of-the-art performance in causal model evaluation across different scenarios.

Abstract: ML is playing an increasingly crucial role in estimating causal effects of treatments on outcomes from observational data. Many ML methods (causal estimators') have been proposed for this task. All of these methods, as with any ML approach, require extensive hyperparameter tuning. For non-causal predictive tasks, there is a consensus on the choice of tuning metrics (e.g. mean squared error), making it simple to compare models. However, for causal inference tasks, such a consensus is yet to be reached, making any comparison of causal models difficult. On top of that, there is no ideal metric on which to tune causal estimators, so one must rely on proxies. Furthermore, the fact that model selection in causal inference involves multiple components (causal estimator, ML regressor, hyperparameters, metric), complicates the issue even further. In order to evaluate the importance of each component, we perform an extensive empirical study on their combination. Our experimental setup involves many commonly used causal estimators, regressors (base learners’ henceforth) and metrics applied to four well-known causal inference benchmark datasets. Our results show that hyperparameter tuning increased the probability of reaching state-of-the-art performance in average ($65% {\rightarrow} 81%$) and individualised ($50% {\rightarrow} 57%$) effect estimation with only commonly used estimators. We also show that the performance of standard metrics can be inconsistent across different scenarios. Our findings highlight the need for further research to establish whether metrics uniformly capable of state-of-the-art performance in causal model evaluation can be found.

[379] Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

Ruizhe Shi, Minhak Song, Runlong Zhou, Zihan Zhang, Maryam Fazel, Simon S. Du

Main category: cs.LG

TL;DR: Theoretical analysis shows RLHF and DPO performance gaps depend on representation gaps, with online DPO outperforming both when reward and policy models are isomorphic but mis-specified, while RLHF has statistical advantages in sparse reward settings.

Details

Motivation: To provide a fine-grained theoretical understanding of the performance gap between RLHF and DPO methods under representation gaps, decomposing the gap into explicit and implicit components.

Method: Decomposed the performance gap into explicit representation gap (exact optimization) and implicit representation gap (finite samples). Analyzed how relative capacities of reward and policy model classes affect final policy quality under different optimization settings.

Result: In exact optimization, RLHF, DPO, or online DPO can outperform each other depending on model mis-specifications. Online DPO outperforms both when reward and policy models are isomorphic and mis-specified. In approximate optimization with sparse rewards, RLHF requires significantly fewer samples than DPO to recover effective reward models.

Conclusion: Provides comprehensive understanding of RLHF-DPO performance gaps under various settings, offering practical insights for method selection based on model specifications and sample efficiency requirements.

Abstract: We present a fine-grained theoretical analysis of the performance gap between reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under a representation gap. Our study decomposes this gap into two sources: an explicit representation gap under exact optimization and an implicit representation gap under finite samples. In the exact optimization setting, we characterize how the relative capacities of the reward and policy model classes influence the final policy qualities. We show that RLHF, DPO, or online DPO can outperform one another depending on type of model mis-specifications. Notably, online DPO can outperform both RLHF and standard DPO when the reward and policy model classes are isomorphic and both mis-specified. In the approximate optimization setting, we provide a concrete construction where the ground-truth reward is implicitly sparse and show that RLHF requires significantly fewer samples than DPO to recover an effective reward model – highlighting a statistical advantage of two-stage learning. Together, these results provide a comprehensive understanding of the performance gap between RLHF and DPO under various settings, and offer practical insights into when each method is preferred.

[380] DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation

Makoto Shing, Masanori Koyama, Takuya Akiba

Main category: cs.LG

TL;DR: DiffusionBlocks is a framework that transforms transformers into independent trainable blocks using score matching, enabling block-wise training that matches end-to-end performance while reducing memory usage.

Details

Motivation: End-to-end backpropagation creates memory bottlenecks that limit model scalability. Existing block-wise training methods rely on ad-hoc local objectives and are unexplored beyond classification tasks.

Method: Leverages residual connections as updates in a dynamical system, converting them to denoising processes. Each block learns independently using score matching objective, enabling training with gradients for only one block at a time.

Result: Experiments on various transformer architectures show DiffusionBlocks matches end-to-end training performance while enabling scalable block-wise training on practical tasks beyond classification.

Conclusion: DiffusionBlocks provides a theoretically grounded approach that successfully scales to modern generative tasks across diverse architectures, reducing memory requirements proportionally to the number of blocks.

Abstract: End-to-end backpropagation requires storing activations throughout all layers, creating memory bottlenecks that limit model scalability. Existing block-wise training methods offer means to alleviate this problem, but they rely on ad-hoc local objectives and remain largely unexplored beyond classification tasks. We propose $\textit{DiffusionBlocks}$, a principled framework for transforming transformer-based networks into genuinely independent trainable blocks that maintain competitive performance with end-to-end training. Our key insight leverages the fact that residual connections naturally correspond to updates in a dynamical system. With minimal modifications to this system, we can convert the updates to those of a denoising process, where each block can be learned independently by leveraging the score matching objective. This independence enables training with gradients for only one block at a time, thereby reducing memory requirements in proportion to the number of blocks. Our experiments on a range of transformer architectures (vision, diffusion, autoregressive, recurrent-depth, and masked diffusion) demonstrate that DiffusionBlocks training matches the performance of end-to-end training while enabling scalable block-wise training on practical tasks beyond small-scale classification. DiffusionBlocks provides a theoretically grounded approach that successfully scales to modern generative tasks across diverse architectures.

[381] A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation

Heyang Zhao, Jiafan He, Quanquan Gu

Main category: cs.LG

TL;DR: Proposes MQL-UCB algorithm for RL with general function approximation, achieving minimax optimal regret and low switching cost through monotonic value functions and variance-weighted regression.

Details

Motivation: Address the exploration-exploitation dilemma in reinforcement learning with complex model classes and general function approximation.

Method: Combines deterministic policy-switching strategy, monotonic value function structure with controlled complexity, and variance-weighted regression scheme for data efficiency.

Result: Achieves minimax optimal regret of Õ(d√(HK)) and near-optimal policy switching cost of Õ(dH), where d is eluder dimension, H is planning horizon, K is episodes.

Conclusion: Provides provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.

Abstract: The exploration-exploitation dilemma has been a central challenge in reinforcement learning (RL) with complex model classes. In this paper, we propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for RL with general function approximation. Our key algorithmic design includes (1) a general deterministic policy-switching strategy that achieves low switching cost, (2) a monotonic value function structure with carefully controlled function class complexity, and (3) a variance-weighted regression scheme that exploits historical trajectories with high data efficiency. MQL-UCB achieves minimax optimal regret of $\tilde{O}(d\sqrt{HK})$ when $K$ is sufficiently large and near-optimal policy switching cost of $\tilde{O}(dH)$, with $d$ being the eluder dimension of the function class, $H$ being the planning horizon, and $K$ being the number of episodes. Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.

[382] Controlled Generation with Equivariant Variational Flow Matching

Floor Eijkelboom, Heiko Zimmermann, Sharvaree Vadgama, Erik J Bekkers, Max Welling, Christian A. Naesseth, Jan-Willem van de Meent

Main category: cs.LG

TL;DR: This paper presents a controlled generation framework using Variational Flow Matching (VFM) that enables constraint-driven generation through either end-to-end training or post-hoc Bayesian inference, with applications to molecular generation.

Details

Motivation: To establish a principled connection between flow-based generative modeling and Bayesian inference for controlled generation, enabling both training-time and post-hoc constraint enforcement while preserving symmetries in molecular generation.

Method: Derived controlled generation objective within VFM framework, implemented two approaches: end-to-end conditional training and Bayesian inference for post-hoc control. Developed equivariant formulation for molecular generation with rotation, translation, and permutation invariance.

Result: Achieved state-of-the-art performance on uncontrolled molecular generation and outperformed SOTA models in controlled generation, working effectively in both end-to-end training and Bayesian inference settings.

Conclusion: The work strengthens connections between flow-based models and Bayesian inference, providing a scalable framework for constraint-driven and symmetry-aware generation that doesn’t require retraining for new constraints.

Abstract: We derive a controlled generation objective within the framework of Variational Flow Matching (VFM), which casts flow matching as a variational inference problem. We demonstrate that controlled generation can be implemented two ways: (1) by way of end-to-end training of conditional generative models, or (2) as a Bayesian inference problem, enabling post hoc control of unconditional models without retraining. Furthermore, we establish the conditions required for equivariant generation and provide an equivariant formulation of VFM tailored for molecular generation, ensuring invariance to rotations, translations, and permutations. We evaluate our approach on both uncontrolled and controlled molecular generation, achieving state-of-the-art performance on uncontrolled generation and outperforming state-of-the-art models in controlled generation, both with end-to-end training and in the Bayesian inference setting. This work strengthens the connection between flow-based generative modeling and Bayesian inference, offering a scalable and principled framework for constraint-driven and symmetry-aware generation.

[383] Amelia: A Large Dataset and Benchmark for Airport Surface Movement Forecasting

Ingrid Navarro, Pablo Ortega-Kral, Jay Patrikar, Haichuan Wang, Alonso Cano, Zelin Ye, Jong Hoon Park, Sebastian Scherer, Jean Oh

Main category: cs.LG

TL;DR: The paper introduces Amelia-42, a large-scale dataset of airport surface movement data from 42 US airports to address the lack of public data for developing air traffic management technologies.

Details

Motivation: Increasing air travel demand and understaffed control towers (90% in US) have led to safety concerns like near-misses, highlighting the need for better air traffic management technologies. Lack of large-scale surface movement datasets has hindered development of scalable approaches.

Method: Created Amelia-42 dataset by collecting raw airport surface movement reports from FAA’s SWIM Program over 2 years (9.19TB across 42 airports). Also released processing tools, Amelia42-Mini sample data, and established trajectory forecasting benchmarks including Amelia10-Bench and Amelia-TF transformer baseline.

Result: Successfully compiled first large-scale public dataset of airport surface movements, providing processed data samples and benchmarking tools for the research community.

Conclusion: The Amelia-42 dataset and associated tools address the critical data gap in aviation research, enabling development of scalable air traffic management technologies to improve safety and efficiency in increasingly strained aviation infrastructure.

Abstract: Demand for air travel is rising, straining existing aviation infrastructure. In the US, more than 90% of airport control towers are understaffed, falling short of FAA and union standards. This, in part, has contributed to an uptick in near-misses and safety-critical events, highlighting the need for advancements in air traffic management technologies to ensure safe and efficient operations. Data-driven predictive models for terminal airspace show potential to address these challenges; however, the lack of large-scale surface movement datasets in the public domain has hindered the development of scalable and generalizable approaches. To address this, we introduce Amelia-42, a first-of-its-kind large collection of raw airport surface movement reports streamed through the FAA’s System Wide Information Management (SWIM) Program, comprising over two years of trajectory data (~9.19 TB) across 42 US airports. We open-source tools to process this data into clean tabular position reports. We release Amelia42-Mini, a 15-day sample per airport, fully processed data on HuggingFace for ease of use. We also present a trajectory forecasting benchmark consisting of Amelia10-Bench, an accessible experiment family using 292 days from 10 airports, as well as Amelia-TF, a transformer-based baseline for multi-agent trajectory forecasting. All resources are available at our website: https://ameliacmu.github.io and https://huggingface.co/AmeliaCMU.

[384] Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, Sean Hendryx

Main category: cs.LG

TL;DR: Rubrics as Rewards (RaR) extends RLVR to real-world reasoning tasks by using rubric-based feedback as reward signals, achieving significant improvements over LLM-as-judge baselines.

Details

Motivation: Extend RLVR beyond verifiable domains like math/coding to real-world reasoning tasks where evaluation depends on nuanced, multi-criteria judgments rather than binary correctness.

Method: On-policy reinforcement learning using rubric-based feedback as rewards, with multiple strategies for aggregating rubric feedback into reward signals.

Result: Achieved relative improvements of up to 31% on HealthBench and 7% on GPQA-Diamond over LLM-as-judge baselines using Likert-based rewards.

Conclusion: RaR-trained policies adapt well to diverse evaluation formats, perform strongly on both rubric-based and multiple-choice tasks, and yield better alignment for smaller judges with reduced performance variance.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for complex reasoning tasks with clear correctness signals such as math and coding. However, extending it to real-world reasoning tasks is challenging, as evaluation depends on nuanced, multi-criteria judgments rather than binary correctness. Instance-specific rubrics have recently been used in evaluation benchmarks to capture such judgments, but their potential as reward signals for on-policy post-training remains underexplored. We introduce $\textbf{Rubrics as Rewards}$ (RaR), an on-policy reinforcement learning method that extends RLVR beyond verifiable domains by using rubric-based feedback. Across both medical and science domains, we evaluate multiple strategies for aggregating rubric feedback into rewards. The best RaR variant achieves relative improvements of up to $31%$ on HealthBench and $7%$ on GPQA-Diamond over popular LLM-as-judge baselines that rely on direct Likert-based rewards. These results demonstrate that RaR-trained policies adapt well to diverse evaluation formats, performing strongly on both rubric-based and multiple-choice tasks. Moreover, we find that using rubrics as structured reward signals yields better alignment for smaller judges and reduces performance variance across judge scales.

[385] Model Parallelism With Subnetwork Data Parallelism

Vaibhav Singh, Zafir Khalid, Edouard Oyallon, Eugene Belilovsky

Main category: cs.LG

TL;DR: Subnetwork Data Parallelism (SDP) is a distributed training framework that partitions models into subnetworks across workers without exchanging activations, using backward and forward masking to reduce memory usage by 30%-75% while maintaining or improving performance.

Details

Motivation: Pre-training large neural networks at scale imposes heavy memory demands on accelerators and requires costly communication, creating a need for more efficient distributed training methods.

Method: SDP partitions models into structured subnetworks trained across workers without exchanging activations. It uses two masking regimes: backward masking (sparsity only in backward step) and forward masking (removes parameters in forward pass). Two subnetwork construction strategies: neuron level and block level, applied to CNNs and transformers.

Result: SDP reduces per-device memory usage by 30%-75% while maintaining or improving performance across CNNs and transformers on CIFAR and ImageNet, and LLM pre-training on FineWeb. In FLOP-matched settings, forward masking can sometimes achieve better performance.

Conclusion: SDP provides an effective distributed training framework that significantly reduces memory usage while preserving model performance, with forward masking offering additional regularization benefits.

Abstract: Pre-training large neural networks at scale imposes heavy memory demands on accelerators and often requires costly communication. We introduce Subnetwork Data Parallelism (SDP), a distributed training framework that partitions a model into structured subnetworks trained across workers without exchanging activations. We study two complementary masking regimes: backward masking, which applies sparsity only in the backward step to retain unbiased gradients, and forward masking, which also removes parameters in the forward pass to deliver stronger efficiency gains while providing additional regularization. We further explore two subnetwork construction strategies: neuron level and block level, applied across both CNNs and transformers. In experiments spanning CNNs and transformers on CIFAR and ImageNet, as well as LLM pre-training on FineWeb, SDP reduces per-device memory usage by 30%-75% while maintaining or improving performance. Notably, in FLOP-matched settings, forward masking can sometimes achieve better performance.

[386] Wasserstein Bounds for generative diffusion models with Gaussian tail targets

Xixian Wang, Zhongjian Wang

Main category: cs.LG

TL;DR: The paper provides a Wasserstein distance estimate between data distribution and score-based generative models, showing sampling complexity of O(√d) with logarithmic constant under Gaussian tail assumptions.

Details

Motivation: To establish rigorous theoretical bounds on the sampling complexity of score-based generative models with respect to dimension, addressing practical scenarios like early stopping techniques.

Method: Analysis assumes Gaussian-type tail behavior of data distribution and ε-accurate score approximation, using dimension-independent heat kernel estimates to derive global Lipschitz bounds for the score function.

Result: The sampling complexity scales as O(√d) with logarithmic constant, linearly with square root of covariance operator trace, which relates to the forward process invariant distribution.

Conclusion: The analysis provides dimension-dependent complexity bounds for score-based generative models under realistic assumptions, with practical implications for early stopping techniques.

Abstract: We present an estimate of the Wasserstein distance between the data distribution and the generation of score-based generative models. The sampling complexity with respect to dimension is $\mathcal{O}(\sqrt{d})$, with a logarithmic constant. In the analysis, we assume a Gaussian-type tail behavior of the data distribution and an $\epsilon$-accurate approximation of the score. Such a Gaussian tail assumption is general, as it accommodates a practical target - the distribution from early stopping techniques with bounded support. The crux of the analysis lies in the global Lipschitz bound of the score, which is shown from the Gaussian tail assumption by a dimension-independent estimate of the heat kernel. Consequently, our complexity bound scales linearly (up to a logarithmic constant) with the square root of the trace of the covariance operator, which relates to the invariant distribution of the forward process.

[387] First Hallucination Tokens Are Different from Conditional Ones

Jakob Snel, Seong Joon Oh

Main category: cs.LG

TL;DR: The first hallucinated token in LLM outputs is significantly more detectable than subsequent ones, revealing a structural pattern in token-level hallucination detection across models.

Details

Motivation: To understand the distribution of hallucination signals across sequences of hallucinated tokens, as current approaches focus on response or span level detection but token-level detection enables more fine-grained intervention.

Method: Leveraged token-level annotations from the RAGTruth corpus to analyze hallucination detection patterns across sequences of hallucinated tokens in LLM outputs.

Result: Found that the first hallucinated token is far more detectable than later ones, and this structural property holds consistently across different models.

Conclusion: First hallucination tokens play a key role in token-level hallucination detection, providing important insights for developing more effective detection methods.

Abstract: Large Language Models (LLMs) hallucinate, and detecting these cases is key to ensuring trust. While many approaches address hallucination detection at the response or span level, recent work explores token-level detection, enabling more fine-grained intervention. However, the distribution of hallucination signal across sequences of hallucinated tokens remains unexplored. We leverage token-level annotations from the RAGTruth corpus and find that the first hallucinated token is far more detectable than later ones. This structural property holds across models, suggesting that first hallucination tokens play a key role in token-level hallucination detection. Our code is available at https://github.com/jakobsnl/RAGTruth_Xtended.

[388] ColNet: Collaborative Optimization in Decentralized Federated Multi-task Learning Systems

Chao Feng, Nicolas Fazli Kohler, Zhi Wang, Weijie Niu, Alberto Huertas Celdran, Gerome Bovet, Burkhard Stiller

Main category: cs.LG

TL;DR: ColNet is a decentralized federated multi-task learning framework that addresses task heterogeneity by partitioning models into backbone and task-specific heads, using adaptive clustering to form task-coherent groups, and performing hyper-conflict-averse cross-group aggregation.

Details

Motivation: Most existing federated multi-task learning research focuses on data heterogeneity rather than task heterogeneity, and relies on centralized settings, leaving decentralized FMTL largely unexplored.

Method: ColNet partitions models into backbone and task-specific heads, uses adaptive clustering based on model and data sensitivity to form task-coherent client groups, averages backbones within groups, and performs hyper-conflict-averse cross-group aggregation through group leaders.

Result: ColNet outperforms competing schemes across datasets and federations under label and task heterogeneity, and shows robustness to poisoning attacks.

Conclusion: ColNet successfully bridges the gap in decentralized federated multi-task learning for heterogeneous tasks, demonstrating superior performance and robustness compared to existing approaches.

Abstract: The integration of Federated Learning (FL) and Multi-Task Learning (MTL) has been explored to address client heterogeneity, with Federated Multi-Task Learning (FMTL) treating each client as a distinct task. However, most existing research focuses on data heterogeneity (e.g., addressing non-IID data) rather than task heterogeneity, where clients solve fundamentally different tasks. Additionally, much of the work relies on centralized settings with a server managing the federation, leaving the more challenging domain of decentralized FMTL largely unexplored. Thus, this work bridges this gap by proposing ColNet, a framework designed for heterogeneous tasks in decentralized federated environments. ColNet partitions models into a backbone and task-specific heads, and uses adaptive clustering based on model and data sensitivity to form task-coherent client groups. Backbones are averaged within groups, and group leaders perform hyper-conflict-averse cross-group aggregation. Across datasets and federations, ColNet outperforms competing schemes under label and task heterogeneity and shows robustness to poisoning attacks.

[389] Best Policy Learning from Trajectory Preference Feedback

Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Zheng Wen

Main category: cs.LG

TL;DR: Proposes PSPL, a novel algorithm for preference-based RL that uses posterior sampling to identify optimal policies from noisy binary preferences, addressing limitations of RLHF.

Details

Motivation: RLHF is vulnerable to reward mis-specification and hacking, while PbRL offers more robust alignment using direct trajectory comparisons. Need systematic online learning for post-training optimization of generative models.

Method: PSPL algorithm inspired by Top-Two Thompson Sampling, maintains posteriors over reward model and dynamics, combines offline preference datasets with online pure exploration.

Result: First Bayesian simple regret guarantees for PbRL, efficient approximation outperforms existing baselines on simulation and image generation benchmarks.

Conclusion: PSPL provides effective solution for preference-based policy identification with theoretical guarantees and practical performance improvements.

Abstract: Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful approach for aligning generative models, but its reliance on learned reward models makes it vulnerable to mis-specification and reward hacking. Preference-based Reinforcement Learning (PbRL) offers a more robust alternative by directly leveraging noisy binary comparisons over trajectories. We study the best policy identification problem in PbRL, motivated by post-training optimization of generative models, for example, during multi-turn interactions. Learning in this setting combines an offline preference dataset–potentially biased or out-of-distribution and collected from a rater of subpar ‘competence’–with online pure exploration, making systematic online learning essential. To this end, we propose Posterior Sampling for Preference Learning ($\mathsf{PSPL}$), a novel algorithm inspired by Top-Two Thompson Sampling that maintains posteriors over the reward model and dynamics. We provide the first Bayesian simple regret guarantees for PbRL and introduce an efficient approximation that outperforms existing baselines on simulation and image generation benchmarks.

[390] Anchored Supervised Fine-Tuning

He Zhu, Junyou Su, Peng Lai, Ren Ma, Wenjia Zhang, Linyi Yang, Guanhua Chen

Main category: cs.LG

TL;DR: The paper introduces Anchored Supervised Fine-Tuning (ASFT), which improves upon Dynamic Fine-Tuning (DFT) by adding KL regularization to prevent distributional drift, achieving better performance than both SFT and DFT across multiple reasoning tasks with minimal computational overhead.

Details

Motivation: To address the trade-off between supervised fine-tuning (SFT) that tends to memorize and reinforcement learning (RL) that generalizes better but is computationally expensive, and to fix the instability issues in DFT.

Method: Proposed Anchored Supervised Fine-Tuning (ASFT) that augments DFT’s reweighting with lightweight KL regularization to preserve theoretical tightness while ensuring training stability.

Result: ASFT consistently outperforms both SFT and DFT across mathematical reasoning, medical knowledge grounding, and code generation tasks, achieving substantial improvements with minimal computational overhead.

Conclusion: The RWR framework provides a systematic understanding of post-training methods, showing that principled theoretical analysis leads to both stronger guarantees and practical performance gains.

Abstract: Post-training of large language models involves a fundamental trade-off between supervised fine-tuning (SFT), which efficiently mimics demonstrations but tends to memorize, and reinforcement learning (RL), which achieves better generalization at higher computational cost. Dynamic Fine-Tuning (DFT) recently emerged as a promising middle ground, reweighting SFT objectives with token probabilities and achieving improvements in certain reasoning domains, though it exhibits instability in other tasks. We provide a analysis of DFT through the reward-weighted regression (RWR) framework, revealing that it corresponds to a specific auxiliary distribution choice that yields provably tighter RL bounds than standard SFT. However, our analysis also uncovers a critical limitation: this construction lacks distributional anchoring, leading to progressive drift that undermines training stability. To address this, we propose Anchored Supervised Fine-Tuning (ASFT), which augments DFT’s reweighting with lightweight KL regularization to preserve tightness while ensuring stability. Empirically, ASFT consistently outperforms both SFT and DFT across mathematical reasoning, medical knowledge grounding, and code generation, achieving substantial improvements with minimal computational overhead. Our RWR framework provides a systematic lens for understanding post-training methods and demonstrates that principled theoretical analysis leads to both stronger guarantees and practical gains.

[391] Learning Counterfactual Outcomes Under Rank Preservation

Peng Wu, Haoxuan Li, Chunyuan Zheng, Yan Zeng, Jiawei Chen, Yang Liu, Ruocheng Guo, Kun Zhang

Main category: cs.LG

TL;DR: Proposes a novel method for counterfactual inference using rank preservation assumption and kernel-based estimation, achieving unbiased learning without requiring known structural causal models.

Details

Motivation: Existing counterfactual inference methods require known structural causal models or make strong assumptions about exogenous variable homogeneity and strict monotonicity, limiting their practical applicability.

Method: Introduces rank preservation assumption for identification, proposes an ideal loss function for unbiased learning, and develops a kernel-based estimator for empirical estimation.

Result: Theoretical analysis shows rank preservation is not stronger than existing assumptions, ideal loss is convex, and estimator is unbiased. Experiments demonstrate effectiveness on semi-synthetic and real-world data.

Conclusion: The proposed approach provides a principled framework for counterfactual inference that overcomes limitations of previous methods while maintaining theoretical guarantees.

Abstract: Counterfactual inference aims to estimate the counterfactual outcome at the individual level given knowledge of an observed treatment and the factual outcome, with broad applications in fields such as epidemiology, econometrics, and management science. Previous methods rely on a known structural causal model (SCM) or assume the homogeneity of the exogenous variable and strict monotonicity between the outcome and exogenous variable. In this paper, we propose a principled approach for identifying and estimating the counterfactual outcome. We first introduce a simple and intuitive rank preservation assumption to identify the counterfactual outcome without relying on a known structural causal model. Building on this, we propose a novel ideal loss for theoretically unbiased learning of the counterfactual outcome and further develop a kernel-based estimator for its empirical estimation. Our theoretical analysis shows that the rank preservation assumption is not stronger than the homogeneity and strict monotonicity assumptions, and shows that the proposed ideal loss is convex, and the proposed estimator is unbiased. Extensive semi-synthetic and real-world experiments are conducted to demonstrate the effectiveness of the proposed method.

[392] Grounding the Ungrounded: A Spectral-Graph Framework for Quantifying Hallucinations in multimodal LLMs

Supratik Sarkar, Swagatam Das

Main category: cs.LG

TL;DR: Introduces a rigorous information geometric framework using diffusion dynamics to quantify hallucinations in multimodal LLMs, moving from qualitative detection to mathematically grounded measurement.

Details

Motivation: Hallucinations in LLMs remain a fundamental obstacle to trustworthy AI, especially in high-stakes domains. Existing evaluation techniques are heuristic and lack principled quantification or theoretical guarantees.

Method: Represents MLLM outputs as spectral embeddings over multimodal graph Laplacians, characterizes manifold gaps as semantic distortion, and uses Rayleigh-Ritz bounds on hallucination energy with time-dependent temperature profiles via eigenmode decompositions in RKHS embeddings.

Result: Develops modality-aware, theoretically interpretable metrics that capture the evolution of hallucinations across time and input prompts through temperature annealing.

Conclusion: Establishes a principled foundation for quantifying and bounding hallucinations, transforming them from a qualitative risk to a tractable, analyzable phenomenon.

Abstract: Hallucinations in large language models (LLMs) remain a fundamental obstacle to trustworthy AI, particularly in high-stakes multimodal domains such as medicine, law, and finance. Existing evaluation techniques are largely heuristic – anchored in qualitative benchmarking or ad-hoc empirical mitigation – providing neither principled quantification nor actionable theoretical guarantees. This gap leaves a critical blind spot in understanding how hallucinations arise, propagate, and interact across modalities. We introduce the first (to our knowledge) rigorous information geometric framework in diffusion dynamics for quantifying hallucinations in multimodal LLMs (MLLMs), advancing the field from qualitative detection to mathematically grounded measurement. Our approach represents MLLM outputs as the spectral embeddings over multimodal graph Laplacians and characterizes the manifold gaps of truth vs inconsistencies as the semantic distortion, enabling the tight Rayleigh–Ritz bounds on the multimodal hallucination energy as a functional of time-dependent temperature profiles. By leveraging eigenmode decompositions in Reproducing Kernel Hilbert Space (RKHS) embeddings, our framework delivers modality-aware, theoretically interpretable metrics that capture the evolution of hallucinations across time and input prompts through temperature annealing. This work establishes a principled foundation for quantifying and bounding hallucinations, transforming them from a qualitative risk to a tractable, analyzable phenomenon.

[393] On the Effect of Sampling Diversity in Scaling LLM Inference

Tianchun Wang, Zichuan Liu, Yuanzhou Chen, Jonathan Light, Weiyang Liu, Haifeng Chen, Xiang Zhang, Wei Cheng

Main category: cs.LG

TL;DR: Diversified sampling through meaningful prompt perturbations significantly improves LLM scaling inference performance across reasoning, mathematics, and code generation tasks.

Details

Motivation: To enhance LLM scaling inference performance by leveraging the observed relationship between solution accuracy and meaningful response diversity, moving beyond stationary prompts.

Method: Systematically study prompt diversity effects, theoretically analyze diversified sampling for Best-of-N scaling, analyze perturbation fidelity, and develop task-level and query-level perturbations to promote solution diversity.

Result: Diversified sampling achieves relative gains of 10.8% in EM@100 for reasoning, 9.6% for mathematics, and 9.5% in Pass@100 for code generation, with responses from diverse prompts showing significantly lower error rates than stationary prompts.

Conclusion: Meaningful prompt diversity is crucial for effective LLM scaling inference, with moderately relevant perturbations providing optimal performance improvements across diverse tasks.

Abstract: Large language model (LLM) scaling inference is key to unlocking greater performance, and leveraging diversity has proven an effective way to enhance it. Motivated by the observed relationship between solution accuracy and meaningful response diversity, we systematically study the effect of prompt diversity in scaling inference. We theoretically explain why diversified sampling improves Best-of-$N$ scaling, showing that responses generated from meaningful diverse prompts after Best-of-$N$ selection exhibit significantly lower error rates than those produced from stationary prompts. To promote solution diversity, we analyze perturbation fidelity and show that moderately relevant perturbations improve performance, providing guidance for effective perturbation design. Further, we present a set of effective perturbations, including task-level and query-level ones, and analyze the conditions under which they succeed. We systematically evaluate diversified sampling across tasks, finding relative gains of 10.8% in EM@100 for reasoning, 9.6% for mathematics, and 9.5% in Pass@100 for code generation.

[394] FinP: Fairness-in-Privacy in Federated Learning by Addressing Disparities in Privacy Risk

Tianyu Zhao, Mahmoud Srewa, Salma Elmalaki

Main category: cs.LG

TL;DR: FinP is a novel framework for fair privacy in federated learning that addresses disparities in privacy risk across clients through adaptive aggregation and client-side regularization.

Details

Motivation: To ensure fairness in privacy distribution across clients in federated learning, particularly addressing disproportionate vulnerability to source inference attacks.

Method: Two-pronged strategy: (1) server-side adaptive aggregation that dynamically adjusts client contributions, and (2) client-side regularization to enhance individual privacy robustness.

Result: Achieved 57.14% improvement in group fairness for privacy risk on CIFAR-10 compared to state-of-the-art, with minimal impact on utility and significant mitigation of source inference attack risks.

Conclusion: FinP effectively establishes fairness in privacy within FL systems without compromising utility, making it a promising solution for equitable privacy risk distribution.

Abstract: Ensuring fairness in machine learning extends to the critical dimension of privacy, particularly in human-centric federated learning (FL) settings where decentralized data necessitates an equitable distribution of privacy risk across clients. This paper introduces FinP, a novel framework specifically designed to address disparities in privacy risk by mitigating disproportionate vulnerability to source inference attacks (SIA). FinP employs a two-pronged strategy: (1) server-side adaptive aggregation, which dynamically adjusts client contributions to the global model to foster fairness, and (2) client-side regularization, which enhances the privacy robustness of individual clients. This comprehensive approach directly tackles both the symptoms and underlying causes of privacy unfairness in FL. Extensive evaluations on the Human Activity Recognition (HAR) and CIFAR-10 datasets demonstrate FinP’s effectiveness, achieving improvement in fairness-in-privacy on HAR and CIFAR-10 with minimal impact on utility. FinP improved group fairness with respect to disparity in privacy risk using equal opportunity in CIFAR-10 by 57.14% compared to the state-of-the-art. Furthermore, FinP significantly mitigates SIA risks on CIFAR-10, underscoring its potential to establish fairness in privacy within FL systems without compromising utility.

[395] STORI: A Benchmark and Taxonomy for Stochastic Environments

Aryan Amit Barsainyan, Jing Yu Lim, Dianbo Liu

Main category: cs.LG

TL;DR: The paper introduces STORI, a benchmark that systematically incorporates diverse stochastic effects to evaluate RL techniques under different forms of uncertainty, revealing vulnerabilities in state-of-the-art model-based RL algorithms.

Details

Motivation: Current RL techniques show limited transfer to real-world domains due to environmental stochasticity (noisy observations, unpredictable dynamics, non-stationary conditions), and existing benchmarks rarely capture these uncertainties.

Method: Proposed a comprehensive five-type taxonomy of environmental stochasticity and created the STORI benchmark to systematically evaluate RL techniques under different forms of uncertainty, testing DreamerV3 and STORM algorithms.

Result: World models dramatically underestimate environmental variance, struggle with action corruption, and exhibit unreliable dynamics under partial observability. State-of-the-art model-based RL algorithms show systematic vulnerabilities to different types of stochasticity.

Conclusion: STORI provides a unified framework for developing more robust RL systems by enabling rigorous evaluation under diverse stochastic conditions, addressing a critical gap in current RL benchmarking.

Abstract: Reinforcement learning (RL) techniques have achieved impressive performance on simulated benchmarks such as Atari100k, yet recent advances remain largely confined to simulation and show limited transfer to real-world domains. A central obstacle is environmental stochasticity, as real systems involve noisy observations, unpredictable dynamics, and non-stationary conditions that undermine the stability of current methods. Existing benchmarks rarely capture these uncertainties and favor simplified settings where algorithms can be tuned to succeed. The absence of a well-defined taxonomy of stochasticity further complicates evaluation, as robustness to one type of stochastic perturbation, such as sticky actions, does not guarantee robustness to other forms of uncertainty. To address this critical gap, we introduce STORI (STOchastic-ataRI), a benchmark that systematically incorporates diverse stochastic effects and enables rigorous evaluation of RL techniques under different forms of uncertainty. We propose a comprehensive five-type taxonomy of environmental stochasticity and demonstrate systematic vulnerabilities in state-of-the-art model-based RL algorithms through targeted evaluation of DreamerV3 and STORM. Our findings reveal that world models dramatically underestimate environmental variance, struggle with action corruption, and exhibit unreliable dynamics under partial observability. We release the code and benchmark publicly at https://github.com/ARY2260/stori, providing a unified framework for developing more robust RL systems.

[396] To Backtrack or Not to Backtrack: When Sequential Search Limits Model Reasoning

Tian Qin, David Alvarez-Melis, Samy Jelassi, Eran Malach

Main category: cs.LG

TL;DR: Sequential backtracking doesn’t always outperform parallel sampling for LLM reasoning - performance depends on task structure, with backtracking being better for Sudoku but worse for CountDown.

Details

Motivation: To systematically compare sequential backtracking vs parallel sampling approaches for scaling test-time compute in LLM reasoning, as the advantages of backtracking under fixed compute budgets remain poorly understood.

Method: Systematic comparison of sequential search (backtracking with chain-of-thought) and parallel sampling (best-of-N selection) on CountDown and Sudoku reasoning tasks, with analysis extended to reinforcement learning fine-tuning.

Result: Sequential search underperformed parallel sampling on CountDown but outperformed it on Sudoku. Backtracking can degrade performance due to training on fixed search traces and explicit CoT supervision discouraging implicit reasoning. RL fine-tuning significantly benefits models with backtracking capabilities.

Conclusion: Backtracking doesn’t universally enhance LLM reasoning - performance depends on complex interactions between task structure, training data, model scale, and learning paradigm.

Abstract: Recent advancements in large language models (LLMs) have significantly improved their reasoning abilities, particularly through techniques involving search and backtracking. Backtracking naturally scales test-time compute by enabling sequential, linearized exploration via long chain-of-thought (CoT) generation. However, this is not the only strategy for scaling test time-compute: parallel sampling with best-of-N selection provides an alternative that generates diverse solutions simultaneously. Despite the growing adoption of sequential search, its advantages over parallel sampling-especially under a fixed compute budget-remain poorly understood. In this paper, we systematically compare these two approaches on two challenging reasoning tasks: CountDown and Sudoku. Surprisingly, we find that sequential search underperforms parallel sampling on CountDown but outperforms it on Sudoku, suggesting that backtracking is not universally beneficial. We identify two factors that can cause backtracking to degrade performance: (1) training on fixed search traces can lock models intro suboptimal strategies, and (2) explicit CoT supervision can discourage implicit (non verbalized) reasoning. Extending our analysis to reinforcement learning (RL), we show that models with backtracking capabilities benefit significantly from RL fine-tuning, while models without backtracking see limited, mixed gains. Together, these findings challenge the assumption that backtracking universally enhances LLM reasoning, instead revealing a complex interaction between task structure, training data, model scale, and learning paradigm.

[397] DualRAG: A Dual-Process Approach to Integrate Reasoning and Retrieval for Multi-Hop Question Answering

Rong Cheng, Jinyi Liu, Yan Zheng, Fei Ni, Jiazhen Du, Hangyu Mao, Fuzheng Zhang, Bo Wang, Jianye Hao

Main category: cs.LG

TL;DR: DualRAG is a dual-process framework that integrates reasoning and retrieval for multi-hop question answering, using Reasoning-augmented Querying and progressive Knowledge Aggregation to improve answer accuracy and coherence.

Details

Motivation: Existing approaches for Multi-Hop Question Answering struggle with identifying and organizing dynamic knowledge despite improvements in iterative retrieval methods.

Method: DualRAG framework with two coupled processes: Reasoning-augmented Querying (RaQ) for navigating reasoning paths and generating queries, and progressive Knowledge Aggregation (pKA) for systematic knowledge integration.

Result: Substantially improves answer accuracy and coherence, approaching or surpassing performance with oracle knowledge access in some cases.

Conclusion: DualRAG establishes itself as a robust and efficient solution for complex multi-hop reasoning tasks, maintaining capabilities even in smaller-scale models through targeted fine-tuning.

Abstract: Multi-Hop Question Answering (MHQA) tasks permeate real-world applications, posing challenges in orchestrating multi-step reasoning across diverse knowledge domains. While existing approaches have been improved with iterative retrieval, they still struggle to identify and organize dynamic knowledge. To address this, we propose DualRAG, a synergistic dual-process framework that seamlessly integrates reasoning and retrieval. DualRAG operates through two tightly coupled processes: Reasoning-augmented Querying (RaQ) and progressive Knowledge Aggregation (pKA). They work in concert: as RaQ navigates the reasoning path and generates targeted queries, pKA ensures that newly acquired knowledge is systematically integrated to support coherent reasoning. This creates a virtuous cycle of knowledge enrichment and reasoning refinement. Through targeted fine-tuning, DualRAG preserves its sophisticated reasoning and retrieval capabilities even in smaller-scale models, demonstrating its versatility and core advantages across different scales. Extensive experiments demonstrate that this dual-process approach substantially improves answer accuracy and coherence, approaching, and in some cases surpassing, the performance achieved with oracle knowledge access. These results establish DualRAG as a robust and efficient solution for complex multi-hop reasoning tasks.

[398] Flow-Induced Diagonal Gaussian Processes

Moule Lin, Andrea Patane, Weipeng Jing, Shuhao Guan, Goetz Botterweck

Main category: cs.LG

TL;DR: FiD-GP is a compression framework that uses flow-induced diagonal Gaussian processes with inducing weight matrices to project neural network weight uncertainty into lower-dimensional subspaces, achieving significant compression while maintaining performance.

Details

Motivation: To address the computational cost and storage requirements of Bayesian neural networks while improving uncertainty estimation and enabling effective out-of-distribution detection.

Method: Uses normalising-flow priors and spectral regularizations with a compact inducing weight matrix to project weight uncertainty into lower-dimensional subspaces, incorporating a numerically stable projection mechanism aligned with feature-gradient geometry.

Result: Reduces Bayesian training cost by several orders of magnitude, compresses parameters by 51%, reduces model size by 75%, matches state-of-the-art accuracy and uncertainty estimation, and enables theoretically guaranteed OoD detection.

Conclusion: FiD-GP provides an effective framework for compressing neural networks while maintaining uncertainty estimation capabilities and enabling reliable out-of-distribution detection across various tasks.

Abstract: We present Flow-Induced Diagonal Gaussian Processes (FiD-GP), a compression framework that incorporates a compact inducing weight matrix to project a neural network’s weight uncertainty into a lower-dimensional subspace. Critically, FiD-GP relies on normalising-flow priors and spectral regularisations to augment its expressiveness and align the inducing subspace with feature-gradient geometry through a numerically stable projection mechanism objective. Furthermore, we demonstrate how the prediction framework in FiD-GP can help to design a single-pass projection for Out-of-Distribution (OoD) detection. Our analysis shows that FiD-GP improves uncertainty estimation ability on various tasks compared with SVGP-based baselines, satisfies tight spectral residual bounds with theoretically guaranteed OoD detection, and significantly compresses the neural network’s storage requirements at the cost of increased inference computation dependent on the number of inducing weights employed. Specifically, in a comprehensive empirical study spanning regression, image classification, semantic segmentation, and out-of-distribution detection benchmarks, it cuts Bayesian training cost by several orders of magnitude, compresses parameters by roughly 51%, reduces model size by about 75%, and matches state-of-the-art accuracy and uncertainty estimation.

[399] On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm

Huan Li, Yiming Dong, Zhouchen Lin

Main category: cs.LG

TL;DR: This paper establishes the first theoretical convergence rate for AdamW optimizer, showing O(√d/K¹ᐟ⁴) convergence measured by ℓ₁ norm, which is analogous to SGD’s optimal rate when considering high-dimensional gradient properties.

Details

Motivation: AdamW is widely used for training large language models but lacks theoretical understanding of its convergence behavior, creating a gap between practical success and theoretical foundations.

Method: Theoretical analysis establishing convergence bounds for AdamW using ℓ₁ norm measurement, with empirical validation on real-world deep learning tasks and extension to NAdamW variant.

Result: Proved convergence rate of 1/K ∑E[||∇f(xᵏ)||₁] ≤ O(√d/K¹ᐟ⁴), showing AdamW achieves similar convergence behavior to SGD’s optimal rate in high dimensions, with empirical evidence supporting the theoretical findings.

Conclusion: AdamW maintains comparable convergence properties to SGD in high-dimensional settings, providing theoretical justification for its practical success in training large language models, with the analysis extending to momentum-based variants like NAdamW.

Abstract: As the default optimizer for training large language models, AdamW has achieved remarkable success in deep learning. However, its convergence behavior is not theoretically well-understood. This paper establishes the convergence rate $\frac{1}{K}\sum_{k=1}^KE\left[||\nabla f(x^k)||_1\right]\leq O(\frac{\sqrt{d}C}{K^{1/4}})$ for AdamW measured by $\ell_1$ norm, where $K$ represents the iteration number, $d$ denotes the model dimension, and $C$ matches the constant in the optimal convergence rate of SGD. Theoretically, we have $||\nabla f(x)||_2\ll ||\nabla f(x)||_1\leq \sqrt{d}||\nabla f(x)||_2$ for any high-dimensional vector $x$ and $E\left[||\nabla f(x)||_1\right]\geq\sqrt{\frac{2d}{\pi}}E\left[||\nabla f(x)||_2\right]$ when each element of $\nabla f(x)$ is generated from Gaussian distribution $\mathcal N(0,1)$. Empirically, our experimental results on real-world deep learning tasks reveal $||\nabla f(x)||_1=\varTheta(\sqrt{d})||\nabla f(x)||2$. Both support that our convergence rate can be considered to be analogous to the optimal $\frac{1}{K}\sum{k=1}^KE\left[||\nabla f(x^k)||_2\right]\leq O(\frac{C}{K^{1/4}})$ convergence rate of SGD in the ideal case. We also extend our result to NAdamW, an AdamW variant that employs a double-momentum mechanism, and demonstrate that it maintains the same convergence rate.

[400] Graph-Reward-SQL: Execution-Free Reinforcement Learning for Text-to-SQL via Graph Matching and Stepwise Reward

Han Weng, Puzhen Wu, Longjie Cui, Yi Zhan, Boyi Liu, Yuanfeng Song, Dun Zeng, Yingxiang Yang, Qianru Zhang, Dong Huang, Xiaoming Yin, Yang Sun, Xing Chen

Main category: cs.LG

TL;DR: Proposes Graph-Reward-SQL, a novel reward model framework for RL-based Text-to-SQL that uses SQL graph representations to provide accurate rewards while reducing time cost and GPU memory usage, with StepRTM providing intermediate supervision over CTE subqueries.

Details

Motivation: Existing RL methods for Text-to-SQL suffer from high execution latency due to repeated database calls (execution-based rewards) or substantial GPU memory overhead (LLM-based Bradley-Terry rewards), hindering efficiency and scalability.

Method: Uses Graph-Reward-SQL framework with GMNScore outcome reward model based on SQL graph representations, plus StepRTM stepwise reward model that provides intermediate supervision over Common Table Expression subqueries.

Result: Extensive experiments on Spider and BIRD benchmarks show the method consistently outperforms existing reward models, significantly reducing time cost and GPU memory usage while improving both functional correctness and readability of SQL.

Conclusion: The proposed Graph-Reward-SQL framework with StepRTM provides an efficient and scalable solution for RL-based Text-to-SQL that overcomes limitations of existing reward models.

Abstract: Reinforcement learning (RL) has been widely adopted to enhance the performance of large language models (LLMs) on Text-to-SQL tasks. However, existing methods often rely on execution-based or LLM-based Bradley-Terry reward models. The former suffers from high execution latency caused by repeated database calls, whereas the latter imposes substantial GPU memory overhead, both of which significantly hinder the efficiency and scalability of RL pipelines. To this end, we propose a novel reward model framework for RL-based Text-to-SQL named Graph-Reward-SQL, which employs the GMNScore outcome reward model. We leverage SQL graph representations to provide accurate reward signals while significantly reducing time cost and GPU memory usage. Building on this foundation, we further introduce StepRTM, a stepwise reward model that provides intermediate supervision over Common Table Expression (CTE) subqueries. This encourages both functional correctness and readability of SQL. Extensive comparative and ablation experiments on standard benchmarks, including Spider and BIRD, demonstrate that our method consistently outperforms existing reward models.

[401] Theoretical Investigation on Inductive Bias of Isolation Forest

Qin-Cheng Zheng, Shao-Qun Zhang, Shen-Huan Lyu, Yuan Jiang, Zhi-Hua Zhou

Main category: cs.LG

TL;DR: This paper provides a theoretical foundation for Isolation Forest (iForest) by modeling its growth process as a random walk and deriving the expected depth function, revealing key inductive biases about its performance characteristics.

Details

Motivation: Despite iForest's widespread adoption and success in anomaly detection, there has been no clear theoretical foundation explaining why it works so well, particularly its remarkable runtime efficiency and superior performance in large-scale tasks.

Method: The authors model the growth process of iForest as a random walk, where split dimensions and values are randomly selected. They derive the expected depth function using transition probabilities to analyze iForest’s inductive bias.

Result: The theoretical analysis reveals that iForest exhibits lower sensitivity to central anomalies and demonstrates greater parameter adaptability compared to k-Nearest Neighbor methods.

Conclusion: This study provides the first theoretical understanding of iForest’s effectiveness and establishes a foundation for further theoretical exploration of this popular anomaly detection algorithm.

Abstract: Isolation Forest (iForest) stands out as a widely-used unsupervised anomaly detector, primarily owing to its remarkable runtime efficiency and superior performance in large-scale tasks. Despite its widespread adoption, a theoretical foundation explaining iForest’s success remains unclear. This paper focuses on the inductive bias of iForest, which theoretically elucidates under what circumstances and to what extent iForest works well. The key is to formulate the growth process of iForest, where the split dimensions and split values are randomly selected. We model the growth process of iForest as a random walk, enabling us to derive the expected depth function, which is the outcome of iForest, using transition probabilities. The case studies reveal key inductive biases: iForest exhibits lower sensitivity to central anomalies while demonstrating greater parameter adaptability compared to $k$-Nearest Neighbor. Our study provides a theoretical understanding of the effectiveness of iForest and establishes a foundation for further theoretical exploration.

[402] Observation-Free Attacks on Online Learning to Rank

Sameep Chattopadhyay, Nikhil Karamchandani, Sharayu Moharir

Main category: cs.LG

TL;DR: The paper presents a framework for coordinated adversarial attacks on online learning to rank algorithms, showing they can be manipulated with only logarithmic effort to promote target items while causing linear regret.

Details

Motivation: Online learning to rank algorithms are widely used but their vulnerability to coordinated adversarial attacks remains poorly understood, posing security risks in search engines and recommender systems.

Method: Proposed two attack strategies: CascadeOFA for CascadeUCB1 and PBMOFA for PBM-UCB, designed to promote target items to top-K recommendations while inducing linear regret in the learning algorithm.

Result: Theoretical guarantees show both attack strategies require only O(log T) manipulations to succeed, and empirical results on real-world data validate the effectiveness of the attacks.

Conclusion: OLTR algorithms are highly vulnerable to coordinated adversarial attacks that can efficiently manipulate rankings with minimal effort, highlighting critical security concerns in practical applications.

Abstract: Online learning to rank (OLTR) plays a critical role in information retrieval and machine learning systems, with a wide range of applications in search engines and content recommenders. However, despite their extensive adoption, the susceptibility of OLTR algorithms to coordinated adversarial attacks remains poorly understood. In this work, we present a novel framework for attacking some of the widely used OLTR algorithms. Our framework is designed to promote a set of target items so that they appear in the list of top-K recommendations for T - o(T) rounds, while simultaneously inducing linear regret in the learning algorithm. We propose two novel attack strategies: CascadeOFA for CascadeUCB1 and PBMOFA for PBM-UCB . We provide theoretical guarantees showing that both strategies require only O(log T) manipulations to succeed. Additionally, we supplement our theoretical analysis with empirical results on real-world data.

[403] Online Decision-Focused Learning

Aymeric Capitaine, Maxime Haddouche, Eric Moulines, Michael I. Jordan, Etienne Boursier, Alain Durmus

Main category: cs.LG

TL;DR: This paper introduces online decision-focused learning algorithms for dynamic environments where objective functions and data distributions evolve over time, addressing challenges of non-differentiable and non-convex objectives.

Details

Motivation: Existing decision-focused learning methods assume fixed data batches and static objectives, but real-world environments are dynamic with evolving data distributions and changing objectives, creating a gap in current research.

Method: The authors propose two online algorithms that (i) regularize the objective to make it differentiable and (ii) use perturbation techniques with a near-optimal oracle to handle non-convexity, establishing static and dynamic regret bounds.

Result: The algorithms achieve provable guarantees for online decision-focused learning and outperform standard benchmarks in a knapsack experiment, demonstrating effectiveness in dynamic settings.

Conclusion: This work provides the first provable guarantees for online decision-focused learning in dynamic environments, addressing key challenges of non-differentiability and non-convexity through regularization and perturbation techniques.

Abstract: Decision-focused learning (DFL) is an increasingly popular paradigm for training predictive models whose outputs are used in decision-making tasks. Instead of merely optimizing for predictive accuracy, DFL trains models to directly minimize the loss associated with downstream decisions. However, existing studies focus solely on scenarios where a fixed batch of data is available and the objective function does not change over time. We instead investigate DFL in dynamic environments where the objective function and data distribution evolve over time. This setting is challenging for online learning because the objective function has zero or undefined gradients – which prevents the use of standard first-order optimization methods – and is generally non-convex. To address these difficulties, we (i) regularize the objective to make it differentiable and (ii) use perturbation techniques along with a near-optimal oracle to overcome non-convexity. Combining those techniques yields two original online algorithms tailored for DFL, for which we establish respectively static and dynamic regret bounds. These are the first provable guarantees for the online decision-focused problem. Finally, we showcase the effectiveness of our algorithms on a knapsack experiment, where they outperform two standard benchmarks.

Anyi Wang, Xuansheng Wu, Dong Shu, Yunpu Ma, Ninghao Liu

Main category: cs.LG

TL;DR: SAE-RSV refines steering vectors for LLM control using sparse autoencoders to remove noise and enrich task-relevant features from limited data, outperforming supervised fine-tuning.

Details

Motivation: Existing steering methods require large datasets, limiting real-world applicability. Small datasets produce noisy steering vectors with task-irrelevant features.

Method: Use sparse autoencoders to semantically denoise steering vectors by removing task-irrelevant features and enriching relevant ones through semantic similarity.

Result: SAE-RSV substantially outperforms all baseline methods including supervised fine-tuning in extensive experiments.

Conclusion: Effective steering vectors can be constructed from limited data by refining original vectors through SAEs.

Abstract: Steering has emerged as a promising approach in controlling large language models (LLMs) without modifying model parameters. However, most existing steering methods rely on large-scale datasets to learn clear behavioral information, which limits their applicability in many real-world scenarios. The steering vectors extracted from small dataset often contain task-irrelevant noising features, which degrades their effectiveness. To refine the steering vectors learned from limited data, we introduce Refinement of Steering Vector via Sparse Autoencoder (SAE-RSV) that leverages SAEs to semantically denoise and augment the steering vectors. In our framework, we first remove task-irrelevant features according to their semantics provided by SAEs, and then enrich task-relevant features missing from the small dataset through their semantic similarity to the identified relevant features. Extensive experiments demonstrate that the proposed SAE-RSV substantially outperforms all the baseline methods including supervised fine-tuning. Our findings show that effective steering vector can be constructed from limited training data by refining the original steering vector through SAEs.

[405] QiMeng-CodeV-R1: Reasoning-Enhanced Verilog Generation

Yaoyu Zhu, Di Huang, Hanqi Lyu, Xiaoyun Zhang, Chongxiao Li, Wenxuan Shi, Yutong Wu, Jianan Mu, Jinghua Wang, Yang Zhao, Pengwei Jin, Shuyao Cheng, Shengwen Liang, Xishan Zhang, Rui Zhang, Zidong Du, Qi Guo, Xing Hu, Yunji Chen

Main category: cs.LG

TL;DR: CodeV-R1 is an RLVR framework that trains LLMs to generate Verilog code from natural language specifications, achieving state-of-the-art performance on hardware design benchmarks.

Details

Motivation: Extend RLVR to electronic design automation (EDA) for automatically generating hardware description languages (HDLs) like Verilog from natural language, addressing challenges in automated verification, data scarcity, and computation costs.

Method: Developed rule-based testbench generator for equivalence checking, round-trip data synthesis method for high-quality dataset creation, and two-stage ‘distill-then-RL’ training pipeline with adaptive DAPO algorithm to reduce training costs.

Result: CodeV-R1-7B model achieves 68.6% pass@1 on VerilogEval v2 and 72.9% pass@1 on RTLLM v1.1, surpassing prior SOTA by 12-20% and exceeding 671B DeepSeek-R1 performance on RTLLM.

Conclusion: The framework successfully addresses key challenges in Verilog generation and achieves superior performance, with released model, code, and dataset to benefit EDA and LLM research communities.

Abstract: Large language models (LLMs) trained via reinforcement learning with verifiable reward (RLVR) have achieved breakthroughs on tasks with explicit, automatable verification, such as software programming and mathematical problems. Extending RLVR to electronic design automation (EDA), especially automatically generating hardware description languages (HDLs) like Verilog from natural-language (NL) specifications, however, poses three key challenges: the lack of automated and accurate verification environments, the scarcity of high-quality NL-code pairs, and the prohibitive computation cost of RLVR. To this end, we introduce CodeV-R1, an RLVR framework for training Verilog generation LLMs. First, we develop a rule-based testbench generator that performs robust equivalence checking against golden references. Second, we propose a round-trip data synthesis method that pairs open-source Verilog snippets with LLM-generated NL descriptions, verifies code-NL-code consistency via the generated testbench, and filters out inequivalent examples to yield a high-quality dataset. Third, we employ a two-stage “distill-then-RL” training pipeline: distillation for the cold start of reasoning abilities, followed by adaptive DAPO, our novel RLVR algorithm that can reduce training cost by adaptively adjusting sampling rate. The resulting model, CodeV-R1-7B, achieves 68.6% and 72.9% pass@1 on VerilogEval v2 and RTLLM v1.1, respectively, surpassing prior state-of-the-art by 12~20%, while even exceeding the performance of 671B DeepSeek-R1 on RTLLM. We have released our model, training code, and dataset to facilitate research in EDA and LLM communities.

[406] Putnam-like dataset summary: LLMs as mathematical competition contestants

Bartosz Bieganowski, Daniel Strzelecki, Robert Skiba, Mateusz Topolewski

Main category: cs.LG

TL;DR: Analysis of LLM performance on Putnam-like mathematical problems from Google DeepMind’s benchmark

Details

Motivation: To evaluate LLMs' ability to solve complex mathematical contest problems similar to those in the Putnam Competition

Method: Used a dataset of 96 original Putnam-like problems and 576 LLM solutions to analyze model performance

Result: The paper presents performance analysis results of LLMs on mathematical contest problems

Conclusion: The study provides insights into LLMs’ capabilities in solving challenging mathematical problems from competitions

Abstract: In this paper we summarize the results of the Putnam-like benchmark published by Google DeepMind. This dataset consists of 96 original problems in the spirit of the Putnam Competition and 576 solutions of LLMs. We analyse the performance of models on this set of problems to verify their ability to solve problems from mathematical contests.

[407] Risk-Sensitive Agent Compositions

Guruprerana Shabadi, Rajeev Alur

Main category: cs.LG

TL;DR: This paper presents a framework for optimizing agent compositions in workflows by minimizing value-at-risk of safety/fairness/privacy violations, using dynamic programming on agent graphs.

Details

Motivation: Real-world agentic systems need to minimize low-probability safety violations while maximizing task success, requiring analysis of tail behaviors in agent compositions.

Method: Formalizes agent workflows as directed acyclic graphs, introduces efficient dynamic programming algorithm that approximates value-at-risk using union bounds, and proves near-optimality for practical loss functions.

Result: The algorithm effectively approximates value-at-risk and identifies optimal agent compositions in video game-like control benchmarks with reinforcement learning agents.

Conclusion: The proposed framework provides an efficient solution for risk-aware agent composition selection that balances task success with safety requirements in complex workflows.

Abstract: From software development to robot control, modern agentic systems decompose complex objectives into a sequence of subtasks and choose a set of specialized AI agents to complete them. We formalize agentic workflows as directed acyclic graphs, called agent graphs, where edges represent AI agents and paths correspond to feasible compositions of agents. Real-world deployment requires selecting agent compositions that not only maximize task success but also minimize violations of safety, fairness, and privacy requirements which demands a careful analysis of the low-probability (tail) behaviors of compositions of agents. In this work, we consider risk minimization over the set of feasible agent compositions and seek to minimize the value-at-risk of the loss distribution of the agent composition where the loss quantifies violations of these requirements. We introduce an efficient algorithm which traverses the agent graph and finds a near-optimal composition of agents. It uses a dynamic programming approach to approximate the value-at-risk of agent compositions by exploiting a union bound. Furthermore, we prove that the approximation is near-optimal asymptotically for a broad class of practical loss functions. To evaluate our framework, we consider a suite of video game-like control benchmarks that require composing several agents trained with reinforcement learning and demonstrate our algorithm’s effectiveness in approximating the value-at-risk and identifying the optimal agent composition.

[408] Exponential Family Variational Flow Matching for Tabular Data Generation

Andrés Guzmán-Cordero, Floor Eijkelboom, Jan-Willem van de Meent

Main category: cs.LG

TL;DR: TabbyFlow is a variational Flow Matching method for generating tabular data with mixed continuous and discrete features, achieving state-of-the-art performance.

Details

Motivation: Denoising diffusion and flow matching have advanced generative modeling but remain limited for tabular data despite its real-world ubiquity.

Method: Developed TabbyFlow using Exponential Family Variational Flow Matching (EF-VFM) to represent heterogeneous data types with exponential family distributions, enabling efficient moment matching for learning probability paths.

Result: Evaluation on tabular data benchmarks demonstrates state-of-the-art performance compared to baseline methods.

Conclusion: TabbyFlow successfully extends flow matching to tabular data generation and establishes connections between variational flow matching and Bregman divergence-based objectives.

Abstract: While denoising diffusion and flow matching have driven major advances in generative modeling, their application to tabular data remains limited, despite its ubiquity in real-world applications. To this end, we develop TabbyFlow, a variational Flow Matching (VFM) method for tabular data generation. To apply VFM to data with mixed continuous and discrete features, we introduce Exponential Family Variational Flow Matching (EF-VFM), which represents heterogeneous data types using a general exponential family distribution. We hereby obtain an efficient, data-driven objective based on moment matching, enabling principled learning of probability paths over mixed continuous and discrete variables. We also establish a connection between variational flow matching and generalized flow matching objectives based on Bregman divergences. Evaluation on tabular data benchmarks demonstrates state-of-the-art performance compared to baselines.

[409] DM-Bench: Benchmarking LLMs for Personalized Decision Making in Diabetes Management

Maria Ana Cardei, Josephine Lamp, Mark Derdzinski, Karan Bhatia

Main category: cs.LG

TL;DR: DM-Bench is the first benchmark for evaluating LLMs on real-world diabetes management tasks, featuring 360,600 personalized questions across 7 task categories using data from 15,000 individuals.

Details

Motivation: Existing health benchmarks are either generic, clinician-focused, or limited to clinical tasks, lacking evaluation frameworks for patient-facing AI solutions in diabetes management.

Method: Created a comprehensive dataset with one month of time-series data from 15,000 individuals across three diabetes populations, generating personalized questions across 7 task categories and evaluating models on 5 metrics.

Result: Evaluation of 8 recent LLMs showed substantial variability across tasks and metrics, with no single model consistently outperforming others across all dimensions.

Conclusion: DM-Bench aims to advance the reliability, safety, effectiveness and practical utility of AI solutions in diabetes care by providing a specialized evaluation framework.

Abstract: We present DM-Bench, the first benchmark designed to evaluate large language model (LLM) performance across real-world decision-making tasks faced by individuals managing diabetes in their daily lives. Unlike prior health benchmarks that are either generic, clinician-facing or focused on clinical tasks (e.g., diagnosis, triage), DM-Bench introduces a comprehensive evaluation framework tailored to the unique challenges of prototyping patient-facing AI solutions in diabetes, glucose management, metabolic health and related domains. Our benchmark encompasses 7 distinct task categories, reflecting the breadth of real-world questions individuals with diabetes ask, including basic glucose interpretation, educational queries, behavioral associations, advanced decision making and long term planning. Towards this end, we compile a rich dataset comprising one month of time-series data encompassing glucose traces and metrics from continuous glucose monitors (CGMs) and behavioral logs (e.g., eating and activity patterns) from 15,000 individuals across three different diabetes populations (type 1, type 2, pre-diabetes/general health and wellness). Using this data, we generate a total of 360,600 personalized, contextual questions across the 7 tasks. We evaluate model performance on these tasks across 5 metrics: accuracy, groundedness, safety, clarity and actionability. Our analysis of 8 recent LLMs reveals substantial variability across tasks and metrics; no single model consistently outperforms others across all dimensions. By establishing this benchmark, we aim to advance the reliability, safety, effectiveness and practical utility of AI solutions in diabetes care.

[410] Modern Methods in Associative Memory

Dmitry Krotov, Benjamin Hoover, Parikshit Ram, Bao Pham

Main category: cs.LG

TL;DR: This tutorial introduces Associative Memories (like Hopfield Networks) with modern methods, connecting them to Transformers and Diffusion Models, and providing practical derivations and coding notebooks.

Details

Motivation: Recent theoretical advances in Associative Memories' storage capabilities and their connections to SOTA AI architectures (Transformers, Diffusion Models) enable new interpretations of traditional AI networks and novel distributed model designs.

Method: The tutorial uses modern language and methods for Associative Memories, including Lagrangian formulations that enable distributed models, with hands-on mathematical derivations and coding notebooks.

Result: The tutorial provides an accessible introduction to Associative Memories, emphasizing their modern applications and theoretical connections to contemporary AI architectures.

Conclusion: Associative Memories offer valuable theoretical frameworks for understanding and designing AI networks, with practical implementations through modern mathematical formulations and coding approaches.

Abstract: Associative Memories like the famous Hopfield Networks are elegant models for describing fully recurrent neural networks whose fundamental job is to store and retrieve information. In the past few years they experienced a surge of interest due to novel theoretical results pertaining to their information storage capabilities, and their relationship with SOTA AI architectures, such as Transformers and Diffusion Models. These connections open up possibilities for interpreting the computation of traditional AI networks through the theoretical lens of Associative Memories. Additionally, novel Lagrangian formulations of these networks make it possible to design powerful distributed models that learn useful representations and inform the design of novel architectures. This tutorial provides an approachable introduction to Associative Memories, emphasizing the modern language and methods used in this area of research, with practical hands-on mathematical derivations and coding notebooks.

[411] Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models

Zizhuo Zhang, Jianing Zhu, Xinmu Ge, Zihua Zhao, Zhanke Zhou, Xuan Li, Xiao Feng, Jiangchao Yao, Bo Han

Main category: cs.LG

TL;DR: Co-rewarding is a self-supervised RL framework that addresses training collapse in self-rewarding LLMs by introducing complementary supervision through contrastive agreement across questions and self-distillation with reference teachers.

Details

Motivation: To overcome the scaling limitations of RLVR (requiring human labels) and training collapse issues in self-rewarding methods (due to single-view supervision creating self-consistent illusions).

Method: Two instantiations: Co-rewarding-I uses contrastive agreement across semantically similar questions, and Co-rewarding-II uses self-distillation with slowly-updated reference teachers and pseudo labels.

Result: Achieves stable training and outperforms self-rewarding baselines by +3.31% average improvement on math reasoning benchmarks, with +7.49% on Llama-3.2-3B-Instruct. Matches or exceeds RLVR with ground-truth labels in some cases (94.01% Pass@1 on GSM8K with Qwen3-8B-Base).

Conclusion: Co-rewarding provides an effective self-supervised alternative to RLVR that avoids training collapse while achieving competitive or superior performance without human annotations.

Abstract: While reinforcement learning with verifiable rewards (RLVR) is effective to improve the reasoning ability of large language models (LLMs), its reliance on human-annotated labels leads to the scaling up dilemma, especially for complex tasks. Recent self-rewarding methods investigate a label-free alternative to unlock the reasoning capabilities of LLMs, yet they frequently encounter the non-negligible training collapse issue, as the single-view supervision signal easily forms the self-consistent illusion, yielding the reward hacking. Inspired by the success of self-supervised learning, we propose \textit{Co-rewarding}, a novel self-supervised RL framework that improves training stability by seeking complementary supervision from another views. Specifically, we instantiate Co-rewarding in two ways: (1) \textit{Co-rewarding-I} is a data-side instantiation that derives reward signals from contrastive agreement across semantically analogous questions; and (2) \textit{Co-rewarding-II} is a model-side instantiation that maintains a slowly-updated reference teacher with pseudo labels to realize self-distillation. Intuitively, such instantiations introduce different levels of discrepancy to increase the difficulty of training collapse on trivial reasoning solutions. Empirically, Co-rewarding exhibits stable training across various setups, and outperforms other self-rewarding baselines by $+3.31%$ improvements on average on multiple mathematical reasoning benchmarks, especially by $+7.49%$ on Llama-3.2-3B-Instruct. Notably, Co-rewarding reaches or even surpasses RLVR with ground-truth (GT) label in several cases, such as a Pass@$1$ of $94.01%$ on GSM8K with Qwen3-8B-Base remarkably higher than GT. Our code is publicly available at https://github.com/tmlr-group/Co-rewarding.

[412] Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed

Isha Gupta, Rylan Schaeffer, Joshua Kazdan, Ken Ziyu Liu, Sanmi Koyejo

Main category: cs.LG

TL;DR: This paper distinguishes between data-space attacks (which transfer between models) and representation-space attacks (which don’t transfer without geometric alignment), explaining why image jailbreaks fail to transfer between vision-language models while other attacks succeed.

Details

Motivation: To explain why recent studies found that image jailbreaks cannot transfer between vision-language models, unlike adversarial examples in image classifiers and text jailbreaks in language models.

Method: The authors provide theoretical and empirical evidence across four settings: mathematical proof for networks with same input-output maps but different representations, representation-space attacks on image classifiers, representation-space attacks on language models, and data-space attacks on VLMs with geometric alignment analysis.

Result: Data-space attacks successfully transfer between models, while representation-space attacks fail to transfer unless the models’ latent geometries are sufficiently aligned. The authors demonstrate successful transfer of data-space attacks between VLMs and show representation-space attacks can transfer when geometric alignment exists.

Conclusion: Adversarial transfer is not inherent to all attacks but depends on their operational domain - attacks in shared data-space transfer, while attacks in models’ unique representation spaces do not transfer without geometric alignment, providing critical insights for building more robust models.

Abstract: The field of adversarial robustness has long established that adversarial examples can successfully transfer between image classifiers and that text jailbreaks can successfully transfer between language models (LMs). However, a pair of recent studies reported being unable to successfully transfer image jailbreaks between vision-language models (VLMs). To explain this striking difference, we propose a fundamental distinction regarding the transferability of attacks against machine learning models: attacks in the input data-space can transfer, whereas attacks in model representation space do not, at least not without geometric alignment of representations. We then provide theoretical and empirical evidence of this hypothesis in four different settings. First, we mathematically prove this distinction in a simple setting where two networks compute the same input-output map but via different representations. Second, we construct representation-space attacks against image classifiers that are as successful as well-known data-space attacks, but fail to transfer. Third, we construct representation-space attacks against LMs that successfully jailbreak the attacked models but again fail to transfer. Fourth, we construct data-space attacks against VLMs that successfully transfer to new VLMs, and we show that representation space attacks can transfer when VLMs’ latent geometries are sufficiently aligned in post-projector space. Our work reveals that adversarial transfer is not an inherent property of all attacks but contingent on their operational domain - the shared data-space versus models’ unique representation spaces - a critical insight for building more robust models.

[413] Injecting Measurement Information Yields a Fast and Noise-Robust Diffusion-Based Inverse Problem Solver

Jonathan Patsenker, Henry Li, Myeongseob Ko, Ruoxi Jia, Yuval Kluger

Main category: cs.LG

TL;DR: Proposes a method to estimate conditional posterior mean in diffusion models for inverse problems, improving on traditional Tweedie’s formula by incorporating measurement information directly during sampling.

Details

Motivation: Traditional diffusion-based inverse solvers use Tweedie's formula which only considers the diffusion variate without incorporating measurement information, leading to suboptimal integration of measurement data downstream.

Method: Estimate conditional posterior mean by solving a lightweight maximum likelihood estimation problem with a single parameter, which can be integrated into any standard sampler with noise-aware stopping criteria.

Result: Achieves comparable or improved performance against contemporary inverse solvers across multiple datasets and tasks, with fast and memory-efficient operation.

Conclusion: The proposed conditional posterior mean estimation provides a more effective way to incorporate measurement information in diffusion-based inverse solvers, offering robust performance with efficient computation.

Abstract: Diffusion models have been firmly established as principled zero-shot solvers for linear and nonlinear inverse problems, owing to their powerful image prior and iterative sampling algorithm. These approaches often rely on Tweedie’s formula, which relates the diffusion variate $\mathbf{x}_t$ to the posterior mean $\mathbb{E} [\mathbf{x}_0 | \mathbf{x}_t]$, in order to guide the diffusion trajectory with an estimate of the final denoised sample $\mathbf{x}_0$. However, this does not consider information from the measurement $\mathbf{y}$, which must then be integrated downstream. In this work, we propose to estimate the conditional posterior mean $\mathbb{E} [\mathbf{x}_0 | \mathbf{x}_t, \mathbf{y}]$, which can be formulated as the solution to a lightweight, single-parameter maximum likelihood estimation problem. The resulting prediction can be integrated into any standard sampler, resulting in a fast and memory-efficient inverse solver. Our optimizer is amenable to a noise-aware likelihood-based stopping criteria that is robust to measurement noise in $\mathbf{y}$. We demonstrate comparable or improved performance against a wide selection of contemporary inverse solvers across multiple datasets and tasks.

[414] KAIROS: Unified Training for Universal Non-Autoregressive Time Series Forecasting

Kuiye Ding, Fanda Fan, Zheya Wang, Hongxiao Li, Yifan Wang, Lei Wang, Chunjie Luo, Jianfeng Zhan

Main category: cs.LG

TL;DR: KAIROS is a non-autoregressive time series forecasting framework that models segment-level multi-peak distributions, enabling fast inference without error accumulation while maintaining high accuracy comparable to state-of-the-art foundation models.

Details

Motivation: Web applications require fast time series forecasting for real-time decision making in resource planning, cache placement, and anomaly response, but existing approaches either suffer from error accumulation (autoregressive) or produce over-smoothed predictions (non-autoregressive).

Method: KAIROS uses a non-autoregressive framework that directly models segment-level multi-peak distributions, avoiding the sequential error accumulation of autoregressive methods while preventing the over-smoothing issues of existing non-autoregressive approaches.

Result: KAIROS demonstrates strong zero-shot generalization on six benchmarks, achieving forecasting performance comparable to state-of-the-art foundation models with similar scale, but at a fraction of their inference cost.

Conclusion: KAIROS highlights non-autoregressive design as a scalable paradigm for foundation models in time series, offering fast inference without sacrificing accuracy.

Abstract: In the World Wide Web, reliable time series forecasts provide the forward-looking signals that drive resource planning, cache placement, and anomaly response, enabling platforms to operate efficiently as user behavior and content distributions evolve. Compared with other domains, time series forecasting for Web applications requires much faster responsiveness to support real-time decision making. We present KAIROS, a non-autoregressive time series forecasting framework that directly models segment-level multi-peak distributions. Unlike autoregressive approaches, KAIROS avoids error accumulation and achieves just-in-time inference, while improving over existing non-autoregressive models that collapse to over-smoothed predictions. Trained on the large-scale corpus, KAIROS demonstrates strong zero-shot generalization on six widely used benchmarks, delivering forecasting performance comparable to state-of-the-art foundation models with similar scale, at a fraction of their inference cost. Beyond empirical results, KAIROS highlights the importance of non-autoregressive design as a scalable paradigm for foundation models in time series.

[415] Relative Advantage Debiasing for Watch-Time Prediction in Short-Video Recommendation

Emily Liu, Kuan Han, Minfeng Zhan, Bocheng Zhao, Guanyu Mu, Yang Song

Main category: cs.LG

TL;DR: A framework that debiases watch time by comparing it to reference distributions from user and item groups, using quantile-based preference signals and two-stage architecture.

Details

Motivation: Raw watch times are influenced by confounding factors like video duration and popularity, which can distort preference signals and create biased recommendation models.

Method: Relative advantage debiasing framework with quantile-based preference signals, two-stage architecture separating distribution estimation from preference learning, and distributional embeddings for efficient parameterization.

Result: Significant improvements in recommendation accuracy and robustness demonstrated in both offline and online experiments compared to baseline methods.

Conclusion: The proposed framework effectively addresses watch time bias and enhances recommendation performance through relative advantage debiasing and quantile-based preference modeling.

Abstract: Watch time is widely used as a proxy for user satisfaction in video recommendation platforms. However, raw watch times are influenced by confounding factors such as video duration, popularity, and individual user behaviors, potentially distorting preference signals and resulting in biased recommendation models. We propose a novel relative advantage debiasing framework that corrects watch time by comparing it to empirically derived reference distributions conditioned on user and item groups. This approach yields a quantile-based preference signal and introduces a two-stage architecture that explicitly separates distribution estimation from preference learning. Additionally, we present distributional embeddings to efficiently parameterize watch-time quantiles without requiring online sampling or storage of historical data. Both offline and online experiments demonstrate significant improvements in recommendation accuracy and robustness compared to existing baseline methods.

[416] Towards Provable Emergence of In-Context Reinforcement Learning

Jiuqi Wang, Rohan Chandra, Shangtong Zhang

Main category: cs.LG

TL;DR: This paper investigates why RL pretraining algorithms can generate network parameters that enable in-context reinforcement learning (ICRL), where agents solve new tasks without parameter updates by conditioning on context.

Details

Motivation: The motivation is to understand why standard RL pretraining algorithms can produce network parameters that enable the remarkable ICRL phenomenon, where pretrained agents solve out-of-distribution tasks without parameter updates by using context information.

Method: The authors conduct a case study and prove that when a Transformer is pretrained for policy evaluation, one of the global minimizers of the pretraining loss can enable in-context temporal difference learning.

Result: The paper provides initial support for the hypothesis that parameters capable of ICRL are minimizers of the pretraining loss, specifically demonstrating this for policy evaluation with Transformers.

Conclusion: The work establishes theoretical foundations for understanding ICRL by showing that global minimizers of pretraining loss can enable in-context learning capabilities, specifically temporal difference learning in the case studied.

Abstract: Typically, a modern reinforcement learning (RL) agent solves a task by updating its neural network parameters to adapt its policy to the task. Recently, it has been observed that some RL agents can solve a wide range of new out-of-distribution tasks without parameter updates after pretraining on some task distribution. When evaluated in a new task, instead of making parameter updates, the pretrained agent conditions its policy on additional input called the context, e.g., the agent’s interaction history in the new task. The agent’s performance increases as the information in the context increases, with the agent’s parameters fixed. This phenomenon is typically called in-context RL (ICRL). The pretrained parameters of the agent network enable the remarkable ICRL phenomenon. However, many ICRL works perform the pretraining with standard RL algorithms. This raises the central question this paper aims to address: Why can the RL pretraining algorithm generate network parameters that enable ICRL? We hypothesize that the parameters capable of ICRL are minimizers of the pretraining loss. This work provides initial support for this hypothesis through a case study. In particular, we prove that when a Transformer is pretrained for policy evaluation, one of the global minimizers of the pretraining loss can enable in-context temporal difference learning.

[417] Feature Dynamics as Implicit Data Augmentation: A Depth-Decomposed View on Deep Neural Network Generalization

Tianyu Ruan, Kuo Gai, Shihua Zhang

Main category: cs.LG

TL;DR: Deep networks generalize well due to temporal consistency in feature evolution, where predictions remain stable when combining shallow features from earlier training stages with deeper features from later stages, acting as implicit structured augmentation.

Details

Motivation: To understand why deep networks generalize well by examining internal feature evolution rather than just inputs and outputs, moving beyond classical generalization theory.

Method: Study temporal consistency of predictions when combining features from different training checkpoints, analyze stability on various data types, and use statistical tests to examine SGD noise patterns.

Result: Found temporal consistency acts as implicit structured augmentation that supports generalization, extends to unseen/corrupted data but collapses with random labels, and SGD injects anisotropic noise aligned with principal directions.

Conclusion: Feature dynamics are linked to generalization, suggesting future work on practical measures for temporal feature evolution as a way to understand and improve generalization.

Abstract: Why do deep networks generalize well? In contrast to classical generalization theory, we approach this fundamental question by examining not only inputs and outputs, but the evolution of internal features. Our study suggests a phenomenon of temporal consistency that predictions remain stable when shallow features from earlier checkpoints combine with deeper features from later ones. This stability is not a trivial convergence artifact. It acts as a form of implicit, structured augmentation that supports generalization. We show that temporal consistency extends to unseen and corrupted data, but collapses when semantic structure is destroyed (e.g., random labels). Statistical tests further reveal that SGD injects anisotropic noise aligned with a few principal directions, reinforcing its role as a source of structured variability. Together, these findings suggest a conceptual perspective that links feature dynamics to generalization, pointing toward future work on practical surrogates for measuring temporal feature evolution.

[418] Demystifying Network Foundation Models

Sylee Beltiukov, Satyandra Guthula, Wenbo Guo, Walter Willinger, Arpit Gupta

Main category: cs.LG

TL;DR: This paper systematically analyzes latent knowledge in Network Foundation Models (NFMs) through embedding geometry, metric alignment, and causal sensitivity testing, revealing significant limitations like anisotropy and inconsistent feature sensitivity that can be addressed to improve performance.

Details

Motivation: To investigate the latent knowledge encoded within Network Foundation Models beyond just downstream task performance, focusing on hidden representations analysis to understand model limitations and properties.

Method: Three-part evaluation: Embedding Geometry Analysis to assess representation space utilization, Metric Alignment Assessment to measure correspondence with domain-expert features, and Causal Sensitivity Testing to evaluate robustness to protocol perturbations. Evaluated four state-of-the-art NFMs using five diverse network datasets.

Result: All evaluated NFMs exhibit significant anisotropy, inconsistent feature sensitivity patterns, inability to separate high-level context, payload dependency, and other limitations. Addressing these limitations can significantly improve model performance by up to +0.35 F1 score without architectural changes.

Conclusion: The systematic analysis reveals numerous limitations across all Network Foundation Models, demonstrating that understanding and addressing these latent representation issues can lead to substantial performance improvements in network analysis tasks.

Abstract: This work presents a systematic investigation into the latent knowledge encoded within Network Foundation Models (NFMs) that focuses on hidden representations analysis rather than pure downstream task performance. Different from existing efforts, we analyze the models through a three-part evaluation: Embedding Geometry Analysis to assess representation space utilization, Metric Alignment Assessment to measure correspondence with domain-expert features, and Causal Sensitivity Testing to evaluate robustness to protocol perturbations. Using five diverse network datasets spanning controlled and real-world environments, we evaluate four state-of-the-art NFMs, revealing that they all exhibit significant anisotropy, inconsistent feature sensitivity patterns, an inability to separate the high-level context, payload dependency, and other properties. Our work identifies numerous limitations across all models and demonstrates that addressing them can significantly improve model performance (by up to +0.35 $F_1$ score without architectural changes).

[419] Gradient Descent with Large Step Sizes: Chaos and Fractal Convergence Region

Shuang Liang, Guido Montúfar

Main category: cs.LG

TL;DR: Gradient descent in matrix factorization develops fractal structure under large step sizes, with chaotic dynamics near critical step sizes that eliminate predictable implicit biases.

Details

Motivation: To understand how gradient descent behaves in matrix factorization problems when using large step sizes, particularly examining convergence properties and sensitivity to initialization.

Method: Analyzed gradient descent dynamics in matrix factorization, derived exact critical step size for scalar-vector factorization, and extended analysis to general matrix factorization with orthogonal initialization.

Result: Found that near critical step sizes, parameter space develops fractal structure, selected minimizer depends sensitively on initialization, and regularization amplifies this sensitivity creating fractal boundaries between convergent and divergent initializations.

Conclusion: Near-critical step sizes induce chaotic regime in gradient descent where long-term dynamics are unpredictable and there are no simple implicit biases like balancedness, minimum norm, or flatness.

Abstract: We examine gradient descent in matrix factorization and show that under large step sizes the parameter space develops a fractal structure. We derive the exact critical step size for convergence in scalar-vector factorization and show that near criticality the selected minimizer depends sensitively on the initialization. Moreover, we show that adding regularization amplifies this sensitivity, generating a fractal boundary between initializations that converge and those that diverge. The analysis extends to general matrix factorization with orthogonal initialization. Our findings reveal that near-critical step sizes induce a chaotic regime of gradient descent where the long-term dynamics are unpredictable and there are no simple implicit biases, such as towards balancedness, minimum norm, or flatness.

[420] AReUReDi: Annealed Rectified Updates for Refining Discrete Flows with Multi-Objective Guidance

Tong Chen, Yinuo Zhang, Pranam Chatterjee

Main category: cs.LG

TL;DR: AReUReDi is a discrete optimization algorithm that guarantees convergence to Pareto optimal solutions for multi-objective sequence design, outperforming existing methods in therapeutic peptide and SMILES sequence optimization.

Details

Motivation: Existing generative frameworks operate in continuous spaces with single-objective guidance, while discrete approaches lack guarantees for multi-objective Pareto optimality in therapeutic sequence design.

Method: Builds on Rectified Discrete Flows (ReDi) and combines Tchebycheff scalarization, locally balanced proposals, and annealed Metropolis-Hastings updates to bias sampling toward Pareto-optimal states while preserving distributional invariance.

Result: Successfully optimizes up to five therapeutic properties simultaneously (affinity, solubility, hemolysis, half-life, non-fouling) and outperforms both evolutionary and diffusion-based baselines in peptide and SMILES sequence design.

Conclusion: AReUReDi establishes a powerful sequence-based framework for multi-property biomolecule generation with theoretical guarantees of Pareto optimality.

Abstract: Designing sequences that satisfy multiple, often conflicting, objectives is a central challenge in therapeutic and biomolecular engineering. Existing generative frameworks largely operate in continuous spaces with single-objective guidance, while discrete approaches lack guarantees for multi-objective Pareto optimality. We introduce AReUReDi (Annealed Rectified Updates for Refining Discrete Flows), a discrete optimization algorithm with theoretical guarantees of convergence to the Pareto front. Building on Rectified Discrete Flows (ReDi), AReUReDi combines Tchebycheff scalarization, locally balanced proposals, and annealed Metropolis-Hastings updates to bias sampling toward Pareto-optimal states while preserving distributional invariance. Applied to peptide and SMILES sequence design, AReUReDi simultaneously optimizes up to five therapeutic properties (including affinity, solubility, hemolysis, half-life, and non-fouling) and outperforms both evolutionary and diffusion-based baselines. These results establish AReUReDi as a powerful, sequence-based framework for multi-property biomolecule generation.

[421] SCOPED: Score-Curvature Out-of-distribution Proximity Evaluator for Diffusion

Brett Barkley, Preston Culbertson, David Fridovich-Keil

Main category: cs.LG

TL;DR: SCOPED is a fast OOD detection method for diffusion models that reduces forward passes by an order of magnitude, combining Jacobian trace and score function norm into a single test statistic with kernel density estimation.

Details

Motivation: Out-of-distribution detection is essential for reliable deployment of machine learning systems across various domains like vision, robotics, and reinforcement learning.

Method: Combines Jacobian trace and squared norm of the model’s score function into a single test statistic, uses kernel density estimation for density estimation, requires only one forward pass and one Jacobian-vector product with Hutchinson’s trace estimator.

Result: Achieves competitive or state-of-the-art precision-recall scores on four vision benchmarks despite low computational cost, and generalizes to robotic control tasks with shared state and action spaces.

Conclusion: SCOPED positions as a practical foundation for fast and reliable OOD detection in real-world domains including vision, outlier detection, reinforcement learning, and dataset curation.

Abstract: Out-of-distribution (OOD) detection is essential for reliable deployment of machine learning systems in vision, robotics, reinforcement learning, and beyond. We introduce Score-Curvature Out-of-distribution Proximity Evaluator for Diffusion (SCOPED), a fast and general-purpose OOD detection method for diffusion models that reduces the number of forward passes on the trained model by an order of magnitude compared to prior methods, outperforming most diffusion-based baselines and closely approaching the accuracy of the strongest ones. SCOPED is computed from a single diffusion model trained once on a diverse dataset, and combines the Jacobian trace and squared norm of the model’s score function into a single test statistic. Rather than thresholding on a fixed value, we estimate the in-distribution density of SCOPED scores using kernel density estimation, enabling a flexible, unsupervised test that, in the simplest case, only requires a single forward pass and one Jacobian-vector product (JVP), made efficient by Hutchinson’s trace estimator. On four vision benchmarks, SCOPED achieves competitive or state-of-the-art precision-recall scores despite its low computational cost. The same method generalizes to robotic control tasks with shared state and action spaces, identifying distribution shifts across reward functions and training regimes. These results position SCOPED as a practical foundation for fast and reliable OOD detection in real-world domains, including perceptual artifacts in vision, outlier detection in autoregressive models, exploration in reinforcement learning, and dataset curation for unsupervised training.

[422] Fixing That Free Lunch: When, Where, and Why Synthetic Data Fails in Model-Based Policy Optimization

Brett Barkley, David Fridovich-Keil

Main category: cs.LG

TL;DR: MBPO’s synthetic data can degrade performance in DMC tasks despite success in Gym. Two failure modes identified: scale mismatches in dynamics/reward models causing critic underestimation, and poor target representation inflating model variance. Fixing these enables MBPO to outperform SAC in most DMC tasks.

Details

Motivation: To understand why MBPO underperforms in DeepMind Control Suite compared to OpenAI Gym, identify failure modes, and enable policy improvement where previously impossible.

Method: Analyzed MBPO’s performance across seven DMC tasks, identified two coupled failure modes: scale mismatches between dynamics/reward models and poor target representation choice.

Result: After addressing failure modes, MBPO outperformed SAC in 5 out of 7 DMC tasks while maintaining strong Gym performance.

Conclusion: Environment-specific assumptions can become encoded in algorithm design when evaluation is limited. Need taxonomies linking MDP structure to failure modes and unified solutions.

Abstract: Synthetic data is a core component of data-efficient Dyna-style model-based reinforcement learning, yet it can also degrade performance. We study when it helps, where it fails, and why, and we show that addressing the resulting failure modes enables policy improvement that was previously unattainable. We focus on Model-Based Policy Optimization (MBPO), which performs actor and critic updates using synthetic action counterfactuals. Despite reports of strong and generalizable sample-efficiency gains in OpenAI Gym, recent work shows that MBPO often underperforms its model-free counterpart, Soft Actor-Critic (SAC), in the DeepMind Control Suite (DMC). Although both suites involve continuous control with proprioceptive robots, this shift leads to sharp performance losses across seven challenging DMC tasks, with MBPO failing in cases where claims of generalization from Gym would imply success. This reveals how environment-specific assumptions can become implicitly encoded into algorithm design when evaluation is limited. We identify two coupled issues behind these failures: scale mismatches between dynamics and reward models that induce critic underestimation and hinder policy improvement during model-policy coevolution, and a poor choice of target representation that inflates model variance and produces error-prone rollouts. Addressing these failure modes enables policy improvement where none was previously possible, allowing MBPO to outperform SAC in five of seven tasks while preserving the strong performance previously reported in OpenAI Gym. Rather than aiming only for incremental average gains, we hope our findings motivate the community to develop taxonomies that tie MDP task- and environment-level structure to algorithmic failure modes, pursue unified solutions where possible, and clarify how benchmark choices ultimately shape the conditions under which algorithms generalize.

[423] Fine-tuning LLMs with variational Bayesian last layer for high-dimensional Bayesian optimization

Haotian Xiang, Jinwen Xu, Qin Lu

Main category: cs.LG

TL;DR: The paper proposes using Large Language Models (LLMs) as surrogates for Bayesian optimization in high-dimensional black-box optimization problems with irregular variables, combining LoRA fine-tuning with variational Bayesian last layer for efficient adaptation.

Details

Motivation: Bayesian optimization struggles with high-dimensional problems containing irregular variables (categorical, ordinal) where Gaussian processes perform poorly. Neural network surrogates help but need efficient adaptation.

Method: Use LLM as surrogate model, fine-tune with LoRA and variational Bayesian last layer (VBLL) for efficient parameter adaptation. Develop ensemble approach (ENS-LoRA-VBLL) for automated hyperparameter selection.

Result: The proposed methods show compelling performance on high-dimensional benchmarks and real-world molecular optimization tasks, being computationally light and allowing recursive updates.

Conclusion: LLM-based surrogates with LoRA-VBLL adaptation provide an effective solution for high-dimensional Bayesian optimization with irregular variables, outperforming traditional approaches.

Abstract: A plethora of applications entail solving black-box optimization problems with high evaluation costs, including drug discovery, material design, as well as hyperparameter tuning. Toward finding the global optimum of such black-box optimization problems with sample efficiency, Bayesian optimization (BO) is a theoretically elegant framework that relies on a probabilistic surrogate model so as to iteratively select the query point with well-balanced exploration-exploitation tradeoffs. The Gaussian process (GP), as the de-facto choice for surrogate modeling, has achieved compelling performances for vanilla BO with low-dimensional continuous variables. However, GPs fall short in coping with high-dimensional counterparts with {\it irregular} variables (e.g., categorical, ordinal, etc.). To alleviate this, neural network-based surrogates have been explored. Inspired by the powerful capabilities of LLMs, we adopt the LLM as the surrogate to model the mapping from the high-dimensional input variables to the objective function. To adapt to the current problem, we leverage the low-rank adaptation (LoRA) to fine-tune the LLM parameters together with the posterior of a linear regression head via the variational Bayesian last layer (VBLL) framework. The resulting LoRA-VBLL is not only computationally light compared to existing alternatives, but also admits recursive updates. To automate the critical selection of the LoRA rank as well as other hyperparameters, a weighted ensemble (ENS) of LoRA-VBLL surrogates has been devised, which further accommodates continual update of the per-model weight and individual LoRA-VBLL parameters via recursive Bayes. Extensive experimental results demonstrate the compelling performance of the proposed (ENS-)LoRA-VBLL approaches on various high-dimensional benchmarks and the real-world molecular optimization tasks.

[424] Octax: Accelerated CHIP-8 Arcade Environments for Reinforcement Learning in JAX

Waris Radji, Thomas Michel, Hector Piteau

Main category: cs.LG

TL;DR: Octax is a high-performance JAX-based suite of classic arcade game environments that provides GPU-accelerated alternatives to CPU-bound Atari benchmarks, enabling massive-scale RL experimentation with orders-of-magnitude speedups.

Details

Motivation: Current RL research needs diverse, challenging environments that are both tractable and scalable. Modern video games are computationally expensive and CPU-bound, limiting large-scale experimentation.

Method: Implemented classic arcade game environments in JAX based on CHIP-8 emulation, creating GPU-accelerated alternatives to traditional CPU emulators while maintaining perfect fidelity to original game mechanics.

Result: Achieved orders-of-magnitude speedups over traditional CPU emulators, demonstrated significant improvements in RL training speed and scalability across multiple games.

Conclusion: Octax provides an ideal platform for large-scale RL experimentation with its modular design that enables easy extension with new games or generation of novel environments using LLMs.

Abstract: Reinforcement learning (RL) research requires diverse, challenging environments that are both tractable and scalable. While modern video games may offer rich dynamics, they are computationally expensive and poorly suited for large-scale experimentation due to their CPU-bound execution. We introduce Octax, a high-performance suite of classic arcade game environments implemented in JAX, based on CHIP-8 emulation, a predecessor to Atari, which is widely adopted as a benchmark in RL research. Octax provides the JAX community with a long-awaited end-to-end GPU alternative to the Atari benchmark, offering image-based environments, spanning puzzle, action, and strategy genres, all executable at massive scale on modern GPUs. Our JAX-based implementation achieves orders-of-magnitude speedups over traditional CPU emulators while maintaining perfect fidelity to the original game mechanics. We demonstrate Octax’s capabilities by training RL agents across multiple games, showing significant improvements in training speed and scalability compared to existing solutions. The environment’s modular design enables researchers to easily extend the suite with new games or generate novel environments using large language models, making it an ideal platform for large-scale RL experimentation.

[425] PepCompass: Navigating peptide embedding spaces using Riemannian Geometry

Marcin Możejko, Adam Bielecki, Jurand Prądzyński, Marcin Traskowski, Antoni Janowski, Karol Jurasz, Michał Kucharczyk, Hyun-Su Lee, Marcelo Der Torossian Torres, Cesar de la Fuente-Nunez, Paulina Szymczak, Michał Kmicikiewicz, Ewa Szczurek

Main category: cs.LG

TL;DR: PepCompass is a geometry-aware framework for antimicrobial peptide discovery that uses Riemannian manifolds to better navigate peptide space, enabling more efficient exploration and optimization through local sampling methods and geodesic interpolation.

Details

Motivation: Current generative models for peptide discovery use flat Euclidean metrics that distort the true geometry of peptide space, making exploration and optimization inefficient. The astronomical size of peptide space and scarcity of active peptides requires better geometric understanding.

Method: Introduces Union of κ-Stable Riemannian Manifolds to capture local geometry. Uses Second-Order Riemannian Brownian Efficient Sampling and Mutation Enumeration in Tangent Space for local exploration. Combines these in Local Enumeration Bayesian Optimization (LE-BO) for activity optimization. Develops Potential-minimizing Geodesic Search (PoGS) to interpolate along property-enriched geodesics.

Result: In-vitro validation shows PoGS yields four novel seeds, and subsequent optimization with LE-BO discovers 25 highly active peptides with broad-spectrum activity, including against resistant bacterial strains.

Conclusion: Geometry-informed exploration provides a powerful new paradigm for antimicrobial peptide design, overcoming limitations of conventional flat Euclidean approaches.

Abstract: Antimicrobial peptide discovery is challenged by the astronomical size of peptide space and the relative scarcity of active peptides. Generative models provide continuous latent “maps” of peptide space, but conventionally ignore decoder-induced geometry and rely on flat Euclidean metrics, rendering exploration and optimization distorted and inefficient. Prior manifold-based remedies assume fixed intrinsic dimensionality, which critically fails in practice for peptide data. Here, we introduce PepCompass, a geometry-aware framework for peptide exploration and optimization. At its core, we define a Union of $\kappa$-Stable Riemannian Manifolds $\mathbb{M}^{\kappa}$, a family of decoder-induced manifolds that captures local geometry while ensuring computational stability. We propose two local exploration methods: Second-Order Riemannian Brownian Efficient Sampling, which provides a convergent second-order approximation to Riemannian Brownian motion, and Mutation Enumeration in Tangent Space, which reinterprets tangent directions as discrete amino-acid substitutions. Combining these yields Local Enumeration Bayesian Optimization (LE-BO), an efficient algorithm for local activity optimization. Finally, we introduce Potential-minimizing Geodesic Search (PoGS), which interpolates between prototype embeddings along property-enriched geodesics, biasing discovery toward seeds, i.e. peptides with favorable activity. In-vitro validation confirms the effectiveness of PepCompass: PoGS yields four novel seeds, and subsequent optimization with LE-BO discovers 25 highly active peptides with broad-spectrum activity, including against resistant bacterial strains. These results demonstrate that geometry-informed exploration provides a powerful new paradigm for antimicrobial peptide design.

[426] C2AL: Cohort-Contrastive Auxiliary Learning for Large-scale Recommendation Systems

Mertcan Cokbas, Ziteng Liu, Zeyi Tao, Elder Veliz, Qin Huang, Ellie Wen, Huayu Li, Qiang Jin, Murat Duman, Benjamin Au, Guy Lebanon, Sagar Chordia, Chengkai Zhang

Main category: cs.LG

TL;DR: The paper proposes C2AL, a method that uses auxiliary learning with conflicting labels to address heterogeneity in recommendation models, improving performance on minority cohorts while maintaining global performance.

Details

Motivation: Large-scale recommendation models trained under a single global objective assume user homogeneity, but real-world data contains heterogeneous cohorts. This leads to models being dominated by central distribution patterns while neglecting head and tail regions, causing inactive attention weights and dead neurons.

Method: The approach analyzes dataset substructures and exposes those with strong distributional contrast through auxiliary learning. It leverages partially conflicting auxiliary labels to regularize shared representations, customizing attention layers to preserve mutual information with minority cohorts while improving global performance.

Result: Evaluation on massive production datasets with billions of data points across six SOTA models showed that the factorization machine captures fine-grained user-ad interactions, achieving up to 0.16% reduction in normalized entropy overall and gains exceeding 0.30% on targeted minority cohorts.

Conclusion: The attention mechanism plays a key role in factorization machines for shared embedding selection, and using auxiliary learning with conflicting labels effectively addresses heterogeneity issues in recommendation systems, benefiting both global performance and minority cohorts.

Abstract: Training large-scale recommendation models under a single global objective implicitly assumes homogeneity across user populations. However, real-world data are composites of heterogeneous cohorts with distinct conditional distributions. As models increase in scale and complexity and as more data is used for training, they become dominated by central distribution patterns, neglecting head and tail regions. This imbalance limits the model’s learning ability and can result in inactive attention weights or dead neurons. In this paper, we reveal how the attention mechanism can play a key role in factorization machines for shared embedding selection, and propose to address this challenge by analyzing the substructures in the dataset and exposing those with strong distributional contrast through auxiliary learning. Unlike previous research, which heuristically applies weighted labels or multi-task heads to mitigate such biases, we leverage partially conflicting auxiliary labels to regularize the shared representation. This approach customizes the learning process of attention layers to preserve mutual information with minority cohorts while improving global performance. We evaluated C2AL on massive production datasets with billions of data points each for six SOTA models. Experiments show that the factorization machine is able to capture fine-grained user-ad interactions using the proposed method, achieving up to a 0.16% reduction in normalized entropy overall and delivering gains exceeding 0.30% on targeted minority cohorts.

cs.MA

cs.MM

[427] Detecting Notational Errors in Digital Music Scores

Géré Léo, Nicolas Audebert, Florent Jacquemard

Main category: cs.MM

TL;DR: Automated detection of notational errors in digital music scores, focusing on rhythm/time inconsistencies and contextual errors, with application to the ASAP piano score dataset showing 40% of scores contain errors.

Details

Motivation: Digital music scores vary widely in quality due to software specificity, conversion issues, and user inputs, creating problems for musical information extraction and retrieval.

Method: Modular state machine approach to detect two types of errors: rhythm/time inconsistencies in individual elements and contextual errors that break common Western music notation rules.

Result: Applied to ASAP piano score dataset, found around 40% of scores contain at least one notational error, with manual fixes implemented to improve dataset quality.

Conclusion: The automated error-detection method effectively identifies and localizes notational defects in music scores, enhancing data quality for musical information processing.

Abstract: Music scores are used to precisely store music pieces for transmission and preservation. To represent and manipulate these complex objects, various formats have been tailored for different use cases. While music notation follows specific rules, digital formats usually enforce them leniently. Hence, digital music scores widely vary in quality, due to software and format specificity, conversion issues, and dubious user inputs. Problems range from minor engraving discrepancies to major notation mistakes. Yet, data quality is a major issue when dealing with musical information extraction and retrieval. We present an automated approach to detect notational errors, aiming at precisely localizing defects in scores. We identify two types of errors: i) rhythm/time inconsistencies in the encoding of individual musical elements, and ii) contextual errors, i.e. notation mistakes that break commonly accepted musical rules. We implement the latter using a modular state machine that can be easily extended to include rules representing the usual conventions from the common Western music notation. Finally, we apply this error-detection method to the piano score dataset ASAP. We highlight that around 40% of the scores contain at least one notational error, and manually fix multiple of them to enhance the dataset’s quality.

Ziyu Gong, Chengcheng Mai, Yihua Huang

Main category: cs.MM

TL;DR: MHier-RAG is a novel multi-modal RAG model that addresses hallucinations in LVLM-based methods and inter-modal disconnection in traditional RAG approaches for long-context document QA.

Details

Motivation: Existing methods for multi-modal long-context document QA suffer from hallucinations (in LVLM-based approaches) and inter-modal disconnection/cross-page fragmentation (in RAG-based methods), limiting their effectiveness in handling visual-rich documents across multiple pages.

Method: Proposed MHier-RAG with hierarchical indexing combining flattened in-page chunks and topological cross-page chunks, plus multi-granularity semantic retrieval using joint similarity evaluation and LLM-based re-ranking for page-level parent page retrieval and document-level summary retrieval.

Result: Experimental results on MMLongBench-Doc and LongDocURL datasets demonstrated superior performance in understanding and answering modality-rich, multi-page documents compared to existing methods.

Conclusion: MHier-RAG effectively addresses challenges in multi-modal long-context document QA by establishing in-page multi-modal associations and long-distance cross-page dependencies through hierarchical indexing and multi-granularity retrieval.

Abstract: The multi-modal long-context document question-answering task aims to locate and integrate multi-modal evidences (such as texts, tables, charts, images, and layouts) distributed across multiple pages, for question understanding and answer generation. The existing methods can be categorized into Large Vision-Language Model (LVLM)-based and Retrieval-Augmented Generation (RAG)-based methods. However, the former were susceptible to hallucinations, while the latter struggled for inter-modal disconnection and cross-page fragmentation. To address these challenges, a novel multi-modal RAG model, named MHier-RAG, was proposed, leveraging both textual and visual information across long-range pages to facilitate accurate question answering for visual-rich documents. A hierarchical indexing method with the integration of flattened in-page chunks and topological cross-page chunks was designed to jointly establish in-page multi-modal associations and long-distance cross-page dependencies. By means of joint similarity evaluation and large language model (LLM)-based re-ranking, a multi-granularity semantic retrieval method, including the page-level parent page retrieval and document-level summary retrieval, was proposed to foster multi-modal evidence connection and long-distance evidence integration and reasoning. Experimental results performed on public datasets, MMLongBench-Doc and LongDocURL, demonstrated the superiority of our MHier-RAG method in understanding and answering modality-rich and multi-page documents.

eess.AS

[429] WEE-Therapy: A Mixture of Weak Encoders Framework for Psychological Counseling Dialogue Analysis

Yongqi Kang, Yong Zhao

Main category: eess.AS

TL;DR: WEE-Therapy is a multi-task AudioLLM that uses a Weak Encoder Ensemble mechanism with dual-routing strategy to improve understanding of counseling dialogues, achieving significant performance gains in emotion recognition, technique classification, risk detection, and summarization.

Details

Motivation: Existing audio language models struggle to capture domain-specific features in counseling dialogues like complex emotions and professional techniques, limiting their effectiveness in computational psychology applications.

Method: Proposes WEE-Therapy with Weak Encoder Ensemble mechanism that supplements a powerful base encoder with lightweight specialized encoders, using a dual-routing strategy combining stable domain knowledge with dynamic expert selection.

Result: Achieves significant performance gains across all tasks (emotion recognition, technique classification, risk detection, summarization) with minimal parameter overhead.

Conclusion: WEE-Therapy demonstrates strong potential for AI-assisted clinical analysis in computational psychology by effectively capturing domain-specific counseling features.

Abstract: The advancement of computational psychology requires AI tools capable of deeply understanding counseling dialogues. Existing audio language models (AudioLLMs) often rely on single speech encoders pre-trained on general data, struggling to capture domain-specific features like complex emotions and professional techniques. To address this, we propose WEE-Therapy, a multi-task AudioLLM incorporating a Weak Encoder Ensemble (WEE) mechanism. This supplements a powerful base encoder with a pool of lightweight, specialized encoders. A novel dual-routing strategy combines stable, data-independent domain knowledge with dynamic, data-dependent expert selection. Evaluated on emotion recognition, technique classification, risk detection, and summarization, WEE-Therapy achieves significant performance gains across all tasks with minimal parameter overhead, demonstrating strong potential for AI-assisted clinical analysis.

[430] SpeechCT-CLIP: Distilling Text-Image Knowledge to Speech for Voice-Native Multimodal CT Analysis

Lukas Buess, Jan Geier, David Bani-Harouni, Chantal Pellegrini, Matthias Keicher, Paula Andrea Perez-Toro, Nassir Navab, Andreas Maier, Tomas Arias-Vergara

Main category: eess.AS

TL;DR: This paper explores learning visual-language representations directly from spoken radiology reports using speech-to-image alignment, achieving near-text performance through knowledge distillation from text-image models.

Details

Motivation: Clinical workflows heavily rely on spoken communication (e.g., radiology dictation), but current medical AI systems only use written text, creating a gap between clinical practice and AI capabilities.

Method: Created Speech-RATE dataset of spoken radiology reports and trained SpeechCT-CLIP, a contrastive model aligning speech and 3D CT volumes using knowledge distillation from pretrained text-image CLIP models.

Result: Speech-based models initially underperformed text counterparts, but knowledge distillation improved zero-shot classification F1 from 0.623 to 0.705 (88% performance gap recovered) and achieved strong retrieval without requiring text at inference.

Conclusion: Speech is a practical alternative to text in multimodal pretraining, enabling voice-driven diagnostic support tools that align with clinical workflows.

Abstract: Spoken communication plays a central role in clinical workflows. In radiology, for example, most reports are created through dictation. Yet, nearly all medical AI systems rely exclusively on written text. In this work, we address this gap by exploring the feasibility of learning visual-language representations directly from spoken radiology reports. Specifically, we synthesize a large-scale dataset (Speech-RATE) of spoken radiology reports and train SpeechCT-CLIP, a contrastive model that aligns speech and 3D CT volumes in a shared representation space. While naive speech-based models underperform compared to text-trained counterparts, we show that knowledge distillation from a pretrained text-image CLIP model effectively transfers semantic alignment capabilities from text to speech, substantially narrowing this gap. Experiments demonstrate improved zero-shot classification F1 from 0.623 to 0.705, recovering 88% of the performance difference, and strong retrieval results without requiring text at inference. These findings highlight speech as a practical alternative to text in multimodal pretraining and open the door to voice-driven diagnostic support tools in clinical practice.

[431] When Voice Matters: Evidence of Gender Disparity in Positional Bias of SpeechLLMs

Shree Harsha Bokkahalli Satish, Gustav Eje Henter, Éva Székely

Main category: eess.AS

TL;DR: This paper presents the first token-level probabilistic evaluation of multiple choice question answering (MCQA) benchmarks for SpeechLLM systems, examining how temperature, prompt design, and voice gender affect gender and positional biases.

Details

Motivation: The rapid development of SpeechLLM-based conversational AI systems has created a need for robust benchmarking that includes fairness and bias assessment, but current MCQA benchmarks may not adequately capture speech-based biases.

Method: The study conducts token-level probabilistic evaluation and response-based analysis of MCQA benchmarks, examining: 1) effects of model temperature and prompt design on gender/positional bias, 2) impact of input voice gender on biases, and 3) generalizability across different gender-bias benchmarks.

Result: Results show that positional bias concerns from text domains are equally valid in speech domains, with stronger effects for female voices than male voices. This is the first study to isolate positional bias effects in SpeechLLM gender-bias benchmarks.

Conclusion: Current MCQA benchmarks do not account for speech-based bias, and alternative strategies are needed to ensure fairness towards all users in SpeechLLM systems.

Abstract: The rapid development of SpeechLLM-based conversational AI systems has created a need for robustly benchmarking these efforts, including aspects of fairness and bias. At present, such benchmarks typically rely on multiple choice question answering (MCQA). In this paper, we present the first token-level probabilistic evaluation and response-based study of several issues affecting the use of MCQA in SpeechLLM benchmarking: 1) we examine how model temperature and prompt design affect gender and positional bias on an MCQA gender-bias benchmark; 2) we examine how these biases are affected by the gender of the input voice; and 3) we study to what extent observed trends carry over to a second gender-bias benchmark. Our results show that concerns about positional bias from the text domain are equally valid in the speech domain. We also find the effect to be stronger for female voices than for male voices. To our knowledge, this is the first study to isolate positional bias effects in SpeechLLM-based gender-bias benchmarks. We conclude that current MCQA benchmarks do not account for speech-based bias and alternative strategies are needed to ensure fairness towards all users.

[432] Multi-Source Position and Direction-of-Arrival Estimation Based on Euclidean Distance Matrices

Klaus Brümann, Simon Doclo

Main category: eess.AS

TL;DR: Proposes EDM-based methods for multi-source position and DOA estimation that reduce computational complexity compared to SRP beamforming by exploiting Euclidean distance matrix properties.

Details

Motivation: SRP-based methods for 3D sound source localization require joint optimization of 3 continuous variables for position or 2 for DOA estimation, which is computationally expensive.

Method: Uses Euclidean distance matrices and Gram matrices to reduce optimization variables: single continuous distance variable for position estimation, and eliminates continuous optimization for DOA estimation by defining relative coordinate systems aligned with source directions.

Result: Significantly reduces computational cost compared to SRP-based methods while consistently outperforming SRP in two-source position and DOA estimation accuracy across different configurations.

Conclusion: EDM-based methods provide more efficient and accurate alternatives to SRP beamforming for multi-source sound localization by leveraging mathematical properties of distance matrices.

Abstract: A popular method to estimate the positions or directions-of-arrival (DOAs) of multiple sound sources using an array of microphones is based on steered-response power (SRP) beamforming. For a three-dimensional scenario, SRP-based methods need to jointly optimize three continuous variables for position estimation or two continuous variables for DOA estimation, which can be computationally expensive. In this paper, we propose novel methods for multi-source position and DOA estimation by exploiting properties of Euclidean distance matrices (EDMs) and their respective Gram matrices. In the proposed multi-source position estimation method only a single continuous variable, representing the distance between each source and a reference microphone, needs to be optimized. For each source, the optimal continuous distance variable and set of candidate time-difference of arrival (TDOA) estimates are determined by minimizing a cost function that is defined using the eigenvalues of the Gram matrix. The estimated relative source positions are then mapped to estimated absolute source positions by solving an orthogonal Procrustes problem for each source. The proposed multi-source DOA estimation method entirely eliminates the need for continuous variable optimization by defining a relative coordinate system per source such that one of its coordinate axes is aligned with the respective source DOA. The optimal set of candidate TDOA estimates is determined by minimizing a cost function that is defined using the eigenvalues of a rank-reduced Gram matrix. The computational cost of the proposed EDM-based methods is significantly reduced compared to the SRP-based methods. Experimental results for different source and microphone configurations show that the proposed EDM-based method consistently outperforms the SRP-based method in terms of two-source position and DOA estimation accuracy.

[433] STSM-FiLM: A FiLM-Conditioned Neural Architecture for Time-Scale Modification of Speech

Dyah A. M. G. Wisnu, Ryandhimas E. Zezario, Stefano Rini, Fo-Rui Li, Yan-Tsung Peng, Hsin-Min Wang, Yu Tsao

Main category: eess.AS

TL;DR: STSM-FILM is a neural time-scale modification method that uses Feature-Wise Linear Modulation (FiLM) to condition on speed factors, trained on WSOLA outputs to produce artifact-free speech stretching across various scaling factors.

Details

Motivation: Classical TSM methods like WSOLA introduce artifacts under extreme stretching or non-stationary conditions, motivating the development of neural approaches that can handle these challenges better.

Method: Proposed STSM-FILM architecture with FiLM conditioning on continuous speed factors, supervised using WSOLA outputs. Explored four encoder-decoder variants: STFT-HiFiGAN, WavLM-HiFiGAN, Whisper-HiFiGAN, and EnCodec.

Result: STSM-FILM produces perceptually consistent outputs across a wide range of time-scaling factors, demonstrating improved generalization and flexibility compared to classical methods.

Conclusion: FiLM-based conditioning shows potential for improving neural TSM models’ generalization and flexibility, enabling artifact-free speech time-scale modification across various stretching conditions.

Abstract: Time-Scale Modification (TSM) of speech aims to alter the playback rate of audio without changing its pitch. While classical methods like Waveform Similarity-based Overlap-Add (WSOLA) provide strong baselines, they often introduce artifacts under non-stationary or extreme stretching conditions. We propose STSM-FILM - a fully neural architecture that incorporates Feature-Wise Linear Modulation (FiLM) to condition the model on a continuous speed factor. By supervising the network using WSOLA-generated outputs, STSM-FILM learns to mimic alignment and synthesis behaviors while benefiting from representations learned through deep learning. We explore four encoder-decoder variants: STFT-HiFiGAN, WavLM-HiFiGAN, Whisper-HiFiGAN, and EnCodec, and demonstrate that STSM-FILM is capable of producing perceptually consistent outputs across a wide range of time-scaling factors. Overall, our results demonstrate the potential of FiLM-based conditioning to improve the generalization and flexibility of neural TSM models.

[434] SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision

Chunbo Hao, Ruibin Yuan, Jixun Yao, Qixin Deng, Xinyi Bai, Wei Xue, Lei Xie

Main category: eess.AS

TL;DR: SongFormer is a scalable framework for music structure analysis that learns from heterogeneous supervision, combining short- and long-window audio representations with learned source embeddings to handle partial, noisy labels.

Details

Motivation: Progress in music structure analysis has been limited by small, inconsistent corpora, creating a need for scalable frameworks that can learn from diverse and imperfect supervision.

Method: Fuses short- and long-window self-supervised audio representations to capture fine-grained and long-range dependencies, and introduces learned source embeddings to enable training with partial, noisy, and schema-mismatched labels.

Result: Sets new state of the art in strict boundary detection (HR.5F) and achieves highest functional label accuracy on SongFormBench, surpassing strong baselines and Gemini 2.5 Pro while remaining computationally efficient.

Conclusion: SongFormer provides an effective scalable framework for music structure analysis, supported by the largest MSA corpus (SongFormDB) and expert-verified benchmark (SongFormBench), enabling robust performance across diverse music.

Abstract: Music structure analysis (MSA) underpins music understanding and controllable generation, yet progress has been limited by small, inconsistent corpora. We present SongFormer, a scalable framework that learns from heterogeneous supervision. SongFormer (i) fuses short- and long-window self-supervised audio representations to capture both fine-grained and long-range dependencies, and (ii) introduces a learned source embedding to enable training with partial, noisy, and schema-mismatched labels. To support scaling and fair evaluation, we release SongFormDB, the largest MSA corpus to date (over 10k tracks spanning languages and genres), and SongFormBench, a 300-song expert-verified benchmark. On SongFormBench, SongFormer sets a new state of the art in strict boundary detection (HR.5F) and achieves the highest functional label accuracy, while remaining computationally efficient; it surpasses strong baselines and Gemini 2.5 Pro on these metrics and remains competitive under relaxed tolerance (HR3F). Code, datasets, and model are publicly available.

[435] Enhancing Photogrammetry Reconstruction For HRTF Synthesis Via A Graph Neural Network

Ludovic Pirard, Katarina C. Poole, Lorenzo Picinali

Main category: eess.AS

TL;DR: This study uses Graph Neural Networks with neural subdivision to upsample low-resolution photogrammetry head meshes into high-resolution meshes for HRTF synthesis, addressing accessibility limitations of traditional methods.

Details

Motivation: Traditional HRTF acquisition methods require specialized equipment and expertise, while high-resolution 3D scanners are expensive and limited. Photogrammetry offers an alternative but has resolution limitations that restrict HRTF synthesis applications.

Method: Process photogrammetry data from SONICOM dataset using Apple Photogrammetry API to reconstruct low-resolution head meshes. Train a GNN with Hausdorff Distance-based loss to upscale low-resolution inputs to high-resolution outputs. Validate performance geometrically and through HRTF synthesis using Mesh2HRTF.

Result: The GNN’s performance was validated on unseen photogrammetry data both geometrically and through synthesized HRTFs. Results were evaluated against HRTFs from high-resolution 3D scans, acoustic measurements, and KEMAR HRTF using numerical analyses and behavioral experiments including localization and Spatial Release from Masking tasks.

Conclusion: Graph Neural Networks with neural subdivision techniques can effectively upsample low-resolution photogrammetry head meshes to generate high-resolution meshes suitable for individual HRTF synthesis, providing a more accessible alternative to traditional methods.

Abstract: Traditional Head-Related Transfer Functions (HRTFs) acquisition methods rely on specialised equipment and acoustic expertise, posing accessibility challenges. Alternatively, high-resolution 3D modelling offers a pathway to numerically synthesise HRTFs using Boundary Elements Methods and others. However, the high cost and limited availability of advanced 3D scanners restrict their applicability. Photogrammetry has been proposed as a solution for generating 3D head meshes, though its resolution limitations restrict its application for HRTF synthesis. To address these limitations, this study investigates the feasibility of using Graph Neural Networks (GNN) using neural subdivision techniques for upsampling low-resolution Photogrammetry-Reconstructed (PR) meshes into high-resolution meshes, which can then be employed to synthesise individual HRTFs. Photogrammetry data from the SONICOM dataset are processed using Apple Photogrammetry API to reconstruct low-resolution head meshes. The dataset of paired low- and high-resolution meshes is then used to train a GNN to upscale low-resolution inputs to high-resolution outputs, using a Hausdorff Distance-based loss function. The GNN’s performance on unseen photogrammetry data is validated geometrically and through synthesised HRTFs generated via Mesh2HRTF. Synthesised HRTFs are evaluated against those computed from high-resolution 3D scans, to acoustically measured HRTFs, and to the KEMAR HRTF using perceptually-relevant numerical analyses as well as behavioural experiments, including localisation and Spatial Release from Masking (SRM) tasks.

[436] CVSM: Contrastive Vocal Similarity Modeling

Christos Garoufis, Athanasia Zlatintsi, Petros Maragos

Main category: eess.AS

TL;DR: CVSM is a contrastive self-supervised method for learning music representations that models vocal similarity by maximizing similarity between vocal excerpts and musical mixtures containing the same vocals, using both label-informed and label-agnostic approaches.

Details

Motivation: The availability of large unlabeled datasets across domains has enabled self-supervised pre-training methods for representation learning, but there's a need for specialized approaches for music signal representation that can handle vocal similarity modeling.

Method: Contrastive self-supervised framework that maximizes similarity between vocal excerpts and musical mixtures with the same vocals. Uses two approaches: label-informed (using artist identity) and label-agnostic (creating artificial mixtures from random vocal and accompaniment samples).

Result: CVSM outperforms baselines in vocal similarity modeling across both isolated vocals and complete mixtures. Label-informed approach shows more consistent performance, but label-agnostic variant with hybrid pre-training achieves comparable performance in artist identification and perceived vocal similarity.

Conclusion: CVSM effectively learns representations for musical and vocal similarity modeling, with both label-informed and label-agnostic approaches showing strong performance, demonstrating the viability of contrastive self-supervised learning for music representation tasks.

Abstract: The availability of large, unlabeled datasets across various domains has contributed to the development of a plethora of methods that learn representations for multiple target (downstream) tasks through self-supervised pre-training. In this work, we introduce CVSM (Contrastive Vocal Similarity Modeling), a contrastive self-supervised procedure for music signal representation learning in the audio domain that can be utilized for musical and vocal similarity modeling. Our method operates under a contrastive framework, maximizing the similarity between vocal excerpts and musical mixtures containing the same vocals; we devise both a label-informed protocol, leveraging artist identity information to sample the contrastive pairs, and a label-agnostic scheme, involving artificial mixture creation from randomly sampled vocal and accompaniment excerpts, which are paired with vocals from the same audio segment. We evaluate our proposed method in measuring vocal similarity both objectively, through linear probing on a suite of appropriate downstream tasks, and subjectively, via conducting a user study consisting of pairwise comparisons between different models in a recommendation-by-query setting. Our results indicate that the representations learned through CVSM are effective in musical and vocal similarity modeling, outperforming numerous baselines across both isolated vocals and complete musical mixtures. Moreover, while the availability of artist identity labels during pre-training leads to overall more consistent performance both in the evaluated downstream tasks and the user study, a label-agnostic CVSM variant incorporating hybrid pre-training with real and artificial mixtures achieves comparable performance to the label-informed one in artist identification and perceived vocal similarity.

[437] Evaluation of preprocessing pipelines in the creation of in-the-wild TTS datasets

Matías Di Bernardo, Emmanuel Misley, Ignacio Correa, Mateo García Iacovelli, Simón Mellino, Gala Lucía Gonzalez Barrios

Main category: eess.AS

TL;DR: A reproducible methodology for evaluating TTS preprocessing pipelines using objective metrics, tested on Argentine Spanish data with 24 pipeline configurations.

Details

Motivation: To develop a low-cost, metric-driven approach for evaluating preprocessing pipelines for in-the-wild TTS corpora generation, particularly for low-resource settings.

Method: Applied custom low-cost pipeline to Argentine Spanish collection, compared 24 configurations combining denoising and quality filtering variants using objective measures (PESQ, SI-SDR, SNR), acoustic descriptors (T30, C50), and speech-preservation metrics (F0-STD, MCD).

Result: Denoising variants with permissive filtering provided the best overall compromise between dataset size, signal quality, and voice preservation.

Conclusion: The methodology enables selecting pipeline configurations without training TTS models for each subset, accelerating and reducing preprocessing development costs for low-resource settings.

Abstract: This work introduces a reproducible, metric-driven methodology to evaluate preprocessing pipelines for in-the-wild TTS corpora generation. We apply a custom low-cost pipeline to the first in-the-wild Argentine Spanish collection and compare 24 pipeline configurations combining different denoising and quality filtering variants. Evaluation relies on complementary objective measures (PESQ, SI-SDR, SNR), acoustic descriptors (T30, C50), and speech-preservation metrics (F0-STD, MCD). Results expose trade-offs between dataset size, signal quality, and voice preservation; where denoising variants with permissive filtering provide the best overall compromise for our testbed. The proposed methodology allows selecting pipeline configurations without training TTS models for each subset, accelerating and reducing the cost of preprocessing development for low-resource settings.

[438] A Survey of Deep Learning for Complex Speech Spectrograms

Yuying Xie, Zheng-Hua Tan

Main category: eess.AS

TL;DR: A comprehensive survey of deep learning techniques for processing complex spectrograms in speech signal processing, covering architectures, training strategies, and applications like phase retrieval and speech enhancement.

Details

Motivation: Recent deep learning advancements have significantly impacted speech signal processing, particularly in analyzing complex spectrograms that contain both magnitude and phase information, creating a need to systematically review state-of-the-art techniques.

Method: The survey examines complex-valued neural networks designed for complex data, revisits real-valued neural network approaches for complex spectrograms, discusses specialized training strategies and loss functions, and analyzes applications across various speech processing tasks.

Result: The survey provides a comprehensive overview of current deep learning techniques for complex spectrogram processing, highlighting significant progress in applications like phase retrieval, speech enhancement, and speaker separation.

Conclusion: This survey serves as a valuable resource for researchers and practitioners in speech signal processing and deep learning by systematically organizing and reviewing the state-of-the-art in complex spectrogram processing techniques.

Abstract: Recent advancements in deep learning have significantly impacted the field of speech signal processing, particularly in the analysis and manipulation of complex spectrograms. This survey provides a comprehensive overview of the state-of-the-art techniques leveraging deep neural networks for processing complex spectrograms, which encapsulate both magnitude and phase information. We begin by introducing complex spectrograms and their associated features for various speech processing tasks. Next, we examine the key components and architectures of complex-valued neural networks, which are specifically designed to handle complex-valued data and have been applied to complex spectrogram processing. As recent studies have primarily focused on applying real-valued neural networks to complex spectrograms, we revisit these approaches and their architectural designs. We then discuss various training strategies and loss functions tailored for training neural networks to process and model complex spectrograms. The survey further examines key applications, including phase retrieval, speech enhancement, and speaker separation, where deep learning has achieved significant progress by leveraging complex spectrograms or their derived feature representations. Additionally, we examine the intersection of complex spectrograms with generative models. This survey aims to serve as a valuable resource for researchers and practitioners in the field of speech signal processing, deep learning and related fields.

eess.IV

[439] Learning a distance measure from the information-estimation geometry of data

Guy Ohayon, Pierre-Etienne H. Fiquet, Florentin Guth, Jona Ballé, Eero P. Simoncelli

Main category: eess.IV

TL;DR: The paper introduces the Information-Estimation Metric (IEM), a novel distance function based on probability densities that connects information theory and estimation theory through denoising errors.

Details

Motivation: To develop a principled distance metric that adapts to the geometry of complex signal distributions, bridging information theory and estimation theory.

Method: IEM is derived by comparing denoising error vectors of signals across noise amplitudes, using score vector fields of blurred densities. It can be computed with learned denoisers and involves solving a one-dimensional integral.

Result: IEM is proven to be a valid global metric and locally approximates to a Riemannian metric. For Gaussian signals, it matches Mahalanobis distance, but adapts to complex distributions. On ImageNet, IEM competes with or outperforms state-of-the-art supervised image quality metrics in predicting human perceptual judgments.

Conclusion: IEM provides a theoretically grounded and practical distance metric that effectively captures perceptual similarity in complex signal distributions like images.

Abstract: We introduce the Information-Estimation Metric (IEM), a novel form of distance function derived from an underlying continuous probability density over a domain of signals. The IEM is rooted in a fundamental relationship between information theory and estimation theory, which links the log-probability of a signal with the errors of an optimal denoiser, applied to noisy observations of the signal. In particular, the IEM between a pair of signals is obtained by comparing their denoising error vectors over a range of noise amplitudes. Geometrically, this amounts to comparing the score vector fields of the blurred density around the signals over a range of blur levels. We prove that the IEM is a valid global metric and derive a closed-form expression for its local second-order approximation, which yields a Riemannian metric. For Gaussian-distributed signals, the IEM coincides with the Mahalanobis distance. But for more complex distributions, it adapts, both locally and globally, to the geometry of the distribution. In practice, the IEM can be computed using a learned denoiser (analogous to generative diffusion models) and solving a one-dimensional integral. To demonstrate the value of our framework, we learn an IEM on the ImageNet database. Experiments show that this IEM is competitive with or outperforms state-of-the-art supervised image quality metrics in predicting human perceptual judgments.

[440] High Pixel Resolution Visible to Extended Shortwave Infrared Single Pixel Imaging with a black Phosphorus-Molybdenum disulfide (bP-MoS2) photodiode

Seyed Saleh Mousavi Khaleghi, Jinyuan Chen, Sivacarendran Balendhran, Alexander Corletto, Shifan Wang, Huan Liu, James Bullock, Kenneth B. Crozier

Main category: eess.IV

TL;DR: This paper demonstrates high-resolution single pixel imaging using a van der Waals material photodetector, achieving 1023×768 pixels for visible light and 512×512 for infrared with compressed sampling that reduces imaging time by 4x.

Details

Motivation: Current high-resolution infrared imagers are expensive due to costly sensor arrays. Van der Waals materials offer potential for low-cost room temperature infrared photodetectors, but creating megapixel arrays remains challenging.

Method: Uses a black phosphorus-molybdenum disulfide (bP-MoS2) photodiode with single pixel imaging and compressed sampling based on a cyclic S-matrix derived from Hadamard sequences, enabling efficient reconstruction via circular convolution and Fourier transforms.

Result: Achieved high-resolution SPI with 1023×768 pixels for visible light and 512×512 for extended shortwave infrared, surpassing previous vdWs material SPI implementations by 64x in pixel count. Also introduced edge detection for rapid feature extraction.

Conclusion: The method enables inexpensive shortwave and midwave infrared cameras, potentially advancing applications in gas detection, biomedical imaging, autonomous driving, security, and surveillance.

Abstract: High-resolution infrared imagers are currently more expensive than CMOS and CCD cameras, due to costly sensor arrays. Van der Waals (vdWs) materials present an opportunity for low-cost, room temperature infrared photodetectors. Although photodetectors based on vdWs materials show promising performance, creating a megapixel array is yet to be achieved. Imaging with a single vdWs photodetector typically relies on time-consuming mechanical scanning and suffers from low resolution. Single pixel imaging (SPI) offers an affordable alternative to achieve high-resolution imaging, utilizing only one photodetector and a spatial light modulator. Progress in SPI using vdWs material photodetectors has been limited, with only one prior demonstration in the near infrared range (64$\times$64 pixels). In this work, we demonstrate a high-resolution SPI system (1023$\times$768 for visible light and 512$\times$512 for extended shortwave infrared) using a black phosphorus-molybdenum disulfide (bP-MoS$_2$) photodiode, surpassing earlier vdWs material SPI implementations by a factor of 64 in pixel count. We introduce an easy-to-implement edge detection method for rapid feature extraction. We employ compressed sampling and reduce imaging time by a factor of four. Our compressed sampling approach is based on a cyclic S-matrix, which is derived from a Hadamard-based sequence, where each row is a circular shift of the first row. This enables efficient imaging reconstruction via circular convolution and Fourier transforms, allowing fewer measurements while preserving the key image features. Our method for SPI using a vdWs material photodetector presents the opportunity for inexpensive shortwave infrared and midwave infrared cameras, and thus may enable advances in gas detection, biomedical imaging, autonomous driving, security, and surveillance.

[441] A UAV-Based VNIR Hyperspectral Benchmark Dataset for Landmine and UXO Detection

Sagar Lekhak, Emmett J. Ientilucci, Jasper Baur, Susmita Ghosh

Main category: eess.IV

TL;DR: A new UAV-based VNIR hyperspectral dataset for landmine/UXO detection with 143 surrogate targets in various configurations, featuring high spectral fidelity and supporting reproducible research.

Details

Motivation: To fill the critical gap in open-access UAV-based hyperspectral data specifically for landmine and unexploded ordnance detection research.

Method: Used a Headwall Nano-Hyperspec sensor on a UAV platform flown at 20.6m altitude, capturing 270 spectral bands (398-1002 nm), with radiometric calibration, orthorectification, mosaicking, and reflectance retrieval using Empirical Line Method.

Result: Cross-validation showed RMSE values below 1.0 and SAM values between 1-6 degrees in 400-900 nm range, demonstrating high spectral fidelity. Dataset includes raw radiance cubes, GCP/AeroPoint data, and reference spectra.

Conclusion: This dataset fills a critical gap and provides a multi-sensor benchmark when combined with previously published drone-based EMI data from the same test field.

Abstract: This paper introduces a novel benchmark dataset of Visible and Near-Infrared (VNIR) hyperspectral imagery acquired via an unmanned aerial vehicle (UAV) platform for landmine and unexploded ordnance (UXO) detection research. The dataset was collected over a controlled test field seeded with 143 realistic surrogate landmine and UXO targets, including surface, partially buried, and fully buried configurations. Data acquisition was performed using a Headwall Nano-Hyperspec sensor mounted on a multi-sensor drone platform, flown at an altitude of approximately 20.6 m, capturing 270 contiguous spectral bands spanning 398-1002 nm. Radiometric calibration, orthorectification, and mosaicking were performed followed by reflectance retrieval using a two-point Empirical Line Method (ELM), with reference spectra acquired using an SVC spectroradiometer. Cross-validation against six reference objects yielded RMSE values below 1.0 and SAM values between 1 and 6 degrees in the 400-900 nm range, demonstrating high spectral fidelity. The dataset is released alongside raw radiance cubes, GCP/AeroPoint data, and reference spectra to support reproducible research. This contribution fills a critical gap in open-access UAV-based hyperspectral data for landmine detection and offers a multi-sensor benchmark when combined with previously published drone-based electromagnetic induction (EMI) data from the same test field.

[442] Image Enhancement Based on Pigment Representation

Se-Ho Lee, Keunsoo Ko, Seung-Wook Kim

Main category: eess.IV

TL;DR: A novel image enhancement method using pigment representation that transforms RGB colors into high-dimensional pigments for adaptive content-based enhancement, achieving superior performance with low computational cost.

Details

Motivation: Conventional image enhancement methods are limited by pre-defined color spaces like RGB, lacking adaptability to input content. The paper aims to overcome this limitation by developing a more expressive and adaptive representation.

Method: Transform RGB colors into high-dimensional pigments, reproject and blend them individually, then transform back to RGB. Uses a visual encoder to adaptively estimate transformation parameters based on input image content.

Result: Superior performance over state-of-the-art methods in image enhancement tasks including retouching and tone mapping, while maintaining low computational complexity and small model size.

Conclusion: The pigment representation method provides an effective and efficient solution for image enhancement, offering adaptability and expressiveness that outperforms conventional approaches.

Abstract: This paper presents a novel and efficient image enhancement method based on pigment representation. Unlike conventional methods where the color transformation is restricted to pre-defined color spaces like RGB, our method dynamically adapts to input content by transforming RGB colors into a high-dimensional feature space referred to as \textit{pigments}. The proposed pigment representation offers adaptability and expressiveness, achieving superior image enhancement performance. The proposed method involves transforming input RGB colors into high-dimensional pigments, which are then reprojected individually and blended to refine and aggregate the information of the colors in pigment spaces. Those pigments are then transformed back into RGB colors to generate an enhanced output image. The transformation and reprojection parameters are derived from the visual encoder which adaptively estimates such parameters based on the content in the input image. Extensive experimental results demonstrate the superior performance of the proposed method over state-of-the-art methods in image enhancement tasks, including image retouching and tone mapping, while maintaining relatively low computational complexity and small model size.

Daeyoung Kim

Main category: eess.IV

TL;DR: GCVAMD is a novel causal AMD analysis model using modified CausalVAE to extract latent causal factors from OCT images, enabling causal inference for AMD risk factors like drusen and neovascularization.

Details

Motivation: Previous deep learning methods focused only on prediction performance without considering underlying causal mechanisms, which limits intervention analysis and reliability. There's a need for causal understanding in AMD detection.

Method: Modified CausalVAE approach that extracts latent causal factors from raw OCT images, enabling causal inference for AMD risk factors including drusen and neovascularization.

Result: GCVAMD successfully identifies drusen status and neovascularization status with AMD causal mechanisms in latent spaces, supporting tasks from AMD classification to intervention analysis.

Conclusion: GCVAMD provides a causal framework for AMD analysis that enhances downstream tasks and enables treatment simulation and intervention analysis on specific risk factors.

Abstract: Age Related Macular Degeneration(AMD) has been one of the most leading causes of permanent vision impairment in ophthalmology. Though treatments, such as anti VEGF drugs or photodynamic therapies, were developed to slow down the degenerative process of AMD, there is still no specific cure to reverse vision loss caused by AMD. Thus, for AMD, detecting existence of risk factors of AMD or AMD itself within the patient retina in early stages is a crucial task to reduce the possibility of vision impairment. Apart from traditional approaches, deep learning based methods, especially attention mechanism based CNNs and GradCAM based XAI analysis on OCT scans, exhibited successful performance in distinguishing AMD retina from normal retinas, making it possible to use AI driven models to aid medical diagnosis and analysis by ophthalmologists regarding AMD. However, though having significant success, previous works mostly focused on prediction performance itself, not pathologies or underlying causal mechanisms of AMD, which can prohibit intervention analysis on specific factors or even lead to less reliable decisions. Thus, this paper introduces a novel causal AMD analysis model: GCVAMD, which incorporates a modified CausalVAE approach that can extract latent causal factors from only raw OCT images. By considering causality in AMD detection, GCVAMD enables causal inference such as treatment simulation or intervention analysis regarding major risk factors: drusen and neovascularization, while returning informative latent causal features that can enhance downstream tasks. Results show that through GCVAMD, drusen status and neovascularization status can be identified with AMD causal mechanisms in GCVAMD latent spaces, which can in turn be used for various tasks from AMD detection(classification) to intervention analysis.

[444] Wave-GMS: Lightweight Multi-Scale Generative Model for Medical Image Segmentation

Talha Ahmed, Nehal Ahmed Shaikh, Hassan Mohy-ud-Din

Main category: eess.IV

TL;DR: Wave-GMS is a lightweight multi-scale generative model for medical image segmentation that achieves state-of-the-art performance with only ~2.6M parameters, enabling training on cost-effective GPUs with limited memory.

Details

Motivation: To enable equitable deployment of AI tools in healthcare by developing segmentation networks that can be trained on cost-effective GPUs with limited memory and large batch sizes, without requiring memory-intensive pretrained models.

Method: Proposed Wave-GMS, a lightweight and efficient multi-scale generative model for medical image segmentation that has substantially fewer trainable parameters and supports training with large batch sizes on GPUs with limited memory.

Result: Extensive experiments on four public datasets (BUS, BUSI, Kvasir-Instrument, HAM10000) show Wave-GMS achieves state-of-the-art segmentation performance with superior cross-domain generalizability using only ~2.6M trainable parameters.

Conclusion: Wave-GMS provides an effective solution for equitable AI deployment in healthcare by offering high-performance segmentation with minimal computational requirements, making it suitable for resource-constrained hospital environments.

Abstract: For equitable deployment of AI tools in hospitals and healthcare facilities, we need Deep Segmentation Networks that offer high performance and can be trained on cost-effective GPUs with limited memory and large batch sizes. In this work, we propose Wave-GMS, a lightweight and efficient multi-scale generative model for medical image segmentation. Wave-GMS has a substantially smaller number of trainable parameters, does not require loading memory-intensive pretrained vision foundation models, and supports training with large batch sizes on GPUs with limited memory. We conducted extensive experiments on four publicly available datasets (BUS, BUSI, Kvasir-Instrument, and HAM10000), demonstrating that Wave-GMS achieves state-of-the-art segmentation performance with superior cross-domain generalizability, while requiring only ~2.6M trainable parameters. Code is available at https://github.com/ATPLab-LUMS/Wave-GMS.

[445] WaveNet-SF: A Hybrid Network for Retinal Disease Detection Based on Wavelet Transform in Spatial-Frequency Domain

Jilan Cheng, Guoli Long, Zeyu Zhang, Zhenjia Qi, Hanyu Wang, Libin Lu, Shuihua Wang, Yudong Zhang, Jin Hong

Main category: eess.IV

TL;DR: WaveNet-SF is a novel framework that integrates spatial and frequency domain learning for retinal disease detection from OCT images, achieving state-of-the-art classification accuracies on benchmark datasets.

Details

Motivation: Retinal diseases cause vision impairment and blindness, with timely diagnosis being critical. OCT images suffer from speckle noise, complex lesion shapes, and varying lesion sizes, making interpretation challenging.

Method: The framework uses wavelet transforms to decompose OCT images into low- and high-frequency components. It includes a Multi-Scale Wavelet Spatial Attention (MSW-SA) module for multi-scale lesion detection and a High-Frequency Feature Compensation (HFFC) block to recover edge information and suppress noise.

Result: Achieved state-of-the-art classification accuracies of 97.82% on OCT-C8 dataset and 99.58% on OCT2017 dataset, surpassing existing methods.

Conclusion: WaveNet-SF effectively addresses challenges in OCT image analysis and shows potential as a powerful tool for retinal disease diagnosis.

Abstract: Retinal diseases are a leading cause of vision impairment and blindness, with timely diagnosis being critical for effective treatment. Optical Coherence Tomography (OCT) has become a standard imaging modality for retinal disease diagnosis, but OCT images often suffer from issues such as speckle noise, complex lesion shapes, and varying lesion sizes, making interpretation challenging. In this paper, we propose a novel framework, WaveNet-SF, to enhance retinal disease detection by integrating the spatial-domain and frequency-domain learning. The framework utilizes wavelet transforms to decompose OCT images into low- and high-frequency components, enabling the model to extract both global structural features and fine-grained details. To improve lesion detection, we introduce a Multi-Scale Wavelet Spatial Attention (MSW-SA) module, which enhances the model’s focus on regions of interest at multiple scales. Additionally, a High-Frequency Feature Compensation (HFFC) block is incorporated to recover edge information lost during wavelet decomposition, suppress noise, and preserve fine details crucial for lesion detection. Our approach achieves state-of-the-art (SOTA) classification accuracies of 97.82% and 99.58% on the OCT-C8 and OCT2017 datasets, respectively, surpassing existing methods. These results demonstrate the efficacy of WaveNet-SF in addressing the challenges of OCT image analysis and its potential as a powerful tool for retinal disease diagnosis.

[446] CostFilter-AD: Enhancing Anomaly Detection through Matching Cost Filtering

Zhe Zhang, Mingxiu Cai, Hanxiao Wang, Gaochang Wu, Tianyou Chai, Xiatian Zhu

Main category: eess.IV

TL;DR: CostFilter-AD introduces cost filtering from classical matching tasks to unsupervised anomaly detection, refining matching cost volumes between input and normal samples to improve anomaly localization accuracy.

Details

Motivation: Existing UAD methods rely on inaccurate image-level or feature-level matching processes that lead to sub-optimal detection performance. The matching process is often overlooked despite being fundamental to deriving anomaly scores.

Method: Constructs a matching cost volume between input and normal samples, then uses a cost volume filtering network guided by input observation as attention query to suppress noise while preserving edges and capturing subtle anomalies. Works as a generic post-processing plug-in for both reconstruction-based and embedding-based methods.

Result: Extensive experiments on MVTec-AD and VisA benchmarks show generic benefits for both single- and multi-class UAD tasks, with improved anomaly localization accuracy.

Conclusion: CostFilter-AD effectively addresses the matching accuracy problem in UAD by borrowing cost filtering concepts from classical matching tasks, providing a versatile solution that enhances existing methods without requiring architectural changes.

Abstract: Unsupervised anomaly detection (UAD) seeks to localize the anomaly mask of an input image with respect to normal samples. Either by reconstructing normal counterparts (reconstruction-based) or by learning an image feature embedding space (embedding-based), existing approaches fundamentally rely on image-level or feature-level matching to derive anomaly scores. Often, such a matching process is inaccurate yet overlooked, leading to sub-optimal detection. To address this issue, we introduce the concept of cost filtering, borrowed from classical matching tasks, such as depth and flow estimation, into the UAD problem. We call this approach {\em CostFilter-AD}. Specifically, we first construct a matching cost volume between the input and normal samples, comprising two spatial dimensions and one matching dimension that encodes potential matches. To refine this, we propose a cost volume filtering network, guided by the input observation as an attention query across multiple feature layers, which effectively suppresses matching noise while preserving edge structures and capturing subtle anomalies. Designed as a generic post-processing plug-in, CostFilter-AD can be integrated with either reconstruction-based or embedding-based methods. Extensive experiments on MVTec-AD and VisA benchmarks validate the generic benefits of CostFilter-AD for both single- and multi-class UAD tasks. Code and models will be released at https://github.com/ZHE-SAPI/CostFilter-AD.

[447] Robust Pan-Cancer Mitotic Figure Detection with YOLOv12

Raphaël Bourgade, Guillaume Balezo, Thomas Walter

Main category: eess.IV

TL;DR: The paper presents a YOLOv12-based approach for mitosis detection that achieved second place in the MIDOG 2025 challenge with strong performance on both hotspot and whole-slide regions.

Details

Motivation: Mitotic figures are crucial for tumor pathology assessment but suffer from high inter-observer variability among pathologists, necessitating robust automated detection methods.

Method: Used state-of-the-art YOLOv12 object detection architecture for mitosis detection without relying on external data.

Result: Achieved F1-score of 0.801 on preliminary test set (hotspots only) and ranked second on final test leaderboard with F1-score of 0.7216 across complex whole-slide regions.

Conclusion: The YOLOv12-based approach demonstrates strong performance for robust mitosis detection in histopathological images, addressing the challenge of inter-observer variability.

Abstract: Mitotic figures represent a key histoprognostic feature in tumor pathology, providing crucial insights into tumor aggressiveness and proliferation. However, their identification remains challenging, subject to significant inter-observer variability, even among experienced pathologists. To address this issue, the MItosis DOmain Generalization (MIDOG) 2025 challenge marks the third edition of an international competition aiming to develop robust mitosis detection algorithms. In this paper, we present a mitotic figure detection approach based on the state-of-the-art YOLOv12 object detection architecture. Our method achieved an F1-score of 0.801 on the preliminary test set (hotspots only) and ranked second on the final test leaderboard with an F1-score of 0.7216 across complex and heterogeneous whole-slide regions, without relying on external data.

Today’s Research Highlights

Table of Contents

cs.CL

[1] AMANDA: Agentic Medical Knowledge Augmentation for Data-Efficient Medical Visual Question Answering

[2] CLARITY: Clinical Assistant for Routing, Inference, and Triage

[3] Hallucination reduction with CASAL: Contrastive Activation Steering For Amortized Learning

[4] Hallucination-Resistant, Domain-Specific Research Assistant with Self-Evaluation and Vector-Grounded Retrieval

[5] KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI

[6] SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification

[7] EntropyLong: Effective Long-Context Training via Predictive Uncertainty

[8] Synthetic Dialogue Generation for Interactive Conversational Elicitation & Recommendation (ICER)

[9] A High-Capacity and Secure Disambiguation Algorithm for Neural Linguistic Steganography

[10] Human Mobility Datasets Enriched With Contextual and Social Dimensions

[11] Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing

[12] Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting?

[13] FormalML: A Benchmark for Evaluating Formal Subgoal Completion in Machine Learning Theory

[14] Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation

[15] KurdSTS: The Kurdish Semantic Textual Similarity

[16] CRACQ: A Multi-Dimensional Approach To Automated Document Assessment

[17] Optimizing Long-Form Clinical Text Generation with Claim-Based Rewards

[18] Evaluating Uncertainty Quantification Methods in Argumentative Large Language Models

[19] Can Prompts Rewind Time for LLMs? Evaluating the Effectiveness of Prompted Knowledge Cutoffs

[20] DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning

[21] $\texttt{BluePrint}$: A Social Media User Dataset for LLM Persona Evaluation and Training

[22] Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression

[23] Small Language Models for Curriculum-based Guidance

[24] mini-vec2vec: Scaling Universal Geometry Alignment with Linear Transformations

[25] LLMSQL: Upgrading WikiSQL for the LLM Era of Text-to-SQL

[26] Language, Culture, and Ideology: Personalizing Offensiveness Detection in Political Tweets with Reasoning LLMs

[27] Evaluating Bias in Spoken Dialogue LLMs for Real-World Decisions and Recommendations

[28] An Senegalese Legal Texts Structuration Using LLM-augmented Knowledge Graph

[29] Modeling the language cortex with form-independent and enriched representations of sentence meaning reveals remarkable semantic abstractness

[30] DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding

[31] Emission-GPT: A domain-specific language model agent for knowledge retrieval, emission inventory and data analysis

[32] Spiral of Silence in Large Language Model Agents

[33] ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference

[34] A Cross-Lingual Analysis of Bias in Large Language Models Using Romanian History

[35] Beyond Manuals and Tasks: Instance-Level Context Learning for LLM Agents

[36] Training Dynamics of Parametric and In-Context Knowledge Utilization in Language Models

[37] Pretraining with hierarchical memories: separating long-tail and common knowledge

[38] Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems

[39] Learning to Route: A Rule-Driven Agent Framework for Hybrid-Source Retrieval-Augmented Generation

[40] KnowledgeSmith: Uncovering Knowledge Updating in LLMs with Model Editing and Unlearning

[41] Retrieval and Augmentation of Domain Knowledge for Text-to-SQL Semantic Parsing

[42] Words That Make Language Models Perceive

[43] Unraveling Syntax: How Language Models Learn Context-Free Grammars

[44] Hierarchical Semantic Retrieval with Cobweb

[45] Knowledge-Graph Based RAG System Evaluation Framework

[46] Transcribe, Translate, or Transliterate: An Investigation of Intermediate Representations in Spoken Language Models

[47] Evaluation Framework for Highlight Explanations of Context Utilisation in Language Models

[48] Mind the Gap: Linguistic Divergence and Adaptation Strategies in Human-LLM Assistant vs. Human-Human Interactions

[49] SoT: Structured-of-Thought Prompting Guides Multilingual Reasoning in Large Language Models

[50] Self-Improvement in Multimodal Large Language Models: A Survey

[51] Uncertainty as Feature Gaps: Epistemic Uncertainty Quantification of LLMs in Contextual Question-Answering

[52] Time-To-Inconsistency: A Survival Analysis of Large Language Model Robustness to Adversarial Attacks

[53] TravelBench : Exploring LLM Performance in Low-Resource Domains

[54] PGMEL: Policy Gradient-based Generative Adversarial Network for Multimodal Entity Linking

[55] IndiCASA: A Dataset and Bias Evaluation Framework in LLMs Using Contrastive Embedding Similarity in the Indian Context

[56] The Path of Self-Evolving Large Language Models: Achieving Data-Efficient Learning via Intrinsic Feedback

[57] XTRA: Cross-Lingual Topic Modeling with Topic and Representation Alignments

[58] A Computational Framework for Interpretable Text-Based Personality Assessment from Social Media

[59] StepChain GraphRAG: Reasoning Over Knowledge Graphs for Multi-Hop Question Answering

[60] Evaluating Large Language Models for IUCN Red List Species Information

[61] Constraint Satisfaction Approaches to Wordle: Novel Heuristics and Cross-Lexicon Validation

[62] Self-Reflective Generation at Test Time

[63] Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval

[64] Leave No TRACE: Black-box Detection of Copyrighted Dataset Usage in Large Language Models via Watermarking

[65] Grounding Large Language Models in Clinical Evidence: A Retrieval-Augmented Generation System for Querying UK NICE Clinical Guidelines

[66] Semantic Differentiation in Speech Emotion Recognition: Insights from Descriptive and Expressive Speech Roles

[67] Semantic Similarity in Radiology Reports via LLMs and NER

[68] SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys?

[69] Beyond the Final Layer: Intermediate Representations for Better Multilingual Calibration in Large Language Models

[70] EditLens: Quantifying the Extent of AI Editing in Text

[71] Neural Correlates of Language Models Are Specific to Human Language

[72] Topic Modeling as Long-Form Generation: Can Long-Context LLMs revolutionize NTM via Zero-Shot Prompting?

[73] Model-Based Ranking of Source Languages for Zero-Shot Cross-Lingual Transfer

[74] FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

[75] Cache-to-Cache: Direct Semantic Communication Between Large Language Models

[76] Self-Anchor: Large Language Model Reasoning via Step-by-step Attention Alignment

[77] Reward Models are Metrics in a Trench Coat