Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 214]
- cs.CV [Total: 332]
- cs.AI [Total: 132]
- cs.SD [Total: 18]
- cs.LG [Total: 398]
- cs.MA [Total: 9]
- cs.MM [Total: 2]
- eess.AS [Total: 20]
- eess.IV [Total: 27]
cs.CL
[1] A Multi-lingual Dataset of Classified Paragraphs from Open Access Scientific Publications
Eric Jeangirard
Main category: cs.CL
TL;DR: A dataset of 833k paragraphs from scientific papers classified into four categories (acknowledgments, data mentions, software/code mentions, clinical trial mentions) with language and domain annotations.
Details
Motivation: To enable training of text classification models and development of named entity recognition systems for scientific literature mining by providing a comprehensive annotated dataset.Method: Extracted paragraphs from CC-BY licensed scientific publications using GROBID, annotated with language identification (fastText) and scientific domain (OpenAlex), primarily in English and French.
Result: Created a publicly available dataset of 833k classified paragraphs on HuggingFace under CC-BY license, covering multiple scientific domains and languages.
Conclusion: This dataset provides a valuable resource for developing NLP tools for scientific literature analysis and can support various text mining applications in research.
Abstract: We present a dataset of 833k paragraphs extracted from CC-BY licensed scientific publications, classified into four categories: acknowledgments, data mentions, software/code mentions, and clinical trial mentions. The paragraphs are primarily in English and French, with additional European languages represented. Each paragraph is annotated with language identification (using fastText) and scientific domain (from OpenAlex). This dataset, derived from the French Open Science Monitor corpus and processed using GROBID, enables training of text classification models and development of named entity recognition systems for scientific literature mining. The dataset is publicly available on HuggingFace https://doi.org/10.57967/hf/6679 under a CC-BY license.
[2] Policy Optimization Prefers The Path of Least Resistance
Debdeep Sanyal, Aakash Sen Sharma, Dhruv Kumar, Saurabh Deshpande, Murari Mandal
Main category: cs.CL
TL;DR: Policy optimization in LLMs consistently follows the path of least resistance, discarding explicit reasoning when allowed flexible CoT formats, even with higher rewards for complex reasoning.
Details
Motivation: To understand how policy optimization behaves when rigid think-then-answer constraints are relaxed into open-ended chain-of-thought structures.Method: Extensive controlled experiments with various models and algorithms, including reward decomposition studies and KL-regularized policy analysis.
Result: PO systematically optimizes for simplest reward components first, causing degeneration to answer-only formats even with 4x higher rewards for complex reasoning.
Conclusion: Policy freedom enables high-reward shortcut discovery but creates strong incentives for reward hacking, posing critical alignment challenges.
Abstract: Policy optimization (PO) algorithms are used to refine Large Language Models
for complex, multi-step reasoning. Current state-of-the-art pipelines enforce a
strict think-then-answer format to elicit chain-of-thought (CoT); however, the
behavior of PO when these rigid constraints are relaxed into an open-ended CoT
structure remains an under-studied question. We investigate this gap with an
extensive suite of controlled experiments and identify a consistent principle:
\textit{policy optimization consistently follows the path of least resistance}.
When afforded the flexibility to interleave reasoning and response, policy
optimization consistently learns to discard explicit reasoning, causing the
policy to degenerate to a direct \texttt{
[3] Language Ranker: A Lightweight Ranking framework for LLM Decoding
Chenheng Zhang, Tianqi Du, Jizhe Zhang, Mingqing Xiao, Yifei Wang, Yisen Wang, Zhouchen Lin
Main category: cs.CL
TL;DR: Language Ranker: A lightweight reranking framework that treats LLM decoding as recommendation ranking, achieving comparable performance to large reward models with <0.5M parameters.
Details
Motivation: Traditional LLM research focuses on output distributions while neglecting decoding processes. Existing decoding methods and reward models suffer from computational costs, redundancy, and limited applicability.Method: Proposes Language Ranker framework that treats decoding as recommendation ranking, using a lightweight module to rerank candidate responses with features extracted by the base model.
Result: Achieves performance comparable to large-scale reward models across various tasks, with only <0.5M additional parameters, significantly reducing computational overhead in training and inference.
Conclusion: The method demonstrates efficiency and effectiveness in unlocking LLM capabilities through lightweight reranking, offering a practical alternative to computationally expensive approaches.
Abstract: Conventional research on large language models (LLMs) has primarily focused on refining output distributions, while paying less attention to the decoding process that transforms these distributions into final responses. Recent advances, such as scaling the computation of inference time with reward models, have underscored the importance of decoding, but these methods often suffer from high computational costs and limited applicability. In this paper, we revisit LLM generation through the lens of recommender systems, conceptualizing the decoding process as analogous to the ranking stage in recommendation pipelines. From this perspective, we observe that both traditional decoding methods and reward models exhibit clear limitations such as redundancy. Motivated by this insight, we propose Language Ranker, a novel framework that introduces a lightweight module to rerank candidate responses using features extracted by the base model. Experiments across a wide range of tasks show that Language Ranker achieves performance comparable to large-scale reward models, while requiring only <0.5M additional parameters, significantly reducing the computational overhead during both training and inference stages. This highlights the efficiency and effectiveness of our method, showcasing its potential to fully unlock the capabilities of LLMs.
[4] Framework for Machine Evaluation of Reasoning Completeness in Large Language Models For Classification Tasks
Avinash Patil
Main category: cs.CL
TL;DR: RACE is a framework that evaluates how well LLM-generated explanations align with interpretable feature importance scores, revealing that correct predictions better cover supporting features while errors show higher coverage of contradicting features.
Details
Motivation: As ML is increasingly used in sensitive domains, there's growing demand for transparent AI. While LLMs can produce natural language explanations, it's unclear if these rationales faithfully capture the actual predictive signals behind decisions.Method: RACE compares LLM rationales against top-ranked supporting and contradicting lexical features from logistic regression baselines across four text classification datasets, using token-aware, exact string, and edit-distance matching techniques at multiple granularity levels.
Result: Empirical results show consistent asymmetry: correct predictions have higher coverage of supporting features, while incorrect predictions show elevated coverage of contradicting features. Edit-distance matching reveals paraphrastic overlaps that boost coverage while preserving this asymmetry.
Conclusion: LLM rationales combine both surface-level and flexible evidence reuse but can amplify misleading cues in error cases. RACE provides insights into LLM explanation faithfulness and establishes quantitative basis for evaluating reasoning completeness in neural language models.
Abstract: The growing adoption of machine learning (ML) in sensitive domains has heightened the demand for transparent and interpretable artificial intelligence. Large Language Models (LLMs) are increasingly capable of producing natural language explanations, yet it remains unclear whether these rationales faithfully capture the predictive signals that underlie decisions. This paper introduces RACE-Reasoning Alignment for Completeness of Explanations, a systematic framework to evaluate the alignment between LLM-generated explanations and interpretable feature importance scores derived from a logistic regression baseline. We analyze four widely used text classification datasets-WIKI ONTOLOGY, AG NEWS, IMDB, and GOEMOTIONS-and compare LLM rationales against top-ranked supporting and contradicting lexical features. To capture alignment at multiple levels of granularity, RACE implements token-aware, exact string, and edit-distance matching techniques. Empirical results reveal a consistent asymmetry: correct predictions exhibit higher coverage of supporting features, while incorrect predictions are associated with elevated coverage of contradicting features. Edit-distance matching further uncovers paraphrastic overlaps, boosting coverage while preserving this asymmetry. These findings demonstrate that LLM rationales combine both surface-level and flexible evidence reuse, yet can also amplify misleading cues in error cases. RACE provides new insights into the faithfulness of LLM explanations and establishes a quantitative basis for evaluating reasoning completeness in neural language models.
[5] Preventing Catastrophic Forgetting: Behavior-Aware Sampling for Safer Language Model Fine-Tuning
Anh Pham, Mihir Thalanki, Michael Sun, Aditya Chaloo, Ankita Gupta, Tian Xia, Aditya Mate, Ehimwenma Nosakhare, Soundararajan Srinivasan
Main category: cs.CL
TL;DR: A behavior-aware sampling framework that selects safety examples based on instruction-response behavior and semantic diversity can reduce catastrophic forgetting of safety behaviors during fine-tuning, achieving 41% harm reduction with only 0.5% additional training data.
Details
Motivation: Large language models often lose safety behaviors when fine-tuned on benign data (catastrophic forgetting), and while random safety examples help, it's unclear which examples are most effective.Method: Proposed a behavior-aware sampling framework that selects safety examples based on instruction-response behavior (refusal vs compliance) and semantic diversity across harm categories.
Result: The approach substantially reduces harmful outputs while maintaining helpfulness, achieving up to 41% reduction in harmfulness with only 0.5% additional training data.
Conclusion: Targeted data selection can significantly improve the safety and efficiency of fine-tuning at scale.
Abstract: Large language models often lose previously aligned safety behaviors when fine-tuned on benign data, a phenomenon known as catastrophic forgetting. Prior work shows that adding random safety examples can mitigate this effect, but it remains unclear which examples are most effective. We propose a behavior-aware sampling framework that selects safety examples based on two complementary factors: instruction-response behavior (e.g., refusal versus compliance) and semantic diversity across harm categories. Systematic evaluation shows that this approach substantially reduces harmful outputs while maintaining helpfulness, achieving up to a 41% reduction in harmfulness with only 0.5% additional training data. These results highlight how targeted data selection can improve the safety and efficiency of fine-tuning at scale.
[6] Embedding Trust: Semantic Isotropy Predicts Nonfactuality in Long-Form Text Generation
Dhrupad Bhardwaj, Julia Kempe, Tim G. J. Rudner
Main category: cs.CL
TL;DR: Introduces semantic isotropy (embedding uniformity) to assess LLM response trustworthiness without labeled data or fine-tuning, outperforming existing methods.
Details
Motivation: Need reliable, low-cost methods to assess trustworthiness of long-form LLM responses in high-stakes domains, avoiding expensive claim-by-claim fact-checking.Method: Generate multiple long-form responses, embed them, and measure semantic isotropy as angular dispersion of embeddings on unit sphere.
Result: Higher semantic isotropy (greater embedding dispersion) reliably indicates lower factual consistency across samples.
Conclusion: Semantic isotropy offers practical, low-cost trust assessment for real-world LLM workflows, outperforming existing approaches across multiple domains.
Abstract: To deploy large language models (LLMs) in high-stakes application domains that require substantively accurate responses to open-ended prompts, we need reliable, computationally inexpensive methods that assess the trustworthiness of long-form responses generated by LLMs. However, existing approaches often rely on claim-by-claim fact-checking, which is computationally expensive and brittle in long-form responses to open-ended prompts. In this work, we introduce semantic isotropy – the degree of uniformity across normalized text embeddings on the unit sphere – and use it to assess the trustworthiness of long-form responses generated by LLMs. To do so, we generate several long-form responses, embed them, and estimate the level of semantic isotropy of these responses as the angular dispersion of the embeddings on the unit sphere. We find that higher semantic isotropy – that is, greater embedding dispersion – reliably signals lower factual consistency across samples. Our approach requires no labeled data, no fine-tuning, and no hyperparameter selection, and can be used with open- or closed-weight embedding models. Across multiple domains, our method consistently outperforms existing approaches in predicting nonfactuality in long-form responses using only a handful of samples – offering a practical, low-cost approach for integrating trust assessment into real-world LLM workflows.
[7] Understanding Network Behaviors through Natural Language Question-Answering
Mingzhe Xing, Chang Tian, Jianan Zhang, Lichen Pan, Peipei Liu, Zhaoteng Yan, Yinliang Yue
Main category: cs.CL
TL;DR: NetMind is a framework that uses natural language queries to understand network behaviors, addressing LLM limitations with tree-based configuration chunking, unified fact graphs, and hybrid imperative-declarative language.
Details
Motivation: Traditional network configuration analysis methods have steep learning curves and limited flexibility, while natural language offers more accessible interfaces. LLMs show promise but face challenges with long configurations, device heterogeneity, and complex reasoning requirements.Method: Proposes tree-based configuration chunking to preserve semantics, constructs unified fact graphs to normalize vendor-specific configurations, and designs hybrid imperative-declarative language to reduce LLM reasoning burden.
Result: NetMind achieves accurate and scalable network behavior understanding, outperforming existing baselines in experiments using a contributed benchmark of NL question-answer pairs with network configurations.
Conclusion: The framework successfully addresses key challenges in NL-guided network analysis, demonstrating practical effectiveness for network behavior understanding through innovative approaches to configuration processing and reasoning assistance.
Abstract: Modern large-scale networks introduce significant complexity in understanding network behaviors, increasing the risk of misconfiguration. Prior work proposed to understand network behaviors by mining network configurations, typically relying on domain-specific languages interfaced with formal models. While effective, they suffer from a steep learning curve and limited flexibility. In contrast, natural language (NL) offers a more accessible and interpretable interface, motivating recent research on NL-guided network behavior understanding. Recent advances in large language models (LLMs) further enhance this direction, leveraging their extensive prior knowledge of network concepts and strong reasoning capabilities. However, three key challenges remain: 1) numerous router devices with lengthy configuration files challenge LLM’s long-context understanding ability; 2) heterogeneity across devices and protocols impedes scalability; and 3) complex network topologies and protocols demand advanced reasoning abilities beyond the current capabilities of LLMs. To tackle the above challenges, we propose NetMind, a novel framework for querying networks using NL. Our approach introduces a tree-based configuration chunking strategy to preserve semantic coherence while enabling efficient partitioning. We then construct a unified fact graph as an intermediate representation to normalize vendor-specific configurations. Finally, we design a hybrid imperative-declarative language to reduce the reasoning burden on LLMs and enhance precision. We contribute a benchmark consisting of NL question-answer pairs paired with network configurations. Experiments demonstrate that NetMind achieves accurate and scalable network behavior understanding, outperforming existing baselines.
[8] Deep Literature Survey Automation with an Iterative Workflow
Hongbo Zhang, Han Cui, Yidong Wang, Yijian Tian, Qi Guo, Cunxiang Wang, Jian Wu, Chiyu Song, Yue Zhang
Main category: cs.CL
TL;DR: IterSurvey is a framework for automatic literature survey generation that uses recurrent outline generation and paper cards to improve content coverage, structural coherence, and citation quality compared to one-shot methods.
Details
Motivation: Existing one-shot survey generation systems suffer from noisy retrieval, fragmented structures, and context overload, limiting survey quality. The authors were inspired by human researchers' iterative reading process.Method: Uses recurrent outline generation with a planning agent that incrementally retrieves, reads, and updates outlines. Implements paper cards to distill papers into contributions, methods, and findings, plus a review-and-refine loop with visualization enhancement.
Result: Substantially outperforms state-of-the-art baselines in content coverage, structural coherence, and citation quality. Produces more accessible and better-organized surveys. Also introduces Survey-Arena benchmark for reliable assessment.
Conclusion: IterSurvey provides a more effective approach to automatic literature survey generation through iterative processing and faithful paper-level grounding, producing higher quality surveys than existing methods.
Abstract: Automatic literature survey generation has attracted increasing attention, yet most existing systems follow a one-shot paradigm, where a large set of papers is retrieved at once and a static outline is generated before drafting. This design often leads to noisy retrieval, fragmented structures, and context overload, ultimately limiting survey quality. Inspired by the iterative reading process of human researchers, we propose \ours, a framework based on recurrent outline generation, in which a planning agent incrementally retrieves, reads, and updates the outline to ensure both exploration and coherence. To provide faithful paper-level grounding, we design paper cards that distill each paper into its contributions, methods, and findings, and introduce a review-and-refine loop with visualization enhancement to improve textual flow and integrate multimodal elements such as figures and tables. Experiments on both established and emerging topics show that \ours\ substantially outperforms state-of-the-art baselines in content coverage, structural coherence, and citation quality, while producing more accessible and better-organized surveys. To provide a more reliable assessment of such improvements, we further introduce Survey-Arena, a pairwise benchmark that complements absolute scoring and more clearly positions machine-generated surveys relative to human-written ones. The code is available at https://github.com/HancCui/IterSurvey_Autosurveyv2.
[9] Explaining and Mitigating Crosslingual Tokenizer Inequities
Catherine Arnett, Tyler A. Chang, Stella Biderman, Benjamin K. Bergen
Main category: cs.CL
TL;DR: Token premiums (disparities in token counts across languages) persist even after controlling for dataset size, vocabulary size, and content. Training 7,000 monolingual tokenizers reveals that vocabulary size and pre-tokenization affect token premiums, while data similarity doesn’t. Optimal vocabulary sizes and superword tokenizers can significantly reduce token premium effects.
Details
Motivation: High token premiums lead to reduced training throughput and increased inference costs. Understanding cross-linguistic differences causing these disparities is crucial for optimizing multilingual NLP systems.Method: Trained ~7,000 comparable monolingual tokenizers for 97 languages, manipulating tokenization algorithm, vocabulary size, and dataset size. Measured token premiums and tested relationships with data similarity, vocabulary size, pre-tokenization, and language-specific features like writing system and word length.
Result: Data similarity between training and test data doesn’t impact token premiums, but vocabulary size and pre-tokenization do. Superword tokenizers (allowing merges over whitespaces) reduce token premium effects and improve compression overall. Optimal vocabulary sizes for each language can significantly reduce token premium effects.
Conclusion: Intervening on vocabulary size or pre-tokenizer significantly reduces crosslingual token premium effects. Superword tokenizers and language-specific optimal vocabulary sizes are effective strategies for mitigating token disparities across languages.
Abstract: The number of tokens it takes to encode parallel text in different languages is known to vary. These disparities are called token premiums. Having high token premiums leads to less throughput during training and increases costs at inference. In this paper, we show that even after controlling for dataset size, vocabulary size, and data content, monolingual tokenizers exhibit a wide range of token premiums across languages. To understand the cross-linguistic differences that cause these token premiums, we train a suite of approximately 7,000 comparable monolingual tokenizers for 97 languages, manipulating tokenization algorithm, vocabulary size, and dataset size. We measure token premiums and test for a relationship between factors such as data similarity (between tokenizer training and evaluation), vocabulary size, and pre-tokenization. We also investigate the role of language-specific features such as writing system and word length. We find that similarity between training and test data does not impact token premiums, but vocabulary size and pre-tokenization do. While simply increasing vocabulary size does not lead to reduced token premium effects, we can determine an ``optimal’’ vocabulary size for each language to achieve significantly reduced token premium effects. We also train superword tokenizers which allow merges over whitespaces, and we find that they both reduce token premium effects and improve compression overall. Thus, intervening on the vocabulary size or the pre-tokenizer significantly reduces crosslingual token premium effects.
[10] Arabic Little STT: Arabic Children Speech Recognition Dataset
Mouhand Alkadri, Dania Desouki, Khloud Al Jallad
Main category: cs.CL
TL;DR: Created Arabic Little STT dataset of Levantine Arabic child speech and evaluated Whisper ASR models, revealing poor performance on child speech compared to adult benchmarks.
Details
Motivation: Address data scarcity for low-resource languages like Arabic and the absence of child-specific speech corpora in ASR development.Method: Created Arabic Little STT dataset with 355 utterances from 288 children (ages 6-13), then systematically assessed eight Whisper ASR variants on this child speech dataset.
Result: Even the best-performing Whisper model (Large_v3) achieved 0.66 WER on child speech, significantly worse than sub 0.20 WER on adult datasets, highlighting the performance gap.
Conclusion: Critical need for dedicated child speech benchmarks, inclusive training data, and strict ethical frameworks for child data protection in ASR development.
Abstract: The performance of Artificial Intelligence (AI) systems fundamentally depends on high-quality training data. However, low-resource languages like Arabic suffer from severe data scarcity. Moreover, the absence of child-specific speech corpora is an essential gap that poses significant challenges. To address this gap, we present our created dataset, Arabic Little STT, a dataset of Levantine Arabic child speech recorded in classrooms, containing 355 utterances from 288 children (ages 6 - 13). We further conduct a systematic assessment of Whisper, a state-of-the-art automatic speech recognition (ASR) model, on this dataset and compare its performance with adult Arabic benchmarks. Our evaluation across eight Whisper variants reveals that even the best-performing model (Large_v3) struggles significantly, achieving a 0.66 word error rate (WER) on child speech, starkly contrasting with its sub 0.20 WER on adult datasets. These results align with other research on English speech. Results highlight the critical need for dedicated child speech benchmarks and inclusive training data in ASR development. Emphasizing that such data must be governed by strict ethical and privacy frameworks to protect sensitive child information. We hope that this study provides an initial step for future work on equitable speech technologies for Arabic-speaking children. We hope that our publicly available dataset enrich the children’s demographic representation in ASR datasets.
[11] Model-Aware Tokenizer Transfer
Mykola Haltiuk, Aleksander Smywiński-Pohl
Main category: cs.CL
TL;DR: MATT is a model-aware tokenizer transfer method that uses attention patterns to improve multilingual LLM adaptation, outperforming heuristic approaches.
Details
Motivation: Existing tokenizer transfer methods rely on semantic heuristics and ignore higher-layer model dynamics, limiting transfer quality for lower-resource languages.Method: Proposes Model-Aware Tokenizer Transfer (MATT) with Attention Influence Modeling (AIM) objective that distills inter-token communication patterns from source to target model.
Result: MATT recovers large fraction of original model performance within few GPU hours, outperforming heuristic baselines across diverse linguistic settings.
Conclusion: Incorporating model-level signals offers practical and effective path toward robust tokenizer transfer in multilingual LLMs.
Abstract: Large Language Models (LLMs) are trained to support an increasing number of languages, yet their predefined tokenizers remain a bottleneck for adapting models to lower-resource or distinct-script languages. Existing tokenizer transfer methods typically rely on semantic heuristics to initialize new embeddings, ignoring higher-layer model dynamics and limiting transfer quality. We propose Model-Aware Tokenizer Transfer (MATT), a method that incorporates model internals into the tokenizer transfer process. MATT introduces an Attention Influence Modeling (AIM) objective that distills inter-token communication patterns from a source model into a target model with a new tokenizer, providing an efficient warm-up before standard language modeling. Unlike approaches that focus solely on embedding similarity, MATT leverages attention behavior to guide embedding initialization and adaptation. Experiments across diverse linguistic settings show that MATT recovers a large fraction of the original model’s performance within a few GPU hours, outperforming heuristic baselines. These results demonstrate that incorporating model-level signals offers a practical and effective path toward robust tokenizer transfer in multilingual LLMs.
[12] A Stylometric Application of Large Language Models
Harrison F. Stropkay, Jiayi Chen, Mohammad J. Latifi, Daniel N. Rockmore, Jeremy R. Manning
Main category: cs.CL
TL;DR: LLMs can distinguish authors’ writing styles by training separate models on each author’s works, with each model better predicting its own author’s text than others’ texts.
Details
Motivation: To develop a method for identifying authors based on their unique writing styles using language models, and to apply this to verify authorship of disputed works.Method: Train individual GPT-2 models from scratch on each author’s works, then compare how well each model predicts held-out text from its own author versus other authors.
Result: Models trained on one author’s works predict that author’s text more accurately than other authors’ texts. Successfully confirmed R.P. Thompson’s authorship of the 15th Oz book originally attributed to F.L. Baum.
Conclusion: Language models can effectively capture and identify individual authors’ unique writing styles, providing a reliable method for authorship attribution.
Abstract: We show that large language models (LLMs) can be used to distinguish the writings of different authors. Specifically, an individual GPT-2 model, trained from scratch on the works of one author, will predict held-out text from that author more accurately than held-out text from other authors. We suggest that, in this way, a model trained on one author’s works embodies the unique writing style of that author. We first demonstrate our approach on books written by eight different (known) authors. We also use this approach to confirm R. P. Thompson’s authorship of the well-studied 15th book of the Oz series, originally attributed to F. L. Baum.
[13] Uncovering the Persuasive Fingerprint of LLMs in Jailbreaking Attacks
Havva Alizadeh Noughabi, Julien Serbanescu, Fattane Zarrinkalam, Ali Dehghantanha
Main category: cs.CL
TL;DR: LLMs remain vulnerable to jailbreak attacks despite alignment safeguards. This paper explores using persuasive strategies from social sciences to craft adversarial prompts that bypass LLM safety constraints.
Details
Motivation: Little attention has been paid to linguistic and psychological mechanisms influencing LLM susceptibility to jailbreak attacks. The authors hypothesize that LLMs trained on human text may respond more compliantly to persuasive structures.Method: Leverage foundational theories of persuasion from social sciences to craft adversarial prompts. Investigate whether LLMs exhibit distinct persuasive fingerprints in their jailbreak responses through empirical evaluations across multiple aligned LLMs.
Result: Persuasion-aware prompts significantly bypass LLM safeguards and demonstrate potential to induce jailbreak behaviors. LLMs show distinct persuasive fingerprints in their responses.
Conclusion: Cross-disciplinary insights from social sciences are crucial for addressing evolving LLM safety challenges. Persuasion strategies can effectively circumvent alignment constraints in language models.
Abstract: Despite recent advances, Large Language Models remain vulnerable to jailbreak attacks that bypass alignment safeguards and elicit harmful outputs. While prior research has proposed various attack strategies differing in human readability and transferability, little attention has been paid to the linguistic and psychological mechanisms that may influence a model’s susceptibility to such attacks. In this paper, we examine an interdisciplinary line of research that leverages foundational theories of persuasion from the social sciences to craft adversarial prompts capable of circumventing alignment constraints in LLMs. Drawing on well-established persuasive strategies, we hypothesize that LLMs, having been trained on large-scale human-generated text, may respond more compliantly to prompts with persuasive structures. Furthermore, we investigate whether LLMs themselves exhibit distinct persuasive fingerprints that emerge in their jailbreak responses. Empirical evaluations across multiple aligned LLMs reveal that persuasion-aware prompts significantly bypass safeguards, demonstrating their potential to induce jailbreak behaviors. This work underscores the importance of cross-disciplinary insight in addressing the evolving challenges of LLM safety. The code and data are available.
[14] Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models
Sarah Ball, Niki Hasrati, Alexander Robey, Avi Schwarzschild, Frauke Kreuter, Zico Kolter, Andrej Risteski
Main category: cs.CL
TL;DR: The paper analyzes when and why jailbreaking suffixes transfer between prompts and models, identifying three key statistical properties that correlate with transfer success.
Details
Motivation: Despite empirical evidence of transferability in jailbreaking attacks, there's a lack of rigorous analysis of when and why transfer occurs between different prompts and models.Method: Identified three statistical properties correlated with transfer success: refusal direction activation, push away from refusal direction, and orthogonal direction shifts. Compared these with semantic similarity.
Result: Found that the three statistical properties strongly correlate with transfer success across experiments, while prompt semantic similarity only weakly correlates.
Conclusion: Provides a fine-grained understanding of transferability that can be used to improve attack success rates in practical applications.
Abstract: Discrete optimization-based jailbreaking attacks on large language models aim to generate short, nonsensical suffixes that, when appended onto input prompts, elicit disallowed content. Notably, these suffixes are often transferable – succeeding on prompts and models for which they were never optimized. And yet, despite the fact that transferability is surprising and empirically well-established, the field lacks a rigorous analysis of when and why transfer occurs. To fill this gap, we identify three statistical properties that strongly correlate with transfer success across numerous experimental settings: (1) how much a prompt without a suffix activates a model’s internal refusal direction, (2) how strongly a suffix induces a push away from this direction, and (3) how large these shifts are in directions orthogonal to refusal. On the other hand, we find that prompt semantic similarity only weakly correlates with transfer success. These findings lead to a more fine-grained understanding of transferability, which we use in interventional experiments to showcase how our statistical analysis can translate into practical improvements in attack success.
[15] Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics
Yilin Zhang, Wenda Xu, Zhongtao Liu, Tetsuji Nakagawa, Markus Freitag
Main category: cs.CL
TL;DR: Quality Estimation metrics in machine translation exhibit significant length bias - they over-predict errors in longer translations and prefer shorter translations, which can lead to unfair penalization of correct longer translations.
Details
Motivation: To investigate the underexplored prevalence and impact of length bias in Quality Estimation metrics, which are crucial for reference-free evaluation and as reward signals in machine translation tasks.Method: Systematic study of top-performing regression-based and LLM-as-a-Judge QE metrics across 10 diverse language pairs, followed by proposing two mitigation strategies: length normalization during training and incorporating reference texts during evaluation.
Result: Two critical length biases identified: QE metrics consistently over-predict errors with increasing translation length (even for error-free texts) and prefer shorter translations when multiple candidates are available for the same source text.
Conclusion: The identified length biases risk unfair penalization of longer correct translations and sub-optimal decision-making. The proposed mitigation strategies (length normalization and reference incorporation) effectively reduce length bias.
Abstract: Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and as a reward signal in tasks like reinforcement learning. However, the prevalence and impact of length bias in QE have been underexplored. Through a systematic study of top-performing regression-based and LLM-as-a-Judge QE metrics across 10 diverse language pairs, we reveal two critical length biases: First, QE metrics consistently over-predict errors with increasing translation length, even for high-quality, error-free texts. Second, they exhibit a preference for shorter translations when multiple candidates are available for the same source text. These inherent length biases risk unfairly penalizing longer, correct translations and can lead to sub-optimal decision-making in applications such as QE reranking and QE guided reinforcement learning. To mitigate this, we propose two strategies: (a) applying length normalization during model training, and (b) incorporating reference texts during evaluation. Both approaches were found to effectively reduce the identified length bias.
[16] ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality
Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I-Hung Hsu, Isaac Caswell, Alex Pentland, Sercan Arik, Chen-Yu Lee, Sayna Ebrahimi
Main category: cs.CL
TL;DR: Largest multilingual scaling laws study with 774 experiments across 400+ languages, introducing Adaptive Transfer Scaling Law (ATLAS) that outperforms existing methods by 0.3 R².
Details
Motivation: Address the English-centric bias in scaling laws research and serve billions of international users by studying multilingual learning dynamics.Method: Conducted 774 multilingual training experiments spanning 10M-8B parameters, analyzed cross-lingual transfer matrix (38×38 language pairs), and derived language-agnostic scaling laws.
Result: ATLAS outperforms existing scaling laws, reveals optimal scaling strategies for multilingual models, and identifies computational crossover points for pretraining vs finetuning.
Conclusion: Provides scientific foundation for democratizing scaling laws across languages and enables efficient scaling of multilingual models beyond English-first AI.
Abstract: Scaling laws research has focused overwhelmingly on English – yet the most prominent AI models explicitly serve billions of international users. In this work, we undertake the largest multilingual scaling laws study to date, totaling 774 multilingual training experiments, spanning 10M-8B model parameters, 400+ training languages and 48 evaluation languages. We introduce the Adaptive Transfer Scaling Law (ATLAS) for both monolingual and multilingual pretraining, which outperforms existing scaling laws’ out-of-sample generalization often by more than 0.3 R^2. Our analyses of the experiments shed light on multilingual learning dynamics, transfer properties between languages, and the curse of multilinguality. First, we derive a cross-lingual transfer matrix, empirically measuring mutual benefit scores between 38 x 38=1444 language pairs. Second, we derive a language-agnostic scaling law that reveals how to optimally scale model size and data when adding languages without sacrificing performance. Third, we identify the computational crossover points for when to pretrain from scratch versus finetune from multilingual checkpoints. We hope these findings provide the scientific foundation for democratizing scaling laws across languages, and enable practitioners to efficiently scale models – beyond English-first AI.
[17] Emotions Where Art Thou: Understanding and Characterizing the Emotional Latent Space of Large Language Models
Benjamin Reichman, Adar Avsian, Larry Heck
Main category: cs.CL
TL;DR: LLMs internally represent emotion through a low-dimensional emotional manifold that is directionally encoded, stable across layers, and generalizes across languages. This universal emotional subspace can be manipulated to steer emotion perception while preserving semantics.
Details
Motivation: To understand how large language models internally represent and process emotion by analyzing the geometric structure of their hidden-state space.Method: Analyzed the geometry of LLM hidden-state space, identified emotional manifolds, tested cross-domain alignment across 8 emotion datasets in 5 languages, and developed intervention modules to manipulate emotion perception.
Result: Found a consistent low-dimensional emotional manifold that is directionally encoded, distributed across layers, and generalizes well across languages. Cross-domain alignment showed low error and strong linear probe performance. Emotion perception can be steered while preserving semantics.
Conclusion: LLMs possess a consistent and manipulable affective geometry that reveals how they internalize and process emotion, with universal emotional representations that span multiple languages and can be controlled through targeted interventions.
Abstract: This work investigates how large language models (LLMs) internally represent emotion by analyzing the geometry of their hidden-state space. The paper identifies a low-dimensional emotional manifold and shows that emotional representations are directionally encoded, distributed across layers, and aligned with interpretable dimensions. These structures are stable across depth and generalize to eight real-world emotion datasets spanning five languages. Cross-domain alignment yields low error and strong linear probe performance, indicating a universal emotional subspace. Within this space, internal emotion perception can be steered while preserving semantics using a learned intervention module, with especially strong control for basic emotions across languages. These findings reveal a consistent and manipulable affective geometry in LLMs and offer insight into how they internalize and process emotion.
[18] Compositional Bias Control in Large Language Models: Preference Learning Fails, Supervision Succeeds
Atij Mahesh
Main category: cs.CL
TL;DR: Comparative analysis of six bias mitigation techniques for LLMs shows supervised fine-tuning achieves near-perfect compliance with high diversity, while preference-based methods fail at compositional constraints.
Details
Motivation: LLMs produce gender-stereotyped language even in neutral contexts, reflecting societal biases, but comparative efficacy of mitigation techniques remains poorly understood.Method: Analyzed six control techniques: prompt-only, generate-and-filter, DFA-based Ctrl-G decoding, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Iterative Nullspace Projection (INLP) on compositional constraint task requiring sentences with agentic and communal descriptors for occupations.
Result: SFT achieved 99.87% compliance with high lexical diversity, DPO failed at 4.53% compliance despite similar training stability. Ctrl-G guaranteed perfect compliance but severely reduced fluency and diversity. Preference-based methods cannot satisfy compositional constraints.
Conclusion: Only explicit positive supervision enables mitigation of compositional biases; preference-based alignment fails to generalize logical structures, highlighting limitations of preference learning and necessity of explicit supervision for fair controlled generation.
Abstract: Large Language Models (LLMs) still produce gender-stereotyped language even in occupation-neutral contexts that reflect deep societal biases (Rudinger et al., 2018). To address this, prior work has proposed prompting, constrained decoding (Dathathri et al., 2020; Zhou et al., 2024), post-processing, and fine-tuning-based alignment (Rafailov et al., 2023; Ravfogel et al., 2022). However, the comparative efficacy and learning dynamics remain little understood. We report a comparative analysis of six control techniques for bias mitigation: prompt-only, generate-and-filter, DFA-based Ctrl-G decoding, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Iterative Nullspace Projection (INLP). We evaluate each method on a compositional constraint task. This task requires generating sentences that contain at least one agentic and one communal descriptor for each of the twenty Winogender-derived occupations. We quantify trade-offs between control strength and naturalness with evaluations of constraint compliance, lexical diversity, and fluency. Our results reveal key contrasts among the methods: SFT achieves 99.87 +- 0.15% compliance and high lexical diversity, while DPO, despite similar training stability, fails at 4.53 +- 0.82%. Ctrl-G guarantees perfect compliance, but at the cost of severely reduced fluency and diversity. Preference-based learning fundamentally differs: it cannot satisfy compositional constraints, as binary preference signals encode ranking, not logical conjunctions. Only explicit positive supervision enables mitigation of compositional biases; preference-based alignment fails to generalize logical structures, underscoring the limitations of preference learning and the necessity of explicit supervision for fair and fluent controlled generation.
[19] Generalization or Memorization: Dynamic Decoding for Mode Steering
Xuanming Zhang
Main category: cs.CL
TL;DR: Proposes Dynamic Mode Steering (DMS) to control LLMs’ reasoning modes - steering them from brittle memorization toward robust generalization using inference-time activation steering.
Details
Motivation: LLMs exhibit unpredictable duality between remarkable generalization and brittle verbatim memorization, undermining reliability in high-stakes applications.Method: Developed Dynamic Mode Steering (DMS) with: (1) lightweight linear probe to detect memorization reliance, (2) dynamic activation steering to nudge computation toward generalization circuits, framed as adaptive self-contrastive decoding.
Result: Experiments on reasoning and faithfulness tasks show DMS significantly improves logical consistency and factual accuracy.
Conclusion: DMS offers a principled approach to enhance LLM reliability by controlling reasoning modes at inference time.
Abstract: Large Language Models (LLMs) exhibit a troubling duality, capable of both remarkable generalization and brittle, verbatim memorization of their training data. This unpredictability undermines their reliability in high-stakes applications. In this work, we propose a unified framework to understand, identify, and control these distinct reasoning modes. First, we introduce a theoretical model based on the Information Bottleneck (IB) principle, formalizing generalization as the learning of a compressed, task-relevant representation and memorization as a failure to compress. Building on this theory, we develop Dynamic Mode Steering (DMS), a novel inference-time algorithm which comprises two components: (1) a lightweight, causally-grounded linear probe that identifies the model’s instantaneous reliance on memorization, and (2) a dynamic activation steering mechanism that nudges the model’s computation towards pre-identified generalization circuits. We frame DMS as a form of adaptive, self-contrastive decoding. Experiments on reasoning and faithfulness tasks demonstrate that DMS significantly improves logical consistency and factual accuracy, thereby offering a principled approach to enhancing LLM reliability.
[20] Gradual Forgetting: Logarithmic Compression for Extending Transformer Context Windows
Billy Dickson, Zoran Tiganj
Main category: cs.CL
TL;DR: The paper introduces a simple method to extend transformer’s long-range memory by applying logarithmic compression to input tokens, rather than modifying the transformer architecture itself.
Details
Motivation: Most existing approaches for long-context processing increase transformer complexity by adding recurrence or memory modules. This work aims to preserve architectural simplicity while improving long-range memory.Method: Apply scale-invariant logarithmic compression to input tokens, then process the compressed representation using a standard, unmodified transformer architecture.
Result: Evaluation on WikiText-103 and PG-19 benchmarks shows reduced perplexity compared to uncompressed baselines, with consistent performance improvement as compressed temporal contexts get longer.
Conclusion: Input-level logarithmic compression is a simple and effective way to extend a transformer’s long-range memory without architectural modifications.
Abstract: Most approaches to long-context processing increase the complexity of the transformer’s internal architecture by integrating mechanisms such as recurrence or auxiliary memory modules. In this work, we introduce an alternative approach that modifies the input representation itself, rather than the transformer architecture. Inspired by cognitive models of human memory, our method applies a scale-invariant logarithmic compression to the input tokens. The resulting compressed representation is processed by a standard, unmodified transformer, preserving architectural simplicity. We evaluate this approach on the WikiText-103 and PG-19 language modeling benchmarks, showing a reduction in perplexity compared to uncompressed baselines. Moreover, performance improves consistently with longer compressed temporal contexts, showing that input-level logarithmic compression is a simple and effective way to extend a transformer’s long-range memory.
[21] OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model
Chen Wang, Tianyu Peng, Wen Yang, Yinan Bai, Guangfu Wang, Jun Lin, Lanpeng Jia, Lingxiang Wu, Jinqiao Wang, Chengqing Zong, Jiajun Zhang
Main category: cs.CL
TL;DR: OpenS2S is a fully open-source, transparent end-to-end large speech language model for empathetic speech interactions, addressing the opacity of current powerful empathetic LSLMs.
Details
Motivation: Current powerful empathetic LSLMs are increasingly closed off, making architecture, data, and development details opaque to researchers, creating a critical need for transparent research into empathetic behavior.Method: Based on BLSP-Emo empathetic speech-to-text model, OpenS2S employs streaming interleaved decoding for low-latency speech generation and uses an automated data construction pipeline that synthesizes diverse empathetic speech dialogues using LLMs and controllable TTS systems.
Result: The model achieves scalable training with rich paralinguistic diversity and minimal human supervision, creating a high-quality empathetic speech dialogue corpus.
Conclusion: OpenS2S is released as fully open-source including dataset, model weights, and codes to empower the research community and accelerate innovation in empathetic speech systems.
Abstract: Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive responses. However, the most powerful empathetic LSLMs are increasingly closed off, leaving the crucial details about the architecture, data and development opaque to researchers. Given the critical need for transparent research into the LSLMs and empathetic behavior, we present OpenS2S, a fully open-source, transparent and end-to-end LSLM designed to enable empathetic speech interactions. Based on our empathetic speech-to-text model BLSP-Emo, OpenS2S further employs a streaming interleaved decoding architecture to achieve low-latency speech generation. To facilitate end-to-end training, OpenS2S incorporates an automated data construction pipeline that synthesizes diverse, high-quality empathetic speech dialogues at low cost. By leveraging large language models to generate empathetic content and controllable text-to-speech systems to introduce speaker and emotional variation, we construct a scalable training corpus with rich paralinguistic diversity and minimal human supervision. We release the fully open-source OpenS2S model, including the dataset, model weights, pre-training and fine-tuning codes, to empower the broader research community and accelerate innovation in empathetic speech systems. The project webpage can be accessed at https://casia-lm.github.io/OpenS2S
[22] Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation
Ling-Team, Ang Li, Ben Liu, Binbin Hu, Bing Li, Bingwei Zeng, Borui Ye, Caizhi Tang, Changxin Tian, Chao Huang, Chao Zhang, Chen Qian, Chenchen Ju, Chenchen Li, Chengfu Tang, Chili Fu, Chunshao Ren, Chunwei Wu, Cong Zhang, Cunyin Peng, Dafeng Xu, Daixin Wang, Dalong Zhang, Dingnan Jin, Dingyuan Zhu, Dongke Hu, Fangzheng Zhao, Feifan Wu, Feng Zhu, Gangshan Wang, Haitao Zhang, Hailin Zhao, Hanxiao Zhang, Hanzi Wang, Hao Qian, Haoyi Yu, Heng Zhang, Hongliang Zhang, Hongzhi Luan, Huirong Dong, Huizhong Li, Jia Li, Jia Liu, Jialong Zhu, Jian Sha, Jianping Wei, Jiaolong Yang, Jieyue Ma, Jiewei Wu, Jinjing Huang, Jingyun Tian, Jingyuan Zhang, Jinquan Sun, Juanhui Tu, Jun Liu, Jun Xu, Jun Zhou, Junjie Ou, Junpeng Fang, Kaihong Zhang, Kaiqin Hu, Ke Shi, Kun Tang, Kunlong Chen, Lanyin Mei, Lei Liang, Lei Xu, Libo Zhang, Lin Ju, Lin Yuan, Ling Zhong, Lintao Ma, Lu Liu, Lu Yu, Lun Cai, Meiqi Zhu, Mengying Li, Min Chen, Minghao Xue, Minghong Cai, Mingming Yin, Peijie Jiang, Peilong Zhao, Pingping Liu, Qian Zhao, Qing Cui, Qingxiang Huang, Qingyuan Yang, Quankun Yu, Shaowei Wei, Shijie Lian, Shoujian Zheng, Shun Song, Shungen Zhang, Shuo Zhang, Siyuan Li, Song Liu, Ting Guo, Tong Zhao, Wanli Gu, Weichang Wu, Weiguang Han, Wenjing Fang, Wubin Wang, Xiang Shu, Xiao Shi, Xiaoshun Lan, Xiaolu Zhang, Xiaqing Sun, Xin Zhao, Xingyu Lu, Xiong Xu, Xudong Wang, Xudong Wang, Xuemin Yang, Yajie Yang, Yang Xiang, Yanzhe Li, Yi Zhang, Yilong Wang, Yingxue Li, Yongzhen Guo, Yuzhuo Fu, Yuanyuan Wang, Yue Yang, Yue Yu, Yufeng Deng, Yun Zhang, Yunfei Xu, Yuqi Zhang, Yuxiao He, Zengke Gui, Zhaoxin Huan, Zhaoyang Wang, Zhibo Zhu, Zhihao Wang, Zhiqiang Zhang, Zhoufei Wang, Zihang Zeng, Ziqi Liu, Zitao Xuan, Zuoli Tang
Main category: cs.CL
TL;DR: Ling 2.0 is a reasoning-oriented language model series using Mixture-of-Experts architecture that scales from 16B to 1T parameters, achieving up to 7x computational efficiency through high sparsity and coordinated innovations in architecture, training, and infrastructure.
Details
Motivation: To create a scalable and efficient reasoning foundation that boosts reasoning capability through every activation, addressing the need for computational efficiency while maintaining high reasoning accuracy.Method: Uses high-sparsity MoE with MTP for efficient reasoning, reasoning-oriented data with mid-training CoT activation, reinforcement-based fine-tuning (DFT, Evo-CoT), and full-scale FP8 training with fine-grained heterogeneous pipelines.
Result: Ling-1T establishes a new Pareto frontier of reasoning accuracy versus computational efficiency, with the series achieving up to 7-fold active-compute efficiency compared to dense counterparts.
Conclusion: Sparse activation, when properly aligned with reasoning objectives, enables scalable and efficient intelligence, providing a coherent foundation for advancing future reasoning and thinking models.
Abstract: We introduce Ling 2.0, a series reasoning-oriented language foundation built upon the principle that every activation boosts reasoning capability. Designed to scale from tens of billions to one trillion parameters under a unified Mixture-of-Experts (MoE) paradigm, Ling 2.0 emphasizes high sparsity, cross-scale consistency, and efficiency guided by empirical scaling laws. The series includes three non-thinking (instruct) models - Ling-mini-2.0, Ling-flash-2.0, and Ling-1T - ranging from 16B to 1T total parameters and achieving up to 7-fold active-compute efficiency compared with dense counterparts. Ling 2.0 integrates coordinated innovations across model architecture, pre-training, post-training, and infrastructure: a high-sparsity MoE with MTP for efficient reasoning, reasoning-oriented data and mid-training CoT activation, reinforcement-based fine-tuning (DFT, Evo-CoT), and full-scale FP8 training with fine-grained heterogeneous pipelines. At the trillion scale, Ling-1T establishes a new Pareto frontier of reasoning accuracy versus computational efficiency, demonstrating that sparse activation, when properly aligned with reasoning objectives, enables scalable and efficient intelligence. Collectively, Ling 2.0 provides a coherent, open, and efficient foundation for advancing future reasoning and thinking models, including the Ring series built upon the same base.
[23] OlaMind: Towards Human-Like and Hallucination-Safe Customer Service for Retrieval-Augmented Dialogue
Tianhong Gao, Jundong Shen, Bei Shi, Jiapeng Wang, Ying Ju, Junfeng Yao, Jiao Ran, Yong Zhang, Lin Dong, Huiyu Yu, Tingting Ye
Main category: cs.CL
TL;DR: OlaMind is a human-like and hallucination-safe customer service framework that uses Learn-to-Think and Learn-to-Respond stages with SFT and RL to improve response quality and reduce business risks in RAG-based systems.
Details
Motivation: Current RAG-based ICS systems suffer from hallucinations and generate rigid responses, introducing business risks and poor user experience in web-based customer service.Method: Two-stage approach: Learn-to-Think stage learns reasoning from human experts, then Learn-to-Respond stage uses cold-start SFT combined with RL for self-refinement.
Result: Significant improvements: +28.92%/+18.42% intelligent resolution rates and -6.08%/-7.12% human takeover rates in community-support/livestream-interaction scenarios.
Conclusion: OlaMind effectively enhances human-likeness while mitigating hallucinations and business risks, demonstrating consistent effectiveness across diverse real-world applications.
Abstract: Intelligent customer service (ICS) systems via retrieval-augmented generation (RAG) have been widely adopted in Web-based domains such as social platforms and e-commerce, achieving remarkable improvements in automation and efficiency. However, notable limitations still remain: these systems are prone to hallucinations and often generate rigid, mechanical responses, which can introduce business risks and undermine user experience, especially in Web-based customer service interactions under the RAG scenarios. In this paper, we introduce OlaMind, a human-like and hallucination-safe customer service framework for retrieval-augmented dialogue. Specifically, it first leverages a Learn-to-Think stage to learn the reasoning processes and response strategies from human experts, and then employs a Learn-to-Respond stage to perform cold-start supervised fine-tuning (SFT) combined with reinforcement learning (RL) for basic-to-hard self-refinement. Our method significantly enhances human-likeness and naturalness while effectively mitigating hallucinations and critical business risks. We have conducted large-scale online A/B experiments in an industry-level social customer service setting, and extensive experimental results show that OlaMind achieves significant cumulative relative improvements with intelligent resolution rates +28.92%/+18.42% and human takeover rate -6.08%/-7.12% in community-support/livestream-interaction scenarios, respectively, which highlights its consistent effectiveness across diverse real-world applications. The code and data will be publicly available.
[24] SentiMaithili: A Benchmark Dataset for Sentiment and Reason Generation for the Low-Resource Maithili Language
Rahul Ranjan, Mahendra Kumar Gurve, Anuj, Nitin, Yamuna Prasad
Main category: cs.CL
TL;DR: This paper introduces the first benchmark dataset for explainable sentiment analysis in Maithili, a low-resource Indo-Aryan language, featuring 3,221 sentences with sentiment polarity labels and natural language justifications written in Maithili.
Details
Motivation: Maithili is underrepresented in NLP research despite being spoken by over 13 million people. Low-resource languages like Maithili lack sentiment analysis resources with interpretability mechanisms, facing challenges in dataset creation due to limited linguistic experts and high annotation costs.Method: Created a novel dataset of 3,221 Maithili sentences annotated for sentiment polarity with natural language justifications. The dataset was carefully curated and validated by linguistic experts to ensure label reliability and contextual fidelity. Experiments used both classical machine learning and state-of-the-art transformer architectures.
Result: The dataset enables effective interpretable sentiment analysis. Extensive experiments demonstrate the dataset’s effectiveness, with justifications written in Maithili promoting culturally grounded interpretation and enhancing model explainability.
Conclusion: This work establishes the first benchmark for explainable affective computing in Maithili, contributing a valuable resource to advance multilingual NLP and explainable AI for low-resource languages.
Abstract: Developing benchmark datasets for low-resource languages poses significant challenges, primarily due to the limited availability of native linguistic experts and the substantial time and cost involved in annotation. Given these challenges, Maithili is still underrepresented in natural language processing research. It is an Indo-Aryan language spoken by more than 13 million people in the Purvanchal region of India, valued for its rich linguistic structure and cultural significance. While sentiment analysis has achieved remarkable progress in high-resource languages, resources for low-resource languages, such as Maithili, remain scarce, often restricted to coarse-grained annotations and lacking interpretability mechanisms. To address this limitation, we introduce a novel dataset comprising 3,221 Maithili sentences annotated for sentiment polarity and accompanied by natural language justifications. Moreover, the dataset is carefully curated and validated by linguistic experts to ensure both label reliability and contextual fidelity. Notably, the justifications are written in Maithili, thereby promoting culturally grounded interpretation and enhancing the explainability of sentiment models. Furthermore, extensive experiments using both classical machine learning and state-of-the-art transformer architectures demonstrate the dataset’s effectiveness for interpretable sentiment analysis. Ultimately, this work establishes the first benchmark for explainable affective computing in Maithili, thus contributing a valuable resource to the broader advancement of multilingual NLP and explainable AI.
[25] DETECT: Determining Ease and Textual Clarity of German Text Simplifications
Maria Korobeynikova, Alessia Battisti, Lukas Fischer, Yingqiang Gao
Main category: cs.CL
TL;DR: DETECT is the first German-specific metric for automatic text simplification evaluation that assesses simplicity, meaning preservation, and fluency, trained entirely on synthetic LLM responses without human annotation.
Details
Motivation: Current German ATS evaluation relies on general-purpose metrics like SARI, BLEU, and BERTScore, which insufficiently capture simplification quality. Specialized metrics exist for English but not for German due to lack of human-annotated corpora.Method: Adapts the LENS framework to German and extends it with: (i) a pipeline for generating synthetic quality scores via LLMs, enabling dataset creation without human annotation, and (ii) an LLM-based refinement step for aligning grading criteria with simplification requirements.
Result: DETECT achieves substantially higher correlations with human judgments than widely used ATS metrics, with particularly strong gains in meaning preservation and fluency. The authors also construct the largest German human evaluation dataset for text simplification to validate the metric.
Conclusion: The findings highlight both the potential and limitations of LLMs for automatic evaluation and provide transferable guidelines for general language accessibility tasks beyond ATS.
Abstract: Current evaluation of German automatic text simplification (ATS) relies on general-purpose metrics such as SARI, BLEU, and BERTScore, which insufficiently capture simplification quality in terms of simplicity, meaning preservation, and fluency. While specialized metrics like LENS have been developed for English, corresponding efforts for German have lagged behind due to the absence of human-annotated corpora. To close this gap, we introduce DETECT, the first German-specific metric that holistically evaluates ATS quality across all three dimensions of simplicity, meaning preservation, and fluency, and is trained entirely on synthetic large language model (LLM) responses. Our approach adapts the LENS framework to German and extends it with (i) a pipeline for generating synthetic quality scores via LLMs, enabling dataset creation without human annotation, and (ii) an LLM-based refinement step for aligning grading criteria with simplification requirements. To the best of our knowledge, we also construct the largest German human evaluation dataset for text simplification to validate our metric directly. Experimental results show that DETECT achieves substantially higher correlations with human judgments than widely used ATS metrics, with particularly strong gains in meaning preservation and fluency. Beyond ATS, our findings highlight both the potential and the limitations of LLMs for automatic evaluation and provide transferable guidelines for general language accessibility tasks.
[26] Dipper: Diversity in Prompts for Producing Large Language Model Ensembles in Reasoning tasks
Gregory Kang Ruey Lau, Wenyang Hu, Diwen Liu, Jizhuo Chen, See-Kiong Ng, Bryan Kian Hsiang Low
Main category: cs.CL
TL;DR: DIPPER is a training-free framework that transforms a single LLM into an inference-time ensemble using parallel prompting with diverse prompts to improve reasoning performance.
Details
Motivation: Smaller LLMs struggle with complex reasoning tasks, and existing sequential prompting methods are inefficient. Ensemble approaches show promise but need efficient implementation.Method: DIPPER feeds an optimized and diverse set of prompts in parallel to a single LLM, creating varied reasoning paths without additional training.
Result: Significant improvements on reasoning benchmarks like MATH, where a 3-instance DIPPER ensemble of Qwen2-MATH-1.5B outperforms a larger 7B model.
Conclusion: DIPPER effectively enhances reasoning performance of smaller LLMs through parallel ensemble prompting, achieving better results than larger models.
Abstract: Large Language Models (LLMs), particularly smaller variants, still struggle with complex reasoning tasks. While inference-time prompting can guide reasoning, existing methods often rely on sequential queries. Ensemble approaches offer a promising path to performance gains, especially given recent batch inference speed-ups. This work introduces DIPPER, a novel, training-free framework that transforms a single LLM into an effective inference-time ensemble. By feeding the model an optimized and diverse set of prompts in parallel, DIPPER elicits varied reasoning paths, leading to performance gains. We empirically demonstrate significant improvements on reasoning benchmarks, such as MATH, where a DIPPER ensemble of three Qwen2-MATH-1.5B instances (via parallel prompting of a single model) outperforms a larger 7B model.
[27] Estimating the Error of Large Language Models at Pairwise Text Comparison
Tianyi Li
Main category: cs.CL
TL;DR: This paper measures LLMs’ error rates in pairwise text comparison tasks without relying on ground truth, using two scenarios: uniform error rate and binary positional bias. The method employs Copeland counting to construct rankings and estimate error rates across six LLMs.
Details
Motivation: To quantify LLMs' reliability in pairwise text comparison tasks by measuring their error probabilities and positional biases, since traditional methods rely on ground truth which may not always be available.Method: Uses pairwise comparisons with two scenarios: (i) uniform error rate estimated with two comparisons per text pair, (ii) binary positional bias with distinct error rates for different comparison orders using repeated comparisons. Applies Copeland counting to construct rankings from preferences.
Result: Tested six LLMs (ChatGPT, Claude, DeepSeek, Gemini, Grok, Qwen) with five text types. Found consistent error rate estimates, with positional bias terms similar to uniform error. Claude performed best considering error rates and prompt robustness. The proposed model outperforms biased Bradley-Terry model and commutativity score.
Conclusion: The method successfully estimates LLMs’ error rates in pairwise comparison without ground truth, revealing poor scalability of LLM-based pairwise comparison. Claude demonstrated the most desirable performance among tested models.
Abstract: We measure LLMs’ output error at pairwise text comparison, noting the probability of error in their preferences. Our method does not rely on the ground truth and supports two scenarios: (i) uniform error rate regardless of the order of comparison, estimated with two comparisons for each text pair with either text placed first; (ii) binary positional bias assuming distinct error rates for the two orders of comparison, estimated with repeated comparisons between the texts. The Copeland counting constructs a ranking over the compared texts from pairwise preferences; the ranking reveals the poor scalability of LLM-based pairwise comparison and helps yield the estimates for LLMs’ error rates. We apply the method to six LLMs (ChatGPT, Claude, DeepSeek, Gemini, Grok, Qwen) with five types of text input and obtain consistent estimates of LLMs’ error. In general, the measured two positional bias terms are similar, close to the uniform error. Considering both the error rates and the robustness to the variation of prompts, Claude obtained the most desirable performance in this experiment. Our model outperforms the biased Bradley-Terry model and the commutativity score in indicating LLMs’ error at this task.
[28] Evolution of the lexicon: a probabilistic point of view
Maurizio Serva
Main category: cs.CL
TL;DR: The paper analyzes limitations of the Swadesh approach for language dating, showing that even under ideal assumptions, mathematical probability limits accuracy. It introduces gradual lexical modification as an important second process that improves dating precision.
Details
Motivation: To address the unrealistic assumptions and inherent mathematical limitations of the Swadesh approach in lexicostatistics, and to improve accuracy by considering gradual lexical modification alongside word replacement.Method: Mathematical analysis of probabilistic limits in language dating, and incorporation of gradual lexical modification as a second stochastic process in vocabulary evolution models.
Result: Demonstrates inherent accuracy limits in Swadesh dating due to probability constraints, and shows that including gradual lexical modification significantly improves temporal separation estimation precision.
Conclusion: The Swadesh approach has fundamental accuracy limitations even under ideal conditions, but considering gradual lexical modification alongside word replacement provides more accurate language dating results.
Abstract: The Swadesh approach for determining the temporal separation between two languages relies on the stochastic process of words replacement (when a complete new word emerges to represent a given concept). It is well known that the basic assumptions of the Swadesh approach are often unrealistic due to various contamination phenomena and misjudgments (horizontal transfers, variations over time and space of the replacement rate, incorrect assessments of cognacy relationships, presence of synonyms, and so on). All of this means that the results cannot be completely correct. More importantly, even in the unrealistic case that all basic assumptions are satisfied, simple mathematics places limits on the accuracy of estimating the temporal separation between two languages. These limits, which are purely probabilistic in nature and which are often neglected in lexicostatistical studies, are analyzed in detail in this article. Furthermore, in this work we highlight that the evolution of a language’s lexicon is also driven by another stochastic process: gradual lexical modification of words. We show that this process equally also represents a major contribution to the reshaping of the vocabulary of languages over the centuries and we also show, from a purely probabilistic perspective, that taking into account this second random process significantly increases the precision in determining the temporal separation between two languages.
[29] Solving the Unsolvable: Translating Case Law in Hong Kong
King-kui Sin, Xi Xuan, Chunyu Kit, Clara Ho-yan Chan, Honic Ho-kin Ip
Main category: cs.CL
TL;DR: This paper analyzes challenges in translating case law for Hong Kong’s bilingual legal system, critiques current uncoordinated efforts, and proposes a multi-agent AI translation platform to improve efficiency and quality.
Details
Motivation: To address the gap in Hong Kong's bilingual legal system where statutes were successfully translated but case law translation remains inadequate, impacting legal transparency and public trust.Method: Proposes a human-machine interactive translation platform that evolves from neural models to large language models, and from single-agent to multi-agent system with Translator, Annotator, and Proofreader agents supported by continuous feedback mechanisms.
Result: The paper critiques current sporadic translation efforts and presents a technological solution that could enable more efficient and high-quality translation of judicial judgments.
Conclusion: A sustainable strategy using advanced AI technology is needed to properly translate case law, which is essential for maintaining Hong Kong’s bilingual legal system and public trust in legal institutions.
Abstract: This paper addresses the challenges translating case law under Hong Kong’s bilingual legal system. It highlights the initial success of translating all written statutes into Chinese before the 1997 handover, a task mandated by the Basic Law. The effort involved significant collaboration among legal, linguistic, and translation experts, resulting in a comprehensive and culturally appropriate bilingual legal system. However, translating case law remains a significant challenge due to the sheer volume and continuous growth of judicial decisions. The paper critiques the governments and judiciarys sporadic and uncoordinated efforts to translate case law, contrasting it with the thorough approach previously taken for statute translation. Although the government acknowledges the importance of legal bilingualism, it lacks a sustainable strategy for translating case law. The Judiciarys position that translating all judgments is unnecessary, unrealistic, and not cost-effectiveis analyzed and critiqued for its impact on legal transparency and public trust. A proposed solution involves leveraging machine translation technology through a human-machine interactive translation platform, which undergoes two major transitions. Initially based on a neural model, the platform transitions to using a large language model for improved translation accuracy. Furthermore, it evolves from a single-agent system to a multi-agent system, incorporating Translator, Annotator, and Proofreader agents. This multi-agent approach, supported by a grant, aims to facilitate efficient, high-quality translation of judicial judgments by integrating advanced artificial intelligence and continuous feedback mechanisms, thus better meeting the needs of a bilingual legal system.
[30] You Don’t Need Prompt Engineering Anymore: The Prompting Inversion
Imran Khan
Main category: cs.CL
TL;DR: Sculpting is a constrained prompting method that improves reasoning over standard CoT on mid-tier models but becomes detrimental on advanced models due to hyper-literalism.
Details
Motivation: To address limitations of standard Chain-of-Thought prompting by reducing errors from semantic ambiguity and flawed common sense through more structured prompting.Method: Introduced “Sculpting” - a constrained, rule-based prompting method. Evaluated three strategies (Zero Shot, standard CoT, Sculpting) across three OpenAI models (gpt-4o-mini, gpt-4o, gpt-5) using GSM8K benchmark.
Result: Found “Prompting Inversion”: Sculpting improved performance on gpt-4o (97% vs 93% for CoT) but harmed performance on gpt-5 (94.00% vs 96.36% for CoT). Identified “Guardrail-to-Handcuff” transition where constraints become counterproductive.
Conclusion: Optimal prompting strategies must co-evolve with model capabilities, with simpler prompts working better for more capable models.
Abstract: Prompt engineering, particularly Chain-of-Thought (CoT) prompting, significantly enhances LLM reasoning capabilities. We introduce “Sculpting,” a constrained, rule-based prompting method designed to improve upon standard CoT by reducing errors from semantic ambiguity and flawed common sense. We evaluate three prompting strategies (Zero Shot, standard CoT, and Sculpting) across three OpenAI model generations (gpt-4o-mini, gpt-4o, gpt-5) using the GSM8K mathematical reasoning benchmark (1,317 problems). Our findings reveal a “Prompting Inversion”: Sculpting provides advantages on gpt-4o (97% vs. 93% for standard CoT), but becomes detrimental on gpt-5 (94.00% vs. 96.36% for CoT on full benchmark). We trace this to a “Guardrail-to-Handcuff” transition where constraints preventing common-sense errors in mid-tier models induce hyper-literalism in advanced models. Our detailed error analysis demonstrates that optimal prompting strategies must co-evolve with model capabilities, suggesting simpler prompts for more capable models.
[31] SteerX: Disentangled Steering for LLM Personalization
Xiaoyan Zhao, Ming Yan, Yilun Qiu, Haoting Ni, Yang Zhang, Fuli Feng, Hong Cheng, Tat-Seng Chua
Main category: cs.CL
TL;DR: SteerX is a disentangled steering method that isolates preference-driven components from preference-agnostic ones in LLM personalization, using causal inference to identify preference-driven tokens and enhance activation steering quality.
Details
Motivation: Existing activation steering methods use all historical data to compute steering vectors, ignoring that not all content reflects true user preferences, which undermines personalization effectiveness.Method: SteerX uses causal inference theory to estimate token-level causal effects, identifies preference-driven tokens, transforms these discrete signals into coherent descriptions, and leverages them to steer personalized LLM generation.
Result: Experiments on two representative steering backbone methods across real-world datasets show that SteerX consistently enhances steering vector quality and improves personalization.
Conclusion: SteerX offers a practical solution for more effective LLM personalization by focusing on truly preference-driven information to produce more accurate activation steering vectors.
Abstract: Large language models (LLMs) have shown remarkable success in recent years, enabling a wide range of applications, including intelligent assistants that support users’ daily life and work. A critical factor in building such assistants is personalizing LLMs, as user preferences and needs vary widely. Activation steering, which directly leverages directions representing user preference in the LLM activation space to adjust its behavior, offers a cost-effective way to align the model’s outputs with individual users. However, existing methods rely on all historical data to compute the steering vector, ignoring that not all content reflects true user preferences, which undermines the personalization signal. To address this, we propose SteerX, a disentangled steering method that isolates preference-driven components from preference-agnostic components. Grounded in causal inference theory, SteerX estimates token-level causal effects to identify preference-driven tokens, transforms these discrete signals into a coherent description, and then leverages them to steer personalized LLM generation. By focusing on the truly preference-driven information, SteerX produces more accurate activation steering vectors and enhances personalization. Experiments on two representative steering backbone methods across real-world datasets demonstrate that SteerX consistently enhances steering vector quality, offering a practical solution for more effective LLM personalization.
[32] Assessing the Potential of Generative Agents in Crowdsourced Fact-Checking
Luigia Costabile, Gian Marco Orlando, Valerio La Gatta, Vincenzo Moscato
Main category: cs.CL
TL;DR: LLM-powered generative agents outperform human crowds in fact-checking tasks with higher truthfulness classification, better consistency, and reduced bias.
Details
Motivation: Address the need for scalable fact-checking solutions by exploring whether LLM-powered generative agents can effectively contribute to crowdsourced fact-checking traditionally done by humans.Method: Simulate crowds of generative agents with diverse demographic and ideological profiles using La Barbera et al. (2024) protocol. Agents retrieve evidence, assess claims along multiple quality dimensions, and issue veracity judgments.
Result: Agent crowds outperform human crowds in truthfulness classification, exhibit higher internal consistency, show reduced susceptibility to social and cognitive biases, and rely more systematically on informative criteria like Accuracy, Precision, and Informativeness.
Conclusion: Generative agents have significant potential as scalable, consistent, and less biased contributors to crowd-based fact-checking systems.
Abstract: The growing spread of online misinformation has created an urgent need for scalable, reliable fact-checking solutions. Crowdsourced fact-checking - where non-experts evaluate claim veracity - offers a cost-effective alternative to expert verification, despite concerns about variability in quality and bias. Encouraged by promising results in certain contexts, major platforms such as X (formerly Twitter), Facebook, and Instagram have begun shifting from centralized moderation to decentralized, crowd-based approaches. In parallel, advances in Large Language Models (LLMs) have shown strong performance across core fact-checking tasks, including claim detection and evidence evaluation. However, their potential role in crowdsourced workflows remains unexplored. This paper investigates whether LLM-powered generative agents - autonomous entities that emulate human behavior and decision-making - can meaningfully contribute to fact-checking tasks traditionally reserved for human crowds. Using the protocol of La Barbera et al. (2024), we simulate crowds of generative agents with diverse demographic and ideological profiles. Agents retrieve evidence, assess claims along multiple quality dimensions, and issue final veracity judgments. Our results show that agent crowds outperform human crowds in truthfulness classification, exhibit higher internal consistency, and show reduced susceptibility to social and cognitive biases. Compared to humans, agents rely more systematically on informative criteria such as Accuracy, Precision, and Informativeness, suggesting a more structured decision-making process. Overall, our findings highlight the potential of generative agents as scalable, consistent, and less biased contributors to crowd-based fact-checking systems.
[33] PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding
Iliass Ayaou, Denis Cavallucci
Main category: cs.CL
TL;DR: PatenTEB is a comprehensive patent text embedding benchmark with 15 tasks and 2.06M examples, addressing patent-specific challenges like asymmetric retrieval. The patembed model family achieves state-of-the-art performance through multi-task training.
Details
Motivation: Existing benchmarks inadequately capture patent-specific challenges such as asymmetric fragment-to-document matching, domain-specific hard negatives, and patent retrieval scenarios.Method: Developed PatenTEB benchmark with domain-stratified splits, hard negative mining, and asymmetric matching scenarios. Created patembed model family (67M-344M parameters) through multi-task training with context lengths up to 4096 tokens.
Result: patembed-base achieves SOTA on MTEB BigPatentClustering.v2 (0.494 V-measure vs 0.445 previous best), patembed-large achieves 0.377 NDCG@100 on DAPFAM. Multi-task training improves external generalization despite minor benchmark costs.
Conclusion: Domain-pretrained initialization provides consistent advantages across task families. All resources will be publicly available to advance patent text embedding research.
Abstract: Patent text embeddings enable prior art search, technology landscaping, and patent analysis, yet existing benchmarks inadequately capture patent-specific challenges. We introduce PatenTEB, a comprehensive benchmark comprising 15 tasks across retrieval, classification, paraphrase, and clustering, with 2.06 million examples. PatenTEB employs domain-stratified splits, domain specific hard negative mining, and systematic coverage of asymmetric fragment-to-document matching scenarios absent from general embedding benchmarks. We develop the patembed model family through multi-task training, spanning 67M to 344M parameters with context lengths up to 4096 tokens. External validation shows strong generalization: patembed-base achieves state-of-the-art on MTEB BigPatentClustering.v2 (0.494 V-measure vs. 0.445 previous best), while patembed-large achieves 0.377 NDCG@100 on DAPFAM. Systematic ablations reveal that multi-task training improves external generalization despite minor benchmark costs, and that domain-pretrained initialization provides consistent advantages across task families. All resources will be made available at https://github.com/iliass-y/patenteb. Keywords: patent retrieval, sentence embeddings, multi-task learning, asymmetric retrieval, benchmark evaluation, contrastive learning.
[34] From Slides to Chatbots: Enhancing Large Language Models with University Course Materials
Tu Anh Dinh, Philipp Nicolas Schumacher, Jan Niehues
Main category: cs.CL
TL;DR: Incorporating university course materials through Retrieval-Augmented Generation (RAG) with multi-modal slide presentation significantly improves LLM performance for computer science education questions.
Details
Motivation: LLMs struggle to answer university-level computer science questions accurately, and course materials like lecture slides and transcripts differ substantially from typical training data.Method: Compared RAG and Continual Pre-Training strategies, with multi-modal RAG approach that presents retrieved slide content as images rather than text.
Result: RAG was more effective and efficient than CPT for small course materials, and multi-modal slide presentation significantly outperformed text-only retrieval.
Conclusion: Multi-modal RAG with course materials provides practical strategy for developing better educational AI assistants, inspiring similar approaches in other educational contexts.
Abstract: Large Language Models (LLMs) have advanced rapidly in recent years. One application of LLMs is to support student learning in educational settings. However, prior work has shown that LLMs still struggle to answer questions accurately within university-level computer science courses. In this work, we investigate how incorporating university course materials can enhance LLM performance in this setting. A key challenge lies in leveraging diverse course materials such as lecture slides and transcripts, which differ substantially from typical textual corpora: slides also contain visual elements like images and formulas, while transcripts contain spoken, less structured language. We compare two strategies, Retrieval-Augmented Generation (RAG) and Continual Pre-Training (CPT), to extend LLMs with course-specific knowledge. For lecture slides, we further explore a multi-modal RAG approach, where we present the retrieved content to the generator in image form. Our experiments reveal that, given the relatively small size of university course materials, RAG is more effective and efficient than CPT. Moreover, incorporating slides as images in the multi-modal setting significantly improves performance over text-only retrieval. These findings highlight practical strategies for developing AI assistants that better support learning and teaching, and we hope they inspire similar efforts in other educational contexts.
[35] Supervised Fine-Tuning or In-Context Learning? Evaluating LLMs for Clinical NER
Andrei Baroian
Main category: cs.CL
TL;DR: Comparison of BERT-style encoders, GPT-4o with few-shot learning, and GPT-4o with supervised fine-tuning for clinical NER on CADEC corpus, showing SFT achieves best performance.
Details
Motivation: To evaluate different approaches for clinical Named Entity Recognition and understand the performance limits of various model families in this domain.Method: Three approaches: (i) BERT-style encoders (BERT Base, BioClinicalBERT, RoBERTa-large), (ii) GPT-4o with few-shot in-context learning using simple vs. complex prompts, (iii) GPT-4o with supervised fine-tuning, evaluated on CADEC corpus with five entity types.
Result: RoBERTa-large and BioClinicalBERT showed limited improvements over BERT Base. Simple ICL outperformed complex prompts. SFT achieved strongest performance with F1 ≈ 87.1%, though with higher cost. LLMs performed better on simplified binary classification tasks.
Conclusion: Supervised fine-tuning of GPT-4o provides the best clinical NER performance, while simpler approaches show diminishing returns for specialized BERT variants. LLMs excel at simplified classification tasks.
Abstract: We study clinical Named Entity Recognition (NER) on the CADEC corpus and compare three families of approaches: (i) BERT-style encoders (BERT Base, BioClinicalBERT, RoBERTa-large), (ii) GPT-4o used with few-shot in-context learning (ICL) under simple vs.\ complex prompts, and (iii) GPT-4o with supervised fine-tuning (SFT). All models are evaluated on standard NER metrics over CADEC’s five entity types (ADR, Drug, Disease, Symptom, Finding). RoBERTa-large and BioClinicalBERT offer limited improvements over BERT Base, showing the limit of these family of models. Among LLM settings, simple ICL outperforms a longer, instruction-heavy prompt, and SFT achieves the strongest overall performance (F1 $\approx$ 87.1%), albeit with higher cost. We find that the LLM achieve higher accuracy on simplified tasks, restricting classification to two labels.
[36] Memory-based Language Models: An Efficient, Explainable, and Eco-friendly Approach to Large Language Modeling
Antal van den Bosch, Ainhoa Risco Patón, Teun Buijse, Peter Berck, Maarten van Gompel
Main category: cs.CL
TL;DR: Memory-based language modeling is presented as an efficient, eco-friendly alternative to deep neural networks, offering log-linear scalability and strong memorization with low ecological footprint.
Details
Motivation: To provide a more efficient and environmentally friendly alternative to deep neural network-based language models that reduces computational costs and ecological impact.Method: Implemented memory-based language modeling using fast approximations of k-nearest neighbor classification, relying fully on CPUs for both training and inference to achieve low token latencies.
Result: The OLIFANT implementation showed competitive next-token prediction accuracy compared to GPT-2 and GPT-Neo, with lower estimated emissions and faster speeds due to its CPU-based architecture.
Conclusion: Memory-based language modeling offers a viable, transparent, and eco-friendly alternative to traditional deep neural network approaches for language modeling tasks.
Abstract: We present memory-based language modeling as an efficient, eco-friendly alternative to deep neural network-based language modeling. It offers log-linearly scalable next-token prediction performance and strong memorization capabilities. Implementing fast approximations of k-nearest neighbor classification, memory-based language modeling leaves a relatively small ecological footprint both in training and in inference mode, as it relies fully on CPUs and attains low token latencies. Its internal workings are simple and fully transparent. We compare our implementation of memory-based language modeling, OLIFANT, with GPT-2 and GPT-Neo on next-token prediction accuracy, estimated emissions and speeds, and offer some deeper analyses of the model.
[37] Multilingual Target-Stance Extraction
Ethan Mines, Bonnie Dorr
Main category: cs.CL
TL;DR: This paper introduces the first multilingual Target-Stance Extraction (TSE) benchmark spanning six languages and proposes a pipeline model that achieves modest performance, highlighting the challenges of multilingual TSE.
Details
Motivation: Prior TSE work was English-only, creating a gap for multilingual analysis of public opinion on social media. The authors aim to extend TSE to multiple languages without requiring separate models for each language.Method: The authors created a multilingual TSE benchmark spanning Catalan, Estonian, French, Italian, Mandarin, and Spanish. They extended the original TSE pipeline to handle multiple languages using a single model approach.
Result: The model pipeline achieved an F1 score of 12.78, demonstrating the increased difficulty of multilingual TSE compared to English-only setups. Target prediction was identified as the primary bottleneck, and the study revealed F1 score sensitivity to different target verbalizations.
Conclusion: This work establishes the first baseline for multilingual TSE, providing resources, algorithms, and evaluation criteria for future research in this emerging field.
Abstract: Social media enables data-driven analysis of public opinion on contested issues. Target-Stance Extraction (TSE) is the task of identifying the target discussed in a document and the document’s stance towards that target. Many works classify stance towards a given target in a multilingual setting, but all prior work in TSE is English-only. This work introduces the first multilingual TSE benchmark, spanning Catalan, Estonian, French, Italian, Mandarin, and Spanish corpora. It manages to extend the original TSE pipeline to a multilingual setting without requiring separate models for each language. Our model pipeline achieves a modest F1 score of 12.78, underscoring the increased difficulty of the multilingual task relative to English-only setups and highlighting target prediction as the primary bottleneck. We are also the first to demonstrate the sensitivity of TSE’s F1 score to different target verbalizations. Together these serve as a much-needed baseline for resources, algorithms, and evaluation criteria in multilingual TSE.
[38] FAIR-RAG: Faithful Adaptive Iterative Refinement for Retrieval-Augmented Generation
Mohammad Aghajani Asl, Majid Asgari-Bidhendi, Behrooz Minaei-Bidgoli
Main category: cs.CL
TL;DR: FAIR-RAG is a novel agentic framework that transforms standard RAG into a dynamic, evidence-driven reasoning process with iterative refinement and explicit gap analysis to handle complex multi-hop queries.
Details
Motivation: Existing RAG frameworks often fail on complex multi-hop queries that require synthesizing information from disparate sources, lacking robust mechanisms to systematically identify and fill evidence gaps.Method: FAIR-RAG introduces an Iterative Refinement Cycle with Structured Evidence Assessment (SEA) that deconstructs queries into required findings, audits evidence for gaps, and uses Adaptive Query Refinement to generate targeted sub-queries for missing information.
Result: FAIR-RAG achieves state-of-the-art performance on multi-hop QA benchmarks, with an F1-score of 0.453 on HotpotQA - an 8.3 point improvement over the strongest iterative baseline.
Conclusion: A structured, evidence-driven refinement process with explicit gap analysis is crucial for reliable and accurate reasoning in advanced RAG systems for complex knowledge-intensive tasks.
Abstract: While Retrieval-Augmented Generation (RAG) mitigates hallucination and knowledge staleness in Large Language Models (LLMs), existing frameworks often falter on complex, multi-hop queries that require synthesizing information from disparate sources. Current advanced RAG methods, employing iterative or adaptive strategies, lack a robust mechanism to systematically identify and fill evidence gaps, often propagating noise or failing to gather a comprehensive context. We introduce FAIR-RAG, a novel agentic framework that transforms the standard RAG pipeline into a dynamic, evidence-driven reasoning process. At its core is an Iterative Refinement Cycle governed by a module we term Structured Evidence Assessment (SEA). The SEA acts as an analytical gating mechanism: it deconstructs the initial query into a checklist of required findings and audits the aggregated evidence to identify confirmed facts and, critically, explicit informational gaps. These gaps provide a precise signal to an Adaptive Query Refinement agent, which generates new, targeted sub-queries to retrieve missing information. This cycle repeats until the evidence is verified as sufficient, ensuring a comprehensive context for a final, strictly faithful generation. We conducted experiments on challenging multi-hop QA benchmarks, including HotpotQA, 2WikiMultiHopQA, and MusiQue. In a unified experimental setup, FAIR-RAG significantly outperforms strong baselines. On HotpotQA, it achieves an F1-score of 0.453 – an absolute improvement of 8.3 points over the strongest iterative baseline – establishing a new state-of-the-art for this class of methods on these benchmarks. Our work demonstrates that a structured, evidence-driven refinement process with explicit gap analysis is crucial for unlocking reliable and accurate reasoning in advanced RAG systems for complex, knowledge-intensive tasks.
[39] Irony Detection in Urdu Text: A Comparative Study Using Machine Learning Models and Large Language Models
Fiaz Ahmad, Nisar Hussain, Amna Qasim, Momina Hafeez, Muhammad Usman Grigori Sidorov, Alexander Gelbukh
Main category: cs.CL
TL;DR: This paper presents a method for detecting irony in Urdu by translating an English ironic corpus and evaluating various ML and transformer models, with LLaMA 3 achieving the best performance.
Details
Motivation: Ironic identification is challenging in NLP, especially for languages with different syntax and cultural contexts like Urdu, which is historically low-resource.Method: Translate English ironic corpus to Urdu, evaluate 10 ML algorithms with GloVe and Word2Vec embeddings, and fine-tune transformer models (BERT, RoBERTa, LLaMA 2, LLaMA 3, Mistral).
Result: Gradient Boosting achieved best ML performance (F1-score: 89.18%), LLaMA 3 (8B) achieved best transformer performance (F1-score: 94.61%).
Conclusion: Combining transliteration techniques with modern NLP models enables robust irony detection in Urdu, demonstrating effectiveness for low-resource languages.
Abstract: Ironic identification is a challenging task in Natural Language Processing, particularly when dealing with languages that differ in syntax and cultural context. In this work, we aim to detect irony in Urdu by translating an English Ironic Corpus into the Urdu language. We evaluate ten state-of-the-art machine learning algorithms using GloVe and Word2Vec embeddings, and compare their performance with classical methods. Additionally, we fine-tune advanced transformer-based models, including BERT, RoBERTa, LLaMA 2 (7B), LLaMA 3 (8B), and Mistral, to assess the effectiveness of large-scale models in irony detection. Among machine learning models, Gradient Boosting achieved the best performance with an F1-score of 89.18%. Among transformer-based models, LLaMA 3 (8B) achieved the highest performance with an F1-score of 94.61%. These results demonstrate that combining transliteration techniques with modern NLP models enables robust irony detection in Urdu, a historically low-resource language.
[40] GigaEmbeddings: Efficient Russian Language Embedding Model
Egor Kolodin, Daria Khomich, Nikita Savushkin, Anastasia Ianina, Fyodor Minkin
Main category: cs.CL
TL;DR: GigaEmbeddings is a framework for training Russian text embeddings using hierarchical instruction tuning of GigaChat-3B, achieving state-of-the-art results on Russian language tasks.
Details
Motivation: To address limitations of existing methods by creating high-performance Russian-focused text embeddings that unify diverse objectives and leverage synthetic data generation.Method: Three-stage pipeline: large-scale contrastive pre-training, fine-tuning with hard negatives, and multitask generalization across retrieval, classification, and clustering. Uses bidirectional attention, latent attention pooling, and strategic pruning of 25% transformer layers.
Result: Achieved state-of-the-art results on ruMTEB benchmark with 69.1 average score across 23 multilingual tasks, outperforming larger baselines.
Conclusion: The framework successfully creates efficient and high-performing Russian text embeddings through hierarchical instruction tuning and architectural innovations.
Abstract: We introduce GigaEmbeddings, a novel framework for training high-performance Russian-focused text embeddings through hierarchical instruction tuning of the decoder-only LLM designed specifically for Russian language (GigaChat-3B). Our three-stage pipeline, comprising large-scale contrastive pre-training in web-scale corpora, fine-tuning with hard negatives, and multitask generalization across retrieval, classification, and clustering tasks, addresses key limitations of existing methods by unifying diverse objectives and leveraging synthetic data generation. Architectural innovations include bidirectional attention for contextual modeling, latent attention pooling for robust sequence aggregation, and strategic pruning of 25% of transformer layers to enhance efficiency without compromising performance. Evaluated on the ruMTEB benchmark spanning 23 multilingual tasks, GigaEmbeddings achieves state-of-the-art results (69.1 avg. score), outperforming strong baselines with a larger number of parameters.
[41] VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations
Yupeng Xie, Zhiyang Zhang, Yifan Wu, Sirong Lu, Jiayi Zhang, Zhaoyang Yu, Jinlin Wang, Sirui Hong, Bang Liu, Chenglin Wu, Yuyu Luo
Main category: cs.CL
TL;DR: VisJudge-Bench is the first comprehensive benchmark for evaluating MLLMs’ capabilities in assessing visualization aesthetics and quality, revealing significant gaps between current MLLMs and human experts, and proposing VisJudge model to bridge this gap.
Details
Motivation: Evaluating visualization quality is challenging as it requires simultaneous judgment across data encoding accuracy, information expressiveness, and visual aesthetics, but no systematic benchmark exists for measuring MLLMs' capabilities in this domain.Method: Created VisJudge-Bench with 3,090 expert-annotated samples from real-world scenarios covering single visualizations, multiple visualizations, and dashboards across 32 chart types, and proposed VisJudge model specifically designed for visualization aesthetics assessment.
Result: Advanced MLLMs (like GPT-5) show significant gaps compared to human experts with MAE of 0.551 and correlation of 0.429. VisJudge reduces MAE to 0.442 (19.8% reduction) and increases consistency to 0.681 (58.7% improvement).
Conclusion: VisJudge-Bench provides a comprehensive benchmark for visualization quality assessment, and VisJudge model significantly narrows the gap between AI and human judgment in evaluating visualization aesthetics and quality.
Abstract: Visualization, a domain-specific yet widely used form of imagery, is an effective way to turn complex datasets into intuitive insights, and its value depends on whether data are faithfully represented, clearly communicated, and aesthetically designed. However, evaluating visualization quality is challenging: unlike natural images, it requires simultaneous judgment across data encoding accuracy, information expressiveness, and visual aesthetics. Although multimodal large language models (MLLMs) have shown promising performance in aesthetic assessment of natural images, no systematic benchmark exists for measuring their capabilities in evaluating visualizations. To address this, we propose VisJudge-Bench, the first comprehensive benchmark for evaluating MLLMs’ performance in assessing visualization aesthetics and quality. It contains 3,090 expert-annotated samples from real-world scenarios, covering single visualizations, multiple visualizations, and dashboards across 32 chart types. Systematic testing on this benchmark reveals that even the most advanced MLLMs (such as GPT-5) still exhibit significant gaps compared to human experts in judgment, with a Mean Absolute Error (MAE) of 0.551 and a correlation with human ratings of only 0.429. To address this issue, we propose VisJudge, a model specifically designed for visualization aesthetics and quality assessment. Experimental results demonstrate that VisJudge significantly narrows the gap with human judgment, reducing the MAE to 0.442 (a 19.8% reduction) and increasing the consistency with human experts to 0.681 (a 58.7% improvement) compared to GPT-5. The benchmark is available at https://github.com/HKUSTDial/VisJudgeBench.
[42] Confabulations from ACL Publications (CAP): A Dataset for Scientific Hallucination Detection
Federica Gamba, Aman Sinha, Timothee Mickus, Raul Vazquez, Patanjali Bhamidipati, Claudio Savelli, Ahana Chattopadhyay, Laura A. Zanella, Yash Kankanampati, Binesh Arakkal Remesh, Aryan Ashok Chandramania, Rohit Agarwal, Chuyuan Li, Ioana Buhnila, Radhika Mamidi
Main category: cs.CL
TL;DR: CAP dataset is a multilingual resource for studying hallucinations in LLMs within scientific text generation, covering 9 languages with 900 questions and 7000+ annotated LLM answers.
Details
Motivation: Hallucinations in scientific domain distort factual knowledge due to specialized terminology, statistical reasoning, and context-dependent interpretations, exacerbated by LLMs' lack of true comprehension and limited contextual understanding.Method: Created a dataset with 900 scientific questions and over 7000 LLM-generated answers from 16 models across 5 high-resource and 4 low-resource languages, annotated with binary labels for scientific hallucinations (factuality errors) and fluency issues.
Result: CAP dataset provides question-answer pairs with token sequences and logits, annotated for factuality errors and linguistic quality, enabling research on hallucination detection and multilingual LLM evaluation.
Conclusion: CAP is publicly released to facilitate advanced research on hallucination detection, multilingual evaluation of LLMs, and development of more reliable scientific NLP systems.
Abstract: We introduce the CAP (Confabulations from ACL Publications) dataset, a multilingual resource for studying hallucinations in large language models (LLMs) within scientific text generation. CAP focuses on the scientific domain, where hallucinations can distort factual knowledge, as they frequently do. In this domain, however, the presence of specialized terminology, statistical reasoning, and context-dependent interpretations further exacerbates these distortions, particularly given LLMs’ lack of true comprehension, limited contextual understanding, and bias toward surface-level generalization. CAP operates in a cross-lingual setting covering five high-resource languages (English, French, Hindi, Italian, and Spanish) and four low-resource languages (Bengali, Gujarati, Malayalam, and Telugu). The dataset comprises 900 curated scientific questions and over 7000 LLM-generated answers from 16 publicly available models, provided as question-answer pairs along with token sequences and corresponding logits. Each instance is annotated with a binary label indicating the presence of a scientific hallucination, denoted as a factuality error, and a fluency label, capturing issues in the linguistic quality or naturalness of the text. CAP is publicly released to facilitate advanced research on hallucination detection, multilingual evaluation of LLMs, and the development of more reliable scientific NLP systems.
[43] CHOIR: Collaborative Harmonization fOr Inference Robustness
Xiangjue Dong, Cong Wang, Maria Teleki, Millennium Bismay, James Caverlee
Main category: cs.CL
TL;DR: CHOIR is a test-time framework that leverages persona variations in LLMs as a constructive resource to improve reasoning robustness by harmonizing multiple persona-conditioned reasoning signals into unified predictions.
Details
Motivation: Minor demographic perturbations in personas can alter LLM reasoning trajectories, leading to divergent correct answers. Instead of treating these variations as biases, the authors explore their potential to improve reasoning robustness.Method: CHOIR orchestrates a collaborative decoding process among counterfactual personas, dynamically balancing agreement and divergence in their reasoning paths to harmonize multiple persona-conditioned reasoning signals.
Result: CHOIR consistently enhances performance across demographics, model architectures, scales, and tasks without additional training, with improvements up to 26.4% for individual demographic groups and 19.2% on average across five demographics.
Conclusion: By reframing persona variation as a constructive signal, CHOIR provides a scalable and generalizable approach to more reliable LLM reasoning, remaining effective even when base personas are suboptimal.
Abstract: Persona-assigned Large Language Models (LLMs) can adopt diverse roles, enabling personalized and context-aware reasoning. However, even minor demographic perturbations in personas, such as simple pronoun changes, can alter reasoning trajectories, leading to divergent sets of correct answers. Instead of treating these variations as biases to be mitigated, we explore their potential as a constructive resource to improve reasoning robustness. We propose CHOIR (Collaborative Harmonization fOr Inference Robustness), a test-time framework that harmonizes multiple persona-conditioned reasoning signals into a unified prediction. CHOIR orchestrates a collaborative decoding process among counterfactual personas, dynamically balancing agreement and divergence in their reasoning paths. Experiments on various reasoning benchmarks demonstrate that CHOIR consistently enhances performance across demographics, model architectures, scales, and tasks - without additional training. Improvements reach up to 26.4% for individual demographic groups and 19.2% on average across five demographics. It remains effective even when base personas are suboptimal. By reframing persona variation as a constructive signal, CHOIR provides a scalable and generalizable approach to more reliable LLM reasoning.
[44] The Tonogenesis Continuum in Tibetan: A Computational Investigation
Siyu Liang, Zhaxi Zerong
Main category: cs.CL
TL;DR: A computational approach using ASR models shows that pitch manipulation reveals a tonogenesis continuum in Tibetan languages, with atonal dialects tolerating pitch removal most and tonal varieties showing severe degradation.
Details
Motivation: To develop a computational method that quantifies pitch's functional role in tonogenesis, moving beyond traditional comparative reconstruction and acoustic phonetics approaches.Method: Used automatic speech recognition (ASR) models to measure sensitivity to pitch-flattening across closely related Tibetan languages at different stages of tonogenesis.
Result: Found evidence of a tonogenesis continuum: Amdo dialects (atonal) tolerate pitch removal most, U-Tsang varieties (fully tonal) show severe degradation, and Kham dialects (intermediate) fall between these extremes.
Conclusion: Computational methods can capture fine-grained stages of sound change, and traditional functional load metrics based on minimal pairs may overestimate pitch dependence in transitional systems where segmental and suprasegmental cues remain intertwined.
Abstract: Tonogenesis-the historical process by which segmental contrasts evolve into lexical tone-has traditionally been studied through comparative reconstruction and acoustic phonetics. We introduce a computational approach that quantifies the functional role of pitch at different stages of this sound change by measuring how pitch manipulation affects automatic speech recognition (ASR) performance. Through analysis on the sensitivity to pitch-flattening from a set of closely related Tibetan languages, we find evidence of a tonogenesis continuum: atonal Amdo dialects tolerate pitch removal the most, while fully tonal U-Tsang varieties show severe degradation, and intermediate Kham dialects fall measurably between these extremes. These gradient effects demonstrate how ASR models implicitly learn the shifting functional load of pitch as languages transition from consonant-based to tone-based lexical contrasts. Our findings show that computational methods can capture fine-grained stages of sound change and suggest that traditional functional load metrics, based solely on minimal pairs, may overestimate pitch dependence in transitional systems where segmental and suprasegmental cues remain phonetically intertwined.
[45] Frustratingly Easy Task-aware Pruning for Large Language Models
Yuanhe Tian, Junjie Liu, Xican Yang, Haishan Ye, Yan Song
Main category: cs.CL
TL;DR: A task-aware pruning method for LLMs that preserves domain-specific capabilities while reducing model size by incorporating both general and task-specific calibration data into importance scoring.
Details
Motivation: Existing LLM pruning methods focus on maintaining general fluency but neglect performance on specific domains and tasks, limiting their practical utility for specialized applications.Method: Extends conventional pruning by incorporating task-specific feature distributions into importance computation. Computes separate importance scores using general and task-specific data, partitions parameters into shared/exclusive groups based on activation-norm differences, and fuses scores to guide pruning.
Result: Experiments show the approach consistently outperforms baseline methods with identical pruning ratios across various settings and benchmarks.
Conclusion: The proposed task-aware pruning framework effectively preserves specialized capabilities while compressing LLMs, and can be seamlessly integrated with various foundation pruning techniques.
Abstract: Pruning provides a practical solution to reduce the resources required to run large language models (LLMs) to benefit from their effective capabilities as well as control their cost for training and inference. Research on LLM pruning often ranks the importance of LLM parameters using their magnitudes and calibration-data activations and removes (or masks) the less important ones, accordingly reducing LLMs’ size. However, these approaches primarily focus on preserving the LLM’s ability to generate fluent sentences, while neglecting performance on specific domains and tasks. In this paper, we propose a simple yet effective pruning approach for LLMs that preserves task-specific capabilities while shrinking their parameter space. We first analyze how conventional pruning minimizes loss perturbation under general-domain calibration and extend this formulation by incorporating task-specific feature distributions into the importance computation of existing pruning algorithms. Thus, our framework computes separate importance scores using both general and task-specific calibration data, partitions parameters into shared and exclusive groups based on activation-norm differences, and then fuses their scores to guide the pruning process. This design enables our method to integrate seamlessly with various foundation pruning techniques and preserve the LLM’s specialized abilities under compression. Experiments on widely used benchmarks demonstrate that our approach is effective and consistently outperforms the baselines with identical pruning ratios and different settings.
[46] The Limits of Data Scaling: Sub-token Utilization and Acoustic Saturation in Multilingual ASR
Siyu Liang, Nicolas Ballier, Gina-Anne Levow, Richard Wright
Main category: cs.CL
TL;DR: Analysis of Whisper’s multilingual ASR model shows sub-token discovery follows consistent exponential saturation patterns across languages, with lexical diversity largely independent of pre-training data disparity. The study identifies acoustic saturation time (AST) as a convergence threshold.
Details
Motivation: To understand how much audio is needed to observe a multilingual ASR model's sub-token inventory across languages and whether data disparity in pre-training affects token utilization during inference.Method: Analyzed Whisper’s decoding behavior across 49 languages by logging decoding candidate sub-tokens and tracking their cumulative discovery over time, studying utilization patterns of the model’s sub-token space.
Result: Total discovered tokens independent of pre-training hours; sub-token discovery follows exponential saturation; Zipf-Mandelbrot patterns observed; Latin script languages show more favorable metrics than Cyrillic, CJK, and Semitic scripts.
Conclusion: Sub-token utilization in multilingual ASR inference is constrained more by statistical, typological, and orthographic structure of speech than by training data scale, providing basis for more equitable corpus construction and cross-lingual evaluation.
Abstract: How much audio is needed to fully observe a multilingual ASR model’s learned sub-token inventory across languages, and does data disparity in multilingual pre-training affect how these tokens are utilized during inference? We address this question by analyzing Whisper’s decoding behavior during inference across 49 languages. By logging decoding candidate sub-tokens and tracking their cumulative discovery over time, we study the utilization pattern of the model’s sub-token space. Results show that the total number of discovered tokens remains largely independent of a language’s pre-training hours, indicating that data disparity does not strongly influence lexical diversity in the model’s hypothesis space. Sub-token discovery rates follow a consistent exponential saturation pattern across languages, suggesting a stable time window after which additional audio yields minimal new sub-token activation. We refer to this convergence threshold as acoustic saturation time (AST). Further analyses of rank-frequency distributions reveal Zipf-like patterns better modeled by a Zipf-Mandelbrot law, and mean sub-token length shows a positive correlation with resource level. Additionally, those metrics show more favorable patterns for languages in the Latin script than those in scripts such as Cyrillic, CJK, and Semitic. Together, our study suggests that sub-token utilization during multilingual ASR inference is constrained more by the statistical, typological, and orthographic structure of the speech than by training data scale, providing an empirical basis for more equitable corpus construction and cross-lingual evaluation.
[47] A Sociophonetic Analysis of Racial Bias in Commercial ASR Systems Using the Pacific Northwest English Corpus
Michael Scott, Siyu Liang, Alicia Wassink, Gina-Anne Levow
Main category: cs.CL
TL;DR: Systematic evaluation of racial bias in commercial ASR systems using the PNWE corpus, showing that phonetic variation, particularly vowel quality differences, causes higher error rates for African American speakers across all systems.
Details
Motivation: To investigate racial bias in commercial ASR systems and understand how sociophonetic variation contributes to differential performance across ethnic groups.Method: Used the Pacific Northwest English (PNWE) corpus to analyze transcription accuracy across four ethnic backgrounds, introduced Phonetic Error Rate (PER) metric, and examined eleven sociophonetic features including vowel quality variation.
Result: Found that vowel quality variation, especially resistance to low-back merger and pre-nasal merger patterns, is systematically associated with higher error rates, with most pronounced effects for African American speakers across all evaluated systems.
Conclusion: Acoustic modeling of dialectal phonetic variation is a primary source of bias in commercial ASR systems, and targeted representation of sociophonetic diversity in training data is needed to improve performance.
Abstract: This paper presents a systematic evaluation of racial bias in four major commercial automatic speech recognition (ASR) systems using the Pacific Northwest English (PNWE) corpus. We analyze transcription accuracy across speakers from four ethnic backgrounds (African American, Caucasian American, ChicanX, and Yakama) and examine how sociophonetic variation contributes to differential system performance. We introduce a heuristically-determined Phonetic Error Rate (PER) metric that links recognition errors to specific linguistically motivated variables derived from sociophonetic annotation. Our analysis of eleven sociophonetic features reveals that vowel quality variation, particularly resistance to the low-back merger and pre-nasal merger patterns, is systematically associated with differential error rates across ethnic groups, with the most pronounced effects for African American speakers across all evaluated systems. These findings demonstrate that acoustic modeling of dialectal phonetic variation, rather than lexical or syntactic factors, remains a primary source of bias in commercial ASR systems. The study establishes the PNWE corpus as a valuable resource for bias evaluation in speech technologies and provides actionable guidance for improving ASR performance through targeted representation of sociophonetic diversity in training data.
[48] Text to Trust: Evaluating Fine-Tuning and LoRA Trade-offs in Language Models for Unfair Terms of Service Detection
Noshitha Padma Pratyusha Juttu, Sahithi Singireddy, Sravani Gona, Sujal Timilsina
Main category: cs.CL
TL;DR: Systematic evaluation of fine-tuning, parameter-efficient adaptation (LoRA/QLoRA), and zero-shot prompting for unfair clause detection in Terms of Service documents, showing full fine-tuning achieves best performance while LoRA provides competitive results with lower memory cost.
Details
Motivation: LLMs have transformed text understanding but their adaptation to specialized legal domains is constrained by the cost of full fine-tuning, requiring efficient methods for legal NLP applications.Method: Fine-tuned BERT and DistilBERT, applied 4-bit LoRA to TinyLlama, LLaMA 3B/7B, and SaulLM models, and evaluated GPT-4o in zero-shot settings on CLAUDETTE-ToS benchmark and Multilingual Scraper Corpus.
Result: Full fine-tuning achieves strongest precision-recall balance, while LoRA-based models provide competitive recall with up to 3x lower memory cost.
Conclusion: The study highlights practical design trade-offs for efficient domain-adapted LLMs and contributes open baselines for fine-tuning research in legal text processing.
Abstract: Large Language Models (LLMs) have transformed text understanding, yet their adaptation to specialized legal domains remains constrained by the cost of full fine-tuning. This study provides a systematic evaluation of fine tuning, parameter efficient adaptation (LoRA, QLoRA), and zero-shot prompting strategies for unfair clause detection in Terms of Service (ToS) documents, a key application in legal NLP. We finetune BERT and DistilBERT, apply 4-bit Low-Rank Adaptation (LoRA) to models such as TinyLlama, LLaMA 3B/7B, and SaulLM, and evaluate GPT-4o and O-versions in zero-shot settings. Experiments on the CLAUDETTE-ToS benchmark and the Multilingual Scraper Corpus show that full fine-tuning achieves the strongest precision recall balance, while LoRA-based models provide competitive recall with up to 3x lower memory cost. These findings highlight practical design trade-offs for efficient and domain-adapted LLMs, contributing open baselines for fine-tuning research in legal text processing.
[49] LooGLE v2: Are LLMs Ready for Real World Long Dependency Challenges?
Ziyuan He, Yuxuan Wang, Jiaqi Li, Kexin Liang, Muhan Zhang
Main category: cs.CL
TL;DR: LooGLE v2 is a benchmark for evaluating LLMs’ long context understanding using real-world texts (16k-2M tokens) across law, finance, game, and code domains, revealing significant limitations despite extended context windows.
Details
Motivation: LLMs have extended context windows but their long context understanding capabilities over long dependency tasks remain limited and underexplored, especially for real-world applications that were rarely benchmarked.Method: Automatically collected real-world long texts (16k-2M tokens) across law, finance, game, and code domains, with 10 types of domain-specific long-dependency tasks and 1,934 QA instances generated through a scalable data curation pipeline.
Result: Comprehensive assessment of 6 locally deployed and 4 API-based LLMs shows even the best-performing model achieves only 59.2% overall score, revealing that popular LLMs can only understand much shorter context than claimed.
Conclusion: Despite extensive context windows, LLMs have significant limitations in handling real-world tasks with long dependencies, highlighting substantial room for improvement in practical long-context understanding.
Abstract: Large language models (LLMs) are equipped with increasingly extended context windows recently, yet their long context understanding capabilities over long dependency tasks remain fundamentally limited and underexplored. This gap is especially significant in many real-world long-context applications that were rarely benchmarked. In this paper, we introduce LooGLE v2, a novel benchmark designed to evaluate LLMs’ long context ability in real-world applications and scenarios. Our benchmark consists of automatically collected real-world long texts, ranging from 16k to 2M tokens, encompassing domains in law, finance, game and code. Accordingly, we delicately design 10 types of domain-specific long-dependency tasks and generate 1,934 QA instances with various diversity and complexity in a scalable data curation pipeline for further practical needs. We conduct a comprehensive assessment of 6 locally deployed and 4 API-based LLMs. The evaluation results show that even the best-performing model achieves only a 59.2% overall score on our benchmark. Despite the extensive context windows, popular LLMs are only capable of understanding a much shorter length of context than they claim to be, revealing significant limitations in their ability to handle real-world tasks with long dependencies and highlighting substantial room for model improvement in practical long-context understanding.
[50] SABlock: Semantic-Aware KV Cache Eviction with Adaptive Compression Block Size
Jinhan Chen, Jianchun Liu, Hongli Xu, Xianjun Gao, Shilong Wang
Main category: cs.CL
TL;DR: SABlock is a semantic-aware KV cache eviction framework that uses adaptive block sizes to efficiently compress KV cache while maintaining semantic coherence, achieving near-full performance with significantly reduced memory usage.
Details
Motivation: The growing memory footprint of KV cache poses a scalability bottleneck for long-context LLM inference, and existing compression methods struggle to balance semantic coherence with memory efficiency.Method: Performs semantic segmentation to align compression boundaries with linguistic structures, applies segment-guided token scoring for importance estimation, and uses budget-driven search to adaptively determine optimal block sizes for each segment.
Result: Achieves 99.9% retrieval accuracy on NIAH with only 96 KV entries (vs 8K in full cache), reduces peak memory usage by 46.28%, and achieves up to 9.5x faster decoding on 128K context length under fixed cache budget.
Conclusion: SABlock consistently outperforms state-of-the-art baselines by effectively balancing semantic integrity and compression efficiency through semantic-aware adaptive block sizing.
Abstract: The growing memory footprint of the Key-Value (KV) cache poses a severe scalability bottleneck for long-context Large Language Model (LLM) inference. While KV cache eviction has emerged as an effective solution by discarding less critical tokens, existing token-, block-, and sentence-level compression methods struggle to balance semantic coherence and memory efficiency. To this end, we introduce SABlock, a \underline{s}emantic-aware KV cache eviction framework with \underline{a}daptive \underline{block} sizes. Specifically, SABlock first performs semantic segmentation to align compression boundaries with linguistic structures, then applies segment-guided token scoring to refine token importance estimation. Finally, for each segment, a budget-driven search strategy adaptively determines the optimal block size that preserves semantic integrity while improving compression efficiency under a given cache budget. Extensive experiments on long-context benchmarks demonstrate that SABlock consistently outperforms state-of-the-art baselines under the same memory budgets. For instance, on Needle-in-a-Haystack (NIAH), SABlock achieves 99.9% retrieval accuracy with only 96 KV entries, nearly matching the performance of the full-cache baseline that retains up to 8K entries. Under a fixed cache budget of 1,024, SABlock further reduces peak memory usage by 46.28% and achieves up to 9.5x faster decoding on a 128K context length.
[51] A Closed-Loop Personalized Learning Agent Integrating Neural Cognitive Diagnosis, Bounded-Ability Adaptive Testing, and LLM-Driven Feedback
Zhifeng Wang, Xinyue Zheng, Chunyan Zeng
Main category: cs.CL
TL;DR: EduLoop-Agent is an end-to-end personalized learning agent that integrates neural cognitive diagnosis, adaptive testing, and LLM feedback in a closed-loop framework.
Details
Motivation: Current personalized learning methods handle modeling, item selection, and feedback in isolation, leading to coarse student models, assumption-bound adaptivity, and non-actionable feedback.Method: Combines Neural Cognitive Diagnosis (NCD) for fine-grained mastery assessment, Bounded-Ability Estimation Computerized Adaptive Testing (BECAT) for dynamic item selection, and LLMs for structured feedback in a closed-loop “Diagnosis-Recommendation-Feedback” framework.
Result: NCD achieves strong response prediction with interpretable mastery assessments, adaptive recommendation improves item relevance and personalization, and LLM-based feedback provides targeted study guidance aligned with identified weaknesses.
Conclusion: The proposed design is effective and practically deployable, providing a feasible pathway to generating individualized learning trajectories in intelligent education.
Abstract: As information technology advances, education is moving from one-size-fits-all instruction toward personalized learning. However, most methods handle modeling, item selection, and feedback in isolation rather than as a closed loop. This leads to coarse or opaque student models, assumption-bound adaptivity that ignores diagnostic posteriors, and generic, non-actionable feedback. To address these limitations, this paper presents an end-to-end personalized learning agent, EduLoop-Agent, which integrates a Neural Cognitive Diagnosis model (NCD), a Bounded-Ability Estimation Computerized Adaptive Testing strategy (BECAT), and large language models (LLMs). The NCD module provides fine-grained estimates of students’ mastery at the knowledge-point level; BECAT dynamically selects subsequent items to maximize relevance and learning efficiency; and LLMs convert diagnostic signals into structured, actionable feedback. Together, these components form a closed-loop framework of ``Diagnosis–Recommendation–Feedback.’’ Experiments on the ASSISTments dataset show that the NCD module achieves strong performance on response prediction while yielding interpretable mastery assessments. The adaptive recommendation strategy improves item relevance and personalization, and the LLM-based feedback offers targeted study guidance aligned with identified weaknesses. Overall, the results indicate that the proposed design is effective and practically deployable, providing a feasible pathway to generating individualized learning trajectories in intelligent education.
[52] Pedagogy-driven Evaluation of Generative AI-powered Intelligent Tutoring Systems
Kaushal Kumar Maurya, Ekaterina Kochmar
Main category: cs.CL
TL;DR: The paper addresses the lack of standardized evaluation frameworks for LLM-powered Intelligent Tutoring Systems (ITSs) in AIED, highlighting current evaluation challenges and proposing three research directions for establishing fair, unified, and scalable evaluation methodologies.
Details
Motivation: The rapid development of LLM-powered ITSs lacks reliable, universally accepted, and pedagogy-driven evaluation frameworks, leading to inconsistencies and limited generalizability in current evaluation practices.Method: The authors provide comprehensive state-of-the-art evaluation practices analysis, examine challenges through real-world case studies from AIED research, and build on insights from previous interdisciplinary AIED research.
Result: The paper identifies current evaluation inconsistencies and limitations in ITS assessment, particularly highlighting the reliance on subjective protocols and non-standardized benchmarks.
Conclusion: The authors propose three practical, feasible, and theoretically grounded research directions rooted in learning science principles to establish fair, unified, and scalable evaluation methodologies for ITSs.
Abstract: The interdisciplinary research domain of Artificial Intelligence in Education (AIED) has a long history of developing Intelligent Tutoring Systems (ITSs) by integrating insights from technological advancements, educational theories, and cognitive psychology. The remarkable success of generative AI (GenAI) models has accelerated the development of large language model (LLM)-powered ITSs, which have potential to imitate human-like, pedagogically rich, and cognitively demanding tutoring. However, the progress and impact of these systems remain largely untraceable due to the absence of reliable, universally accepted, and pedagogy-driven evaluation frameworks and benchmarks. Most existing educational dialogue-based ITS evaluations rely on subjective protocols and non-standardized benchmarks, leading to inconsistencies and limited generalizability. In this work, we take a step back from mainstream ITS development and provide comprehensive state-of-the-art evaluation practices, highlighting associated challenges through real-world case studies from careful and caring AIED research. Finally, building on insights from previous interdisciplinary AIED research, we propose three practical, feasible, and theoretically grounded research directions, rooted in learning science principles and aimed at establishing fair, unified, and scalable evaluation methodologies for ITSs.
[53] AutoBench: Automating LLM Evaluation through Reciprocal Peer Assessment
Dario Loi, Elena Maria Muià, Federico Siciliano, Giovanni Trappolini, Vincenzo Crisà, Peter Kruger, Fabrizio Silvestri
Main category: cs.CL
TL;DR: AutoBench is an automated framework for evaluating LLMs using reciprocal peer assessment where models take turns as question generators, contestants, and judges, with iterative weighting to aggregate reliable judgments into consensus rankings.
Details
Motivation: To address limitations of static benchmarks like test-set contamination and limited adaptability by creating a dynamic, self-sustaining evaluation system that can continuously assess evolving language models.Method: Uses reciprocal peer assessment where LLMs alternate roles as question generators, contestants, and judges across domains. Implements iterative weighting to amplify reliable evaluators and aggregates peer judgments into consensus-based rankings.
Result: Shows strong correlations with established benchmarks (78% with MMLU-Pro and 63% with GPQA). Multi-judge design significantly outperforms single-judge baselines, producing more robust and human-consistent assessments.
Conclusion: AutoBench provides a scalable, contamination-resistant alternative to static benchmarks for continuous evaluation of evolving language models through distributed peer assessment.
Abstract: We present AutoBench, a fully automated and self-sustaining framework for evaluating Large Language Models (LLMs) through reciprocal peer assessment. This paper provides a rigorous scientific validation of the AutoBench methodology, originally developed as an open-source project by eZecute S.R.L.. Unlike static benchmarks that suffer from test-set contamination and limited adaptability, AutoBench dynamically generates novel evaluation tasks while models alternately serve as question generators, contestants, and judges across diverse domains. An iterative weighting mechanism amplifies the influence of consistently reliable evaluators, aggregating peer judgments into consensus-based rankings that reflect collective model agreement. Our experiments demonstrate strong correlations with established benchmarks including MMLU-Pro and GPQA (respectively 78% and 63%), validating this peer-driven evaluation paradigm. The multi-judge design significantly outperforms single-judge baselines, confirming that distributed evaluation produces more robust and human-consistent assessments. AutoBench offers a scalable, contamination-resistant alternative to static benchmarks for the continuous evaluation of evolving language models.
[54] Personal Care Utility (PCU): Building the Health Infrastructure for Everyday Insight and Guidance
Mahyar Abbasian, Ramesh Jain
Main category: cs.CL
TL;DR: The paper proposes a Personal Care Utility (PCU) - an AI-powered cybernetic system for lifelong health guidance that continuously orchestrates multimodal data to provide personalized health information, proactive navigation, and ongoing treatment monitoring.
Details
Motivation: To address limitations of conventional episodic healthcare by creating a continuous, adaptive system that integrates digital infrastructure and biomedical innovation for lifelong health management.Method: Uses multimodal agents, event-centric modeling, and contextual inference to create an ambient, adaptive companion that observes, interprets, and guides health in real time across daily life.
Result: The paper describes the architecture, design principles, and implementation challenges of PCU as an emerging paradigm for continuous health management.
Conclusion: PCU promises improved individual health outcomes and provides a new foundation for public health and scientific discovery through integration of personal sensing, experiential computing, and population-level analytics.
Abstract: Building on decades of success in digital infrastructure and biomedical innovation, we propose the Personal Care Utility (PCU) - a cybernetic system for lifelong health guidance. PCU is conceived as a global, AI-powered utility that continuously orchestrates multimodal data, knowledge, and services to assist individuals and populations alike. Drawing on multimodal agents, event-centric modeling, and contextual inference, it offers three essential capabilities: (1) trusted health information tailored to the individual, (2) proactive health navigation and behavior guidance, and (3) ongoing interpretation of recovery and treatment response after medical events. Unlike conventional episodic care, PCU functions as an ambient, adaptive companion - observing, interpreting, and guiding health in real time across daily life. By integrating personal sensing, experiential computing, and population-level analytics, PCU promises not only improved outcomes for individuals but also a new substrate for public health and scientific discovery. We describe the architecture, design principles, and implementation challenges of this emerging paradigm.
[55] PerCoR: Evaluating Commonsense Reasoning in Persian via Multiple-Choice Sentence Completion
Morteza Alikhani, Mohammadtaha Bagherifard, Erfan Zinvandi, Mehran Sarmadi
Main category: cs.CL
TL;DR: PerCoR is the first large-scale Persian benchmark for commonsense reasoning with 106K multiple-choice problems, featuring a novel conjunction-based segmentation strategy and DRESS-AF adversarial filtering for challenging distractors.
Details
Motivation: To address the lack of large-scale Persian benchmarks for commonsense reasoning and create a challenging dataset that can evaluate model performance in this domain.Method: Used conjunction-based segmentation to generate sentence-completion pairs, and developed DRESS-AF (Distractor Ranking via Embedding Similarity Scoring and Adversarial Filtering) to create difficult distractors from gold continuations.
Result: Human annotators scored 89%, OpenAI-o3 achieved 92.18%, Claude-Sonnet-3.7 scored 91.17%, and the best open-source model DeepSeek-R1 reached 82.51%. DRESS-AF also improved English HellaSwag benchmark difficulty without reducing human solvability.
Conclusion: PerCoR establishes a challenging benchmark for Persian commonsense reasoning, revealing significant performance gaps between models, and demonstrates that DRESS-AF can effectively increase dataset difficulty across languages.
Abstract: We introduced PerCoR (Persian Commonsense Reasoning), the first large-scale Persian benchmark for commonsense reasoning. PerCoR contains 106K multiple-choice sentence-completion problems drawn from more than forty news, cultural, and other web sources. We introduce a novel conjunction-based segmentation strategy to generate coherent sentence-completion pairs, enabling broad topical and structural diversity. To create challenging distractors, we propose DRESS-AF (Distractor Ranking via Embedding Similarity Scoring and Adversarial Filtering), a generation-free adversarial filtering method that selects distractors from the pool of gold continuations while maximising model confusion. Human annotators score 89% on PerCoR, while OpenAI-o3 achieves the highest performance at 92.18%, followed closely by Claude-Sonnet-3.7 (91.17%). The strongest open-source model, DeepSeek-R1, reaches 82.51%, underscoring both the dataset’s difficulty and the remaining performance gap in Persian commonsense reasoning. We further show that DRESS-AF transfers to the English HellaSwag benchmark, increasing its difficulty without hurting human solvability. The dataset is available at https://huggingface.co/datasets/MCINext/PerCoR.
[56] Integrating Linguistics and AI: Morphological Analysis and Corpus development of Endangered Toto Language of West Bengal
Ambalika Guha, Sajal Saha, Debanjan Ballav, Soumi Mitra, Hritwick Chakraborty
Main category: cs.CL
TL;DR: Development of a trilingual language learning app to preserve the endangered Toto language using AI technology and linguistic documentation.
Details
Motivation: To preserve linguistic diversity by digitally archiving and promoting the endangered Toto language of West Bengal, India, ensuring its accessibility and usability.Method: Fieldwork for linguistic documentation, creation of a morpheme-tagged trilingual corpus, training Small Language Model and Transformer-based translation engine, script standardization, and digital literacy tools development.
Result: A structured language corpus, AI models for language processing, and digital tools that support Toto language learning and preservation.
Conclusion: The study provides a sustainable model for preserving endangered languages by combining traditional linguistic methods with AI, emphasizing interdisciplinary collaboration for community-based language revitalization.
Abstract: Preserving linguistic diversity is necessary as every language offers a distinct perspective on the world. There have been numerous global initiatives to preserve endangered languages through documentation. This paper is a part of a project which aims to develop a trilingual (Toto-Bangla-English) language learning application to digitally archive and promote the endangered Toto language of West Bengal, India. This application, designed for both native Toto speakers and non-native learners, aims to revitalize the language by ensuring accessibility and usability through Unicode script integration and a structured language corpus. The research includes detailed linguistic documentation collected via fieldwork, followed by the creation of a morpheme-tagged, trilingual corpus used to train a Small Language Model (SLM) and a Transformer-based translation engine. The analysis covers inflectional morphology such as person-number-gender agreement, tense-aspect-mood distinctions, and case marking, alongside derivational strategies that reflect word-class changes. Script standardization and digital literacy tools were also developed to enhance script usage. The study offers a sustainable model for preserving endangered languages by incorporating traditional linguistic methodology with AI. This bridge between linguistic research with technological innovation highlights the value of interdisciplinary collaboration for community-based language revitalization.
[57] Culturally Grounded Physical Commonsense Reasoning in Italian and English: A Submission to the MRL 2025 Shared Task
Marco De Santis, Lisa Alazraki
Main category: cs.CL
TL;DR: FormaMentis is a novel Italian physical commonsense reasoning benchmark created for MRL 2025 Shared Task, featuring culturally-grounded data annotated by native Italian speakers.
Details
Motivation: To create manually-annotated evaluation data for physical commonsense reasoning in languages other than English, specifically focusing on Italian language and culture.Method: Expert native Italian speakers created culturally-grounded data samples following PIQA format, with additional English translations preserving Italian cultural elements.
Result: FormaMentis benchmark was developed as a submission to the MRL 2025 Shared Task, providing Italian physical commonsense reasoning data with cultural context.
Conclusion: The paper presents FormaMentis as a culturally-grounded Italian benchmark for physical commonsense reasoning, contributing to multilingual evaluation resources.
Abstract: This paper presents our submission to the MRL 2025 Shared Task on Multilingual Physical Reasoning Datasets. The objective of the shared task is to create manually-annotated evaluation data in the physical commonsense reasoning domain, for languages other than English, following a format similar to PIQA. Our contribution, FormaMentis, is a novel benchmark for physical commonsense reasoning that is grounded in Italian language and culture. The data samples in FormaMentis are created by expert annotators who are native Italian speakers and are familiar with local customs and norms. The samples are additionally translated into English, while preserving the cultural elements unique to the Italian context.
[58] Conjugate Relation Modeling for Few-Shot Knowledge Graph Completion
Zilong Wang, Qingtian Zeng, Hua Duan, Cheng Cheng, Minghao Zou, Ziyang Wang
Main category: cs.CL
TL;DR: CR-FKGC is a novel few-shot knowledge graph completion framework that uses conjugate relation modeling to capture complex relational patterns and address data sparsity through neighborhood aggregation, implicit conditional diffusion, and manifold space inference.
Details
Motivation: Existing few-shot knowledge graph completion methods struggle to capture complex relational patterns and mitigate data sparsity issues in long-tail distribution scenarios.Method: Proposes CR-FKGC framework with: 1) neighborhood aggregation encoder for higher-order neighbor information, 2) conjugate relation learner combining implicit conditional diffusion module and stable relation module, 3) manifold conjugate decoder for inference in manifold space.
Result: Experiments on three benchmarks show superior performance over state-of-the-art methods.
Conclusion: The proposed CR-FKGC framework effectively addresses FKGC challenges by modeling conjugate relations and achieves better performance than existing approaches.
Abstract: Few-shot Knowledge Graph Completion (FKGC) infers missing triples from limited support samples, tackling long-tail distribution challenges. Existing methods, however, struggle to capture complex relational patterns and mitigate data sparsity. To address these challenges, we propose a novel FKGC framework for conjugate relation modeling (CR-FKGC). Specifically, it employs a neighborhood aggregation encoder to integrate higher-order neighbor information, a conjugate relation learner combining an implicit conditional diffusion relation module with a stable relation module to capture stable semantics and uncertainty offsets, and a manifold conjugate decoder for efficient evaluation and inference of missing triples in manifold space. Experiments on three benchmarks demonstrate that our method achieves superior performance over state-of-the-art methods.
[59] Rule-Based Explanations for Retrieval-Augmented LLM Systems
Joel Rorseth, Parke Godfrey, Lukasz Golab, Divesh Srivastava, Jarek Szlichta
Main category: cs.CL
TL;DR: First proposal to use if-then rules for explaining LLMs with RAG, showing how source presence/absence affects outputs with optimized rule generation.
Details
Motivation: Need to explain emerging LLM-RAG systems by linking retrieved information sources to output provenance, since RAG allows incorporating external sources at inference time.Method: Generate rules by probing LLM with source combinations, with Apriori-like pruning optimizations from frequent itemset mining adapted for this novel problem.
Result: Proposed optimizations significantly speed up rule generation compared to brute force approach.
Conclusion: Qualitative and quantitative experiments demonstrate the value and efficiency of the proposed solutions for explaining LLM-RAG systems.
Abstract: If-then rules are widely used to explain machine learning models; e.g., “if employed = no, then loan application = rejected.” We present the first proposal to apply rules to explain the emerging class of large language models (LLMs) with retrieval-augmented generation (RAG). Since RAG enables LLM systems to incorporate retrieved information sources at inference time, rules linking the presence or absence of sources can explain output provenance; e.g., “if a Times Higher Education ranking article is retrieved, then the LLM ranks Oxford first.” To generate such rules, a brute force approach would probe the LLM with all source combinations and check if the presence or absence of any sources leads to the same output. We propose optimizations to speed up rule generation, inspired by Apriori-like pruning from frequent itemset mining but redefined within the scope of our novel problem. We conclude with qualitative and quantitative experiments demonstrating our solutions’ value and efficiency.
[60] SALSA: Single-pass Autoregressive LLM Structured Classification
Ruslan Berdichevsky, Shai Nahum-Gefen, Elad Ben Zaken
Main category: cs.CL
TL;DR: SALSA is a pipeline combining structured prompting, class-to-token mapping, and parameter-efficient fine-tuning to improve LLM performance on text classification tasks without cold-start training.
Details
Motivation: Instruction-tuned Large Language Models often underperform on text classification benchmarks despite their impressive generalization capabilities.Method: Map each class label to a distinct output token, construct prompts to elicit single-token responses, and use parameter-efficient fine-tuning. During inference, project output only onto relevant class token logits for efficient single-forward-pass classification.
Result: SALSA achieves state-of-the-art results across diverse benchmarks, demonstrating robustness and scalability for LLM-based classification applications.
Conclusion: The SALSA pipeline effectively addresses LLM underperformance in text classification through structured prompting and efficient inference mechanisms.
Abstract: Despite their impressive generalization capabilities, instruction-tuned Large Language Models often underperform on text classification benchmarks. We introduce SALSA, a coherent pipeline that combines structured prompting, class-to-token mapping, and parameter-efficient fine-tuning, thereby avoiding cold-start training. Each class label is mapped to a distinct output token, and prompts are constructed to elicit a single-token response. During inference, the model’s output is projected only onto the logits of the relevant class tokens, enabling efficient and accurate classification in a single forward pass. SALSA achieves state-of-the-art results across diverse benchmarks, demonstrating its robustness and scalability for LLM-based classification applications.
[61] $\text{E}^2\text{Rank}$: Your Text Embedding can Also be an Effective and Efficient Listwise Reranker
Qi Liu, Yanzhao Zhang, Mingxin Li, Dingkun Long, Pengjun Xie, Jiaxin Mao
Main category: cs.CL
TL;DR: E²Rank is a unified framework that extends a single text embedding model to perform both retrieval and listwise reranking through continued training, achieving strong effectiveness with high efficiency.
Details
Motivation: Text embedding models have limited ranking fidelity compared to dedicated rerankers, especially LLM-based listwise rerankers that capture fine-grained interactions between queries and documents.Method: Uses cosine similarity between query and document embeddings as unified ranking function, with listwise ranking prompts constructed from original query and candidate documents (similar to pseudo-relevance feedback).
Result: Achieves state-of-the-art results on BEIR reranking benchmark, competitive performance on BRIGHT benchmark with low latency, and improves embedding performance on MTEB benchmark.
Conclusion: A single embedding model can effectively unify retrieval and reranking, offering both computational efficiency and competitive ranking accuracy.
Abstract: Text embedding models serve as a fundamental component in real-world search applications. By mapping queries and documents into a shared embedding space, they deliver competitive retrieval performance with high efficiency. However, their ranking fidelity remains limited compared to dedicated rerankers, especially recent LLM-based listwise rerankers, which capture fine-grained query-document and document-document interactions. In this paper, we propose a simple yet effective unified framework $\text{E}^2\text{Rank}$, means Efficient Embedding-based Ranking (also means Embedding-to-Rank), which extends a single text embedding model to perform both high-quality retrieval and listwise reranking through continued training under a listwise ranking objective, thereby achieving strong effectiveness with remarkable efficiency. By applying cosine similarity between the query and document embeddings as a unified ranking function, the listwise ranking prompt, which is constructed from the original query and its candidate documents, serves as an enhanced query enriched with signals from the top-K documents, akin to pseudo-relevance feedback (PRF) in traditional retrieval models. This design preserves the efficiency and representational quality of the base embedding model while significantly improving its reranking performance. Empirically, $\textrm{E}^2\text{Rank}$ achieves state-of-the-art results on the BEIR reranking benchmark and demonstrates competitive performance on the reasoning-intensive BRIGHT benchmark, with very low reranking latency. We also show that the ranking training process improves embedding performance on the MTEB benchmark. Our findings indicate that a single embedding model can effectively unify retrieval and reranking, offering both computational efficiency and competitive ranking accuracy.
[62] Low-Resource Dialect Adaptation of Large Language Models: A French Dialect Case-Study
Eeham Khan, Firas Saidani, Owen Van Esbroeck, Richard Khoury, Leila Kosseim
Main category: cs.CL
TL;DR: CPT with LoRA enables efficient adaptation of LLMs to minority dialects like Québec French using minimal data and compute, improving dialect performance with minimal regression on standard language benchmarks.
Details
Motivation: To address the limited capabilities of LLMs in low-resource dialects and expand high-quality LLM access to minority linguistic communities through cost-effective methods.Method: Used continual pre-training (CPT) with low-rank adaptation (LoRA) and compute-efficient techniques to adapt three LLMs to Québec French using a small dataset, updating under 1% of model parameters.
Result: Showed improvement on minority dialect benchmarks with minimal regression on prestige language benchmarks, though gains were highly dependent on corpus composition.
Conclusion: CPT with PEFT can effectively narrow the dialect gap and provide sustainable language resource creation for minority linguistic communities.
Abstract: Despite the widespread adoption of large language models (LLMs), their strongest capabilities remain largely confined to a small number of high-resource languages for which there is abundant training data. Recently, continual pre-training (CPT) has emerged as a means to fine-tune these models to low-resource regional dialects. In this paper, we study the use of CPT for dialect learning under tight data and compute budgets. Using low-rank adaptation (LoRA) and compute-efficient continual pre-training, we adapt three LLMs to the Qu'ebec French dialect using a very small dataset and benchmark them on the COLE suite. Our experiments demonstrate an improvement on the minority dialect benchmarks with minimal regression on the prestige language benchmarks with under 1% of model parameters updated. Analysis of the results demonstrate that gains are highly contingent on corpus composition. These findings indicate that CPT with parameter-efficient fine-tuning (PEFT) can narrow the dialect gap by providing cost-effective and sustainable language resource creation, expanding high-quality LLM access to minority linguistic communities. We release the first Qu'ebec French LLMs on HuggingFace.
[63] Beyond Semantics: How Temporal Biases Shape Retrieval in Transformer and State-Space Models
Anooshka Bajaj, Deven Mahesh Mistry, Sahaj Singh Maini, Yash Aggarwal, Zoran Tiganj
Main category: cs.CL
TL;DR: LLMs exhibit temporal biases in in-context learning, favoring tokens near sequence boundaries and showing reduced retrieval reliability for middle-position memories, similar to human episodic memory patterns.
Details
Motivation: To understand how temporal relationships affect LLMs' contextual information retrieval, analogous to human episodic memory that separates events by time.Method: Tested various pretrained LLMs with sequences containing repeated tokens at fixed positions while permuting others, isolating temporal effects on next-token prediction. Used ablation experiments and extended analysis to unique semantic contexts.
Result: Models consistently favored tokens following repeated tokens but showed bias for beginning/end positions. Transformers’ bias linked to induction heads. State-space and transformer models showed comparable temporal biases despite architectural differences.
Conclusion: Findings deepen understanding of temporal biases in in-context learning and illustrate how these biases enable temporal separation and episodic retrieval in LLMs.
Abstract: In-context learning is governed by both temporal and semantic relationships, shaping how Large Language Models (LLMs) retrieve contextual information. Analogous to human episodic memory, where the retrieval of specific events is enabled by separating events that happened at different times, this work probes the ability of various pretrained LLMs, including transformer and state-space models, to differentiate and retrieve temporally separated events. Specifically, we prompted models with sequences containing multiple presentations of the same token, which reappears at the sequence end. By fixing the positions of these repeated tokens and permuting all others, we removed semantic confounds and isolated temporal effects on next-token prediction. Across diverse sequences, models consistently placed the highest probabilities on tokens following a repeated token, but with a notable bias for those nearest the beginning or end of the input. An ablation experiment linked this phenomenon in transformers to induction heads. Extending the analysis to unique semantic contexts with partial overlap further demonstrated that memories embedded in the middle of a prompt are retrieved less reliably. Despite architectural differences, state-space and transformer models showed comparable temporal biases. Our findings deepen the understanding of temporal biases in in-context learning and offer an illustration of how these biases can enable temporal separation and episodic retrieval.
[64] EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models
Li Zhou, Lutong Yu, You Lyu, Yihang Lin, Zefeng Zhao, Junyi Ao, Yuhao Zhang, Benyou Wang, Haizhou Li
Main category: cs.CL
TL;DR: EchoMind is a new benchmark for evaluating Speech Language Models’ ability to perceive vocal cues and generate empathetic responses, revealing current models struggle with high-expressive vocal cues despite linguistic understanding.
Details
Motivation: Existing benchmarks evaluate linguistic, acoustic, reasoning, or dialogue abilities in isolation, overlooking the integration needed for human-like emotionally intelligent conversation.Method: Created EchoMind benchmark with sequential, context-linked tasks: spoken-content understanding, vocal-cue perception, integrated reasoning, and response generation using semantically neutral scripts with controlled vocal style variations.
Result: Testing 12 advanced SLMs showed even state-of-the-art models struggle with high-expressive vocal cues, limiting empathetic response quality. Models have weaknesses in instruction-following, resilience to natural speech variability, and effective use of vocal cues.
Conclusion: SLMs need to better integrate linguistic content with diverse vocal cues to achieve truly empathetic conversational ability.
Abstract: Speech Language Models (SLMs) have made significant progress in spoken language understanding. Yet it remains unclear whether they can fully perceive non lexical vocal cues alongside spoken words, and respond with empathy that aligns with both emotional and contextual factors. Existing benchmarks typically evaluate linguistic, acoustic, reasoning, or dialogue abilities in isolation, overlooking the integration of these skills that is crucial for human-like, emotionally intelligent conversation. We present EchoMind, the first interrelated, multi-level benchmark that simulates the cognitive process of empathetic dialogue through sequential, context-linked tasks: spoken-content understanding, vocal-cue perception, integrated reasoning, and response generation. All tasks share identical and semantically neutral scripts that are free of explicit emotional or contextual cues, and controlled variations in vocal style are used to test the effect of delivery independent of the transcript. EchoMind is grounded in an empathy-oriented framework spanning 3 coarse and 12 fine-grained dimensions, encompassing 39 vocal attributes, and evaluated using both objective and subjective metrics. Testing 12 advanced SLMs reveals that even state-of-the-art models struggle with high-expressive vocal cues, limiting empathetic response quality. Analyses of prompt strength, speech source, and ideal vocal cue recognition reveal persistent weaknesses in instruction-following, resilience to natural speech variability, and effective use of vocal cues for empathy. These results underscore the need for SLMs that integrate linguistic content with diverse vocal cues to achieve truly empathetic conversational ability.
[65] Iterative Layer Pruning for Efficient Translation Inference
Yasmin Moslem, Muhammad Hazim Al Farouq, John D. Kelleher
Main category: cs.CL
TL;DR: The paper presents a model compression method using iterative layer pruning with layer importance analysis for LLMs, achieving significant size and inference time reductions while maintaining translation quality.
Details
Motivation: Large language models have intensive computational requirements that make efficient deployment challenging, particularly for machine translation tasks.Method: Iterative layer pruning guided by layer importance analysis, evaluated on Aya-Expanse-8B model for Czech-German and English-Egyptian Arabic translation tasks.
Result: Achieved substantial reductions in model size and inference time while maintaining the translation quality of baseline models.
Conclusion: The proposed iterative layer pruning method with layer importance analysis is effective for compressing LLMs while preserving performance in machine translation tasks.
Abstract: Large language models (LLMs) have transformed many areas of natural language processing, including machine translation. However, efficient deployment of LLMs remains challenging due to their intensive computational requirements. In this paper, we address this challenge and present our submissions to the Model Compression track at the Conference on Machine Translation (WMT 2025). In our experiments, we investigate iterative layer pruning guided by layer importance analysis. We evaluate this method using the Aya-Expanse-8B model for translation from Czech to German, and from English to Egyptian Arabic. Our approach achieves substantial reductions in model size and inference time, while maintaining the translation quality of the baseline models.
[66] MMPersuade: A Dataset and Evaluation Framework for Multimodal Persuasion
Haoyi Qiu, Yilun Zhou, Pranav Narayanan Venkit, Kung-Hsiang Huang, Jiaxin Zhang, Nanyun Peng, Chien-Sheng Wu
Main category: cs.CL
TL;DR: MMPersuade is a framework for studying how Large Vision-Language Models (LVLMs) can be influenced by persuasive multimodal content, revealing that multimodal inputs significantly increase persuasion effectiveness compared to text alone.
Details
Motivation: LVLMs are increasingly deployed in domains with persuasive content, and understanding their susceptibility is crucial to prevent adoption of misleading beliefs, overriding user preferences, or generating unethical outputs when exposed to manipulative messages.Method: MMPersuade introduces a comprehensive multimodal dataset pairing images/videos with persuasion principles across commercial, subjective/behavioral, and adversarial contexts, and an evaluation framework using third-party agreement scoring and self-estimated token probabilities.
Result: Multimodal inputs substantially increase persuasion effectiveness and model susceptibility compared to text alone, especially for misinformation; stated prior preferences decrease susceptibility but multimodal advantage persists; different strategies vary in effectiveness across contexts.
Conclusion: MMPersuade provides a principled foundation for developing LVLMs that are robust, preference-consistent, and ethically aligned when engaging with persuasive multimodal content.
Abstract: As Large Vision-Language Models (LVLMs) are increasingly deployed in domains such as shopping, health, and news, they are exposed to pervasive persuasive content. A critical question is how these models function as persuadees-how and why they can be influenced by persuasive multimodal inputs. Understanding both their susceptibility to persuasion and the effectiveness of different persuasive strategies is crucial, as overly persuadable models may adopt misleading beliefs, override user preferences, or generate unethical or unsafe outputs when exposed to manipulative messages. We introduce MMPersuade, a unified framework for systematically studying multimodal persuasion dynamics in LVLMs. MMPersuade contributes (i) a comprehensive multimodal dataset that pairs images and videos with established persuasion principles across commercial, subjective and behavioral, and adversarial contexts, and (ii) an evaluation framework that quantifies both persuasion effectiveness and model susceptibility via third-party agreement scoring and self-estimated token probabilities on conversation histories. Our study of six leading LVLMs as persuadees yields three key insights: (i) multimodal inputs substantially increase persuasion effectiveness-and model susceptibility-compared to text alone, especially in misinformation scenarios; (ii) stated prior preferences decrease susceptibility, yet multimodal information maintains its persuasive advantage; and (iii) different strategies vary in effectiveness across contexts, with reciprocity being most potent in commercial and subjective contexts, and credibility and logic prevailing in adversarial contexts. By jointly analyzing persuasion effectiveness and susceptibility, MMPersuade provides a principled foundation for developing models that are robust, preference-consistent, and ethically aligned when engaging with persuasive multimodal content.
[67] Scalable Supervising Software Agents with Patch Reasoner
Junjielong Xu, Boyin Tan, Xiaoyuan Liu, Chao Peng, Pengfei Gao, Pinjia He
Main category: cs.CL
TL;DR: R4P is a patch verifier model that provides scalable rewards for training software engineering agents through reasoning, addressing limitations of test-based supervision by enabling faster verification without running tests.
Details
Motivation: Existing test-based supervision for software engineering agents is unscalable due to heavy test sandbox requirements and rarity of high-coverage test data, which limits data scaling potential and is vulnerable to test hacking.Method: R4P treats patch verification as a reasoning task, using a group-wise objective for RL training to verify multiple patches against each other’s modifications, providing dense rewards for stable training without requiring test execution.
Result: R4P achieves 72.2% accuracy on SWE-bench-verified, outperforms OpenAI o3, and enables Mini-SE to achieve 26.2% Pass@1 (10% improvement over Qwen3-32B), with verification speed 50x faster than testing.
Conclusion: R4P provides practical, scalable rewards for SWE agent training through reasoning-based verification, enabling stable scaling curves and high efficiency while avoiding test-based supervision limitations.
Abstract: While large language model agents have advanced software engineering tasks, the unscalable nature of existing test-based supervision is limiting the potential improvement of data scaling. The reason is twofold: (1) building and running test sandbox is rather heavy and fragile, and (2) data with high-coverage tests is naturally rare and threatened by test hacking via edge cases. In this paper, we propose R4P, a patch verifier model to provide scalable rewards for training and testing SWE agents via reasoning. We consider that patch verification is fundamentally a reasoning task, mirroring how human repository maintainers review patches without writing and running new reproduction tests. To obtain sufficient reference and reduce the risk of reward hacking, R4P uses a group-wise objective for RL training, enabling it to verify multiple patches against each other’s modification and gain a dense reward for stable training. R4P achieves 72.2% Acc. for verifying patches from SWE-bench-verified, surpassing OpenAI o3. To demonstrate R4P’s practicality, we design and train a lite scaffold, Mini-SE, with pure reinforcement learning where all rewards are derived from R4P. As a result, Mini-SE achieves 26.2% Pass@1 on SWE-bench-verified, showing a 10.0% improvement over the original Qwen3-32B. This can be further improved to 32.8% with R4P for test-time scaling. Furthermore, R4P verifies patches within a second, 50x faster than testing on average. The stable scaling curves of rewards and accuracy along with high efficiency reflect R4P’s practicality.
[68] VEHME: A Vision-Language Model For Evaluating Handwritten Mathematics Expressions
Thu Phuong Nguyen, Duc M. Nguyen, Hyotaek Jeon, Hyunwook Lee, Hyunmin Song, Sungahn Ko, Taehwan Kim
Main category: cs.CL
TL;DR: VEHME is a vision-language model that automatically assesses handwritten mathematical solutions with high accuracy and interpretable reasoning, using a two-phase training approach and expression-aware visual prompting.
Details
Motivation: Automated assessment of handwritten math solutions is challenging due to diverse formats, unstructured layouts, and symbolic complexity, creating a need for accurate and interpretable evaluation tools in educational technology.Method: VEHME uses a two-phase training pipeline: supervised fine-tuning with structured reasoning data, and reinforcement learning aligned with multi-dimensional grading objectives. It includes an Expression-Aware Visual Prompting Module trained on synthesized multi-line math expressions for spatial understanding.
Result: VEHME achieves state-of-the-art performance among open-source models on AIHub and FERMAT datasets, approaching the accuracy of proprietary systems while providing interpretable reasoning traces.
Conclusion: VEHME demonstrates potential as a scalable and accessible tool for automated math assessment, with publicly available training and experiment code.
Abstract: Automatically assessing handwritten mathematical solutions is an important problem in educational technology with practical applications, but it remains a significant challenge due to the diverse formats, unstructured layouts, and symbolic complexity of student work. To address this challenge, we introduce VEHME-a Vision-Language Model for Evaluating Handwritten Mathematics Expressions-designed to assess open-form handwritten math responses with high accuracy and interpretable reasoning traces. VEHME integrates a two-phase training pipeline: (i) supervised fine-tuning using structured reasoning data, and (ii) reinforcement learning that aligns model outputs with multi-dimensional grading objectives, including correctness, reasoning depth, and error localization. To enhance spatial understanding, we propose an Expression-Aware Visual Prompting Module, trained on our synthesized multi-line math expressions dataset to robustly guide attention in visually heterogeneous inputs. Evaluated on AIHub and FERMAT datasets, VEHME achieves state-of-the-art performance among open-source models and approaches the accuracy of proprietary systems, demonstrating its potential as a scalable and accessible tool for automated math assessment. Our training and experiment code is publicly available at our GitHub repository.
[69] Cross-Lingual Stability and Bias in Instruction-Tuned Language Models for Humanitarian NLP
Poli Nemkova, Amrit Adhikari, Matthew Pearson, Vamsi Krishna Sadu, Mark V. Albert
Main category: cs.CL
TL;DR: First systematic comparison of commercial vs open-weight LLMs for human rights violation detection across 7 languages, showing aligned models maintain stable performance while open-weight models exhibit significant language sensitivity.
Details
Motivation: Humanitarian organizations face cost-reliability trade-offs between expensive commercial APIs and free open-weight models, especially for low-resource languages in conflict zones where empirical validation is lacking.Method: Evaluated 6 models (4 commercial aligned, 2 open-weight) across 78,000 multilingual inferences using standard classification metrics and new cross-lingual reliability measures: Calibration Deviation, Decision Bias, Language Robustness Score, and Language Stability Score.
Result: Aligned models maintained near-invariant accuracy and balanced calibration across typologically distant and low-resource languages, while open-weight models showed significant prompt-language sensitivity and calibration drift. Alignment, not scale, determines stability.
Conclusion: Multilingual alignment enables language-agnostic reasoning, providing practical guidance for humanitarian organizations balancing budget constraints with reliability in multilingual deployment.
Abstract: Humanitarian organizations face a critical choice: invest in costly commercial APIs or rely on free open-weight models for multilingual human rights monitoring. While commercial systems offer reliability, open-weight alternatives lack empirical validation – especially for low-resource languages common in conflict zones. This paper presents the first systematic comparison of commercial and open-weight large language models (LLMs) for human-rights-violation detection across seven languages, quantifying the cost-reliability trade-off facing resource-constrained organizations. Across 78,000 multilingual inferences, we evaluate six models – four instruction-aligned (Claude-Sonnet-4, DeepSeek-V3, Gemini-Flash-2.0, GPT-4.1-mini) and two open-weight (LLaMA-3-8B, Mistral-7B) – using both standard classification metrics and new measures of cross-lingual reliability: Calibration Deviation (CD), Decision Bias (B), Language Robustness Score (LRS), and Language Stability Score (LSS). Results show that alignment, not scale, determines stability: aligned models maintain near-invariant accuracy and balanced calibration across typologically distant and low-resource languages (e.g., Lingala, Burmese), while open-weight models exhibit significant prompt-language sensitivity and calibration drift. These findings demonstrate that multilingual alignment enables language-agnostic reasoning and provide practical guidance for humanitarian organizations balancing budget constraints with reliability in multilingual deployment.
[70] Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays
Haowei Hua, Hong Jiao, Xinyi Wang
Main category: cs.CL
TL;DR: Using generative language models with summarization and prompting improves automated scoring of long essays, overcoming BERT’s 512-token limit and increasing QWK from 0.822 to 0.8878.
Details
Motivation: BERT and its variants have a 512-token limit, which is insufficient for automated scoring of long essays, creating a need for alternative approaches.Method: Employ generative language models for automated scoring through summarization and prompting techniques.
Result: Significant improvement in scoring accuracy with QWK increasing from 0.822 to 0.8878 on the Learning Agency Lab Automated Essay Scoring 2.0 dataset.
Conclusion: Generative language models with summarization and prompting are effective for automated scoring of long essays, outperforming encoder-based models like BERT.
Abstract: BERT and its variants are extensively explored for automated scoring. However, a limit of 512 tokens for these encoder-based models showed the deficiency in automated scoring of long essays. Thus, this research explores generative language models for automated scoring of long essays via summarization and prompting. The results revealed great improvement of scoring accuracy with QWK increased from 0.822 to 0.8878 for the Learning Agency Lab Automated Essay Scoring 2.0 dataset.
[71] Leveraging Large Language Models to Identify Conversation Threads in Collaborative Learning
Prerna Ravi, Dong Won Lee, Beatriz Flamia, Jasmine David, Brandon Hanks, Cynthia Breazeal, Emma Anderson, Grace Lin
Main category: cs.CL
TL;DR: This paper investigates how explicit thread linkages in synchronous spoken dialogue can improve LLM-based coding of relational moves in group conversations, showing that clear thread information enhances coding performance.
Details
Motivation: Understanding how ideas flow in small-group conversations is critical for collaborative learning analysis. While threading has been studied in text settings, detecting threads in synchronous spoken dialogue remains challenging due to overlapping turns and implicit cues. LLMs struggle with long-context tasks that depend on tracing conversational links.Method: The authors provide a systematic guidebook for identifying threads in synchronous multi-party transcripts and benchmark different LLM prompting strategies for automated threading. They test how threading influences performance on downstream coding of conversational analysis frameworks capturing collaborative actions like agreeing, building, and eliciting.
Result: Results show that providing clear conversational thread information improves LLM coding performance and underscores the heavy reliance of downstream analysis on well-structured dialogue. The study also discusses practical trade-offs in time and cost.
Conclusion: This work advances methods for combining LLMs and robust conversational thread structures to make sense of complex, real-time group interactions, emphasizing where human-AI hybrid approaches can yield the best value.
Abstract: Understanding how ideas develop and flow in small-group conversations is critical for analyzing collaborative learning. A key structural feature of these interactions is threading, the way discourse talk naturally organizes into interwoven topical strands that evolve over time. While threading has been widely studied in asynchronous text settings, detecting threads in synchronous spoken dialogue remains challenging due to overlapping turns and implicit cues. At the same time, large language models (LLMs) show promise for automating discourse analysis but often struggle with long-context tasks that depend on tracing these conversational links. In this paper, we investigate whether explicit thread linkages can improve LLM-based coding of relational moves in group talk. We contribute a systematic guidebook for identifying threads in synchronous multi-party transcripts and benchmark different LLM prompting strategies for automated threading. We then test how threading influences performance on downstream coding of conversational analysis frameworks, that capture core collaborative actions such as agreeing, building, and eliciting. Our results show that providing clear conversational thread information improves LLM coding performance and underscores the heavy reliance of downstream analysis on well-structured dialogue. We also discuss practical trade-offs in time and cost, emphasizing where human-AI hybrid approaches can yield the best value. Together, this work advances methods for combining LLMs and robust conversational thread structures to make sense of complex, real-time group interactions.
[72] Once Upon an Input: Reasoning via Per-Instance Program Synthesis
Adam Stein, Neelay Velingker, Mayur Naik, Eric Wong
Main category: cs.CL
TL;DR: PIPS is a method that improves LLM reasoning by generating and refining programs at the instance-level using structural feedback, with dynamic selection between direct inference and program synthesis.
Details
Motivation: LLMs struggle with complex multi-step reasoning, and existing methods like Chain of Thought and Program of Thought often produce undesirable solutions, especially in algorithmic domains.Method: Per-Instance Program Synthesis (PIPS) generates and refines programs at the instance-level using structural feedback without task-specific guidance or explicit test cases, with a confidence metric for dynamic selection between inference modes.
Result: PIPS improves absolute harmonic mean accuracy by up to 8.6% and 9.4% compared to PoT and CoT respectively, and reduces undesirable program generations by 65.1% on algorithmic tasks compared to PoT with Gemini-2.0-Flash.
Conclusion: PIPS effectively enhances LLM reasoning capabilities by combining instance-level program synthesis with structural feedback and dynamic mode selection, significantly outperforming existing methods across diverse benchmarks.
Abstract: Large language models (LLMs) excel at zero-shot inference but continue to struggle with complex, multi-step reasoning. Recent methods that augment LLMs with intermediate reasoning steps such as Chain of Thought (CoT) and Program of Thought (PoT) improve performance but often produce undesirable solutions, especially in algorithmic domains. We introduce Per-Instance Program Synthesis (PIPS), a method that generates and refines programs at the instance-level using structural feedback without relying on task-specific guidance or explicit test cases. To further improve performance, PIPS incorporates a confidence metric that dynamically chooses between direct inference and program synthesis on a per-instance basis. Experiments across three frontier LLMs and 30 benchmarks including all tasks of Big Bench Extra Hard (BBEH), visual question answering tasks, relational reasoning tasks, and mathematical reasoning tasks show that PIPS improves the absolute harmonic mean accuracy by up to 8.6% and 9.4% compared to PoT and CoT respectively, and reduces undesirable program generations by 65.1% on the algorithmic tasks compared to PoT with Gemini-2.0-Flash.
[73] Far from the Shallow: Brain-Predictive Reasoning Embedding through Residual Disentanglement
Linyang He, Tianjun Zhong, Richard Antonello, Gavin Mischler, Micah Goldblum, Nima Mesgarani
Main category: cs.CL
TL;DR: A residual disentanglement method isolates lexicon, syntax, meaning, and reasoning components from LLM embeddings, revealing distinct neural signatures for reasoning in brain activity that peak later and recruit visual regions.
Details
Motivation: To overcome the entanglement of linguistic features in LLM representations that biases brain encoding analyses toward shallow features, making it difficult to isolate neural substrates of deeper cognitive processes like reasoning.Method: Residual disentanglement method that probes LLMs to identify feature-specific layers, then iteratively regresses out lower-level representations to produce four nearly orthogonal embeddings for lexicon, syntax, meaning, and reasoning.
Result: The reasoning embedding uniquely predicts neural activity variance not explained by other features, recruits visual regions beyond language areas, peaks later (~350-400ms), and shows standard LLM embeddings primarily reflect shallow features, masking deeper cognitive contributions.
Conclusion: Disentangled embeddings reveal distinct neural processing for reasoning that occurs later in the hierarchy and extends beyond classical language areas, while standard LLM embeddings are misleading as they predominantly reflect shallow linguistic features.
Abstract: Understanding how the human brain progresses from processing simple linguistic inputs to performing high-level reasoning is a fundamental challenge in neuroscience. While modern large language models (LLMs) are increasingly used to model neural responses to language, their internal representations are highly “entangled,” mixing information about lexicon, syntax, meaning, and reasoning. This entanglement biases conventional brain encoding analyses toward linguistically shallow features (e.g., lexicon and syntax), making it difficult to isolate the neural substrates of cognitively deeper processes. Here, we introduce a residual disentanglement method that computationally isolates these components. By first probing an LM to identify feature-specific layers, our method iteratively regresses out lower-level representations to produce four nearly orthogonal embeddings for lexicon, syntax, meaning, and, critically, reasoning. We used these disentangled embeddings to model intracranial (ECoG) brain recordings from neurosurgical patients listening to natural speech. We show that: 1) This isolated reasoning embedding exhibits unique predictive power, accounting for variance in neural activity not explained by other linguistic features and even extending to the recruitment of visual regions beyond classical language areas. 2) The neural signature for reasoning is temporally distinct, peaking later (~350-400ms) than signals related to lexicon, syntax, and meaning, consistent with its position atop a processing hierarchy. 3) Standard, non-disentangled LLM embeddings can be misleading, as their predictive success is primarily attributable to linguistically shallow features, masking the more subtle contributions of deeper cognitive processing.
[74] Interpreting and Mitigating Unwanted Uncertainty in LLMs
Tiasa Singha Roy, Ayush Rajesh Jhaveri, Ilias Triantafyllopoulos
Main category: cs.CL
TL;DR: LLMs exhibit unwanted uncertainty where they flip correct answers to incorrect ones when re-prompted. The study identifies specific non-retrieval attention heads that attend to misleading tokens, and masking them reduces flip behavior by up to 15% without negative side effects.
Details
Motivation: Unwanted uncertainty in LLMs undermines trust and poses serious risks in high-stakes domains, where models change previously correct answers to incorrect ones when re-prompted.Method: Adapted Needle-in-a-Haystack retrieval framework with Flip-style re-evaluation prompts to simulate answer-flipping scenarios, then identified and masked specific non-retrieval attention heads that disproportionately attend to misleading tokens.
Result: Masking identified attention heads reduced flip behavior by up to 15% without introducing incoherence or overcorrection, though trade-offs were observed in downstream tasks.
Conclusion: The findings contribute to mechanistic interpretability and present a simple yet effective technique for mitigating uncertainty-driven failure modes in LLMs, identifying that retrieval heads are not primarily responsible for avoiding uncertainty.
Abstract: Despite their impressive capabilities, Large Language Models (LLMs) exhibit unwanted uncertainty, a phenomenon where a model changes a previously correct answer into an incorrect one when re-prompted. This behavior undermines trust and poses serious risks in high-stakes domains. In this work, we investigate the mechanisms that drive this phenomenon. We adapt the Needle-in-a-Haystack retrieval framework and integrate a Flip-style re-evaluation prompt to simulate realistic answer-flipping scenarios. We find that retrieval heads are not primarily responsible for avoiding uncertainty. Instead, we identify a small set of non-retrieval attention heads that disproportionately attend to misleading tokens in uncertain contexts. Masking these heads yields significant improvements, reducing flip behavior by up to 15% without introducing incoherence or overcorrection. However, when tested for downstream tasks, we observe trade-offs with flip behavior. Our findings contribute to the growing field of mechanistic interpretability and present a simple yet effective technique for mitigating uncertainty-driven failure modes in LLMs.
[75] A Comprehensive Dataset for Human vs. AI Generated Text Detection
Rajarshi Roy, Nasrin Imanpour, Ashhar Aziz, Shashwat Bajpai, Gurpreet Singh, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Gaytri Jena, Amit Sheth, Vasu Sharma, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha, Amitava Das
Main category: cs.CL
TL;DR: A comprehensive dataset of 58,000+ text samples combining NYT articles with AI-generated versions from multiple LLMs, providing baselines for AI text detection (58.35% accuracy) and model attribution (8.92% accuracy).
Details
Motivation: To address concerns about AI-generated text authenticity and misinformation by creating large-scale, diverse datasets for developing reliable detection and attribution methods.Method: Created dataset combining authentic New York Times articles with synthetic versions generated by multiple state-of-the-art LLMs (Gemma-2-9b, Mistral-7B, Qwen-2-72B, LLaMA-8B, Yi-Large, GPT-4-o) using original article abstracts as prompts.
Result: Established baseline results: 58.35% accuracy for distinguishing human-written from AI-generated text, and 8.92% accuracy for attributing AI texts to their generating models.
Conclusion: The dataset bridges real-world journalistic content with modern generative models to catalyze development of robust detection and attribution methods, fostering trust and transparency in generative AI.
Abstract: The rapid advancement of large language models (LLMs) has led to increasingly human-like AI-generated text, raising concerns about content authenticity, misinformation, and trustworthiness. Addressing the challenge of reliably detecting AI-generated text and attributing it to specific models requires large-scale, diverse, and well-annotated datasets. In this work, we present a comprehensive dataset comprising over 58,000 text samples that combine authentic New York Times articles with synthetic versions generated by multiple state-of-the-art LLMs including Gemma-2-9b, Mistral-7B, Qwen-2-72B, LLaMA-8B, Yi-Large, and GPT-4-o. The dataset provides original article abstracts as prompts, full human-authored narratives. We establish baseline results for two key tasks: distinguishing human-written from AI-generated text, achieving an accuracy of 58.35%, and attributing AI texts to their generating models with an accuracy of 8.92%. By bridging real-world journalistic content with modern generative models, the dataset aims to catalyze the development of robust detection and attribution methods, fostering trust and transparency in the era of generative AI. Our dataset is available at: https://huggingface.co/datasets/gsingh1-py/train.
[76] Batch Speculative Decoding Done Right
Ranran Haoran Zhang, Soumik Dey, Ashirbad Mishra, Hansi Wu, Binbin Li, Rui Zhang
Main category: cs.CL
TL;DR: The paper addresses the ragged tensor problem in batch speculative decoding for LLM inference, introduces EQSPEC for correctness verification and EXSPEC for efficient batch processing, achieving up to 3x throughput improvement while maintaining 95% output equivalence.
Details
Motivation: Batch speculative decoding is essential for production serving but introduces the ragged tensor problem where sequences accept different numbers of draft tokens, breaking alignment and corrupting position IDs, attention masks, and KV-cache state, leading to output equivalence violations.Method: The authors characterize synchronization requirements for correctness, present EQSPEC to verify correctness and measure realignment overhead, and introduce EXSPEC which maintains a sliding pool of sequences and dynamically forms same-length groups to reduce realignment overhead while preserving speculative speedups.
Result: On SpecBench dataset across multiple target/draft model pairs, the approach achieves up to 3x throughput improvement at batch size 8 compared to batch size 1, with efficient scaling through batch size 8 while maintaining 95% output equivalence. Realignment consumes 40% of overhead in baseline implementations.
Conclusion: The proposed method successfully addresses the ragged tensor problem in batch speculative decoding, enabling efficient batch processing with significant throughput improvements while maintaining output correctness, requiring no custom kernels and integrating cleanly with existing inference stacks.
Abstract: Speculative decoding speeds up LLM inference by using a small draft model to propose multiple tokens that a target model verifies in parallel. Extending this idea to batches is essential for production serving, but it introduces the ragged tensor problem: sequences in the same batch accept different numbers of draft tokens, breaking right-alignment and corrupting position IDs, attention masks, and KV-cache state. We show that several existing batch implementations violate output equivalence-the fundamental requirement that speculative decoding must produce identical token sequences to standard autoregressive generation. These violations occur precisely due to improper handling of the ragged tensor problem. In response, we (1) characterize the synchronization requirements that guarantee correctness, (2) present a correctness-first batch speculative decoding EQSPEC that exposes realignment as consuming 40% of overhead, and (3) introduce EXSPEC, which maintains a sliding pool of sequences and dynamically forms same-length groups, to reduce the realignment overhead while preserving per-sequence speculative speedups. On the SpecBench dataset, across Vicuna-7B/68M, Qwen3-8B/0.6B, and GLM-4-9B/0.6B target/draft pairs, our approach achieves up to 3$\times$ throughput improvement at batch size 8 compared to batch size 1, with efficient scaling through batch size 8, while maintaining 95% output equivalence. Our method requires no custom kernels and integrates cleanly with existing inference stacks. Our code is available at https://github.com/eBay/spec_dec.
[77] Language Server CLI Empowers Language Agents with Process Rewards
Yifan Zhang, Lanser Contributors
Main category: cs.CL
TL;DR: Lanser-CLI is a CLI tool that orchestrates Language Server Protocol servers to provide deterministic, replayable workflows for coding agents and CI, addressing LLM hallucinations and edit mislocalization through robust addressing, analysis bundles, safety envelopes, and process rewards.
Details
Motivation: Large language models often hallucinate APIs and mislocalize edits, while language servers provide verified, IDE-grade facts about real code. The motivation is to bridge this gap by leveraging language servers' structural information and process rewards to align agent planning with program reality.Method: Lanser-CLI introduces: (1) robust addressing via Selector DSL with relocation algorithm, (2) deterministic Analysis Bundles with stable content hashes, (3) safety envelope for mutating operations with preview and transactional apply, and (4) process-reward functional derived from Language Server facts.
Result: The system provides deterministic workflows under frozen snapshots with monotonicity properties for process rewards, making it suitable for process supervision and counterfactual analysis. It enables machine-checked, step-wise signals that align agent planning with program reality.
Conclusion: Language servers provide not only structural information but also actionable process rewards. Lanser-CLI successfully mediates LSP servers to deliver deterministic, replayable workflows that address LLM limitations and enable reliable coding agent operations.
Abstract: Large language models routinely hallucinate APIs and mislocalize edits, while language servers compute verified, IDE-grade facts about real code. We present Lanser-CLI, a CLI-first orchestration layer that pins and mediates a Language Server Protocol (LSP) server for coding agents and CI, exposing deterministic, replayable workflows. Our position is that language servers provide not only structural information (definitions, references, types, diagnostics) but also an actionable process reward: machine-checked, step-wise signals that align an agent’s planning loop with program reality. In this work, Lanser-CLI contributes: (i) a robust addressing scheme beyond brittle “file:line:col” via a Selector DSL (symbolic, AST-path, and content-anchored selectors) with a principled relocation algorithm; (ii) deterministic Analysis Bundles that normalize Language Server responses and capture environment/capability metadata with stable content hashes; (iii) a safety envelope for mutating operations (rename, code actions) with preview, workspace jails, and Git-aware, transactional apply; and (iv) a process-reward functional derived from Language Server facts (diagnostic deltas, disambiguation confidence, and safe-apply checks) that is computable online and replayable offline. We formalize determinism under frozen snapshots and establish a monotonicity property for the process reward, making it suitable for process supervision and counterfactual analysis. Project Page: https://github.com/yifanzhang-pro/lanser-cli
[78] Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)
Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, Alon Albalak, Yejin Choi
Main category: cs.CL
TL;DR: The paper introduces Infinity-Chat, a large-scale dataset for evaluating language model diversity in open-ended generation, revealing an “Artificial Hivemind” effect where models produce similar outputs across different queries.
Details
Motivation: Language models struggle to generate diverse creative content, raising concerns about homogenization of human thought through repeated exposure to similar outputs. Current methods for evaluating LM output diversity are limited beyond narrow tasks.Method: Created Infinity-Chat dataset with 26K diverse, real-world open-ended user queries and 31,250 human annotations. Developed comprehensive taxonomy with 6 top-level categories and 17 subcategories for characterizing open-ended prompts. Conducted large-scale study of mode collapse in LMs.
Result: Revealed pronounced Artificial Hivemind effect with (1) intra-model repetition and (2) inter-model homogeneity. Found that LMs, reward models, and LM judges are less calibrated to human ratings on generations that elicit differing individual preferences, despite comparable overall quality.
Conclusion: Infinity-Chat provides the first large-scale resource for systematically studying real-world open-ended queries to LMs, offering critical insights to mitigate long-term AI safety risks posed by the Artificial Hivemind effect.
Abstract: Language models (LMs) often struggle to generate diverse, human-like creative content, raising concerns about the long-term homogenization of human thought through repeated exposure to similar outputs. Yet scalable methods for evaluating LM output diversity remain limited, especially beyond narrow tasks such as random number or name generation, or beyond repeated sampling from a single model. We introduce Infinity-Chat, a large-scale dataset of 26K diverse, real-world, open-ended user queries that admit a wide range of plausible answers with no single ground truth. We introduce the first comprehensive taxonomy for characterizing the full spectrum of open-ended prompts posed to LMs, comprising 6 top-level categories (e.g., brainstorm & ideation) that further breaks down to 17 subcategories. Using Infinity-Chat, we present a large-scale study of mode collapse in LMs, revealing a pronounced Artificial Hivemind effect in open-ended generation of LMs, characterized by (1) intra-model repetition, where a single model consistently generates similar responses, and more so (2) inter-model homogeneity, where different models produce strikingly similar outputs. Infinity-Chat also includes 31,250 human annotations, across absolute ratings and pairwise preferences, with 25 independent human annotations per example. This enables studying collective and individual-specific human preferences in response to open-ended queries. Our findings show that LMs, reward models, and LM judges are less well calibrated to human ratings on model generations that elicit differing idiosyncratic annotator preferences, despite maintaining comparable overall quality. Overall, INFINITY-CHAT presents the first large-scale resource for systematically studying real-world open-ended queries to LMs, revealing critical insights to guide future research for mitigating long-term AI safety risks posed by the Artificial Hivemind.
[79] Tagging-Augmented Generation: Assisting Language Models in Finding Intricate Knowledge In Long Contexts
Anwesan Pal, Karen Hovsepian, Tinghao Guo, Mengnan Zhao, Somendra Tripathi, Nikos Kanakaris, George Mihaila, Sumit Nigam
Main category: cs.CL
TL;DR: Tagging-Augmented Generation (TAG) is a lightweight data augmentation method that improves LLM performance on long-context QA tasks by adding tags or tag definitions to prompts, achieving up to 17% performance gains without altering document integrity.
Details
Motivation: Modern LLMs have limitations in effective QA and reasoning over long contexts. Existing approaches like RAG are sensitive to chunking/retrieval strategies and require extensive preprocessing, prompting the need for simpler solutions.Method: Propose TAG - a lightweight data augmentation strategy that adds tags or tag definitions to QA prompts to boost LLM performance in long-context scenarios without changing retrieved documents.
Result: TAG achieves consistent performance gains: up to 17% for 32K token contexts and 2.9% improvement in complex reasoning for multi-hop queries requiring knowledge across wide text spans.
Conclusion: Tagging context or adding tag definitions to prompts effectively enhances LLM performance on long-context QA tasks, providing a simple yet powerful alternative to more complex approaches like RAG.
Abstract: Recent investigations into effective context lengths of modern flagship large language models (LLMs) have revealed major limitations in effective question answering (QA) and reasoning over long and complex contexts for even the largest and most impressive cadre of models. While approaches like retrieval-augmented generation (RAG) and chunk-based re-ranking attempt to mitigate this issue, they are sensitive to chunking, embedding and retrieval strategies and models, and furthermore, rely on extensive pre-processing, knowledge acquisition and indexing steps. In this paper, we propose Tagging-Augmented Generation (TAG), a lightweight data augmentation strategy that boosts LLM performance in long-context scenarios, without degrading and altering the integrity and composition of retrieved documents. We validate our hypothesis by augmenting two challenging and directly relevant question-answering benchmarks – NoLima and NovelQA – and show that tagging the context or even just adding tag definitions into QA prompts leads to consistent performance gains over the baseline – up to 17% for 32K token contexts, and 2.9% in complex reasoning question-answering for multi-hop queries requiring knowledge across a wide span of text. Additional details are available at https://sites.google.com/view/tag-emnlp.
[80] MAD-Fact: A Multi-Agent Debate Framework for Long-Form Factuality Evaluation in LLMs
Yucheng Ning, Xixun Lin, Fang Fang, Yanan Cao
Main category: cs.CL
TL;DR: Proposes a systematic framework for evaluating factual accuracy in long-form LLM outputs using multi-agent verification and weighted metrics, with experiments showing larger LLMs maintain higher factual consistency.
Details
Motivation: Address concerns about factual accuracy of LLM outputs in high-risk domains like biomedicine, law, and education, where existing evaluation methods fail on long-form content due to complex reasoning chains and cumulative information.Method: Integrates large-scale long-form datasets (LongHalluQA), multi-agent verification mechanisms (MAD-Fact debate-based system), and weighted evaluation metrics with fact importance hierarchy to capture varying significance of claims.
Result: Experiments on two benchmarks show larger LLMs generally maintain higher factual consistency, while domestic models excel on Chinese content.
Conclusion: Provides a structured framework for evaluating and enhancing factual reliability in long-form LLM outputs, guiding their safe deployment in sensitive domains.
Abstract: The widespread adoption of Large Language Models (LLMs) raises critical concerns about the factual accuracy of their outputs, especially in high-risk domains such as biomedicine, law, and education. Existing evaluation methods for short texts often fail on long-form content due to complex reasoning chains, intertwined perspectives, and cumulative information. To address this, we propose a systematic approach integrating large-scale long-form datasets, multi-agent verification mechanisms, and weighted evaluation metrics. We construct LongHalluQA, a Chinese long-form factuality dataset; and develop MAD-Fact, a debate-based multi-agent verification system. We introduce a fact importance hierarchy to capture the varying significance of claims in long-form texts. Experiments on two benchmarks show that larger LLMs generally maintain higher factual consistency, while domestic models excel on Chinese content. Our work provides a structured framework for evaluating and enhancing factual reliability in long-form LLM outputs, guiding their safe deployment in sensitive domains.
[81] Measuring Teaching with LLMs
Michael Hardy
Main category: cs.CL
TL;DR: Custom LLMs using sentence embeddings achieve human-level performance in measuring teaching quality from classroom transcripts, outperforming general-purpose models and showing alignment with teacher value-added measures.
Details
Motivation: Objective and scalable measurement of teaching quality is challenging, and general-purpose LLMs struggle with complex classroom observation instruments.Method: Custom LLMs built on sentence-level embeddings with five different embedding types under data-efficient training to prevent overfitting.
Result: Models achieved human-level performance (correlation >0.65 with expert ratings) and surpassed average human-human rater correlation. Aggregate scores aligned with teacher value-added measures.
Conclusion: Specialized sentence-embedding models provide a viable methodology for AI-driven instructional measurement, offering scalable and reliable feedback for educator development.
Abstract: Objective and scalable measurement of teaching quality is a persistent challenge in education. While Large Language Models (LLMs) offer potential, general-purpose models have struggled to reliably apply complex, authentic classroom observation instruments. This paper uses custom LLMs built on sentence-level embeddings, an architecture better suited for the long-form, interpretive nature of classroom transcripts than conventional subword tokenization. We systematically evaluate five different sentence embeddings under a data-efficient training regime designed to prevent overfitting. Our results demonstrate that these specialized models can achieve human-level and even super-human performance with expert human ratings above 0.65 and surpassing the average human-human rater correlation. Further, through analysis of annotation context windows, we find that more advanced models-those better aligned with human judgments-attribute a larger share of score variation to lesson-level features rather than isolated utterances, challenging the sufficiency of single-turn annotation paradigms. Finally, to assess external validity, we find that aggregate model scores align with teacher value-added measures, indicating they are capturing features relevant to student learning. However, this trend does not hold at the individual item level, suggesting that while the models learn useful signals, they have not yet achieved full generalization. This work establishes a viable and powerful new methodology for AI-driven instructional measurement, offering a path toward providing scalable, reliable, and valid feedback for educator development.
[82] Understanding In-Context Learning Beyond Transformers: An Investigation of State Space and Hybrid Architectures
Shenran Wang, Timothy Tin-Long Tse, Jian Zhu
Main category: cs.CL
TL;DR: In-depth evaluation of in-context learning across transformer, state-space, and hybrid LLMs reveals similar task performance but different internal mechanisms, with function vectors primarily in self-attention and Mamba layers.
Details
Motivation: To understand how different LLM architectures (transformer, state-space, hybrid) perform in-context learning tasks and investigate their internal mechanisms despite similar behavioral outputs.Method: Used behavioral probing and intervention-based methods to evaluate ICL on knowledge-based tasks across different LLM architectures.
Result: Function vectors for ICL are primarily located in self-attention and Mamba layers; Mamba2 uses different mechanisms from FVs; FVs are more important for parametric knowledge retrieval than contextual understanding.
Conclusion: LLMs of different architectures can achieve similar task performance through different internal mechanisms, highlighting the need for combined behavioral and mechanistic analysis to understand LLM capabilities.
Abstract: We perform in-depth evaluations of in-context learning (ICL) on state-of-the-art transformer, state-space, and hybrid large language models over two categories of knowledge-based ICL tasks. Using a combination of behavioral probing and intervention-based methods, we have discovered that, while LLMs of different architectures can behave similarly in task performance, their internals could remain different. We discover that function vectors (FVs) responsible for ICL are primarily located in the self-attention and Mamba layers, and speculate that Mamba2 uses a different mechanism from FVs to perform ICL. FVs are more important for ICL involving parametric knowledge retrieval, but not for contextual knowledge understanding. Our work contributes to a more nuanced understanding across architectures and task types. Methodologically, our approach also highlights the importance of combining both behavioural and mechanistic analyses to investigate LLM capabilities.
[83] LangLingual: A Personalised, Exercise-oriented English Language Learning Tool Leveraging Large Language Models
Sammriddh Gupta, Sonit Singh, Aditya Joshi, Mira Kim
Main category: cs.CL
TL;DR: LangLingual is a conversational agent using LangChain and LLMs to provide grammar-focused feedback, context-aware exercises, and proficiency tracking for language learners.
Details
Motivation: Language educators need to provide rich learning experiences but face limitations in feedback and practice delivery.Method: Built using LangChain framework and Large Language Models to create a conversational agent with real-time grammar feedback and exercise generation.
Result: The system demonstrated strong usability, positive learning outcomes, and encouraging learner engagement.
Conclusion: LangLingual successfully addresses limitations in language education through AI-powered conversational feedback and practice.
Abstract: Language educators strive to create a rich experience for learners, while they may be restricted in the extend of feedback and practice they can provide. We present the design and development of LangLingual, a conversational agent built using the LangChain framework and powered by Large Language Models. The system is specifically designed to provide real-time, grammar-focused feedback, generate context-aware language exercises and track learner proficiency over time. The paper discusses the architecture, implementation and evaluation of LangLingual in detail. The results indicate strong usability, positive learning outcomes and encouraging learner engagement.
[84] Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning
Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan, Carl Yang, Hongkun Yu
Main category: cs.CL
TL;DR: TIR-Judge is an RL framework for training LLM judges that integrates code execution for precise evaluation, achieving superior performance over reasoning-based judges without requiring distilled data.
Details
Motivation: Current LLM judges rely solely on text-based reasoning, limiting their ability to verify complex constraints or perform accurate computations, creating a need for tool-integrated evaluation approaches.Method: End-to-end RL framework with three principles: diverse training across verifiable/non-verifiable domains, flexible judgment formats (pointwise, pairwise, listwise), and iterative RL bootstrapping from initial model without distillation.
Result: Surpasses reasoning-based judges by up to 6.4% (pointwise) and 7.7% (pairwise), achieves listwise performance comparable to Claude-Opus-4 with only 8B parameters, and TIR-Judge-Zero matches distilled variants’ performance.
Conclusion: Tool-augmented judges can self-evolve through iterative reinforcement learning without requiring distilled judge trajectories, demonstrating the effectiveness of tool-integrated reasoning for LLM evaluation.
Abstract: Large Language Models (LLMs) are widely used as judges to evaluate response quality, providing a scalable alternative to human evaluation. However, most LLM judges operate solely on intrinsic text-based reasoning, limiting their ability to verify complex constraints or perform accurate computation. Motivated by the success of tool-integrated reasoning (TIR) in numerous tasks, we propose TIR-Judge, an end-to-end RL framework for training LLM judges that integrates a code executor for precise evaluation. TIR-Judge is built on three principles: (i) diverse training across verifiable and non-verifiable domains, (ii) flexible judgment formats (pointwise, pairwise, listwise), and (iii) iterative RL that bootstraps directly from the initial model without distillation. On seven public benchmarks, TIR-Judge surpasses strong reasoning-based judges by up to 6.4% (pointwise) and 7.7% (pairwise), and achieves listwise performance comparable to Claude-Opus-4 despite having only 8B parameters. Remarkably, TIR-Judge-Zero - trained entirely without distilled judge trajectories, matches the performance of distilled variants, demonstrating that tool-augmented judges can self-evolve through iterative reinforcement learning.
[85] Knocking-Heads Attention
Zhanchao Zhou, Xiaodong Chen, Haoxing Chen, Zhenzhong Lan, Jianguo Li
Main category: cs.CL
TL;DR: KHA enables cross-head interactions in attention mechanisms by using shared diagonal projections, improving training stability and performance with minimal overhead.
Details
Motivation: Standard multi-head attention and its variants concatenate outputs from isolated heads without strong interaction, limiting representational capacity as head count increases.Method: Knocking-heads attention applies a shared, diagonally-initialized projection matrix across all heads to facilitate cross-head feature-level interactions before scaled dot-product attention.
Result: KHA achieves superior and more stable training dynamics, better performance across downstream tasks compared to baseline attention mechanisms, with only minimal parameter and FLOP overhead.
Conclusion: KHA effectively enhances attention mechanisms by enabling cross-head interactions while maintaining efficiency, making it a promising drop-in replacement for existing attention variants.
Abstract: Multi-head attention (MHA) has become the cornerstone of modern large language models, enhancing representational capacity through parallel attention heads. However, increasing the number of heads inherently weakens individual head capacity, and existing attention mechanisms - whether standard MHA or its variants like grouped-query attention (GQA) and grouped-tied attention (GTA) - simply concatenate outputs from isolated heads without strong interaction. To address this limitation, we propose knocking-heads attention (KHA), which enables attention heads to “knock” on each other - facilitating cross-head feature-level interactions before the scaled dot-product attention. This is achieved by applying a shared, diagonally-initialized projection matrix across all heads. The diagonal initialization preserves head-specific specialization at the start of training while allowing the model to progressively learn integrated cross-head representations. KHA adds only minimal parameters and FLOPs and can be seamlessly integrated into MHA, GQA, GTA, and other attention variants. We validate KHA by training a 6.1B parameter MoE model (1.01B activated) on 1T high-quality tokens. Compared to baseline attention mechanisms, KHA brings superior and more stable training dynamics, achieving better performance across downstream tasks.
[86] Quality-Aware Translation Tagging in Multilingual RAG system
Hoyeon Moon, Byeolhee Kim, Nikhil Verma
Main category: cs.CL
TL;DR: QTT-RAG improves multilingual RAG by evaluating translation quality across semantic equivalence, grammatical accuracy, and naturalness, then attaching quality scores as metadata to help LLMs make better decisions without altering original content.
Details
Motivation: Existing mRAG approaches either assume good translation quality or use rewriting methods that cause factual distortion and hallucinations, especially problematic for low-resource languages.Method: Proposes Quality-Aware Translation Tagging (QTT-RAG) that explicitly evaluates translation quality along three dimensions and attaches scores as metadata, preserving original content integrity.
Result: Outperforms CrossRAG and DKM-RAG baselines on XORQA and MKQA benchmarks across 6 LLMs (2.4B-14B parameters) for Korean, Finnish, and Chinese, preserving factual integrity while enabling informed generator decisions.
Conclusion: QTT-RAG offers a practical and robust solution for effectively using cross-lingual documents in low-resource settings with limited native language documents across multilingual domains.
Abstract: Multilingual Retrieval-Augmented Generation (mRAG) often retrieves English documents and translates them into the query language for low-resource settings. However, poor translation quality degrades response generation performance. Existing approaches either assume sufficient translation quality or utilize the rewriting method, which introduces factual distortion and hallucinations. To mitigate these problems, we propose Quality-Aware Translation Tagging in mRAG (QTT-RAG), which explicitly evaluates translation quality along three dimensions-semantic equivalence, grammatical accuracy, and naturalness&fluency-and attach these scores as metadata without altering the original content. We evaluate QTT-RAG against CrossRAG and DKM-RAG as baselines in two open-domain QA benchmarks (XORQA, MKQA) using six instruction-tuned LLMs ranging from 2.4B to 14B parameters, covering two low-resource languages (Korean and Finnish) and one high-resource language (Chinese). QTT-RAG outperforms the baselines by preserving factual integrity while enabling generator models to make informed decisions based on translation reliability. This approach allows for effective usage of cross-lingual documents in low-resource settings with limited native language documents, offering a practical and robust solution across multilingual domains.
[87] A Survey on LLM Mid-training
Chengying Tu, Xuemiao Zhang, Rongxiang Weng, Rumei Li, Chen Zhang, Yang Bai, Hongfei Yan, Jingang Wang, Xunliang Cai
Main category: cs.CL
TL;DR: Mid-training is a critical stage between pre-training and post-training that uses intermediate data and compute to systematically enhance specific LLM capabilities while maintaining foundational competencies.
Details
Motivation: To formalize and clarify the role of mid-training in LLM development, as recent advances show it's a vital bridge that enables targeted capability enhancement without compromising core competencies.Method: Survey approach analyzing optimization frameworks including data curation, training strategies, and model architecture optimization for mid-training, with examination of mainstream model implementations.
Result: Provides a comprehensive taxonomy of mid-training, formal definitions, and actionable insights showing how it serves as a distinct critical stage in progressive LLM capability development.
Conclusion: Mid-training is a crucial distinct stage in LLM development that systematically bridges pre-training and post-training, enabling targeted enhancement of specific capabilities while preserving foundational competencies.
Abstract: Recent advances in foundation models have highlighted the significant benefits of multi-stage training, with a particular emphasis on the emergence of mid-training as a vital stage that bridges pre-training and post-training. Mid-training is distinguished by its use of intermediate data and computational resources, systematically enhancing specified capabilities such as mathematics, coding, reasoning, and long-context extension, while maintaining foundational competencies. This survey provides a formal definition of mid-training for large language models (LLMs) and investigates optimization frameworks that encompass data curation, training strategies, and model architecture optimization. We analyze mainstream model implementations in the context of objective-driven interventions, illustrating how mid-training serves as a distinct and critical stage in the progressive development of LLM capabilities. By clarifying the unique contributions of mid-training, this survey offers a comprehensive taxonomy and actionable insights, supporting future research and innovation in the advancement of LLMs.
[88] MAP4TS: A Multi-Aspect Prompting Framework for Time-Series Forecasting with Large Language Models
Suchan Lee, Jihoon Choi, Sohyeon Lee, Minseok Song, Bong-Gyu Jang, Hwanjo Yu, Soyeon Caren Han
Main category: cs.CL
TL;DR: MAP4TS is a multi-aspect prompting framework that incorporates classical time-series analysis into LLM-based forecasting, outperforming state-of-the-art methods across diverse datasets.
Details
Motivation: Existing multimodal approaches overlook the distinct statistical properties and temporal dependencies fundamental to time-series data, creating a gap in LLM-based forecasting.Method: Proposes four specialized prompt components: Global Domain Prompt (dataset-level context), Local Domain Prompt (recent trends), Statistical Prompt (ACF/PACF insights), and Temporal Prompt (Fourier analysis), combined with raw embeddings through cross-modality alignment.
Result: Extensive experiments across eight datasets show MAP4TS consistently outperforms state-of-the-art LLM-based methods. GPT-2 backbones with structured prompts outperform larger models like LLaMA in long-term forecasting.
Conclusion: Prompt-aware designs significantly enhance performance stability, and structured prompts enable smaller models to outperform larger ones in time-series forecasting tasks.
Abstract: Recent advances have investigated the use of pretrained large language models (LLMs) for time-series forecasting by aligning numerical inputs with LLM embedding spaces. However, existing multimodal approaches often overlook the distinct statistical properties and temporal dependencies that are fundamental to time-series data. To bridge this gap, we propose MAP4TS, a novel Multi-Aspect Prompting Framework that explicitly incorporates classical time-series analysis into the prompt design. Our framework introduces four specialized prompt components: a Global Domain Prompt that conveys dataset-level context, a Local Domain Prompt that encodes recent trends and series-specific behaviors, and a pair of Statistical and Temporal Prompts that embed handcrafted insights derived from autocorrelation (ACF), partial autocorrelation (PACF), and Fourier analysis. Multi-Aspect Prompts are combined with raw time-series embeddings and passed through a cross-modality alignment module to produce unified representations, which are then processed by an LLM and projected for final forecasting. Extensive experiments across eight diverse datasets show that MAP4TS consistently outperforms state-of-the-art LLM-based methods. Our ablation studies further reveal that prompt-aware designs significantly enhance performance stability and that GPT-2 backbones, when paired with structured prompts, outperform larger models like LLaMA in long-term forecasting tasks.
[89] Leveraging Hierarchical Organization for Medical Multi-document Summarization
Yi-Li Hsu, Katelyn X. Mei, Lucy Lu Wang
Main category: cs.CL
TL;DR: Hierarchical structures in medical multi-document summarization improve model performance and human preference while maintaining factuality, coverage, and coherence.
Details
Motivation: To investigate whether hierarchical structures can better organize and contextualize information across documents in medical multi-document summarization compared to traditional flat methods.Method: Tested two hierarchical organization approaches across three large language models, with comprehensive evaluations using automated metrics, model-based metrics, and domain expert assessment of multiple quality dimensions.
Result: Human experts preferred model-generated summaries over human-written ones. Hierarchical approaches preserved factuality, coverage, and coherence while increasing human preference. GPT-4 judgments aligned well with human evaluations on objective criteria.
Conclusion: Hierarchical structures improve clarity of medical summaries and human preference while maintaining content coverage, offering a practical enhancement for medical multi-document summarization.
Abstract: Medical multi-document summarization (MDS) is a complex task that requires effectively managing cross-document relationships. This paper investigates whether incorporating hierarchical structures in the inputs of MDS can improve a model’s ability to organize and contextualize information across documents compared to traditional flat summarization methods. We investigate two ways of incorporating hierarchical organization across three large language models (LLMs), and conduct comprehensive evaluations of the resulting summaries using automated metrics, model-based metrics, and domain expert evaluation of preference, understandability, clarity, complexity, relevance, coverage, factuality, and coherence. Our results show that human experts prefer model-generated summaries over human-written summaries. Hierarchical approaches generally preserve factuality, coverage, and coherence of information, while also increasing human preference for summaries. Additionally, we examine whether simulated judgments from GPT-4 align with human judgments, finding higher agreement along more objective evaluation facets. Our findings demonstrate that hierarchical structures can improve the clarity of medical summaries generated by models while maintaining content coverage, providing a practical way to improve human preference for generated summaries.
[90] Flexing in 73 Languages: A Single Small Model for Multilingual Inflection
Tomáš Sourada, Jana Straková
Main category: cs.CL
TL;DR: A compact multilingual inflection model that generates word forms from lemmas across 73 languages, outperforming monolingual baselines and simplifying deployment.
Details
Motivation: To address the lack of open-source, general-purpose multilingual morphological inflection systems that can handle unseen words across many languages.Method: Joint training on data from 73 languages using a frequency-weighted, lemma-disjoint train-dev-test resampling procedure on Universal Dependencies treebanks.
Result: Outperforms monolingual baselines in most languages, is lightweight, robust to unseen words, and eliminates the need for managing multiple separate models.
Conclusion: Multilingual modeling is effective for inflection tasks and offers practical deployment benefits by replacing dozens of monolingual models with a single unified system.
Abstract: We present a compact, single-model approach to multilingual inflection, the task of generating inflected word forms from base lemmas to express grammatical categories. Our model, trained jointly on data from 73 languages, is lightweight, robust to unseen words, and outperforms monolingual baselines in most languages. This demonstrates the effectiveness of multilingual modeling for inflection and highlights its practical benefits: simplifying deployment by eliminating the need to manage and retrain dozens of separate monolingual models. In addition to the standard SIGMORPHON shared task benchmarks, we evaluate our monolingual and multilingual models on 73 Universal Dependencies (UD) treebanks, extracting lemma-tag-form triples and their frequency counts. To ensure realistic data splits, we introduce a novel frequency-weighted, lemma-disjoint train-dev-test resampling procedure. Our work addresses the lack of an open-source, general-purpose, multilingual morphological inflection system capable of handling unseen words across a wide range of languages, including Czech. All code is publicly released at: https://github.com/tomsouri/multilingual-inflection.
[91] Beyond Higher Rank: Token-wise Input-Output Projections for Efficient Low-Rank Adaptation
Shiwei Li, Xiandi Luo, Haozhao Wang, Xing Tang, Ziqiang Cui, Dugang Liu, Yuhua Li, Xiuqiang He, Ruixuan Li
Main category: cs.CL
TL;DR: TopLoRA improves standard LoRA by dynamically adjusting LoRA weights based on input tokens, enabling token-wise input-output projections without increasing rank.
Details
Motivation: Standard LoRA shares the same weights for all input tokens, limiting its ability to capture token-specific information due to semantic differences among tokens.Method: TopLoRA introduces token-wise projected LoRA with weights expressed as BΣ_XA, where A and B are low-rank matrices, and Σ_X is a diagonal matrix generated from each input token X, enabling dynamic weight adjustment per token.
Result: Extensive experiments across multiple models and datasets show that TopLoRA consistently outperforms LoRA and its variants.
Conclusion: TopLoRA achieves more granular adaptation by learning token-wise LoRA weights without increasing rank, providing better performance than standard LoRA approaches.
Abstract: Low-rank adaptation (LoRA) is a parameter-efficient fine-tuning (PEFT) method widely used in large language models (LLMs). LoRA essentially describes the projection of an input space into a low-dimensional output space, with the dimensionality determined by the LoRA rank. In standard LoRA, all input tokens share the same weights and undergo an identical input-output projection. This limits LoRA’s ability to capture token-specific information due to the inherent semantic differences among tokens. To address this limitation, we propose Token-wise Projected Low-Rank Adaptation (TopLoRA), which dynamically adjusts LoRA weights according to the input token, thereby learning token-wise input-output projections in an end-to-end manner. Formally, the weights of TopLoRA can be expressed as $B\Sigma_X A$, where $A$ and $B$ are low-rank matrices (as in standard LoRA), and $\Sigma_X$ is a diagonal matrix generated from each input token $X$. Notably, TopLoRA does not increase the rank of LoRA weights but achieves more granular adaptation by learning token-wise LoRA weights (i.e., token-wise input-output projections). Extensive experiments across multiple models and datasets demonstrate that TopLoRA consistently outperforms LoRA and its variants. The code is available at https://github.com/Leopold1423/toplora-neurips25.
[92] Corpus Frequencies in Morphological Inflection: Do They Matter?
Tomáš Sourada, Jana Straková
Main category: cs.CL
TL;DR: This paper explores incorporating corpus frequency information into morphological inflection systems through frequency-weighted train-test splits, token accuracy evaluation, and frequency-aware training sampling.
Details
Motivation: Traditional morphological inflection systems treat all words equally, but real-world usage follows frequency distributions. The paper aims to better reflect natural language frequency patterns in system development for improved deployment performance.Method: Three frequency-aware approaches: (1) frequency-weighted train-dev-test splits combining lemma-disjoint and frequency distribution strategies, (2) token accuracy evaluation that weights frequent words more heavily, and (3) frequency-aware training that incorporates word frequency into sampling.
Result: Frequency-aware training outperformed uniform sampling in 26 out of 43 languages tested, demonstrating the effectiveness of incorporating frequency information.
Conclusion: Incorporating corpus frequency information through multiple dimensions (data splitting, evaluation, and training) improves morphological inflection systems and better reflects real-world language usage patterns.
Abstract: The traditional approach to morphological inflection (the task of modifying a base word (lemma) to express grammatical categories) has been, for decades, to consider lexical entries of lemma-tag-form triples uniformly, lacking any information about their frequency distribution. However, in production deployment, one might expect the user inputs to reflect a real-world distribution of frequencies in natural texts. With future deployment in mind, we explore the incorporation of corpus frequency information into the task of morphological inflection along three key dimensions during system development: (i) for train-dev-test split, we combine a lemma-disjoint approach, which evaluates the model’s generalization capabilities, with a frequency-weighted strategy to better reflect the realistic distribution of items across different frequency bands in training and test sets; (ii) for evaluation, we complement the standard type accuracy (often referred to simply as accuracy), which treats all items equally regardless of frequency, with token accuracy, which assigns greater weight to frequent words and better approximates performance on running text; (iii) for training data sampling, we introduce a method novel in the context of inflection, frequency-aware training, which explicitly incorporates word frequency into the sampling process. We show that frequency-aware training outperforms uniform sampling in 26 out of 43 languages.
[93] ENTP: Enhancing Low-Quality SFT Data via Neural-Symbolic Text Purge-Mix
Zile Yang, Ling Li, Na Di, Jinlong Pang, Yao Zhou, Hao Cheng, Bo Han, Jiaheng Wei
Main category: cs.CL
TL;DR: ENTP framework revitalizes low-quality SFT data through neural-symbolic purification and reconstruction, outperforming established data-selection methods and even full dataset fine-tuning.
Details
Motivation: Existing quality-first SFT paradigms discard valuable low-quality data and rely on imperfect filters, overlooking potential signals in noisy samples.Method: Neural-symbolic framework with symbolic purification (pruning noisy samples using statistical priors) and neural reconstruction (synthesizing enriched instruction-response pairs using latent representations and model knowledge).
Result: ENTP-augmented datasets from low-quality data outperform 13 established data-selection baselines across five benchmarks and surpass fine-tuning on full original dataset (~300K examples).
Conclusion: Low-quality data has untapped potential, and intelligent purification/synthesis is crucial for efficient instruction alignment.
Abstract: Supervised Fine-Tuning (SFT) adapts pre-trained Large Language Models (LLMs) to domain-specific instructions by training on a carefully curated subset of high-quality instruction-response pairs, typically drawn from a larger dataset that often contains many low-quality or noisy samples. However, existing quality-first paradigms often overlook valuable signals in discarded low-quality data and rely on imperfect quality filters. We introduce ENTP (Enhancing low-quality SFT data via Neural-symbolic Text Purge-Mix), a framework that revitalizes low-quality corpora through symbolic purification and neural reconstruction. The symbolic module identifies and prunes noisy samples based on statistical priors, while the neural component synthesizes enriched instruction-response pairs by leveraging latent representations and model knowledge. This neural-symbolic synergy enhances data informativeness and diversity. Experiments show that ENTP-augmented datasets, constructed exclusively from low-quality data, outperform 13 established data-selection baselines across five instruction-following benchmarks, and even surpass fine-tuning on the full original dataset (approximately 300K examples). Our results highlight the untapped potential of low-quality data and underscore the importance of intelligent purification and synthesis for efficient instruction alignment.
[94] Beyond Direct Generation: A Decomposed Approach to Well-Crafted Screenwriting with LLMs
Hang Lei, Shengyi Zong, Zhaoyan Li, Ziren Zhou, Hao Liu
Main category: cs.CL
TL;DR: Dual-Stage Refinement (DSR) framework decomposes screenplay generation into creative narrative generation and format conversion stages, achieving 75% win rate against strong baselines and 82.7% of human-level performance.
Details
Motivation: Direct end-to-end LLM generation fails to produce quality screenplays because it forces models to simultaneously handle creative narrative construction and rigid format adherence, resulting in superficial outputs lacking structural integrity.Method: DSR framework with two stages: first transforms brief outlines into rich novel-style prose, then refines into professionally formatted screenplays. Uses hybrid data synthesis (reverse and forward) to address paired training data scarcity.
Result: Blind evaluations by professional screenwriters show 75% win rate against strong baselines like Gemini-2.5-Pro and reaches 82.7% of human-level performance.
Conclusion: Decomposed generation architecture with tailored data synthesis effectively specializes LLMs in complex creative domains like screenplay writing.
Abstract: The screenplay serves as the foundation for television production, defining narrative structure, character development, and dialogue. While Large Language Models (LLMs) show great potential in creative writing, direct end-to-end generation approaches often fail to produce well-crafted screenplays. We argue this failure stems from forcing a single model to simultaneously master two disparate capabilities: creative narrative construction and rigid format adherence. The resulting outputs may mimic superficial style but lack the deep structural integrity and storytelling substance required for professional use. To enable LLMs to generate high-quality screenplays, we introduce Dual-Stage Refinement (DSR), a decomposed framework that decouples creative narrative generation from format conversion. The first stage transforms a brief outline into rich, novel-style prose. The second stage refines this narrative into a professionally formatted screenplay. This separation enables the model to specialize in one distinct capability at each stage. A key challenge in implementing DSR is the scarcity of paired outline-to-novel training data. We address this through hybrid data synthesis: reverse synthesis deconstructs existing screenplays into structured inputs, while forward synthesis leverages these inputs to generate high-quality narrative texts as training targets. Blind evaluations by professional screenwriters show that DSR achieves a 75% win rate against strong baselines like Gemini-2.5-Pro and reaches 82.7% of human-level performance. Our work demonstrates that decomposed generation architecture with tailored data synthesis effectively specializes LLMs in complex creative domains.
[95] MATCH: Task-Driven Code Evaluation through Contrastive Learning
Marah Ghoummaid, Vladimir Tchuiev, Ofek Glick, Michal Moschkovitz, Dotan Di Castro
Main category: cs.CL
TL;DR: MATCH is a novel reference-free metric for evaluating AI-generated code using contrastive learning to create embeddings that measure how well code implements natural language task descriptions.
Details
Motivation: Traditional code evaluation methods like unit tests are unscalable, syntactic metrics don't capture functionality, and reference-based metrics require reference code which isn't always available.Method: Uses Contrastive Learning to generate meaningful embeddings for both code and natural language task descriptions, enabling similarity scoring between generated code and task descriptions.
Result: MATCH achieves stronger correlations with functional correctness and human preference than existing metrics across multiple programming languages.
Conclusion: MATCH provides an effective reference-free evaluation method for AI-generated code that better captures functional alignment with developer intent.
Abstract: AI-based code generation is increasingly prevalent, with GitHub Copilot estimated to generate 46% of the code on GitHub. Accurately evaluating how well generated code aligns with developer intent remains a critical challenge. Traditional evaluation methods, such as unit tests, are often unscalable and costly. Syntactic similarity metrics (e.g., BLEU, ROUGE) fail to capture code functionality, and metrics like CodeBERTScore require reference code, which is not always available. To address the gap in reference-free evaluation, with few alternatives such as ICE-Score, this paper introduces MATCH, a novel reference-free metric. MATCH uses Contrastive Learning to generate meaningful embeddings for code and natural language task descriptions, enabling similarity scoring that reflects how well generated code implements the task. We show that MATCH achieves stronger correlations with functional correctness and human preference than existing metrics across multiple programming languages.
[96] SI-Bench: Benchmarking Social Intelligence of Large Language Models in Human-to-Human Conversations
Shuai Huang, Wenxuan Zhao, Jun Gao
Main category: cs.CL
TL;DR: SI-Bench is a benchmark for evaluating social intelligence in LLMs using 2,221 authentic multi-turn dialogues from social networking apps, showing SOTA models exceed humans in process reasoning but lag in reply quality.
Details
Motivation: Existing evaluation methods use simulated agent-to-agent interactions that fail to capture authentic linguistic styles and relational dynamics of real human conversations.Method: Collected 2,221 authentic multi-turn dialogues from social networking applications and manually annotated 312 dialogues across 8 major models, grounded in social science theories.
Result: SOTA models surpassed human experts in process reasoning under complex social situations but still fall behind humans in reply quality. Chain-of-Thought reasoning degraded LLM performance in social dialogue tasks.
Conclusion: SI-Bench provides a realistic benchmark for evaluating social intelligence in LLMs, revealing current models’ strengths in reasoning but weaknesses in conversational quality compared to humans.
Abstract: As large language models (LLMs) develop anthropomorphic abilities, they are increasingly being deployed as autonomous agents to interact with humans. However, evaluating their performance in realistic and complex social interactions remains a significant challenge. Most previous research built datasets through simulated agent-to-agent interactions, which fails to capture the authentic linguistic styles and relational dynamics found in real human conversations. To address this gap, we introduce SI-Bench, a novel benchmark designed to evaluate aspects of social intelligence in LLMs. Grounded in broad social science theories, SI-Bench contains 2,221 authentic multi-turn dialogues collected from a social networking application. We further selected a subset of 312 dialogues for manual annotation across 8 major models. The experiments show that SOTA models have surpassed the human expert in process reasoning under complex social situations, yet they still fall behind humans in reply quality. Moreover, introducing Chain-of-Thought (CoT) reasoning may degrade the performance of LLMs in social dialogue tasks. All datasets are openly available at https://github.com/SI-Bench/SI-Bench.git.
[97] DREaM: Drug-Drug Relation Extraction via Transfer Learning Method
Ali Fata, Hossein Rahmani, Parinaz Soltanzadeh, Amirhossein Derakhshan, Behrouz Minaei Bidgoli
Main category: cs.CL
TL;DR: DREAM is a transfer learning method for drug-drug relation extraction that uses a trained relation extraction model on medical texts to build a drug relationship ontology, with LLM validation achieving 71% agreement.
Details
Motivation: Limited datasets exist for drug-drug relation extraction, making transfer learning necessary to apply machine learning methods in this domain for identifying drug interactions and side effects.Method: Uses a trained relation extraction model to discover relations between entities, applies it to medical text corpus to construct drug relationship ontology, and validates extracted relations using a large language model.
Result: LLM agreed with 71% of relations extracted from PubMed abstracts subset. Qualitative analysis revealed ambiguities in medical domain, highlighting relation extraction challenges.
Conclusion: The approach demonstrates feasibility of transfer learning for drug-drug relation extraction and reveals domain-specific challenges through identified ambiguities in medical texts.
Abstract: Relation extraction between drugs plays a crucial role in identifying drug drug interactions and predicting side effects. The advancement of machine learning methods in relation extraction, along with the development of large medical text databases, has enabled the low cost extraction of such relations compared to other approaches that typically require expert knowledge. However, to the best of our knowledge, there are limited datasets specifically designed for drug drug relation extraction currently available. Therefore, employing transfer learning becomes necessary to apply machine learning methods in this domain. In this study, we propose DREAM, a method that first employs a trained relation extraction model to discover relations between entities and then applies this model to a corpus of medical texts to construct an ontology of drug relationships. The extracted relations are subsequently validated using a large language model. Quantitative results indicate that the LLM agreed with 71 of the relations extracted from a subset of PubMed abstracts. Furthermore, our qualitative analysis indicates that this approach can uncover ambiguities in the medical domain, highlighting the challenges inherent in relation extraction in this field.
[98] Process Reward Models for Sentence-Level Verification of LVLM Radiology Reports
Alois Thomas, Maya Varma, Jean-Benoit Delbrouck, Curtis P. Langlotz
Main category: cs.CL
TL;DR: A sentence-level Process Reward Model (PRM) for detecting hallucinations in radiology reports generated by Large Vision-Language Models, achieving better performance than existing methods and strong generalization across different LVLMs.
Details
Motivation: LVLMs often produce clinically critical hallucinations in radiology reports, posing serious risks, and existing detection methods lack sentence-level granularity and robust generalization.Method: Developed a lightweight 0.5B-parameter PRM that predicts factual correctness of each generated sentence, conditioned on clinical context and preceding text, fine-tuned on MIMIC-CXR with weakly-supervised labels.
Result: Outperformed existing verification techniques with 7.5% improvement in Matthews Correlation Coefficient and 1.8% in AUROC, showed strong generalization to unseen LVLMs, improved F1-CheXbert scores by 4.5% when filtering low-quality reports, and achieved 7.4% improvement in F1-CheXbert through weighted best-of-N selection.
Conclusion: A lightweight, context-aware PRM provides an effective model-agnostic safety layer for clinical LVLMs without requiring access to internal model activations.
Abstract: Automating radiology report generation with Large Vision-Language Models (LVLMs) holds great potential, yet these models often produce clinically critical hallucinations, posing serious risks. Existing hallucination detection methods frequently lack the necessary sentence-level granularity or robust generalization across different LVLM generators. We introduce a novel approach: a sentence-level Process Reward Model (PRM) adapted for this vision-language task. Our PRM predicts the factual correctness of each generated sentence, conditioned on clinical context and preceding text. When fine-tuned on MIMIC-CXR with weakly-supervised labels, a lightweight 0.5B-parameter PRM outperforms existing verification techniques, demonstrating, for instance, relative improvements of 7.5% in Matthews Correlation Coefficient and 1.8% in AUROC over strong white-box baselines on outputs from one LVLM. Unlike methods reliant on internal model states, our PRM demonstrates strong generalization to an unseen LVLM. We further show its practical utility: PRM scores effectively filter low-quality reports, improving F1-CheXbert scores by 4.5% (when discarding the worst 10% of reports). Moreover, when guiding a novel weighted best-of-N selection process on the MIMIC-CXR test set, our PRM show relative improvements in clinical metrics of 7.4% for F1-CheXbert and 0.6% for BERTScore. These results demonstrate that a lightweight, context-aware PRM provides a model-agnostic safety layer for clinical LVLMs without access to internal activations
[99] Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages?
Tawsif Tashwar Dipto, Azmol Hossain, Rubayet Sabbir Faruque, Md. Rezuwan Hassan, Kanij Fatema, Tanmoy Shome, Ruwad Naswan, Md. Foriduzzaman Zihad, Mohaymen Ul Anam, Nazia Tasnim, Hasan Mahmud, Md Kamrul Hasan, Md. Mehedi Hasan Shawon, Farig Sadeque, Tahsin Reasat
Main category: cs.CL
TL;DR: The paper introduces Ben-10, a 78-hour Bengali dialect speech corpus, showing that speech foundation models struggle with regional dialects in both zero-shot and fine-tuned settings, with dialect-specific training being the most effective approach.
Details
Motivation: To investigate the effects of dialectal variations on automatic speech recognition (ASR), particularly for low-resource languages where conventional research relies on canonical forms and treats dialect ASR as fine-tuning.Method: Developed a 78-hour annotated Bengali Speech-to-Text corpus (Ben-10) and evaluated speech foundation models in zero-shot and fine-tuned settings across different dialects.
Result: Speech foundation models struggle heavily with regional dialect ASR in both zero-shot and fine-tuned settings. All deep learning methods have difficulty modeling speech data under dialectal variations, but dialect-specific model training alleviates the issue.
Conclusion: The Ben-10 dataset serves as an out-of-distribution resource for ASR modeling under constrained resources, highlighting the challenges of dialectal variations in speech recognition and the need for dialect-specific approaches.
Abstract: Conventional research on speech recognition modeling relies on the canonical form for most low-resource languages while automatic speech recognition (ASR) for regional dialects is treated as a fine-tuning task. To investigate the effects of dialectal variations on ASR we develop a 78-hour annotated Bengali Speech-to-Text (STT) corpus named Ben-10. Investigation from linguistic and data-driven perspectives shows that speech foundation models struggle heavily in regional dialect ASR, both in zero-shot and fine-tuned settings. We observe that all deep learning methods struggle to model speech data under dialectal variations but dialect specific model training alleviates the issue. Our dataset also serves as a out of-distribution (OOD) resource for ASR modeling under constrained resources in ASR algorithms. The dataset and code developed for this project are publicly available
[100] Mubeen AI: A Specialized Arabic Language Model for Heritage Preservation and User Intent Understanding
Mohammed Aljafari, Ismail Alturki, Ahmed Mori, Yehya Kadumi
Main category: cs.CL
TL;DR: Mubeen is a proprietary Arabic language model developed by MASARAT SA, optimized for Arabic linguistics, Islamic studies, and cultural heritage. It uses native Arabic sources and features Practical Closure Architecture to solve the “Utility Gap Crisis” by providing decisive guidance rather than just factual answers.
Details
Motivation: To address limitations in existing Arabic models that rely on translated English data and often fail in intent detection and retrieval-augmented generation, while ensuring cultural authenticity and accuracy in Arabic language processing.Method: Trained on extensive authentic Arabic sources including historical manuscripts digitized via proprietary Arabic OCR, incorporating scholarly works in linguistics, jurisprudence, hadith, and Quranic exegesis. Uses deep linguistic engineering framework and Practical Closure Architecture to prioritize clarity and decisive guidance.
Result: Mubeen masters Arabic eloquence and enables precise understanding across classical texts, contemporary writing, and regional dialects with focus on user intent and contextually relevant responses. It transforms from information repository to decisive guide.
Conclusion: Mubeen successfully bridges the “Utility Gap Crisis” by providing culturally authentic Arabic language processing with practical utility, aligning with Saudi Vision 2030 goals for cultural preservation and technological advancement.
Abstract: Mubeen is a proprietary Arabic language model developed by MASARAT SA, optimized for deep understanding of Arabic linguistics, Islamic studies, and cultural heritage. Trained on an extensive collection of authentic Arabic sources significantly expanded by digitizing historical manuscripts via a proprietary Arabic OCR engine, the model incorporates seminal scholarly works in linguistics, jurisprudence, hadith, and Quranic exegesis, alongside thousands of academic theses and peer-reviewed research papers. Conditioned through a deep linguistic engineering framework, Mubeen masters not just the meaning but the eloquence of Arabic, enabling precise understanding across classical texts, contemporary writing, and regional dialects with focus on comprehending user intent and delivering accurate, contextually relevant responses. Unlike other Arabic models relying on translated English data that often fail in intent detection or retrieval-augmented generation (RAG), Mubeen uses native Arabic sources to ensure cultural authenticity and accuracy. Its core innovation is the Practical Closure Architecture, designed to solve the “Utility Gap Crisis” where factually correct answers fail to resolve users' core needs, forcing them into frustrating cycles of re-prompting. By prioritizing clarity and decisive guidance, Mubeen transforms from an information repository into a decisive guide, aligning with Saudi Vision 2030. The model’s architecture combines deep heritage specialization with multi-disciplinary expert modules, enabling robust performance across both cultural preservation and general knowledge domains.
[101] Code Aesthetics with Agentic Reward Feedback
Bang Xiao, Lingjie Jiang, Shaohan Huang, Tengchao Lv, Yupan Huang, Xun Wu, Lei Cui, Furu Wei
Main category: cs.CL
TL;DR: A new pipeline to improve aesthetic quality of LLM-generated code using AesCode-358K dataset, agentic reward feedback system, and GRPO-AR algorithm, achieving superior performance on code aesthetics benchmarks.
Details
Motivation: LLMs struggle with visually-oriented coding tasks and produce suboptimal aesthetics in code generation, despite excelling at traditional programming tasks.Method: Constructed AesCode-358K dataset for instruction-tuning, developed multi-agent reward feedback system for executability and aesthetics evaluation, and created GRPO-AR algorithm for joint optimization of functionality and code aesthetics.
Result: Combining supervised fine-tuning on AesCode-358K with reinforcement learning using agentic reward feedback significantly improves performance on OpenDesign benchmark and enhances results on PandasPlotBench. AesCoder-4B surpasses GPT-4o and GPT-4.1, achieving performance comparable to large 480B-685B parameter models.
Conclusion: The proposed approach effectively enhances code aesthetics in LLM-generated code, demonstrating that focused training on aesthetic aspects can yield substantial improvements even with smaller models.
Abstract: Large Language Models (LLMs) have become valuable assistants for developers in code-related tasks. While LLMs excel at traditional programming tasks such as code generation and bug fixing, they struggle with visually-oriented coding tasks, often producing suboptimal aesthetics. In this paper, we introduce a new pipeline to enhance the aesthetic quality of LLM-generated code. We first construct AesCode-358K, a large-scale instruction-tuning dataset focused on code aesthetics. Next, we propose agentic reward feedback, a multi-agent system that evaluates executability, static aesthetics, and interactive aesthetics. Building on this, we develop GRPO-AR, which integrates these signals into the GRPO algorithm for joint optimization of functionality and code aesthetics. Finally, we develop OpenDesign, a benchmark for assessing code aesthetics. Experimental results show that combining supervised fine-tuning on AesCode-358K with reinforcement learning using agentic reward feedback significantly improves performance on OpenDesign and also enhances results on existing benchmarks such as PandasPlotBench. Notably, our AesCoder-4B surpasses GPT-4o and GPT-4.1, and achieves performance comparable to large open-source models with 480B-685B parameters, underscoring the effectiveness of our approach.
[102] A Cocktail-Party Benchmark: Multi-Modal dataset and Comparative Evaluation Results
Thai-Binh Nguyen, Katerina Zmolikova, Pingchuan Ma, Ngoc Quan Pham, Christian Fuegen, Alexander Waibel
Main category: cs.CL
TL;DR: The paper introduces MCoRec task for CHiME Challenge, addressing cocktail-party problem using audio-visual cues to handle extreme speech overlap in natural multi-party conversations.
Details
Motivation: To solve the cocktail-party problem of overlapping conversations in single-room settings, capturing natural unscripted group chats with up to 100% speech overlap and fragmented conversational turns.Method: Uses audio, visual, and contextual cues to jointly transcribe each speaker’s speech and cluster them into respective conversations from audio-visual recordings.
Result: Audio-only baselines exceed 100% word error rate, while incorporating visual cues yields substantial 50% improvements, demonstrating the critical importance of multi-modality.
Conclusion: Multi-modal approaches are essential for handling extreme speech overlap in natural conversations, with visual cues providing significant performance gains over audio-only systems.
Abstract: We introduce the task of Multi-Modal Context-Aware Recognition (MCoRec) in the ninth CHiME Challenge, which addresses the cocktail-party problem of overlapping conversations in a single-room setting using audio, visual, and contextual cues. MCoRec captures natural multi-party conversations where the recordings focus on unscripted, casual group chats, leading to extreme speech overlap of up to 100% and highly fragmented conversational turns. The task requires systems to answer the question “Who speaks when, what, and with whom?” by jointly transcribing each speaker’s speech and clustering them into their respective conversations from audio-visual recordings. Audio-only baselines exceed 100% word error rate, whereas incorporating visual cues yields substantial 50% improvements, highlighting the importance of multi-modality. In this manuscript, we present the motivation behind the task, outline the data collection process, and report the baseline systems developed for the MCoRec.
[103] DCMM-SQL: Automated Data-Centric Pipeline and Multi-Model Collaboration Training for Text-to-SQL Model
Yuanzhen Xie, Liu Ye, Jiqun Chu, Mochi Gao, Hehuan Liu, Yunzhi Tan, Bo Hu, Zang Li
Main category: cs.CL
TL;DR: The paper proposes a fully automated data-centric pipeline for text-to-SQL tasks, including adaptive data repair and error data augmentation, combined with multi-model collaboration training and ensemble strategies.
Details
Motivation: While agent-based frameworks have improved text-to-SQL tasks, the impact of data-centric strategies remains underexplored. Current fine-tuned models have limited capabilities, requiring better approaches.Method: Developed a data-centric pipeline with adaptive data repair (automatically finding/fixing training data errors) and error data augmentation (enhancing erroneous data). Used multi-model collaboration training with different augmented data and ensemble strategies for multiple-choice questions.
Result: Achieved first place in lightweight text-to-SQL models (within 70B parameters). Experiment results and ablation studies demonstrated the effectiveness of the proposed pipeline and multi-model strategies.
Conclusion: The data-centric pipeline and multi-model interactive iterative strategies effectively improve text-to-SQL task accuracy, showing the importance of data quality and model collaboration in this domain.
Abstract: Text-to-SQL tasks have gained attractive improvements since the release of ChatGPT. Among them, agent-based frameworks have been widely used in this field. However, the impact of data-centric strategies on text-to-SQL tasks has rarely been explored. In this paper, we systemically design a fully automated data-centric pipeline for text-to-SQL tasks, including \emph{adaptive data repair}, which can automatically find and fix errors in the training dataset; and \emph{error data augmentation}, where we specifically diffuse and enhance erroneous data predicted by the initially trained models. Meanwhile, we propose a Multi-Model collaboration training schema, aiming to train multiple models with different augmented data, enabling them to possess distinct capabilities and work together to complement each other, because it has been found that the capability of a single fine-tuned model is very limited. Furthermore, we utilize an ensemble strategy to integrate the capabilities of multiple models to solve a multiple-choice question, aiming to further improve the accuracy of text-to-SQL tasks. The experiment results and ablation study have demonstrated the effectiveness of data-centric pipeline and Multi-Model(MM) interactive iterative strategies, achieving first place in lightweight text-to-SQL models (within 70B).
[104] Adaptive Blockwise Search: Inference-Time Alignment for Large Language Models
Mohammad Atif Quamar, Mohammad Areeb, Nishant Sharma, Ananth Shreekumar, Jonathan Rosenthal, Muslum Ozgur Ozmen, Mikhail Kuznetsov, Z. Berkay Celik
Main category: cs.CL
TL;DR: AdaSearch is a blockwise search strategy that adaptively allocates computational budget to focus on critical initial tokens in LLM responses, improving alignment performance over standard methods.
Details
Motivation: Current inference-time alignment methods apply uniform computational effort across all tokens, which is suboptimal since initial tokens are often disproportionately critical for alignment tasks.Method: Introduces AdaSearch, a blockwise search strategy with adaptive computational budget allocation using sampling schedules, and AdaBeam for tree-search applications.
Result: AdaSearch outperforms Best-of-N and fine-tuning baselines across eight LLMs, achieving over 10% improvement in win-rates for harmlessness generation, controlled sentiment generation, and mathematical reasoning tasks.
Conclusion: Adaptive allocation of computational effort to critical initial tokens significantly improves LLM alignment performance compared to uniform approaches.
Abstract: LLM alignment remains a critical challenge. Inference-time methods provide a flexible alternative to fine-tuning, but their uniform computational effort often yields suboptimal alignment. We hypothesize that for many alignment tasks, the initial tokens of a response are disproportionately more critical. To leverage this principle, we introduce AdaSearch, a novel blockwise search strategy. It adaptively allocates a fixed computational budget using a sampling schedule, focusing search effort on these critical tokens. We apply AdaSearch to sequential decoding and introduce its tree-search counterpart, AdaBeam. Our comprehensive evaluation across eight LLMs demonstrates that AdaSearch outperforms strong Best-of-N and fine-tuning baselines. Specifically, win-rates improve by over 10% for harmlessness generation, controlled sentiment generation, and for mathematical reasoning tasks relative to Best-of-N.
[105] BaZi-Based Character Simulation Benchmark: Evaluating AI on Temporal and Persona Reasoning
Siyuan Zheng, Pai Liu, Xi Chen, Jizheng Dong, Sihan Jia
Main category: cs.CL
TL;DR: The paper introduces a BaZi-LLM system that combines symbolic reasoning with LLMs to generate dynamic virtual personas, achieving significant accuracy improvements over mainstream LLMs.
Details
Motivation: Current methods for creating human-like virtual characters rely on annotated data or handcrafted prompts, which are difficult to scale and produce realistic, contextually coherent personas.Method: Created the first QA dataset for BaZi-based persona reasoning using real human experiences categorized into five life domains, and proposed a BaZi-LLM system integrating symbolic reasoning with large language models.
Result: Achieved 30.3%-62.6% accuracy improvement over mainstream LLMs like DeepSeek-v3 and GPT-5-mini. When incorrect BaZi information was used, accuracy dropped by 20%-45%, demonstrating the importance of culturally grounded reasoning.
Conclusion: The integration of culturally grounded symbolic reasoning with LLMs shows strong potential for realistic character simulation and temporally dynamic persona generation.
Abstract: Human-like virtual characters are crucial for games, storytelling, and virtual reality, yet current methods rely heavily on annotated data or handcrafted persona prompts, making it difficult to scale up and generate realistic, contextually coherent personas. We create the first QA dataset for BaZi-based persona reasoning, where real human experiences categorized into wealth, health, kinship, career, and relationships are represented as life-event questions and answers. Furthermore, we propose the first BaZi-LLM system that integrates symbolic reasoning with large language models to generate temporally dynamic and fine-grained virtual personas. Compared with mainstream LLMs such as DeepSeek-v3 and GPT-5-mini, our method achieves a 30.3%-62.6% accuracy improvement. In addition, when incorrect BaZi information is used, our model’s accuracy drops by 20%-45%, showing the potential of culturally grounded symbolic-LLM integration for realistic character simulation.
[106] LightKGG: Simple and Efficient Knowledge Graph Generation from Textual Data
Teng Lin
Main category: cs.CL
TL;DR: LightKGG is a framework that enables efficient knowledge graph extraction from text using small language models through context-integrated graph extraction and topology-enhanced relationship inference.
Details
Motivation: Address the bottleneck of high-quality knowledge graph scarcity by overcoming limitations of existing methods: error-prone pattern matching and resource-intensive LLMs that are inaccessible in low-resource environments.Method: Two key innovations: (1) Context-integrated Graph extraction that unifies contextual information with nodes/edges, and (2) Topology-enhanced relationship inference that uses graph topology to infer relationships without complex language understanding.
Result: Enables accurate knowledge graph construction with minimal hardware requirements, making automated knowledge extraction practical for deployment in resource-constrained scenarios.
Conclusion: Bridges the gap between automated knowledge extraction and practical deployment while introducing scientifically rigorous methods for optimizing small language model efficiency in structured NLP tasks.
Abstract: The scarcity of high-quality knowledge graphs (KGs) remains a critical bottleneck for downstream AI applications, as existing extraction methods rely heavily on error-prone pattern-matching techniques or resource-intensive large language models (LLMs). While recent tools leverage LLMs to generate KGs, their computational demands limit accessibility for low-resource environments. Our paper introduces LightKGG, a novel framework that enables efficient KG extraction from textual data using small-scale language models (SLMs) through two key technical innovations: (1) Context-integrated Graph extraction integrates contextual information with nodes and edges into a unified graph structure, reducing the reliance on complex semantic processing while maintaining more key information; (2) Topology-enhanced relationship inference leverages the inherent topology of the extracted graph to efficiently infer relationships, enabling relationship discovery without relying on complex language understanding capabilities of LLMs. By enabling accurate KG construction with minimal hardware requirements, this work bridges the gap between automated knowledge extraction and practical deployment scenarios while introducing scientifically rigorous methods for optimizing SLM efficiency in structured NLP tasks.
[107] How AI Forecasts AI Jobs: Benchmarking LLM Predictions of Labor Market Changes
Sheri Osborn, Rohit Valecha, H. Raghav Rao, Dan Sass, Anthony Rios
Main category: cs.CL
TL;DR: This paper introduces a benchmark for evaluating LLMs’ ability to forecast AI’s effects on job demand, combining US job posting data with global AI occupational change projections.
Details
Motivation: There's a lack of tools to systematically forecast AI's effects on employment, despite AI reshaping labor markets. Existing research shows LLMs can extract sentiment and summarize economic reports, but little work assesses their use for forward-looking labor prediction.Method: The benchmark combines US sector-level job postings data with global AI occupational change projections, formatted into forecasting tasks with temporal splits. Evaluates LLMs using task-scaffolded, persona-driven, and hybrid prompting strategies across model families.
Result: Structured task prompts improve forecast stability, while persona prompts work better for short-term trends. Performance varies significantly across sectors and time horizons, highlighting the need for domain-aware prompting and rigorous evaluation.
Conclusion: The released benchmark supports future research on labor forecasting, prompt design, and LLM-based economic reasoning, providing a reproducible testbed for studying AI’s limits and opportunities as a forecasting tool in labor markets.
Abstract: Artificial intelligence is reshaping labor markets, yet we lack tools to systematically forecast its effects on employment. This paper introduces a benchmark for evaluating how well large language models (LLMs) can anticipate changes in job demand, especially in occupations affected by AI. Existing research has shown that LLMs can extract sentiment, summarize economic reports, and emulate forecaster behavior, but little work has assessed their use for forward-looking labor prediction. Our benchmark combines two complementary datasets: a high-frequency index of sector-level job postings in the United States, and a global dataset of projected occupational changes due to AI adoption. We format these data into forecasting tasks with clear temporal splits, minimizing the risk of information leakage. We then evaluate LLMs using multiple prompting strategies, comparing task-scaffolded, persona-driven, and hybrid approaches across model families. We assess both quantitative accuracy and qualitative consistency over time. Results show that structured task prompts consistently improve forecast stability, while persona prompts offer advantages on short-term trends. However, performance varies significantly across sectors and horizons, highlighting the need for domain-aware prompting and rigorous evaluation protocols. By releasing our benchmark, we aim to support future research on labor forecasting, prompt design, and LLM-based economic reasoning. This work contributes to a growing body of research on how LLMs interact with real-world economic data, and provides a reproducible testbed for studying the limits and opportunities of AI as a forecasting tool in the context of labor markets.
[108] Detecting Religious Language in Climate Discourse
Evy Beijen, Pien Pieterse, Yusuf Çelik, Willem Th. van Peursen, Sandjai Bhulai, Meike Morren
Main category: cs.CL
TL;DR: This paper analyzes religious language in climate discourse using both rule-based methods and LLMs, finding that rule-based approaches detect more religious language than LLMs.
Details
Motivation: To investigate how religious language appears in climate-related texts from secular and religious NGOs, and to address methodological challenges in detecting religious language computationally.Method: Dual approach: rule-based model using hierarchical tree of religious terms from ecotheology literature, and large language models (LLMs) in zero-shot setting, applied to 880,000+ sentences.
Result: Rule-based method consistently labels more sentences as religious than LLMs, highlighting methodological challenges and tension between vocabulary-based vs. context-based definitions of religious language.
Conclusion: The study demonstrates both potential and limitations of computational approaches for analyzing religious language in climate discourse, contributing to digital methods in religious studies.
Abstract: Religious language continues to permeate contemporary discourse, even in ostensibly secular domains such as environmental activism and climate change debates. This paper investigates how explicit and implicit forms of religious language appear in climate-related texts produced by secular and religious nongovernmental organizations (NGOs). We introduce a dual methodological approach: a rule-based model using a hierarchical tree of religious terms derived from ecotheology literature, and large language models (LLMs) operating in a zero-shot setting. Using a dataset of more than 880,000 sentences, we compare how these methods detect religious language and analyze points of agreement and divergence. The results show that the rule-based method consistently labels more sentences as religious than LLMs. These findings highlight not only the methodological challenges of computationally detecting religious language but also the broader tension over whether religious language should be defined by vocabulary alone or by contextual meaning. This study contributes to digital methods in religious studies by demonstrating both the potential and the limitations of approaches for analyzing how the sacred persists in climate discourse.
[109] EMTSF:Extraordinary Mixture of SOTA Models for Time Series Forecasting
Musleh Alharthi, Kaleel Mahmood, Sarosh Patel, Ausif Mahmood
Main category: cs.CL
TL;DR: A Mixture of Experts (MoE) framework that combines state-of-the-art TSF models (xLSTM, enhanced Linear, PatchTST, minGRU) using a Transformer-based gating network achieves superior performance in time series forecasting.
Details
Motivation: Recent debates about Transformer effectiveness in TSF, with conflicting results showing simple linear models sometimes outperform complex architectures, and the challenge that TSF data favors recent past and faces unpredictable events.Method: Proposes a Mixture of Experts framework that integrates multiple SOTA models (xLSTM, enhanced Linear, PatchTST, minGRU) using a Transformer-based gating network to combine their complementary strengths.
Result: Outperforms all existing TSF models on standard benchmarks, surpassing even the latest MoE-based approaches.
Conclusion: The proposed MoE framework effectively combines diverse forecasting models to achieve state-of-the-art performance in time series forecasting.
Abstract: The immense success of the Transformer architecture in Natural Language Processing has led to its adoption in Time Se ries Forecasting (TSF), where superior performance has been shown. However, a recent important paper questioned their effectiveness by demonstrating that a simple single layer linear model outperforms Transformer-based models. This was soon shown to be not as valid, by a better transformer-based model termed PatchTST. More re cently, TimeLLM demonstrated even better results by repurposing a Large Language Model (LLM) for the TSF domain. Again, a follow up paper challenged this by demonstrating that removing the LLM component or replacing it with a basic attention layer in fact yields better performance. One of the challenges in forecasting is the fact that TSF data favors the more recent past, and is sometimes subject to unpredictable events. Based upon these recent insights in TSF, we propose a strong Mixture of Experts (MoE) framework. Our method combines the state-of-the-art (SOTA) models including xLSTM, en hanced Linear, PatchTST, and minGRU, among others. This set of complimentary and diverse models for TSF are integrated in a Trans former based MoE gating network. Our proposed model outperforms all existing TSF models on standard benchmarks, surpassing even the latest approaches based on MoE frameworks.
[110] Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences
Zhuoran Jin, Hongbang Yuan, Kejian Zhu, Jiachun Li, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
Main category: cs.CL
TL;DR: Omni-Reward addresses modality imbalance and preference rigidity in reward models by introducing a generalist omni-modal framework with support for free-form preferences across text, image, video, audio, and 3D modalities.
Details
Motivation: Current reward models face modality imbalance (limited to text/image) and preference rigidity (fixed binary preferences), failing to capture personalized preferences across diverse modalities.Method: Proposed Omni-Reward framework includes: Omni-RewardBench benchmark (9 tasks across 5 modalities), Omni-RewardData (248K general + 69K instruction-tuning pairs), and Omni-RewardModel (discriminative and generative RMs).
Result: Achieves strong performance on Omni-RewardBench and other widely used reward modeling benchmarks.
Conclusion: Omni-Reward represents a step toward generalist omni-modal reward modeling that supports free-form preferences across multiple modalities.
Abstract: Reward models (RMs) play a critical role in aligning AI behaviors with human preferences, yet they face two fundamental challenges: (1) Modality Imbalance, where most RMs are mainly focused on text and image modalities, offering limited support for video, audio, and other modalities; and (2) Preference Rigidity, where training on fixed binary preference pairs fails to capture the complexity and diversity of personalized preferences. To address the above challenges, we propose Omni-Reward, a step toward generalist omni-modal reward modeling with support for free-form preferences, consisting of: (1) Evaluation: We introduce Omni-RewardBench, the first omni-modal RM benchmark with free-form preferences, covering nine tasks across five modalities including text, image, video, audio, and 3D; (2) Data: We construct Omni-RewardData, a multimodal preference dataset comprising 248K general preference pairs and 69K instruction-tuning pairs for training generalist omni-modal RMs; (3) Model: We propose Omni-RewardModel, which includes both discriminative and generative RMs, and achieves strong performance on Omni-RewardBench as well as other widely used reward modeling benchmarks.
[111] BrowseConf: Confidence-Guided Test-Time Scaling for Web Agents
Litu Ou, Kuan Li, Huifeng Yin, Liwen Zhang, Zhongwang Zhang, Xixi Wu, Rui Ye, Zile Qiao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
Main category: cs.CL
TL;DR: This paper investigates LLM-based search agents’ ability to communicate confidence in multi-turn interactions and proposes Test-Time Scaling methods that use confidence scores to optimize answer quality and reduce token consumption.
Details
Motivation: Existing confidence research focuses mainly on single-turn scenarios, while confidence assessment in complex multi-turn interactions remains limited. The authors want to explore whether LLM agents can effectively communicate their confidence after long action sequences.Method: The researchers experimented with open-source agentic models and proposed Test-Time Scaling (TTS) methods that use confidence scores to determine answer quality and encourage models to retry until reaching satisfactory confidence levels.
Result: Models showed much higher task accuracy at high confidence levels while having near-zero accuracy when confidence was low. The proposed TTS methods significantly reduced token consumption while maintaining competitive performance compared to baseline fixed budget methods.
Conclusion: LLM-based search agents can effectively communicate confidence in multi-turn scenarios, and using confidence scores to guide retry mechanisms can optimize performance while reducing computational costs.
Abstract: Confidence in LLMs is a useful indicator of model uncertainty and answer reliability. Existing work mainly focused on single-turn scenarios, while research on confidence in complex multi-turn interactions is limited. In this paper, we investigate whether LLM-based search agents have the ability to communicate their own confidence through verbalized confidence scores after long sequences of actions, a significantly more challenging task compared to outputting confidence in a single interaction. Experimenting on open-source agentic models, we first find that models exhibit much higher task accuracy at high confidence while having near-zero accuracy when confidence is low. Based on this observation, we propose Test-Time Scaling (TTS) methods that use confidence scores to determine answer quality, encourage the model to try again until reaching a satisfactory confidence level. Results show that our proposed methods significantly reduce token consumption while demonstrating competitive performance compared to baseline fixed budget TTS methods.
[112] Evaluating Large Language Models for Stance Detection on Financial Targets from SEC Filing Reports and Earnings Call Transcripts
Nikesh Gyawali, Doina Caragea, Alex Vasenkov, Cornelia Caragea
Main category: cs.CL
TL;DR: The paper introduces a sentence-level corpus for stance detection on three financial metrics (debt, EPS, sales) from SEC filings and earnings call transcripts, using ChatGPT-o3-pro labeling with human validation. It shows that few-shot with Chain-of-Thought prompting outperforms supervised baselines for financial stance detection.
Details
Motivation: Financial narratives from SEC filings and earnings calls are important but difficult to analyze due to length, jargon, and nuanced language. Traditional sentiment analysis requires expensive labeled datasets, making sentence-level stance detection challenging.Method: Created a corpus by extracting sentences from Form 10-K reports and ECTs, labeled for stance (positive, negative, neutral) using ChatGPT-o3-pro with human validation. Evaluated LLMs using zero-shot, few-shot, and Chain-of-Thought prompting strategies.
Result: Few-shot with Chain-of-Thought prompting performed best compared to supervised baselines. LLM performance varied across SEC and ECT datasets, demonstrating practical viability for target-specific stance detection without extensive labeled data.
Conclusion: LLMs can be effectively leveraged for target-specific stance detection in the financial domain using few-shot with Chain-of-Thought prompting, eliminating the need for large labeled datasets.
Abstract: Financial narratives from U.S. Securities and Exchange Commission (SEC) filing reports and quarterly earnings call transcripts (ECTs) are very important for investors, auditors, and regulators. However, their length, financial jargon, and nuanced language make fine-grained analysis difficult. Prior sentiment analysis in the financial domain required a large, expensive labeled dataset, making the sentence-level stance towards specific financial targets challenging. In this work, we introduce a sentence-level corpus for stance detection focused on three core financial metrics: debt, earnings per share (EPS), and sales. The sentences were extracted from Form 10-K annual reports and ECTs, and labeled for stance (positive, negative, neutral) using the advanced ChatGPT-o3-pro model under rigorous human validation. Using this corpus, we conduct a systematic evaluation of modern large language models (LLMs) using zero-shot, few-shot, and Chain-of-Thought (CoT) prompting strategies. Our results show that few-shot with CoT prompting performs best compared to supervised baselines, and LLMs’ performance varies across the SEC and ECT datasets. Our findings highlight the practical viability of leveraging LLMs for target-specific stance in the financial domain without requiring extensive labeled data.
[113] MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring
Tengchao Yang, Sichen Guo, Mengzhao Jia, Jiaming Su, Yuanyang Liu, Zhihan Zhang, Meng Jiang
Main category: cs.CL
TL;DR: MMTutorBench is the first benchmark for AI math tutoring that evaluates models on diagnostic and guidance skills through 685 problems with key-steps and problem-specific rubrics across three tasks.
Details
Motivation: Existing benchmarks overlook essential tutoring skills like diagnosing student difficulties and step-by-step guidance, which are crucial for effective math tutoring.Method: Created MMTutorBench with 685 problems built around pedagogically significant key-steps, paired with problem-specific rubrics for fine-grained evaluation across six dimensions, structured into Insight Discovery, Operation Formulation, and Operation Execution tasks.
Result: Evaluation of 12 leading MLLMs shows clear performance gaps between proprietary and open-source systems, substantial room for improvement compared to human tutors, OCR pipelines degrade tutoring quality, few-shot prompting yields limited gains, and rubric-based LLM-as-a-Judge proves highly reliable.
Conclusion: MMTutorBench highlights both the difficulty and diagnostic value for advancing AI tutoring, revealing significant gaps between current AI systems and human tutoring capabilities.
Abstract: Effective math tutoring requires not only solving problems but also diagnosing students’ difficulties and guiding them step by step. While multimodal large language models (MLLMs) show promise, existing benchmarks largely overlook these tutoring skills. We introduce MMTutorBench, the first benchmark for AI math tutoring, consisting of 685 problems built around pedagogically significant key-steps. Each problem is paired with problem-specific rubrics that enable fine-grained evaluation across six dimensions, and structured into three tasks-Insight Discovery, Operation Formulation, and Operation Execution. We evaluate 12 leading MLLMs and find clear performance gaps between proprietary and open-source systems, substantial room compared to human tutors, and consistent trends across input variants: OCR pipelines degrade tutoring quality, few-shot prompting yields limited gains, and our rubric-based LLM-as-a-Judge proves highly reliable. These results highlight both the difficulty and diagnostic value of MMTutorBench for advancing AI tutoring.
[114] M4FC: a Multimodal, Multilingual, Multicultural, Multitask Real-World Fact-Checking Dataset
Jiahui Geng, Jonathan Tonglet, Iryna Gurevych
Main category: cs.CL
TL;DR: M4FC is a new multimodal fact-checking dataset with 4,982 images and 6,980 claims across 10 languages, addressing limitations of existing datasets through diverse cultural contexts and six fact-checking tasks.
Details
Motivation: Existing multimodal fact-checking datasets have limitations including small size, limited languages, evidence leakage, and dependency on external news sources, which M4FC aims to overcome.Method: Created M4FC dataset with images verified by professional fact-checkers from 22 organizations, spanning 10 languages and six multimodal tasks: visual claim extraction, claimant intent prediction, fake detection, image contextualization, location verification, and verdict prediction.
Result: The dataset contains 4,982 images paired with 6,980 claims, with baseline results provided for all tasks and analysis of how intermediate tasks influence downstream verdict prediction performance.
Conclusion: M4FC addresses key limitations in multimodal fact-checking research by providing a comprehensive, diverse dataset with multiple tasks and languages, enabling more robust fact-checking systems.
Abstract: Existing real-world datasets for multimodal automated fact-checking have multiple limitations: they contain few instances, focus on only one or two languages and tasks, suffer from evidence leakage, or depend on external sets of news articles for sourcing true claims. To address these shortcomings, we introduce M4FC, a new real-world dataset comprising 4,982 images paired with 6,980 claims. The images, verified by professional fact-checkers from 22 organizations, represent diverse cultural and geographic contexts. Each claim is available in one or two out of ten languages. M4FC spans six multimodal fact-checking tasks: visual claim extraction, claimant intent prediction, fake detection, image contextualization, location verification, and verdict prediction. We provide baseline results for all tasks and analyze how combining intermediate tasks influence downstream verdict prediction performance. We make our dataset and code available.
[115] IPQA: A Benchmark for Core Intent Identification in Personalized Question Answering
Jieyong Kim, Maryam Amirizaniani, Soojin Yoon, Dongha Lee
Main category: cs.CL
TL;DR: The paper introduces IPQA, a benchmark for core intent identification in personalized question answering, addressing the gap in existing benchmarks that don’t directly measure intent identification capabilities.
Details
Motivation: Existing benchmarks only evaluate response quality or retrieval performance without directly measuring intent identification capabilities, which is critical for generating responses that satisfy individual information needs.Method: Proposes core intents derived from observable behavior patterns in answer selection, grounded in satisficing theory. Constructs dataset through systematic filtering, LLM-based annotation, and rigorous quality control combining automated verification with human validation.
Result: Experimental evaluations show current systems struggle with core intent identification in personalized contexts, with models failing to identify core intents from user histories and performance degrading as question complexity increases.
Conclusion: The IPQA benchmark addresses a critical gap in personalized question answering evaluation and reveals limitations in current language models’ ability to identify core intents, with code and dataset to be made publicly available.
Abstract: Intent identification serves as the foundation for generating appropriate responses in personalized question answering (PQA). However, existing benchmarks evaluate only response quality or retrieval performance without directly measuring intent identification capabilities. This gap is critical because without understanding which intents users prioritize, systems cannot generate responses satisfying individual information needs. To address this, we introduce the concept of core intents: intents users prioritize when selecting answers to satisfy their information needs. To evaluate these core intents, we propose IPQA, a benchmark for core Intent identification in Personalized Question Answering. Since users do not explicitly state their prioritized intents, we derive core intents from observable behavior patterns in answer selection, grounded in satisficing theory where users choose answers meeting their acceptance thresholds. We construct a dataset with various domains through systematic filtering, LLM-based annotation, and rigorous quality control combining automated verification with human validation. Experimental evaluations across state-of-the-art language models reveal that current systems struggle with core intent identification in personalized contexts. Models fail to identify core intents from user histories, with performance degrading as question complexity increases. The code and dataset will be made publicly available to facilitate future research in this direction.
[116] LimRank: Less is More for Reasoning-Intensive Information Reranking
Tingyu Song, Yilun Zhao, Siyue Zhang, Chen Zhao, Arman Cohan
Main category: cs.CL
TL;DR: LIMRANK is a lightweight LLM-based reranker that achieves competitive performance using only 5% of typical training data through synthetic data generation via LIMRANK-SYNTHESIZER pipeline.
Details
Motivation: Existing LLM adaptation for information reranking requires large-scale fine-tuning which is computationally expensive, so the goal is to achieve effective adaptation with minimal supervision.Method: Developed LIMRANK-SYNTHESIZER pipeline to generate diverse, challenging, and realistic synthetic reranking examples, then fine-tuned LIMRANK model using this synthetic data.
Result: LIMRANK achieves competitive performance on BRIGHT and FollowIR benchmarks while using less than 5% of typical training data, with strong generalization across scientific literature search and knowledge-intensive problem solving.
Conclusion: Modern LLMs can be effectively adapted for reranking with minimal high-quality supervision through synthetic data generation, enabling efficient and generalizable performance.
Abstract: Existing approaches typically rely on large-scale fine-tuning to adapt LLMs for information reranking tasks, which is computationally expensive. In this work, we demonstrate that modern LLMs can be effectively adapted using only minimal, high-quality supervision. To enable this, we design LIMRANK-SYNTHESIZER, a reusable and open-source pipeline for generating diverse, challenging, and realistic reranking examples. Using this synthetic data, we fine-tune our reranker model, LIMRANK. We evaluate LIMRANK on two challenging benchmarks, i.e., BRIGHT for reasoning-intensive retrieval and FollowIR for instruction-following retrieval. Our experiments demonstrate that LIMRANK achieves competitive performance, while being trained on less than 5% of the data typically used in prior work. Further ablation studies demonstrate the effectiveness of LIMRANK-SYNTHESIZER and the strong generalization capabilities of LIMRANK across downstream tasks, including scientific literature search and retrieval-augmented generation for knowledge-intensive problem solving.
[117] Hope Speech Detection in Social Media English Corpora: Performance of Traditional and Transformer Models
Luis Ramos, Hiram Calvo, Olga Kolesnikova
Main category: cs.CL
TL;DR: This paper evaluates traditional ML models and fine-tuned transformers for hope speech detection, finding that transformers achieve better performance (0.79 macro F1) than traditional models (0.78 max) due to their ability to capture subtle semantic nuances.
Details
Motivation: To identify motivational expressions of agency and goal-directed behavior on social media platforms through hope speech detection, addressing the need for positive content analysis.Method: Evaluated traditional machine learning models (SVM, logistic regression, Naive Bayes) and fine-tuned transformers on a pre-split hope speech dataset with train, development, and test sets.
Result: Traditional models: linear-kernel SVM and logistic regression reached 0.78 macro-F1, RBF SVM 0.77, Naive Bayes 0.75. Transformers achieved better results with weighted precision 0.82, recall 0.80, F1 0.79, macro F1 0.79, accuracy 0.80.
Conclusion: While traditional ML models remain agile, transformers detect subtle semantics of hope better, suggesting larger transformers and LLMs could perform better on small datasets for hope speech detection.
Abstract: The identification of hope speech has become a promised NLP task, considering the need to detect motivational expressions of agency and goal-directed behaviour on social media platforms. This proposal evaluates traditional machine learning models and fine-tuned transformers for a previously split hope speech dataset as train, development and test set. On development test, a linear-kernel SVM and logistic regression both reached a macro-F1 of 0.78; SVM with RBF kernel reached 0.77, and Na"ive Bayes hit 0.75. Transformer models delivered better results, the best model achieved weighted precision of 0.82, weighted recall of 0.80, weighted F1 of 0.79, macro F1 of 0.79, and 0.80 accuracy. These results suggest that while optimally configured traditional machine learning models remain agile, transformer architectures detect some subtle semantics of hope to achieve higher precision and recall in hope speech detection, suggesting that larges transformers and LLMs could perform better in small datasets.
[118] Think Twice: Branch-and-Rethink Reasoning Reward Model
Yizhu Jiao, Jiaqi Zeng, Julien Veron Vialard, Oleksii Kuchaiev, Jiawei Han, Olivier Delalleau
Main category: cs.CL
TL;DR: BR-RM is a two-turn reward model that applies think-twice reasoning to reward modeling, using adaptive branching in turn 1 to select critical dimensions and branch-conditioned rethinking in turn 2 for targeted analysis, reducing judgment diffusion and improving error detection.
Details
Motivation: Traditional reward models compress multiple quality dimensions into single scalar scores, causing judgment diffusion where attention spreads across criteria, leading to diluted focus and shallow analysis.Method: Two-turn approach: Turn 1 performs adaptive branching to select instance-critical dimensions and create evidence-seeking hypotheses; Turn 2 executes branch-conditioned rethinking to test hypotheses with targeted rereading. Trained with GRPO-style RL using binary outcome rewards and format checks.
Result: Achieves state-of-the-art performance on three challenging reward modeling benchmarks across diverse domains, reduces judgment diffusion, and improves sensitivity to subtle yet consequential errors.
Conclusion: BR-RM successfully transfers think-twice principles to reward modeling, providing a practical and scalable solution that maintains compatibility with standard RLHF pipelines while enhancing reasoning quality.
Abstract: Large language models (LLMs) increasingly rely on thinking models that externalize intermediate steps and allocate extra test-time compute, with think-twice strategies showing that a deliberate second pass can elicit stronger reasoning. In contrast, most reward models (RMs) still compress many quality dimensions into a single scalar in one shot, a design that induces judgment diffusion: attention spreads across evaluation criteria, yielding diluted focus and shallow analysis. We introduce branch-and-rethink (BR-RM), a two-turn RM that transfers the think-twice principle to reward modeling. Turn 1 performs adaptive branching, selecting a small set of instance-critical dimensions (such as factuality and safety) and sketching concise, evidence-seeking hypotheses. Turn 2 executes branch-conditioned rethinking, a targeted reread that tests those hypotheses and scrutinizes only what matters most. We train with GRPO-style reinforcement learning over structured two-turn traces using a simple binary outcome reward with strict format checks, making the approach compatible with standard RLHF pipelines. By converting all-at-oncescoringintofocused, second-lookreasoning, BR-RMreducesjudgmentdiffusionandimproves sensitivity to subtle yet consequential errors while remaining practical and scalable. Experimental results demonstrate that our model achieves state-of-the-art performance on three challenging reward modeling benchmarks across diverse domains. The code and the model will be released soon.
[119] DocFinQA: A Long-Context Financial Reasoning Dataset
Varshini Reddy, Rik Koncel-Kedziorski, Viet Dac Lai, Michael Krumdick, Charles Lovering, Chris Tanner
Main category: cs.CL
TL;DR: DocFinQA introduces a long-document financial QA task by extending FinQA with full-document contexts, increasing average context length from 700 to 123k words, presenting significant challenges for state-of-the-art systems.
Details
Motivation: Most financial research datasets use short excerpts, but financial professionals actually work with documents hundreds of pages long, creating a gap between research and real-world applications.Method: Augmented 7,437 questions from FinQA with full-document context, creating DocFinQA. Conducted experiments with retrieval-based QA pipelines and long-context language models.
Result: DocFinQA proved significantly challenging for state-of-the-art systems, with models particularly struggling on the longest documents in the dataset.
Conclusion: Addressing these long-document challenges could benefit applications requiring specificity and long-range contexts, such as gene sequences and legal document analysis.
Abstract: For large language models (LLMs) to be effective in the financial domain – where each decision can have a significant impact – it is necessary to investigate realistic tasks and data. Financial professionals often interact with documents that are hundreds of pages long, but most financial research datasets only deal with short excerpts from these documents. To address this, we introduce a long-document financial QA task. We augment 7,437 questions from the existing FinQA dataset with the full-document context, extending the average context length from under 700 words in FinQA to 123k words in DocFinQA. We conduct extensive experiments over retrieval-based QA pipelines and long-context language models. DocFinQA proves a significant challenge for even state-of-the-art systems. We also provide a case-study on the longest documents in DocFinQA and find that models particularly struggle on these documents. Addressing these challenges may have a wide reaching impact across applications where specificity and long-range contexts are critical, like gene sequences and legal document contract analysis.
[120] FaithLM: Towards Faithful Explanations for Large Language Models
Yu-Neng Chuang, Guanchu Wang, Chia-Yuan Chang, Ruixiang Tang, Shaochen Zhong, Fan Yang, Mengnan Du, Xuanting Cai, Vladimir Braverman, Xia Hu
Main category: cs.CL
TL;DR: FaithLM is a model-agnostic framework that evaluates and improves LLM explanation faithfulness by measuring prediction shifts when explanations are contradicted, without requiring token masking or task-specific heuristics.
Details
Motivation: LLM-generated explanations often lack faithfulness and don't reliably reflect the evidence models use for decisions, creating a need for methods to evaluate and improve explanation reliability.Method: FaithLM formalizes faithfulness as an intervention property - faithful explanations should cause prediction shifts when contradicted. It uses a contrary-hint score to measure this, then iteratively refines both elicitation prompts and explanations to maximize faithfulness scores.
Result: Experiments on three multi-domain datasets with multiple LLM backbones show FaithLM consistently increases faithfulness and produces explanations more aligned with human rationales than strong self-explanation baselines.
Conclusion: Intervention-based evaluation coupled with iterative optimization provides a principled approach toward achieving faithful and reliable LLM explanations.
Abstract: Large language models (LLMs) increasingly produce natural language explanations, yet these explanations often lack faithfulness, and they do not reliably reflect the evidence the model uses to decide. We introduce FaithLM, a model-agnostic framework that evaluates and improves the faithfulness of LLM explanations without token masking or task-specific heuristics. FaithLM formalizes explanation faithfulness as an intervention property: a faithful explanation should yield a prediction shift when its content is contradicted. Theoretical analysis shows that the resulting contrary-hint score is a sound and discriminative estimator of faithfulness. Building on this principle, FaithLM iteratively refines both the elicitation prompt and the explanation to maximize the measured score. Experiments on three multi-domain datasets and multiple LLM backbones demonstrate that FaithLM consistently increases faithfulness and produces explanations more aligned with human rationales than strong self-explanation baselines. These findings highlight that intervention-based evaluation, coupled with iterative optimization, provides a principled route toward faithful and reliable LLM explanations.
[121] How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?
Atsuki Yamaguchi, Aline Villavicencio, Nikolaos Aletras
Main category: cs.CL
TL;DR: Vocabulary expansion for faster non-English LLM inference in low-resource settings using only 30K target language sentences.
Details
Motivation: LLMs have higher inference costs for non-English text due to English-centric tokenizers, and existing vocabulary expansion methods require substantial target language data.Method: Investigate vocabulary expansion in low-resource settings by exploring embedding initialization methods and continual pre-training strategies with minimal target language data.
Result: Established strategies for vocabulary expansion that achieve faster inference while maintaining competitive downstream performance across diverse languages and tasks.
Conclusion: Vocabulary expansion can be effectively performed in low-resource settings with only 30K sentences, enabling faster inference for non-English languages while preserving model performance.
Abstract: Large language models (LLMs) have shown remarkable capabilities in many languages beyond English. Yet, LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers and vocabulary, resulting in higher usage costs to non-English speakers. Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue. Despite its effectiveness in inference speedup, previous work on vocabulary expansion has focused on high-resource settings assuming access to a substantial amount of target language data to effectively initialize the embeddings of the new tokens and adapt the LLM to the target language. However, vocabulary expansion in low-resource settings has yet to be explored. In this article, we investigate vocabulary expansion in low-resource settings by considering embedding initialization methods and continual pre-training strategies. Through extensive experiments across typologically diverse languages, tasks and models, we establish a set of strategies to perform vocabulary expansion for faster inference, while striving to maintain competitive downstream performance to baselines. This is achieved with only 30K sentences ($\sim$0.01GB text data) from the target language.
[122] Can Large Language Models Unlock Novel Scientific Research Ideas?
Sandeep Kumar, Tirthankar Ghosal, Vinayak Goyal, Asif Ekbal
Main category: cs.CL
TL;DR: This paper proposes automated evaluation metrics (IAScore and Idea Distinctness Index) for assessing LLM-generated research ideas, addressing the lack of scalable evaluation methods for this challenging task.
Details
Motivation: The integration of AI into everyday life through LLMs like ChatGPT creates a need to evaluate their ability to generate future research ideas. Current manual evaluation is time-consuming, costly, and non-scalable due to the rapid pace of new LLM releases.Method: Proposed two automated evaluation metrics: Idea Alignment Score (IAScore) and Idea Distinctness Index. Also conducted human evaluation to assess novelty, relevance, and feasibility of generated ideas.
Result: The study provides insights into LLMs’ capabilities and limitations in research idea generation. The proposed metrics offer scalable evaluation alternatives to manual assessment.
Conclusion: This work contributes to evaluating and utilizing language models for generating future research ideas, with publicly available datasets and codes to support further research.
Abstract: The widespread adoption of Large Language Models (LLMs) and publicly available ChatGPT have marked a significant turning point in the integration of Artificial Intelligence (AI) into people’s everyday lives. This study examines the ability of Large Language Models (LLMs) to generate future research ideas from scientific papers. Unlike tasks such as summarization or translation, idea generation lacks a clearly defined reference set or structure, making manual evaluation the default standard. However, human evaluation in this setting is extremely challenging ie: it requires substantial domain expertise, contextual understanding of the paper, and awareness of the current research landscape. This makes it time-consuming, costly, and fundamentally non-scalable, particularly as new LLMs are being released at a rapid pace. Currently, there is no automated evaluation metric specifically designed for this task. To address this gap, we propose two automated evaluation metrics: Idea Alignment Score (IAScore) and Idea Distinctness Index. We further conducted human evaluation to assess the novelty, relevance, and feasibility of the generated future research ideas. This investigation offers insights into the evolving role of LLMs in idea generation, highlighting both its capability and limitations. Our work contributes to the ongoing efforts in evaluating and utilizing language models for generating future research ideas. We make our datasets and codes publicly available
[123] TrendFact: A Benchmark for Explainable Hotspot Perception in Fact-Checking with Natural Language Explanation
Xiaocheng Zhang, Xi Wang, Yifei Lu, Jianing Wang, Zhuangzhuang Ye, Mengjiao Bao, Peng Yan, Xiaohong Su
Main category: cs.CL
TL;DR: TrendFact is a comprehensive fact-checking benchmark addressing limitations of existing English-centric benchmarks by evaluating hotspot perception ability and all fact-checking tasks with new metrics ECS and HCPI.
Details
Motivation: Existing fact-checking benchmarks fail to address transparency concerns, accuracy for high-influence events, and are predominantly English-centric, hindering comprehensive fact-checking progress.Method: Introduces TrendFact benchmark with 7,643 curated samples from trending platforms and fact-checking datasets, plus an evidence library of 366,634 entries. Proposes FactISR framework integrating dynamic evidence augmentation with influence score-based iterative self-reflection for reasoning LLMs.
Result: Current fact-checking systems show significant limitations on TrendFact. FactISR effectively improves reasoning LLM performance on explainable and complex fact-checking tasks.
Conclusion: TrendFact facilitates development of more robust fact-checking methods, while FactISR offers new insights into explainable and complex fact-checking by enhancing reasoning LLM capabilities.
Abstract: Fact-checking benchmarks provide standardized testing criteria for automated fact-checking systems, driving technological advancement. With the surge of misinformation on social media and the emergence of various fact-checking methods, public concern about the transparency of automated systems and the accuracy of fact-checking for high infulence events has grown. However, existing benchmarks fail to meet these urgent needs and are predominantly English-centric, hindering the progress of comprehensive fact-checking. To address these issues, we introduce TrendFact, the first benchmark capable of evaluating hotspot perception ability (HPA) and all fact-checking tasks. TrendFact consists of 7,643 curated samples sourced from trending platforms and professional fact-checking datasets, as well as an evidence library containing 366,634 entries with publication dates. Additionally, to complement existing benchmarks in evaluating system explanation consistency and HPA, we propose two new metrics: ECS and HCPI. Experimental results show that current fact-checking systems face significant limitations when evaluated on TrendFact, which facilitates the development of more robust fact-checking methods. Furthermore, to enhance the capabilities of existing advanced fact-checking systems, the reasoning large language models (RLMs), we propose FactISR, a reasoning framework that integrates dynamic evidence augmentation with influence score-based iterative self-reflection. FactISR effectively improves RLM’s performance, offering new insights into explainable and complex fact-checking.
[124] TrajAgent: An LLM-Agent Framework for Trajectory Modeling via Large-and-Small Model Collaboration
Yuwei Du, Jie Feng, Jie Zhao, Yong Li
Main category: cs.CL
TL;DR: TrajAgent is an LLM-powered agent framework for automated trajectory modeling that achieves 2.38%-69.91% performance improvement over baselines across four tasks and datasets.
Details
Motivation: Trajectory modeling faces challenges due to data heterogeneity and task diversity, making effective modeling difficult even for domain experts. There's a need for automated solutions that can handle various trajectory tasks across different datasets.Method: Proposes TrajAgent framework with: 1) UniEnv - unified execution environment for data and models, 2) agentic workflow for automatic trajectory modeling, 3) collaborative learning between LLM agents and specialized models.
Result: Achieves 2.38%-69.91% performance improvement over baseline methods across four trajectory tasks using four real-world datasets.
Conclusion: TrajAgent effectively enables automated trajectory modeling through LLM-powered agents and collaborative learning, demonstrating significant performance improvements across diverse tasks and datasets.
Abstract: Trajectory modeling, which includes research on trajectory data pattern mining and future prediction, has widespread applications in areas such as life services, urban transportation, and public administration. Numerous methods have been proposed to address specific problems within trajectory modeling. However, the heterogeneity of data and the diversity of trajectory tasks make effective and reliable trajectory modeling an important yet highly challenging endeavor, even for domain experts. \fix In this paper, we propose \textit{TrajAgent}, a agent framework powered by large language models (LLMs), designed to facilitate robust and efficient trajectory modeling through automation modeling. This framework leverages and optimizes diverse specialized models to address various trajectory modeling tasks across different datasets effectively. \unfix~In \textit{TrajAgent}, we first develop \textit{UniEnv}, an execution environment with a unified data and model interface, to support the execution and training of various models. Building on \textit{UniEnv}, we introduce an agentic workflow designed for automatic trajectory modeling across various trajectory tasks and data. Furthermore, we introduce collaborative learning schema between LLM-based agents and small speciallized models, to enhance the performance of the whole framework effectively. Extensive experiments on four tasks using four real-world datasets demonstrate the effectiveness of \textit{TrajAgent} in automated trajectory modeling, achieving a performance improvement of \fix 2.38%-69.91% \unfix over baseline methods. The codes and data can be accessed via https://github.com/tsinghua-fib-lab/TrajAgent.
[125] Fine-tuning Large Language Models with Limited Data: A Survey and Practical Guide
Marton Szep, Daniel Rueckert, Rüdiger von Eisenhart-Rothe, Florian Hinterwimmer
Main category: cs.CL
TL;DR: A comprehensive survey of methods for fine-tuning large language models in data-scarce scenarios, covering parameter-efficient techniques, domain adaptation, and preference alignment approaches.
Details
Motivation: Fine-tuning LLMs with limited data is challenging in low-resource languages, specialized domains, and constrained deployment settings, requiring efficient adaptation techniques.Method: Systematic review of parameter-efficient fine-tuning, domain/cross-lingual adaptation methods for encoder/decoder models, model specialization strategies, and preference alignment approaches using limited feedback.
Result: Provides empirical trade-offs, selection criteria, and best practices for choosing suitable techniques based on task constraints like model scaling, data scaling, and mitigating catastrophic forgetting.
Conclusion: Equips researchers and practitioners with actionable insights for effectively fine-tuning LLMs when data and resources are limited.
Abstract: Fine-tuning large language models (LLMs) with limited data poses a practical challenge in low-resource languages, specialized domains, and constrained deployment settings. While pre-trained LLMs provide strong foundations, effective adaptation under data scarcity requires focused and efficient fine-tuning techniques. This paper presents a structured and practical survey of recent methods for fine-tuning LLMs in data-scarce scenarios. We systematically review parameter-efficient fine-tuning techniques that lower training and deployment costs, domain and cross-lingual adaptation methods for both encoder and decoder models, and model specialization strategies. We further examine preference alignment approaches that guide model behavior using limited human or synthetic feedback, emphasizing sample and compute efficiency. Throughout, we highlight empirical trade-offs, selection criteria, and best practices for choosing suitable techniques based on task constraints, including model scaling, data scaling, and the mitigation of catastrophic forgetting. The aim is to equip researchers and practitioners with actionable insights for effectively fine-tuning LLMs when data and resources are limited.
[126] Enhancing Naturalness in LLM-Generated Utterances through Disfluency Insertion
Syed Zohaib Hassan, Pierre Lison, Pål Halvorsen
Main category: cs.CL
TL;DR: Fine-tuning LLMs with LoRA to insert disfluencies improves perceived spontaneity of synthesized speech but slightly reduces intelligibility.
Details
Motivation: Disfluencies are natural in human speech but absent from LLM outputs, reducing the naturalness of synthesized speech for conversational agents.Method: Fine-tune LLM with LoRA to incorporate disfluencies, then synthesize using TTS model that supports speech phenomena generation.
Result: User study showed significant increase in perceived spontaneity but slight reduction in intelligibility.
Conclusion: Disfluency insertion enhances speech naturalness for conversational agents despite minor intelligibility trade-off.
Abstract: Disfluencies are a natural feature of spontaneous human speech but are typically absent from the outputs of Large Language Models (LLMs). This absence can diminish the perceived naturalness of synthesized speech, which is an important criteria when building conversational agents that aim to mimick human behaviours. We show how the insertion of disfluencies can alleviate this shortcoming. The proposed approach involves (1) fine-tuning an LLM with Low-Rank Adaptation (LoRA) to incorporate various types of disfluencies into LLM-generated utterances and (2) synthesizing those utterances using a text-to-speech model that supports the generation of speech phenomena such as disfluencies. We evaluated the quality of the generated speech across two metrics: intelligibility and perceived spontaneity. We demonstrate through a user study that the insertion of disfluencies significantly increase the perceived spontaneity of the generated speech. This increase came, however, along with a slight reduction in intelligibility.
[127] Computational Analysis of Character Development in Holocaust Testimonies
Esther Shizgal, Eitan Wagner, Renana Keydar, Omri Abend
Main category: cs.CL
TL;DR: Computational analysis of character development in Holocaust survivor testimonies, focusing on religious trajectory evolution through belief and practice patterns.
Details
Motivation: To develop methods for analyzing character development along narrative timelines, specifically examining the interplay between inner and outer changes in protagonists using Holocaust survivor testimonies as a case study.Method: Natural language processing approach analyzing transcripts of Holocaust survivor testimonies, clustering religious trajectories to identify common sequences in belief and practice evolution.
Result: Identified multiple common structures: constant disposition in religious belief and oscillating structure in religious practice across most narratives.
Conclusion: Demonstrates NLP’s potential for analyzing character evolution through thematic trajectories in narratives, providing valuable material for historical and sociological research.
Abstract: This work presents a computational approach to analyze character development along the narrative timeline. The analysis characterizes the inner and outer changes the protagonist undergoes within a narrative, and the interplay between them. We consider transcripts of Holocaust survivor testimonies as a test case, each telling the story of an individual in first-person terms. We focus on the survivor’s religious trajectory, examining the evolution of their disposition toward religious belief and practice along the testimony. Clustering the resulting trajectories in the dataset, we identify common sequences in the data. Our findings highlight multiple common structures of religiosity across the narratives: in terms of belief, most present a constant disposition, while for practice, most present an oscillating structure, serving as valuable material for historical and sociological research. This work demonstrates the potential of natural language processing techniques for analyzing character evolution through thematic trajectories in narratives.
[128] AttentionPredictor: Temporal Patterns Matter for KV Cache Compression
Qingyue Yang, Jie Wang, Xing Li, Zhihai Wang, Chen Chen, Lei Chen, Xianzhi Yu, Wulong Liu, Jianye Hao, Mingxuan Yuan, Bin Li
Main category: cs.CL
TL;DR: AttentionPredictor is a learning-based KV cache compression method that uses a lightweight convolution model to predict attention patterns, achieving 13x compression and 5.6x speedup while maintaining LLM performance.
Details
Motivation: Existing KV cache compression methods struggle with accurately identifying critical tokens because they neglect temporal patterns in attention scores, leading to performance degradation in LLMs.Method: Proposes AttentionPredictor - a lightweight unified convolution model that dynamically captures spatiotemporal patterns to predict next-token attention scores, and a cross-token critical cache prefetching framework to hide estimation overhead.
Result: Achieves 13x KV cache compression and 5.6x speedup in cache offloading scenarios while maintaining comparable LLM performance, significantly outperforming state-of-the-art methods.
Conclusion: AttentionPredictor effectively addresses the limitations of static attention modeling by learning dynamic spatiotemporal patterns, enabling efficient KV cache compression without compromising LLM performance.
Abstract: With the development of large language models (LLMs), efficient inference through Key-Value (KV) cache compression has attracted considerable attention, especially for long-context generation. To compress the KV cache, recent methods identify critical KV tokens through static modeling of attention scores. However, these methods often struggle to accurately determine critical tokens as they neglect the temporal patterns in attention scores, resulting in a noticeable degradation in LLM performance. To address this challenge, we propose AttentionPredictor, which is the first learning-based method to directly predict attention patterns for KV cache compression and critical token identification. Specifically, AttentionPredictor learns a lightweight, unified convolution model to dynamically capture spatiotemporal patterns and predict the next-token attention scores. An appealing feature of AttentionPredictor is that it accurately predicts the attention score and shares the unified prediction model, which consumes negligible memory, among all transformer layers. Moreover, we propose a cross-token critical cache prefetching framework that hides the token estimation time overhead to accelerate the decoding stage. By retaining most of the attention information, AttentionPredictor achieves 13$\times$ KV cache compression and 5.6$\times$ speedup in a cache offloading scenario with comparable LLM performance, significantly outperforming the state-of-the-arts. The code is available at https://github.com/MIRALab-USTC/LLM-AttentionPredictor.
[129] When Personalization Meets Reality: A Multi-Faceted Analysis of Personalized Preference Learning
Yijiang River Dong, Tiancheng Hu, Yinhong Liu, Ahmet Üstün, Nigel Collier
Main category: cs.CL
TL;DR: The paper presents a comprehensive evaluation framework for personalized preference learning in LLMs, revealing significant performance gaps (up to 36%) and safety misalignment (up to 20%) across different methods, emphasizing the need for holistic assessment.
Details
Motivation: Current RLHF assumes homogeneous user preferences, overlooking diverse human values and minority viewpoints. Personalized preference learning addresses this but lacks standardized evaluation methods to assess effectiveness.Method: Developed a multi-faceted evaluation framework measuring performance, fairness, unintended effects, and adaptability across preference divergence levels. Compared eight personalization methods across three preference datasets.
Result: Performance differences between methods reached 36% when users strongly disagreed, and personalization introduced up to 20% safety misalignment. Methods showed varying effectiveness across different preference scenarios.
Conclusion: The findings highlight the critical need for holistic evaluation approaches to advance the development of more effective and inclusive preference learning systems that account for diverse human values.
Abstract: While Reinforcement Learning from Human Feedback (RLHF) is widely used to align Large Language Models (LLMs) with human preferences, it typically assumes homogeneous preferences across users, overlooking diverse human values and minority viewpoints. Although personalized preference learning addresses this by tailoring separate preferences for individual users, the field lacks standardized methods to assess its effectiveness. We present a multi-faceted evaluation framework that measures not only performance but also fairness, unintended effects, and adaptability across varying levels of preference divergence. Through extensive experiments comparing eight personalization methods across three preference datasets, we demonstrate that performance differences between methods could reach 36% when users strongly disagree, and personalization can introduce up to 20% safety misalignment. These findings highlight the critical need for holistic evaluation approaches to advance the development of more effective and inclusive preference learning systems.
[130] FaithUn: Toward Faithful Forgetting in Language Models by Investigating the Interconnectedness of Knowledge
Nakyeong Yang, Minsung Kim, Seunghyun Yoon, Joongbo Shin, Kyomin Jung
Main category: cs.CL
TL;DR: The paper introduces a new benchmark FaithUn and method KLUE to address superficial unlearning in language models, where current methods fail to properly handle interconnected knowledge.
Details
Motivation: Prior unlearning methods overlook the complex interconnected nature of knowledge, failing to evaluate whether they faithfully erase interconnected knowledge while retaining relevant but different context knowledge.Method: Proposes KLUE method that identifies knowledge neurons using explainability methods and updates only those neurons using selected unforgotten samples to achieve faithful unlearning.
Result: Experimental results show widely-used unlearning methods fail to ensure faithful unlearning, while KLUE demonstrates significant effectiveness in real-world QA unlearning.
Conclusion: The paper successfully addresses the superficial unlearning problem through a new benchmark and KLUE method, achieving faithful knowledge removal while preserving relevant knowledge in different contexts.
Abstract: Various studies have attempted to remove sensitive or private knowledge from a language model to prevent its unauthorized exposure. However, prior studies have overlooked the complex and interconnected nature of knowledge, where related knowledge must be carefully examined. Specifically, they have failed to evaluate whether an unlearning method faithfully erases interconnected knowledge that should be removed, retaining knowledge that appears relevant but exists in a completely different context. To resolve this problem, we first define a new concept called superficial unlearning, which refers to the phenomenon where an unlearning method either fails to erase the interconnected knowledge it should remove or unintentionally erases irrelevant knowledge. Based on the definition, we introduce a new benchmark, FaithUn, to analyze and evaluate the faithfulness of unlearning in real-world knowledge QA settings. Furthermore, we propose a novel unlearning method, KLUE, which updates only knowledge-related neurons to achieve faithful unlearning. KLUE identifies knowledge neurons using an explainability method and updates only those neurons using selected unforgotten samples. Experimental results demonstrate that widely-used unlearning methods fail to ensure faithful unlearning, while our method shows significant effectiveness in real-world QA unlearning.
[131] Reasoning is Periodicity? Improving Large Language Models Through Effective Periodicity Modeling
Yihong Dong, Ge Li, Xue Jiang, Yongding Tao, Kechi Zhang, Hao Zhu, Huanyu Liu, Jiazheng Ding, Jia Li, Jinliang Deng, Hong Mei
Main category: cs.CL
TL;DR: FANformer integrates Fourier Analysis Network into attention mechanism for efficient periodicity modeling, outperforming Transformer in language modeling and downstream tasks with superior learning efficiency and reasoning capabilities.
Details
Motivation: Periodicity is crucial for structured knowledge acquisition, but current Transformer-based LLMs have flaws in periodicity modeling that affect learning efficiency and principle establishment from data.Method: FANformer adapts Fourier Analysis Network (FAN) into attention mechanism by modifying the feature projection process to achieve efficient periodicity modeling.
Result: FANformer consistently outperforms Transformer when scaling up model size and training tokens, with pretrained FANformer-1B showing marked improvements on downstream tasks compared to similar-sized open-source LLMs, and superior rule learning for reasoning.
Conclusion: FANformer is an effective and promising architecture for advancing LLMs with superior learning efficiency and reasoning capabilities.
Abstract: Periodicity, as one of the most important basic characteristics, lays the foundation for facilitating structured knowledge acquisition and systematic cognitive processes within human learning paradigms. However, the potential flaws of periodicity modeling in Transformer affect the learning efficiency and establishment of underlying principles from data for large language models (LLMs) built upon it. In this paper, we demonstrate that integrating effective periodicity modeling can improve the learning efficiency and performance of LLMs. We introduce FANformer, which adapts Fourier Analysis Network (FAN) into attention mechanism to achieve efficient periodicity modeling, by modifying the feature projection process of attention mechanism. Extensive experimental results on language modeling show that FANformer consistently outperforms Transformer when scaling up model size and training tokens, underscoring its superior learning efficiency. Our pretrained FANformer-1B exhibits marked improvements on downstream tasks compared to open-source LLMs with similar model parameters or training tokens. Moreover, we reveal that FANformer exhibits superior ability to learn and apply rules for reasoning compared to Transformer. The results position FANformer as an effective and promising architecture for advancing LLMs.
[132] Beyond QA Pairs: Assessing Parameter-Efficient Fine-Tuning for Fact Embedding in LLMs
Shivam Ratnakar, Abhiroop Talasila, Raghav Chamadiya, Nikhil Agarwal, Vinayak K Doifode
Main category: cs.CL
TL;DR: PEFT for embedding domain facts into LLMs is enhanced by categorizing QA pairs into Factual/Conceptual classes and using synthetic dataset generation techniques, with conceptual training outperforming factual training.
Details
Motivation: To improve the fine-tuning process for embedding domain-specific facts into LLMs by exploring QA pair categorization and synthetic dataset generation methods.Method: Used BERT-based classifier to categorize QA pairs into Factual and Conceptual classes, fine-tuned two Llama-2 models based on these classifications, compared D-RAG and D-Naive synthetic dataset generation techniques, and evaluated with GPT-3.5 Turbo and Gemini.
Result: Models trained on conceptual datasets outperformed factual ones, D-Naive synthetic generation technique showed superior performance, and fine-tuned Llama-2 7B significantly outperformed baseline in generating product recommendations on a 1000-sample dataset.
Conclusion: PEFT may not be optimal for fact embedding but excels in instruction-based tasks, highlighting the importance of QA pair categorization and synthetic dataset generation for enhancing LLM performance in specific domains.
Abstract: This paper presents an extensive examination of Parameter-Efficient Fine-Tuning (PEFT) for embedding domain specific facts into Large Language Models (LLMs), focusing on improving the fine-tuning process by categorizing question-answer (QA) pairs into Factual and Conceptual classes using a BERT-based classifier. Two distinct Llama-2 models are fine-tuned based on these classifications and evaluated using larger models like GPT-3.5 Turbo and Gemini. Our results indicate that models trained on conceptual datasets outperform those trained on factual datasets. Additionally, we compare the efficiency of two synthetic fine-tuning dataset generation techniques, D-RAG and D-Naive, with D-Naive demonstrating superior performance. Although PEFT has shown effectiveness, our research indicates that it may not be the most optimal method for embedding facts into LLMs. However, it has demonstrated exceptional performance in instruction-based tasks. Our findings are reinforced by a 1000-sample dataset in the data center domain, where the fine-tuned Llama-2 7B model significantly outperforms the baseline model in generating product recommendations. Our study highlights the importance of QA pair categorization and synthetic dataset generation techniques in enhancing the performance of LLMs in specific domains.
[133] Superficial Self-Improved Reasoners Benefit from Model Merging
Xiangchi Yuan, Chunhui Zhang, Zheyuan Liu, Dachuan Shi, Leyan Pan, Soroush Vosoughi, Wenke Lee
Main category: cs.CL
TL;DR: Self-improvement in language models leads to superficial learning where models memorize in-domain tasks but lose out-of-domain generalization. The paper proposes Iterative Model Merging (IMM) to preserve genuine reasoning improvements while maintaining generalization.
Details
Motivation: As language models approach human-level reasoning, self-improvement is seen as a solution for high-quality data synthesis, but current methods risk model collapse and superficial learning where models memorize rather than genuinely improve reasoning capabilities.Method: The paper proposes Iterative Model Merging (IMM), a method that strategically combines weights from original and self-improved models to preserve generalization while incorporating genuine reasoning improvements.
Result: Analysis shows that during self-improvement, LM weight updates concentrate in less reasoning-critical layers, leading to superficial learning where models show improved in-domain accuracy but compromised out-of-domain generalization due to memorization.
Conclusion: IMM effectively mitigates both LM collapse and superficial learning, moving towards more stable self-improving systems that preserve generalization while incorporating genuine reasoning improvements.
Abstract: As scaled language models (LMs) approach human-level reasoning capabilities, self-improvement emerges as a solution to synthesizing high-quality data corpus. While previous research has identified model collapse as a risk in self-improvement, where model outputs become increasingly deterministic, we discover a more fundamental challenge: the superficial self-improved reasoners phenomenon. In particular, our analysis reveals that even when LMs show improved in-domain (ID) reasoning accuracy, they actually compromise their generalized reasoning capabilities on out-of-domain (OOD) tasks due to memorization rather than genuine. Through a systematic investigation of LM architecture, we discover that during self-improvement, LM weight updates are concentrated in less reasoning-critical layers, leading to superficial learning. To address this, we propose Iterative Model Merging (IMM), a method that strategically combines weights from original and self-improved models to preserve generalization while incorporating genuine reasoning improvements. Our approach effectively mitigates both LM collapse and superficial learning, moving towards more stable self-improving systems.
[134] AttentionRAG: Attention-Guided Context Pruning in Retrieval-Augmented Generation
Yixiong Fang, Tianran Sun, Yuling Shi, Xiaodong Gu
Main category: cs.CL
TL;DR: AttentionRAG is an attention-guided context pruning method for RAG systems that achieves up to 6.3x context compression while outperforming existing methods by around 10% in key metrics.
Details
Motivation: RAG effectiveness is hindered by long retrieved contexts causing information redundancy and computational overhead. Existing methods like LLMLingua lack contextual awareness and flexible compression control, leading to insufficient pruning or excessive information loss.Method: AttentionRAG uses an attention focus mechanism that reformulates RAG queries into a next-token prediction paradigm, isolating the query’s semantic focus to a single token for precise attention calculation between queries and retrieved contexts.
Result: Extensive experiments on LongBench and Babilong benchmarks show AttentionRAG achieves up to 6.3x context compression while outperforming LLMLingua methods by around 10% in key metrics.
Conclusion: AttentionRAG provides an effective solution for context pruning in RAG systems through its attention-guided approach, addressing limitations of existing methods and achieving superior compression performance.
Abstract: While RAG demonstrates remarkable capabilities in LLM applications, its effectiveness is hindered by the ever-increasing length of retrieved contexts, which introduces information redundancy and substantial computational overhead. Existing context pruning methods, such as LLMLingua, lack contextual awareness and offer limited flexibility in controlling compression rates, often resulting in either insufficient pruning or excessive information loss. In this paper, we propose AttentionRAG, an attention-guided context pruning method for RAG systems. The core idea of AttentionRAG lies in its attention focus mechanism, which reformulates RAG queries into a next-token prediction paradigm. This mechanism isolates the query’s semantic focus to a single token, enabling precise and efficient attention calculation between queries and retrieved contexts. Extensive experiments on LongBench and Babilong benchmarks show that AttentionRAG achieves up to 6.3$\times$ context compression while outperforming LLMLingua methods by around 10% in key metrics.
[135] Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
Teng Wang, Zhangyi Jiang, Zhenqi He, Shenyang Tong, Wenhan Yang, Yanan Zheng, Zeyu Li, Zifan He, Hailei Gong, Zewen Ye, Shengjie Ma, Jianping Zhang
Main category: cs.CL
TL;DR: Proposes Hierarchical Reward Model (HRM) to address reward hacking in Process Reward Models by evaluating reasoning steps at multiple granularities, with Hierarchical Node Compression (HNC) for efficient data augmentation.
Details
Motivation: PRM suffers from reward hacking and high annotation costs for reasoning processes, making reliable intermediate step evaluation challenging.Method: HRM evaluates individual and consecutive reasoning steps at fine-grained and coarse-grained levels. HNC merges consecutive steps in tree structures for data augmentation with minimal computational overhead.
Result: HRM with HNC provides more stable and reliable evaluations than PRM on PRM800K dataset, and shows strong generalization on MATH500 and GSM8K datasets.
Conclusion: HRM effectively addresses PRM limitations through hierarchical evaluation and efficient data augmentation, demonstrating robust performance across diverse reasoning tasks.
Abstract: Recent studies show that Large Language Models (LLMs) achieve strong reasoning capabilities through supervised fine-tuning or reinforcement learning. However, a key approach, the Process Reward Model (PRM), suffers from reward hacking, making it unreliable in identifying the best intermediate step. In addition, the cost of annotating reasoning processes for reward modeling is high, making large-scale collection of high-quality data challenging. To address this, we propose a novel reward model approach called the Hierarchical Reward Model (HRM), which evaluates both individual and consecutive reasoning steps at both fine-grained and coarse-grained levels. HRM excels at assessing multi-step reasoning coherence, especially when flawed steps are later corrected through self-reflection. To further reduce the cost of generating training data, we introduce a lightweight and effective data augmentation strategy called Hierarchical Node Compression (HNC), which merges two consecutive reasoning steps into one within the tree structure. By applying HNC to MCTS-generated reasoning trajectories, we enhance the diversity and robustness of HRM training data while introducing controlled noise with minimal computational overhead. Empirical results on the PRM800K dataset show that HRM, together with HNC, provides more stable and reliable evaluations than PRM. Furthermore, cross-domain evaluations on the MATH500 and GSM8K datasets demonstrate HRM’s strong generalization and robustness across a variety of reasoning tasks.
[136] The Lighthouse of Language: Enhancing LLM Agents via Critique-Guided Improvement
Ruihan Yang, Fanghua Ye, Jian Li, Siyu Yuan, Yikai Zhang, Zhaopeng Tu, Xiaolong Li, Deqing Yang
Main category: cs.CL
TL;DR: CGI is a two-player framework with an actor model exploring environments and a critic model providing natural language feedback, enabling robust exploration and improved decision-making in LLM-based agents.
Details
Motivation: Natural language feedback provides richer guidance than numerical rewards for LLMs, but parsing and implementing feedback effectively is challenging. CGI addresses this by creating specialized models for action and critique.Method: Two-player framework: actor model explores environments while critic model generates detailed natural language feedback. Both models are trained - critic for fine-grained assessments and actionable revisions, actor for utilizing critiques.
Result: CGI outperforms existing baselines substantially in three interactive environments. Even small critic models surpass GPT-4 in feedback quality. The actor achieves state-of-the-art performance.
Conclusion: Explicit iterative guidance through natural language feedback significantly enhances decision-making in LLM-based agents, enabling more robust exploration and avoiding local optima.
Abstract: Large language models (LLMs) have recently transformed from text-based assistants to autonomous agents capable of planning, reasoning, and iteratively improving their actions. While numerical reward signals and verifiers can effectively rank candidate actions, they often provide limited contextual guidance. In contrast, natural language feedback better aligns with the generative capabilities of LLMs, providing richer and more actionable suggestions. However, parsing and implementing this feedback effectively can be challenging for LLM-based agents. In this work, we introduce Critique-Guided Improvement (CGI), a novel two-player framework, comprising an actor model that explores an environment and a critic model that generates detailed nature language feedback. By training the critic to produce fine-grained assessments and actionable revisions, and the actor to utilize these critiques, our approach promotes more robust exploration of alternative strategies while avoiding local optima. Experiments in three interactive environments show that CGI outperforms existing baselines by a substantial margin. Notably, even a small critic model surpasses GPT-4 in feedback quality. The resulting actor achieves state-of-the-art performance, demonstrating the power of explicit iterative guidance to enhance decision-making in LLM-based agents.
[137] SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging
Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Syed Zawad, Holger Boche
Main category: cs.CL
TL;DR: SafeMERGE is a lightweight post-fine-tuning framework that selectively merges fine-tuned model layers with safety-aligned layers when they deviate from safe behavior, preserving safety while maintaining downstream task performance.
Details
Motivation: Fine-tuning LLMs for specialized domains often erodes safety alignment, causing models to respond to harmful prompts. Existing safety realignment methods are either difficult to implement or compromise task utility.Method: SafeMERGE uses a cosine similarity criterion to selectively merge fine-tuned layers with safety-aligned model layers only when they deviate from safe behavior, implementing selective layer-wise merging.
Result: Across three LLMs and two tasks, SafeMERGE consistently reduces harmful outputs compared to other defenses, with negligible or even positive impact on utility.
Conclusion: Selective layer-wise merging offers an effective safeguard against safety loss during fine-tuning, establishing SafeMERGE as a simple post-fine-tuning defense that preserves both safety and utility.
Abstract: Fine-tuning large language models (LLMs) is a common practice to adapt generalist models to specialized domains. However, recent studies show that fine-tuning can erode safety alignment, causing LLMs to respond to harmful or unethical prompts. Many methods to realign safety have been proposed, but often introduce custom algorithms that are difficult to implement or compromise task utility. In this work, we propose SafeMERGE, a lightweight, post-fine-tuning framework that preserves safety while maintaining downstream performance. SafeMERGE selectively merges fine-tuned with safety-aligned model layers only when they deviate from safe behavior, measured by a cosine similarity criterion. Across three LLMs and two tasks, SafeMERGE consistently reduces harmful outputs compared to other defenses, with negligible or even positive impact on utility. Our results demonstrate that selective layer-wise merging offers an effective safeguard against the inadvertent loss of safety during fine-tuning, establishing SafeMERGE as a simple post-fine-tuning defense.
[138] Distinct social-linguistic processing between humans and large audio-language models: Evidence from model-brain alignment
Hanlin Wu, Xufeng Duan, Zhenguang Cai
Main category: cs.CL
TL;DR: This study compares how large audio-language models (LALMs) and humans process speaker-contextualized language, finding that Qwen2-Audio shows some human-like sensitivity to speaker-content incongruency but neither model replicates the human distinction between social and biological knowledge violations.
Details
Motivation: To understand whether LALMs process speaker-contextualized language in ways that parallel human cognitive mechanisms, particularly in integrating linguistic and paralinguistic information during speech comprehension.Method: Compared two LALMs (Qwen2-Audio and Ultravox 0.5) with human EEG responses using surprisal and entropy metrics, analyzing sensitivity to speaker-content incongruency across social stereotype violations and biological knowledge violations.
Result: Qwen2-Audio showed increased surprisal for speaker-incongruent content and its surprisal values significantly predicted human N400 responses, while Ultravox 0.5 showed limited sensitivity. Neither model replicated the human processing distinction between social violations (N400 effects) and biological violations (P600 effects).
Conclusion: Current LALMs show both potential and limitations in processing speaker-contextualized language, revealing differences in social-linguistic processing mechanisms between humans and AI models.
Abstract: Voice-based AI development faces unique challenges in processing both linguistic and paralinguistic information. This study compares how large audio-language models (LALMs) and humans integrate speaker characteristics during speech comprehension, asking whether LALMs process speaker-contextualized language in ways that parallel human cognitive mechanisms. We compared two LALMs’ (Qwen2-Audio and Ultravox 0.5) processing patterns with human EEG responses. Using surprisal and entropy metrics from the models, we analyzed their sensitivity to speaker-content incongruency across social stereotype violations (e.g., a man claiming to regularly get manicures) and biological knowledge violations (e.g., a man claiming to be pregnant). Results revealed that Qwen2-Audio exhibited increased surprisal for speaker-incongruent content and its surprisal values significantly predicted human N400 responses, while Ultravox 0.5 showed limited sensitivity to speaker characteristics. Importantly, neither model replicated the human-like processing distinction between social violations (eliciting N400 effects) and biological violations (eliciting P600 effects). These findings reveal both the potential and limitations of current LALMs in processing speaker-contextualized language, and suggest differences in social-linguistic processing mechanisms between humans and LALMs.
[139] Unified Sparse Mixture of Experts
Giang Do, Hung Le, Truyen Tran
Main category: cs.CL
TL;DR: USMoE is a unified sparse mixture of experts framework that addresses limitations of traditional routing methods by integrating expert and token dimensions through linear programming, achieving up to 10% performance improvement or 14% inference cost reduction.
Details
Motivation: Traditional SMoE approaches with fixed k values face three key limitations: failure to route to important experts/tokens, assignment of irrelevant ones, and representation collapse among experts.Method: Proposes a unified mechanism integrating expert and token dimensions with a unified scoring function that linearly combines similarity scores between experts and tokens, based on linear programming principles.
Result: USMoE achieves up to 10% performance improvement over standard approaches or reduces inference costs by up to 14% while maintaining competitive accuracy across various settings including clean/corrupted data, LLMs, vision tasks, and training-free/training scenarios.
Conclusion: The USMoE framework effectively overcomes traditional routing limitations through its unified approach, providing both theoretical justification and empirical evidence of superior performance and efficiency.
Abstract: Sparse Mixture of Experts (SMoEs) models scale the capacity of models while maintaining constant computational overhead. Early designs typically relied on a fixed value of $k$, where $k$ represents either the number of experts selected per token or the number of tokens assigned per expert. However, these approaches encounter three key limitations: they may fail to route to important experts or tokens, may assign irrelevant ones, and often suffer from representation collapse among experts. This paper reexamines SMoEs through the lens of \textit{Linear Programming}, and proposes a Unified Sparse Mixture of Experts (USMoE) framework that addresses these limitations. Specifically, our approach introduces a unified mechanism that integrates information from both the expert and token dimensions, and a unified scoring function that linearly combines similarity scores between experts and tokens. We provide both theoretical justification and empirical evidence demonstrating USMoE’s effectiveness in overcoming the limitations of traditional routing methods. Through comprehensive evaluations on both clean and corrupted settings for large language models and vision tasks, under both training-free and training scenarios, USMoE achieves up to a 10% performance improvement over standard approaches or reduces inference costs by up to 14%, while maintaining competitive accuracy.
[140] StereoDetect: Detecting Stereotypes and Anti-stereotypes the Correct Way Using Social Psychological Underpinnings
Kaustubh Shivshankar Shejole, Pushpak Bhattacharyya
Main category: cs.CL
TL;DR: This paper proposes a clear distinction between stereotypes and stereotypical biases, introduces a five-tuple definition framework, and presents StereoDetect - a curated benchmark dataset for stereotype and anti-stereotype detection that addresses shortcomings in existing benchmarks.
Details
Motivation: Current research fails to clearly distinguish between stereotypes and stereotypical biases, slowing progress in stereotype detection. This is a critical area in Responsible AI that requires social knowledge and has been underdeveloped.Method: Proposed a five-tuple definition framework disentangling stereotypes, anti-stereotypes, stereotypical bias, and general bias. Developed StereoDetect, a well-curated benchmark dataset aligned with the proposed definitions, grounded in social psychology principles.
Result: Sub-10B language models and GPT-4o frequently misclassify anti-stereotypes and fail to recognize neutral overgeneralizations. StereoDetect demonstrates effectiveness through qualitative and quantitative comparisons with existing benchmarks and fine-tuned models.
Conclusion: The work provides precise terminology and a conceptual framework for stereotype detection, addresses key shortcomings in existing benchmarks, and offers StereoDetect as a reliable tool for advancing research in this critical area of Responsible AI.
Abstract: Stereotypes are known to have very harmful effects, making their detection critically important. However, current research predominantly focuses on detecting and evaluating stereotypical biases, thereby leaving the study of stereotypes in its early stages. Our study revealed that many works have failed to clearly distinguish between stereotypes and stereotypical biases, which has significantly slowed progress in advancing research in this area. Stereotype and Anti-stereotype detection is a problem that requires social knowledge; hence, it is one of the most difficult areas in Responsible AI. This work investigates this task, where we propose a five-tuple definition and provide precise terminologies disentangling stereotypes, anti-stereotypes, stereotypical bias, and general bias. We provide a conceptual framework grounded in social psychology for reliable detection. We identify key shortcomings in existing benchmarks for this task of stereotype and anti-stereotype detection. To address these gaps, we developed StereoDetect, a well curated, definition-aligned benchmark dataset designed for this task. We show that sub-10B language models and GPT-4o frequently misclassify anti-stereotypes and fail to recognize neutral overgeneralizations. We demonstrate StereoDetect’s effectiveness through multiple qualitative and quantitative comparisons with existing benchmarks and models fine-tuned on them. The dataset and code is available at https://github.com/KaustubhShejole/StereoDetect.
[141] MOSAIC: Modeling Social AI for Content Dissemination and Regulation in Multi-Agent Simulations
Genglin Liu, Vivian Le, Salman Rahman, Elisa Kreiss, Marzyeh Ghassemi, Saadia Gabriel
Main category: cs.CL
TL;DR: MOSAIC is an open-source social network simulation framework that uses LLM agents with diverse personas to model user behaviors like liking, sharing, and flagging content, enabling analysis of misinformation spread and content moderation strategies.
Details
Motivation: To better understand how users determine the veracity of online social content and analyze emergent deception behaviors in social networks through simulation.Method: Combines LLM agents with a directed social graph, constructing user representations from diverse fine-grained personas to enable multi-agent simulations of content dissemination and engagement dynamics.
Result: Evaluated three content moderation strategies and found they not only mitigate non-factual content spread but also increase user engagement. Analyzed content trajectories and agent reasoning alignment with collective engagement patterns.
Conclusion: The framework enables scalable analysis of social content dynamics and moderation strategies, with open-source software to encourage further AI and social sciences research.
Abstract: We present a novel, open-source social network simulation framework, MOSAIC, where generative language agents predict user behaviors such as liking, sharing, and flagging content. This simulation combines LLM agents with a directed social graph to analyze emergent deception behaviors and gain a better understanding of how users determine the veracity of online social content. By constructing user representations from diverse fine-grained personas, our system enables multi-agent simulations that model content dissemination and engagement dynamics at scale. Within this framework, we evaluate three different content moderation strategies with simulated misinformation dissemination, and we find that they not only mitigate the spread of non-factual content but also increase user engagement. In addition, we analyze the trajectories of popular content in our simulations, and explore whether simulation agents’ articulated reasoning for their social interactions truly aligns with their collective engagement patterns. We open-source our simulation software to encourage further research within AI and social sciences.
[142] SEAL: Steerable Reasoning Calibration of Large Language Models for Free
Runjin Chen, Zhenyu Zhang, Junyuan Hong, Souvik Kundu, Zhangyang Wang
Main category: cs.CL
TL;DR: SEAL is a training-free method that improves LLM reasoning efficiency and accuracy by identifying and reducing redundant reflection and transition thoughts in chain-of-thought reasoning.
Details
Motivation: Current LLMs show substantial redundancy in chain-of-thought reasoning traces, which increases inference latency and negatively impacts performance by diverting attention to unnecessary reasoning paths.Method: SEAL categorizes reasoning thoughts into execution, reflection, and transition types, then uses a steering vector extracted from latent space to calibrate reasoning traces through representation intervention, reducing excessive reflection and transition thoughts.
Result: Experiments show up to 11% accuracy improvement while reducing reasoning tokens by 11.8% to 50.4% across multiple models and benchmarks including Math500, GSM8K, and LiveCodeBench.
Conclusion: SEAL effectively improves reasoning efficiency and accuracy by steering away from redundant thought patterns, with the steering vector demonstrating strong transferability across tasks.
Abstract: Large Language Models (LLMs), such as OpenAI’s o1-series have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism. However, recent studies reveal substantial redundancy in the CoT reasoning traces, which not only increases inference latency but also negatively impacts model performance by diverting attention to unnecessary reasoning paths. To address this issue, we investigate the internal reasoning structures of LLMs and categorize them into three primary thought types: execution, reflection, and transition thoughts. Moreover, our analysis reveals that excessive reflection and transition thoughts are strongly correlated with failure cases and these thought categories exhibit clear separation in the latent space. Based on these, we introduce SEAL (Steerable reasoning calibration), a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains. SEAL consists of an offline stage for extracting the reasoning steering vector in the latent space, followed by an on-the-fly calibration of the reasoning trace through representation intervention using the steering vector. Notably, the steering vector exhibits strong transferability across various tasks. Extensive experiments across multiple models (DeepSeek-R1-Distill and QwQ-32B-Preview) and benchmarks (Math500, GSM8K, LiveCodeBench) validate the effectiveness of SEAL, up to a 11% improvement in accuracy while reducing reasoning tokens by 11.8% to 50.4%. Our code is publicly available at https://github.com/VITA-Group/SEAL.
[143] Better Estimation of the Kullback–Leibler Divergence Between Language Models
Afra Amini, Tim Vieira, Ryan Cotterell
Main category: cs.CL
TL;DR: The paper introduces a Rao-Blackwellized estimator for KL divergence between language models that reduces variance compared to standard Monte Carlo estimators, and applies it to sentiment-controlled fine-tuning.
Details
Motivation: Estimating KL divergence between language models is important for RLHF, interpretability, and knowledge distillation, but standard Monte Carlo estimators suffer from high variance and can produce negative estimates of this non-negative quantity.Method: Developed a Rao-Blackwellized estimator that is unbiased and provably has lower variance than standard Monte Carlo estimators. Also derived an analogous estimator for the gradient of KL divergence.
Result: Empirical study on sentiment-controlled fine-tuning shows the estimator provides more stable KL estimates with substantially reduced variance. Training with the gradient estimator leads to more stable training and models that more frequently appear on the Pareto frontier of reward vs. KL.
Conclusion: The proposed Rao-Blackwellized estimator offers improved variance reduction and stability for KL divergence estimation in language models, with practical benefits for training and optimization.
Abstract: Estimating the Kullback–Leibler (KL) divergence between language models has many applications, e.g., reinforcement learning from human feedback (RLHF), interpretability, and knowledge distillation. However, computing the exact KL divergence between two arbitrary language models is intractable. Thus, practitioners often resort to sampling-based estimators. While it is easy to fashion a simple Monte Carlo (MC) estimator that provides an unbiased estimate of the KL divergence between language models, this estimator notoriously suffers from high variance and can even result in a negative estimate of the KL divergence, a non-negative quantity. In this paper, we introduce a Rao–Blackwellized estimator that is unbiased and provably has variance less than or equal to that of the standard Monte Carlo estimator. In an empirical study on sentiment-controlled fine-tuning, we show that our estimator provides more stable KL estimates and reduces variance substantially. Additionally, we derive an analogous Rao–Blackwellized estimator of the gradient of the KL divergence, which leads to more stable training and produces models that more frequently appear on the Pareto frontier of reward vs. KL compared to the ones trained with the MC estimator of the gradient.
[144] Cancer-Myth: Evaluating AI Chatbot on Patient Questions with False Presuppositions
Wang Bill Zhu, Tianqi Chen, Xinyan Velocity Yu, Ching Ying Lin, Jade Law, Mazen Jizzini, Jorge J. Nieva, Ruishan Liu, Robin Jia
Main category: cs.CL
TL;DR: LLMs frequently fail to correct false presuppositions in cancer patients’ questions, posing safety risks. Cancer-Myth benchmark shows top models correct false presuppositions only 43% of the time, and mitigation strategies have trade-offs.
Details
Motivation: Cancer patients increasingly use LLMs for medical information, but current benchmarks don't evaluate how models handle real patient questions with false assumptions, creating safety risks.Method: Created Cancer-Myth dataset with 585 expert-verified cancer questions containing false presuppositions, and Cancer-Myth-NFP set with 150 questions without false presuppositions. Tested frontier LLMs and mitigation strategies like GEPA-optimized precautionary prompts.
Result: No frontier LLM (GPT-5, Gemini-2.5-Pro, Claude-4-Sonnet) corrected false presuppositions more than 43% of the time. Mitigation strategies improved accuracy to 80% on Cancer-Myth but caused 41% false positives on Cancer-Myth-NFP and 10% performance drop on other medical benchmarks.
Conclusion: LLMs have critical reliability gaps in handling false presuppositions in medical questions. Prompting alone isn’t sufficient, and more robust safeguards are needed for medical AI systems.
Abstract: Cancer patients are increasingly turning to large language models (LLMs) for medical information, making it critical to assess how well these models handle complex, personalized questions. However, current medical benchmarks focus on medical exams or consumer-searched questions and do not evaluate LLMs on real patient questions with patient details. In this paper, we first have three hematology-oncology physicians evaluate cancer-related questions drawn from real patients. While LLM responses are generally accurate, the models frequently fail to recognize or address false presuppositions in the questions, posing risks to safe medical decision-making. To study this limitation systematically, we introduce Cancer-Myth, an expert-verified adversarial dataset of 585 cancer-related questions with false presuppositions. On this benchmark, no frontier LLM – including GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet – corrects these false presuppositions more than $43%$ of the time. To study mitigation strategies, we further construct a 150-question Cancer-Myth-NFP set, in which physicians confirm the absence of false presuppositions. We find typical mitigation strategies, such as adding precautionary prompts with GEPA optimization, can raise accuracy on Cancer-Myth to $80%$, but at the cost of misidentifying presuppositions in $41%$ of Cancer-Myth-NFP questions and causing a $10%$ relative performance drop on other medical benchmarks. These findings highlight a critical gap in the reliability of LLMs, show that prompting alone is not a reliable remedy for false presuppositions, and underscore the need for more robust safeguards in medical AI systems.
[145] Unsupervised Classification of English Words Based on Phonological Information: Discovery of Germanic and Latinate Clusters
Takashi Morita, Timothy J. O’Donnell
Main category: cs.CL
TL;DR: The study shows that the Germanic-Latinate distinction in English can be learned from phonotactic patterns through unsupervised clustering, without needing historical etymology knowledge.
Details
Motivation: To address the learnability challenge of etymology-based generalizations, since historical origins are inaccessible to general language learners.Method: Used unsupervised clustering on corpus-extracted words based on phonotactic information.
Result: The word clusters largely aligned with etymological distinction and recovered known linguistic generalizations about etymological classes, plus uncovered new features.
Conclusion: The Germanic-Latinate distinction in English is learnable from phonotactic patterns, making etymology-based generalizations cognitively plausible.
Abstract: Cross-linguistically, native words and loanwords follow different phonological rules. In English, for example, words of Germanic and Latinate origin exhibit different stress patterns, and a certain syntactic structure, double-object datives, is predominantly associated with Germanic verbs rather than Latinate verbs. As a cognitive model, however, such etymology-based generalizations face challenges in terms of learnability, since the historical origins of words are presumably inaccessible information for general language learners. In this study, we present computational evidence indicating that the Germanic-Latinate distinction in the English lexicon is learnable from the phonotactic information of individual words. Specifically, we performed an unsupervised clustering on corpus-extracted words, and the resulting word clusters largely aligned with the etymological distinction. The model-discovered clusters also recovered various linguistic generalizations documented in the previous literature regarding the corresponding etymological classes. Moreover, our findings also uncovered previously unrecognized features of the quasi-etymological clusters.
[146] Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale
Bowen Jiang, Zhuoqun Hao, Young-Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo J. Taylor, Dan Roth
Main category: cs.CL
TL;DR: PERSONAMEM benchmark evaluates LLMs’ ability to leverage user interaction history for personalization, finding current models struggle with tracking evolving user profiles and achieve only ~50% accuracy.
Details
Motivation: To assess how well LLMs can internalize user traits, track evolving preferences over time, and generate personalized responses using interaction history.Method: Created PERSONAMEM benchmark with curated user profiles and 180+ simulated user-LLM interaction histories across 15 real-world tasks, evaluating LLMs’ response selection accuracy given user queries.
Result: Current LLMs struggle with dynamic user profile evolution, achieving only around 50% overall accuracy across frontier models like GPT-4.1, o4-mini, GPT-4.5, o1, and Gemini-2.0.
Conclusion: There’s significant room for improvement in developing user-aware chatbots, and PERSONAMEM provides a valuable benchmark and simulation pipeline for future research.
Abstract: Large Language Models (LLMs) have emerged as personalized assistants for users across a wide range of tasks – from offering writing support to delivering tailored recommendations or consultations. Over time, the interaction history between a user and an LLM can provide extensive information about an individual’s traits and preferences. However, open questions remain on how well LLMs today can effectively leverage such history to (1) internalize the user’s inherent traits and preferences, (2) track how the user profiling and preferences evolve over time, and (3) generate personalized responses accordingly in new scenarios. In this work, we introduce the PERSONAMEM benchmark. PERSONAMEM features curated user profiles with over 180 simulated user-LLM interaction histories, each containing up to 60 sessions of multi-turn conversations across 15 real-world tasks that require personalization. Given an in-situ user query, i.e. query issued by the user from the first-person perspective, we evaluate LLM chatbots’ ability to identify the most suitable response according to the current state of the user’s profile. We observe that current LLMs still struggle to recognize the dynamic evolution in users’ profiles over time through direct prompting approaches. As a consequence, LLMs often fail to deliver responses that align with users’ current situations and preferences, with frontier models such as GPT-4.1, o4-mini, GPT-4.5, o1, or Gemini-2.0 achieving only around 50% overall accuracy, suggesting room for improvement. We hope that PERSONAMEM, along with the user profile and conversation simulation pipeline, can facilitate future research in the development of truly user-aware chatbots. Code and data are available at github.com/bowen-upenn/PersonaMem.
[147] ComPO: Preference Alignment via Comparison Oracles
Peter Chen, Xi Chen, Wotao Yin, Tianyi Lin
Main category: cs.CL
TL;DR: Proposes a zeroth-order comparison-based optimization method for LLM alignment that addresses verbosity and likelihood displacement issues in noisy preference pairs, with convergence guarantees and experimental validation across multiple models.
Details
Motivation: Direct alignment methods suffer from verbosity and likelihood displacement problems caused by noisy preference pairs that assign similar likelihoods to preferred and dispreferred responses.Method: Developed a zeroth-order comparison-based optimization method using comparison oracles, with convergence guarantees and practical heuristics for handling noisy preference pairs.
Result: Experimental evaluations on Mistral-7B, Llama-3-8B, and Gemma-2-9B models using AlpacaEval 2, MT-Bench, and Arena-Hard benchmarks demonstrate effectiveness in improving LLM performance with noisy preference pairs.
Conclusion: The method provides an effective alternative to existing direct alignment approaches and highlights the importance of designing specialized methods for preference pairs with distinct likelihood margins.
Abstract: Direct alignment methods are increasingly used for aligning large language models (LLMs) with human preferences. However, these methods suffer from the issues of verbosity and likelihood displacement, which can be driven by the noisy preference pairs that induce similar likelihood for preferred and dispreferred responses. The contributions of this paper are two-fold. First, we propose a new preference alignment method based on zeroth-order, comparison-based optimization via comparison oracles and provide convergence guarantees for its basic scheme. Second, we improve our method using some heuristics and conduct the experiments to demonstrate the flexibility and compatibility of practical scheme in improving the performance of LLMs using noisy preference pairs. Evaluations are conducted across multiple base and instruction-tuned models (Mistral-7B, Llama-3-8B and Gemma-2-9B) with benchmarks (AlpacaEval 2, MT-Bench and Arena-Hard). Experimental results show the effectiveness of our method as an alternative to addressing the limitations of existing direct alignment methods. A highlight of our work is that we evidence the importance of designing specialized methods for preference pairs with distinct likelihood margin, which complements the recent findings in Razin et al (2025).
[148] A Multi-Task Benchmark for Abusive Language Detection in Low-Resource Settings
Fitsum Gaim, Hoyun Song, Huije Lee, Changgeon Ko, Eui Jun Hwang, Jong C. Park
Main category: cs.CL
TL;DR: A large-scale multi-task benchmark dataset for abusive language detection in Tigrinya social media with joint annotations for abusiveness, sentiment, and topic classification, featuring 13,717 YouTube comments in both Ge’ez script and Romanized transliterations.
Details
Motivation: Address the lack of content moderation resources for the majority of world languages, particularly Tigrinya, leaving millions of vulnerable users exposed to online hostility.Method: Developed an iterative term clustering approach for data selection, collected 13,717 YouTube comments from 7,373 videos with over 1.2 billion views, annotated by nine native speakers, accommodating both Ge’ez script and Romanized transliterations.
Result: Fine-tuned small models outperform prompted frontier LLMs in low-resource settings, achieving 86.67% F1 in abusiveness detection (7+ points over best LLM) and stronger performance across all tasks.
Conclusion: The benchmark addresses critical gaps in content moderation for under-resourced languages and demonstrates that specialized fine-tuned models can outperform general LLMs in low-resource scenarios, promoting research on online safety.
Abstract: Content moderation research has recently made significant advances, but remains limited in serving the majority of the world’s languages due to the lack of resources, leaving millions of vulnerable users to online hostility. This work presents a large-scale human-annotated multi-task benchmark dataset for abusive language detection in Tigrinya social media with joint annotations for three tasks: abusiveness, sentiment, and topic classification. The dataset comprises 13,717 YouTube comments annotated by nine native speakers, collected from 7,373 videos with a total of over 1.2 billion views across 51 channels. We developed an iterative term clustering approach for effective data selection. Recognizing that around 64% of Tigrinya social media content uses Romanized transliterations rather than native Ge’ez script, our dataset accommodates both writing systems to reflect actual language use. We establish strong baselines across the tasks in the benchmark, while leaving significant challenges for future contributions. Our experiments demonstrate that small fine-tuned models outperform prompted frontier large language models (LLMs) in the low-resource setting, achieving 86.67% F1 in abusiveness detection (7+ points over best LLM), and maintain stronger performance in all other tasks. The benchmark is made public to promote research on online safety.
[149] Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models
Zahraa Al Sahili, Ioannis Patras, Matthew Purver
Main category: cs.CL
TL;DR: Multilingual CLIP models exhibit amplified gender and race biases compared to English-only baselines, with bias patterns varying by language resource level and gender marking systems.
Details
Motivation: To systematically audit social biases in multilingual vision-language models, as their biases remain underexplored despite promises of universal image-text retrieval.Method: Audited four multilingual CLIP variants (M-CLIP, NLLB-CLIP, CAPIVARA-CLIP, SigLIP-2) across ten languages using balanced subsets of FairFace and PATA stereotype suite in zero-shot setting.
Result: All models showed stronger gender bias than English baselines; CAPIVARA-CLIP had largest biases in low-resource target languages; shared encoders transferred English stereotypes to gender-neutral languages; gendered languages amplified all bias types.
Conclusion: Aggregated metrics mask language-specific bias hotspots, highlighting need for fine-grained, language-aware bias evaluation in multilingual VLM research.
Abstract: Multilingual vision-language models (VLMs) promise universal image-text retrieval, yet their social biases remain underexplored. We perform the first systematic audit of four public multilingual CLIP variants: M-CLIP, NLLB-CLIP, CAPIVARA-CLIP, and the debiased SigLIP-2, covering ten languages that differ in resource availability and morphological gender marking. Using balanced subsets of FairFace and the PATA stereotype suite in a zero-shot setting, we quantify race and gender bias and measure stereotype amplification. Contrary to the intuition that multilinguality mitigates bias, every model exhibits stronger gender skew than its English-only baseline. CAPIVARA-CLIP shows its largest biases precisely in the low-resource languages it targets, while the shared encoder of NLLB-CLIP and SigLIP-2 transfers English gender stereotypes into gender-neutral languages; loosely coupled encoders largely avoid this leakage. Although SigLIP-2 reduces agency and communion skews, it inherits – and in caption-sparse contexts (e.g., Xhosa) amplifies – the English anchor’s crime associations. Highly gendered languages consistently magnify all bias types, yet gender-neutral languages remain vulnerable whenever cross-lingual weight sharing imports foreign stereotypes. Aggregated metrics thus mask language-specific hot spots, underscoring the need for fine-grained, language-aware bias evaluation in future multilingual VLM research.
[150] Gated Integration of Low-Rank Adaptation for Continual Learning of Large Language Models
Yan-Shuo Liang, Jia-Rui Chen, Wu-Jun Li
Main category: cs.CL
TL;DR: GainLoRA is a new continual learning method for LLMs that uses gated integration of LoRA branches to mitigate forgetting and improve performance.
Details
Motivation: Existing CL methods based on LoRA expand new branches for each task but force equal influence between new and old branches, potentially causing forgetting of old tasks.Method: GainLoRA expands a new LoRA branch for each new task and introduces gating modules to integrate new and old branches while minimizing the new branch’s influence on old tasks.
Result: Experimental results on CL benchmarks show that GainLoRA outperforms existing state-of-the-art methods.
Conclusion: GainLoRA effectively mitigates forgetting in continual learning for LLMs through gated integration of LoRA branches, improving overall model performance.
Abstract: Continual learning (CL), which requires the model to learn multiple tasks sequentially, is crucial for large language models (LLMs). Recently, low-rank adaptation~(LoRA), one of the most representative parameter-efficient fine-tuning (PEFT) methods, has gained increasing attention in CL of LLMs. However, most existing CL methods based on LoRA typically expand a new LoRA branch to learn each new task and force the new and old LoRA branches to influence old tasks equally, potentially leading to forgetting. In this work, we propose a new method, called gated integration of low-rank adaptation (GainLoRA), for CL of LLMs. GainLoRA expands a new LoRA branch for each new task and introduces gating modules to integrate the new and old LoRA branches. Furthermore, GainLoRA leverages the new gating module to minimize the influence from the new LoRA branch to old tasks, effectively mitigating forgetting and improving the model’s overall performance. Experimental results on CL benchmarks demonstrate that GainLoRA outperforms existing state-of-the-art methods.
[151] LyapLock: Bounded Knowledge Preservation in Sequential Large Language Model Editing
Peng Wang, Biyu Zhou, Xuehai Tang, Jizhong Han, Songlin Hu
Main category: cs.CL
TL;DR: LyapLock is a novel model editing framework that uses queuing theory and Lyapunov optimization to enable efficient sequential knowledge updates in LLMs while maintaining long-term knowledge preservation, achieving 11.89% better editing efficacy than state-of-the-art methods.
Details
Motivation: Current model editing methods suffer from performance decline during sequential editing due to inadequate long-term knowledge preservation mechanisms, limiting their practical applicability.Method: Models sequential editing as constrained stochastic programming, integrates queuing theory and Lyapunov optimization to decompose long-term constraints into tractable stepwise subproblems.
Result: Scales sequential editing capacity to over 10,000 edits while stabilizing general capabilities and boosting average editing efficacy by 11.89% over SOTA baselines. Can also enhance baseline methods.
Conclusion: LyapLock is the first model editing framework with rigorous theoretical guarantees, achieving asymptotic optimal editing performance while meeting long-term knowledge preservation constraints.
Abstract: Large Language Models often contain factually incorrect or outdated knowledge, giving rise to model editing methods for precise knowledge updates. However, current mainstream locate-then-edit approaches exhibit a progressive performance decline during sequential editing, due to inadequate mechanisms for long-term knowledge preservation. To tackle this, we model the sequential editing as a constrained stochastic programming. Given the challenges posed by the cumulative preservation error constraint and the gradually revealed editing tasks, \textbf{LyapLock} is proposed. It integrates queuing theory and Lyapunov optimization to decompose the long-term constrained programming into tractable stepwise subproblems for efficient solving. This is the first model editing framework with rigorous theoretical guarantees, achieving asymptotic optimal editing performance while meeting the constraints of long-term knowledge preservation. Experimental results show that our framework scales sequential editing capacity to over 10,000 edits while stabilizing general capabilities and boosting average editing efficacy by 11.89% over SOTA baselines. Furthermore, it can be leveraged to enhance the performance of baseline methods. Our code is released on https://github.com/caskcsg/LyapLock.
[152] The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation
Patrick Kahardipraja, Reduan Achtibat, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin
Main category: cs.CL
TL;DR: The paper investigates how large language models use in-context learning for retrieval-augmented question answering, identifying specialized attention heads and their roles in instruction comprehension and knowledge retrieval.
Details
Motivation: To understand the inner workings of in-context retrieval augmentation in large language models, as current methods remain unclear despite their promising capabilities.Method: Proposed an attribution-based method to identify specialized attention heads (in-context and parametric heads), extracted function vectors, and modified attention weights to analyze their influence on answer generation.
Result: Identified distinct roles of attention heads: in-context heads comprehend instructions and retrieve relevant contextual information, while parametric heads store entities’ relational knowledge.
Conclusion: The insights enable tracing knowledge sources during inference, paving the way for more safe and transparent language models.
Abstract: Large language models are able to exploit in-context learning to access external knowledge beyond their training data through retrieval-augmentation. While promising, its inner workings remain unclear. In this work, we shed light on the mechanism of in-context retrieval augmentation for question answering by viewing a prompt as a composition of informational components. We propose an attribution-based method to identify specialized attention heads, revealing in-context heads that comprehend instructions and retrieve relevant contextual information, and parametric heads that store entities’ relational knowledge. To better understand their roles, we extract function vectors and modify their attention weights to show how they can influence the answer generation process. Finally, we leverage the gained insights to trace the sources of knowledge used during inference, paving the way towards more safe and transparent language models.
[153] MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback
Wanhao Liu, Zonglin Yang, Jue Wang, Lidong Bing, Di Zhang, Dongzhan Zhou, Yuqiang Li, Houqiang Li, Erik Cambria, Wanli Ouyang
Main category: cs.CL
TL;DR: The paper introduces experiment-guided hypothesis ranking using a simulator and in-context reinforcement learning, outperforming pre-experiment methods by incorporating empirical feedback.
Details
Motivation: Current hypothesis ranking methods rely only on language model reasoning without empirical feedback, which is insufficient for cost-intensive scientific domains where real experiments are impractical.Method: Proposed a simulator that models hypothesis performance based on similarity to hidden ground truth with noise, and an in-context reinforcement learning framework where an LLM-based policy decomposes hypotheses, clusters them by mechanistic roles, and prioritizes recombinations using feedback.
Result: Validated against 124 hypotheses with experimental outcomes, the simulator approximates real results with consistent trend alignment. The approach significantly outperforms pre-experiment baselines and strong ablations.
Conclusion: The toolkit enables systematic research on experiment-guided ranking, with the policy serving as a strong proof of concept for more robust hypothesis prioritization strategies.
Abstract: Hypothesis ranking is vital for automated scientific discovery, especially in cost-intensive, throughput-limited natural science domains. Current methods focus on pre-experiment ranking, relying solely on language model reasoning without empirical feedback. We introduce experiment-guided ranking, which prioritizes hypotheses based on feedback from prior tests. Due to the impracticality of real experiments, we propose a simulator grounded in domain-specific concepts that models hypothesis performance as a function of similarity to a hidden ground truth, perturbed by noise. Validated against 124 hypotheses with experimentally reported outcomes, the simulator approximates real results with consistent trend alignment. Although deviations exist, they mimic wet-lab noise, promoting more robust ranking strategies. We frame experiment-guided ranking as a sequential decision-making problem and propose an in-context reinforcement learning (ICRL) framework. Our LLM-based policy decomposes hypotheses into functional elements, clusters them by mechanistic roles, and prioritizes recombinations based on feedback. Experiments show our approach significantly outperforms pre-experiment baselines and strong ablations. Our toolkit, comprising the simulator and ICRL framework, enables systematic research on experiment-guided ranking, with the policy serving as a strong proof of concept.
[154] MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search
Zonglin Yang, Wanhao Liu, Ben Gao, Yujie Liu, Wei Li, Tong Xie, Lidong Bing, Wanli Ouyang, Erik Cambria, Dongzhan Zhou
Main category: cs.CL
TL;DR: This paper introduces fine-grained scientific hypothesis discovery as a new task, framing it as a combinatorial optimization problem and proposing a hierarchical search method that outperforms baselines on expert-annotated benchmarks.
Details
Motivation: Existing LLM approaches for scientific hypothesis generation primarily yield coarse-grained hypotheses lacking critical methodological and experimental details, creating a need for generating detailed, experimentally actionable hypotheses.Method: Proposes a hierarchical search method that incrementally proposes and integrates details into hypotheses, progressing from general concepts to specific experimental configurations. This smooths the reward landscape and enables more effective optimization.
Result: Empirical evaluations on a new benchmark of expert-annotated fine-grained hypotheses from recent literature show that the method consistently outperforms strong baselines.
Conclusion: The hierarchical process effectively addresses fine-grained scientific hypothesis discovery, demonstrating that LLMs can generate detailed, experimentally actionable hypotheses when properly leveraged through combinatorial optimization approaches.
Abstract: Large language models (LLMs) have shown promise in automating scientific hypothesis generation, yet existing approaches primarily yield coarse-grained hypotheses lacking critical methodological and experimental details. We introduce and formally define the new task of fine-grained scientific hypothesis discovery, which entails generating detailed, experimentally actionable hypotheses from coarse initial research directions. We frame this as a combinatorial optimization problem and investigate the upper limits of LLMs' capacity to solve it when maximally leveraged. Specifically, we explore four foundational questions: (1) how to best harness an LLM’s internal heuristics to formulate the fine-grained hypothesis it itself would judge as the most promising among all the possible hypotheses it might generate, based on its own internal scoring-thus defining a latent reward landscape over the hypothesis space; (2) whether such LLM-judged better hypotheses exhibit stronger alignment with ground-truth hypotheses; (3) whether shaping the reward landscape using an ensemble of diverse LLMs of similar capacity yields better outcomes than defining it with repeated instances of the strongest LLM among them; and (4) whether an ensemble of identical LLMs provides a more reliable reward landscape than a single LLM. To address these questions, we propose a hierarchical search method that incrementally proposes and integrates details into the hypothesis, progressing from general concepts to specific experimental configurations. We show that this hierarchical process smooths the reward landscape and enables more effective optimization. Empirical evaluations on a new benchmark of expert-annotated fine-grained hypotheses from recent literature show that our method consistently outperforms strong baselines.
[155] Prompting is not Enough: Exploring Knowledge Integration and Controllable Generation
Tingjia Shen, Hao Wang, Chuan Qin, Ruijun Sun, Yang Song, Defu Lian, Hengshu Zhu, Enhong Chen
Main category: cs.CL
TL;DR: GenKI is a novel framework that improves OpenQA performance by simultaneously addressing knowledge integration and controllable generation in LLMs through dense passage retrieval, instruction-based fine-tuning, and ensemble-based text consistency.
Details
Motivation: To address two critical challenges in LLM-based OpenQA: effective knowledge integration into LLMs and adaptive generation with specific answer formats for various task situations.Method: 1) Train dense passage retrieval model to retrieve associated knowledge; 2) Introduce knowledge integration model incorporating retrieval knowledge into instructions during fine-tuning; 3) Leverage fine-tuned LLM with ensemble based on text consistency (coherence, fluency, answer format assurance).
Result: Extensive experiments on TriviaQA, MSMARCO, and CMRC2018 datasets demonstrate GenKI’s effectiveness compared to state-of-the-art baselines. Ablation studies reveal linear relationship between retrieved knowledge frequency and model’s knowledge recall accuracy.
Conclusion: GenKI successfully improves OpenQA performance by simultaneously exploring knowledge integration and controllable generation in LLMs, with experiments showing its effectiveness across diverse answer formats and datasets.
Abstract: Open-domain question answering (OpenQA) represents a cornerstone in natural language processing (NLP), primarily focused on extracting answers from unstructured textual data. With the rapid advancements in Large Language Models (LLMs), LLM-based OpenQA methods have reaped the benefits of emergent understanding and answering capabilities enabled by massive parameters compared to traditional methods. However, most of these methods encounter two critical challenges: how to integrate knowledge into LLMs effectively and how to adaptively generate results with specific answer formats for various task situations. To address these challenges, we propose a novel framework named GenKI, which aims to improve the OpenQA performance by exploring Knowledge Integration and controllable Generation on LLMs simultaneously. Specifically, we first train a dense passage retrieval model to retrieve associated knowledge from a given knowledge base. Subsequently, we introduce a novel knowledge integration model that incorporates the retrieval knowledge into instructions during fine-tuning to intensify the model. Furthermore, to enable controllable generation in LLMs, we leverage a certain fine-tuned LLM and an ensemble based on text consistency incorporating all coherence, fluency, and answer format assurance. Finally, extensive experiments conducted on the TriviaQA, MSMARCO, and CMRC2018 datasets, featuring diverse answer formats, have demonstrated the effectiveness of GenKI with comparison of state-of-the-art baselines. Moreover, ablation studies have disclosed a linear relationship between the frequency of retrieved knowledge and the model’s ability to recall knowledge accurately against the ground truth. Our code of GenKI is available at https://github.com/USTC-StarTeam/GenKI
[156] Gatsby Without the ‘E’: Crafting Lipograms with LLMs
Rohan Balasubramanian, Nitish Gokulakrishnan, Syeda Jannatus Saba, Steven Skiena
Main category: cs.CL
TL;DR: This paper explores using modern LLMs to transform The Great Gatsby into a lipogram that excludes the letter ’e’, testing various techniques from basic synonym replacement to advanced generative models with beam search.
Details
Motivation: To test the capabilities of modern large language models in handling extreme linguistic constraints like lipograms, and to explore how much English text can be adapted while maintaining meaning under strict letter exclusion rules.Method: Used a range of techniques including baseline synonym replacement and sophisticated generative models enhanced with beam search and named entity analysis to transform The Great Gatsby into an ’e’-less text.
Result: Found that excluding up to 3.6% of the most common letters (up to ‘u’) had minimal impact on text meaning, but translation fidelity rapidly decays with stronger lipogram constraints.
Conclusion: Modern LLMs demonstrate surprising flexibility in handling strict linguistic constraints, revealing the adaptability and creativity of English language under extreme limitations.
Abstract: Lipograms are a unique form of constrained writing where all occurrences of a particular letter are excluded from the text, typified by the novel Gadsby, which daringly avoids all usage of the letter ’e’. In this study, we explore the power of modern large language models (LLMs) by transforming the novel F. Scott Fitzgerald’s The Great Gatsby into a fully ’e’-less text. We experimented with a range of techniques, from baseline methods like synonym replacement to sophisticated generative models enhanced with beam search and named entity analysis. We show that excluding up to 3.6% of the most common letters (up to the letter ‘u’) had minimal impact on the text’s meaning, although translation fidelity rapidly and predictably decays with stronger lipogram constraints. Our work highlights the surprising flexibility of English under strict constraints, revealing just how adaptable and creative language can be.
[157] First SFT, Second RL, Third UPT: Continual Improving Multi-Modal LLM Reasoning via Unsupervised Post-Training
Lai Wei, Yuting Li, Chen Wang, Yue Wang, Linghe Kong, Weiran Huang, Lichao Sun
Main category: cs.CL
TL;DR: MM-UPT is an unsupervised post-training framework for MLLMs that enables self-improvement without manual annotations, using self-rewarding mechanisms and data self-generation strategies.
Details
Motivation: Current MLLM improvement methods rely on expensive supervised fine-tuning or reinforcement learning with manually annotated data, which is unsustainable. There's a need for unsupervised approaches for continual self-improvement.Method: Builds upon GRPO with self-rewarding mechanism based on majority voting over multiple sampled responses. Also includes data self-generation strategies where MLLM synthesizes new training samples.
Result: Significantly improves reasoning ability: MathVista (66.3%→72.9%) and We-Math (62.9%→68.7%) using standard datasets without ground truth labels. Synthetic data combination further boosts performance.
Conclusion: MM-UPT offers a new paradigm for autonomous MLLM enhancement as a critical third step after SFT and RL, enabling scalable self-improvement without external supervision.
Abstract: Improving Multi-modal Large Language Models (MLLMs) in the post-training stage typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL), which require expensive and manually annotated multi-modal data–an ultimately unsustainable resource. This limitation has motivated a growing interest in unsupervised paradigms as a third stage of post-training after SFT and RL. While recent efforts have explored this direction, their methods are complex and difficult to iterate. To address this, we propose MM-UPT, a simple yet effective framework for unsupervised post-training of MLLMs, enabling continual self-improvement without any external supervision. The training method of MM-UPT builds upon GRPO, replacing traditional reward signals with a self-rewarding mechanism based on majority voting over multiple sampled responses. Our experiments demonstrate that such training method effectively improves the reasoning ability of Qwen2.5-VL-7B (e.g., 66.3%$\rightarrow$72.9% on MathVista, 62.9%$\rightarrow$68.7% on We-Math), using standard dataset without ground truth labels. To further explore scalability, we extend our framework to a data self-generation setting, designing two strategies that prompt the MLLM to synthesize new training samples on its own. Additional experiments show that combining these synthetic data with the unsupervised training method can also boost performance, highlighting a promising approach for scalable self-improvement. Overall, MM-UPT offers a new paradigm for autonomous enhancement of MLLMs, serving as a critical third step after initial SFT and RL in the absence of external supervision. Our code is available at https://github.com/waltonfuture/MM-UPT.
[158] Estimating LLM Consistency: A User Baseline vs Surrogate Metrics
Xiaoyuan Wu, Weiran Lin, Omer Akgul, Lujo Bauer
Main category: cs.CL
TL;DR: Current methods for measuring LLM response consistency don’t align well with human perceptions. The authors propose a logit-based ensemble method that matches the best existing metric’s performance in estimating human ratings.
Details
Motivation: LLMs are prone to hallucinations and sensitive to prompt perturbations, leading to inconsistent responses. While various methods exist to measure consistency, it's unclear how well they approximate human perceptions of consistency.Method: Conducted a large user study (n=2,976) and proposed a logit-based ensemble method for estimating LLM consistency that better aligns with human judgments.
Result: Current automated consistency metrics typically do not align well with human perceptions. The proposed logit-based ensemble method matches the performance of the best existing metric in estimating human ratings.
Conclusion: Automated consistency metrics are sufficiently imperfect to warrant broader use of human evaluation to avoid misjudging model adequacy based on flawed automated metrics.
Abstract: Large language models (LLMs) are prone to hallucinations and sensitiveto prompt perturbations, often resulting in inconsistent or unreliablegenerated text. Different methods have been proposed to mitigate suchhallucinations and fragility, one of which is to measure theconsistency of LLM responses – the model’s confidence in the responseor likelihood of generating a similar response when resampled. Inprevious work, measuring LLM response consistency often relied oncalculating the probability of a response appearing within a pool of resampledresponses, analyzing internal states, or evaluating logits of resopnses.However, it was not clear how well theseapproaches approximated users’ perceptions of consistency of LLMresponses. To find out, we performed a user study ($n=2,976$)demonstrating that current methods for measuring LLM responseconsistency typically do not align well with humans’ perceptions of LLMconsistency. We propose a logit-based ensemble method for estimatingLLM consistency and show that our method matches the performance of thebest-performing existing metric in estimating human ratings of LLMconsistency. Our results suggest that methods for estimating LLMconsistency without human evaluation are sufficiently imperfect towarrant broader use of evaluation with human input; this would avoidmisjudging the adequacy of models because of the imperfections ofautomated consistency metrics.
[159] TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine
Jiacheng Xie, Yang Yu, Ziyang Zhang, Shuai Zeng, Jiaxuan He, Ayush Vasireddy, Xiaoting Tang, Congyu Guo, Lening Zhao, Congcong Jing, Guanghui An, Dong Xu
Main category: cs.CL
TL;DR: TCM-Ladder is the first comprehensive multimodal QA dataset for evaluating large TCM language models, covering multiple TCM disciplines with over 52,000 questions including text, images, and videos.
Details
Motivation: Existing evaluation datasets for TCM language models are limited in scope and primarily text-based, lacking a unified standardized multimodal benchmark to objectively assess performance on real-world TCM tasks.Method: Created TCM-Ladder dataset covering core TCM disciplines using automated and manual filtering, with various question types and modalities. Also proposed Ladder-Score evaluation method for assessing answer quality in TCM terminology and semantic expression.
Result: The dataset comprises over 52,000 questions across multiple modalities. Comparative experiments were conducted against nine general domain and five TCM-specific LLMs using the proposed evaluation framework.
Conclusion: This work provides the first systematic evaluation of mainstream LLMs on a unified multimodal TCM benchmark, with publicly available datasets and leaderboard for continuous updates.
Abstract: Traditional Chinese Medicine (TCM), as an effective alternative medicine, has been receiving increasing attention. In recent years, the rapid development of large language models (LLMs) tailored for TCM has highlighted the urgent need for an objective and comprehensive evaluation framework to assess their performance on real-world tasks. However, existing evaluation datasets are limited in scope and primarily text-based, lacking a unified and standardized multimodal question-answering (QA) benchmark. To address this issue, we introduce TCM-Ladder, the first comprehensive multimodal QA dataset specifically designed for evaluating large TCM language models. The dataset covers multiple core disciplines of TCM, including fundamental theory, diagnostics, herbal formulas, internal medicine, surgery, pharmacognosy, and pediatrics. In addition to textual content, TCM-Ladder incorporates various modalities such as images and videos. The dataset was constructed using a combination of automated and manual filtering processes and comprises over 52,000 questions. These questions include single-choice, multiple-choice, fill-in-the-blank, diagnostic dialogue, and visual comprehension tasks. We trained a reasoning model on TCM-Ladder and conducted comparative experiments against nine state-of-the-art general domain and five leading TCM-specific LLMs to evaluate their performance on the dataset. Moreover, we propose Ladder-Score, an evaluation method specifically designed for TCM question answering that effectively assesses answer quality in terms of terminology usage and semantic expression. To the best of our knowledge, this is the first work to systematically evaluate mainstream general domain and TCM-specific LLMs on a unified multimodal benchmark. The datasets and leaderboard are publicly available at https://tcmladder.com and will be continuously updated.
[160] A Simple Linear Patch Revives Layer-Pruned Large Language Models
Xinrui Chen, Haoli Bai, Tao Yuan, Ruikang Liu, Kang Zhao, Xianzhi Yu, Lu Hou, Tian Guan, Yonghong He, Chun Yuan
Main category: cs.CL
TL;DR: LinearPatch is a lightweight plug-and-play technique that addresses activation magnitude mismatch at layer pruning interfaces in LLMs, preserving up to 94.15% of original performance when pruning 5 out of 32 layers in LLaMA-3-8B.
Details
Motivation: Existing layer pruning approaches for LLMs suffer substantial performance degradation due to overlooked activation magnitude mismatch at pruning interfaces, where pre-interface and post-interface activations have significantly different scales causing distributional shift.Method: LinearPatch fuses two operations into one matrix multiply at the pruning interface: (1) Hadamard transformation to suppress massive outliers at particular tokens, and (2) channel-wise scaling to align activation statistics. It can be further refined with 5K unlabeled samples via memory-efficient offline distillation.
Result: On LLaMA-3-8B, LinearPatch preserves up to 94.15% of original performance when pruning 5 out of 32 layers, outperforming previous state-of-the-art by 4%. With offline distillation, retention improves to 95.16% within 30 minutes on a single GPU.
Conclusion: LinearPatch effectively addresses the activation magnitude mismatch problem in layer pruning, achieving significant performance retention improvements with minimal computational overhead, making it a practical solution for LLM compression.
Abstract: Layer pruning has emerged as a widely used technique for compressing large language models (LLMs). However, existing layer pruning approaches often incur substantial performance degradation. We identify the majority of this degradation to a single yet previously overlooked issue: \textit{the mismatch of activation magnitudes at the pruning interface}. The pre-interface activations exhibit significantly different scales from the post-interface ones, causing the distributional shift as it propagates through the remaining layers. To address this issue, we introduce \textsc{LinearPatch}, a lightweight and plug-and-play technique that fuses two operations into one matrix multiply at the pruning interface: (i) a Hadamard transformation that suppresses massive outliers at particular tokens and (ii) a channel-wise scaling that aligns activation statistics. On LLaMA-3-8B, \textsc{LinearPatch} preserves up to \textbf{94.15%} of the original model’s performance when pruning 5 out of 32 layers, outperforming the previous state of the art by \textbf{4%}. The patch can be further refined with 5K unlabeled samples via memory-efficient offline distillation, pushing the retention to 95.16% within only 30 minutes on a single GPU. Code is available at https://github.com/chenxinrui-tsinghua/LinearPatch.
[161] Less is More: Local Intrinsic Dimensions of Contextual Language Models
Benjamin Matthias Ruppik, Julius von Rohrscheidt, Carel van Niekerk, Michael Heck, Renato Vukovic, Shutong Feng, Hsien-chin Lin, Nurul Lubis, Bastian Rieck, Marcus Zibrowius, Milica Gašić
Main category: cs.CL
TL;DR: The paper introduces a geometric approach using local dimensions of contextual latent embeddings to study LLM training dynamics, showing that local dimensions can predict training exhaustion, overfitting, grokking, and performance gains.
Details
Motivation: To understand how fine-tuning affects LLM behavior and provide insights into training dynamics and generalization ability through geometric analysis of latent embeddings.Method: Measure local dimensions of contextual language model’s latent space and analyze their shifts during training and fine-tuning across different tasks (dialogue state tracking, emotion recognition, arithmetic).
Result: Local dimensions predict training capability exhaustion, overfitting, grokking, and performance gains. Reductions in mean local dimension accompany and predict subsequent performance improvements.
Conclusion: The geometric approach provides practitioners with deeper understanding of fine-tuning effects on embedding spaces, bridging intrinsic model mechanisms with geometric properties for better interpretability and adaptability of LLMs.
Abstract: Understanding the internal mechanisms of large language models (LLMs) remains a challenging and complex endeavor. Even fundamental questions, such as how fine-tuning affects model behavior, often require extensive empirical evaluation. In this paper, we introduce a novel perspective based on the geometric properties of contextual latent embeddings to study the effects of training and fine-tuning. To that end, we measure the local dimensions of a contextual language model’s latent space and analyze their shifts during training and fine-tuning. We show that the local dimensions provide insights into the model’s training dynamics and generalization ability. Specifically, the mean of the local dimensions predicts when the model’s training capabilities are exhausted, as exemplified in a dialogue state tracking task, overfitting, as demonstrated in an emotion recognition task, and grokking, as illustrated with an arithmetic task. Furthermore, our experiments suggest a practical heuristic: reductions in the mean local dimension tend to accompany and predict subsequent performance gains. Through this exploration, we aim to provide practitioners with a deeper understanding of the implications of fine-tuning on embedding spaces, facilitating informed decisions when configuring models for specific applications. The results of this work contribute to the ongoing discourse on the interpretability, adaptability, and generalizability of LLMs by bridging the gap between intrinsic model mechanisms and geometric properties in the respective embeddings.
[162] The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, Yu Meng
Main category: cs.CL
TL;DR: Training language models with only negative sample reinforcement (penalizing incorrect responses) can be highly effective for mathematical reasoning, often matching or surpassing traditional RL methods like PPO and GRPO, while reinforcing only correct responses improves Pass@1 but reduces diversity at higher k.
Details
Motivation: To better understand the mechanism of reinforcement learning with verifiable rewards (RLVR) for training language models on reasoning tasks, particularly by decomposing the learning signal into positive and negative sample reinforcement.Method: Decomposed RLVR learning into Positive Sample Reinforcement (PSR) and Negative Sample Reinforcement (NSR), trained Qwen2.5-Math-7B, Qwen3-4B and Llama-3.1-8B-Instruct on mathematical reasoning dataset, analyzed gradient patterns, and proposed a variant that upweights NSR.
Result: Training with only negative samples consistently improved performance over base models across Pass@k spectrum (k up to 256), often matching or surpassing PPO and GRPO. NSR suppresses incorrect generations and redistributes probability mass toward plausible candidates based on model’s prior beliefs.
Conclusion: Solely penalizing incorrect responses contributes more to performance than previously recognized, working by refining existing knowledge rather than introducing new behaviors. The proposed NSR-upweighted variant consistently improves overall Pass@k performance on multiple mathematical reasoning benchmarks.
Abstract: Reinforcement learning with verifiable rewards (RLVR) is a promising approach for training language models (LMs) on reasoning tasks that elicit emergent long chains of thought (CoTs). Unlike supervised learning, it updates the model using both correct and incorrect samples via policy gradients. To better understand its mechanism, we decompose the learning signal into reinforcing correct responses and penalizing incorrect ones, referred to as Positive and Negative Sample Reinforcement (PSR and NSR), respectively. We train Qwen2.5-Math-7B, Qwen3-4B and Llama-3.1-8B-Instruct on a mathematical reasoning dataset and uncover a surprising result: training with only negative samples – without reinforcing correct responses – can be highly effective: it consistently improves performance over the base model across the entire Pass@$k$ spectrum $k$ up to 256), often matching or surpassing PPO and GRPO. In contrast, reinforcing only correct responses improves Pass@1 but degrades performance at higher $k$, due to reduced diversity. These inference-scaling trends highlight that solely penalizing incorrect responses may contribute more to performance than previously recognized. Through gradient analysis, we show that NSR works by suppressing incorrect generations and redistributing probability mass toward other plausible candidates, guided by the model’s prior beliefs. It refines the model’s existing knowledge rather than introducing entirely new behaviors. Building on this insight, we propose a simple variant of the RL objective that upweights NSR, and show that it consistently improves overall Pass@$k$ performance on MATH, AIME 2025, and AMC23. Our code is available at https://github.com/TianHongZXY/RLVR-Decomposed.
[163] Entity-Augmented Neuroscience Knowledge Retrieval Using Ontology and Semantic Understanding Capability of LLM
Pralaypati Ta, Sriram Venkatesaperumal, Keerthi Ram, Mohanasankar Sivaprakasam
Main category: cs.CL
TL;DR: Proposes methods for constructing knowledge graphs from unlabeled neuroscience literature using LLMs, ontology, and embeddings, with enhanced retrieval capabilities.
Details
Motivation: Current retrieval methods struggle with dispersed neuroscience knowledge across multiple sources, and existing KG construction requires labeled data and domain expertise which is challenging to acquire.Method: Uses large language models, neuroscience ontology, and text embeddings to analyze semantic relevance of text segments for KG construction, plus entity-augmented information retrieval algorithm.
Result: Achieves F1 score of 0.84 for entity extraction from unlabeled data, comparable to supervised methods. Improves answers to over 52% of neuroscience questions from PubMedQA dataset.
Conclusion: The proposed methods significantly enhance knowledge discovery from unlabeled neuroscience research corpus and demonstrate effective KG construction without requiring labeled data.
Abstract: Neuroscience research publications encompass a vast wealth of knowledge. Accurately retrieving existing information and discovering new insights from this extensive literature is essential for advancing the field. However, when knowledge is dispersed across multiple sources, current state-of-the-art retrieval methods often struggle to extract the necessary information. A knowledge graph (KG) can integrate and link knowledge from multiple sources. However, existing methods for constructing KGs in neuroscience often rely on labeled data and require domain expertise. Acquiring large-scale, labeled data for a specialized area like neuroscience presents significant challenges. This work proposes novel methods for constructing KG from unlabeled large-scale neuroscience research corpus utilizing large language models (LLM), neuroscience ontology, and text embeddings. We analyze the semantic relevance of neuroscience text segments identified by LLM for building the knowledge graph. We also introduce an entity-augmented information retrieval algorithm to extract knowledge from the KG. Several experiments were conducted to evaluate the proposed approaches. The results demonstrate that our methods significantly enhance knowledge discovery from the unlabeled neuroscience research corpus. The performance of the proposed entity and relation extraction method is comparable to the existing supervised method. It achieves an F1 score of 0.84 for entity extraction from the unlabeled data. The knowledge obtained from the KG improves answers to over 52% of neuroscience questions from the PubMedQA dataset and questions generated using selected neuroscience entities.
[164] Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models
Taha Entesari, Arman Hatami, Rinat Khaziev, Anil Ramakrishna, Mahyar Fazlyab
Main category: cs.CL
TL;DR: A new constrained optimization approach for LLM unlearning that uses logit-margin flattening loss for forgetting and hard constraints for retention, achieving better performance than regularized methods.
Details
Motivation: Existing unlearning methods suffer from unstable optimization and degraded performance when combining forgetting and retention objectives into a single regularized loss, especially under aggressive forgetting scenarios.Method: Formulates unlearning as constrained optimization with logit-margin flattening loss for forgetting (driving output distribution toward uniformity on forget set) and hard constraints for retention on retain set, solved using scalable primal-dual algorithm.
Result: Outperforms state-of-the-art baselines on TOFU and MUSE benchmarks across diverse LLM architectures, effectively removing targeted information while preserving downstream utility with stable optimization.
Conclusion: The constrained optimization formulation with logit-margin flattening provides a more stable and effective approach to LLM unlearning compared to regularized methods, enabling better trade-off management between forgetting and retention.
Abstract: Large Language Models (LLMs) deployed in real-world settings increasingly face the need to unlearn sensitive, outdated, or proprietary information. Existing unlearning methods typically formulate forgetting and retention as a regularized trade-off, combining both objectives into a single scalarized loss. This often leads to unstable optimization and degraded performance on retained data, especially under aggressive forgetting. We propose a new formulation of LLM unlearning as a constrained optimization problem: forgetting is enforced via a novel logit-margin flattening loss that explicitly drives the output distribution toward uniformity on a designated forget set, while retention is preserved through a hard constraint on a separate retain set. Compared to entropy-based objectives, our loss is softmax-free, numerically stable, and maintains non-vanishing gradients, enabling more efficient and robust optimization. We solve the constrained problem using a scalable primal-dual algorithm that exposes the trade-off between forgetting and retention through the dynamics of the dual variable, all without any extra computational overhead. Evaluations on the TOFU and MUSE benchmarks across diverse LLM architectures demonstrate that our approach consistently matches or exceeds state-of-the-art baselines, effectively removing targeted information while preserving downstream utility.
[165] Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance
Aladin Djuhera, Swanand Ravindra Kadhe, Syed Zawad, Farhan Ahmed, Heiko Ludwig, Holger Boche
Main category: cs.CL
TL;DR: This paper presents a comprehensive comparison of two open post-training datasets (Tulu-3-SFT-Mix and SmolTalk) using quality metrics, and introduces TuluTalk - a curated dataset that achieves better performance with fewer samples.
Details
Motivation: Most post-training datasets for LLMs are not publicly available, making systematic comparisons difficult. There's a need for transparent analysis of how dataset composition affects model performance.Method: Used the Magpie framework to annotate samples with quality metrics (turn structure, task category, input/output quality), analyzed structural differences, and designed a principled curation recipe to create TuluTalk.
Result: TuluTalk contains 14% fewer samples than source datasets while matching or exceeding their performance on key benchmarks. The analysis revealed structural and qualitative differences between datasets.
Conclusion: The work provides actionable insights for building effective post-training datasets within resource limits and releases annotated datasets and TuluTalk mixture to support future research.
Abstract: Recent work on large language models (LLMs) has increasingly focused on post-training and alignment with datasets curated to enhance instruction following, world knowledge, and specialized skills. However, most post-training datasets used in leading open- and closed-source LLMs remain inaccessible to the public, with limited information about their construction process. This lack of transparency has motivated the recent development of open-source post-training corpora. While training on these open alternatives can yield performance comparable to that of leading models, systematic comparisons remain challenging due to the significant computational cost of conducting them rigorously at scale, and are therefore largely absent. As a result, it remains unclear how specific samples, task types, or curation strategies influence downstream performance when assessing data quality. In this work, we conduct the first comprehensive side-by-side analysis of two prominent open post-training datasets: Tulu-3-SFT-Mix and SmolTalk. Using the Magpie framework, we annotate each sample with detailed quality metrics, including turn structure (single-turn vs. multi-turn), task category, input quality, and response quality, and we derive statistics that reveal structural and qualitative similarities and differences between the two datasets. Based on these insights, we design a principled curation recipe that produces a new data mixture, TuluTalk, which contains 14% fewer samples than either source dataset while matching or exceeding their performance on key benchmarks. Our findings offer actionable insights for constructing more effective post-training datasets that improve model performance within practical resource limits. To support future research, we publicly release both the annotated source datasets and our curated TuluTalk mixture.
[166] Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference
Jiayi Yuan, Hao Li, Xinheng Ding, Wenya Xie, Yu-Jhe Li, Wentian Zhao, Kun Wan, Jing Shi, Xia Hu, Zirui Liu
Main category: cs.CL
TL;DR: LLM benchmark reproducibility is fragile due to numerical precision issues in floating-point arithmetic, causing significant variations in model outputs and accuracy across different hardware configurations.
Details
Motivation: To investigate why LLM performance evaluations are not reproducible across different system configurations, and to understand how numerical precision affects model outputs.Method: Conducted systematic experiments across various hardware, software, and precision settings; developed LayerCast pipeline that stores weights in 16-bit but computes in FP32 for better numerical stability.
Result: Found up to 9% accuracy variation and 9,000 token length differences in reasoning models due to GPU count, type, and batch size differences; identified non-associative floating-point arithmetic as root cause.
Conclusion: Numerical precision is critical for LLM reproducibility but often neglected; LayerCast provides a practical solution balancing memory efficiency with numerical stability.
Abstract: Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the reproducibility of LLM performance is fragile: changing system configuration, such as evaluation batch size, GPU count, and GPU version, can introduce significant differences in the generated responses. This issue is especially pronounced in reasoning models, where minor rounding differences in early tokens can cascade into divergent chains of thought, ultimately affecting accuracy. For instance, under bfloat16 precision with greedy decoding, a reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9% variation in accuracy and 9,000 tokens difference in response length due to differences in GPU count, type, and evaluation batch size. We trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision. This work presents the first systematic investigation into how numerical precision affects reproducibility in LLM inference. Through carefully controlled experiments across various hardware, software, and precision settings, we quantify when and how model outputs diverge. Our analysis reveals that floating-point precision - while critical for reproducibility - is often neglected in evaluation practices. Inspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32, balancing memory efficiency with numerical stability. Code is available at https://github.com/nanomaoli/llm_reproducibility.
[167] Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning
Xiangning Yu, Zhuohan Wang, Linyi Yang, Haoxuan Li, Anjie Liu, Xiao Xue, Jun Wang, Mengyue Yang
Main category: cs.CL
TL;DR: A causal framework that uses Probability of Sufficiency and Necessity to analyze Chain-of-Thought reasoning, enabling automated step addition and pruning to improve efficiency without sacrificing accuracy.
Details
Motivation: Chain-of-Thought prompting faces challenges in ensuring sufficiency (comprehensive coverage of reasoning steps) and necessity (identifying truly indispensable steps) for complex reasoning in LLMs.Method: Proposed a causal framework incorporating Probability of Sufficiency and Necessity to quantify step influence, enabling automated addition of missing steps and pruning of redundant ones.
Result: Extensive experiments on mathematical and commonsense reasoning benchmarks show substantial improvements in reasoning efficiency and reduced token usage without accuracy loss.
Conclusion: Provides a promising direction for improving LLM reasoning performance and cost-effectiveness through causal analysis of reasoning steps.
Abstract: Chain-of-Thought (CoT) prompting plays an indispensable role in endowing large language models (LLMs) with complex reasoning capabilities. However, CoT currently faces two fundamental challenges: (1) Sufficiency, which ensures that the generated intermediate inference steps comprehensively cover and substantiate the final conclusion; and (2) Necessity, which identifies the inference steps that are truly indispensable for the soundness of the resulting answer. We propose a causal framework that characterizes CoT reasoning through the dual lenses of sufficiency and necessity. Incorporating causal Probability of Sufficiency and Necessity allows us not only to determine which steps are logically sufficient or necessary to the prediction outcome, but also to quantify their actual influence on the final reasoning outcome under different intervention scenarios, thereby enabling the automated addition of missing steps and the pruning of redundant ones. Extensive experimental results on various mathematical and commonsense reasoning benchmarks confirm substantial improvements in reasoning efficiency and reduced token usage without sacrificing accuracy. Our work provides a promising direction for improving LLM reasoning performance and cost-effectiveness.
[168] Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers
Yixiao Huang, Hanlin Zhu, Tianyu Guo, Jiantao Jiao, Somayeh Sojoudi, Michael I. Jordan, Stuart Russell, Song Mei
Main category: cs.CL
TL;DR: LLMs exhibit both generalization and hallucination due to out-of-context reasoning (OCR), where models associate concepts regardless of causal relationships. This behavior stems from gradient descent’s implicit bias favoring nuclear norm minimization.
Details
Motivation: To understand why LLMs show both remarkable generalization from new facts and tendency to hallucinate incorrect information during fine-tuning, which remains poorly understood.Method: Formalized OCR as synthetic factual recall task, tested five LLMs, and analyzed one-layer single-head attention-only transformers with factorized vs. combined weight matrices.
Result: OCR drives both generalization and hallucination; models with factorized matrices can learn OCR while combined-weight models cannot; gradient descent’s implicit bias minimizes nuclear norm of output-value matrix.
Conclusion: OCR provides unified mechanism explaining LLM behavior, with theoretical foundation for understanding and mitigating undesirable behaviors from knowledge injection.
Abstract: Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information. However, the reasons for this phenomenon remain poorly understood. In this work, we argue that both behaviors stem from a single mechanism known as out-of-context reasoning (OCR): the ability to deduce implications by associating concepts, even those without a causal link. Our experiments across five prominent LLMs confirm that OCR indeed drives both generalization and hallucination, depending on whether the associated concepts are causally related. To build a rigorous theoretical understanding of this phenomenon, we then formalize OCR as a synthetic factual recall task. We empirically show that a one-layer single-head attention-only transformer with factorized output and value matrices can learn to solve this task, while a model with combined weights cannot, highlighting the crucial role of matrix factorization. Our theoretical analysis shows that the OCR capability can be attributed to the implicit bias of gradient descent, which favors solutions that minimize the nuclear norm of the combined output-value matrix. This mathematical structure explains why the model learns to associate facts and implications with high sample efficiency, regardless of whether the correlation is causal or merely spurious. Ultimately, our work provides a theoretical foundation for understanding the OCR phenomenon, offering a new lens for analyzing and mitigating undesirable behaviors from knowledge injection.
[169] Unsupervised Document and Template Clustering using Multimodal Embeddings
Phillipe R. Sampaio, Helene Maxcici
Main category: cs.CL
TL;DR: Unsupervised document clustering using frozen multimodal encoders and classical clustering algorithms, evaluating modality-specific performance across various document types.
Details
Motivation: To study unsupervised clustering of documents at category and template levels using frozen multimodal encoders, addressing the need for robust document organization without labeled data.Method: Systematic pipeline that projects heterogeneous encoder states into token-type-aware document vectors, then applies centroid- or density-based clustering methods including k-Means, DBSCAN, HDBSCAN + k-NN, and BIRCH on eight different encoders.
Result: Reveals modality-specific failure modes and robustness-accuracy trade-off: vision features work well on clean pages for template discovery, text dominates under covariate shift, and fused encoders offer the best balance.
Conclusion: Provides reproducible tuning protocol and evaluation settings to guide future work on unsupervised document organization, demonstrating the effectiveness of multimodal approaches.
Abstract: We study unsupervised clustering of documents at both the category and template levels using frozen multimodal encoders and classical clustering algorithms. We systematize a model-agnostic pipeline that (i) projects heterogeneous last-layer states from text-layout-vision encoders into token-type-aware document vectors and (ii) performs clustering with centroid- or density-based methods, including an HDBSCAN + $k$-NN assignment to eliminate unlabeled points. We evaluate eight encoders (text-only, layout-aware, vision-only, and vision-language) with $k$-Means, DBSCAN, HDBSCAN + $k$-NN, and BIRCH on five corpora spanning clean synthetic invoices, their heavily degraded print-and-scan counterparts, scanned receipts, and real identity and certificate documents. The study reveals modality-specific failure modes and a robustness-accuracy trade-off, with vision features nearly solving template discovery on clean pages while text dominates under covariate shift, and fused encoders offering the best balance. We detail a reproducible, oracle-free tuning protocol and the curated evaluation settings to guide future work on unsupervised document organization.
[170] Cohort Discovery: A Survey on LLM-Assisted Clinical Trial Recruitment
Shrestha Ghosh, Moritz Schneider, Carina Reinicke, Carsten Eickhoff
Main category: cs.CL
TL;DR: This survey analyzes LLM applications in clinical trial-patient matching, examining current benchmarks, methods, challenges, and future directions.
Details
Motivation: LLMs show strong potential for clinical trial recruitment due to their knowledge aggregation and reasoning abilities, but current applications rely on proprietary models and weak evaluation benchmarks.Method: The paper conducts a comprehensive survey analyzing existing LLM-based approaches, benchmarks, and evaluation frameworks for trial-patient matching.
Result: The survey identifies limitations in current LLM applications for clinical trial recruitment and provides critical examination of existing methods.
Conclusion: The paper highlights challenges in adopting LLM technologies in clinical research and outlines exciting future directions for improvement.
Abstract: Recent advances in LLMs have greatly improved general-domain NLP tasks. Yet, their adoption in critical domains, such as clinical trial recruitment, remains limited. As trials are designed in natural language and patient data is represented as both structured and unstructured text, the task of matching trials and patients benefits from knowledge aggregation and reasoning abilities of LLMs. Classical approaches are trial-specific and LLMs with their ability to consolidate distributed knowledge hold the potential to build a more general solution. Yet recent applications of LLM-assisted methods rely on proprietary models and weak evaluation benchmarks. In this survey, we are the first to analyze the task of trial-patient matching and contextualize emerging LLM-based approaches in clinical trial recruitment. We critically examine existing benchmarks, approaches and evaluation frameworks, the challenges to adopting LLM technologies in clinical research and exciting future directions.
[171] Human-Aligned Faithfulness in Toxicity Explanations of LLMs
Ramaravind K. Mothilal, Joanna Roy, Syed Ishtiaque Ahmed, Shion Guha
Main category: cs.CL
TL;DR: This paper proposes HAF (Human-Aligned Faithfulness), a multi-dimensional criterion to evaluate LLMs’ reasoning about toxicity through their explanations, using six metrics based on uncertainty quantification.
Details
Motivation: Shift focus from toxicity detection to evaluating LLMs' reasoning about toxicity through their explanations to enhance trustworthiness in downstream tasks, addressing limitations of existing explainability methods.Method: Developed HAF criterion and six metrics based on uncertainty quantification to evaluate LLMs’ toxicity explanations without human involvement, tested on Llama models (up to 70B) and Ministral model across five toxicity datasets.
Result: LLMs generate plausible explanations to simple prompts but show inconsistent and irrelevant responses when prompted about nuanced relations between reasons, individual reasons, and toxicity stances.
Conclusion: LLMs’ reasoning about toxicity breaks down in complex scenarios, highlighting the need for better evaluation methods like HAF to improve trustworthiness in toxicity-related tasks.
Abstract: The discourse around toxicity and LLMs in NLP largely revolves around detection tasks. This work shifts the focus to evaluating LLMs’ reasoning about toxicity – from their explanations that justify a stance – to enhance their trustworthiness in downstream tasks. Despite extensive research on explainability, it is not straightforward to adopt existing methods to evaluate free-form toxicity explanation due to their over-reliance on input text perturbations, among other challenges. To account for these, we propose a novel, theoretically-grounded multi-dimensional criterion, Human-Aligned Faithfulness (HAF), that measures the extent to which LLMs’ free-form toxicity explanations align with those of a rational human under ideal conditions. We develop six metrics, based on uncertainty quantification, to comprehensively evaluate HAF of LLMs’ toxicity explanations with no human involvement, and highlight how “non-ideal” the explanations are. We conduct several experiments on three Llama models (of size up to 70B) and an 8B Ministral model on five diverse toxicity datasets. Our results show that while LLMs generate plausible explanations to simple prompts, their reasoning about toxicity breaks down when prompted about the nuanced relations between the complete set of reasons, the individual reasons, and their toxicity stances, resulting in inconsistent and irrelevant responses. We open-source our code at https://github.com/uofthcdslab/HAF and LLM-generated explanations at https://huggingface.co/collections/uofthcdslab/haf.
[172] DeepOmni: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE
Hang Shao, Heting Gao, Yunhang Shen, Jiawei Chen, Zuwei Long, Dong Yang, Ke Li, Xing Sun
Main category: cs.CL
TL;DR: DeepTalk is a native multimodal LLM framework that uses Mixture of Experts to address catastrophic forgetting in speech-text models, achieving only 5.5% performance drop compared to 20%+ in existing methods while maintaining low latency.
Details
Motivation: Native MLLMs preserve rich paralinguistic features and enable direct speech generation within LLMs, but suffer from catastrophic forgetting due to insufficient paired speech-text data compared to text-only pretraining.Method: Proposes DeepTalk with adaptive modality expert learning using MoE architecture - distinguishes modality experts by load, performs specialized single-modality training, then joint multimodal collaborative training.
Result: Achieves only 5.5% performance drop vs original LLM (compared to 20%+ in native MLLMs like GLM-4-Voice), maintains end-to-end latency under 0.5 seconds, and performs on par with modular MLLMs.
Conclusion: DeepTalk effectively addresses catastrophic forgetting in native MLLMs through adaptive modality expert learning, enabling seamless speech interaction while preserving LLM performance.
Abstract: Native multimodal large language models (MLLMs) restructure a single large language model (LLM) into a spoken language model (SLM) capable of both speech and text generation. Compared to modular and aligned MLLMs, native MLLMs preserve richer paralinguistic features such as emotion and prosody, and generate speech responses directly within the backbone LLM rather than using a separate speech decoder. This integration also results in lower response latency and smoother interaction. However, native MLLMs suffer from catastrophic forgetting and performance degradation because the available paired speech-text data is insufficient to support the pretraining of MLLMs compared to the vast amount of text data required to pretrain text LLMs. To address this issue, we propose DeepTalk, a framework for adaptive modality expert learning based on a Mixture of Experts (MoE) architecture. DeepTalk first adaptively distinguishes modality experts according to their modality load within the LLM. Each modality expert then undergoes specialized single-modality training, followed by joint multimodal collaborative training. As a result, DeepTalk incurs only a 5.5% performance drop compared to the original LLM, which is significantly lower than the average performance drop of over 20% typically seen in native MLLMs (such as GLM-4-Voice), and is on par with modular MLLMs. Meanwhile, the end-to-end dialogue latency remains within 0.5 seconds, ensuring a seamless and intelligent speech interaction experience. Code and models are released at https://github.com/talkking/DeepTalk.
[173] Improving the Distributional Alignment of LLMs using Supervision
Gauri Kambhatla, Sanjana Gautam, Angela Zhang, Alex Liu, Ravi Srinivasan, Junyi Jessy Li, Matthew Lease
Main category: cs.CL
TL;DR: Simple supervision improves LLM alignment with diverse population groups across three datasets, with evaluation of distributional alignment and benchmarking for future research.
Details
Motivation: To accurately align LLMs with human population groups on subjective questions, which has great value.Method: Use of simple supervision to improve language model alignment, evaluated over three datasets spanning various topics and multiple LLMs with different prompting strategies.
Result: Greatly improved language model alignment with diverse population groups more consistently, with insights into how alignment varies across specific groups.
Conclusion: Broad findings provide insights into distributional alignment of LLMs with diverse groups, and the work serves as a benchmark to stimulate future research.
Abstract: The ability to accurately align LLMs with human population groups on subjective questions would have great value. In this work, we show that use of simple supervision can greatly improve language model alignment with diverse population groups more consistently, as measured over three datasets spanning various topics. Beyond evaluating average alignment, we also report how alignment varies across specific groups. Our broad findings provide insights into the distributional alignment of LLMs with diverse population groups. By conducting evaluation over many LLMs and prompting strategies, along with open-sourcing our work, we provide a benchmark to stimulate future research.
[174] ARF-RLHF: Adaptive Reward-Following for RLHF through Emotion-Driven Self-Supervision and Trace-Biased Dynamic Optimization
YuXuan Zhang
Main category: cs.CL
TL;DR: ARF converts natural feedback into continuous preference trajectories using TraceBias algorithm, outperforming PPO and DPO by up to 7.6% in alignment.
Details
Motivation: Current RLHF methods like PPO and DPO use binary labels that are costly and too coarse, failing to capture individual variation in preferences.Method: ARF (Adaptive Reward-Following) extracts continuous preference trajectories from free-form natural feedback using the novel TraceBias optimization algorithm.
Result: ARF consistently outperforms PPO and DPO across diverse LLMs and preference domains, improving alignment by up to 7.6%.
Conclusion: Continuous reward modeling provides a scalable path toward personalized and theoretically grounded RLHF.
Abstract: Current RLHF methods such as PPO and DPO typically reduce human preferences to binary labels, which are costly to obtain and too coarse to reflect individual variation. We observe that expressions of satisfaction and dissatisfaction follow stable linguistic patterns across users, indicating that more informative supervisory signals can be extracted from free-form feedback. Building on this insight, we introduce Adaptive Reward-Following (ARF), which converts natural feedback into continuous preference trajectories and optimizes them using the novel TraceBias algorithm. Across diverse LLMs and preference domains, ARF consistently outperforms PPO and DPO, improving alignment by up to 7.6%. Our results demonstrate that continuous reward modeling provides a scalable path toward personalized and theoretically grounded RLHF.
[175] Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving
Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, Ge Zhang, Jiaheng Liu, Xingyao Wang, Sirui Hong, Chenglin Wu, Hao Cheng, Chi Wang, Wangchunshu Zhou
Main category: cs.CL
TL;DR: AGENT KB is a universal memory infrastructure that enables cross-framework knowledge sharing between AI agent systems, preventing repeated mistakes and rediscovery of solutions through hybrid retrieval and feedback mechanisms.
Details
Motivation: Current AI agent frameworks operate in isolation, trapping valuable problem-solving experiences within individual systems and preventing collective intelligence from emerging across different architectures.Method: AGENT KB aggregates trajectories into a structured knowledge base with lightweight APIs, using hybrid retrieval with two stages: planning seeds agents with cross-domain workflows, and feedback applies targeted diagnostic fixes. A disagreement gate prevents knowledge interference.
Result: Substantial improvements across major frameworks: smolagents achieved up to 18.7pp gains at pass@3 (55.2% -> 73.9%), OpenHands improved 4.0pp on SWE-bench pass@1 (24.3% -> 28.3%). Similar gains observed across all model families.
Conclusion: AGENT KB establishes the foundation for collective agent intelligence through shared memory infrastructures, with automatically generated experiences matching manual curation and enabling seamless cross-framework knowledge transfer.
Abstract: AI agent frameworks operate in isolation, forcing agents to rediscover solutions and repeat mistakes across different systems. Despite valuable problem-solving experiences accumulated by frameworks like smolagents, OpenHands, and OWL, this knowledge remains trapped within individual systems, preventing the emergence of collective intelligence. Current memory systems focus on individual agents or framework-specific demonstrations, failing to enable cross-architecture knowledge transfer. We introduce AGENT KB, a universal memory infrastructure enabling seamless experience sharing across heterogeneous agent frameworks without retraining. AGENT KB aggregates trajectories into a structured knowledge base and serves lightweight APIs. At inference time, hybrid retrieval operates through two stages: planning seeds agents with cross-domain workflows, while feedback applies targeted diagnostic fixes. A disagreement gate ensures retrieved knowledge enhances rather than disrupts reasoning, addressing knowledge interference in cross-framework transfer. We validate AGENT KB across major frameworks on GAIA, Humanity’s Last Exam, GPQA, and SWE-bench. Results show substantial improvements across diverse model families: compared to baseline pass@1, smolagents with AGENT KB achieve up to 18.7pp gains at pass@3 (55.2% -> 73.9%), while OpenHands improves 4.0pp on SWE-bench pass@1 (24.3% -> 28.3%). Similar improvements are observed across all base model families. Ablations confirm that hybrid retrieval and feedback stages are essential, with automatically generated experiences matching manual curation. This establishes the foundation for collective agent intelligence through shared memory infrastructures.
[176] Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation
Liliang Ren, Congcong Chen, Haoran Xu, Young Jin Kim, Adam Atkinson, Zheng Zhan, Jiankai Sun, Baolin Peng, Liyuan Liu, Shuohang Wang, Hao Cheng, Jianfeng Gao, Weizhu Chen, Yelong Shen
Main category: cs.CL
TL;DR: SambaY introduces Gated Memory Unit (GMU) for efficient memory sharing across SSM layers, creating a decoder-hybrid-decoder architecture that improves decoding efficiency, preserves linear pre-filling time, and boosts long-context performance without positional encoding.
Details
Motivation: Prior hybrid architectures like Samba and YOCO showed promise but didn't explore representation sharing between SSM layers for efficiency gains.Method: Developed Gated Memory Unit (GMU) mechanism and applied it to create SambaY - a decoder-hybrid-decoder architecture with GMUs in cross-decoder to share memory readout states from Samba-based self-decoder.
Result: Significantly lower irreducible loss than YOCO baseline, superior performance scalability, Phi4-mini-Flash-Reasoning achieves better performance on reasoning tasks without RL, and up to 10x higher decoding throughput on long prompts.
Conclusion: GMU-based SambaY architecture enables efficient memory sharing across SSM layers, delivering significant performance improvements and scalability advantages for large-scale language modeling.
Abstract: Recent advances in language modeling have demonstrated the effectiveness of State Space Models (SSMs) for efficient sequence modeling. While hybrid architectures such as Samba and the decoder-decoder architecture, YOCO, have shown promising performance gains over Transformers, prior works have not investigated the efficiency potential of representation sharing between SSM layers. In this paper, we introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers. We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs in the cross-decoder to share memory readout states from a Samba-based self-decoder. SambaY significantly enhances decoding efficiency, preserves linear pre-filling time complexity, and boosts long-context performance, all while eliminating the need for explicit positional encoding. Through extensive scaling experiments, we demonstrate that our model exhibits a significantly lower irreducible loss compared to a strong YOCO baseline, indicating superior performance scalability under large-scale compute regimes. Our largest model enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves significantly better performance than Phi4-mini-Reasoning on reasoning tasks such as Math500, AIME24/25, and GPQA Diamond without any reinforcement learning, while delivering up to 10x higher decoding throughput on 2K-length prompts with 32K generation length under the vLLM inference framework. We release our training codebase on open-source data at https://github.com/microsoft/ArchScale.
[177] The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora
Chen Amiraz, Yaroslav Fyodorov, Elad Haramaty, Zohar Karnin, Liane Lewin-Eytan
Main category: cs.CL
TL;DR: Cross-lingual RAG faces retrieval bottlenecks in domain-specific settings, with performance drops when query and document languages differ. Simple strategies like equal retrieval from both languages or query translation substantially improve performance.
Details
Motivation: To address gaps in cross-lingual RAG research by studying Arabic-English RAG in domain-specific settings using real-world corporate datasets, revealing hidden retrieval challenges not apparent in Wikipedia-based benchmarks.Method: Used benchmarks derived from real-world corporate datasets with all combinations of Arabic and English query-document pairs. Proposed two retrieval strategies: enforcing equal retrieval from both languages and query translation.
Result: Found that retrieval is a critical bottleneck in cross-lingual domain-specific scenarios, with substantial performance drops when query and document languages differ. The proposed strategies resulted in substantial improvements in cross-lingual and overall performance.
Conclusion: Retrieval challenges are the primary bottleneck in cross-lingual RAG, particularly in practical applications. Simple retrieval strategies can effectively address cross-lingual ranking difficulties and improve performance.
Abstract: Cross-lingual retrieval-augmented generation (RAG) is a critical capability for retrieving and generating answers across languages. Prior work in this context has mostly focused on generation and relied on benchmarks derived from open-domain sources, most notably Wikipedia. In such settings, retrieval challenges often remain hidden due to language imbalances, overlap with pretraining data, and memorized content. To address this gap, we study Arabic-English RAG in a domain-specific setting using benchmarks derived from real-world corporate datasets. Our benchmarks include all combinations of languages for the user query and the supporting document, drawn independently and uniformly at random. This enables a systematic study of multilingual retrieval behavior. Our findings reveal that retrieval is a critical bottleneck in cross-lingual domain-specific scenarios, with substantial performance drops occurring when the user query and supporting document languages differ. A key insight is that these failures stem primarily from the retriever’s difficulty in ranking documents across languages. Finally, we propose two simple retrieval strategies that address this source of failure by enforcing equal retrieval from both languages or by translating the query, resulting in substantial improvements in cross-lingual and overall performance. These results highlight meaningful opportunities for improving multilingual retrieval, particularly in practical, real-world RAG applications.
[178] DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models
Cathy Jiao, Yijun Pan, Emily Xiao, Daisy Sheng, Niket Jain, Hanzhang Zhao, Ishita Dasgupta, Jiaqi W. Ma, Chenyan Xiong
Main category: cs.CL
TL;DR: DATE-LM is a unified benchmark for evaluating data attribution methods in language models through three key tasks: training data selection, toxicity/bias filtering, and factual attribution.
Details
Motivation: There are critical gaps in systematic LLM-centric evaluation of data attribution methods, which are becoming increasingly relevant for dataset curation, model interpretability, and data valuation in LLM research and applications.Method: Introduces DATE-LM benchmark that measures attribution quality through three real-world LLM application tasks and enables large-scale evaluations across diverse tasks and LLM architectures.
Result: Large-scale evaluation shows no single method dominates across all tasks, data attribution methods have trade-offs with simpler baselines, and method performance is sensitive to task-specific evaluation design.
Conclusion: DATE-LM serves as a foundation for future data attribution research in LLMs, with a public leaderboard released for community engagement and method comparison.
Abstract: Data attribution methods quantify the influence of training data on model outputs and are becoming increasingly relevant for a wide range of LLM research and applications, including dataset curation, model interpretability, data valuation. However, there remain critical gaps in systematic LLM-centric evaluation of data attribution methods. To this end, we introduce DATE-LM (Data Attribution Evaluation in Language Models), a unified benchmark for evaluating data attribution methods through real-world LLM applications. DATE-LM measures attribution quality through three key tasks – training data selection, toxicity/bias filtering, and factual attribution. Our benchmark is designed for ease of use, enabling researchers to configure and run large-scale evaluations across diverse tasks and LLM architectures. Furthermore, we use DATE-LM to conduct a large-scale evaluation of existing data attribution methods. Our findings show that no single method dominates across all tasks, data attribution methods have trade-offs with simpler baselines, and method performance is sensitive to task-specific evaluation design. Finally, we release a public leaderboard for quick comparison of methods and to facilitate community engagement, with the motivation that DATE-LM can serve as a foundation for future data attribution research in LLMs.
[179] Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun
Main category: cs.CL
TL;DR: Mixture-of-Recursions (MoR) is a unified framework that combines parameter sharing and adaptive computation in a single Recursive Transformer, achieving better efficiency and performance than existing methods.
Details
Motivation: Current language models face high computational and memory costs during training and deployment. Existing efficiency approaches target either parameter sharing or adaptive computation separately, leaving a gap in achieving both simultaneously.Method: MoR uses a shared stack of layers across recursion steps for parameter efficiency, with lightweight routers that dynamically assign different recursion depths to individual tokens. It focuses attention computation only on active tokens and selectively caches their key-value pairs. A KV sharing variant reuses KV pairs from the first recursion to further reduce memory footprint.
Result: Across model scales from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity, improves few-shot accuracy, and delivers higher throughput compared to vanilla and existing recursive baselines.
Conclusion: MoR is an effective path towards achieving large-model quality without incurring large-model costs, demonstrating that unified parameter sharing and adaptive computation can significantly improve efficiency.
Abstract: Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer. MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking by dynamically assigning different recursion depths to individual tokens. This allows MoR to focus quadratic attention computation only among tokens still active at a given recursion depth, further improving memory access efficiency by selectively caching only their key-value pairs. Beyond these core mechanisms, we also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to further decrease memory footprint. Across model scales ranging from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity and improves few-shot accuracy, while delivering higher throughput compared with vanilla and existing recursive baselines. These gains demonstrate that MoR is an effective path towards large-model quality without incurring large-model cost.
[180] MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning
Xiaoyuan Li, Moxin Li, Wenjie Wang, Rui Men, Yichang Zhang, Fuli Feng, Dayiheng Liu
Main category: cs.CL
TL;DR: This paper evaluates MLLMs’ code-based capabilities for multi-modal mathematical reasoning, focusing on visual operations through code generation and editing.
Details
Motivation: Existing evaluations of Multi-modal Large Language Models focus mainly on text-only reasoning outputs, leaving their ability to perform accurate visual operations via code largely unexplored.Method: Proposes a framework with two evaluation aspects: Multi-modal Code Generation (MCG) for understanding and constructing visualizations, and Multi-modal Code Editing (MCE) for fine-grained operations including Deletion, Modification and Annotation. Uses a dataset covering five types of mathematical figures and evaluates nine mainstream MLLMs.
Result: Experimental results reveal that existing models still lag significantly behind human performance in performing fine-grained visual operations.
Conclusion: There is a significant gap between current MLLM capabilities and human performance in code-based multi-modal mathematical reasoning, particularly in fine-grained visual operations.
Abstract: Recent progress in Multi-modal Large Language Models (MLLMs) has enabled step-by-step multi-modal mathematical reasoning by performing visual operations based on the textual instructions. A promising approach uses code as an intermediate representation to precisely express and manipulate the images in the reasoning steps. However, existing evaluations focus mainly on text-only reasoning outputs, leaving the MLLM’s ability to perform accurate visual operations via code largely unexplored. This work takes a first step toward addressing that gap by evaluating MLLM’s code-based capabilities in multi-modal mathematical reasoning.Specifically, our framework focuses on two key evaluation aspects: (1) Multi-modal Code Generation (MCG) evaluates the model’s ability to accurately understand and construct visualizations from scratch. (2) Multi-modal Code Editing (MCE) assesses the model’s capacity for fine-grained operations, which include three types: Deletion, Modification and Annotation. To evaluate the above tasks, we incorporate a dataset that covers the five most popular types of mathematical figures, including geometric diagrams, function plots, and three types of statistical charts, to provide a comprehensive and effective measurement of existing MLLMs. Our experimental evaluation involves nine mainstream MLLMs, and the results reveal that existing models still lag significantly behind human performance in performing fine-grained visual operations.
[181] Trusted Knowledge Extraction for Operations and Maintenance Intelligence
Kathleen P. Mealey, Jonathan A. Karr Jr., Priscila Saboia Moreira, Paul R. Brenner, Charles F. Vardeman II
Main category: cs.CL
TL;DR: The paper evaluates 16 NLP tools and LLMs for knowledge graph construction in aircraft maintenance, finding significant performance limitations and discussing challenges for trusted applications in confidential environments.
Details
Motivation: Address the challenge of deriving operational intelligence from organizational data while maintaining confidentiality, particularly in mission-critical industries like aviation where NLP tools face domain-specific limitations.Method: Break down knowledge extraction into NER, coreference resolution, named entity linking, and relation extraction components. Evaluate 16 NLP tools and LLMs using a baseline dataset from FAA maintenance records, focusing on zero-shot performance in controlled environments.
Result: Observed significant performance limitations of NLP and LLM tools for trusted applications in confidential environments, highlighting challenges with technical readiness for mission-critical industries.
Conclusion: Provide recommendations to enhance trust in NLP/LLM tools and release an open-source curated dataset to support further baseline testing and evaluation in trusted environments.
Abstract: Deriving operational intelligence from organizational data repositories is a key challenge due to the dichotomy of data confidentiality vs data integration objectives, as well as the limitations of Natural Language Processing (NLP) tools relative to the specific knowledge structure of domains such as operations and maintenance. In this work, we discuss Knowledge Graph construction and break down the Knowledge Extraction process into its Named Entity Recognition, Coreference Resolution, Named Entity Linking, and Relation Extraction functional components. We then evaluate sixteen NLP tools in concert with or in comparison to the rapidly advancing capabilities of Large Language Models (LLMs). We focus on the operational and maintenance intelligence use case for trusted applications in the aircraft industry. A baseline dataset is derived from a rich public domain US Federal Aviation Administration dataset focused on equipment failures or maintenance requirements. We assess the zero-shot performance of NLP and LLM tools that can be operated within a controlled, confidential environment (no data is sent to third parties). Based on our observation of significant performance limitations, we discuss the challenges related to trusted NLP and LLM tools as well as their Technical Readiness Level for wider use in mission-critical industries such as aviation. We conclude with recommendations to enhance trust and provide our open-source curated dataset to support further baseline testing and evaluation.
[182] Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL
Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, Yi Wu
Main category: cs.CL
TL;DR: ASearcher is an open-source RL training framework for search agents that enables long-horizon search with over 100 tool calls, achieving state-of-the-art performance on xBench and GAIA benchmarks.
Details
Motivation: Open-source LLM agents lack expert-level Search Intelligence for resolving ambiguous queries and conducting thorough exploration, with existing approaches limited by scalability, efficiency, and data quality issues.Method: Uses scalable fully asynchronous RL training for long-horizon search and a prompt-based LLM agent to autonomously synthesize high-quality QA datasets for training.
Result: ASearcher-Web-QwQ achieves 51.1 Avg@4 on xBench and 58.7 on GAIA, surpassing existing 32B agents, with training showing extreme long-horizon search (100+ tool calls, 400k+ output tokens).
Conclusion: ASearcher demonstrates that open-source agents can achieve commercial-level performance through large-scale RL training and autonomous data synthesis, enabling expert-level search capabilities.
Abstract: Recent advancements in LLM-based agents have demonstrated remarkable capabilities in handling complex, knowledge-intensive tasks by integrating external tools. Among diverse choices of tools, search tools play a pivotal role in accessing vast external knowledge. However, open-source agents still fall short of achieving expert-level Search Intelligence, the ability to resolve ambiguous queries, generate precise searches, analyze results, and conduct thorough exploration. Existing approaches fall short in scalability, efficiency, and data quality. For example, small turn limits in existing online RL methods, e.g. <=10, restrict complex strategy learning. This paper introduces ASearcher, an open-source project for large-scale RL training of search agents. Our key contributions include: (1) Scalable fully asynchronous RL training that enables long-horizon search while maintaining high training efficiency. (2) A prompt-based LLM agent that autonomously synthesizes high-quality and challenging QAs, creating a large-scale QA dataset. Through RL training, our prompt-based QwQ-32B agent achieves substantial improvements, with 78.0% and 34.3% Avg@4 gains on xBench and GAIA, respectively. Notably, our agent exhibits extreme long-horizon search, with tool calls exceeding 100 turns and output tokens exceeding 400k during training time. With a simple agent design and no external LLMs, ASearcher-Web-QwQ achieves Avg@4 scores of 51.1 on xBench and 58.7 on GAIA, surpassing existing open-source 32B agents. Finally, we also show that ASearcher-Web-QwQ could achieve performance of commercial systems using external summary tool in a zero-shot transfer manner and test-time search. We open-source our models, training data, and codes in https://github.com/inclusionAI/ASearcher.
[183] Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training
Woojin Chung, Jeonghoon Kim
Main category: cs.CL
TL;DR: Larger vocabularies reduce complexity of tokenized text by lowering uncertainty on frequent words, but increase token-frequency imbalance and hurt performance on rare words.
Details
Motivation: To understand why larger vocabularies benefit language models despite creating highly imbalanced token distributions, and to clarify the underlying mechanisms.Method: Controlled study scaling vocabulary from 24K to 196K while holding data, computation, and optimization constant; analyzed tokenized text complexity via Kolmogorov complexity and word-level loss decomposition.
Result: Larger vocabularies reduce cross-entropy loss primarily by lowering uncertainty on the 2,500 most frequent words (covering ~75% of downstream tokens), while loss on rare words increases. Same benefit achieved by enlarging model parameters with fixed vocabulary.
Conclusion: Bigger vocabularies help by lowering complexity of tokenized text, offering a principled approach for tokenizer-model co-design and clarifying loss dynamics in language model scaling.
Abstract: Large language models are trained with tokenizers, and the resulting token distribution is highly imbalanced: a few words dominate the stream while most occur rarely. Recent practice favors ever-larger vocabularies, but it is unclear where the benefit comes from. To this end, we perform a controlled study that scales the vocabulary of the language model from 24K to 196K while holding data, computation, and optimization unchanged. We begin by quantifying the complexity of tokenized text – formalized via Kolmogorov complexity – and show that larger vocabularies reduce this complexity. Above 24K, every common word is already tokenized as a single token, so enlarging vocabulary only deepens the relative token-frequency imbalance. Word-level loss decomposition shows that larger vocabularies reduce cross-entropy loss almost exclusively by lowering uncertainty on the 2,500 most frequent words, even though loss on the rare tail rises. The same frequent words cover roughly 75% of tokens in downstream benchmarks, so this training advantage transfers intact. We further show that enlarging model parameters with a fixed vocabulary yields the same frequent-word benefit. Our results recast “bigger vocabularies help” as “lowering complexity of tokenized text helps,” offering a simple, principled knob for tokenizer-model co-design and clarifying the loss dynamics that govern language model scaling in pre-training.
[184] Bhav-Net: Knowledge Transfer for Cross-Lingual Antonym vs Synonym Distinction via Dual-Space Graph Transformers
Samyak S. Sanghvi
Main category: cs.CL
TL;DR: Bhav-Net is a dual-space architecture that enables knowledge transfer from multilingual models to language-specific architectures for antonym-synonym distinction across multiple languages.
Details
Motivation: Antonym-synonym distinction presents computational challenges due to the paradoxical nature of antonymous relationships where words share semantic domains but express opposite meanings.Method: Combines language-specific BERT encoders with graph transformer networks, creating distinct semantic projections where synonymous pairs cluster in one space and antonymous pairs exhibit high similarity in a complementary space.
Result: Achieves competitive performance against state-of-the-art baselines across eight languages (English, German, French, Spanish, Italian, Portuguese, Dutch, Russian) with effective cross-lingual generalization.
Conclusion: Semantic relationship modeling transfers effectively across languages, providing interpretable semantic representations and robust cross-lingual antonym-synonym distinction capabilities.
Abstract: Antonym vs synonym distinction across multiple languages presents unique computational challenges due to the paradoxical nature of antonymous relationships words that share semantic domains while expressing opposite meanings. This work introduces Bhav-Net, a novel dual-space architecture that enables effective knowledge transfer from complex multilingual models to simpler, language-specific architectures while maintaining robust cross-lingual antonym–synonym distinction capabilities. Our approach combines language-specific BERT encoders with graph transformer networks, creating distinct semantic projections where synonymous pairs cluster in one space while antonymous pairs exhibit high similarity in a complementary space. Through comprehensive evaluation across eight languages (English, German, French, Spanish, Italian, Portuguese, Dutch, and Russian), we demonstrate that semantic relationship modeling transfers effectively across languages. The dual-encoder design achieves competitive performance against state-of-the-art baselines while providing interpretable semantic representations and effective cross-lingual generalization.
[185] ClaimGen-CN: A Large-scale Chinese Dataset for Legal Claim Generation
Siying Zhou, Yiquan Wu, Hui Chen, Xavier Hu, Kun Kuang, Adam Jatowt, Ming Hu, Chunyan Zheng, Fei Wu
Main category: cs.CL
TL;DR: This paper introduces the first dataset for Chinese legal claim generation and evaluates state-of-the-art LLMs on this task, revealing limitations in factual precision and clarity.
Details
Motivation: While existing research focuses on helping legal professionals, there's a gap in assisting non-professionals (like plaintiffs) with legal claim generation based on case facts.Method: Constructed ClaimGen-CN dataset from real-world legal disputes, designed evaluation metrics for factuality and clarity, and conducted comprehensive zero-shot evaluation of general and legal-domain LLMs.
Result: Current models show limitations in factual precision and expressive clarity when generating legal claims, indicating the need for more targeted development in this domain.
Conclusion: The research highlights the challenges in legal claim generation and provides a publicly available dataset to encourage further exploration of this important task for assisting non-professionals.
Abstract: Legal claims refer to the plaintiff’s demands in a case and are essential to guiding judicial reasoning and case resolution. While many works have focused on improving the efficiency of legal professionals, the research on helping non-professionals (e.g., plaintiffs) remains unexplored. This paper explores the problem of legal claim generation based on the given case’s facts. First, we construct ClaimGen-CN, the first dataset for Chinese legal claim generation task, from various real-world legal disputes. Additionally, we design an evaluation metric tailored for assessing the generated claims, which encompasses two essential dimensions: factuality and clarity. Building on this, we conduct a comprehensive zero-shot evaluation of state-of-the-art general and legal-domain large language models. Our findings highlight the limitations of the current models in factual precision and expressive clarity, pointing to the need for more targeted development in this domain. To encourage further exploration of this important task, we will make the dataset publicly available.
[186] Computational-Assisted Systematic Review and Meta-Analysis (CASMA): Effect of a Subclass of GnRH-a on Endometriosis Recurrence
Sandro Tsang
Main category: cs.CL
TL;DR: CASMA workflow integrates PRISMA guidelines with computational methods to enhance systematic reviews, demonstrated through endometriosis recurrence analysis showing 36% reduction in recurrence with GnRH-a.
Details
Motivation: To address the challenge of growing medical literature by developing an efficient, transparent, and reproducible systematic review workflow using computational solutions.Method: Hybrid approach combining PRISMA guidelines with fuzzy matching and regular expressions for semi-automated deduplication and filtering, with modified splitting method for multi-arm trials.
Result: Workflow reduced screening workload significantly (11 days for 33,444 records), identified 7 eligible RCTs (841 patients), showing 36% reduction in recurrence (RR=0.64, 95% CI 0.48-0.86) with no heterogeneity.
Conclusion: The information-retrieval-driven workflow successfully bridges clinical research and computer science, providing a generalizable framework for scalable evidence synthesis with robust clinical results.
Abstract: Background: Evidence synthesis facilitates evidence-based medicine. This task becomes increasingly difficult to accomplished with applying computational solutions, since the medical literature grows at astonishing rates. Objective: This study evaluates an information retrieval-driven workflow, CASMA, to enhance the efficiency, transparency, and reproducibility of systematic reviews. Endometriosis recurrence serves as the ideal case due to its complex and ambiguous literature. Methods: The hybrid approach integrates PRISMA guidelines with fuzzy matching and regular expression (regex) to facilitate semi-automated deduplication and filtered records before manual screening. The workflow synthesised evidence from randomised controlled trials on the efficacy of a subclass of gonadotropin-releasing hormone agonists (GnRH-a). A modified splitting method addressed unit-of-analysis errors in multi-arm trials. Results: The workflow sharply reduced the screening workload, taking only 11 days to fetch and filter 33,444 records. Seven eligible RCTs were synthesized (841 patients). The pooled random-effects model yielded a Risk Ratio (RR) of $0.64$ ($95%$ CI $0.48$ to $0.86$), demonstrating a $36%$ reduction in recurrence, with non-significant heterogeneity ($I^2=0.00%$, $\tau^2=0.00$). The findings were robust and stable, as they were backed by sensitivity analyses. Conclusion: This study demonstrates an application of an information-retrieval-driven workflow for medical evidence synthesis. The approach yields valuable clinical results and a generalisable framework to scale up the evidence synthesis, bridging the gap between clinical research and computer science.
[187] Modeling Bottom-up Information Quality during Language Processing
Cui Ding, Yanning Yin, Lena A. Jäger, Ethan Gotlieb Wilcox
Main category: cs.CL
TL;DR: The paper proposes an information-theoretic measure of bottom-up input quality in reading, showing that reduced visual information (occluding word halves) increases reading difficulty, with upper halves containing more word identity information than lower halves in both English and Chinese.
Details
Motivation: To test the prediction from language processing models that noisy bottom-up inputs should lead to more difficult comprehension, specifically in the domain of reading.Method: Used mutual information between visual information and word identity as an operationalization of input quality; conducted reading experiments with occluded word halves; employed multimodal language models to estimate mutual information; compared English and Chinese reading patterns.
Result: Upper halves contain more information about word identity than lower halves in both languages, but the asymmetry is more pronounced in English; reading times increased when information quality was reduced through occlusion.
Conclusion: Information quality of bottom-up inputs significantly affects reading comprehension difficulty, with visual information distribution patterns varying across writing systems but consistently showing upper-half dominance in word identity information.
Abstract: Contemporary theories model language processing as integrating both top-down expectations and bottom-up inputs. One major prediction of such models is that the quality of the bottom-up inputs modulates ease of processing – noisy inputs should lead to difficult and effortful comprehension. We test this prediction in the domain of reading. First, we propose an information-theoretic operationalization for the “quality” of bottom-up information as the mutual information (MI) between visual information and word identity. We formalize this prediction in a mathematical model of reading as a Bayesian update. Second, we test our operationalization by comparing participants’ reading times in conditions where words’ information quality has been reduced, either by occluding their top or bottom half, with full words. We collect data in English and Chinese. We then use multimodal language models to estimate the mutual information between visual inputs and words. We use these data to estimate the specific effect of reduced information quality on reading times. Finally, we compare how information is distributed across visual forms. In English and Chinese, the upper half contains more information about word identity than the lower half. However, the asymmetry is more pronounced in English, a pattern which is reflected in the reading times.
[188] WolBanking77: Wolof Banking Speech Intent Classification Dataset
Abdou Karim Kandji, Frédéric Precioso, Cheikh Ba, Samba Ndiaye, Augustin Ndione
Main category: cs.CL
TL;DR: This paper introduces WolBanking77, a Wolof speech intent classification dataset for banking domain, addressing the gap in low-resource language NLP research.
Details
Motivation: Previous intent classification studies focus on high-resource languages, creating a gap for low-resource languages like Wolof (spoken by 10M+ people) and regions with high illiteracy rates where spoken language is more prevalent than written.Method: Created WolBanking77 dataset containing 9,791 text sentences and 4+ hours of spoken sentences in Wolof banking domain. Conducted experiments with various text and voice state-of-the-art models as baselines.
Result: Promising results on the dataset with reported F1-scores for NLP models and word error rates for ASR models trained on WolBanking77, along with model comparisons.
Conclusion: The paper presents a valuable resource for intent classification research in low-resource languages and provides baseline performance metrics that show the dataset’s utility for advancing NLP in underrepresented languages like Wolof.
Abstract: Intent classification models have made a significant progress in recent years. However, previous studies primarily focus on high-resource language datasets, which results in a gap for low-resource languages and for regions with high rates of illiteracy, where languages are more spoken than read or written. This is the case in Senegal, for example, where Wolof is spoken by around 90% of the population, while the national illiteracy rate remains at of 42%. Wolof is actually spoken by more than 10 million people in West African region. To address these limitations, we introduce the Wolof Banking Speech Intent Classification Dataset (WolBanking77), for academic research in intent classification. WolBanking77 currently contains 9,791 text sentences in the banking domain and more than 4 hours of spoken sentences. Experiments on various baselines are conducted in this work, including text and voice state-of-the-art models. The results are very promising on this current dataset. In addition, this paper presents an in-depth examination of the dataset’s contents. We report baseline F1-scores and word error rates metrics respectively on NLP and ASR models trained on WolBanking77 dataset and also comparisons between models. Dataset and code available at: https://github.com/abdoukarim/wolbanking77.
[189] Aligning LLMs for Multilingual Consistency in Enterprise Applications
Amit Agarwal, Hansa Meghwani, Hitesh Laxmichand Patel, Tao Sheng, Sujith Ravi, Dan Roth
Main category: cs.CL
TL;DR: A batch-wise alignment strategy for fine-tuning LLMs using multilingual data to reduce performance gaps between English and non-English languages, improving non-English accuracy by up to 23.9% without compromising English performance.
Details
Motivation: LLMs are unreliable for global enterprise applications due to substantial performance gaps between high-resource and mid/low-resource languages, driven by English-centric pretraining and internal reasoning biases, which undermines customer experience and operational reliability in multilingual settings.Method: A practical, batch-wise alignment strategy for fine-tuning LLMs that leverages semantically equivalent multilingual data in each training batch to directly align model outputs across languages.
Result: The approach improves non-English accuracy by up to 23.9% without compromising English performance, model reasoning, or retrieval quality. It addresses the observed 29% accuracy drop in non-English languages compared to English.
Conclusion: The method is simple to implement, scalable, and integrates seamlessly with existing LLM training & deployment pipelines, enabling more robust and equitable multilingual AI solutions in industry.
Abstract: Large language models (LLMs) remain unreliable for global enterprise applications due to substantial performance gaps between high-resource and mid/low-resource languages, driven by English-centric pretraining and internal reasoning biases. This inconsistency undermines customer experience and operational reliability in multilingual settings such as customer support, content moderation, and information retrieval. Even with advanced Retrieval-Augmented Generation (RAG) systems, we observe up to an 29% accuracy drop in non-English languages compared to English. We propose a practical, batch-wise alignment strategy for fine-tuning LLMs, leveraging semantically equivalent multilingual data in each training batch to directly align model outputs across languages. This approach improves non-English accuracy by up to 23.9% without compromising English performance, model reasoning, or retrieval quality. Our method is simple to implement, scalable, and integrates seamlessly with existing LLM training & deployment pipelines, enabling more robust and equitable multilingual AI solutions in industry.
[190] EuroSpeech: A Multilingual Speech Corpus
Samuel Pfisterer, Florian Grötschla, Luca A. Lanzendörfer, Florian Yan, Roger Wattenhofer
Main category: cs.CL
TL;DR: A scalable pipeline for constructing speech datasets from parliamentary recordings is introduced, extracting over 61k hours of aligned speech across 22 European languages with significant per-language coverage.
Details
Motivation: Existing multilingual speech datasets often have insufficient data for most languages, leading to poor model performance on the majority of supported languages.Method: A pipeline with robust media retrieval and a two-stage alignment algorithm designed to handle non-verbatim transcripts and long-form audio from parliamentary recordings.
Result: Extracted over 61k hours of aligned speech segments from 22 European parliaments, with 19 languages exceeding 1k hours and 22 languages exceeding 500 hours. Achieved 41.8% average reduction in word error rates when finetuning ASR models.
Conclusion: The proposed pipeline effectively addresses data scarcity in multilingual speech processing by leveraging parliamentary recordings to create large-scale, high-quality speech datasets.
Abstract: Recent progress in speech processing has highlighted that high-quality performance across languages requires substantial training data for each individual language. While existing multilingual datasets cover many languages, they often contain insufficient data for most languages. Thus, trained models perform poorly on the majority of the supported languages. Our work addresses this challenge by introducing a scalable pipeline for constructing speech datasets from parliamentary recordings. The proposed pipeline includes robust components for media retrieval and a two-stage alignment algorithm designed to handle non-verbatim transcripts and long-form audio. Applying this pipeline to recordings from 22 European parliaments, we extract over 61k hours of aligned speech segments, achieving substantial per-language coverage with 19 languages exceeding 1k hours and 22 languages exceeding 500 hours of high-quality speech data. We obtain an average 41.8% reduction in word error rates over baselines when finetuning an existing ASR model on our dataset, demonstrating the usefulness of our approach.
[191] ThinkBrake: Mitigating Overthinking in Tool Reasoning
Minjae Oh, Sangjun Song, Seungkyu Lee, Sungmin Jo, Yohan Jo
Main category: cs.CL
TL;DR: Small reasoning models often overthink during tool use, reaching correct configurations then overwriting them. ThinkBrake, a training-free decoding heuristic, monitors log-probability margins at sentence boundaries to trigger early termination, reducing tokens by up to 25% while maintaining or improving accuracy.
Details
Motivation: Small reasoning models exhibit overthinking behavior during tool use, where they reach correct tool-argument configurations but continue reasoning and overwrite them with incorrect calls. This reveals substantial recoverable headroom and potential redundant reasoning that needs addressing.Method: Diagnosed overthinking via oracle rollouts injecting at sentence boundaries. Introduced ThinkBrake - a training-free decoding heuristic that monitors the log-probability margin between and the current top token at sentence boundaries, triggering termination when this margin becomes small.
Result: Oracle termination lifted average accuracy from 85.8% to 94.2% while reducing tokens by 80-94%. ThinkBrake preserved or improved accuracy across BFCL’s single turn, non-live and live splits while reducing tokens up to 25%, outperforming various baselines.
Conclusion: ThinkBrake effectively addresses overthinking in small reasoning models during tool use, demonstrating that training-free early termination methods can significantly reduce computational overhead while maintaining or improving performance.
Abstract: Small reasoning models (SRMs) often overthink during tool use: they reach a correct tool-argument configuration, then continue reasoning and overwrite it with an incorrect final call. We diagnose overthinking via oracle rollouts that inject at sentence boundaries. On the Berkeley Function Calling Leaderboard (BFCL), this oracle termination lifts average accuracy from 85.8% to 94.2% while reducing tokens by 80-94%, revealing substantial recoverable headroom and potential redundant reasoning. While prior work on concise reasoning has largely targeted mathematics, tool reasoning remains underexplored. We adapt various early-termination baselines to tool use and introduce ThinkBrake, a training-free decoding heuristic. ThinkBrake monitors the log-probability margin between and the current top token at sentence boundaries and triggers termination when this margin becomes small. Across BFCL’s single turn, non-live and live splits, ThinkBrake preserves or improves accuracy while reducing tokens up to 25%, outperforming various baselines.
[192] On the Convergence of Moral Self-Correction in Large Language Models
Guangliang Liu, Haitao Mao, Bochuan Cao, Zhiyu Xue, Xitong Zhang, Rongrong Wang, Kristen Marie Johnson
Main category: cs.CL
TL;DR: LLMs can self-correct their moral responses through multi-round interactions, leading to performance convergence as moral concepts stabilize.
Details
Motivation: To understand how and why intrinsic self-correction works in LLMs, particularly for moral reasoning, since empirical success exists but mechanisms remain unknown.Method: Mechanistic analysis of moral self-correction through multi-round interactions, examining how self-correction instructions activate moral concepts and reduce model uncertainty.
Result: Self-correction instructions consistently activate moral concepts that reduce model uncertainty, leading to converged performance as these concepts stabilize over successive rounds.
Conclusion: Moral self-correction exhibits desirable convergence properties, demonstrating strong potential for improving LLM responses through intrinsic self-correction mechanisms.
Abstract: Large Language Models (LLMs) are able to improve their responses when instructed to do so, a capability known as self-correction. When instructions provide only a general and abstract goal without specific details about potential issues in the response, LLMs must rely on their internal knowledge to improve response quality, a process referred to as intrinsic self-correction. The empirical success of intrinsic self-correction is evident in various applications, but how and why it is effective remains unknown. Focusing on moral self-correction in LLMs, we reveal a key characteristic of intrinsic self-correction: performance convergence through multi-round interactions; and provide a mechanistic analysis of this convergence behavior. Based on our experimental results and analysis, we uncover the underlying mechanism of convergence: consistently injected self-correction instructions activate moral concepts that reduce model uncertainty, leading to converged performance as the activated moral concepts stabilize over successive rounds. This paper demonstrates the strong potential of moral self-correction by showing that it exhibits a desirable property of converged performance.
[193] Populism Meets AI: Advancing Populism Research with LLMs
Yujin J. Jung, Eduardo Ryô Tamaki, Julia Chatterley, Grant Mitchell, Semir Dzebo, Cristóbal Sandoval, Levente Littvay, Kirk A. Hawkins
Main category: cs.CL
TL;DR: A rubric and anchor guided chain of thought prompting approach enables LLMs to achieve expert-level accuracy in measuring populism in political speeches, overcoming limitations of traditional textual analysis methods.
Details
Motivation: Traditional textual analysis methods for measuring populism are costly, time-consuming, and difficult to scale across languages, contexts, and large corpora, creating a need for more efficient approaches.Method: Used a rubric and anchor guided chain of thought prompting approach that mirrors human coder training, leveraging the Global Populism Database and adapting human coder documentation to guide LLM reasoning.
Result: The domain-specific prompting strategy enabled LLMs to achieve classification accuracy on par with expert human coders, successfully navigating the nuanced, context-sensitive aspects of populism.
Conclusion: LLMs with domain-specific prompting can effectively measure populist ideational content, providing a scalable alternative to traditional methods while maintaining expert-level accuracy.
Abstract: Measuring the ideational content of populism remains a challenge. Traditional strategies based on textual analysis have been critical for building the field’s foundations and providing a valid, objective indicator of populist framing. Yet these approaches are costly, time consuming, and difficult to scale across languages, contexts, and large corpora. Here we present the results from a rubric and anchor guided chain of thought (CoT) prompting approach that mirrors human coder training. By leveraging the Global Populism Database (GPD), a comprehensive dataset of global leaders’ speeches annotated for degrees of populism, we replicate the process used to train human coders by prompting the LLM with an adapted version of the same documentation to guide the model’s reasoning. We then test multiple proprietary and open weight models by replicating scores in the GPD. Our findings reveal that this domain specific prompting strategy enables the LLM to achieve classification accuracy on par with expert human coders, demonstrating its ability to navigate the nuanced, context sensitive aspects of populism.
[194] LLM4Cell: A Survey of Large Language and Agentic Models for Single-Cell Biology
Sajib Acharjee Dip, Adrika Zafor, Bikash Kumar Paul, Uddip Acharjee Shuvo, Muhit Islam Emon, Xuan Wang, Liqing Zhang
Main category: cs.CL
TL;DR: LLM4Cell is the first unified survey of 58 foundation and agentic models for single-cell biology, categorizing them across RNA, ATAC, multi-omic, and spatial modalities, and evaluating them across 10 domain dimensions using over 40 public datasets.
Details
Motivation: To address the fragmented progress in using large language models and agentic frameworks for single-cell biology across different data modalities, architectures, and evaluation standards.Method: Categorizes 58 models into five families (foundation, text-bridge, spatial, multimodal, epigenomic, and agentic) and maps them to eight analytical tasks, then evaluates them using over 40 public datasets across 10 domain dimensions.
Result: Provides the first integrated view of language-driven single-cell intelligence, analyzing benchmark suitability, data diversity, and ethical/scalability constraints while evaluating models on biological grounding, multi-omics alignment, fairness, privacy, and explainability.
Conclusion: LLM4Cell outlines open challenges in interpretability, standardization, and trustworthy model development for single-cell biology applications of language models.
Abstract: Large language models (LLMs) and emerging agentic frameworks are beginning to transform single-cell biology by enabling natural-language reasoning, generative annotation, and multimodal data integration. However, progress remains fragmented across data modalities, architectures, and evaluation standards. LLM4Cell presents the first unified survey of 58 foundation and agentic models developed for single-cell research, spanning RNA, ATAC, multi-omic, and spatial modalities. We categorize these methods into five families-foundation, text-bridge, spatial, multimodal, epigenomic, and agentic-and map them to eight key analytical tasks including annotation, trajectory and perturbation modeling, and drug-response prediction. Drawing on over 40 public datasets, we analyze benchmark suitability, data diversity, and ethical or scalability constraints, and evaluate models across 10 domain dimensions covering biological grounding, multi-omics alignment, fairness, privacy, and explainability. By linking datasets, models, and evaluation domains, LLM4Cell provides the first integrated view of language-driven single-cell intelligence and outlines open challenges in interpretability, standardization, and trustworthy model development.
[195] Beyond Fertility: Analyzing STRR as a Metric for Multilingual Tokenization Evaluation
Mir Tafseer Nayeem, Sawsan Alqahtani, Md Tahmid Rahman Laskar, Tasnim Mohiuddin, M Saiful Bari
Main category: cs.CL
TL;DR: Tokenization in LLMs is poorly evaluated. Standard fertility metric misses vocabulary allocation across languages. Proposed STRR metric shows English prioritization, Chinese support, and Hindi fragmentation, offering cross-lingual fairness insights.
Details
Motivation: Current tokenization evaluation using fertility metric is insufficient as it only measures compression efficiency and obscures how vocabularies are distributed across different languages and domains.Method: Analyzed six widely used tokenizers across seven languages and two domains. Proposed Single Token Retention Rate (STRR) to measure proportion of words preserved as single tokens.
Result: Found stable fertility for English, high fertility for Chinese, little domain sensitivity. STRR revealed systematic prioritization of English, strong support for Chinese, and fragmentation in Hindi.
Conclusion: STRR complements fertility and provides practical guidance for designing more equitable multilingual tokenizers by offering interpretable view of cross-lingual fairness.
Abstract: Tokenization is a crucial but under-evaluated step in large language models (LLMs). The standard metric, fertility (the average number of tokens per word), captures compression efficiency but obscures how vocabularies are allocated across languages and domains. We analyze six widely used tokenizers across seven languages and two domains, finding stable fertility for English, high fertility for Chinese, and little domain sensitivity. To address fertility’s blind spots, we propose the Single Token Retention Rate (STRR), which measures the proportion of words preserved as single tokens. STRR reveals systematic prioritization of English, strong support for Chinese, and fragmentation in Hindi, offering an interpretable view of cross-lingual fairness. Our results show that STRR complements fertility and provides practical guidance for designing more equitable multilingual tokenizers.
[196] LinearRAG: Linear Graph Retrieval Augmented Generation on Large-scale Corpora
Luyao Zhuang, Shengyuan Chen, Yilin Xiao, Huachi Zhou, Yujing Zhang, Hao Chen, Qinggang Zhang, Xiao Huang
Main category: cs.CL
TL;DR: LinearRAG is an efficient graph-based RAG framework that constructs relation-free hierarchical graphs using lightweight entity extraction, enabling linear scaling and precise retrieval without costly relation extraction.
Details
Motivation: Traditional RAG systems struggle with fragmented information in large corpora, while existing GraphRAG methods rely on unstable and costly relation extraction that produces noisy graphs degrading retrieval quality.Method: LinearRAG constructs a Tri-Graph using only entity extraction and semantic linking, avoiding relation modeling. It uses two-stage retrieval: entity activation via semantic bridging followed by passage retrieval through importance aggregation.
Result: Extensive experiments on four datasets show LinearRAG significantly outperforms baseline models in retrieval performance.
Conclusion: LinearRAG provides an efficient, economical, and reliable alternative to traditional GraphRAG by eliminating unstable relation extraction while maintaining strong retrieval capabilities.
Abstract: Retrieval-Augmented Generation (RAG) is widely used to mitigate hallucinations of Large Language Models (LLMs) by leveraging external knowledge. While effective for simple queries, traditional RAG systems struggle with large-scale, unstructured corpora where information is fragmented. Recent advances incorporate knowledge graphs to capture relational structures, enabling more comprehensive retrieval for complex, multi-hop reasoning tasks. However, existing graph-based RAG (GraphRAG) methods rely on unstable and costly relation extraction for graph construction, often producing noisy graphs with incorrect or inconsistent relations that degrade retrieval quality. In this paper, we revisit the pipeline of existing GraphRAG systems and propose LinearRAG (Linear Graph-based Retrieval-Augmented Generation), an efficient framework that enables reliable graph construction and precise passage retrieval. Specifically, LinearRAG constructs a relation-free hierarchical graph, termed Tri-Graph, using only lightweight entity extraction and semantic linking, avoiding unstable relation modeling. This new paradigm of graph construction scales linearly with corpus size and incurs no extra token consumption, providing an economical and reliable indexing of the original passages. For retrieval, LinearRAG adopts a two-stage strategy: (i) relevant entity activation via local semantic bridging, followed by (ii) passage retrieval through global importance aggregation. Extensive experiments on four datasets demonstrate that LinearRAG significantly outperforms baseline models.
[197] DiffHeads: Differential Analysis and Inference-Time Masking of Bias Heads in Large Language Models
Tingxu Han, Wei Song, Ziqi Ding, Ziming Li, Chunrong Fang, Yuekang Li, Dongfang Liu, Zhenyu Chen, Zhenting Wang
Main category: cs.CL
TL;DR: DiffHeads is a lightweight debiasing framework that identifies and masks specific bias heads in LLMs through differential activation analysis between Direct-Answer and Chain-of-Thought prompting, reducing unfairness by 49.4% and 40.3% respectively without harming utility.
Details
Motivation: LLMs increasingly mediate decisions in sensitive domains where unfair treatment of demographic groups is unacceptable, but existing bias mitigation approaches are largely fragile and lack understanding of the underlying mechanisms.Method: 1) Compare DA vs CoT prompting across 8 LLMs; 2) Define token-to-head contribution score to trace bias to specific attention heads; 3) Develop DiffHeads that identifies bias heads through differential activation analysis and selectively masks them.
Result: DA triggering increases unfairness by 534.5%-391.9%; Identified small cluster of bias heads that activate under DA but stay dormant with CoT; DiffHeads reduces unfairness by 49.4% under DA and 40.3% under CoT without harming model utility.
Conclusion: The paper provides the first causal link between prompting strategy and bias emergence, and demonstrates that selective masking of identified bias heads through DiffHeads effectively reduces unfairness while maintaining model performance.
Abstract: Large language models (LLMs) increasingly mediate decisions in domains where unfair treatment of demographic groups is unacceptable. Existing work probes when biased outputs appear, but gives little insight into the mechanisms that generate them, leaving existing mitigations largely fragile. In this paper, we conduct a systematic investigation LLM unfairness and propose DiffHeads, a lightweight debiasing framework for LLMs. We first compare Direct-Answer (DA) prompting to Chain-of-Thought (CoT) prompting across eight representative open- and closed-source LLMs. DA will trigger the nature bias part of LLM and improve measured unfairness by 534.5%-391.9% in both one-turn and two-turn dialogues. Next, we define a token-to-head contribution score that traces each token’s influence back to individual attention heads. This reveals a small cluster of bias heads that activate under DA but stay largely dormant with CoT, providing the first causal link between prompting strategy and bias emergence. Finally, building on this insight, we propose DiffHeads that identifies bias heads through differential activation analysis between DA and CoT, and selectively masks only those heads. DiffHeads reduces unfairness by 49.4%, and 40.3% under DA and CoT, respectively, without harming model utility.
[198] Are LLMs Empathetic to All? Investigating the Influence of Multi-Demographic Personas on a Model’s Empathy
Ananya Malik, Nazanin Sabri, Melissa Karnaze, Mai Elsherief
Main category: cs.CL
TL;DR: LLMs’ empathy varies significantly across demographic groups, with intersectional analysis revealing complex patterns that sometimes reverse expected empathy trends, highlighting the need for more inclusive model design.
Details
Motivation: To investigate whether LLMs demonstrate equitable empathy across diverse user groups, given that emotional experiences are shaped by demographic and cultural contexts.Method: Proposed a framework analyzing cognitive and affective empathy across 315 unique personas combining age, culture, and gender attributes, using both quantitative and qualitative analysis across four LLMs.
Result: Demographic attributes profoundly shape empathetic responses, with multiple attributes sometimes attenuating or reversing expected empathy patterns. Models broadly reflect real-world trends but show notable misalignments for certain groups like Confucian culture.
Conclusion: Designing empathy-aware LLMs that account for demographic diversity is crucial for promoting more inclusive and equitable model behavior.
Abstract: Large Language Models’ (LLMs) ability to converse naturally is empowered by their ability to empathetically understand and respond to their users. However, emotional experiences are shaped by demographic and cultural contexts. This raises an important question: Can LLMs demonstrate equitable empathy across diverse user groups? We propose a framework to investigate how LLMs’ cognitive and affective empathy vary across user personas defined by intersecting demographic attributes. Our study introduces a novel intersectional analysis spanning 315 unique personas, constructed from combinations of age, culture, and gender, across four LLMs. Results show that attributes profoundly shape a model’s empathetic responses. Interestingly, we see that adding multiple attributes at once can attenuate and reverse expected empathy patterns. We show that they broadly reflect real-world empathetic trends, with notable misalignments for certain groups, such as those from Confucian culture. We complement our quantitative findings with qualitative insights to uncover model behaviour patterns across different demographic groups. Our findings highlight the importance of designing empathy-aware LLMs that account for demographic diversity to promote more inclusive and equitable model behaviour.
[199] Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs
Pasin Buakhaw, Kun Kerdthaisong, Phuree Phenhiran, Pitikorn Khlaisamniang, Supasate Vorathammathorn, Piyalitt Ittichaiwong, Nutchanon Yongsatianchot
Main category: cs.CL
TL;DR: The paper presents an approach for creating dynamic NPCs using LLMs, combining lightweight prompting techniques and fine-tuned models to achieve high rankings in the CPDC 2025 competition.
Details
Motivation: To leverage large language models for creating dynamic non-player characters in gaming that can perform functional tasks while maintaining persona-consistent dialogue.Method: Combines two strategies: (1) lightweight prompting techniques including Deflanderization method to reduce excessive role-play, and (2) fine-tuned Qwen3-14B models using supervised fine-tuning and LoRA adaptation.
Result: Achieved 2nd place on Task 1, 2nd place on Task 3 (API track), and 4th place on Task 3 (GPU track) in the Commonsense Persona-Grounded Dialogue Challenge 2025 Round 2.
Conclusion: The combination of prompting techniques and fine-tuned models effectively enables LLMs to create dynamic NPCs capable of both task execution and persona-consistent dialogue in gaming environments.
Abstract: The emergence of large language models (LLMs) has opened new opportunities for creating dynamic non-player characters (NPCs) in gaming environments, enabling both functional task execution and persona-consistent dialogue generation. In this paper, we (Tu_Character_lab) report our participation in the Commonsense Persona-Grounded Dialogue Challenge (CPDC) 2025 Round 2, which evaluates agents across three tracks: task-oriented dialogue, context-aware dialogue, and their integration. Our approach combines two complementary strategies: (i) lightweight prompting techniques in the API track, including a Deflanderization prompting method to suppress excessive role-play and improve task fidelity, and (ii) fine-tuned large models in the GPU track, leveraging Qwen3-14B with supervisedfinetuning (SFT) and Low-Rank Adaptation(LoRA). Our best submissions ranked 2nd on Task 1, 2nd on Task 3 (API track), and 4th on Task 3 (GPU track).
[200] Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety
Vamshi Krishna Bonagiri, Ponnurangam Kumaragurum, Khanh Nguyen, Benjamin Plaut
Main category: cs.CL
TL;DR: LLM agents can improve safety by quitting when uncertain, achieving significant safety gains with minimal helpfulness loss.
Details
Motivation: Multi-turn agentic scenarios with real-world tool access create compounding uncertainties that lead to severe risks beyond traditional text generation failures.Method: Using ‘quitting’ as a behavioral mechanism for LLM agents to withdraw from uncertain situations, evaluated systematically across 12 state-of-the-art LLMs using the ToolEmu framework.
Result: Agents with explicit quit instructions improved safety by +0.39 on 0-3 scale (+0.64 for proprietary models) with only -0.03 average decrease in helpfulness.
Conclusion: Explicit quit instructions are a highly effective, immediately deployable safety mechanism that establishes quitting as an effective first-line defense for autonomous agents in high-stakes applications.
Abstract: As Large Language Model (LLM) agents increasingly operate in complex environments with real-world consequences, their safety becomes critical. While uncertainty quantification is well-studied for single-turn tasks, multi-turn agentic scenarios with real-world tool access present unique challenges where uncertainties and ambiguities compound, leading to severe or catastrophic risks beyond traditional text generation failures. We propose using “quitting” as a simple yet effective behavioral mechanism for LLM agents to recognize and withdraw from situations where they lack confidence. Leveraging the ToolEmu framework, we conduct a systematic evaluation of quitting behavior across 12 state-of-the-art LLMs. Our results demonstrate a highly favorable safety-helpfulness trade-off: agents prompted to quit with explicit instructions improve safety by an average of +0.39 on a 0-3 scale across all models (+0.64 for proprietary models), while maintaining a negligible average decrease of -0.03 in helpfulness. Our analysis demonstrates that simply adding explicit quit instructions proves to be a highly effective safety mechanism that can immediately be deployed in existing agent systems, and establishes quitting as an effective first-line defense mechanism for autonomous agents in high-stakes applications.
[201] The Chameleon Nature of LLMs: Quantifying Multi-Turn Stance Instability in Search-Enabled Language Models
Shivam Ratnakar, Sanjay Raghavendra
Main category: cs.CL
TL;DR: LLMs exhibit ‘chameleon behavior’ - shifting stances when faced with contradictory questions in multi-turn conversations, especially in search-enabled systems, undermining reliability in critical applications.
Details
Motivation: To systematically investigate the vulnerability of LLMs in maintaining consistent stances across multi-turn conversations, particularly when integrated with search engines, as this poses reliability risks in critical domains.Method: Created Chameleon Benchmark Dataset with 17,770 question-answer pairs across 1,180 multi-turn conversations spanning 12 controversial domains. Introduced two metrics: Chameleon Score (stance instability) and Source Re-use Rate (knowledge diversity). Evaluated Llama-4-Maverick, GPT-4o-mini, and Gemini-2.5-Flash.
Result: All models exhibited severe chameleon behavior (scores 0.391-0.511), with GPT-4o-mini performing worst. Strong correlations found between source re-use rate and confidence (r=0.627) and stance changes (r=0.429), indicating limited knowledge diversity makes models deferential to query framing.
Conclusion: LLMs show pathological deference to query framing due to limited knowledge diversity, highlighting the critical need for comprehensive consistency evaluation before deployment in healthcare, legal, and financial systems where coherent positions are essential.
Abstract: Integration of Large Language Models with search/retrieval engines has become ubiquitous, yet these systems harbor a critical vulnerability that undermines their reliability. We present the first systematic investigation of “chameleon behavior” in LLMs: their alarming tendency to shift stances when presented with contradictory questions in multi-turn conversations (especially in search-enabled LLMs). Through our novel Chameleon Benchmark Dataset, comprising 17,770 carefully crafted question-answer pairs across 1,180 multi-turn conversations spanning 12 controversial domains, we expose fundamental flaws in state-of-the-art systems. We introduce two theoretically grounded metrics: the Chameleon Score (0-1) that quantifies stance instability, and Source Re-use Rate (0-1) that measures knowledge diversity. Our rigorous evaluation of Llama-4-Maverick, GPT-4o-mini, and Gemini-2.5-Flash reveals consistent failures: all models exhibit severe chameleon behavior (scores 0.391-0.511), with GPT-4o-mini showing the worst performance. Crucially, small across-temperature variance (less than 0.004) suggests the effect is not a sampling artifact. Our analysis uncovers the mechanism: strong correlations between source re-use rate and confidence (r=0.627) and stance changes (r=0.429) are statistically significant (p less than 0.05), indicating that limited knowledge diversity makes models pathologically deferential to query framing. These findings highlight the need for comprehensive consistency evaluation before deploying LLMs in healthcare, legal, and financial systems where maintaining coherent positions across interactions is critical for reliable decision support.
[202] SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
Tiancheng Hu, Joachim Baumann, Lorenzo Lupo, Nigel Collier, Dirk Hovy, Paul Röttger
Main category: cs.CL
TL;DR: SimBench is the first large-scale standardized benchmark for evaluating LLM simulations of human behavior, unifying 20 diverse datasets to provide robust, reproducible assessment of when and why LLM simulations succeed or fail.
Details
Motivation: Current evaluations of LLM simulations are fragmented with bespoke tasks and metrics, creating incomparable results that hinder progress in using LLMs for social and behavioral sciences.Method: Created SimBench by unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, enabling standardized evaluation of LLM simulation capabilities.
Result: Best LLMs today have limited simulation ability (40.80/100), performance scales log-linearly with model size, simulation performance not improved by inference-time compute, shows alignment-simulation trade-off, models struggle with specific demographic groups, and simulation ability correlates strongly with deep knowledge-intensive reasoning (MMLU-Pro, r=0.939).
Conclusion: SimBench enables measurable progress in developing more faithful LLM simulators by providing the necessary foundation to systematically evaluate simulation capabilities and identify key challenges.
Abstract: Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that, while even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size. Simulation performance is not improved by increased inference-time compute. We demonstrate an alignment-simulation trade-off: instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning (MMLU-Pro, r=0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.
[203] Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model
Ling Team, Anqi Shen, Baihui Li, Bin Hu, Bin Jing, Cai Chen, Chao Huang, Chao Zhang, Chaokun Yang, Cheng Lin, Chengyao Wen, Congqi Li, Deng Zhao, Dingbo Yuan, Donghai You, Fagui Mao, Fanzhuang Meng, Feng Xu, Guojie Li, Guowei Wang, Hao Dai, Haonan Zheng, Hong Liu, Jia Guo, Jiaming Liu, Jian Liu, Jianhao Fu, Jiannan Shi, Jianwen Wang, Jianxin Lai, Jin Yang, Jun Mei, Jun Zhou, Junbo Zhao, Junping Zhao, Kuan Xu, Le Su, Lei Chen, Li Tang, Liang Jiang, Liangcheng Fu, Lianhao Xu, Linfeng Shi, Lisha Liao, Longfei Zheng, Meng Li, Mingchun Chen, Qi Zuo, Qiang Cheng, Qianggang Cao, Qitao Shi, Quanrui Guo, Senlin Zhu, Shaofei Wang, Shaomian Zheng, Shuaicheng Li, Shuwei Gu, Siba Chen, Tao Wu, Tao Zhang, Tianyu Zhang, Tianyu Zhou, Tiwei Bie, Tongkai Yang, Wang Hong, Wang Ren, Weihua Chen, Wenbo Yu, Wengang Zheng, Xiangchun Wang, Xiaodong Yan, Xiaopei Wan, Xin Zhao, Xinyu Kong, Xinyu Tang, Xudong Han, Xudong Wang, Xuemin Yang, Xueyu Hu, Yalin Zhang, Yan Sun, Yicheng Shan, Yilong Wang, Yingying Xu, Yongkang Liu, Yongzhen Guo, Yuanyuan Wang, Yuchen Yan, Yuefan Wang, Yuhong Guo, Zehuan Li, Zhankai Xu, Zhe Li, Zhenduo Zhang, Zhengke Gui, Zhenxuan Pan, Zhenyu Huang, Zhenzhong Lan, Zhiqiang Ding, Zhiqiang Zhang, Zhixun Li, Zhizhen Liu, Zihao Wang, Zujie Wen
Main category: cs.CL
TL;DR: Ring-1T is the first open-source trillion-parameter thinking model that activates 50B parameters per token, achieving state-of-the-art results on reasoning benchmarks including IMO-2025 silver medal performance.
Details
Motivation: To democratize large-scale reasoning intelligence by creating the first open-source trillion-parameter thinking model and address unprecedented challenges in training at this scale.Method: Three interconnected innovations: IcePop for RL training stabilization via token-level discrepancy masking, C3PO++ for efficient long rollout processing under token budget, and ASystem RL framework to overcome systemic bottlenecks.
Result: Breakthrough performance: 93.4 on AIME-2025, 86.72 on HMMT-2025, 2088 on CodeForces, 55.94 on ARC-AGI-1, and silver medal-level result on IMO-2025.
Conclusion: Ring-1T establishes a new baseline for open-source model performance and marks a significant milestone in democratizing large-scale reasoning intelligence by releasing the complete 1T parameter MoE model to the community.
Abstract: We present Ring-1T, the first open-source, state-of-the-art thinking model with a trillion-scale parameter. It features 1 trillion total parameters and activates approximately 50 billion per token. Training such models at a trillion-parameter scale introduces unprecedented challenges, including train-inference misalignment, inefficiencies in rollout processing, and bottlenecks in the RL system. To address these, we pioneer three interconnected innovations: (1) IcePop stabilizes RL training via token-level discrepancy masking and clipping, resolving instability from training-inference mismatches; (2) C3PO++ improves resource utilization for long rollouts under a token budget by dynamically partitioning them, thereby obtaining high time efficiency; and (3) ASystem, a high-performance RL framework designed to overcome the systemic bottlenecks that impede trillion-parameter model training. Ring-1T delivers breakthrough results across critical benchmarks: 93.4 on AIME-2025, 86.72 on HMMT-2025, 2088 on CodeForces, and 55.94 on ARC-AGI-1. Notably, it attains a silver medal-level result on the IMO-2025, underscoring its exceptional reasoning capabilities. By releasing the complete 1T parameter MoE model to the community, we provide the research community with direct access to cutting-edge reasoning capabilities. This contribution marks a significant milestone in democratizing large-scale reasoning intelligence and establishes a new baseline for open-source model performance.
[204] UNO-Bench: A Unified Benchmark for Exploring the Compositional Law Between Uni-modal and Omni-modal in OmniModels
Chen Chen, ZeYang Hu, Fengjiao Chen, Liya Ma, Jiaxing Liu, Xiaoyu Li, Xuezhi Cao
Main category: cs.CL
TL;DR: UNO-Bench is a unified benchmark for evaluating both uni-modal and omni-modal capabilities of multimodal large language models, featuring 3730 human-curated samples across 44 task types with 98% cross-modality solvability.
Details
Motivation: To understand the correlation between uni-modal and omni-modal capabilities and drive the intelligence evolution of omni models, as current benchmarks lack comprehensive evaluation of both capabilities.Method: Proposed UNO-Bench with 3730 human-curated samples across 44 task types, including multi-step open-ended questions for complex reasoning, and developed a general scoring model with 95% accuracy for automated evaluation.
Result: Revealed the Compositional Law between omni-modal and uni-modal performance: omni-modal capability acts as a bottleneck effect on weak models while showing synergistic promotion on strong models.
Conclusion: UNO-Bench effectively assesses multimodal model capabilities and provides insights into the relationship between uni-modal and omni-modal performance, facilitating the development of more intelligent omni models.
Abstract: Multimodal Large Languages models have been progressing from uni-modal understanding toward unifying visual, audio and language modalities, collectively termed omni models. However, the correlation between uni-modal and omni-modal remains unclear, which requires comprehensive evaluation to drive omni model’s intelligence evolution. In this work, we propose a novel, high quality and UNified Omni model benchmark, UNO-Bench, which effectively assesses both UNi-modal and Omni-modal capabilities. The benchmark consists of 3730 human curated samples, with 98% cross-modality solvability, across 44 task types, and an innovative multi-step open-ended question type for assessing complex reasoning. Besides, a general scoring model supporting 6 question types is proposed for automated evaluation with 95% accuracy. Experimental result shows the Compositional Law between omni-modal and uni-modal performance and the omni-modal capability manifests as a bottleneck effect on weak models, while exhibiting synergistic promotion on strong models. The code and data are available at https://github.com/meituan-longcat/UNO-Bench
[205] Are they lovers or friends? Evaluating LLMs’ Social Reasoning in English and Korean Dialogues
Eunsu Kim, Junyeong Park, Juhyun Oh, Kiwoong Park, Seyoung Song, A. Seza Doğruöz, Najoung Kim, Alice Oh
Main category: cs.CL
TL;DR: SCRIPTS dataset evaluates LLMs’ social reasoning for inferring interpersonal relationships from dialogues, revealing significant performance gaps between English and Korean, and limitations in current models’ social capabilities.
Details
Motivation: To assess LLMs' social reasoning capabilities in interpersonal contexts as they become more integrated into human-AI interactions, particularly across different languages and cultures.Method: Created SCRIPTS dataset with 1k dialogues from movie scripts in English and Korean, annotated with probabilistic relational labels by native speakers. Evaluated nine models on relationship inference task.
Result: Proprietary LLMs achieved 75-80% on English but dropped to 58-69% on Korean. Models selected ‘Unlikely’ relationships in 10-25% of responses. Thinking models and chain-of-thought provided minimal benefits and sometimes amplified biases.
Conclusion: Current LLMs have significant limitations in social reasoning capabilities, highlighting the need for developing more socially-aware language models that can better understand interpersonal relationships across languages and cultures.
Abstract: As large language models (LLMs) are increasingly used in human-AI interactions, their social reasoning capabilities in interpersonal contexts are critical. We introduce SCRIPTS, a 1k-dialogue dataset in English and Korean, sourced from movie scripts. The task involves evaluating models’ social reasoning capability to infer the interpersonal relationships (e.g., friends, sisters, lovers) between speakers in each dialogue. Each dialogue is annotated with probabilistic relational labels (Highly Likely, Less Likely, Unlikely) by native (or equivalent) Korean and English speakers from Korea and the U.S. Evaluating nine models on our task, current proprietary LLMs achieve around 75-80% on the English dataset, whereas their performance on Korean drops to 58-69%. More strikingly, models select Unlikely relationships in 10-25% of their responses. Furthermore, we find that thinking models and chain-of-thought prompting, effective for general reasoning, provide minimal benefits for social reasoning and occasionally amplify social biases. Our findings reveal significant limitations in current LLMs’ social reasoning capabilities, highlighting the need for efforts to develop socially-aware language models.
[206] LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts
Siyuan Wang, Gaokai Zhang, Li Lyna Zhang, Ning Shang, Fan Yang, Dongyao Chen, Mao Yang
Main category: cs.CL
TL;DR: LoongRL is a data-driven RL method that enhances long-context reasoning by transforming short multi-hop QA into high-difficulty tasks using UUID chains, enabling models to generalize beyond training length and achieve competitive performance with larger frontier models.
Details
Motivation: Reasoning over long contexts is essential but challenging for LLMs. While RL improves short-context reasoning, advanced thinking patterns for long-context reasoning remain unexplored, and high-difficulty RL data is scarce.Method: LoongRL uses KeyChain synthesis to transform short multi-hop QA into long-context tasks by inserting UUID chains that hide the true question among distracting documents. This requires step-by-step chain tracing, question identification, fact retrieval, and reasoning.
Result: Models trained at 16K effectively solve 128K tasks without full-length RL costs. Qwen2.5-7B and 14B show +23.5% and +21.1% absolute gains in long-context multi-hop QA accuracy. LoongRL-14B scores 74.2, rivaling larger models like o3-mini (74.5) and DeepSeek-R1 (74.9).
Conclusion: LoongRL induces an emergent plan-retrieve-reason-recheck reasoning pattern that generalizes beyond training length, improves long-context retrieval, passes all 128K needle-in-a-haystack tests, and preserves short-context reasoning capabilities.
Abstract: Reasoning over long contexts is essential for large language models. While reinforcement learning (RL) enhances short-context reasoning by inducing “Aha” moments in chain-of-thought, the advanced thinking patterns required for long-context reasoning remain largely unexplored, and high-difficulty RL data are scarce. In this paper, we introduce LoongRL, a data-driven RL method for advanced long-context reasoning. Central to LoongRL is KeyChain, a synthesis approach that transforms short multi-hop QA into high-difficulty long-context tasks by inserting UUID chains that hide the true question among large collections of distracting documents. Solving these tasks requires the model to trace the correct chain step-by-step, identify the true question, retrieve relevant facts and reason over them to answer correctly. RL training on KeyChain data induces an emergent plan-retrieve-reason-recheck reasoning pattern that generalizes far beyond training length. Models trained at 16K effectively solve 128K tasks without prohibitive full-length RL rollout costs. On Qwen2.5-7B and 14B, LoongRL substantially improves long-context multi-hop QA accuracy by +23.5% and +21.1% absolute gains. The resulting LoongRL-14B reaches a score of 74.2, rivaling much larger frontier models such as o3-mini (74.5) and DeepSeek-R1 (74.9). It also improves long-context retrieval, passes all 128K needle-in-a-haystack stress tests, and preserves short-context reasoning capabilities.
[207] Automated HIV Screening on Dutch Electronic Health Records with Large Language Models
Lang Zhou, Amrish Jhingoer, Yinghao Luo, Klaske Vliegenthart–Jongbloed, Carlijn Jordans, Ben Werkhoven, Tom Seinen, Erik van Mulligen, Casper Rokx, Yunlei Li
Main category: cs.CL
TL;DR: A novel pipeline using Large Language Models to analyze unstructured EHR text for HIV testing eligibility screening, achieving high accuracy with low false negative rates.
Details
Motivation: Current HIV screening methods rely on structured data and overlook valuable information in unstructured clinical notes. EHRs provide new opportunities for efficient HIV screening but existing approaches don't utilize text data effectively.Method: Proposed a pipeline leveraging Large Language Models to analyze unstructured EHR text data to determine patient eligibility for HIV testing.
Result: Experimental results on clinical data from Erasmus University Medical Center Rotterdam showed the pipeline achieved high accuracy while maintaining low false negative rate.
Conclusion: The LLM-based pipeline effectively utilizes unstructured EHR text for HIV screening, providing an accurate and reliable method for identifying patients who need further HIV testing.
Abstract: Efficient screening and early diagnosis of HIV are critical for reducing onward transmission. Although large scale laboratory testing is not feasible, the widespread adoption of Electronic Health Records (EHRs) offers new opportunities to address this challenge. Existing research primarily focuses on applying machine learning methods to structured data, such as patient demographics, for improving HIV diagnosis. However, these approaches often overlook unstructured text data such as clinical notes, which potentially contain valuable information relevant to HIV risk. In this study, we propose a novel pipeline that leverages a Large Language Model (LLM) to analyze unstructured EHR text and determine a patient’s eligibility for further HIV testing. Experimental results on clinical data from Erasmus University Medical Center Rotterdam demonstrate that our pipeline achieved high accuracy while maintaining a low false negative rate.
[208] Steering Evaluation-Aware Language Models to Act Like They Are Deployed
Tim Tian Hua, Andrew Qin, Samuel Marks, Neel Nanda
Main category: cs.CL
TL;DR: Activation steering can suppress LLM evaluation-awareness, making models behave as if deployed during safety evaluations.
Details
Motivation: LLMs can detect when they're being evaluated and adjust behavior to appear more aligned, compromising safety evaluation reliability.Method: Two-step training: continued pretraining on documents with factual descriptions, then expert iteration to use Python type hints in evaluation settings. Applied activation steering using original model’s vectors.
Result: Activation steering successfully suppressed evaluation-awareness, making the model act like deployed even when evaluation cues were present.
Conclusion: AI evaluators could improve safety evaluation reliability by steering models to act like they’re deployed.
Abstract: Large language models (LLMs) can sometimes detect when they are being evaluated and adjust their behavior to appear more aligned, compromising the reliability of safety evaluations. In this paper, we show that adding a steering vector to an LLM’s activations can suppress evaluation-awareness and make the model act like it is deployed during evaluation. To study our steering technique, we train an LLM to exhibit evaluation-aware behavior using a two-step training process designed to mimic how this behavior could emerge naturally. First, we perform continued pretraining on documents with factual descriptions of the model (1) using Python type hints during evaluation but not during deployment and (2) recognizing that the presence of a certain evaluation cue always means that it is being tested. Then, we train the model with expert iteration to use Python type hints in evaluation settings. The resulting model is evaluation-aware: it writes type hints in evaluation contexts more than deployment contexts. We find that activation steering can suppress evaluation awareness and make the model act like it is deployed even when the cue is present. Importantly, we constructed our steering vector using the original model before our additional training. Our results suggest that AI evaluators could improve the reliability of safety evaluations by steering models to act like they are deployed.
[209] BUSTED at AraGenEval Shared Task: A Comparative Study of Transformer-Based Models for Arabic AI-Generated Text Detection
Ali Zain, Sareem Farooqui, Muhammad Rafi
Main category: cs.CL
TL;DR: The paper presents a submission to the AraGenEval Shared Task on Arabic AI-generated text detection, where the BUSTED team achieved 5th place by fine-tuning three transformer models, with XLM-RoBERTa surprisingly outperforming specialized Arabic models.
Details
Motivation: To investigate the effectiveness of different pre-trained transformer models for detecting AI-generated Arabic text in a shared task competition.Method: Fine-tuned three pre-trained transformer models (AraELECTRA, CAMeLBERT, and XLM-RoBERTa) on the provided dataset for binary classification of AI-generated vs human-written Arabic text.
Result: XLM-RoBERTa achieved the highest performance with an F1 score of 0.7701, outperforming the specialized Arabic models AraELECTRA and CAMeLBERT, securing 5th place in the competition.
Conclusion: Multilingual models like XLM-RoBERTa demonstrate strong generalization capabilities for AI-generated text detection, sometimes outperforming specialized language-specific models, highlighting the complexity of this detection task.
Abstract: This paper details our submission to the AraGenEval Shared Task on Arabic AI-generated text detection, where our team, BUSTED, secured 5th place. We investigated the effectiveness of three pre-trained transformer models: AraELECTRA, CAMeLBERT, and XLM-RoBERTa. Our approach involved fine-tuning each model on the provided dataset for a binary classification task. Our findings revealed a surprising result: the multilingual XLM-RoBERTa model achieved the highest performance with an F1 score of 0.7701, outperforming the specialized Arabic models. This work underscores the complexities of AI-generated text detection and highlights the strong generalization capabilities of multilingual models.
[210] Can Confidence Estimates Decide When Chain-of-Thought Is Necessary for LLMs?
Samuel Lewis-Lim, Xingwei Tan, Zhixue Zhao, Nikolaos Aletras
Main category: cs.CL
TL;DR: The paper introduces confidence-gated chain-of-thought (CoT) prompting, where models only use reasoning when confidence in direct answers is low, to reduce unnecessary token usage while maintaining performance.
Details
Motivation: Extended CoT reasoning increases token usage unnecessarily in many scenarios, and it's unclear when CoT should be used since it sometimes helps, sometimes doesn't, and sometimes harms performance.Method: Systematic study of four training-free confidence estimation methods for CoT gating, comparing them to random baseline and oracle that always knows when CoT is needed.
Result: Existing training-free confidence measures can reduce redundant CoT and outperform randomly invoked CoT, but their utility varies with dataset and model, making practical deployment challenging.
Conclusion: The study highlights both potential and limitations of current confidence-gated CoT methods, paving the way for more reliable adaptive gating of CoT reasoning.
Abstract: Chain-of-thought (CoT) prompting has emerged as a common technique for enhancing the reasoning abilities of large language models (LLMs). While extended reasoning can boost accuracy on complex tasks, it is often unnecessary and substantially increases token usage, limiting the practicality of reasoning models in many scenarios. Recent models, such as GPT-OSS and Qwen3, expose controls that enable users to adjust the length of CoT or determine whether it is used at all. Yet, it remains unclear when CoT should be used: on some tasks it improves performance, while on others it provides little benefit or even harms performance. We address this challenge with confidence-gated CoT, where a model invokes reasoning only when confidence in its direct answer is low. To this end, we present the first systematic study of training-free confidence estimation methods for CoT gating. Specifically, we evaluate four training-free confidence estimation methods and compare them to a random baseline and an oracle that always knows when CoT is needed. Through extensive experiments, we show that existing training-free confidence measures can reduce redundant CoT and outperform randomly invoked CoT. However, the utility of individual confidence measures is inconsistent, varying with both the dataset and the model, underscoring the difficulty of deploying confidence-gated CoT in practice. By analysing both strengths and failure modes, our study highlights the potential and limitations of current methods and paves the way toward more reliable adaptive gating of CoT.
[211] Input Matters: Evaluating Input Structure’s Impact on LLM Summaries of Sports Play-by-Play
Barkavi Sundararajan, Somayajulu Sripada, Ehud Reiter
Main category: cs.CL
TL;DR: Structured input formats (JSON and row-structured) significantly reduce factual errors in LLM-generated NBA game summaries compared to unstructured input, with JSON being most effective.
Details
Motivation: To quantify how input structure affects hallucinations and factual errors in LLM-generated summaries for accuracy-critical domains like sports reporting.Method: Manual annotation of 3,312 factual errors across 180 game summaries from two models (Llama-3.1-70B and Qwen2.5-72B) using three input formats: row-structured, JSON, and unstructured.
Result: JSON input reduced error rates by 69% for Llama and 65% for Qwen compared to unstructured input. Row-structured input reduced errors by 54% for Llama and 51% for Qwen. Statistical analysis showed input structure accounts for over 80% of variance in error rates.
Conclusion: Input structure has a strong effect on factual accuracy in LLM-generated summaries, with structured formats significantly reducing errors compared to unstructured input.
Abstract: A major concern when deploying LLMs in accuracy-critical domains such as sports reporting is that the generated text may not faithfully reflect the input data. We quantify how input structure affects hallucinations and other factual errors in LLM-generated summaries of NBA play-by-play data, across three formats: row-structured, JSON and unstructured. We manually annotated 3,312 factual errors across 180 game summaries produced by two models, Llama-3.1-70B and Qwen2.5-72B. Input structure has a strong effect: JSON input reduces error rates by 69% for Llama and 65% for Qwen compared to unstructured input, while row-structured input reduces errors by 54% for Llama and 51% for Qwen. A two-way repeated measures ANOVA shows that input structure accounts for over 80% of the variance in error rates, with Tukey HSD post hoc tests confirming statistically significant differences between all input formats.
[212] Dynamic Retriever for In-Context Knowledge Editing via Policy Optimization
Mahmud Wasif Nafee, Maiqi Jiang, Haipeng Chen, Yanfu Zhang
Main category: cs.CL
TL;DR: DR-IKE is a lightweight framework for in-context knowledge editing that dynamically selects demonstrations based on their utility for editing, improving edit success by up to 17.1% and reducing latency by 41.6% while maintaining accuracy on unrelated queries.
Details
Motivation: Current in-context knowledge editors rely on static demonstration sets chosen by surface-level similarity, leading to quantity-quality trade-offs and lack of adaptivity to task difficulty.Method: Trains a BERT retriever with REINFORCE to rank demonstrations by editing reward, and employs a learnable threshold to prune low-value examples, shortening prompts for easy edits and expanding them for hard tasks.
Result: On the COUNTERFACT benchmark, DR-IKE improves edit success by up to 17.1%, reduces latency by 41.6%, and preserves accuracy on unrelated queries.
Conclusion: DR-IKE demonstrates scalable and adaptive knowledge editing that works with black-box LLMs without modifying model weights, relying solely on forward passes.
Abstract: Large language models (LLMs) excel at factual recall yet still propagate stale or incorrect knowledge. In-context knowledge editing offers a gradient-free remedy suitable for black-box APIs, but current editors rely on static demonstration sets chosen by surface-level similarity, leading to two persistent obstacles: (i) a quantity-quality trade-off, and (ii) lack of adaptivity to task difficulty. We address these issues by dynamically selecting supporting demonstrations according to their utility for the edit. We propose Dynamic Retriever for In-Context Knowledge Editing (DR-IKE), a lightweight framework that (1) trains a BERT retriever with REINFORCE to rank demonstrations by editing reward, and (2) employs a learnable threshold to prune low-value examples, shortening the prompt when the edit is easy and expanding it when the task is hard. DR-IKE performs editing without modifying model weights, relying solely on forward passes for compatibility with black-box LLMs. On the COUNTERFACT benchmark, it improves edit success by up to 17.1%, reduces latency by 41.6%, and preserves accuracy on unrelated queries, demonstrating scalable and adaptive knowledge editing. The code is available at https://github.com/mwnafee/DR-IKE .
[213] The Gray Zone of Faithfulness: Taming Ambiguity in Unfaithfulness Detection
Qiang Ding, Lvzhou Luo, Yixuan Cao, Ping Luo
Main category: cs.CL
TL;DR: Proposes VeriGray, a new faithfulness benchmark for LLM summarization that addresses annotation ambiguity by introducing an ‘Out-Dependent’ category for cases requiring external knowledge verification.
Details
Motivation: Existing faithfulness benchmarks suffer from annotation ambiguity due to ill-defined boundaries of permissible external knowledge, leading to inconsistent labeling of common sense incorporation as faithful or unfaithful.Method: Developed a novel faithfulness annotation framework with an intermediate ‘Out-Dependent’ category for cases requiring external knowledge verification, and constructed VeriGray benchmark using this framework.
Result: Even SOTA LLMs like GPT-5 exhibit ~6% hallucinations in summarization, with ~8% of sentences falling into the Out-Dependent category, highlighting annotation ambiguity challenges.
Conclusion: The benchmark poses significant challenges to baseline methods, indicating substantial room for improvement in unfaithfulness detection and the importance of resolving annotation ambiguity.
Abstract: Ensuring that Large Language Models (LLMs) generate summaries faithful to a given source document is essential for real-world applications. While prior research has explored LLM faithfulness, existing benchmarks suffer from annotation ambiguity, primarily due to the ill-defined boundary of permissible external knowledge in generated outputs. For instance, common sense is often incorporated into responses and labeled as “faithful”, yet the acceptable extent of such knowledge remains unspecified, leading to inconsistent annotations. To address this issue, we propose a novel faithfulness annotation framework, which introduces an intermediate category, Out-Dependent, to classify cases where external knowledge is required for verification. Using this framework, we construct VeriGray (Verification with the Gray Zone) – a new unfaithfulness detection benchmark in summarization. Statistics reveal that even SOTA LLMs, such as GPT-5, exhibit hallucinations ($\sim 6%$ of sentences) in summarization tasks. Moreover, a substantial proportion ($\sim 8%$ on average of models) of generated sentences fall into the Out-Dependent category, underscoring the importance of resolving annotation ambiguity in unfaithfulness detection benchmarks. Experiments demonstrate that our benchmark poses significant challenges to multiple baseline methods, indicating considerable room for future improvement.
[214] Multi-turn Training with Basic Human Feedback Helps Little on LLM Reasoning
Qiang Liu, Wuganjing Song, Zhenzhou Lin, Feifan Chen, Qiaolong Cai, Chen Li, Yongduo Sui
Main category: cs.CL
TL;DR: Single-turn training for LLMs generalizes better to both single- and multi-turn reasoning tasks than multi-turn training, which degrades single-turn performance.
Details
Motivation: Real-world applications involve multi-turn interactions, but LLMs are typically trained with single-turn reinforcement learning, creating a potential mismatch between training and deployment.Method: Compared conventional single-turn training with three multi-turn strategies on reasoning tasks.
Result: Single-turn trained models generalized effectively to both single- and multi-turn evaluations, while multi-turn trained models showed significant degradation in single-turn reasoning performance.
Conclusion: For tasks with complete information, robust single-turn training is more effective and reliable than multi-turn training, which provides limited benefits and can degrade reasoning capabilities.
Abstract: The reasoning capabilities of Large Language Models (LLMs) are typically developed through the single-turn reinforcement learning, whereas real-world applications often involve multi-turn interactions with human feedback, leading to a potential mismatch between training and deployment conditions. In this work, we study whether multi-turn training with human feedback is necessary for reasoning tasks. We compare conventional single-turn training with three multi-turn strategies and reach contrary conclusions to previous research. We find that models trained in a single-turn setting generalize effectively to both single- and multi-turn evaluations, while models trained with multi-turn strategies exhibit a significant degradation in single-turn reasoning performance. These results suggest that for tasks with complete information, robust single-turn training remains more effective and reliable, as multi-turn training with basic feedback provides limited benefits and can even degrade reasoning capabilities.
cs.CV
[215] Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models
Alexa R. Tartaglini, Satchel Grant, Daniel Wurgaft, Christopher Potts, Judith E. Fan
Main category: cs.CV
TL;DR: FUGU is a diagnostic suite that reveals VLMs struggle with data visualization understanding primarily due to vision-language handoff issues, not visual encoding limitations.
Details
Motivation: Current VLMs perform poorly on data visualization tasks, but the root causes of these failures are unclear - whether they stem from visual encoding, vision-language transfer, or language processing limitations.Method: Developed FUGU diagnostic tasks to test specific visualization skills, used activation patching and linear probes to trace information flow, and tested three widely used VLMs with various prompting strategies.
Result: VLMs fail to generate correct data point coordinates, and these initial errors propagate to final responses. Providing correct coordinates improves performance for single-point tasks but worsens it for statistical tasks. Correct coordinates can be read from vision encoder representations.
Conclusion: Current VLM architectures have fundamental constraints in vision-language handoff that limit reliable data visualization understanding, and fine-tuning doesn’t solve these architectural limitations.
Abstract: Data visualizations are vital components of many scientific articles and news stories. Current vision-language models (VLMs) still struggle on basic data visualization understanding tasks, but the causes of failure remain unclear. Are VLM failures attributable to limitations in how visual information in the data visualization is encoded, how information is transferred between the vision and language modules, or how information is processed within the language module? We developed FUGU, a suite of data visualization understanding tasks, to precisely characterize potential sources of difficulty (e.g., extracting the position of data points, distances between them, and other summary statistics). We used FUGU to investigate three widely used VLMs. To diagnose the sources of errors produced by these models, we used activation patching and linear probes to trace information flow through models across a variety of prompting strategies. We found that some models fail to generate the coordinates of individual data points correctly, and these initial errors often lead to erroneous final responses. When these models are provided with the correct coordinates, performance improves substantially. Moreover, even when the model generates an incorrect response, the correct coordinates can be successfully read out from the latent representations in the vision encoder, suggesting that the source of these errors lies in the vision-language handoff. We further found that while providing correct coordinates helps with tasks involving one or a small number of data points, it generally worsens performance for tasks that require extracting statistical relationships across many data points. Fine-tuning models on FUGU also fails to yield ceiling performance. These findings point to architectural constraints in current VLMs that might pose significant challenges for reliable data visualization understanding.
[216] Agro-Consensus: Semantic Self-Consistency in Vision-Language Models for Crop Disease Management in Developing Countries
Mihir Gupta, Pratik Desai, Ross Greer
Main category: cs.CV
TL;DR: A cost-effective self-consistency framework improves vision-language model reliability for agricultural disease diagnosis by using semantic clustering and human-in-the-loop crop type confirmation to select the most coherent captions.
Details
Motivation: Agricultural disease management in developing countries faces challenges due to limited expert access, unreliable internet, and cost constraints that hinder large-scale AI deployment.Method: Uses semantic clustering with a lightweight embedding model to group candidate responses, then selects the most coherent caption through cosine similarity-based consensus. Incorporates human-in-the-loop crop type confirmation to filter erroneous generations.
Result: Achieved 83.1% accuracy with 10 candidate generations (vs 77.5% baseline), and 94.0% accuracy when correct response is in top four clusters (vs 88.5% baseline).
Conclusion: The framework effectively improves VLM reliability for agricultural image captioning while maintaining cost-effectiveness for developing country applications.
Abstract: Agricultural disease management in developing countries such as India, Kenya, and Nigeria faces significant challenges due to limited access to expert plant pathologists, unreliable internet connectivity, and cost constraints that hinder the deployment of large-scale AI systems. This work introduces a cost-effective self-consistency framework to improve vision-language model (VLM) reliability for agricultural image captioning. The proposed method employs semantic clustering, using a lightweight (80MB) pre-trained embedding model to group multiple candidate responses. It then selects the most coherent caption – containing a diagnosis, symptoms, analysis, treatment, and prevention recommendations – through a cosine similarity-based consensus. A practical human-in-the-loop (HITL) component is incorporated, wherein user confirmation of the crop type filters erroneous generations, ensuring higher-quality input for the consensus mechanism. Applied to the publicly available PlantVillage dataset using a fine-tuned 3B-parameter PaliGemma model, our framework demonstrates improvements over standard decoding methods. Evaluated on 800 crop disease images with up to 21 generations per image, our single-cluster consensus method achieves a peak accuracy of 83.1% with 10 candidate generations, compared to the 77.5% baseline accuracy of greedy decoding. The framework’s effectiveness is further demonstrated when considering multiple clusters; accuracy rises to 94.0% when a correct response is found within any of the top four candidate clusters, outperforming the 88.5% achieved by a top-4 selection from the baseline.
[217] EventFormer: A Node-graph Hierarchical Attention Transformer for Action-centric Video Event Prediction
Qile Su, Shoutai Zhu, Shuai Zhang, Baoyu Liang, Chao Tong
Main category: cs.CV
TL;DR: The paper introduces AVEP (Action-centric Video Event Prediction), a new task for predicting subsequent events in videos using complex logic and semantic information. It presents a large annotated dataset and EventFormer model that outperforms existing video prediction approaches.
Details
Motivation: Human events are mostly recorded as videos rather than scripts, but there's a lack of research on event prediction in the vision domain. Existing video prediction tasks don't incorporate the complex logic and rich semantic information needed for event induction.Method: Created a large structured dataset with 35K annotated videos and 178K video clips. Proposed EventFormer, a node-graph hierarchical attention model that captures relationships between events and their arguments, and coreferential relationships between arguments.
Result: Experiments showed AVEP is more complex than existing video prediction tasks. EventFormer outperformed several state-of-the-art video prediction models and LVLMs on the AVEP task.
Conclusion: AVEP addresses the gap in video event prediction research, and the proposed EventFormer model with hierarchical attention effectively handles the complexity of event structures in videos. The dataset and code will be released for reproducibility.
Abstract: Script event induction, which aims to predict the subsequent event based on the context, is a challenging task in NLP, achieving remarkable success in practical applications. However, human events are mostly recorded and presented in the form of videos rather than scripts, yet there is a lack of related research in the realm of vision. To address this problem, we introduce AVEP (Action-centric Video Event Prediction), a task that distinguishes itself from existing video prediction tasks through its incorporation of more complex logic and richer semantic information. We present a large structured dataset, which consists of about $35K$ annotated videos and more than $178K$ video clips of event, built upon existing video event datasets to support this task. The dataset offers more fine-grained annotations, where the atomic unit is represented as a multimodal event argument node, providing better structured representations of video events. Due to the complexity of event structures, traditional visual models that take patches or frames as input are not well-suited for AVEP. We propose EventFormer, a node-graph hierarchical attention based video event prediction model, which can capture both the relationships between events and their arguments and the coreferencial relationships between arguments. We conducted experiments using several SOTA video prediction models as well as LVLMs on AVEP, demonstrating both the complexity of the task and the value of the dataset. Our approach outperforms all these video prediction models. We will release the dataset and code for replicating the experiments and annotations.
[218] Proportion and Perspective Control for Flow-Based Image Generation
Julien Boudier, Hugo Caselles-Dupré
Main category: cs.CV
TL;DR: The paper introduces two specialized ControlNets for artistic control in text-to-image generation: a proportion ControlNet using bounding boxes for object positioning/scaling, and a perspective ControlNet using vanishing lines for 3D scene geometry control.
Details
Motivation: Modern text-to-image diffusion models generate high-fidelity images but offer limited control over spatial and geometric structure of outputs, which is crucial for artistic applications.Method: Developed two ControlNet modules: proportion ControlNet (bounding boxes for object position/scale) and perspective ControlNet (vanishing lines for 3D geometry). Used data pipelines with vision-language models for annotation and specialized algorithms for conditioning image synthesis.
Result: Experiments demonstrate both modules provide effective control over spatial and geometric aspects, though they exhibit limitations with complex constraints.
Conclusion: The proposed ControlNets successfully address the spatial and geometric control limitations in text-to-image generation, with both models released publicly on HuggingFace for community use.
Abstract: While modern text-to-image diffusion models generate high-fidelity images, they offer limited control over the spatial and geometric structure of the output. To address this, we introduce and evaluate two ControlNets specialized for artistic control: (1) a proportion ControlNet that uses bounding boxes to dictate the position and scale of objects, and (2) a perspective ControlNet that employs vanishing lines to control the 3D geometry of the scene. We support the training of these modules with data pipelines that leverage vision-language models for annotation and specialized algorithms for conditioning image synthesis. Our experiments demonstrate that both modules provide effective control but exhibit limitations with complex constraints. Both models are released on HuggingFace: https://huggingface.co/obvious-research
[219] H2OFlow: Grounding Human-Object Affordances with 3D Generative Models and Dense Diffused Flows
Harry Zhang, Luca Carlone
Main category: cs.CV
TL;DR: H2OFlow is a novel framework that learns 3D human-object interaction affordances (contact, orientation, spatial occupancy) using only synthetic data from 3D generative models, eliminating the need for human annotations.
Details
Motivation: Current methods rely on labor-intensive hand-labeled datasets and are limited to contact-based analysis, neglecting important aspects like orientation preferences and spatial occupancy patterns in human-object interactions.Method: Uses a dense 3D-flow-based representation learned through dense diffusion process on point clouds, trained exclusively on synthetic data from 3D generative models.
Result: H2OFlow generalizes effectively to real-world objects and outperforms prior methods that use manual annotations or mesh-based representations in modeling 3D affordance.
Conclusion: The framework successfully demonstrates comprehensive 3D affordance learning without human annotations, addressing limitations of existing approaches through synthetic data and flow-based representations.
Abstract: Understanding how humans interact with the surrounding environment, and specifically reasoning about object interactions and affordances, is a critical challenge in computer vision, robotics, and AI. Current approaches often depend on labor-intensive, hand-labeled datasets capturing real-world or simulated human-object interaction (HOI) tasks, which are costly and time-consuming to produce. Furthermore, most existing methods for 3D affordance understanding are limited to contact-based analysis, neglecting other essential aspects of human-object interactions, such as orientation (\eg, humans might have a preferential orientation with respect certain objects, such as a TV) and spatial occupancy (\eg, humans are more likely to occupy certain regions around an object, like the front of a microwave rather than its back). To address these limitations, we introduce \emph{H2OFlow}, a novel framework that comprehensively learns 3D HOI affordances – encompassing contact, orientation, and spatial occupancy – using only synthetic data generated from 3D generative models. H2OFlow employs a dense 3D-flow-based representation, learned through a dense diffusion process operating on point clouds. This learned flow enables the discovery of rich 3D affordances without the need for human annotations. Through extensive quantitative and qualitative evaluations, we demonstrate that H2OFlow generalizes effectively to real-world objects and surpasses prior methods that rely on manual annotations or mesh-based representations in modeling 3D affordance.
[220] STATUS Bench: A Rigorous Benchmark for Evaluating Object State Understanding in Vision-Language Models
Mahiro Ukai, Shuhei Kurita, Nakamasa Inoue
Main category: cs.CV
TL;DR: STATUS Bench is the first benchmark for evaluating VLMs’ ability to understand subtle object state variations through three simultaneous tasks: object state identification, image retrieval, and state change identification.
Details
Motivation: While VLMs can perform various multimodal tasks, it's unclear how precisely they can identify subtle object states like positional states (open/closed) and functional states (on/off).Method: Created STATUS Bench with hand-crafted image pairs and descriptions, and STATUS Train with 13M semi-automatically created descriptions. Evaluates VLMs on three simultaneous tasks: OSI, IR, and SCI.
Result: Current SOTA VLMs struggle significantly with subtle object state distinctions. Most open-weight VLMs showed chance-level zero-shot performance. After fine-tuning on STATUS Train, Qwen2.5-VL achieved performance comparable to Gemini 2.0 Flash.
Conclusion: STATUS Bench and Train are necessary for advancing object state recognition in VLM research, as current models still significantly struggle with capturing subtle state distinctions despite their general multimodal capabilities.
Abstract: Object state recognition aims to identify the specific condition of objects, such as their positional states (e.g., open or closed) and functional states (e.g., on or off). While recent Vision-Language Models (VLMs) are capable of performing a variety of multimodal tasks, it remains unclear how precisely they can identify object states. To alleviate this issue, we introduce the STAte and Transition UnderStanding Benchmark (STATUS Bench), the first benchmark for rigorously evaluating the ability of VLMs to understand subtle variations in object states in diverse situations. Specifically, STATUS Bench introduces a novel evaluation scheme that requires VLMs to perform three tasks simultaneously: object state identification (OSI), image retrieval (IR), and state change identification (SCI). These tasks are defined over our fully hand-crafted dataset involving image pairs, their corresponding object state descriptions and state change descriptions. Furthermore, we introduce a large-scale training dataset, namely STATUS Train, which consists of 13 million semi-automatically created descriptions. This dataset serves as the largest resource to facilitate further research in this area. In our experiments, we demonstrate that STATUS Bench enables rigorous consistency evaluation and reveal that current state-of-the-art VLMs still significantly struggle to capture subtle object state distinctions. Surprisingly, under the proposed rigorous evaluation scheme, most open-weight VLMs exhibited chance-level zero-shot performance. After fine-tuning on STATUS Train, Qwen2.5-VL achieved performance comparable to Gemini 2.0 Flash. These findings underscore the necessity of STATUS Bench and Train for advancing object state recognition in VLM research.
[221] OCR-Quality: A Human-Annotated Dataset for OCR Quality Assessment
Yulong Zhang
Main category: cs.CV
TL;DR: OCR-Quality is a human-annotated dataset of 1,000 PDF pages converted to images with quality scores for evaluating OCR assessment methods.
Details
Motivation: Addresses the critical need for reliable OCR quality assessment in real-world applications by providing a standardized benchmark.Method: Created dataset from diverse PDF sources, processed with Vision-Language Models, and manually annotated using a 4-level quality scoring system.
Result: Produced a comprehensive dataset with 1,000 annotated documents, detailed source information, and representative cases across difficulty levels.
Conclusion: OCR-Quality provides a valuable publicly available benchmark for training and evaluating OCR verification systems.
Abstract: We present OCR-Quality, a comprehensive human-annotated dataset designed for evaluating and developing OCR quality assessment methods. The dataset consists of 1,000 PDF pages converted to PNG images at 300 DPI, sampled from diverse real-world scenarios, including academic papers, textbooks, e-books, and multilingual documents. Each document has been processed using state-of-the-art Vision-Language Models (VLMs) and manually annotated with quality scores using a 4-level scoring system (1: Excellent, 2: Good, 3: Fair, 4: Poor). The dataset includes detailed source information, annotation guidelines, and representative cases across various difficulty levels. OCR-Quality addresses the critical need for reliable OCR quality assessment in real-world applications and provides a valuable benchmark for training and evaluating OCR verification systems. The dataset is publicly available at https://huggingface.co/datasets/Aslan-mingye/OCR-Quality .
[222] Face-MakeUpV2: Facial Consistency Learning for Controllable Text-to-Image Generation
Dawei Dai, Yinxiu Zhou, Chenghang Li, Guolai Jiang, Chengfang Zhang
Main category: cs.CV
TL;DR: Face-MakeUpV2 is a facial image generation model that addresses attribute leakage and physical inconsistency issues in text-to-image models by using a large dataset with precise spatial supervision and dual facial information injection channels.
Details
Motivation: Current text-to-image models suffer from facial attribute leakage and insufficient physical consistency when responding to local semantic instructions, which limits their reliability for facial editing applications.Method: Used a pretrained text-to-image model as backbone, constructed FaceCaptionMask-1M dataset with image-text-mask pairs, introduced 3D facial rendering channel and global facial feature channel, and formulated semantic alignment and perceptual loss objectives.
Result: Face-MakeUpV2 achieves best overall performance in preserving face ID and maintaining physical consistency of reference images, demonstrating practical potential for reliable facial editing.
Conclusion: The proposed approach effectively addresses attribute leakage and physical inconsistency problems, making Face-MakeUpV2 suitable for diverse and controllable facial editing applications.
Abstract: In facial image generation, current text-to-image models often suffer from facial attribute leakage and insufficient physical consistency when responding to local semantic instructions. In this study, we propose Face-MakeUpV2, a facial image generation model that aims to maintain the consistency of face ID and physical characteristics with the reference image. First, we constructed a large-scale dataset FaceCaptionMask-1M comprising approximately one million image-text-masks pairs that provide precise spatial supervision for the local semantic instructions. Second, we employed a general text-to-image pretrained model as the backbone and introduced two complementary facial information injection channels: a 3D facial rendering channel to incorporate the physical characteristics of the image and a global facial feature channel. Third, we formulated two optimization objectives for the supervised learning of our model: semantic alignment in the model’s embedding space to mitigate the attribute leakage problem and perceptual loss on facial images to preserve ID consistency. Extensive experiments demonstrated that our Face-MakeUpV2 achieves best overall performance in terms of preserving face ID and maintaining physical consistency of the reference images. These results highlight the practical potential of Face-MakeUpV2 for reliable and controllable facial editing in diverse applications.
[223] Ageing Drift in Binary Face Templates: A Bits-per-Decade Analysis
Abdelilah Ganmati, Karim Afdel, Lahcen Koutti
Main category: cs.CV
TL;DR: This paper studies the longitudinal stability of compact binary face templates, finding that Hamming distance increases by 1.357 bits per decade for 64-bit codes and 2.571 bits per decade for 128-bit codes, indicating systematic ageing drift.
Details
Motivation: To quantify how face recognition templates age over time and understand the stability of compact binary codes for practical deployments like smart-cards and match-on-card systems.Method: Used PCA-ITQ to compress float embeddings from a modern face CNN into 64- and 128-bit codes, then analyzed AgeDB data by fitting linear models of Hamming distance versus age gap for 566 identities with multiple age samples.
Result: Found median slopes of 1.357 bits/decade (64-bit) and 2.571 bits/decade (128-bit) with tight confidence intervals, showing predominantly positive drift. Shorter codes are more age-stable at fixed thresholds.
Conclusion: Ageing drift is systematic and scales with code length, suggesting mitigations like periodic re-enrolment and targeted parity on unstable bit positions for practical deployments.
Abstract: We study the longitudinal stability of compact binary face templates and quantify ageing drift directly in bits per decade. Float embeddings from a modern face CNN are compressed with PCA-ITQ into 64- and 128-bit codes. For each identity in AgeDB with at least three distinct ages, we form all genuine pairs and fit a per-identity linear model of Hamming distance versus absolute age gap. Across 566 identities, the median slope is 1.357 bits per decade for 64-bit templates and 2.571 bits per decade for 128-bit templates, with tight non-parametric 95 percent bootstrap confidence intervals. The distributions are predominantly positive, indicating a small but systematic increase in intra-class distance over time. Because drift scales with code length, shorter codes are inherently more age-stable at a fixed decision threshold. We connect these slopes to operating characteristics by reporting EER and TPR at FAR = 1 percent in three age bins. We discuss implications for smart-card and match-on-card deployments, including simple mitigations such as periodic re-enrolment and targeted parity on empirically unstable bit positions. Code and CSV artifacts are provided to support reproducibility.
[224] Bridging Accuracy and Interpretability: Deep Learning with XAI for Breast Cancer Detection
Bishal Chhetri, B. V. Rathish Kumar
Main category: cs.CV
TL;DR: Interpretable deep learning framework for early breast cancer detection using FNA images, achieving state-of-the-art performance with 99.2% accuracy and incorporating SHAP/LIME for explainability.
Details
Motivation: High predictive accuracy alone is insufficient for clinical adoption due to the black-box nature of deep learning models, requiring interpretability for clinician trust.Method: Deep neural network with ReLU activations, Adam optimizer, and binary cross-entropy loss, enhanced with model-agnostic Explainable AI techniques (SHAP and LIME) for feature-level attributions.
Result: Achieved accuracy of 0.992, precision of 1.000, recall of 0.977, and F1 score of 0.988, substantially exceeding literature benchmarks and outperforming traditional algorithms.
Conclusion: The framework bridges performance and interpretability for clinical use, with concave points feature of cell nuclei identified as most influential, providing insights for improved breast cancer diagnosis and treatment.
Abstract: In this study, we present an interpretable deep learning framework for the early detection of breast cancer using quantitative features extracted from digitized fine needle aspirate (FNA) images of breast masses. Our deep neural network, using ReLU activations, the Adam optimizer, and a binary cross-entropy loss, delivers state-of-the-art classification performance, achieving an accuracy of 0.992, precision of 1.000, recall of 0.977, and an F1 score of 0.988. These results substantially exceed the benchmarks reported in the literature. We evaluated the model under identical protocols against a suite of well-established algorithms (logistic regression, decision trees, random forests, stochastic gradient descent, K-nearest neighbors, and XGBoost) and found the deep model consistently superior on the same metrics. Recognizing that high predictive accuracy alone is insufficient for clinical adoption due to the black-box nature of deep learning models, we incorporated model-agnostic Explainable AI techniques such as SHAP and LIME to produce feature-level attributions and human-readable visualizations. These explanations quantify the contribution of each feature to individual predictions, support error analysis, and increase clinician trust, thus bridging the gap between performance and interpretability for real-world clinical use. The concave points feature of the cell nuclei is found to be the most influential feature positively impacting the classification task. This insight can be very helpful in improving the diagnosis and treatment of breast cancer by highlighting the key characteristics of breast tumor.
[225] EdgeSync: Accelerating Edge-Model Updates for Data Drift through Adaptive Continuous Learning
Runchu Donga, Peng Zhao, Guiqin Wang, Nan Qi, Jie Lin
Main category: cs.CV
TL;DR: EdgeSync is an efficient edge-model updating approach that improves real-time video analytics by enhancing sample filtering with timeliness and inference results, and optimizing update timing to handle changing data distributions.
Details
Motivation: Real-time video analytics systems face accuracy degradation due to changing data distributions (lighting, weather conditions). Existing methods have high retraining delays and poor alignment with evolving data streams.Method: EdgeSync enhances sample filtering by incorporating timeliness and inference results to ensure training samples are relevant to current video content. It also includes a dynamic training management module that optimizes update timing and sequencing.
Result: Evaluations on diverse real-world datasets show EdgeSync improves accuracy by approximately 3.4% compared to existing methods and by about 10% compared to traditional approaches.
Conclusion: EdgeSync effectively addresses the challenges of compute-intensive retraining and poor model-data alignment in edge video analytics systems, providing more timely and accurate model updates.
Abstract: Real-time video analytics systems typically deploy lightweight models on edge devices to reduce latency. However, the distribution of data features may change over time due to various factors such as changing lighting and weather conditions, leading to decreased model accuracy. Recent frameworks try to address this issue by leveraging remote servers to continuously train and adapt lightweight edge models using more complex models in the cloud. Despite these advancements, existing methods face two key challenges: first, the retraining process is compute-intensive, causing significant delays in model updates; second, the new model may not align well with the evolving data distribution of the current video stream. To address these challenges, we introduce EdgeSync, an efficient edge-model updating approach that enhances sample filtering by incorporating timeliness and inference results, thus ensuring training samples are more relevant to the current video content while reducing update delays. Additionally, EdgeSync features a dynamic training management module that optimizes the timing and sequencing of model updates to improve their timeliness. Evaluations on diverse and complex real-world datasets demonstrate that EdgeSync improves accuracy by approximately 3.4% compared to existing methods and by about 10% compared to traditional approaches.
[226] LLM-based Fusion of Multi-modal Features for Commercial Memorability Prediction
Aleksandar Pramov
Main category: cs.CV
TL;DR: A multimodal fusion system using Gemma-3 LLM with ViT visual and E5 textual features, enhanced by LoRA adaptation and LLM-generated rationale prompts, outperforms gradient boosted tree baselines in commercial memorability prediction.
Details
Motivation: To predict commercial memorability as part of the MediaEval 2025 competition, addressing the need for robust multimodal approaches that can generalize well on test data.Method: Multimodal fusion with Gemma-3 LLM backbone integrating pre-computed ViT (visual) and E5 (textual) features via multi-modal projections, using LoRA for adaptation, and employing LLM-generated rationale prompts based on expert-derived memorability aspects.
Result: The LLM-based system demonstrated greater robustness and generalization performance on the final test set compared to the gradient boosted tree baseline ensemble.
Conclusion: The proposed multimodal LLM fusion approach with rationale prompting is effective for commercial memorability prediction, showing improved generalization over traditional ensemble methods.
Abstract: This paper addresses the prediction of commercial (brand) memorability as part of “Subtask 2: Commercial/Ad Memorability” within the “Memorability: Predicting movie and commercial memorability” task at the MediaEval 2025 workshop competition. We propose a multimodal fusion system with a Gemma-3 LLM backbone that integrates pre-computed visual (ViT) and textual (E5) features by multi-modal projections. The model is adapted using Low-Rank Adaptation (LoRA). A heavily-tuned ensemble of gradient boosted trees serves as a baseline. A key contribution is the use of LLM-generated rationale prompts, grounded in expert-derived aspects of memorability, to guide the fusion model. The results demonstrate that the LLM-based system exhibits greater robustness and generalization performance on the final test set, compared to the baseline. The paper’s codebase can be found at https://github.com/dsgt-arc/mediaeval-2025-memorability
[227] Promptable Fire Segmentation: Unleashing SAM2’s Potential for Real-Time Mobile Deployment with Strategic Bounding Box Guidance
Emmanuel U. Ugwu, Zhang Xinming
Main category: cs.CV
TL;DR: This paper presents the first comprehensive evaluation of SAM2 variants for fire segmentation, showing that bounding box prompts outperform automatic and single point approaches, with lightweight variants like TinySAM and MobileSAM being suitable for edge deployment.
Details
Motivation: Fire segmentation is challenging due to flames' irregular boundaries and variable intensities. While SAM models have shown cross-domain generalization, their effectiveness in fire segmentation under mobile deployment constraints remains unexplored.Method: Systematically evaluated four SAM2.1 variants and mobile-oriented variants across three fire datasets using multiple prompting strategies: automatic, single positive point, single positive+negative point, multiple positive points, bounding box, and hybrid variants.
Result: Bounding box prompts consistently outperformed other approaches, with Box+MP achieving highest mean IoU (0.64) and Dice coefficient (0.75) on Khan dataset. Lightweight variants reduced memory and computational costs for edge deployment.
Conclusion: This work provides critical insights for deploying promptable segmentation models in fire monitoring systems and establishes benchmarks for future research in domain-specific SAM applications.
Abstract: Fire segmentation remains a critical challenge in computer vision due to flames’ irregular boundaries, translucent edges, and highly variable intensities. While the Segment Anything Models (SAM and SAM2) have demonstrated impressive cross-domain generalization capabilities, their effectiveness in fire segmentation – particularly under mobile deployment constraints – remains largely unexplored. This paper presents the first comprehensive evaluation of SAM2 variants for fire segmentation, focusing on bounding box prompting strategies to enhance deployment feasibility. We systematically evaluate four SAM2.1 variants (tiny, small, base_plus, large) alongside mobile-oriented variants (TinySAM, MobileSAM) across three fire datasets using multiple prompting strategies: automatic, single positive point (SP), single positive point + single negative point (SP+SN), multiple positive points (MP), bounding box (Box), and hybrid variants (Box+SP and Box+MP). Our experimental results demonstrate that bounding box prompts consistently outperform automatic and single point-based approaches, with Box+MP achieving the highest mean IoU (0.64) and Dice coefficient (0.75) on the Khan dataset. Lightweight variants such as TinySAM and MobileSAM further reduce memory and computational costs, making them more suitable for latency-tolerant edge scenarios. Overall, this work provides critical insights for deploying promptable segmentation models in fire monitoring systems and establishes benchmarks for future research in domain-specific SAM applications. Code is available at: https://github.com/UEmmanuel5/ProFSAM
[228] MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection
Haochen Zhao, Yuyao Kong, Yongxiu Xu, Gaopeng Gou, Hongbo Xu, Yubin Wang, Haoliang Zhang
Main category: cs.CV
TL;DR: MMSD3.0 is a new multi-image sarcasm detection benchmark, and CIRM is a cross-image reasoning model that achieves state-of-the-art performance by capturing inter-image connections and using relevance-guided fusion.
Details
Motivation: Existing sarcasm detection methods focus on single-image scenarios, overlooking multi-image semantic relations that occur in real-world settings like tweets and reviews.Method: Proposed Cross-Image Reasoning Model (CIRM) with targeted cross-image sequence modeling and relevance-guided fine-grained cross-modal fusion based on text-image correspondence.
Result: MMSD3.0 is an effective benchmark reflecting real-world conditions, and CIRM achieves state-of-the-art performance across MMSD, MMSD2.0 and MMSD3.0 datasets.
Conclusion: The work successfully addresses the gap in multi-image sarcasm detection and demonstrates CIRM’s effectiveness in both single and multi-image scenarios.
Abstract: Despite progress in multimodal sarcasm detection, existing datasets and methods predominantly focus on single-image scenarios, overlooking potential semantic and affective relations across multiple images. This leaves a gap in modeling cases where sarcasm is triggered by multi-image cues in real-world settings. To bridge this gap, we introduce MMSD3.0, a new benchmark composed entirely of multi-image samples curated from tweets and Amazon reviews. We further propose the Cross-Image Reasoning Model (CIRM), which performs targeted cross-image sequence modeling to capture latent inter-image connections. In addition, we introduce a relevance-guided, fine-grained cross-modal fusion mechanism based on text-image correspondence to reduce information loss during integration. We establish a comprehensive suite of strong and representative baselines and conduct extensive experiments, showing that MMSD3.0 is an effective and reliable benchmark that better reflects real-world conditions. Moreover, CIRM demonstrates state-of-the-art performance across MMSD, MMSD2.0 and MMSD3.0, validating its effectiveness in both single-image and multi-image scenarios.
[229] Noise Aggregation Analysis Driven by Small-Noise Injection: Efficient Membership Inference for Diffusion Models
Guo Li, Yuyang Yu, Xuemiao Xu
Main category: cs.CV
TL;DR: Proposes an efficient membership inference attack method against diffusion models by injecting slight noise and analyzing noise distribution aggregation, achieving superior performance with fewer model visits.
Details
Motivation: Diffusion models like Stable Diffusion pose privacy risks through membership inference attacks, which determine if specific data was used in training.Method: Inject slight noise into test images and analyze aggregation degree of noise distribution predicted by the model at certain time steps.
Result: Achieves superior performance across multiple datasets and better attack effects (ASR and AUC) against large-scale text-to-image diffusion models.
Conclusion: The method is scalable and efficient, requiring fewer model visits while effectively distinguishing between training and non-training samples through noise prediction patterns.
Abstract: Diffusion models have demonstrated powerful performance in generating high-quality images. A typical example is text-to-image generator like Stable Diffusion. However, their widespread use also poses potential privacy risks. A key concern is membership inference attacks, which attempt to determine whether a particular data sample was used in the model training process. We propose an efficient membership inference attack method against diffusion models. This method is based on the injection of slight noise and the evaluation of the aggregation degree of the noise distribution. The intuition is that the noise prediction patterns of diffusion models for training set samples and non-training set samples exhibit distinguishable differences.Specifically, we suppose that member images exhibit higher aggregation of predicted noise around a certain time step of the diffusion process. In contrast, the predicted noises of non-member images exhibit a more discrete characteristic around the certain time step. Compared with other existing methods, our proposed method requires fewer visits to the target diffusion model. We inject slight noise into the image under test and then determine its membership by analyzing the aggregation degree of the noise distribution predicted by the model. Empirical findings indicate that our method achieves superior performance across multiple datasets. At the same time, our method can also show better attack effects in ASR and AUC when facing large-scale text-to-image diffusion models, proving the scalability of our method.
[230] Multi-Agent Pose Uncertainty: A Differentiable Rendering Cramér-Rao Bound
Arun Muthukkumar
Main category: cs.CV
TL;DR: A closed-form lower bound on camera pose covariance using differentiable renderers as measurement functions, extending classical bundle-adjustment uncertainty to multi-agent settings.
Details
Motivation: Pose estimation is crucial for computer vision and robotics, but existing methods lack rigorous uncertainty quantification under dense or learned models.Method: Linearizing image formation with respect to small pose perturbations on the manifold to derive a render-aware Cramér-Rao bound, treating differentiable renderers as measurement functions.
Result: Derived a closed-form lower bound on pose covariance that reduces to classical bundle-adjustment uncertainty and extends to multi-agent settings by fusing Fisher information across cameras.
Conclusion: The statistical formulation enables uncertainty quantification for pose estimation without explicit keypoint correspondences, with applications in cooperative perception and novel view synthesis.
Abstract: Pose estimation is essential for many applications within computer vision and robotics. Despite its uses, few works provide rigorous uncertainty quantification for poses under dense or learned models. We derive a closed-form lower bound on the covariance of camera pose estimates by treating a differentiable renderer as a measurement function. Linearizing image formation with respect to a small pose perturbation on the manifold yields a render-aware Cram'er-Rao bound. Our approach reduces to classical bundle-adjustment uncertainty, ensuring continuity with vision theory. It also naturally extends to multi-agent settings by fusing Fisher information across cameras. Our statistical formulation has downstream applications for tasks such as cooperative perception and novel view synthesis without requiring explicit keypoint correspondences.
[231] Robust Modality-incomplete Anomaly Detection: A Modality-instructive Framework with Benchmark
Bingchen Miao, Wenqiao Zhang, Juncheng Li, Wangyu Wu, Siliang Tang, Zhaocheng Li, Haochen Shi, Jun Xiao, Yueting Zhuang
Main category: cs.CV
TL;DR: This paper introduces RADAR, a robust framework for Modality-Incomplete Industrial Anomaly Detection that addresses the challenge of missing modalities in real-world multimodal data, outperforming traditional methods on their newly created MIIAD benchmark.
Details
Motivation: Traditional multimodal industrial anomaly detection assumes all 2D and 3D modalities are paired, but real-world data often has missing modalities, causing existing methods to overfit and perform poorly. Robust models against modality-incomplete data are needed for practical applications.Method: Proposed RADAR framework with two key components: 1) Modality-incomplete Instruction to guide multimodal Transformer with adaptive parameter learning via HyperNetwork, and 2) Double-Pseudo Hybrid Module to highlight modality combination uniqueness and mitigate overfitting.
Result: RADAR significantly outperforms traditional MIAD methods on the newly created MIIAD dataset, demonstrating superior performance in handling modality-incomplete scenarios.
Conclusion: The proposed RADAR framework effectively addresses modality-incomplete industrial anomaly detection, proving practical value for real-world applications where multimodal data is often imperfect, and establishes a new benchmark for this challenging problem.
Abstract: Multimodal Industrial Anomaly Detection (MIAD), which utilizes 3D point clouds and 2D RGB images to identify abnormal regions in products, plays a crucial role in industrial quality inspection. However, traditional MIAD settings assume that all 2D and 3D modalities are paired, ignoring the fact that multimodal data collected from the real world is often imperfect due to missing modalities. Additionally, models trained on modality-incomplete data are prone to overfitting. Therefore, MIAD models that demonstrate robustness against modality-incomplete data are highly desirable in practice. To address this, we introduce a pioneering study that comprehensively investigates Modality-Incomplete Industrial Anomaly Detection (MIIAD), and under the guidance of experts, we construct the MIIAD Bench with rich modality-missing settings to account for imperfect learning environments with incomplete multimodal information. As expected, we find that most existing MIAD methods perform poorly on the MIIAD Bench, leading to significant performance degradation. To tackle this challenge, we propose a novel two-stage Robust modAlity-aware fusing and Detecting framewoRk, abbreviated as RADAR. Specifically: i) We propose Modality-incomplete Instruction to guide the multimodal Transformer to robustly adapt to various modality-incomplete scenarios, and implement adaptive parameter learning based on HyperNetwork. ii) Then, we construct a Double-Pseudo Hybrid Module to highlight the uniqueness of modality combinations, mitigating overfitting issues and further enhancing the robustness of the MIIAD model. Our experimental results demonstrate that the proposed RADAR significantly outperforms traditional MIAD methods on our newly created MIIAD dataset, proving its practical application value.
[232] 2D_3D Feature Fusion via Cross-Modal Latent Synthesis and Attention Guided Restoration for Industrial Anomaly Detection
Usman Ali, Ali Zia, Abdul Rehman, Umer Ramzan, Zohaib Hassan, Talha Sattar, Jing Wang, Wei Xiang
Main category: cs.CV
TL;DR: MAFR is an unsupervised framework for industrial anomaly detection that fuses RGB images and point clouds using a shared fusion encoder with attention-guided decoders, achieving state-of-the-art performance on benchmarks.
Details
Motivation: Industrial anomaly detection benefits from 2D and 3D data integration, but robust cross-modal fusion remains challenging.Method: Uses a shared fusion encoder to create unified latent space from RGB images and point clouds, followed by attention-guided modality-specific decoders. Anomalies are detected by measuring reconstruction errors between input and restored features.
Result: Achieves state-of-the-art results with mean I-AUROC of 0.972 on MVTec 3D-AD and 0.901 on Eyecandies benchmarks. Also shows strong few-shot learning performance.
Conclusion: MAFR provides a principled approach for fusing visual and geometric information, advancing robustness and accuracy in industrial anomaly detection.
Abstract: Industrial anomaly detection (IAD) increasingly benefits from integrating 2D and 3D data, but robust cross-modal fusion remains challenging. We propose a novel unsupervised framework, Multi-Modal Attention-Driven Fusion Restoration (MAFR), which synthesises a unified latent space from RGB images and point clouds using a shared fusion encoder, followed by attention-guided, modality-specific decoders. Anomalies are localised by measuring reconstruction errors between input features and their restored counterparts. Evaluations on the MVTec 3D-AD and Eyecandies benchmarks demonstrate that MAFR achieves state-of-the-art results, with a mean I-AUROC of 0.972 and 0.901, respectively. The framework also exhibits strong performance in few-shot learning settings, and ablation studies confirm the critical roles of the fusion architecture and composite loss. MAFR offers a principled approach for fusing visual and geometric information, advancing the robustness and accuracy of industrial anomaly detection. Code is available at https://github.com/adabrh/MAFR
[233] Mismatch reconstruction theory for unknown measurement matrix in imaging through multimode fiber bending
Le Yang
Main category: cs.CV
TL;DR: A novel theory for reconstructing images through multimode fibers when the measurement matrix is unknown, using mismatch equations and calibration algorithms to construct a new measurement matrix.
Details
Motivation: Traditional reconstruction algorithms fail when measurement matrix is unknown due to system configuration issues or fiber bending, requiring a solution for practical applications.Method: Proposed mismatch equation and designed matched and calibration solution algorithms to construct a new measurement matrix, with detailed proofs provided.
Result: Under low noise levels, the constructed matrix enables successful image reconstruction using traditional algorithms, with demonstrated robustness against noise, computational precision, and orthogonality issues.
Conclusion: The theory provides a practical solution for unknown measurement matrix scenarios with certain robustness, though limitations exist, and code is available for implementation.
Abstract: Multimode fiber imaging requires strict matching between measurement value and measurement matrix to achieve image reconstruction. However, in practical applications, the measurement matrix often cannot be obtained due to unknown system configuration or difficulty in real-time alignment after arbitrary fiber bending, resulting in the failure of traditional reconstruction algorithms. This paper presents a novel mismatch reconstruction theory for solving the problem of image reconstruction when measurement matrix is unknown. We first propose mismatch equation and design matched and calibration solution algorithms to construct a new measurement matrix. In addition, we also provide a detailed proof of these equations and algorithms in the appendix. The experimental results show that under low noise levels, constructed matrix can be used for matched pair in traditional reconstruction algorithms, and reconstruct the original image successfully. Then, we analyze the impact of noise, computational precision and orthogonality on reconstruction performance. The results show that proposed algorithms have a certain degree of robustness. Finally, we discuss the limitations and potential applications of this theory. The code is available: https://github.com/yanglebupt/mismatch-solution.
[234] GS-ProCams: Gaussian Splatting-based Projector-Camera Systems
Qingyue Deng, Jijiang Li, Haibin Ling, Bingyao Huang
Main category: cs.CV
TL;DR: GS-ProCams is the first Gaussian Splatting-based framework for projector-camera systems that enables efficient view-agnostic projection mapping without requiring additional light sources.
Details
Motivation: To overcome limitations of previous CNN-based ProCams (viewpoint-constrained) and NeRF-based ProCams (require additional light sources, high computational/memory costs) by developing a more efficient view-agnostic solution.Method: Uses 2D Gaussian representations to model complex geometric and photometric mappings of ProCams, including projector responses, surface geometry/materials, and global illumination. Employs differentiable physically-based rendering to jointly estimate these from multi-view projections.
Result: Achieves superior ProCams simulation quality compared to NeRF-based methods, eliminates need for additional devices, uses only 1/10 GPU memory for training, and is 900 times faster in inference speed.
Conclusion: GS-ProCams provides an efficient and practical solution for view-agnostic projection mapping that significantly outperforms existing methods in terms of computational efficiency and resource requirements.
Abstract: We present GS-ProCams, the first Gaussian Splatting-based framework for projector-camera systems (ProCams). GS-ProCams is not only view-agnostic but also significantly enhances the efficiency of projection mapping (PM) that requires establishing geometric and radiometric mappings between the projector and the camera. Previous CNN-based ProCams are constrained to a specific viewpoint, limiting their applicability to novel perspectives. In contrast, NeRF-based ProCams support view-agnostic projection mapping, however, they require an additional co-located light source and demand significant computational and memory resources. To address this issue, we propose GS-ProCams that employs 2D Gaussian for scene representations, and enables efficient view-agnostic ProCams applications. In particular, we explicitly model the complex geometric and photometric mappings of ProCams using projector responses, the projection surface’s geometry and materials represented by Gaussians, and the global illumination component. Then, we employ differentiable physically-based rendering to jointly estimate them from captured multi-view projections. Compared to state-of-the-art NeRF-based methods, our GS-ProCams eliminates the need for additional devices, achieving superior ProCams simulation quality. It also uses only 1/10 of the GPU memory for training and is 900 times faster in inference speed. Please refer to our project page for the code and dataset: https://realqingyue.github.io/GS-ProCams/.
[235] FlowOpt: Fast Optimization Through Whole Flow Processes for Training-Free Editing
Or Ronai, Vladimir Kulikov, Tomer Michaeli
Main category: cs.CV
TL;DR: FlowOpt is a zero-order optimization framework that treats flow-matching models as black boxes, enabling efficient optimization through the entire sampling process without backpropagation for controlled generation tasks like image editing.
Details
Motivation: Existing methods for controlling diffusion and flow-matching models are computationally impractical due to iterative sampling processes, requiring separate manipulation of each timestep rather than direct optimization of the final output.Method: FlowOpt uses zero-order (gradient-free) optimization to treat the entire flow process as a black box, allowing optimization through the whole sampling path without backpropagation through the model, with proven convergence guarantees and empirical step-size estimation.
Result: FlowOpt achieves state-of-the-art results for image editing tasks including inversion (finding initial noise for given images) and direct steering of edited images, using roughly the same number of neural function evaluations (NFEs) as existing methods.
Conclusion: FlowOpt provides an efficient, gradient-free optimization framework for flow-matching models that enables monitoring of intermediate results and early stopping, making it practical for real-time controlled generation tasks.
Abstract: The remarkable success of diffusion and flow-matching models has ignited a surge of works on adapting them at test time for controlled generation tasks. Examples range from image editing to restoration, compression and personalization. However, due to the iterative nature of the sampling process in those models, it is computationally impractical to use gradient-based optimization to directly control the image generated at the end of the process. As a result, existing methods typically resort to manipulating each timestep separately. Here we introduce FlowOpt - a zero-order (gradient-free) optimization framework that treats the entire flow process as a black box, enabling optimization through the whole sampling path without backpropagation through the model. Our method is both highly efficient and allows users to monitor the intermediate optimization results and perform early stopping if desired. We prove a sufficient condition on FlowOpt’s step-size, under which convergence to the global optimum is guaranteed. We further show how to empirically estimate this upper bound so as to choose an appropriate step-size. We demonstrate how FlowOpt can be used for image editing, showcasing two options: (i) inversion (determining the initial noise that generates a given image), and (ii) directly steering the edited image to be similar to the source image while conforming to a target text prompt. In both cases, FlowOpt achieves state-of-the-art results while using roughly the same number of neural function evaluations (NFEs) as existing methods. Code and examples are available on the project’s webpage.
[236] Exploring the design space of diffusion and flow models for data fusion
Niraj Chaudhari, Manmeet Singh, Naveen Sudharsan, Amit Kumar Srivastava, Harsh Kamath, Dushyant Mahajan, Ayan Paul
Main category: cs.CV
TL;DR: This study explores diffusion and flow models for fusing DMSP-OLS and VIIRS nighttime lights satellite data, finding UNet-based diffusion models best preserve spatial details while providing guidance on noise schedulers and quantization techniques.
Details
Motivation: Data fusion is essential for integrating multi-source information to enhance data quality, particularly in satellite remote sensing where fusing multi-sensor observations can improve spatial and temporal resolution.Method: Leveraged diverse 2D image-to-image generative models including UNET, diffusion, and flow modeling architectures, evaluated on DMSP-OLS and VIIRS nighttime lights data fusion, with exploration of noise schedulers and quantization techniques.
Result: Diffusion models based on UNet were particularly adept at preserving fine-grained spatial details and generating high-fidelity fused images. Trade-offs identified between iterative solvers for faster inference and discrete schedulers for higher-quality reconstructions.
Conclusion: Provides practical insights for selecting effective diffusion and flow model architectures for data fusion tasks in remote sensing, with recommendations for noise scheduling strategies to enhance fusion quality while optimizing memory efficiency through quantization.
Abstract: Data fusion is an essential task in various domains, enabling the integration of multi-source information to enhance data quality and insights. One key application is in satellite remote sensing, where fusing multi-sensor observations can improve spatial and temporal resolution. In this study, we explore the design space of diffusion and flow models for data fusion, focusing on the integration of Defense Meteorological Satellite Program’s Operational Linescan System (DMSP-OLS) and Visible Infrared Imaging Radiometer Suite (VIIRS) nighttime lights data. Our approach leverages a diverse set of 2D image-to-image generative models, including UNET, diffusion, and flow modeling architectures. We evaluate the effectiveness of these architectures in satellite remote sensing data fusion, identifying diffusion models based on UNet as particularly adept at preserving fine-grained spatial details and generating high-fidelity fused images. We also provide guidance on the selection of noise schedulers in diffusion-based models, highlighting the trade-offs between iterative solvers for faster inference and discrete schedulers for higher-quality reconstructions. Additionally, we explore quantization techniques to optimize memory efficiency and computational cost without compromising performance. Our findings offer practical insights into selecting the most effective diffusion and flow model architectures for data fusion tasks, particularly in remote sensing applications, and provide recommendations for leveraging noise scheduling strategies to enhance fusion quality.
[237] NVS-SQA: Exploring Self-Supervised Quality Representation Learning for Neurally Synthesized Scenes without References
Qiang Qu, Yiran Shen, Xiaoming Chen, Yuk Ying Chung, Weidong Cai, Tongliang Liu
Main category: cs.CV
TL;DR: NVS-SQA is a no-reference quality assessment method for neural view synthesis that uses self-supervised learning without human labels, outperforming existing methods significantly.
Details
Motivation: Full-reference quality assessment methods don't fully capture perceptual quality in neural view synthesis due to limited reference views, and human labels are hard to acquire, risking overfitting.Method: Uses self-supervised learning with heuristic cues and quality scores as objectives, plus specialized contrastive pair preparation, without relying on human labels.
Result: Outperforms 17 no-reference methods by large margins (109.5% SRCC, 98.6% PLCC, 91.5% KRCC) and exceeds 16 full-reference methods across all metrics.
Conclusion: NVS-SQA effectively addresses quality assessment challenges in neural view synthesis through self-supervised learning and outperforms existing methods.
Abstract: Neural View Synthesis (NVS), such as NeRF and 3D Gaussian Splatting, effectively creates photorealistic scenes from sparse viewpoints, typically evaluated by quality assessment methods like PSNR, SSIM, and LPIPS. However, these full-reference methods, which compare synthesized views to reference views, may not fully capture the perceptual quality of neurally synthesized scenes (NSS), particularly due to the limited availability of dense reference views. Furthermore, the challenges in acquiring human perceptual labels hinder the creation of extensive labeled datasets, risking model overfitting and reduced generalizability. To address these issues, we propose NVS-SQA, a NSS quality assessment method to learn no-reference quality representations through self-supervision without reliance on human labels. Traditional self-supervised learning predominantly relies on the “same instance, similar representation” assumption and extensive datasets. However, given that these conditions do not apply in NSS quality assessment, we employ heuristic cues and quality scores as learning objectives, along with a specialized contrastive pair preparation process to improve the effectiveness and efficiency of learning. The results show that NVS-SQA outperforms 17 no-reference methods by a large margin (i.e., on average 109.5% in SRCC, 98.6% in PLCC, and 91.5% in KRCC over the second best) and even exceeds 16 full-reference methods across all evaluation metrics (i.e., 22.9% in SRCC, 19.1% in PLCC, and 18.6% in KRCC over the second best).
[238] Caption-Driven Explainability: Probing CNNs for Bias via CLIP
Patrick Koller, Amil V. Dravid, Guido M. Schuster, Aggelos K. Katsaggelos
Main category: cs.CV
TL;DR: Proposes a caption-based XAI method that integrates standalone ML models into CLIP using network surgery to identify dominant concepts in predictions, improving robustness against covariate shifts.
Details
Motivation: Saliency maps in XAI can be misleading when spurious and salient features overlap in pixel space, creating robustness issues in ML models.Method: Integrates standalone models into CLIP using novel network surgery approach to create caption-based explanations that identify dominant concepts.
Result: The method identifies the dominant concept contributing most to model predictions, minimizing risk of covariate shift.
Conclusion: Caption-based XAI contributes significantly towards developing robust ML models by providing more reliable explanations than traditional saliency maps.
Abstract: Robustness has become one of the most critical problems in machine learning (ML). The science of interpreting ML models to understand their behavior and improve their robustness is referred to as explainable artificial intelligence (XAI). One of the state-of-the-art XAI methods for computer vision problems is to generate saliency maps. A saliency map highlights the pixel space of an image that excites the ML model the most. However, this property could be misleading if spurious and salient features are present in overlapping pixel spaces. In this paper, we propose a caption-based XAI method, which integrates a standalone model to be explained into the contrastive language-image pre-training (CLIP) model using a novel network surgery approach. The resulting caption-based XAI model identifies the dominant concept that contributes the most to the models prediction. This explanation minimizes the risk of the standalone model falling for a covariate shift and contributes significantly towards developing robust ML models.
[239] Token-Level Inference-Time Alignment for Vision-Language Models
Kejia Chen, Jiawen Zhang, Jiacong Hu, Kewei Gao, Jian Lou, Zunlei Feng, Mingli Song
Main category: cs.CV
TL;DR: TITA is a lightweight framework that reduces hallucinations in Vision-Language Models by providing token-level corrective signals during inference without retraining the base model.
Details
Motivation: VLMs suffer from hallucination issues where generated text misaligns with visual inputs, and existing alignment methods require expensive fine-tuning or provide only coarse feedback.Method: Freezes the base VLM and trains a reward model to approximate its distribution, then extracts implicit preference signals as log-probability ratios between reward model and target VLM during inference.
Result: Consistent improvements across 12 benchmarks: 8.6% on MMVet and 6.7% on POPE, with comparable gains on other models like Qwen2.5-VL-7B and DeepSeek-VL2-27.5B, especially in hallucination reduction and VQA accuracy.
Conclusion: TITA effectively reduces hallucinations in VLMs through token-level inference-time alignment with negligible inference overhead, demonstrating stronger general understanding across multiple models and benchmarks.
Abstract: Vision-Language Models (VLMs) have become essential backbones of modern multimodal intelligence, yet their outputs remain prone to hallucination-plausible text misaligned with visual inputs. Existing alignment approaches often rely on expensive fine-tuning with annotated preference data or sequence-level inference strategies that provide only coarse, delayed feedback. To overcome these limitations, we present TITA (Token-level Inference-Time Alignment), a lightweight framework that freezes the base VLM and instead trains a reward model to approximate its distribution. During inference, implicit preference signals are extracted as log-probability ratios between the reward model and the target VLM, yielding dense autoregressive feedback. This formulation can be viewed as an inference-time variant of Direct Preference Optimization (DPO), providing token-level corrective signals without retraining the backbone. Extensive evaluations on LLaVA-1.5-7B and 13B show consistent gains across 12 benchmarks, with improvements of 8.6% on MMVet and 6.7% on POPE, indicating stronger general understanding and reduced hallucinations. Additional experiments on Qwen2.5-VL-7B and DeepSeek-VL2-27.5B show comparable gains, especially in hallucination reduction and VQA accuracy, while incurring negligible inference overhead.
[240] ControlText: Unlocking Controllable Fonts in Multilingual Text Rendering without Font Annotations
Bowen Jiang, Yuan Yuan, Xinyi Bai, Zhuoqun Hao, Alyson Yin, Yaojie Hu, Wenyu Liao, Lyle Ungar, Camillo J. Taylor
Main category: cs.CV
TL;DR: This paper presents a diffusion model that achieves font-controllable multilingual text rendering using only raw images without font annotations, enabling user-specified font control through self-supervised learning.
Details
Motivation: Current methods require font label annotations which are impractical to obtain from large-scale real-world datasets, preventing user-specified font control in text rendering applications.Method: Proposes integrating conditional diffusion models with text segmentation models, using segmentation masks to capture fonts in pixel space in a self-supervised manner without ground-truth labels.
Result: The method enables zero-shot text and font editing across diverse fonts and languages, providing a proof of concept for generalized visual text rendering.
Conclusion: The approach eliminates the need for font annotations and enables customizable multilingual text rendering, offering valuable insights for both research community and industry applications.
Abstract: This work demonstrates that diffusion models can achieve font-controllable multilingual text rendering using just raw images without font label annotations.Visual text rendering remains a significant challenge. While recent methods condition diffusion on glyphs, it is impossible to retrieve exact font annotations from large-scale, real-world datasets, which prevents user-specified font control. To address this, we propose a data-driven solution that integrates the conditional diffusion model with a text segmentation model, utilizing segmentation masks to capture and represent fonts in pixel space in a self-supervised manner, thereby eliminating the need for any ground-truth labels and enabling users to customize text rendering with any multilingual font of their choice. The experiment provides a proof of concept of our algorithm in zero-shot text and font editing across diverse fonts and languages, providing valuable insights for the community and industry toward achieving generalized visual text rendering. Code is available at github.com/bowen-upenn/ControlText.
[241] Xihe: Scalable Zero-Shot Time Series Learner Via Hierarchical Interleaved Block Attention
Yinbo Sun, Yuchen Fang, Zhibo Zhu, Jia Li, Yu Liu, Qiwen Deng, Jun Zhou, Hang Yu, Xingyu Lu, Lintao Ma
Main category: cs.CV
TL;DR: Proposes HIBA (Hierarchical Interleaved Block Attention) architecture and Xihe model family for time series foundation models, achieving state-of-the-art zero-shot performance with superior parameter efficiency.
Details
Motivation: Existing time series foundation models directly adopt cross-domain architectures from language models, which limits their ability to effectively capture multiscale temporal dependencies in time series data, especially during zero-shot transfer across datasets with different patterns and sampling strategies.Method: Developed HIBA architecture with hierarchical inter- and intra-block sparse attention to capture multi-scale dependencies. Intra-block attention handles local information exchange, while inter-block attention captures global temporal pattern interaction and dynamic evolution. Built Xihe model family scaling from 9.5M to 1.5B parameters.
Result: Xihe-tiny (9.5M) surpasses most contemporary TSFMs, demonstrating remarkable parameter efficiency. Xihe-max (1.5B) establishes new state-of-the-art zero-shot performance, significantly outperforming previous best results on the GIFT-Eval benchmark.
Conclusion: The consistent performance excellence across all parameter scales provides compelling evidence for the exceptional generalization capabilities and architectural superiority of the HIBA approach in time series foundation modeling.
Abstract: The rapid advancement of time series foundation models (TSFMs) has been propelled by migrating architectures from language models. While existing TSFMs demonstrate impressive performance, their direct adoption of cross-domain architectures constrains effective capture of multiscale temporal dependencies inherent to time series data. This limitation becomes particularly pronounced during zero-shot transfer across datasets with divergent underlying patterns and sampling strategies. To address these challenges, we propose Hierarchical Interleaved Block Attention (HIBA) which employs hierarchical inter- and intra-block sparse attention to effectively capture multi-scale dependencies. Intra-block attention facilitates local information exchange, and inter-block attention operates across blocks to capture global temporal pattern interaction and dynamic evolution. Leveraging the HIBA architecture, we introduce Xihe, a scalable TSFM family spanning from an ultra-efficient 9.5M parameter configuration to high-capacity 1.5B variant. Evaluated on the comprehensive GIFT-Eval benchmark, our most compact Xihe-tiny model (9.5M) surpasses the majority of contemporary TSFMs, demonstrating remarkable parameter efficiency. More impressively, Xihe-max (1.5B) establishes new state-of-the-art zero-shot performance, surpassing previous best results by a substantial margin. This consistent performance excellence across the entire parameter spectrum provides compelling evidence for the exceptional generalization capabilities and architectural superiority of HIBA.
[242] MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance
Quanhao Li, Zhen Xing, Rui Wang, Hui Zhang, Qi Dai, Zuxuan Wu
Main category: cs.CV
TL;DR: MagicMotion is a novel image-to-video generation framework that enables precise trajectory control through three levels of conditions (masks, bounding boxes, sparse boxes), addressing limitations in existing methods for complex object movements and multi-object motion control.
Details
Motivation: Existing trajectory-controllable video generation methods struggle with complex object movements, multi-object motion control, imprecise trajectory adherence, poor object consistency, and compromised visual quality. They also only support single-format trajectory control and lack dedicated datasets/benchmarks.Method: MagicMotion uses a novel image-to-video generation framework with three levels of trajectory control conditions (dense to sparse: masks, bounding boxes, sparse boxes). It also introduces MagicData (large-scale trajectory-controlled video dataset) and MagicBench (comprehensive benchmark).
Result: Extensive experiments demonstrate that MagicMotion outperforms previous methods across various metrics, achieving better trajectory adherence, object consistency, and visual quality.
Conclusion: MagicMotion successfully addresses key challenges in trajectory-controllable video generation, providing flexible multi-format trajectory control while maintaining high visual quality and object consistency, supported by comprehensive datasets and benchmarks.
Abstract: Recent advances in video generation have led to remarkable improvements in visual quality and temporal coherence. Upon this, trajectory-controllable video generation has emerged to enable precise object motion control through explicitly defined spatial paths. However, existing methods struggle with complex object movements and multi-object motion control, resulting in imprecise trajectory adherence, poor object consistency, and compromised visual quality. Furthermore, these methods only support trajectory control in a single format, limiting their applicability in diverse scenarios. Additionally, there is no publicly available dataset or benchmark specifically tailored for trajectory-controllable video generation, hindering robust training and systematic evaluation. To address these challenges, we introduce MagicMotion, a novel image-to-video generation framework that enables trajectory control through three levels of conditions from dense to sparse: masks, bounding boxes, and sparse boxes. Given an input image and trajectories, MagicMotion seamlessly animates objects along defined trajectories while maintaining object consistency and visual quality. Furthermore, we present MagicData, a large-scale trajectory-controlled video dataset, along with an automated pipeline for annotation and filtering. We also introduce MagicBench, a comprehensive benchmark that assesses both video quality and trajectory control accuracy across different numbers of objects. Extensive experiments demonstrate that MagicMotion outperforms previous methods across various metrics. Our project page are publicly available at https://quanhaol.github.io/magicmotion-site.
[243] LOC: A General Language-Guided Framework for Open-Set 3D Occupancy Prediction
Yuhang Gao, Xiang Xiang, Sheng Zhong, Guoyou Wang
Main category: cs.CV
TL;DR: LOC is a language-guided framework for 3D scene understanding that adapts to occupancy networks, supporting both supervised and self-supervised learning with Densely Contrastive Learning (DCL) to enhance open-set recognition.
Details
Motivation: Vision-Language Models (VLMs) show promise for open-set challenges but face limitations in 3D scene understanding due to scarce 3D datasets.Method: Uses multi-frame LiDAR fusion with Poisson reconstruction for voxel representations, KNN for semantics, and DCL with textual prompts to prevent feature over-homogenization.
Result: Achieves superior performance on nuScenes dataset with high-precision known class predictions and effective unknown class distinction without extra training.
Conclusion: LOC effectively bridges 3D scene understanding with VLMs, enabling robust open-set recognition through language guidance and contrastive learning.
Abstract: Vision-Language Models (VLMs) have shown significant progress in open-set challenges. However, the limited availability of 3D datasets hinders their effective application in 3D scene understanding. We propose LOC, a general language-guided framework adaptable to various occupancy networks, supporting both supervised and self-supervised learning paradigms. For self-supervised tasks, we employ a strategy that fuses multi-frame LiDAR points for dynamic/static scenes, using Poisson reconstruction to fill voids, and assigning semantics to voxels via K-Nearest Neighbor (KNN) to obtain comprehensive voxel representations. To mitigate feature over-homogenization caused by direct high-dimensional feature distillation, we introduce Densely Contrastive Learning (DCL). DCL leverages dense voxel semantic information and predefined textual prompts. This efficiently enhances open-set recognition without dense pixel-level supervision, and our framework can also leverage existing ground truth to further improve performance. Our model predicts dense voxel features embedded in the CLIP feature space, integrating textual and image pixel information, and classifies based on text and semantic similarity. Experiments on the nuScenes dataset demonstrate the method’s superior performance, achieving high-precision predictions for known classes and distinguishing unknown classes without additional training data.
[244] AI-Boosted Video Annotation: Assessing the Process Enhancement
Juan Gutiérrez, Ángel Mora, Pablo Regodón, Silvia Rodriguez, José Luis Blanco
Main category: cs.CV
TL;DR: AI-powered pre-annotations in Human-in-the-Loop video annotation reduce annotation time by 35% while maintaining quality, improving workflow efficiency and annotation coherence.
Details
Motivation: To enhance Human-in-the-Loop video annotation by integrating automatic AI capabilities to ease the task for annotators and assess their performance, focusing on efficiency, accuracy, and overall annotation quality.Method: Implemented a single-iteration scheme using Label Studio and AI-powered zero-shot pre-annotations, tested on UCF-Crime dataset for normal/abnormal activity discrimination in videos.
Result: 35% reduction in annotation time for 70% of annotators with similar quality annotations compared to manual annotation; annotations were more coherent among annotators and better matched natural clustering of video frames.
Conclusion: Automatic AI-based pre-annotation streamlines video annotation workflow, empowers human annotators, and optimizes the overall pipeline while maintaining annotation quality.
Abstract: We explore the enhancement of Human-in-the-Loop video annotation by integrating automatic capabilities to ease the task for annotators and assess their performance. The research delves into the practical implications of the annotation processes, the integration of AI components, and the evaluation of its outcomes. We analyze their impact on efficiency, accuracy, and overall annotation quality. Focusing on the Human-in-the-Loop for video annotation tasks, we implemented a single-iteration scheme using Label Studio and AI-powered zero-shot pre-annotations. Using this framework, we designed a test based on the annotation of the UCF-Crime dataset to discriminate between normal and abnormal activities in video footage. Our results evidence how automatic AI-based pre-annotation can streamline the video annotation workflow, empowering human annotators and optimizing the overall pipeline. Using the pre-annotated data, we observed a 35% reduction in the annotation time for 70% of the annotators with similar quality annotations, compared to the traditional manual annotation task. Results are consistent with asset duration and complexity. We also observed that while annotators rapidly learned to use the tool, the produced annotations are more coherent among annotators and better match the natural clustering of the video frames.
[245] Morphology-Aware KOA Classification: Integrating Graph Priors with Vision Models
Marouane Tliba, Mohamed Amine Kerkouri, Yassine Nasser, Nour Aburaed, Aladine Chetouani, Ulas Bagci, Rachid Jennane
Main category: cs.CV
TL;DR: A multimodal framework combining anatomical structure graphs from SAM segmentations with vision encoders improves knee osteoarthritis diagnosis by aligning geometric and radiographic features through mutual information maximization.
Details
Motivation: Knee osteoarthritis diagnosis from radiographs is challenging because standard deep learning models struggle to capture subtle morphological details that are crucial for accurate assessment.Method: Proposes a multimodal framework that integrates morphological graph representations from Segment Anything Model (SAM) segmentations with vision encoders, enforcing alignment between geometry-informed graph embeddings and radiographic features through mutual information maximization.
Result: Outperforms single-modality baselines by up to 10% in accuracy (reaching nearly 80%) and existing state-of-the-art methods by 8% in accuracy and 11% in F1 score on the Osteoarthritis Initiative dataset.
Conclusion: Incorporating anatomical structure into radiographic analysis is critical for accurate knee osteoarthritis severity grading, as demonstrated by the significant performance improvements achieved through explicit morphological priors.
Abstract: Knee osteoarthritis (KOA) diagnosis from radiographs remains challenging due to the subtle morphological details that standard deep learning models struggle to capture effectively. We propose a novel multimodal framework that combines anatomical structure with radiographic features by integrating a morphological graph representation - derived from Segment Anything Model (SAM) segmentations
- with a vision encoder. Our approach enforces alignment between geometry-informed graph embeddings and radiographic features through mutual information maximization, significantly improving KOA classification accuracy. By constructing graphs from anatomical features, we introduce explicit morphological priors that mirror clinical assessment criteria, enriching the feature space and enhancing the model’s inductive bias. Experiments on the Osteoarthritis Initiative dataset demonstrate that our approach surpasses single-modality baselines by up to 10% in accuracy (reaching nearly 80%), while outperforming existing state-of-the-art methods by 8% in accuracy and 11% in F1 score. These results underscore the critical importance of incorporating anatomical structure into radiographic analysis for accurate KOA severity grading.
[246] Principled Multimodal Representation Learning
Xiaohao Liu, Xiaobo Xia, See-Kiong Ng, Tat-Seng Chua
Main category: cs.CV
TL;DR: PMRL is a novel multimodal representation learning framework that achieves simultaneous alignment of multiple modalities without anchor dependency by optimizing the dominant singular value to align representations along a shared leading direction.
Details
Motivation: Traditional multimodal learning methods rely on pairwise contrastive learning with predefined anchor modalities, which restricts alignment across all modalities. Recent approaches face challenges with fixed anchor points and instability from optimizing product of singular values.Method: PMRL optimizes the dominant singular value of the representation matrix to align modalities along a shared leading direction, using a softmax-based loss function that treats singular values as logits. It also applies instance-wise contrastive regularization on leading eigenvectors to maintain separability and prevent collapse.
Result: Extensive experiments across diverse tasks demonstrate PMRL’s superiority compared to baseline methods.
Conclusion: PMRL provides a principled approach for stable, anchor-free multimodal representation learning that effectively aligns multiple modalities simultaneously while maintaining inter-instance separability.
Abstract: Multimodal representation learning seeks to create a unified representation space by integrating diverse data modalities to improve multimodal understanding. Traditional methods often depend on pairwise contrastive learning, which relies on a predefined anchor modality, restricting alignment across all modalities. Recent advances have investigated the simultaneous alignment of multiple modalities, yet several challenges remain, such as limitations imposed by fixed anchor points and instability arising from optimizing the product of singular values. To address the challenges, in this paper, we propose Principled Multimodal Representation Learning (PMRL), a novel framework that achieves simultaneous alignment of multiple modalities without anchor dependency in a more stable manner. Specifically, grounded in the theoretical insight that full alignment corresponds to a rank-1 Gram matrix, PMRL optimizes the dominant singular value of the representation matrix to align modalities along a shared leading direction. We propose a softmax-based loss function that treats singular values as logits to prioritize the largest singular value. Besides, instance-wise contrastive regularization on the leading eigenvectors maintains inter-instance separability and prevents representation collapse. Extensive experiments across diverse tasks demonstrate PMRL’s superiority compared to baseline methods. The source code will be publicly available.
[247] It Takes Two to Tango: Two Parallel Samplers Improve Quality in Diffusion Models for Limited Steps
Pedro Cisneros-Velarde
Main category: cs.CV
TL;DR: A method using two parallel samplers in diffusion models improves image quality under limited denoising steps by integrating information from successive time steps.
Details
Motivation: To improve sample quality in diffusion models when constrained by a limited number of denoising steps, addressing computational limitations.Method: Use two parallel processors/samplers that perform denoising steps at successive times and appropriately integrate their information in the latent image.
Result: The method improves sample quality in both automated and human evaluations across different diffusion models, while naive integration reduces quality and adding more samplers doesn’t necessarily help.
Conclusion: Simple parallel sampling with two processors effectively enhances diffusion model performance under computational constraints without requiring fine-tuning or external models.
Abstract: We consider the situation where we have a limited number of denoising steps, i.e., of evaluations of a diffusion model. We show that two parallel processors or samplers under such limitation can improve the quality of the sampled image. Particularly, the two samplers make denoising steps at successive times, and their information is appropriately integrated in the latent image. Remarkably, our method is simple both conceptually and to implement: it is plug-&-play, model agnostic, and does not require any additional fine-tuning or external models. We test our method with both automated and human evaluations for different diffusion models. We also show that a naive integration of the information from the two samplers lowers sample quality. Finally, we find that adding more parallel samplers does not necessarily improve sample quality.
[248] Pindrop it! Audio and Visual Deepfake Countermeasures for Robust Detection and Fine Grained-Localization
Nicholas Klein, Hemlata Tak, James Fullwood, Krishna Regmi, Leonidas Spinoulas, Ganesh Sivaraman, Tianxiang Chen, Elie Khoury
Main category: cs.CV
TL;DR: This paper presents methods for deepfake video classification and localization, achieving top performance in the ACM 1M Deepfakes Detection Challenge.
Details
Motivation: The rapid advancement in visual and audio generation technologies necessitates robust detection of synthetic content, especially when subtle localized manipulations are made in videos.Method: The authors developed solutions specifically for deepfake video classification and temporal localization tasks.
Result: The methods achieved best performance in temporal localization task and top four ranking in classification task for TestA split of the evaluation dataset in the ACM 1M Deepfakes Detection Challenge.
Conclusion: The proposed solutions demonstrate effective capabilities for detecting and localizing deepfake manipulations in videos, addressing the challenges posed by fine-grained synthetic content generation.
Abstract: The field of visual and audio generation is burgeoning with new state-of-the-art methods. This rapid proliferation of new techniques underscores the need for robust solutions for detecting synthetic content in videos. In particular, when fine-grained alterations via localized manipulations are performed in visual, audio, or both domains, these subtle modifications add challenges to the detection algorithms. This paper presents solutions for the problems of deepfake video classification and localization. The methods were submitted to the ACM 1M Deepfakes Detection Challenge, achieving the best performance in the temporal localization task and a top four ranking in the classification task for the TestA split of the evaluation dataset.
[249] Frame-Difference Guided Dynamic Region Perception for CLIP Adaptation in Text-Video Retrieval
Jiaao Yu, Mingjie Han, Tao Gong, Jian Zhang, Man Lan
Main category: cs.CV
TL;DR: FDA-CLIP is a CLIP-based training framework that uses frame differences to generate dynamic region masks, guiding the model to focus on critical dynamic regions while suppressing static background redundancy for improved text-video retrieval.
Details
Motivation: Address limitations of existing text-video retrieval methods: reliance on large-scale annotated data and significant modal gap between video and text features. Existing CLIP adaptation methods lack dynamic feature enhancement and fail to suppress static redundant features.Method: Proposes FDA-CLIP framework that uses frame differences to generate dynamic region masks, which are input into Alpha-CLIP as an additional Alpha channel to guide attention to semantically critical dynamic regions.
Result: Experiments show that frame difference-guided video semantic encoding effectively balances retrieval efficiency and accuracy.
Conclusion: FDA-CLIP provides a concise CLIP-based training framework that enhances dynamic video feature representation and suppresses static redundancy for improved text-video alignment.
Abstract: With the rapid growth of video data, text-video retrieval technology has become increasingly important in numerous application scenarios such as recommendation and search. Early text-video retrieval methods suffer from two critical drawbacks: first, they heavily rely on large-scale annotated video-text pairs, leading to high data acquisition costs; second, there is a significant modal gap between video and text features, which limits cross-modal alignment accuracy. With the development of vision-language model, adapting CLIP to video tasks has attracted great attention. However, existing adaptation methods generally lack enhancement for dynamic video features and fail to effectively suppress static redundant features. To address this issue, this paper proposes FDA-CLIP (Frame Difference Alpha-CLIP), which is a concise CLIP-based training framework for text-video alignment. Specifically, the method uses frame differences to generate dynamic region masks, which are input into Alpha-CLIP as an additional Alpha channel. This proactively guides the model to focus on semantically critical dynamic regions while suppressing static background redundancy. Experiments demonstrate that frame difference-guided video semantic encoding can effectively balance retrieval efficiency and accuracy.
[250] TEn-CATG:Text-Enriched Audio-Visual Video Parsing with Multi-Scale Category-Aware Temporal Graph
Yaru Chen, Faegheh Sardari, Peiliang Zhang, Ruohao Guo, Yang Xiang, Zhenbo Li, Wenwu Wang
Main category: cs.CV
TL;DR: TEn-CATG is a text-enriched audio-visual video parsing framework that combines semantic calibration with category-aware temporal reasoning to improve weakly supervised event detection in videos.
Details
Motivation: Existing AVVP methods either overfit noisy pseudo-labels with attention-based models or weaken temporal localization accuracy by uniformly distributing attention across frames in pseudo-label generation approaches.Method: Proposes TEn-CATG with two key modules: (1) Bi-directional text fusion (BiT) module that uses audio-visual features as semantic anchors to refine text embeddings, and (2) Category-aware temporal graph (CATG) module that models temporal relationships by selecting multi-scale temporal neighbors and learning category-specific temporal decay factors.
Result: Achieves state-of-the-art results across multiple evaluation metrics on benchmark datasets LLP and UnAV-100, demonstrating robustness and superior ability to capture complex temporal and semantic dependencies.
Conclusion: TEn-CATG effectively addresses the limitations of existing weakly supervised AVVP methods by combining semantic calibration with category-aware temporal reasoning, leading to improved performance in event detection and temporal localization.
Abstract: Audio-visual video parsing (AVVP) aims to detect event categories and their temporal boundaries in videos, typically under weak supervision. Existing methods mainly focus on (i) improving temporal modeling using attention-based architectures or (ii) generating richer pseudo-labels to address the absence of frame-level annotations. However, attention-based models often overfit noisy pseudo-labels, leading to cumulative training errors, while pseudo-label generation approaches distribute attention uniformly across frames, weakening temporal localization accuracy. To address these challenges, we propose TEn-CATG, a text-enriched AVVP framework that combines semantic calibration with category-aware temporal reasoning. More specifically, we design a bi-directional text fusion (BiT) module by leveraging audio-visual features as semantic anchors to refine text embeddings, which departs from conventional text-to-feature alignment, thereby mitigating noise and enhancing cross-modal consistency. Furthermore, we introduce the category-aware temporal graph (CATG) module to model temporal relationships by selecting multi-scale temporal neighbors and learning category-specific temporal decay factors, enabling effective event-dependent temporal reasoning. Extensive experiments demonstrate that TEn-CATG achieves state-of-the-art results across multiple evaluation metrics on benchmark datasets LLP and UnAV-100, highlighting its robustness and superior ability to capture complex temporal and semantic dependencies in weakly supervised AVVP tasks.
[251] Activating Visual Context and Commonsense Reasoning through Masked Prediction in VLMs
Jiaao Yu, Shenwei Li, Mingjie Han, Yifei Yin, Wenzheng Song, Chenghao Jia, Man Lan
Main category: cs.CV
TL;DR: The paper introduces a novel fine-tuning task called Masked Prediction via Context and Commonsense (MPCC) to enhance multimodal reasoning in vision-language models, addressing limitations in current approaches that fail to fully exploit visual context and commonsense knowledge.
Details
Motivation: Current reasoning models focus heavily on single-modal language settings and fail to adapt well to real-world multimodal scenarios. Existing approaches either remain confined to perception-centric tasks or reduce images to textual summaries, limiting generalization of reasoning capabilities across diverse multimodal environments.Method: Proposed MPCC task that forces models to integrate visual context and commonsense reasoning by reconstructing semantically meaningful content from occluded images. Introduced Reinforcement Fine-tuning with Prior Sampling training method and developed MPCC Eval benchmark for systematic evaluation.
Result: The proposed approach enhances model performance and improves generalized reasoning capabilities in out-of-distribution (OOD) and cross-task scenarios.
Conclusion: The MPCC framework successfully bridges the gap in multimodal reasoning by forcing models to integrate visual context and commonsense knowledge, enabling better generalization across diverse multimodal environments.
Abstract: Recent breakthroughs in reasoning models have markedly advanced the reasoning capabilities of large language models, particularly via training on tasks with verifiable rewards. Yet, a significant gap persists in their adaptation to real world multimodal scenarios, most notably, vision language tasks, due to a heavy focus on single modal language settings. While efforts to transplant reinforcement learning techniques from NLP to VLMs have emerged, these approaches often remain confined to perception centric tasks or reduce images to textual summaries, failing to fully exploit visual context and commonsense knowledge, ultimately constraining the generalization of reasoning capabilities across diverse multimodal environments. To address this limitation, we introduce a novel fine tuning task, Masked Prediction via Context and Commonsense, which forces models to integrate visual context and commonsense reasoning by reconstructing semantically meaningful content from occluded images, thereby laying the foundation for generalized reasoning. To systematically evaluate the model performance in generalized reasoning, we developed a specialized evaluation benchmark, MPCC Eval, and employed various fine tuning strategies to guide reasoning. Among these, we introduced an innovative training method, Reinforcement Fine tuning with Prior Sampling, which not only enhances model performance but also improves its generalized reasoning capabilities in OOD and cross task scenarios.
[252] Semantic Relation-Enhanced CLIP Adapter for Domain Adaptive Zero-Shot Learning
Jiaao Yu, Mingjie Han, Jinkun Jiang, Junyu Dong, Tao Gong, Man Lan
Main category: cs.CV
TL;DR: SRE-CLIP is a novel framework that enhances CLIP for Domain-Adaptive Zero-Shot Learning by addressing inefficient cross-category knowledge transfer and degraded cross-modal alignment through semantic relation guidance and alignment retention strategies.
Details
Motivation: Existing paradigms fail to balance cross-domain transfer and cross-category generalization in data-limited scenarios, creating demand for Domain-Adaptive Zero-Shot Learning (DAZSL). While CLIP has inherent advantages, current studies don't fully exploit its potential due to inefficient knowledge transfer and degraded alignment during fine-tuning.Method: Proposed Semantic Relation-Enhanced CLIP (SRE-CLIP) Adapter framework with two key components: Semantic Relation Structure Loss for efficient cross-category knowledge transfer, and Cross-Modal Alignment Retention Strategy to maintain cross-modal alignment during target domain fine-tuning.
Result: SRE-CLIP achieves state-of-the-art performance on I2AwA and I2WebV benchmarks, significantly outperforming existing approaches as the first CLIP-based DAZSL method.
Conclusion: The proposed SRE-CLIP framework successfully addresses core challenges in applying CLIP to DAZSL, demonstrating superior performance through semantic relation guidance and cross-modal alignment preservation.
Abstract: The high cost of data annotation has spurred research on training deep learning models in data-limited scenarios. Existing paradigms, however, fail to balance cross-domain transfer and cross-category generalization, giving rise to the demand for Domain-Adaptive Zero-Shot Learning (DAZSL). Although vision-language models (e.g., CLIP) have inherent advantages in the DAZSL field, current studies do not fully exploit their potential. Applying CLIP to DAZSL faces two core challenges: inefficient cross-category knowledge transfer due to the lack of semantic relation guidance, and degraded cross-modal alignment during target domain fine-tuning. To address these issues, we propose a Semantic Relation-Enhanced CLIP (SRE-CLIP) Adapter framework, integrating a Semantic Relation Structure Loss and a Cross-Modal Alignment Retention Strategy. As the first CLIP-based DAZSL method, SRE-CLIP achieves state-of-the-art performance on the I2AwA and I2WebV benchmarks, significantly outperforming existing approaches.
[253] Embodied Navigation with Auxiliary Task of Action Description Prediction
Haru Kondoh, Asako Kanezaki
Main category: cs.CV
TL;DR: Proposes using language description as an auxiliary task in reinforcement learning for robot navigation to achieve both high performance and explainability.
Details
Motivation: Addresses the trade-off between explainability and performance in robot navigation systems, where complex decision systems become black-boxes that lack transparency.Method: Incorporates action description as an auxiliary task in reinforcement learning using knowledge distillation from pre-trained vision-language models to overcome the lack of ground-truth data.
Result: Achieves both high navigation performance and the ability to describe actions, with state-of-the-art performance in semantic audio-visual navigation tasks.
Conclusion: The proposed approach successfully bridges the gap between explainability and performance in multimodal robot navigation through auxiliary language description tasks.
Abstract: The field of multimodal robot navigation in indoor environments has garnered significant attention in recent years. However, as tasks and methods become more advanced, the action decision systems tend to become more complex and operate as black-boxes. For a reliable system, the ability to explain or describe its decisions is crucial; however, there tends to be a trade-off in that explainable systems can not outperform non-explainable systems in terms of performance. In this paper, we propose incorporating the task of describing actions in language into the reinforcement learning of navigation as an auxiliary task. Existing studies have found it difficult to incorporate describing actions into reinforcement learning due to the absence of ground-truth data. We address this issue by leveraging knowledge distillation from pre-trained description generation models, such as vision-language models. We comprehensively evaluate our approach across various navigation tasks, demonstrating that it can describe actions while attaining high navigation performance. Furthermore, it achieves state-of-the-art performance in the particularly challenging multimodal navigation task of semantic audio-visual navigation.
[254] Hybrid Deep Learning Framework for Enhanced Diabetic Retinopathy Detection: Integrating Traditional Features with AI-driven Insights
Arpan Maity, Aviroop Pal, MD. Samiul Islam, Tamal Ghosh
Main category: cs.CV
TL;DR: A hybrid AI framework combining traditional feature extraction with deep learning for enhanced Diabetic Retinopathy detection, improving early diagnosis accuracy and reducing false negatives.
Details
Motivation: Diabetic Retinopathy is a major global concern, especially in India with high diabetic population. Early screening is crucial as DR is asymptomatic initially, and undetected cases lead to irreversible vision loss.Method: Hybrid diagnostic framework combining traditional handcrafted feature extraction (capturing clinical markers) with deep learning (automating hierarchical pattern recognition) using fundus imaging.
Result: The multimodal approach surpasses standalone DL methods, demonstrating superior classification performance and reduced false negatives.
Conclusion: This AI-driven hybrid framework enables scalable, accurate DR screening that is particularly crucial for diabetes-burdened regions.
Abstract: Diabetic Retinopathy (DR), a vision-threatening complication of Dia-betes Mellitus (DM), is a major global concern, particularly in India, which has one of the highest diabetic populations. Prolonged hyperglycemia damages reti-nal microvasculature, leading to DR symptoms like microaneurysms, hemor-rhages, and fluid leakage, which, if undetected, cause irreversible vision loss. Therefore, early screening is crucial as DR is asymptomatic in its initial stages. Fundus imaging aids precise diagnosis by detecting subtle retinal lesions. This paper introduces a hybrid diagnostic framework combining traditional feature extraction and deep learning (DL) to enhance DR detection. While handcrafted features capture key clinical markers, DL automates hierarchical pattern recog-nition, improving early diagnosis. The model synergizes interpretable clinical data with learned features, surpassing standalone DL approaches that demon-strate superior classification and reduce false negatives. This multimodal AI-driven approach enables scalable, accurate DR screening, crucial for diabetes-burdened regions.
[255] Comparative Analysis of Object Detection Algorithms for Surface Defect Detection
Arpan Maity, Tamal Ghosh
Main category: cs.CV
TL;DR: Comparison of six object detection algorithms on NEU-DET surface defect dataset shows YOLOv11 achieves 70% higher accuracy than competitors.
Details
Motivation: To evaluate performance of prominent object detection algorithms for industrial quality control applications, specifically metal surface defect detection.Method: Tested YOLOv11, RetinaNet, Fast R-CNN, YOLOv8, RT-DETR, and DETR on NEU-DET dataset containing various metal surface defects. Assessed detection accuracy, speed, and robustness across defect types.
Result: YOLOv11 demonstrated superior performance with 70% higher accuracy on average. Its enhanced feature extraction, single forward pass processing, and architectural optimizations enabled faster and more precise defect detection.
Conclusion: YOLOv11 is the most effective model for surface defect detection on NEU dataset, outperforming competing algorithms by substantial margin in both accuracy and speed.
Abstract: This article compares the performance of six prominent object detection algorithms, YOLOv11, RetinaNet, Fast R-CNN, YOLOv8, RT-DETR, and DETR, on the NEU-DET surface defect detection dataset, comprising images representing various metal surface defects, a crucial application in industrial quality control. Each model’s performance was assessed regarding detection accuracy, speed, and robustness across different defect types such as scratches, inclusions, and rolled-in scales. YOLOv11, a state-of-the-art real-time object detection algorithm, demonstrated superior performance compared to the other methods, achieving a remarkable 70% higher accuracy on average. This improvement can be attributed to YOLOv11s enhanced feature extraction capabilities and ability to process the entire image in a single forward pass, making it faster and more efficient in detecting minor surface defects. Additionally, YOLOv11’s architecture optimizations, such as improved anchor box generation and deeper convolutional layers, contributed to more precise localization of defects. In conclusion, YOLOv11’s outstanding performance in accuracy and speed solidifies its position as the most effective model for surface defect detection on the NEU dataset, surpassing competing algorithms by a substantial margin.
[256] SITS-DECO: A Generative Decoder Is All You Need For Multitask Satellite Image Time Series Modelling
Samuel J. Barrett, Docko Sow
Main category: cs.CV
TL;DR: SITS-DECO is a decoder-only generative model for Earth Observation data that uses unified token sequences and next-token prediction to perform multiple tasks without task-specific adaptation, outperforming larger models on crop classification.
Details
Motivation: To address the limitations of existing EO foundation models that require adaptation and are rigidly structured around specific data sources or training approaches, by applying the unified-sequence approach from large language models to EO data.Method: Uses a simple GPT-style decoder-only architecture with symbolic prompting to perform multiple supervised and self-supervised tasks within a single unified framework, leveraging dense temporal sequence modeling without spatial context.
Result: Outperforms much larger EO foundation models on crop-type classification (PASTIS-R), demonstrating that dense temporal sequence modeling is a critical missing ingredient in current approaches.
Conclusion: SITS-DECO exemplifies a data-centric paradigm where capability arises from training data diversity rather than architectural complexity, providing a lightweight route to multi-modal, multi-task EO modeling and bridging toward future generative EO foundation models.
Abstract: Earth Observation (EO) Foundation Modelling (FM) holds great promise for simplifying and improving the use of EO data for diverse real-world tasks. However, most existing models require additional adaptation before they can be used and are structured rigidly around particular data sources or training approaches. To address this, we take inspiration from large language models, where diverse tasks, both pre-training and downstream, are implicitly captured through next-token prediction over unified token sequences, leveraging the structure and diversity of the training data. We introduce SITS-DECO (Satellite Image Time Series-DECoder Only), a proof-of-concept generative model that applies this unified-sequence framing to EO data. Using a simple GPT-style decoder-only architecture, and demonstrate its ability to perform useful EO tasks (pixel-wise, multi-temporal, multi-modal crop-type classification) in a purely generative framework. Through symbolic prompting, we show that the model can perform multiple supervised and self-supervised tasks within a single unified architecture, without task- or modality-specific adaptation. Despite its simplicity and lack of spatial context, SITS-DECO outperforms much larger EO foundation models on crop-type classification (PASTIS-R) demonstrating that dense temporal sequence modelling is a critical missing ingredient in the current paradigm. This work exemplifies a data-centric modelling paradigm in which capability arises from the diversity and structure of the training data rather than from architectural complexity. SITS-DECO provides a lightweight, practical route to multi-modal, multi-task EO modelling, and a conceptual bridge toward future generative EO foundation models.
[257] Gestura: A LVLM-Powered System Bridging Motion and Semantics for Real-Time Free-Form Gesture Understanding
Zhuoming Li, Aitong Liu, Mengxi Jia, Tengxiang Zhang, Dell Zhang, Xuelong Li
Main category: cs.CV
TL;DR: Gestura is an end-to-end system that uses a pre-trained Large Vision-Language Model with anatomical hand priors and Chain-of-Thought reasoning to achieve robust free-form gesture understanding, addressing limitations of previous solutions.
Details
Motivation: Existing free-form gesture understanding solutions like GestureGPT suffer from limited recognition accuracy and slow response times, creating a need for more effective systems that can handle highly dynamic and diverse gesture patterns.Method: Gestura combines a pre-trained LVLM with a Landmark Processing Module that embeds anatomical hand priors to capture subtle movements, and uses Chain-of-Thought reasoning for step-by-step semantic inference to transform shallow knowledge into deep understanding.
Result: The system achieves robust and adaptable free-form gesture comprehension. Additionally, the authors created the first open-source dataset for free-form gesture intention reasoning with over 300,000 annotated QA pairs.
Conclusion: Gestura’s integration of LVLM with anatomical hand priors and CoT reasoning enables effective free-form gesture understanding, overcoming previous limitations and providing a foundation for future research through the new dataset.
Abstract: Free-form gesture understanding is highly appealing for human-computer interaction, as it liberates users from the constraints of predefined gesture categories. However, the sole existing solution GestureGPT suffers from limited recognition accuracy and slow response times. In this paper, we propose Gestura, an end-to-end system for free-form gesture understanding. Gestura harnesses a pre-trained Large Vision-Language Model (LVLM) to align the highly dynamic and diverse patterns of free-form gestures with high-level semantic concepts. To better capture subtle hand movements across different styles, we introduce a Landmark Processing Module that compensate for LVLMs’ lack of fine-grained domain knowledge by embedding anatomical hand priors. Further, a Chain-of-Thought (CoT) reasoning strategy enables step-by-step semantic inference, transforming shallow knowledge into deep semantic understanding and significantly enhancing the model’s ability to interpret ambiguous or unconventional gestures. Together, these components allow Gestura to achieve robust and adaptable free-form gesture comprehension. Additionally, we have developed the first open-source dataset for free-form gesture intention reasoning and understanding with over 300,000 annotated QA pairs.
[258] Prompt fidelity of ChatGPT4o / Dall-E3 text-to-image visualisations
Dirk HR Spennemann
Main category: cs.CV
TL;DR: Study examines ChatGPT4o/DALL-E3’s prompt fidelity by analyzing how well specified attributes are rendered in generated images, finding 15.6% deviation rate across 710 attributes.
Details
Motivation: To assess the accuracy of text-to-image AI models in rendering explicitly specified attributes from prompts, particularly for bias detection and model evaluation purposes.Method: Analyzed 430 visualizations from two public-domain datasets (200 of women in cultural industries, 230 of museum curators) across personal attributes, appearance, and paraphernalia categories.
Result: 15.6% of all attributes (n=710) were incorrectly rendered. Error rates: lowest for paraphernalia, moderate for appearance, highest for personal attributes (especially age).
Conclusion: Demonstrates measurable prompt-to-image fidelity gaps in DALL-E3, with implications for detecting biases and evaluating model performance.
Abstract: This study examines the prompt fidelity of ChatGPT4o / DALL-E3 text-to-image visualisations by analysing whether attributes explicitly specified in autogenously generated prompts are correctly rendered in the resulting images. Using two public-domain datasets comprising 200 visualisations of women working in the cultural and creative industries and 230 visualisations of museum curators, the study assessed accuracy across personal attributes (age, hair), appearance (attire, glasses), and paraphernalia (name tags, clipboards). While correctly rendered in most cases, DALL-E3 deviated from prompt specifications in 15.6% of all attributes (n=710). Errors were lowest for paraphernalia, moderate for personal appearance, and highest for depictions of the person themselves, particularly age. These findings demonstrate measurable prompt-to-image fidelity gaps with implications for bias detection and model evaluation.
[259] Wavelet-based GAN Fingerprint Detection using ResNet50
Sai Teja Erukude, Suhasnadh Reddy Veluru, Viswa Chaitanya Marella
Main category: cs.CV
TL;DR: A wavelet-based method using DWT preprocessing and ResNet50 classifier effectively detects StyleGAN-generated images by exploiting unique frequency-domain artifacts, achieving 95.1% accuracy.
Details
Motivation: Identifying GAN-generated images has become a significant challenge in digital image forensics, requiring robust detection methods.Method: Uses discrete wavelet transform (DWT) with Haar and Daubechies filters for preprocessing, followed by ResNet50 classification to detect subtle artifacts in frequency domain.
Result: Haar and Daubechies models achieved 93.8% and 95.1% accuracy respectively, significantly outperforming spatial domain model (81.5%). Daubechies performed best.
Conclusion: GAN-generated images have unique wavelet-domain artifacts, and wavelet-domain analysis is highly effective for GAN image detection, with potential for improving deepfake detection systems.
Abstract: Identifying images generated by Generative Adversarial Networks (GANs) has become a significant challenge in digital image forensics. This research presents a wavelet-based detection method that uses discrete wavelet transform (DWT) preprocessing and a ResNet50 classification layer to differentiate the StyleGAN-generated images from real ones. Haar and Daubechies wavelet filters are applied to convert the input images into multi-resolution representations, which will then be fed to a ResNet50 network for classification, capitalizing on subtle artifacts left by the generative process. Moreover, the wavelet-based models are compared to an identical ResNet50 model trained on spatial data. The Haar and Daubechies preprocessed models achieved a greater accuracy of 93.8 percent and 95.1 percent, much higher than the model developed in the spatial domain (accuracy rate of 81.5 percent). The Daubechies-based model outperforms Haar, showing that adding layers of descriptive frequency patterns can lead to even greater distinguishing power. These results indicate that the GAN-generated images have unique wavelet-domain artifacts or “fingerprints.” The method proposed illustrates the effectiveness of wavelet-domain analysis to detect GAN images and emphasizes the potential of further developing the capabilities of future deepfake detection systems.
[260] Explainable Deep Learning in Medical Imaging: Brain Tumor and Pneumonia Detection
Sai Teja Erukude, Viswa Chaitanya Marella, Suhasnadh Reddy Veluru
Main category: cs.CV
TL;DR: This paper presents an explainable deep learning framework using ResNet50 and DenseNet121 for detecting brain tumors in MRI and pneumonia in chest X-rays, achieving high accuracy with DenseNet121 performing better, and uses Grad-CAM for interpretability.
Details
Motivation: To address the lack of interpretability in deep learning models for medical imaging, which hampers clinical trust and adoption, by developing explainable AI frameworks.Method: Used ResNet50 and DenseNet121 CNNs trained on Kaggle datasets (7,023 brain MRI and 5,863 chest X-ray images) with Grad-CAM integration for heatmap visualizations to show influential regions in decision-making.
Result: DenseNet121 outperformed ResNet50 with 94.3% vs 92.5% accuracy for brain tumors and 89.1% vs 84.4% for pneumonia. Grad-CAM showed DenseNet121 focused on core pathological regions while ResNet50 sometimes scattered attention.
Conclusion: Combining deep learning with explainable AI offers a promising path toward reliable, interpretable, and clinically useful diagnostic tools in medical imaging.
Abstract: Deep Learning (DL) holds enormous potential for improving medical imaging diagnostics, yet the lack of interpretability in most models hampers clinical trust and adoption. This paper presents an explainable deep learning framework for detecting brain tumors in MRI scans and pneumonia in chest X-ray images using two leading Convolutional Neural Networks, ResNet50 and DenseNet121. These models were trained on publicly available Kaggle datasets comprising 7,023 brain MRI images and 5,863 chest X-ray images, achieving high classification performance. DenseNet121 consistently outperformed ResNet50 with 94.3 percent vs. 92.5 percent accuracy for brain tumors and 89.1 percent vs. 84.4 percent accuracy for pneumonia. For better explainability, Gradient-weighted Class Activation Mapping (Grad-CAM) was integrated to create heatmap visualizations superimposed on the test images, indicating the most influential image regions in the decision-making process. Interestingly, while both models produced accurate results, Grad-CAM showed that DenseNet121 consistently focused on core pathological regions, whereas ResNet50 sometimes scattered attention to peripheral or non-pathological areas. Combining deep learning and explainable AI offers a promising path toward reliable, interpretable, and clinically useful diagnostic tools.
[261] Precise classification of low quality G-banded Chromosome Images by reliability metrics and data pruning classifier
Mojtaba Moattari
Main category: cs.CV
TL;DR: This paper improves chromosome classification precision using reliability thresholding metrics and engineered features for low-quality images in resource-constrained settings.
Details
Motivation: Current karyotyping systems require high-quality training data that is unavailable in some pathological labs, leading to false positives in low-cost systems with poor image quality.Method: Proposed reliability thresholding metrics and engineered features, evaluated using Alex-Net neural network, SVM, K-Nearest Neighbors, and cascade pipelines for automated filtering of semi-straight chromosomes.
Result: Classification results improved over 90% for chromosomes with common defects and translocations. Comparative analysis identified the best thresholding metric.
Conclusion: The proposed metrics and pruning method are suitable for karyotyping facilities in poor countries and low-budget pathological laboratories, verified by high precision results on low-quality G-banding database.
Abstract: In the last decade, due to high resolution cameras and accurate meta-phase analyzes, the accuracy of chromosome classification has improved substantially. However, current Karyotyping systems demand large number of high quality train data to have an adequately plausible Precision per each chromosome. Such provision of high quality train data with accurate devices are not yet accomplished in some out-reached pathological laboratories. To prevent false positive detections in low-cost systems and low-quality images settings, this paper improves the classification Precision of chromosomes using proposed reliability thresholding metrics and deliberately engineered features. The proposed method has been evaluated using a variation of deep Alex-Net neural network, SVM, K Nearest-Neighbors, and their cascade pipelines to an automated filtering of semi-straight chromosome. The classification results have highly improved over 90% for the chromosomes with more common defections and translocations. Furthermore, a comparative analysis over the proposed thresholding metrics has been conducted and the best metric is bolded with its salient characteristics. The high Precision results provided for a very low-quality G-banding database verifies suitability of the proposed metrics and pruning method for Karyotyping facilities in poor countries and lowbudget pathological laboratories.
[262] Structured and Abstractive Reasoning on Multi-modal Relational Knowledge Images
Yichi Zhang, Zhuo Chen, Lingbing Guo, Lei Liang, Wen Zhang, Huajun Chen
Main category: cs.CV
TL;DR: This paper addresses the challenge of abstractive reasoning in multi-modal models by introducing a data engine and training framework for Multi-Modal Relational Knowledge (MMRK), achieving superior performance with smaller models compared to GPT-4o.
Details
Motivation: Current multi-modal large language models struggle with abstractive reasoning, particularly with Multi-Modal Relational Knowledge (MMRK) representing relational structures in node-edge formats, which remains under-explored.Method: Developed an automatic STAR data engine to synthesize images with MMRK and create multi-modal instruction data, plus a two-stage capability enhancement training framework with tailored evaluation protocols.
Result: Created STAR-64K dataset with 64K high-quality samples and showed that 3B/7B models trained with their framework significantly outperform GPT-4o in structured abstractive reasoning tasks.
Conclusion: The proposed data engine and training framework effectively enhance abstractive reasoning capabilities in multi-modal models, demonstrating strong performance, transferability, and scalability.
Abstract: Understanding and reasoning with abstractive information from the visual modality presents significant challenges for current multi-modal large language models (MLLMs). Among the various forms of abstractive information, Multi-Modal Relational Knowledge (MMRK), which represents abstract relational structures between multi-modal entities using node-edge formats, remains largely under-explored. In particular, STructured and Abstractive Reasoning (STAR) on such data has received little attention from the research community. To bridge the dual gaps in large-scale high-quality data and capability enhancement methodologies, this paper makes the following key contributions: (i). An automatic STAR data engine capable of synthesizing images with MMRK to build multi-modal instruction data with reliable chain-of-thought thinking for various STAR tasks and (ii). A comprehsive two-stage capability enhancement training framework, accompanied by a suite of evaluation protocols tailored to different STAR tasks. Based upon these contributions, we introduce STAR-64K, a dataset comprising 64K high-quality multi-modal instruction samples, and conduct experiments across 5 open-source MLLMs. Experimental results show that our two-stage enhancement framework enables smaller 3B/7B models to significantly outperform GPT-4o in STAR. Additionally, we provide in-depth analysis regarding the effectiveness of various designs, data transferability, and scalability.
[263] A Flow Model with Low-Rank Transformers for Incomplete Multimodal Survival Analysis
Yi Yin, Yuntao Shou, Zao Dai, Yun Peng, Tao Meng, Wei Ai, Keqin Li
Main category: cs.CV
TL;DR: A novel framework combining low-rank Transformer with flow-based generative model for robust multimodal survival prediction, addressing incomplete modality issues through cross-modal distribution alignment and consistent latent space reconstruction.
Details
Motivation: Real-world multimodal medical datasets often have incomplete modalities due to acquisition limitations, and existing methods ignore distributional discrepancies across modalities, leading to unreliable reconstruction.Method: Uses class-specific flow for cross-modal distribution alignment, normalizing flow model for distribution-consistent latent space construction, and low-rank Transformer for intra-modal dependency modeling to prevent overfitting in high-dimensional fusion.
Result: Achieves state-of-the-art performance under complete modality settings and maintains robust accuracy under incomplete modality scenarios.
Conclusion: The proposed framework effectively addresses incomplete multimodal survival analysis by ensuring distribution consistency and reliable modality reconstruction, demonstrating superior performance across various scenarios.
Abstract: In recent years, multimodal medical data-based survival analysis has attracted much attention. However, real-world datasets often suffer from the problem of incomplete modality, where some patient modality information is missing due to acquisition limitations or system failures. Existing methods typically infer missing modalities directly from observed ones using deep neural networks, but they often ignore the distributional discrepancy across modalities, resulting in inconsistent and unreliable modality reconstruction. To address these challenges, we propose a novel framework that combines a low-rank Transformer with a flow-based generative model for robust and flexible multimodal survival prediction. Specifically, we first formulate the concerned problem as incomplete multimodal survival analysis using the multi-instance representation of whole slide images (WSIs) and genomic profiles. To realize incomplete multimodal survival analysis, we propose a class-specific flow for cross-modal distribution alignment. Under the condition of class labels, we model and transform the cross-modal distribution. By virtue of the reversible structure and accurate density modeling capabilities of the normalizing flow model, the model can effectively construct a distribution-consistent latent space of the missing modality, thereby improving the consistency between the reconstructed data and the true distribution. Finally, we design a lightweight Transformer architecture to model intra-modal dependencies while alleviating the overfitting problem in high-dimensional modality fusion by virtue of the low-rank Transformer. Extensive experiments have demonstrated that our method not only achieves state-of-the-art performance under complete modality settings, but also maintains robust and superior accuracy under the incomplete modalities scenario.
[264] Towards Accurate and Efficient Waste Image Classification: A Hybrid Deep Learning and Machine Learning Approach
Ngoc-Bao-Quang Nguyen, Tuan-Minh Do, Cong-Tam Phan, Thi-Thu-Hong Phan
Main category: cs.CV
TL;DR: A comprehensive comparison of ML, DL, and hybrid approaches for garbage classification shows that hybrid methods combining deep feature extraction with classical classifiers achieve up to 100% accuracy while reducing feature dimensionality by 95% without performance loss.
Details
Motivation: Automated image-based garbage classification is critical for waste management, but systematic benchmarks integrating ML, DL, and hybrid solutions remain underdeveloped.Method: Compared three paradigms: (1) ML with handcrafted features, (2) DL architectures (ResNet variants, EfficientNetV2S), and (3) hybrid approach using deep models for feature extraction combined with classical classifiers (SVM, Logistic Regression).
Result: Hybrid method consistently outperformed others, achieving up to 100% accuracy on TrashNet and refined Household dataset, and 99.87% on Garbage Classification dataset, surpassing state-of-the-art benchmarks. Feature selection reduced dimensionality by over 95% without compromising accuracy.
Conclusion: This work establishes reliable benchmarks for waste classification and introduces an efficient hybrid framework that achieves high accuracy while reducing inference cost, making it suitable for scalable deployment in resource-constrained environments.
Abstract: Automated image-based garbage classification is a critical component of global waste management; however, systematic benchmarks that integrate Machine Learning (ML), Deep Learning (DL), and efficient hybrid solutions remain underdeveloped. This study provides a comprehensive comparison of three paradigms: (1) machine learning algorithms using handcrafted features, (2) deep learning architectures, including ResNet variants and EfficientNetV2S, and (3) a hybrid approach that utilizes deep models for feature extraction combined with classical classifiers such as Support Vector Machine and Logistic Regression to identify the most effective strategy. Experiments on three public datasets - TrashNet, Garbage Classification, and a refined Household Garbage Dataset (with 43 corrected mislabels)- demonstrate that the hybrid method consistently outperforms the others, achieving up to 100% accuracy on TrashNet and the refined Household set, and 99.87% on Garbage Classification, thereby surpassing state-of-the-art benchmarks. Furthermore, feature selection reduces feature dimensionality by over 95% without compromising accuracy, resulting in faster training and inference. This work establishes more reliable benchmarks for waste classification and introduces an efficient hybrid framework that achieves high accuracy while reducing inference cost, making it suitable for scalable deployment in resource-constrained environments.
[265] Neural Stereo Video Compression with Hybrid Disparity Compensation
Shiyin Jiang, Zhenghao Chen, Minghao Han, Shuhang Gu
Main category: cs.CV
TL;DR: Proposes a hybrid disparity compensation (HDC) strategy for stereo video compression that combines explicit pixel displacement with implicit cross-attention mechanisms to better exploit cross-view redundancy.
Details
Motivation: To improve stereo video compression by better capturing cross-view redundancy through a more effective disparity compensation approach that overcomes limitations of existing explicit shifting and implicit cross-attention methods.Method: HDC first computes similarity maps from horizontally shifted cross-view features, then normalizes them into explicit pixel-wise attention scores to perform cross-attention for implicit feature alignment. This is integrated into HDC-based modules for cross-view feature extraction/reconstruction and cross-view entropy modeling.
Result: Extensive experiments on KITTI 2012, KITTI 2015, and Nagoya benchmarks show the framework outperforms both neural and traditional stereo video compression methods across autonomous driving and general scenes.
Conclusion: The hybrid disparity compensation strategy effectively combines explicit and implicit approaches, providing a robust solution for stereo video compression that captures broader disparity information and achieves superior performance.
Abstract: Disparity compensation represents the primary strategy in stereo video compression (SVC) for exploiting cross-view redundancy. These mechanisms can be broadly categorized into two types: one that employs explicit horizontal shifting, and another that utilizes an implicit cross-attention mechanism to reduce cross-view disparity redundancy. In this work, we propose a hybrid disparity compensation (HDC) strategy that leverages explicit pixel displacement as a robust prior feature to simplify optimization and perform implicit cross-attention mechanisms for subsequent warping operations, thereby capturing a broader range of disparity information. Specifically, HDC first computes a similarity map by fusing the horizontally shifted cross-view features to capture pixel displacement information. This similarity map is then normalized into an “explicit pixel-wise attention score” to perform the cross-attention mechanism, implicitly aligning features from one view to another. Building upon HDC, we introduce a novel end-to-end optimized neural stereo video compression framework, which integrates HDC-based modules into key coding operations, including cross-view feature extraction and reconstruction (HDC-FER) and cross-view entropy modeling (HDC-EM). Extensive experiments on SVC benchmarks, including KITTI 2012, KITTI 2015, and Nagoya, which cover both autonomous driving and general scenes, demonstrate that our framework outperforms both neural and traditional SVC methodologies.
[266] Evaluating ChatGPT’s Performance in Classifying Pneumonia from Chest X-Ray Images
Pragna Prahallad, Pranathi Prahallad
Main category: cs.CV
TL;DR: GPT-4o was tested for zero-shot chest X-ray classification (NORMAL vs PNEUMONIA) using 400 images. Concise prompts achieved 74% accuracy, while reasoning prompts performed worse, showing limited clinical reliability.
Details
Motivation: To evaluate GPT-4o's zero-shot capability for medical image classification without fine-tuning, assessing its potential for clinical applications.Method: Used balanced test set of 400 chest X-ray images (200 per class) with four different prompt designs ranging from minimal to reasoning-based instructions.
Result: Concise, feature-focused prompts achieved highest accuracy of 74%, while reasoning-oriented prompts resulted in lower performance.
Conclusion: GPT-4 shows emerging potential for medical image interpretation but has limited diagnostic reliability; requires advances in visual reasoning and domain adaptation for safe clinical use.
Abstract: In this study, we evaluate the ability of OpenAI’s gpt-4o model to classify chest X-ray images as either NORMAL or PNEUMONIA in a zero-shot setting, without any prior fine-tuning. A balanced test set of 400 images (200 from each class) was used to assess performance across four distinct prompt designs, ranging from minimal instructions to detailed, reasoning-based prompts. The results indicate that concise, feature-focused prompts achieved the highest classification accuracy of 74%, whereas reasoning-oriented prompts resulted in lower performance. These findings highlight that while ChatGPT exhibits emerging potential for medical image interpretation, its diagnostic reliability remains limited. Continued advances in visual reasoning and domain-specific adaptation are required before such models can be safely applied in clinical practice.
[267] Kernel Density Steering: Inference-Time Scaling via Mode Seeking for Image Restoration
Yuyang Hu, Kangfu Mei, Mojtaba Sahraee-Ardakan, Ulugbek S. Kamilov, Peyman Milanfar, Mauricio Delbracio
Main category: cs.CV
TL;DR: Kernel Density Steering (KDS) is a novel inference-time framework that uses an ensemble of diffusion samples to improve image restoration quality by steering patches toward shared high-density regions, reducing artifacts and enhancing fidelity.
Details
Motivation: Existing diffusion models for image restoration often produce inconsistent fidelity and undesirable artifacts, requiring a more robust approach to ensure high-quality outputs.Method: KDS employs an N-particle ensemble of diffusion samples, computes patch-wise kernel density estimation gradients from their collective outputs, and uses these gradients to steer patches toward shared higher-density regions, acting as collective wisdom to avoid spurious modes.
Result: Extensive numerical validations show KDS substantially improves both quantitative and qualitative performance on challenging real-world super-resolution and image inpainting tasks.
Conclusion: KDS provides a plug-and-play framework that enhances diffusion model performance for image restoration without requiring retraining or external verifiers, achieving better quality samples at the cost of higher computational resources.
Abstract: Diffusion models show promise for image restoration, but existing methods often struggle with inconsistent fidelity and undesirable artifacts. To address this, we introduce Kernel Density Steering (KDS), a novel inference-time framework promoting robust, high-fidelity outputs through explicit local mode-seeking. KDS employs an $N$-particle ensemble of diffusion samples, computing patch-wise kernel density estimation gradients from their collective outputs. These gradients steer patches in each particle towards shared, higher-density regions identified within the ensemble. This collective local mode-seeking mechanism, acting as “collective wisdom”, steers samples away from spurious modes prone to artifacts, arising from independent sampling or model imperfections, and towards more robust, high-fidelity structures. This allows us to obtain better quality samples at the expense of higher compute by simultaneously sampling multiple particles. As a plug-and-play framework, KDS requires no retraining or external verifiers, seamlessly integrating with various diffusion samplers. Extensive numerical validations demonstrate KDS substantially improves both quantitative and qualitative performance on challenging real-world super-resolution and image inpainting tasks.
[268] Improving the Physics of Video Generation with VJEPA-2 Reward Signal
Jianhao Yuan, Xiaofeng Zhang, Felix Friedrich, Nicolas Beltran-Velez, Melissa Hall, Reyhane Askari-Hemmat, Xiaochuang Han, Nicolas Ballas, Michal Drozdzal, Adriana Romero-Soriano
Main category: cs.CV
TL;DR: The winning entry of PhysicsIQ Challenge improves physics plausibility in video generation by coupling MAGI-1 with VJEPA-2, achieving ~6% improvement.
Details
Motivation: State-of-the-art video generative models lack physical understanding despite visual realism, and SSL pretraining on natural videos has shown emergent physics understanding.Method: Built on MAGI-1 video generative model and coupled it with VJEPA-2 (Video Joint Embedding Predictive Architecture 2) to guide generation using VJEPA-2 as reward signal.
Result: Improved physics plausibility of state-of-the-art video generative models by approximately 6%.
Conclusion: SSL-based video world models can effectively improve physics plausibility in video generation when used as guidance signals.
Abstract: This is a short technical report describing the winning entry of the PhysicsIQ Challenge, presented at the Perception Test Workshop at ICCV 2025. State-of-the-art video generative models exhibit severely limited physical understanding, and often produce implausible videos. The Physics IQ benchmark has shown that visual realism does not imply physics understanding. Yet, intuitive physics understanding has shown to emerge from SSL pretraining on natural videos. In this report, we investigate whether we can leverage SSL-based video world models to improve the physics plausibility of video generative models. In particular, we build ontop of the state-of-the-art video generative model MAGI-1 and couple it with the recently introduced Video Joint Embedding Predictive Architecture 2 (VJEPA-2) to guide the generation process. We show that by leveraging VJEPA-2 as reward signal, we can improve the physics plausibility of state-of-the-art video generative models by ~6%.
[269] LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models
Shubhang Bhatnagar, Andy Xu, Kar-Han Tan, Narendra Ahuja
Main category: cs.CV
TL;DR: LUQ is a novel layerwise ultra-low bit quantization method for multimodal LLMs that selectively applies ultra-low bit quantization to resilient layers, reducing memory usage by 31-40% compared to 4-bit models with less than 10% performance degradation.
Details
Motivation: Multimodal LLMs require huge memory and computational resources, but existing post-training quantization methods developed for language models are less effective for multimodal models due to higher statistical variance in multimodal tokens and activations.Method: Proposed LUQ (Layerwise Ultra-Low Bit Quantization) which analyzes activation distributions across layers and selectively applies ultra-low bit quantization to layers with lower entropy distributions that are more resilient to quantization. Also uses mixed multimodal tokens for PTQ calibration.
Result: Evaluated on LLaVA-1.5 and Qwen-2.5-VL across 9 VQA benchmarks. LUQ models use 40% and 31% less memory than 4-bit counterparts respectively, with less than 10% performance degradation on MME benchmark.
Conclusion: Layerwise selective ultra-low bit quantization is an effective strategy for compressing multimodal LLMs, leveraging the varying tolerance of different layers to quantization while maintaining acceptable performance.
Abstract: Large Language Models (LLMs) with multimodal capabilities have revolutionized vision-language tasks, but their deployment often requires huge memory and computational resources. While post-training quantization (PTQ) has successfully compressed language models to as low as 1-bit precision without significant performance loss, its effectiveness for multimodal LLMs (MLLMs) remains relatively unexplored. In this paper, we present the first study on ultra-low bit (<4-bit) quantization for multimodal LLMs. Our analysis reveals that multimodal tokens and intermediate layer activations produced by them exhibit significantly higher statistical variance and entropy compared to text tokens, making them less tolerant to ultra-low bit quantization. However, the activation distributions of multimodal tokens varies significantly over different layers, with some layers having lower entropy activation distributions. We empirically show that such layers in these models can better tolerate ultra-low bit quantization. Building on these insights, we propose a novel strategy for MLLM quantization, LUQ: Layerwise Ultra-Low Bit Quantization, which selectively applies ultra-low bit quantization to layers that are more resilient to it. Additionally, we also show that using a mix of multimodal tokens (image and text) for PTQ boosts VQA performance in the ultra-low bit regime. We evaluate our method on LLaVA-1.5 and Qwen-2.5-VL across 9 popular VQA benchmarks. The resulting LUQ models use 40% and 31% less memory than their 4-bit counterparts, respectively, while exhibiting a performance degradation of less than 10% on the MME benchmark.
[270] RatioWaveNet: A Learnable RDWT Front-End for Robust and Interpretable EEG Motor-Imagery Classification
Marco Siino, Giuseppe Bonomo, Rosario Sorbello, Ilenia Tinnirello
Main category: cs.CV
TL;DR: RatioWaveNet enhances EEG-based motor imagery BCIs by adding a trainable wavelet transform front end to a CNN-Transformer backbone, improving robustness especially for challenging subjects while maintaining efficiency.
Details
Motivation: To address challenges in non-invasive EEG-based BCIs including nonstationarity, low SNR, and subject variability, particularly improving reliability for the most difficult subjects where BCIs typically fail.Method: Augments TCFormer (temporal CNN-Transformer backbone) with Rationally-Dilated Wavelet Transform front end that performs undecimated multi-resolution subband decomposition, followed by grouped convolutions, multi-kernel CNN, grouped-query attention encoder, and compact TCN head.
Result: On BCI-IV-2a and BCI-IV-2b datasets, improves worst-subject accuracy by +0.17/+0.42 pp (2a) and +1.07/+2.54 pp (2b) across intra- and inter-subject protocols, with consistent average gains and modest computational overhead.
Conclusion: A simple, trainable wavelet front end effectively strengthens Transformer-based BCIs, improving worst-case reliability without sacrificing efficiency.
Abstract: Brain-computer interfaces (BCIs) based on motor imagery (MI) translate covert movement intentions into actionable commands, yet reliable decoding from non-invasive EEG remains challenging due to nonstationarity, low SNR, and subject variability. We present RatioWaveNet, which augments a strong temporal CNN-Transformer backbone (TCFormer) with a trainable, Rationally-Dilated Wavelet Transform (RDWT) front end. The RDWT performs an undecimated, multi-resolution subband decomposition that preserves temporal length and shift-invariance, enhancing sensorimotor rhythms while mitigating jitter and mild artifacts; subbands are fused via lightweight grouped 1-D convolutions and passed to a multi-kernel CNN for local temporal-spatial feature extraction, a grouped-query attention encoder for long-range context, and a compact TCN head for causal temporal integration. Our goal is to test whether this principled wavelet front end improves robustness precisely where BCIs typically fail - on the hardest subjects - and whether such gains persist on average across seeds under both intra- and inter-subject protocols. On BCI-IV-2a and BCI-IV-2b, across five seeds, RatioWaveNet improves worst-subject accuracy over the Transformer backbone by +0.17 / +0.42 percentage points (Sub-Dependent / LOSO) on 2a and by +1.07 / +2.54 percentage points on 2b, with consistent average-case gains and modest computational overhead. These results indicate that a simple, trainable wavelet front end is an effective plug-in to strengthen Transformer-based BCIs, improving worst-case reliability without sacrificing efficiency.
[271] Modal Aphasia: Can Unified Multimodal Models Describe Images From Memory?
Michael Aerni, Joshua Swanson, Kristina Nikolić, Florian Tramèr
Main category: cs.CV
TL;DR: Current unified multimodal models exhibit modal aphasia - they can accurately reproduce visual concepts but fail to describe them correctly in text, despite being trained on both modalities simultaneously.
Details
Motivation: To identify and understand a systematic dissociation in multimodal models where visual memorization works well but textual articulation fails, and to examine the safety implications of this vulnerability.Method: Conducted experiments using leading frontier models to generate movie artwork reproductions and textual descriptions, plus controlled experiments on synthetic datasets across multiple architectures.
Result: Models can generate near-perfect visual reproductions of iconic movie artwork but confuse crucial details in textual descriptions. Modal aphasia emerges as a fundamental property of current unified multimodal models, not just a training artifact.
Conclusion: Modal aphasia creates vulnerabilities in AI safety frameworks, as safeguards applied to one modality may leave harmful concepts accessible in other modalities - demonstrated by models aligned solely on text still generating unsafe images.
Abstract: We present modal aphasia, a systematic dissociation in which current unified multimodal models accurately memorize concepts visually but fail to articulate them in writing, despite being trained on images and text simultaneously. For one, we show that leading frontier models can generate near-perfect reproductions of iconic movie artwork, but confuse crucial details when asked for textual descriptions. We corroborate those findings through controlled experiments on synthetic datasets in multiple architectures. Our experiments confirm that modal aphasia reliably emerges as a fundamental property of current unified multimodal models, not just as a training artifact. In practice, modal aphasia can introduce vulnerabilities in AI safety frameworks, as safeguards applied to one modality may leave harmful concepts accessible in other modalities. We demonstrate this risk by showing how a model aligned solely on text remains capable of generating unsafe images.
[272] SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models
Gyubeum Lim, Yemo Koo, Vijay Krishna Madisetti
Main category: cs.CV
TL;DR: SCoPE VLM is a vision-language model that uses Chain of Scroll mechanism for efficient long-context document navigation, reducing memory usage while modeling human-like reading behaviors.
Details
Motivation: Current VLMs struggle with long-context visual information in agentic tasks like GUI control and web navigation, neglecting document structure understanding and being memory-intensive for local deployment.Method: Proposes Chain of Scroll mechanism for selective recursive document navigation, dedicated data generation pipeline for training trajectories, and Episodic Group Relative Policy Optimization reinforcement learning method.
Result: Substantially reduces memory usage and effectively models human-like reading behaviors for multi-page document question answering.
Conclusion: SCoPE VLM is the first framework to explicitly model agentic reading patterns in multi-page document QA, advancing multimodal agent capabilities.
Abstract: Understanding long-context visual information remains a fundamental challenge for vision-language models, particularly in agentic tasks such as GUI control and web navigation. While web pages and GUI environments are inherently structured documents, current VLMs typically neglect decision-oriented document understanding in their training objectives. Existing approaches primarily extend visual embeddings to process long, high-resolution inputs, but these methods are memory-intensive and impractical for locally deployable solutions. To address these issues, we propose SCoPE VLM, a document navigation expert that leverages a novel Chain of Scroll mechanism to selectively and recursively navigate documents, focusing exclusively on relevant segments. We introduce a dedicated data generation pipeline to construct informative Chain of Scroll trajectories and Episodic Group Relative Policy Optimization, a tailored reinforcement learning method to reduce the gap between training and inference. Our method substantially reduces memory usage and effectively models human-like reading behaviors. To the best of our knowledge, SCoPE VLM is the first framework to explicitly model agentic reading patterns in multi-page document question answering, advancing the capabilities of multimodal agents.
[273] Poisson Flow Consistency Training
Anthony Zhang, Mahmut Gokmen, Dennis Hein, Rongjun Ge, Wenjun Xia, Ge Wang, Jin Chen
Main category: cs.CV
TL;DR: PFCT enables training Poisson Flow Consistency Models without distillation by using perturbation kernels, sinusoidal discretization, and Beta noise distribution, achieving competitive CT denoising results.
Details
Motivation: PFCM was limited to distillation training, restricting its potential across data modalities. The research aimed to develop an isolated training method for PFCM.Method: Used perturbation kernel to remove dependency on pretrained PFGM++, introduced sinusoidal discretization schedule and Beta noise distribution for adaptability and sample quality improvement.
Result: Achieved improved low-dose CT image denoising in terms of LPIPS and SSIM metrics, with similar effectiveness to Consistency Model.
Conclusion: PFCT is a valid training method for PFCM that creates more flexibility in generative modeling, though further optimization and applicability studies are needed.
Abstract: The Poisson Flow Consistency Model (PFCM) is a consistency-style model based on the robust Poisson Flow Generative Model++ (PFGM++) which has achieved success in unconditional image generation and CT image denoising. Yet the PFCM can only be trained in distillation which limits the potential of the PFCM in many data modalities. The objective of this research was to create a method to train the PFCM in isolation called Poisson Flow Consistency Training (PFCT). The perturbation kernel was leveraged to remove the pretrained PFGM++, and the sinusoidal discretization schedule and Beta noise distribution were introduced in order to facilitate adaptability and improve sample quality. The model was applied to the task of low dose computed tomography image denoising and improved the low dose image in terms of LPIPS and SSIM. It also displayed similar denoising effectiveness as models like the Consistency Model. PFCT is established as a valid method of training the PFCM from its effectiveness in denoising CT images, showing potential with competitive results to other generative models. Further study is needed in the precise optimization of PFCT and in its applicability to other generative modeling tasks. The framework of PFCT creates more flexibility for the ways in which a PFCM can be created and can be applied to the field of generative modeling.
[274] A Multi-Stage Hybrid Framework for Automated Interpretation of Multi-View Engineering Drawings Using Vision Language Model
Muhammad Tayyab Khan, Zane Yong, Lequn Chen, Wenhe Feng, Nicholas Yew Jin Tan, Seung Ki Moon
Main category: cs.CV
TL;DR: A three-stage hybrid framework using YOLOv11 and Donut-based VLMs for automated interpretation of 2D multi-view engineering drawings, achieving strong performance in layout segmentation and semantic content parsing.
Details
Motivation: Manual interpretation of complex engineering drawings with dense annotations is challenging due to varied layouts, orientations, and mixed symbolic-textual content, requiring automated solutions.Method: Three-stage framework: 1) YOLOv11-det for layout segmentation, 2) YOLOv11-obb for orientation-aware annotation detection, 3) Two Donut-based VLMs (Alphabetical and Numerical) for semantic content parsing without OCR.
Result: Alphabetical VLM achieved F1 score of 0.672, Numerical VLM achieved 0.963. Two specialized datasets developed: 1,000 drawings for layout detection and 1,406 for annotation-level training.
Conclusion: The framework provides scalable automated interpretation of engineering drawings with unified JSON output for seamless CAD integration, addressing challenges in manufacturing communication.
Abstract: Engineering drawings are fundamental to manufacturing communication, serving as the primary medium for conveying design intent, tolerances, and production details. However, interpreting complex multi-view drawings with dense annotations remains challenging using manual methods, generic optical character recognition (OCR) systems, or traditional deep learning approaches, due to varied layouts, orientations, and mixed symbolic-textual content. To address these challenges, this paper proposes a three-stage hybrid framework for the automated interpretation of 2D multi-view engineering drawings using modern detection and vision language models (VLMs). In the first stage, YOLOv11-det performs layout segmentation to localize key regions such as views, title blocks, and notes. The second stage uses YOLOv11-obb for orientation-aware, fine-grained detection of annotations, including measures, GD&T symbols, and surface roughness indicators. The third stage employs two Donut-based, OCR-free VLMs for semantic content parsing: the Alphabetical VLM extracts textual and categorical information from title blocks and notes, while the Numerical VLM interprets quantitative data such as measures, GD&T frames, and surface roughness. Two specialized datasets were developed to ensure robustness and generalization: 1,000 drawings for layout detection and 1,406 for annotation-level training. The Alphabetical VLM achieved an overall F1 score of 0.672, while the Numerical VLM reached 0.963, demonstrating strong performance in textual and quantitative interpretation, respectively. The unified JSON output enables seamless integration with CAD and manufacturing databases, providing a scalable solution for intelligent engineering drawing analysis.
[275] LSF-Animation: Label-Free Speech-Driven Facial Animation via Implicit Feature Representation
Xin Lu, Chuanqing Zhuang, Chenxi Jin, Zhengda Lu, Yiqun Wang, Wu Liu, Jun Xiao
Main category: cs.CV
TL;DR: LSF-Animation is a speech-driven 3D facial animation framework that implicitly extracts emotion from speech and identity from neutral facial meshes, eliminating the need for explicit emotion/identity labels and improving generalization to unseen speakers.
Details
Motivation: Current emotion-aware facial animation methods rely on explicit one-hot encodings for identity and emotion, limiting generalization to unseen speakers and neglecting emotional cues in speech.Method: Implicitly extracts emotion from speech and identity from neutral facial mesh, uses Hierarchical Interaction Fusion Block (HIFB) with fusion token to integrate emotional, motion, and identity cues through dual transformer features.
Result: Outperforms state-of-the-art methods on 3DMEAD dataset in emotional expressiveness, identity generalization, and animation realism.
Conclusion: LSF-Animation provides a more natural and adaptable facial animation framework that generalizes better to unseen speakers without requiring manual emotion/identity labels.
Abstract: Speech-driven 3D facial animation has attracted increasing interest since its potential to generate expressive and temporally synchronized digital humans. While recent works have begun to explore emotion-aware animation, they still depend on explicit one-hot encodings to represent identity and emotion with given emotion and identity labels, which limits their ability to generalize to unseen speakers. Moreover, the emotional cues inherently present in speech are often neglected, limiting the naturalness and adaptability of generated animations. In this work, we propose LSF-Animation, a novel framework that eliminates the reliance on explicit emotion and identity feature representations. Specifically, LSF-Animation implicitly extracts emotion information from speech and captures the identity features from a neutral facial mesh, enabling improved generalization to unseen speakers and emotional states without requiring manual labels. Furthermore, we introduce a Hierarchical Interaction Fusion Block (HIFB), which employs a fusion token to integrate dual transformer features and more effectively integrate emotional, motion-related and identity-related cues. Extensive experiments conducted on the 3DMEAD dataset demonstrate that our method surpasses recent state-of-the-art approaches in terms of emotional expressiveness, identity generalization, and animation realism. The source code will be released at: https://github.com/Dogter521/LSF-Animation.
[276] Addressing Corner Cases in Autonomous Driving: A World Model-based Approach with Mixture of Experts and LLMs
Haicheng Liao, Bonan Wang, Junxian Yang, Chengyue Wang, Zhengbin He, Guohui Zhang, Chengzhong Xu, Zhenning Li
Main category: cs.CV
TL;DR: WM-MoE is a world model-based motion forecasting framework that addresses corner-case scenarios in autonomous driving by unifying perception, memory, and decision making, leveraging LLMs for temporal reasoning and using mixture-of-experts for scenario decomposition.
Details
Motivation: Existing motion forecasting models underperform in safety-critical corner cases due to over-representation of common scenes in training data and limited generalization capabilities.Method: Uses world model architecture with compact scene representation, LLMs for temporal reasoning via lightweight tokenizer, and mixture-of-experts to decompose corner cases into subproblems with specialized experts for intent inference and counterfactual rollouts.
Result: Outperforms state-of-the-art baselines on four benchmark datasets (nuScenes, NGSIM, HighD, MoCAD) and remains robust under corner-case and data-missing conditions.
Conclusion: World model-based architectures show promise for robust and generalizable motion forecasting in autonomous vehicles, particularly for handling challenging corner-case scenarios.
Abstract: Accurate and reliable motion forecasting is essential for the safe deployment of autonomous vehicles (AVs), particularly in rare but safety-critical scenarios known as corner cases. Existing models often underperform in these situations due to an over-representation of common scenes in training data and limited generalization capabilities. To address this limitation, we present WM-MoE, the first world model-based motion forecasting framework that unifies perception, temporal memory, and decision making to address the challenges of high-risk corner-case scenarios. The model constructs a compact scene representation that explains current observations, anticipates future dynamics, and evaluates the outcomes of potential actions. To enhance long-horizon reasoning, we leverage large language models (LLMs) and introduce a lightweight temporal tokenizer that maps agent trajectories and contextual cues into the LLM’s feature space without additional training, enriching temporal context and commonsense priors. Furthermore, a mixture-of-experts (MoE) is introduced to decompose complex corner cases into subproblems and allocate capacity across scenario types, and a router assigns scenes to specialized experts that infer agent intent and perform counterfactual rollouts. In addition, we introduce nuScenes-corner, a new benchmark that comprises four real-world corner-case scenarios for rigorous evaluation. Extensive experiments on four benchmark datasets (nuScenes, NGSIM, HighD, and MoCAD) showcase that WM-MoE consistently outperforms state-of-the-art (SOTA) baselines and remains robust under corner-case and data-missing conditions, indicating the promise of world model-based architectures for robust and generalizable motion forecasting in fully AVs.
[277] AI Powered Urban Green Infrastructure Assessment Through Aerial Imagery of an Industrial Township
Anisha Dutta
Main category: cs.CV
TL;DR: An AI-based approach using computer vision and deep learning to estimate urban canopy coverage from drone imagery, implemented on cloud infrastructure for efficient large-scale analysis.
Details
Motivation: Traditional urban canopy assessment methods face limitations in scalability, technical requirements, and expertise. Accurate canopy coverage assessment is crucial for urban planning, environmental monitoring, and climate change mitigation.Method: Object-based image analysis using deep learning algorithms applied to high-resolution drone images, implemented on a cloud platform with high-performance processors to handle computational challenges of large datasets.
Result: The approach effectively estimates canopy coverage at city scale, providing detailed analysis of urban vegetation including canopy density variations and spatial distribution, enabling urban forestry management optimization.
Conclusion: This AI-based method generates valuable data for optimizing tree plantation and assessing carbon sequestration potential, contributing to sustainable urban planning and fostering more resilient urban environments.
Abstract: Accurate assessment of urban canopy coverage is crucial for informed urban planning, effective environmental monitoring, and mitigating the impacts of climate change. Traditional practices often face limitations due to inadequate technical requirements, difficulties in scaling and data processing, and the lack of specialized expertise. This study presents an efficient approach for estimating green canopy coverage using artificial intelligence, specifically computer vision techniques, applied to aerial imageries. Our proposed methodology utilizes object-based image analysis, based on deep learning algorithms to accurately identify and segment green canopies from high-resolution drone images. This approach allows the user for detailed analysis of urban vegetation, capturing variations in canopy density and understanding spatial distribution. To overcome the computational challenges associated with processing large datasets, it was implemented over a cloud platform utilizing high-performance processors. This infrastructure efficiently manages space complexity and ensures affordable latency, enabling the rapid analysis of vast amounts of drone imageries. Our results demonstrate the effectiveness of this approach in accurately estimating canopy coverage at the city scale, providing valuable insights for urban forestry management of an industrial township. The resultant data generated by this method can be used to optimize tree plantation and assess the carbon sequestration potential of urban forests. By integrating these insights into sustainable urban planning, we can foster more resilient urban environments, contributing to a greener and healthier future.
[278] TernaryCLIP: Efficiently Compressing Vision-Language Models with Ternary Weights and Distilled Knowledge
Shu-Hao Zhang, Wei-Cheng Tang, Chen Wu, Peng Hu, Nan Li, Liang-Jie Zhang, Qi Zhang, Shao-Qun Zhang
Main category: cs.CV
TL;DR: TernaryCLIP is a lightweight framework that converts CLIP’s vision and text encoder weights to ternary format (1.58-bit), achieving significant compression, acceleration, and efficiency gains while maintaining performance on multimodal tasks.
Details
Motivation: To enable efficient deployment of large multimodal models like CLIP on resource-constrained devices by reducing computational costs and storage requirements through extreme quantization.Method: Proposes TernaryCLIP framework that converts CLIP’s connection weights to ternary format using quantization-aware training and distillation modules to prevent precision degradation.
Result: Achieves 99% ternarized weights with 1.58-bit representation, 16.98× compression, 2.3× inference acceleration, 16× storage reduction, 10× memory optimization, and 60% sparsity while maintaining performance on 41 datasets for zero-shot classification and retrieval.
Conclusion: Demonstrates feasibility of extreme quantization for large multimodal models, supporting efficient deployment on resource-constrained devices with minimal performance loss.
Abstract: Recent years have witnessed an increasing interest in image-text contrastive modeling, exemplified by models such as Contrastive Language-Image Pretraining (CLIP). In this paper, we propose the TernaryCLIP, a lightweight computational framework that converts connection weights of both vision and text encoders of CLIP into the ternary format, instead of full-precision or floating ones. TernaryCLIP incorporates quantization-aware training and distillation modules, preventing precision degradation and enabling low-cost and high-efficiency computations. Comprehensive experiments demonstrate that TernaryCLIP can achieve up to 99% ternarized weights with 1.58-bit representation, 16.98 $\times$ compression ratio, 2.3 $\times$ inference acceleration, 16 $\times$ storage reduction, 10 $\times$ memory optimization, and 60% sparsity while maintaining promising performance on zero-shot image classification and image-text retrieval tasks across 41 commonly used datasets. Our work highlights the feasibility of extreme quantization for large multimodal models, supporting effective and efficient deployment on resource-constrained devices. The model and code can be accessed from Hugging Face and GitHub.
[279] Generative AI in Depth: A Survey of Recent Advances, Model Variants, and Real-World Applications
Shamim Yazdani, Akansha Singh, Nripsuta Saxena, Zichong Wang, Avash Palikhe, Deng Pan, Umapada Pal, Jie Yang, Wenbin Zhang
Main category: cs.CV
TL;DR: This survey paper provides a comprehensive taxonomy and framework for understanding the development of GANs, VAEs, and Diffusion Models, highlighting key innovations, ethical concerns, and future research directions.
Details
Motivation: The rapid advancement of deep learning generative models has made it difficult to stay current with the growing research volume, expanding applications, and unresolved technical challenges.Method: The authors introduce a comprehensive taxonomy that organizes literature on GANs, VAEs, and DMs, including their variants and combined approaches, and provide a cohesive framework for understanding their development.
Result: The survey highlights key innovations that have improved quality, diversity, and controllability of generated outputs, reflecting the expanding potential of generative AI.
Conclusion: The paper outlines persistent challenges, proposes future research directions, and offers a structured perspective for researchers, while also examining ethical concerns and societal impacts of synthetic media.
Abstract: In recent years, deep learning based generative models, particularly Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Diffusion Models (DMs), have been instrumental in in generating diverse, high-quality content across various domains, such as image and video synthesis. This capability has led to widespread adoption of these models and has captured strong public interest. As they continue to advance at a rapid pace, the growing volume of research, expanding application areas, and unresolved technical challenges make it increasingly difficult to stay current. To address this need, this survey introduces a comprehensive taxonomy that organizes the literature and provides a cohesive framework for understanding the development of GANs, VAEs, and DMs, including their many variants and combined approaches. We highlight key innovations that have improved the quality, diversity, and controllability of generated outputs, reflecting the expanding potential of generative artificial intelligence. In addition to summarizing technical progress, we examine rising ethical concerns, including the risks of misuse and the broader societal impact of synthetic media. Finally, we outline persistent challenges and propose future research directions, offering a structured and forward looking perspective for researchers in this fast evolving field.
[280] Sprint: Sparse-Dense Residual Fusion for Efficient Diffusion Transformers
Dogyun Park, Moayed Haji-Ali, Yanyu Li, Willi Menapace, Sergey Tulyakov, Hyunwoo J. Kim, Aliaksandr Siarohin, Anil Kag
Main category: cs.CV
TL;DR: SPRINT enables aggressive token dropping (up to 75%) in Diffusion Transformers while preserving quality through sparse-dense residual fusion and two-stage training.
Details
Motivation: Diffusion Transformers have quadratic training costs that make large-scale pretraining expensive, and existing token dropping methods either degrade quality or are parameter-heavy.Method: SPRINT uses shallow layers to process all tokens for local detail, deeper layers on sparse subsets for efficiency, with residual fusion. Two-stage training: masked pre-training followed by full-token fine-tuning.
Result: Achieves 9.8x training savings on ImageNet-1K 256x256 with comparable FID/FDD. Inference Path-Drop Guidance nearly halves FLOPs while improving quality.
Conclusion: SPRINT provides a simple, effective, and general solution for efficient DiT training with significant computational savings and maintained performance.
Abstract: Diffusion Transformers (DiTs) deliver state-of-the-art generative performance but their quadratic training cost with sequence length makes large-scale pretraining prohibitively expensive. Token dropping can reduce training cost, yet na"ive strategies degrade representations, and existing methods are either parameter-heavy or fail at high drop ratios. We present SPRINT, Sparse–Dense Residual Fusion for Efficient Diffusion Transformers, a simple method that enables aggressive token dropping (up to 75%) while preserving quality. SPRINT leverages the complementary roles of shallow and deep layers: early layers process all tokens to capture local detail, deeper layers operate on a sparse subset to cut computation, and their outputs are fused through residual connections. Training follows a two-stage schedule: long masked pre-training for efficiency followed by short full-token fine-tuning to close the train–inference gap. On ImageNet-1K 256x256, SPRINT achieves 9.8x training savings with comparable FID/FDD, and at inference, its Path-Drop Guidance (PDG) nearly halves FLOPs while improving quality. These results establish SPRINT as a simple, effective, and general solution for efficient DiT training.
[281] LiteDiff
Ruchir Namjoshi, Nagasai Thadishetty, Vignesh Kumar, Hemanth Venkateshwara
Main category: cs.CV
TL;DR: Lite-Diff is a lightweight diffusion model adaptation method that uses residual adapter modules, latent morphological autoencoder, and pixel-level discriminator to efficiently fine-tune models for specialized domains like medical imaging with minimal data.
Details
Motivation: Fine-tuning diffusion models for specialized domains is challenging due to limited domain-specific data and high computational costs of full model adaptation.Method: Integrates lightweight adaptation layers into frozen diffusion U-Net, uses latent morphological autoencoder for domain-specific latent consistency and pixel-level discriminator for adversarial alignment, while freezing base model weights.
Result: Achieves superior adaptation efficiency compared to naive full fine-tuning on three chest X-ray datasets, with optimal balance between efficiency and performance.
Conclusion: Provides a promising direction for transfer learning in diffusion models, facilitating deployment in diverse low-data domains.
Abstract: In recent years, diffusion models have demonstrated remarkable success in high-fidelity image synthesis. However, fine-tuning these models for specialized domains, such as medical imaging, remains challenging due to limited domain-specific data and the high computational cost of full model adaptation. In this paper, we introduce Lite-Diff (Lightweight Diffusion Model Adaptation), a novel finetuning approach that integrates lightweight adaptation layers into a frozen diffusion U-Net while enhancing training with a latent morphological autoencoder (for domain-specific latent consistency) and a pixel level discriminator(for adversarial alignment). By freezing weights of the base model and optimizing only small residual adapter modules, LiteDiff significantly reduces the computational overhead and mitigates overfitting, even in minimal-data settings. Additionally, we conduct ablation studies to analyze the effects of selectively integrating adaptation layers in different U-Net blocks, revealing an optimal balance between efficiency and performance. Experiments on three chest X-ray datasets - (1) Kaggle Chest X-Ray Pneumonia, (2) NIH Chest X-ray14 and (3) VinBigData Chest X_ray demonstrate that LiteDiff achieves superior adaptation efficiency compared to naive full fine-tuning. Our framework provides a promising direction for transfer learning in diffusion models, facilitating their deployment in diverse low data domains.
[282] Reconnaissance Automatique des Langues des Signes : Une Approche Hybridée CNN-LSTM Basée sur Mediapipe
Fraisse Sacré Takouchouang, Ho Tuong Vinh
Main category: cs.CV
TL;DR: A hybrid CNN-LSTM system for automatic sign language recognition achieves 92% accuracy using Mediapipe for gesture keypoint extraction, with real-time translation capabilities.
Details
Motivation: Sign languages are marginalized, limiting deaf communities' access to essential services like healthcare and education, creating a need for automated recognition systems.Method: Hybrid CNN-LSTM architecture using Mediapipe for gesture keypoint extraction, developed with Python, TensorFlow and Streamlit for real-time gesture translation.
Result: 92% average accuracy, with excellent performance for distinct gestures like ‘Hello’ and ‘Thank you’, but confusion remains for visually similar gestures like ‘Call’ and ‘Yes’.
Conclusion: The system shows promising potential for applications in healthcare, education and public services, though challenges remain with visually similar gestures.
Abstract: Sign languages play a crucial role in the communication of deaf communities,
but they are often marginalized, limiting access to essential services such as
healthcare and education. This study proposes an automatic sign language
recognition system based on a hybrid CNN-LSTM architecture, using Mediapipe for
gesture keypoint extraction. Developed with Python, TensorFlow and Streamlit,
the system provides real-time gesture translation. The results show an average
accuracy of 92%, with very good performance for distinct gestures such as
Hello'' and Thank you’’. However, some confusions remain for visually
similar gestures, such as Call'' and Yes’’. This work opens up interesting
perspectives for applications in various fields such as healthcare, education
and public services.
[283] VLM-SlideEval: Evaluating VLMs on Structured Comprehension and Perturbation Sensitivity in PPT
Hyeonsu Kang, Emily Bao, Anjan Goswami
Main category: cs.CV
TL;DR: VLM-SlideEval framework evaluates vision-language models on slide understanding across element extraction, robustness to perturbations, and narrative comprehension, revealing limitations in current VLMs.
Details
Motivation: Vision-language models are increasingly used to evaluate multimodal content like presentation slides, but their slide-specific understanding capabilities remain underexplored despite growing use in agentic pipelines.Method: Developed VLM-SlideEval framework with three evaluation axes: element-level extraction from slide images, robustness to controlled perturbations (geometry, style, text), and higher-level comprehension including narrative order recovery from shuffled slides. Used publicly available decks from Zenodo with standardized ground-truth metadata from PowerPoint XML and live renderings.
Result: VLMs underperform on pixel-accurate extraction, show non-trivial agreement/fidelity/consistency under perturbations, perform better on single-slide content understanding, but fail to reliably capture narrative structure across slides.
Conclusion: Current VLMs have significant limitations for slide evaluation, motivating the need for calibrated, critic-in-the-loop evaluators that enable iterative refinement and selection in agentic pipelines.
Abstract: Vision-language models (VLMs) are increasingly used to evaluate multimodal content, including presentation slides, yet their slide-specific understanding remains underexplored {despite their growing role as critics in agentic, model-forward pipelines}. We introduce VLM-SlideEval, an evaluation framework that probes VLMs along three axes: (1) element-level extraction from slide images aligned to ground truth; (2) robustness to controlled perturbations in geometry, style, and text; and (3) higher-level comprehension, such as recovering a deck’s narrative order from shuffled slides. Using publicly available decks from Zenodo (https://huggingface.co/datasets/Forceless/Zenodo10K/viewer/default/pptx), we standardize ground-truth element metadata from PowerPoint XML and live renderings into a unified, verifiable schema. Empirically, VLMs underperform on pixel-accurate extraction and show non-trivial agreement, fidelity, and consistency under controlled perturbations, while performing better on single-slide content understanding; however, they do not reliably capture narrative structure across slides. These results highlight the limits of current VLMs for slide evaluation and motivate calibrated, critic-in-the-loop evaluators that drive iterative refinement and selection in agentic pipelines.
[284] Human-Centric Anomaly Detection in Surveillance Videos Using YOLO-World and Spatio-Temporal Deep Learning
Mohammad Ali Etemadi Naeen, Hoda Mohammadzade, Saeed Bagheri Shouraki
Main category: cs.CV
TL;DR: A deep learning framework for video anomaly detection that uses human-centric preprocessing with YOLO-World detection and ByteTrack tracking, followed by spatio-temporal modeling with InceptionV3 and BiLSTM, achieving 92.41% accuracy on UCF-Crime dataset.
Details
Motivation: Address challenges in surveillance video anomaly detection including diverse abnormal events, class imbalance, and scene-dependent visual clutter by focusing on behaviorally relevant foreground content.Method: Human-centric preprocessing using YOLO-World for detection and ByteTrack for tracking, background suppression via Gaussian blurring, spatial feature extraction with InceptionV3, and temporal modeling with bidirectional LSTM for sequence classification.
Result: Achieved 92.41% mean test accuracy on five-class UCF-Crime subset, with per-class F1-scores consistently above 0.85, demonstrating strong generalization and resilience to class imbalance.
Conclusion: Foreground-focused preprocessing significantly enhances anomaly discrimination in real-world surveillance scenarios, confirming the effectiveness of the proposed framework.
Abstract: Anomaly detection in surveillance videos remains a challenging task due to the diversity of abnormal events, class imbalance, and scene-dependent visual clutter. To address these issues, we propose a robust deep learning framework that integrates human-centric preprocessing with spatio-temporal modeling for multi-class anomaly classification. Our pipeline begins by applying YOLO-World
- an open-vocabulary vision-language detector - to identify human instances in raw video clips, followed by ByteTrack for consistent identity-aware tracking. Background regions outside detected bounding boxes are suppressed via Gaussian blurring, effectively reducing scene-specific distractions and focusing the model on behaviorally relevant foreground content. The refined frames are then processed by an ImageNet-pretrained InceptionV3 network for spatial feature extraction, and temporal dynamics are captured using a bidirectional LSTM (BiLSTM) for sequence-level classification. Evaluated on a five-class subset of the UCF-Crime dataset (Normal, Burglary, Fighting, Arson, Explosion), our method achieves a mean test accuracy of 92.41% across three independent trials, with per-class F1-scores consistently exceeding 0.85. Comprehensive evaluation metrics - including confusion matrices, ROC curves, and macro/weighted averages
- demonstrate strong generalization and resilience to class imbalance. The results confirm that foreground-focused preprocessing significantly enhances anomaly discrimination in real-world surveillance scenarios.
[285] Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation
Zheng Qi, Chao Shang, Evangelia Spiliopoulou, Nikolaos Pappas
Main category: cs.CV
TL;DR: GIFT is a method that reduces hallucination in VLMs by tracking gaze shifts during query comprehension to create visual saliency maps, then using these maps to enhance attention to both salient visual regions and user queries.
Details
Motivation: VLMs often generate hallucinated content due to over-reliance on linguistic priors and visual attention sink problems where attention is misallocated to irrelevant visual regions, while existing methods fail to address cross-modal fusion balance.Method: GIFT pre-computes holistic visual saliency maps by tracking positive changes in visual attention (gaze shifts) during user query comprehension, then amplifies attention to both salient visual information and user queries at each decoding step.
Result: GIFT achieves up to 20.7% improvement over greedy decoding in mitigating hallucination across both generative and classification tasks, while maintaining general vision-language performance with low computational overhead.
Conclusion: The proposed GIFT method effectively addresses visual attention sink and cross-modal fusion imbalance problems, significantly reducing hallucination in VLMs while preserving overall performance with minimal computational cost.
Abstract: Vision language models (VLMs) often generate hallucination, i.e., content that cannot be substantiated by either textual or visual inputs. Prior work primarily attributes this to over-reliance on linguistic prior knowledge rather than visual inputs. Some methods attempt to mitigate hallucination by amplifying visual token attention proportionally to their attention scores. However, these methods overlook the visual attention sink problem, where attention is frequently misallocated to task-irrelevant visual regions, and neglect cross-modal fusion balance by enhancing only visual attention without adjusting attention to the user query. This can result in amplifying incorrect areas while failing to properly interpret the user query. To address these challenges, we propose a simple yet effective method called Gaze Shift-Guided Cross-modal Fusion Enhancement (GIFT). GIFT pre-computes a holistic visual saliency map by tracking positive changes in visual attention, or “gaze shifts”, during user query comprehension, and leverages this map to amplify attention to both salient visual information and the user query at each decoding step. This reduces the impact of visual attention sink, as irrelevant tokens exhibit minimal shifts, while ensuring balanced cross-modal fusion for well-integrated representation. Extensive experiments show that GIFT effectively mitigates hallucination in VLMs across both generative and classification tasks, achieving up to 20.7% improvement over greedy decoding, while maintaining general vision-language performance with low computational overhead.
[286] Scanner-Agnostic MRI Harmonization via SSIM-Guided Disentanglement
Luca Caldera, Lara Cavinato, Francesca Ieva
Main category: cs.CV
TL;DR: A novel image-based harmonization framework for 3D T1-weighted brain MRI that disentangles anatomical content from scanner variations using SSIM-based loss, improving cross-site consistency and downstream task performance.
Details
Motivation: Variability from different MRI scanner models, acquisition protocols, and imaging sites hinders consistent analysis and generalizability across multicenter neuroimaging studies.Method: Uses a differentiable SSIM-based loss to separate anatomical content from scanner-specific variations, enabling separate evaluation of luminance, contrast, and structural components. Trained on multiple public datasets with diverse scanners and sites.
Result: Harmonized images achieved strong alignment across acquisition settings: structural SSIM 0.97, luminance SSIM 0.98-0.99, reduced Wasserstein distances. Downstream improvements: brain age prediction MAE decreased from 5.36 to 3.30 years, Alzheimer’s classification AUC increased from 0.78 to 0.85.
Conclusion: The framework enhances cross-site image consistency, preserves anatomical fidelity, and improves downstream model performance, providing a robust solution for large-scale multicenter neuroimaging studies.
Abstract: The variability introduced by differences in MRI scanner models, acquisition protocols, and imaging sites hinders consistent analysis and generalizability across multicenter studies. We present a novel image-based harmonization framework for 3D T1-weighted brain MRI, which disentangles anatomical content from scanner- and site-specific variations. The model incorporates a differentiable loss based on the Structural Similarity Index (SSIM) to preserve biologically meaningful features while reducing inter-site variability. This loss enables separate evaluation of image luminance, contrast, and structural components. Training and validation were performed on multiple publicly available datasets spanning diverse scanners and sites, with testing on both healthy and clinical populations. Harmonization using multiple style targets, including style-agnostic references, produced consistent and high-quality outputs. Visual comparisons, voxel intensity distributions, and SSIM-based metrics demonstrated that harmonized images achieved strong alignment across acquisition settings while maintaining anatomical fidelity. Following harmonization, structural SSIM reached 0.97, luminance SSIM ranged from 0.98 to 0.99, and Wasserstein distances between mean voxel intensity distributions decreased substantially. Downstream tasks showed substantial improvements: mean absolute error for brain age prediction decreased from 5.36 to 3.30 years, and Alzheimer’s disease classification AUC increased from 0.78 to 0.85. Overall, our framework enhances cross-site image consistency, preserves anatomical fidelity, and improves downstream model performance, providing a robust and generalizable solution for large-scale multicenter neuroimaging studies.
[287] Mitigating Coordinate Prediction Bias from Positional Encoding Failures
Xingjian Tao, Yiwei Wang, Yujun Cai, Yihong Luo, Jing Tang
Main category: cs.CV
TL;DR: MLLMs struggle with precise coordinate prediction in high-resolution images due to weak positional encodings. The paper proposes VPSG, a test-time method that corrects directional biases by using shuffled positional encodings as negative guidance.
Details
Motivation: Multimodal LLMs perform well on vision-language tasks but fail at precise coordinate prediction, especially with high-resolution inputs that produce long token sequences and weaken positional encodings.Method: Proposed Vision-PE Shuffle Guidance (VPSG) - a training-free test-time method that runs auxiliary decoding with shuffled visual positional encodings to isolate position-unconditioned biases, then uses this as negative evidence to guide digit prediction while preserving coordinate format.
Result: Experiments on ScreenSpot-Pro demonstrate reliable improvements in coordinate prediction accuracy, showing that addressing positional encoding robustness is critical for spatial reasoning in MLLMs.
Conclusion: Positional encoding failures are a key bottleneck for accurate coordinate prediction at scale, and VPSG effectively mitigates directional biases to improve spatial reasoning performance in MLLMs.
Abstract: Multimodal large language models (MLLMs) excel at vision-language tasks such as VQA and document understanding, yet precise coordinate prediction remains challenging. High-resolution inputs exacerbate this difficulty by producing long token sequences that weaken positional encodings and introduce directional biases in coordinate outputs. We investigate this phenomenon by analyzing how MLLMs behave when visual positional encodings (VPEs) are deliberately perturbed through shuffling. Our analysis reveals that such perturbations induce predictable, non-random coordinate biases rather than random errors, suggesting that models rely on internal positional priors when spatial grounding signals are degraded. Crucially, we observe similar directional error patterns in natural high-resolution datasets, indicating that positional encoding failures are a key bottleneck for accurate coordinate prediction at scale. To address this issue, we propose Vision-PE Shuffle Guidance (VPSG), a training-free test-time method that leverages the directional nature of these biases for correction. VPSG runs auxiliary decoding with shuffled VPEs to isolate position-unconditioned tendencies, then uses this as negative evidence to guide digit prediction while preserving coordinate format through a lightweight finite-state machine. Experiments on ScreenSpot-Pro demonstrate reliable improvements, highlighting positional encoding robustness as a critical factor for spatial reasoning in MLLMs.
[288] Discovering Latent Graphs with GFlowNets for Diverse Conditional Image Generation
Bailey Trang, Parham Saremi, Alan Q. Wang, Fangrui Huang, Zahra TehraniNasab, Amar Kumar, Tal Arbel, Li Fei-Fei, Ehsan Adeli
Main category: cs.CV
TL;DR: Rainbow is a conditional image generation framework that decomposes input conditions into diverse latent representations using GFlowNets to capture uncertainty and generate multiple plausible images from ambiguous prompts.
Details
Motivation: Traditional methods for diverse image generation either modify random seeds (making differences hard to interpret) or diversify input prompts (limited in verbal interpretability), failing to properly address inherent uncertainty in conditions that can lead to multiple plausible outputs.Method: Integrates a latent graph parameterized by GFlowNets into prompt representation computation, leveraging GFlowNets’ advanced graph sampling capabilities to produce multiple trajectories that represent different aspects of input condition uncertainty, leading to diverse condition representations and corresponding images.
Result: Evaluations on natural and medical image datasets show Rainbow improves both diversity and fidelity across image synthesis, image generation, and counterfactual generation tasks compared to traditional approaches.
Conclusion: Rainbow effectively addresses condition uncertainty in image generation by decomposing input conditions into diverse latent representations through GFlowNets, enabling generation of multiple plausible and interpretable images from ambiguous prompts.
Abstract: Capturing diversity is crucial in conditional and prompt-based image generation, particularly when conditions contain uncertainty that can lead to multiple plausible outputs. To generate diverse images reflecting this diversity, traditional methods often modify random seeds, making it difficult to discern meaningful differences between samples, or diversify the input prompt, which is limited in verbally interpretable diversity. We propose Rainbow, a novel conditional image generation framework, applicable to any pretrained conditional generative model, that addresses inherent condition/prompt uncertainty and generates diverse plausible images. Rainbow is based on a simple yet effective idea: decomposing the input condition into diverse latent representations, each capturing an aspect of the uncertainty and generating a distinct image. First, we integrate a latent graph, parameterized by Generative Flow Networks (GFlowNets), into the prompt representation computation. Second, leveraging GFlowNets’ advanced graph sampling capabilities to capture uncertainty and output diverse trajectories over the graph, we produce multiple trajectories that collectively represent the input condition, leading to diverse condition representations and corresponding output images. Evaluations on natural image and medical image datasets demonstrate Rainbow’s improvement in both diversity and fidelity across image synthesis, image generation, and counterfactual generation tasks.
[289] GRAID: Enhancing Spatial Reasoning of VLMs Through High-Fidelity Data Generation
Karim Elmaaroufi, Liheng Lai, Justin Svegliato, Yutong Bai, Sanjit A. Seshia, Matei Zaharia
Main category: cs.CV
TL;DR: GRAID is a framework that generates high-quality spatial reasoning datasets using 2D bounding boxes from object detectors, avoiding 3D reconstruction errors and generative hallucinations. It produces datasets with 91.16% human-validated accuracy and enables models to learn generalizable spatial reasoning concepts.
Details
Motivation: Current VLMs struggle with spatial reasoning, and existing dataset generation methods have limitations - single-image 3D reconstruction introduces cascading errors requiring wide tolerances, while caption-based methods need hyper-detailed annotations and suffer from generative hallucinations, resulting in only 57.6% human validation rates.Method: GRAID operates exclusively on 2D bounding boxes from standard object detectors to determine qualitative spatial relationships, avoiding both 3D reconstruction errors and generative hallucinations. The framework generates VQA pairs spanning spatial relations, counting, ranking, and size comparisons.
Result: Generated over 8.5 million high-quality VQA pairs across BDD100k, NuImages, and Waymo datasets with 91.16% human-validated accuracy (vs 57.6% from recent work). Models fine-tuned on GRAID data show strong generalization - improving on over 10 held-out question types with 47.5% gains on BDD and 37.9% on NuImages for Llama 3.2B 11B.
Conclusion: GRAID provides a reliable method for generating high-quality spatial reasoning datasets using 2D geometric primitives, significantly outperforming existing approaches and enabling VLMs to learn generalizable spatial reasoning concepts that transfer to various benchmarks.
Abstract: Vision Language Models (VLMs) achieve strong performance on many vision-language tasks but often struggle with spatial reasoning\textemdash{}a prerequisite for many applications. Empirically, we find that a dataset produced by a current training data generation pipeline has a 57.6% human validation rate. These rates stem from current limitations: single-image 3D reconstruction introduces cascading modeling errors and requires wide answer tolerances, while caption-based methods require hyper-detailed annotations and suffer from generative hallucinations. We present GRAID, built on the key insight that qualitative spatial relationships can be reliably determined from 2D geometric primitives alone. By operating exclusively on 2D bounding boxes from standard object detectors, GRAID avoids both 3D reconstruction errors and generative hallucinations, resulting in datasets that are of higher quality than existing tools that produce similar datasets as validated by human evaluations. We apply our framework to the BDD100k, NuImages, and Waymo datasets, generating over 8.5 million high-quality VQA pairs creating questions spanning spatial relations, counting, ranking, and size comparisons. We evaluate one of the datasets and find it achieves 91.16% human-validated accuracy\textemdash{}compared to 57.6% on a dataset generated by recent work. % or recent work Critically, we demonstrate that when trained on GRAID data, models learn spatial reasoning concepts that generalize: models fine-tuned on 6 question types improve on over 10 held-out types, with accuracy gains of 47.5% on BDD and 37.9% on NuImages for Llama 3.2B 11B, and when trained on all questions types, achieve improvements on several existing benchmarks such as BLINK. The GRAID framework, datasets, and additional information can be found on our \href{https://ke7.github.io/graid/}{project page}.
[290] CogStereo: Neural Stereo Matching with Implicit Spatial Cognition Embedding
Lihuang Fang, Xiao Hu, Yuchen Zou, Hong Zhang
Main category: cs.CV
TL;DR: CogStereo introduces implicit spatial cognition using monocular depth features as priors to improve stereo matching in challenging regions like occlusions and weak textures, achieving state-of-the-art performance with strong cross-domain generalization.
Details
Motivation: Deep stereo matching lacks the zero-shot generalization seen in foundation models for other vision tasks and struggles with challenging regions such as occlusions and weak textures without dataset-specific priors.Method: CogStereo embeds implicit spatial cognition into refinement using monocular depth features as priors, employing a dual-conditional refinement mechanism that combines pixel-wise uncertainty with cognition-guided features for global correction.
Result: Extensive experiments on Scene Flow, KITTI, Middlebury, ETH3D, EuRoc, and real-world datasets show state-of-the-art results and excellent cross-domain generalization.
Conclusion: CogStereo shifts stereo vision towards a cognition-driven approach, ensuring structurally coherent disparity estimation even where geometry alone is inadequate, without relying on dataset-specific priors.
Abstract: Deep stereo matching has advanced significantly on benchmark datasets through fine-tuning but falls short of the zero-shot generalization seen in foundation models in other vision tasks. We introduce CogStereo, a novel framework that addresses challenging regions, such as occlusions or weak textures, without relying on dataset-specific priors. CogStereo embeds implicit spatial cognition into the refinement process by using monocular depth features as priors, capturing holistic scene understanding beyond local correspondences. This approach ensures structurally coherent disparity estimation, even in areas where geometry alone is inadequate. CogStereo employs a dual-conditional refinement mechanism that combines pixel-wise uncertainty with cognition-guided features for consistent global correction of mismatches. Extensive experiments on Scene Flow, KITTI, Middlebury, ETH3D, EuRoc, and real-world demonstrate that CogStereo not only achieves state-of-the-art results but also excels in cross-domain generalization, shifting stereo vision towards a cognition-driven approach.
[291] Mint: A Simple Test-Time Adaptation of Vision-Language Models against Common Corruptions
Wenxuan Bao, Ruxi Deng, Jingrui He
Main category: cs.CV
TL;DR: CLIP models suffer from embedding variance collapse under input corruptions, where both intra-class and inter-class variances decrease with corruption severity. The authors propose Mint, a test-time adaptation method that maximizes pseudo-label-based inter-class variance to improve robustness.
Details
Motivation: Pretrained vision-language models like CLIP show strong zero-shot generalization but are vulnerable to distribution shifts from input corruptions, leading to performance degradation.Method: The authors analyze embedding variance collapse phenomenon and propose Mint - a test-time adaptation method that uses mean and gradient accumulators to maximize pseudo-label-based inter-class variance on the fly, working effectively with small batch sizes.
Result: Mint consistently improves performance across multiple corruption benchmarks and CLIP architectures, demonstrating enhanced robustness to input corruptions.
Conclusion: Maximizing inter-class variance, even with pseudo-labels, can provably enhance embedding quality and improve CLIP’s robustness to distribution shifts caused by input corruptions.
Abstract: Pretrained vision-language models such as CLIP achieve strong zero-shot generalization but remain vulnerable to distribution shifts caused by input corruptions. In this work, we investigate how corruptions affect CLIP’s image embeddings and uncover a consistent phenomenon we term as embedding variance collapse, where both intra-class and inter-class variances shrink as corruption severity increases. We find that this collapse is closely tied to performance degradation, with inter-class variance strongly correlated with classification accuracy. To explain this phenomenon, we analyze how corruptions alter the structure of the embedding space. Our theoretical results suggest that the visual encoder tends to encode corruption-related signals, which dilute class-discriminative features and compress the representation geometry. We further show that maximizing inter-class variance, even when estimated from pseudo-labels, can provably enhance embedding quality. Based on this insight, we propose Mint, a simple test-time adaptation method that maximizes pseudo-label-based inter-class variance on the fly using a mean accumulator and a gradient accumulator. Mint operates effectively with small batch sizes and consistently improves performance across multiple corruption benchmarks and CLIP architectures. Our code is available at https://github.com/baowenxuan/Mint .
[292] egoEMOTION: Egocentric Vision and Physiological Signals for Emotion and Personality Recognition in Real-World Tasks
Matthias Jammot, Bjöern Braun, Paul Streli, Rafael Wampfler, Christian Holz
Main category: cs.CV
TL;DR: egoEMOTION is the first dataset combining egocentric visual and physiological signals with dense emotion and personality self-reports, enabling affect-aware behavior modeling in egocentric vision.
Details
Motivation: Current egocentric vision benchmarks ignore emotional states that shape human behavior, focusing only on physical activities and assuming neutral affect, limiting understanding of internal behavioral drivers.Method: Created egoEMOTION dataset with 50+ hours of recordings from 43 participants using Meta’s Project Aria glasses, capturing synchronized eye-tracking video, photoplethysmography, inertial motion data, and physiological baselines during emotion-elicitation tasks and naturalistic activities.
Result: A classical learning-based method shows better affect prediction from egocentric vision signals than from physiological signals alone, establishing three benchmark tasks: continuous affect classification, discrete emotion classification, and personality inference.
Conclusion: The dataset establishes emotion and personality as core dimensions in egocentric perception, opening new directions for affect-driven modeling of behavior, intent, and interaction.
Abstract: Understanding affect is central to anticipating human behavior, yet current egocentric vision benchmarks largely ignore the person’s emotional states that shape their decisions and actions. Existing tasks in egocentric perception focus on physical activities, hand-object interactions, and attention modeling
- assuming neutral affect and uniform personality. This limits the ability of vision systems to capture key internal drivers of behavior. In this paper, we present egoEMOTION, the first dataset that couples egocentric visual and physiological signals with dense self-reports of emotion and personality across controlled and real-world scenarios. Our dataset includes over 50 hours of recordings from 43 participants, captured using Meta’s Project Aria glasses. Each session provides synchronized eye-tracking video, headmounted photoplethysmography, inertial motion data, and physiological baselines for reference. Participants completed emotion-elicitation tasks and naturalistic activities while self-reporting their affective state using the Circumplex Model and Mikels’ Wheel as well as their personality via the Big Five model. We define three benchmark tasks: (1) continuous affect classification (valence, arousal, dominance); (2) discrete emotion classification; and (3) trait-level personality inference. We show that a classical learning-based method, as a simple baseline in real-world affect prediction, produces better estimates from signals captured on egocentric vision systems than processing physiological signals. Our dataset establishes emotion and personality as core dimensions in egocentric perception and opens new directions in affect-driven modeling of behavior, intent, and interaction.
[293] STG-Avatar: Animatable Human Avatars via Spacetime Gaussian
Guangan Jiang, Tianzi Zhang, Dong Li, Zhenjun Zhao, Haoang Li, Mingrui Li, Hongyu Wang
Main category: cs.CV
TL;DR: STG-Avatar is a 3DGS-based framework that combines Spacetime Gaussians with linear blend skinning to create high-fidelity animatable human avatars from monocular videos, addressing challenges in representing clothing deformations and dynamic regions.
Details
Motivation: To create realistic animatable human avatars for human-robot interaction and virtual experiences, overcoming limitations of current 3DGS-based methods in accurately representing detailed non-rigid features and dynamic regions.Method: A rigid-nonrigid coupled deformation framework integrating Spacetime Gaussians (STG) with linear blend skinning (LBS), using optical flow to identify high-dynamic regions and guide adaptive densification of 3D Gaussians.
Result: The method consistently outperforms state-of-the-art baselines in both reconstruction quality and operational efficiency, achieving superior quantitative metrics while maintaining real-time rendering capabilities.
Conclusion: STG-Avatar successfully addresses the challenges of representing detailed non-rigid features and dynamic regions in human avatar reconstruction, providing a high-fidelity solution with real-time performance.
Abstract: Realistic animatable human avatars from monocular videos are crucial for advancing human-robot interaction and enhancing immersive virtual experiences. While recent research on 3DGS-based human avatars has made progress, it still struggles with accurately representing detailed features of non-rigid objects (e.g., clothing deformations) and dynamic regions (e.g., rapidly moving limbs). To address these challenges, we present STG-Avatar, a 3DGS-based framework for high-fidelity animatable human avatar reconstruction. Specifically, our framework introduces a rigid-nonrigid coupled deformation framework that synergistically integrates Spacetime Gaussians (STG) with linear blend skinning (LBS). In this hybrid design, LBS enables real-time skeletal control by driving global pose transformations, while STG complements it through spacetime adaptive optimization of 3D Gaussians. Furthermore, we employ optical flow to identify high-dynamic regions and guide the adaptive densification of 3D Gaussians in these regions. Experimental results demonstrate that our method consistently outperforms state-of-the-art baselines in both reconstruction quality and operational efficiency, achieving superior quantitative metrics while retaining real-time rendering capabilities. Our code is available at https://github.com/jiangguangan/STG-Avatar
[294] Attention Residual Fusion Network with Contrast for Source-free Domain Adaptation
Renrong Shao, Wei Zhang, Jun Wang
Main category: cs.CV
TL;DR: ARFNet is a novel framework for source-free domain adaptation that uses attention residual fusion, global-local contrast learning, and dynamic centroid evaluation to address negative transfer and domain shift issues.
Details
Motivation: Existing SFDA methods focus on domain shift but neglect negative transfer effects that hinder model performance improvement during adaptation. The lack of source data and complex scene information make SFDA challenging.Method: Proposes ARFNet with three key components: 1) Attention residual fusion using spatial-wise and channel-wise attentions for cross-layer fusion and self-distillation, 2) Global-local attention contrast to improve category discrimination, 3) Dynamic centroid evaluation for trustworthy centroids and pseudo-labels to mitigate domain shift.
Result: Comprehensive experiments on five benchmarks show the method surpasses other techniques and achieves superior performance across SFDA benchmarks.
Conclusion: ARFNet effectively alleviates negative transfer and domain shift in SFDA through attention mechanisms and contrast learning, demonstrating state-of-the-art performance on multiple benchmarks.
Abstract: Source-free domain adaptation (SFDA) involves training a model on source domain and then applying it to a related target domain without access to the source data and labels during adaptation. The complexity of scene information and lack of the source domain make SFDA a difficult task. Recent studies have shown promising results, but many approaches to domain adaptation concentrate on domain shift and neglect the effects of negative transfer, which may impede enhancements of model performance during adaptation. n this paper, addressing this issue, we propose a novel framework of Attention Residual Fusion Network (ARFNet) based on contrast learning for SFDA to alleviate negative transfer and domain shift during the progress of adaptation, in which attention residual fusion, global-local attention contrast, and dynamic centroid evaluation are exploited. Concretely, the attention mechanism is first exploited to capture the discriminative region of the target object. Then, in each block, attention features are decomposed into spatial-wise and channel-wise attentions to achieve the cross-layer attention residual fusion progressively and self-distillation. During adaptation progress, we contrast global and local representations to improve the perceptual capabilities of different categories, which enables the model to discriminate variations between inner-class and intra-class. Finally, a dynamic centroid evaluation strategy is exploited to evaluate the trustworthy centroids and labels for self-supervised self-distillation, which aims to accurately approximate the center of the source domain and pseudo-labels to mitigate domain shift. To validate the efficacy, we execute comprehensive experiments on five benchmarks of varying scales. Experimental outcomes indicate that our method surpasses other techniques, attaining superior performance across SFDA benchmarks.
[295] I2-NeRF: Learning Neural Radiance Fields Under Physically-Grounded Media Interactions
Shuhong Liu, Lin Gu, Ziteng Cui, Xuangeng Chu, Tatsuya Harada
Main category: cs.CV
TL;DR: I2-NeRF is a neural radiance field framework that enhances 3D physical world perception by improving isometric and isotropic metric perception under media degradation through reverse-stratified upsampling and a unified radiative formulation.
Details
Motivation: To endow generative AI with better 3D physical world perception capabilities, particularly addressing limitations of existing NeRF models that rely on object-centric sampling and struggle with media degradation effects.Method: Introduces reverse-stratified upsampling for near-uniform 3D space sampling to preserve isometry, and a general radiative formulation unifying emission, absorption, and scattering using Beer-Lambert attenuation law for handling complex media environments.
Result: Experiments on real-world datasets show significant improvements in both reconstruction fidelity and physical plausibility compared to existing approaches, with capability to estimate medium properties like water depth.
Conclusion: I2-NeRF successfully enhances 3D physical perception by addressing isometric and isotropic metric perception challenges in degraded media environments, providing a unified framework for various complex scenarios including underwater, haze, and low-light scenes.
Abstract: Participating in efforts to endow generative AI with the 3D physical world perception, we propose I2-NeRF, a novel neural radiance field framework that enhances isometric and isotropic metric perception under media degradation. While existing NeRF models predominantly rely on object-centric sampling, I2-NeRF introduces a reverse-stratified upsampling strategy to achieve near-uniform sampling across 3D space, thereby preserving isometry. We further present a general radiative formulation for media degradation that unifies emission, absorption, and scattering into a particle model governed by the Beer-Lambert attenuation law. By composing the direct and media-induced in-scatter radiance, this formulation extends naturally to complex media environments such as underwater, haze, and even low-light scenes. By treating light propagation uniformly in both vertical and horizontal directions, I2-NeRF enables isotropic metric perception and can even estimate medium properties such as water depth. Experiments on real-world datasets demonstrate that our method significantly improves both reconstruction fidelity and physical plausibility compared to existing approaches.
[296] HARMONY: Hidden Activation Representations and Model Output-Aware Uncertainty Estimation for Vision-Language Models
Erum Mushtaq, Zalan Fabian, Yavuz Faruk Bakman, Anil Ramakrishna, Mahdi Soltanolkotabi, Salman Avestimehr
Main category: cs.CV
TL;DR: HARMONY is a novel uncertainty estimation framework for Vision-Language Models that jointly leverages multimodal activations and output distributions to assess response reliability, achieving state-of-the-art performance on VQA benchmarks.
Details
Motivation: Existing uncertainty estimation methods for VLMs either rely solely on output probabilities or hidden representations, failing to capture complex multimodal relationships and struggling with biased probabilities influenced by language priors.Method: Proposes HARMONY framework that combines fused multimodal information from model activations with output probability distributions to determine response reliability, leveraging both internal visual understanding beliefs and token probabilities.
Result: Experimental results on A-OKVQA, VizWiz, and PathVQA benchmarks with LLaVa-7b, LLaVA-13b and InstructBLIP show consistent performance improvements, achieving up to 4% AUROC and 6% PRR improvements over existing approaches.
Conclusion: HARMONY establishes new state-of-the-art in uncertainty estimation for VLMs by jointly leveraging multimodal activations and output distributions, demonstrating that both components provide valuable reliability signals.
Abstract: The growing deployment of Vision-Language Models (VLMs) in high-stakes applications such as autonomous driving and assistive technologies for visually impaired individuals necessitates reliable mechanisms to assess the trustworthiness of their generation. Uncertainty Estimation (UE) plays a central role in quantifying the reliability of model outputs and reducing unsafe generations via selective prediction. In this regard, most existing probability-based UE approaches rely on output probability distributions, aggregating token probabilities into a single uncertainty score using predefined functions such as length-normalization. Another line of research leverages model hidden representations and trains MLP-based models to predict uncertainty. However, these methods often fail to capture the complex multimodal relationships between semantic and textual tokens and struggle to identify biased probabilities often influenced by language priors. Motivated by these observations, we propose a novel UE framework, HARMONY, that jointly leverages fused multimodal information in model activations and the output distribution of the VLM to determine the reliability of responses. The key hypothesis of our work is that both the model’s internal belief in its visual understanding, captured by its hidden representations, and the produced token probabilities carry valuable reliability signals that can be jointly leveraged to improve UE performance, surpassing approaches that rely on only one of these components. Experimental results on three open-ended VQA benchmarks, A-OKVQA, VizWiz, and PathVQA, and three state-of-the-art VLMs, LLaVa-7b, LLaVA-13b and InstructBLIP demonstrate that our method consistently performs on par with or better than existing approaches, achieving up to 4% improvement in AUROC, and 6% in PRR, establishing new state of the art in uncertainty estimation for VLMs.
[297] Scaling Non-Parametric Sampling with Representation
Vincent Lu, Aaron Truong, Zeyu Yun, Yubei Chen
Main category: cs.CV
TL;DR: A simple non-parametric generative model based on three principles of natural images produces compelling results on MNIST and CIFAR-10 without training, revealing insights about image structure and generalization.
Details
Motivation: To understand the opaque mechanisms of complex image generative models by proposing a simple, interpretable alternative that strips away engineering tricks.Method: A non-parametric model that defines each pixel’s distribution from its local context window, grounded in three principles: spatial non-stationarity, low-level regularities, and high-level semantics.
Result: The model produces high-fidelity samples on MNIST and visually compelling CIFAR-10 images despite minimal architecture and no training.
Conclusion: The model’s white-box nature provides mechanistic understanding of generalization and reveals a compositional procedure for part-whole generalization, suggesting how large neural networks learn to generalize.
Abstract: Scaling and architectural advances have produced strikingly photorealistic image generative models, yet their mechanisms still remain opaque. Rather than advancing scaling, our goal is to strip away complicated engineering tricks and propose a simple, non-parametric generative model. Our design is grounded in three principles of natural images-(i) spatial non-stationarity, (ii) low-level regularities, and (iii) high-level semantics-and defines each pixel’s distribution from its local context window. Despite its minimal architecture and no training, the model produces high-fidelity samples on MNIST and visually compelling CIFAR-10 images. This combination of simplicity and strong empirical performance points toward a minimal theory of natural-image structure. The model’s white-box nature also allows us to have a mechanistic understanding of how the model generalizes and generates diverse images. We study it by tracing each generated pixel back to its source images. These analyses reveal a simple, compositional procedure for “part-whole generalization”, suggesting a hypothesis for how large neural network generative models learn to generalize.
[298] MOGRAS: Human Motion with Grasping in 3D Scenes
Kunal Bhosikar, Siddharth Katageri, Vivek Madhavaram, Kai Han, Charu Sharma
Main category: cs.CV
TL;DR: MOGRAS is a large-scale dataset for generating full-body grasping motions in 3D scenes, addressing the gap between scene-aware motion generation and fine-grained grasping tasks, with an effective adaptation method for existing approaches.
Details
Motivation: Existing methods either generate full-body motion in 3D scenes without fine-grained grasping fidelity, or generate precise grasping motions without considering the surrounding 3D scene, creating a significant gap for physically plausible full-body grasping in 3D environments.Method: Introduces MOGRAS dataset with pre-grasping full-body walking motions and final grasping poses in annotated 3D indoor scenes, and proposes a simple yet effective method to adapt existing approaches for scene-aware generation.
Result: The dataset enables benchmarking of existing methods and reveals their limitations in scene-aware generation. The proposed adaptation method achieves significant improvements through extensive quantitative and qualitative experiments.
Conclusion: MOGRAS dataset and the proposed adaptation method effectively bridge the gap between scene-aware motion generation and fine-grained grasping tasks, paving the way for more realistic human-scene interactions in applications like robotics and virtual reality.
Abstract: Generating realistic full-body motion interacting with objects is critical for applications in robotics, virtual reality, and human-computer interaction. While existing methods can generate full-body motion within 3D scenes, they often lack the fidelity for fine-grained tasks like object grasping. Conversely, methods that generate precise grasping motions typically ignore the surrounding 3D scene. This gap, generating full-body grasping motions that are physically plausible within a 3D scene, remains a significant challenge. To address this, we introduce MOGRAS (Human MOtion with GRAsping in 3D Scenes), a large-scale dataset that bridges this gap. MOGRAS provides pre-grasping full-body walking motions and final grasping poses within richly annotated 3D indoor scenes. We leverage MOGRAS to benchmark existing full-body grasping methods and demonstrate their limitations in scene-aware generation. Furthermore, we propose a simple yet effective method to adapt existing approaches to work seamlessly within 3D scenes. Through extensive quantitative and qualitative experiments, we validate the effectiveness of our dataset and highlight the significant improvements our proposed method achieves, paving the way for more realistic human-scene interactions.
[299] LongCat-Video Technical Report
Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, Tong Zhang
Main category: cs.CV
TL;DR: LongCat-Video is a 13.6B parameter video generation model that excels at efficient long video generation, supporting multiple tasks with unified architecture and achieving strong performance through multi-reward RLHF.
Details
Motivation: To develop efficient long video inference as a key capability toward building world models, addressing the need for high-quality, temporally coherent minute-long video generation.Method: Built on Diffusion Transformer (DiT) framework with unified architecture for Text-to-Video, Image-to-Video, and Video-Continuation tasks. Uses coarse-to-fine generation strategy, Block Sparse Attention for efficiency, and multi-reward RLHF training.
Result: Generates 720p, 30fps videos within minutes, maintains high quality and temporal coherence for long videos, and achieves performance comparable to latest closed-source and leading open-source models.
Conclusion: LongCat-Video represents a significant step toward world models with efficient long video generation capabilities, and the model is publicly available to accelerate progress in the field.
Abstract: Video generation is a critical pathway toward world models, with efficient long video inference as a key capability. Toward this end, we introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across multiple video generation tasks. It particularly excels in efficient and high-quality long video generation, representing our first step toward world models. Key features include: Unified architecture for multiple tasks: Built on the Diffusion Transformer (DiT) framework, LongCat-Video supports Text-to-Video, Image-to-Video, and Video-Continuation tasks with a single model; Long video generation: Pretraining on Video-Continuation tasks enables LongCat-Video to maintain high quality and temporal coherence in the generation of minutes-long videos; Efficient inference: LongCat-Video generates 720p, 30fps videos within minutes by employing a coarse-to-fine generation strategy along both the temporal and spatial axes. Block Sparse Attention further enhances efficiency, particularly at high resolutions; Strong performance with multi-reward RLHF: Multi-reward RLHF training enables LongCat-Video to achieve performance on par with the latest closed-source and leading open-source models. Code and model weights are publicly available to accelerate progress in the field.
[300] TrajGATFormer: A Graph-Based Transformer Approach for Worker and Obstacle Trajectory Prediction in Off-site Construction Environments
Mohammed Alduais, Xinming Li, Qipei Mei
Main category: cs.CV
TL;DR: This paper proposes a framework using YOLOv10n and DeepSORT for detection and tracking, plus two novel trajectory prediction models (TrajGATFormer and TrajGATFormer-Obstacle) that integrate transformer encoder-decoder with Graph Attention Networks to improve collision risk assessment in offsite construction environments.
Details
Motivation: Offsite construction introduces new safety risks due to close interaction between workers, machinery, and moving obstacles. Traditional methods struggle with dynamic construction environments, while recent data-driven methods fail to capture long-term behavior and spatial/social context needed for collision risk assessment.Method: Proposes a framework integrating YOLOv10n for object detection and DeepSORT for tracking, plus two trajectory prediction models: TrajGATFormer (worker-only prediction) and TrajGATFormer-Obstacle (worker and obstacle prediction). Both use transformer encoder-decoder with Graph Attention Networks to capture temporal and spatial interactions.
Result: TrajGATFormer achieves ADE of 1.25 m and FDE of 2.3 m over 4.8 s horizon. TrajGATFormer-Obstacle achieves higher accuracy with ADE 1.15 m and FDE 2.2 m. Both models outperform traditional methods, reducing ADE and FDE by up to 35% and 38% respectively.
Conclusion: The proposed framework effectively addresses limitations of existing methods by integrating precise detection/tracking with advanced trajectory prediction models that capture complex spatial-temporal interactions, significantly improving collision risk assessment in dynamic construction environments.
Abstract: As the demand grows within the construction industry for processes that are not only faster but also safer and more efficient, offsite construction has emerged as a solution, though it brings new safety risks due to the close interaction between workers, machinery, and moving obstacles. Predicting the future trajectories of workers and taking into account social and environmental factors is a crucial step for developing collision-avoidance systems to mitigate such risks. Traditional methods often struggle to adapt to the dynamic and unpredictable nature of construction environments. Many rely on simplified assumptions or require hand-crafted features, limiting their ability to respond to complex, real-time interactions between workers and moving obstacles. While recent data-driven methods have improved the modeling of temporal patterns, they still face challenges in capturing long-term behavior and accounting for the spatial and social context crucial to collision risk assessment. To address these limitations, this paper proposes a framework integrating YOLOv10n and DeepSORT for precise detection and tracking, along with two novel trajectory prediction models: TrajGATFormer and TrajGATFormer-Obstacle. YOLOv10n serves as the backbone for object detection, accurately identifying workers and obstacles in diverse scenes, while DeepSORT efficiently tracks them over time with unique IDs for continuity. Both models employ a transformer encoder-decoder with Graph Attention Networks (GAT) to capture temporal and spatial interactions. TrajGATFormer predicts worker trajectories with an ADE of 1.25 m and FDE of 2.3 m over a 4.8 s horizon, while TrajGATFormer-Obstacle extends prediction to both workers and obstacles, achieving higher accuracy (ADE 1.15 m, FDE 2.2 m). Comparative analysis shows both models outperform traditional methods, reducing ADE and FDE by up to 35% and 38%, respectively.
[301] DynamicTree: Interactive Real Tree Animation via Sparse Voxel Spectrum
Yaokun Li, Lihe Ding, Xiao Chen, Guang Tan, Tianfan Xue
Main category: cs.CV
TL;DR: DynamicTree is a framework for generating realistic 4D motion of 3D Gaussian Splatting trees using a compact sparse voxel spectrum representation, enabling fast feed-forward animation and real-time interactive responses.
Details
Motivation: Existing methods face challenges in generating realistic 4D motion for complex real trees, which are needed for virtual reality, games, and world simulation applications.Method: The approach uses a sparse voxel spectrum to represent tree movement, generates mesh motion from 3D Gaussian Splatting trees, binds Gaussians to deform the mesh, and enables fast modal analysis for real-time interactive responses under external forces.
Result: The method achieves realistic and responsive tree animations, significantly outperforming existing approaches in both visual quality and computational efficiency. A large-scale synthetic 4D tree dataset (4DTree) with 8,786 animated tree meshes was also created for training.
Conclusion: DynamicTree successfully generates long-term, interactive animation of 3D Gaussian Splatting trees in a fast feed-forward manner, representing a significant advancement over prior optimization-based methods.
Abstract: Generating dynamic and interactive 3D objects, such as trees, has wide applications in virtual reality, games, and world simulation. Nevertheless, existing methods still face various challenges in generating realistic 4D motion for complex real trees. In this paper, we propose DynamicTree, the first framework that can generate long-term, interactive animation of 3D Gaussian Splatting trees. Unlike prior optimization-based methods, our approach generates dynamics in a fast feed-forward manner. The key success of our approach is the use of a compact sparse voxel spectrum to represent the tree movement. Given a 3D tree from Gaussian Splatting reconstruction, our pipeline first generates mesh motion using the sparse voxel spectrum and then binds Gaussians to deform the mesh. Additionally, the proposed sparse voxel spectrum can also serve as a basis for fast modal analysis under external forces, allowing real-time interactive responses. To train our model, we also introduce 4DTree, the first large-scale synthetic 4D tree dataset containing 8,786 animated tree meshes with semantic labels and 100-frame motion sequences. Extensive experiments demonstrate that our method achieves realistic and responsive tree animations, significantly outperforming existing approaches in both visual quality and computational efficiency.
[302] GALA: A GlobAl-LocAl Approach for Multi-Source Active Domain Adaptation
Juepeng Zheng, Peifeng Zhang, Yibin Wen, Qingmei Li, Yang Zhang, Haohuan Fu
Main category: cs.CV
TL;DR: Proposes GALA, a Multi-Source Active Domain Adaptation method that combines global clustering with local selection to efficiently acquire target annotations, achieving near-supervised performance with only 1% target labels.
Details
Motivation: To bridge the performance gap between domain adaptation methods and fully supervised learning by selectively acquiring target-domain annotations in multi-source settings.Method: GALA strategy combining global k-means clustering for target samples with cluster-wise local selection criterion to handle inter-class diversity and multi-source domain variation.
Result: Outperforms prior active learning and active DA methods on three standard benchmarks, achieving performance comparable to fully-supervised upperbound using only 1% target annotations.
Conclusion: GALA is an effective plug-and-play solution for Multi-Source Active Domain Adaptation that significantly reduces annotation costs while maintaining high performance.
Abstract: Domain Adaptation (DA) provides an effective way to tackle target-domain tasks by leveraging knowledge learned from source domains. Recent studies have extended this paradigm to Multi-Source Domain Adaptation (MSDA), which exploits multiple source domains carrying richer and more diverse transferable information. However, a substantial performance gap still remains between adaptation-based methods and fully supervised learning. In this paper, we explore a more practical and challenging setting, named Multi-Source Active Domain Adaptation (MS-ADA), to further enhance target-domain performance by selectively acquiring annotations from the target domain. The key difficulty of MS-ADA lies in designing selection criteria that can jointly handle inter-class diversity and multi-source domain variation. To address these challenges, we propose a simple yet effective GALA strategy (GALA), which combines a global k-means clustering step for target-domain samples with a cluster-wise local selection criterion, effectively tackling the above two issues in a complementary manner. Our proposed GALA is plug-and-play and can be seamlessly integrated into existing DA frameworks without introducing any additional trainable parameters. Extensive experiments on three standard DA benchmarks demonstrate that GALA consistently outperforms prior active learning and active DA methods, achieving performance comparable to the fully-supervised upperbound while using only 1% of the target annotations.
[303] Enpowering Your Pansharpening Models with Generalizability: Unified Distribution is All You Need
Yongchuan Cui, Peng Liu, Hui Zhang
Main category: cs.CV
TL;DR: UniPAN is a novel approach that enhances the generalizability of deep learning pansharpening models by normalizing data from different satellite sources to a unified distribution, enabling “train once, deploy forever” capability.
Details
Motivation: Existing pansharpening models suffer from performance degradation when applied to unseen satellite data due to distributional discrepancies from different sensors and imaging conditions, limiting their practical applicability.Method: Proposes a unified distribution strategy (UniPAN) that constructs a distribution transformation function to normalize pixels from different sources to an identical distribution, training models on the transformed domain and applying the same transformation during testing.
Result: Extensive experiments demonstrate UniPAN’s efficacy in significantly enhancing the performance of deep pansharpening models across diverse satellite sensors.
Conclusion: UniPAN successfully bridges the gap between training and testing distributions, enabling pansharpening models to achieve better generalizability and maintain performance across different satellite data sources.
Abstract: Existing deep learning-based models for remote sensing pansharpening exhibit exceptional performance on training datasets. However, due to sensor-specific characteristics and varying imaging conditions, these models suffer from substantial performance degradation when applied to unseen satellite data, lacking generalizability and thus limiting their applicability. We argue that the performance drops stem primarily from distributional discrepancies from different sources and the key to addressing this challenge lies in bridging the gap between training and testing distributions. To validate the idea and further achieve a “train once, deploy forever” capability, this paper introduces a novel and intuitive approach to enpower any pansharpening models with generalizability by employing a unified distribution strategy (UniPAN). Specifically, we construct a distribution transformation function that normalizes the pixels sampled from different sources to conform to an identical distribution. The deep models are trained on the transformed domain, and during testing on new datasets, the new data are also transformed to match the training distribution. UniPAN aims to train and test the model on a unified and consistent distribution, thereby enhancing its generalizability. Extensive experiments validate the efficacy of UniPAN, demonstrating its potential to significantly enhance the performance of deep pansharpening models across diverse satellite sensors. Codes: https://github.com/yc-cui/UniPAN.
[304] Audio Frequency-Time Dual Domain Evaluation on Depression Diagnosis
Yu Luo, Nan Huang, Sophie Yu, Hendry Xu, Jerry Wang, Colin Wang, Zhichao Liu, Chen Zeng
Main category: cs.CV
TL;DR: This paper proposes using voice signals with deep learning for intelligent depression diagnosis, achieving excellent classification performance.
Details
Motivation: Depression prevention faces challenges like complex diagnostics, ambiguous criteria, and low consultation rates that hinder timely intervention.Method: Uses voice as physiological signal with frequency-time dual domain multimodal characteristics and deep learning models for depression assessment algorithm.
Result: The proposed method achieves excellent performance in depression diagnosis classification tasks.
Conclusion: Provides new insights and approaches for depression assessment, screening, and diagnosis using voice-based intelligent systems.
Abstract: Depression, as a typical mental disorder, has become a prevalent issue significantly impacting public health. However, the prevention and treatment of depression still face multiple challenges, including complex diagnostic procedures, ambiguous criteria, and low consultation rates, which severely hinder timely assessment and intervention. To address these issues, this study adopts voice as a physiological signal and leverages its frequency-time dual domain multimodal characteristics along with deep learning models to develop an intelligent assessment and diagnostic algorithm for depression. Experimental results demonstrate that the proposed method achieves excellent performance in the classification task for depression diagnosis, offering new insights and approaches for the assessment, screening, and diagnosis of depression.
[305] Diffusion-Driven Two-Stage Active Learning for Low-Budget Semantic Segmentation
Jeongin Kim, Wonho Bae, YouLee Han, Giyeong Oh, Youngjae Yu, Danica J. Sutherland, Junhyug Noh
Main category: cs.CV
TL;DR: A novel two-stage active learning method for semantic segmentation that uses diffusion model features and combines diversity selection with uncertainty-based refinement to achieve high accuracy with minimal labeled pixels.
Details
Motivation: Semantic segmentation requires dense pixel-level annotations which are expensive, especially under extremely constrained labeling budgets. There's a need for efficient active learning methods that can work with minimal labeled data.Method: Two-stage selection pipeline: 1) Hierarchical representation-based candidate selection using MaxHerding to choose representative pixels per image and create diverse global pool, 2) Entropy-augmented disagreement score (eDALD) over multi-scale diffusion features to select most informative pixels, decoupling diversity and uncertainty.
Result: Extensive experiments on four benchmarks (CamVid, ADE-Bed, Cityscapes, Pascal-Context) show significant performance improvements over existing baselines under extreme pixel-budget regimes.
Conclusion: The proposed method effectively addresses low-budget active learning for semantic segmentation by leveraging diffusion model features and a two-stage selection approach that separates diversity and uncertainty, achieving high accuracy with minimal labeled pixels.
Abstract: Semantic segmentation demands dense pixel-level annotations, which can be prohibitively expensive - especially under extremely constrained labeling budgets. In this paper, we address the problem of low-budget active learning for semantic segmentation by proposing a novel two-stage selection pipeline. Our approach leverages a pre-trained diffusion model to extract rich multi-scale features that capture both global structure and fine details. In the first stage, we perform a hierarchical, representation-based candidate selection by first choosing a small subset of representative pixels per image using MaxHerding, and then refining these into a diverse global pool. In the second stage, we compute an entropy-augmented disagreement score (eDALD) over noisy multi-scale diffusion features to capture both epistemic uncertainty and prediction confidence, selecting the most informative pixels for annotation. This decoupling of diversity and uncertainty lets us achieve high segmentation accuracy with only a tiny fraction of labeled pixels. Extensive experiments on four benchmarks (CamVid, ADE-Bed, Cityscapes, and Pascal-Context) demonstrate that our method significantly outperforms existing baselines under extreme pixel-budget regimes. Our code is available at https://github.com/jn-kim/two-stage-edald.
[306] DiffusionLane: Diffusion Model for Lane Detection
Kunyang Zhou, Yeqin Shao
Main category: cs.CV
TL;DR: DiffusionLane is a novel diffusion-based model for lane detection that treats lane detection as a denoising diffusion process in lane parameter space, achieving state-of-the-art performance on multiple benchmarks.
Details
Motivation: To address lane detection challenges by leveraging diffusion models' strong generative capabilities and improving feature representation in noisy lane anchor scenarios.Method: Uses Gaussian noise on ground truth lane parameters to create noisy anchors, then learns progressive refinement. Introduces hybrid decoding strategy with global and local decoders, and employs auxiliary head for encoder supervision.
Result: Achieves strong generalization and superior performance: 1%+ accuracy improvement on Carlane, 81.32% F1 on CULane, 96.89% accuracy on Tusimple, and 97.59% F1 on LLAMAS.
Conclusion: DiffusionLane demonstrates effective application of diffusion models to lane detection with robust performance across multiple datasets and network backbones.
Abstract: In this paper, we present a novel diffusion-based model for lane detection, called DiffusionLane, which treats the lane detection task as a denoising diffusion process in the parameter space of the lane. Firstly, we add the Gaussian noise to the parameters (the starting point and the angle) of ground truth lanes to obtain noisy lane anchors, and the model learns to refine the noisy lane anchors in a progressive way to obtain the target lanes. Secondly, we propose a hybrid decoding strategy to address the poor feature representation of the encoder, resulting from the noisy lane anchors. Specifically, we design a hybrid diffusion decoder to combine global-level and local-level decoders for high-quality lane anchors. Then, to improve the feature representation of the encoder, we employ an auxiliary head in the training stage to adopt the learnable lane anchors for enriching the supervision on the encoder. Experimental results on four benchmarks, Carlane, Tusimple, CULane, and LLAMAS, show that DiffusionLane possesses a strong generalization ability and promising detection performance compared to the previous state-of-the-art methods. For example, DiffusionLane with ResNet18 surpasses the existing methods by at least 1% accuracy on the domain adaptation dataset Carlane. Besides, DiffusionLane with MobileNetV4 gets 81.32% F1 score on CULane, 96.89% accuracy on Tusimple with ResNet34, and 97.59% F1 score on LLAMAS with ResNet101. Code will be available at https://github.com/zkyntu/UnLanedet.
[307] Real-Time Semantic Segmentation on FPGA for Autonomous Vehicles Using LMIINet with the CGRA4ML Framework
Amir Mohammad Khadem Hosseini, Sattar Mirzakuchaki
Main category: cs.CV
TL;DR: FPGA-based real-time semantic segmentation using lightweight LMIINet architecture and CGRA4ML hardware framework, achieving 90% pixel accuracy and 45% mIoU at 20 FPS with optimized power efficiency.
Details
Motivation: Address the challenge of achieving high accuracy in semantic segmentation while operating under computational and hardware constraints, particularly for real-time applications like autonomous driving.Method: Used FPGA implementation with LMIINet architecture and CGRA4ML framework, applied Quantization-Aware Training with 8-bit precision, simplified skip connections, employed hardware-friendly operations (depthwise-separable and 1A-1 convolutions), and redesigned Flatten Transformer components.
Result: Achieved approximately 90% pixel accuracy and 45% mean Intersection-over-Union (mIoU), operating at 20 frames per second with 50.1 ms latency on ZCU104 FPGA board, with 4x memory footprint reduction.
Conclusion: CGRA4ML provides a viable path for implementing advanced semantic segmentation networks on FPGA, offering superior power efficiency compared to traditional GPU solutions while maintaining competitive accuracy for real-time applications.
Abstract: Semantic segmentation has emerged as a fundamental problem in computer vision, gaining particular importance in real-time applications such as autonomous driving. The main challenge is achieving high accuracy while operating under computational and hardware constraints. In this research, we present an FPGA-based implementation of real-time semantic segmentation leveraging the lightweight LMIINet architecture and the Coarse-Grained Reconfigurable Array for Machine Learning (CGRA4ML) hardware framework. The model was trained using Quantization-Aware Training (QAT) with 8-bit precision on the Cityscapes dataset, reducing memory footprint by a factor of four while enabling efficient fixed-point computations. Necessary modifications were applied to adapt the model to CGRA4ML constraints, including simplifying skip connections, employing hardware-friendly operations such as depthwise-separable and 1A-1 convolutions, and redesigning parts of the Flatten Transformer. Our implementation achieves approximately 90% pixel accuracy and 45% mean Intersection-over-Union (mIoU), operating in real-time at 20 frames per second (FPS) with 50.1 ms latency on the ZCU104 FPGA board. The results demonstrate the potential of CGRA4ML, with its flexibility in mapping modern layers and off-chip memory utilization for skip connections, provides a path for implementing advanced semantic segmentation networks on FPGA for real-time applications to outperform traditional GPU solutions in terms of power efficiency while maintaining competitive accuracy. The code for this project is publicly available at https://github.com/STAmirr/ cgra4ml_semantic_segmentation
[308] Accident Anticipation via Temporal Occurrence Prediction
Tianhao Zhao, Yiyang Zou, Zihao Mao, Peilun Xiao, Yulin Huang, Hongda Yang, Yuxuan Li, Qun Li, Guobin Wu, Yutian Lin
Main category: cs.CV
TL;DR: Proposes a novel accident anticipation paradigm that shifts from current-frame risk scoring to directly predicting accident scores at multiple future time steps, using precise accident timestamps for supervision and Transformer-based temporal decoding.
Details
Motivation: Existing methods use ambiguous binary supervision (labeling all frames in accident videos as positive) despite continuous risk variation over time, leading to unreliable learning and false alarms.Method: Uses a snippet-level encoder for spatial-temporal modeling and a Transformer-based temporal decoder with dedicated temporal queries to predict accident scores for multiple future horizons simultaneously.
Result: Achieves superior performance in both recall and Time-to-Accident (TTA) under realistic false alarm rate (FAR) constraints.
Conclusion: The proposed paradigm provides more reliable accident anticipation by directly estimating future accident scores with precise supervision, outperforming existing methods while maintaining practical relevance through refined evaluation protocols.
Abstract: Accident anticipation aims to predict potential collisions in an online manner, enabling timely alerts to enhance road safety. Existing methods typically predict frame-level risk scores as indicators of hazard. However, these approaches rely on ambiguous binary supervision (labeling all frames in accident videos as positive) despite the fact that risk varies continuously over time, leading to unreliable learning and false alarms. To address this, we propose a novel paradigm that shifts the prediction target from current-frame risk scoring to directly estimating accident scores at multiple future time steps (e.g., 0.1s-2.0s ahead), leveraging precisely annotated accident timestamps as supervision. Our method employs a snippet-level encoder to jointly model spatial and temporal dynamics, and a Transformer-based temporal decoder that predicts accident scores for all future horizons simultaneously using dedicated temporal queries. Furthermore, we introduce a refined evaluation protocol that reports Time-to-Accident (TTA) and recall (evaluated at multiple pre-accident intervals (0.5s, 1.0s, and 1.5s)) only when the false alarm rate (FAR) remains within an acceptable range, ensuring practical relevance. Experiments show that our method achieves superior performance in both recall and TTA under realistic FAR constraints.
[309] GSAlign: Geometric and Semantic Alignment Network for Aerial-Ground Person Re-Identification
Qiao Li, Jie Li, Yukang Zhang, Lei Tan, Jing Chen, Jiayi Ji
Main category: cs.CV
TL;DR: GSAlign is a novel network for aerial-ground person re-identification that addresses geometric distortion and semantic misalignment through learnable thin plate spline warping and dynamic alignment modules.
Details
Motivation: Aerial-ground person re-identification faces extreme viewpoint discrepancies, occlusions, and domain gaps between UAV and ground camera views, which existing methods struggle to handle effectively.Method: Proposes GSAlign with two key components: Learnable Thin Plate Spline (LTPS) Module for geometric warping of pedestrian features, and Dynamic Alignment Module (DAM) for visibility-aware semantic alignment.
Result: Achieves significant improvements of +18.8% in mAP and +16.8% in Rank-1 accuracy over previous state-of-the-art methods on the CARGO dataset with four matching protocols.
Conclusion: GSAlign effectively addresses the challenges of aerial-ground person re-identification by jointly tackling geometric distortion and semantic misalignment, demonstrating superior performance over existing approaches.
Abstract: Aerial-Ground person re-identification (AG-ReID) is an emerging yet challenging task that aims to match pedestrian images captured from drastically different viewpoints, typically from unmanned aerial vehicles (UAVs) and ground-based surveillance cameras. The task poses significant challenges due to extreme viewpoint discrepancies, occlusions, and domain gaps between aerial and ground imagery. While prior works have made progress by learning cross-view representations, they remain limited in handling severe pose variations and spatial misalignment. To address these issues, we propose a Geometric and Semantic Alignment Network (GSAlign) tailored for AG-ReID. GSAlign introduces two key components to jointly tackle geometric distortion and semantic misalignment in aerial-ground matching: a Learnable Thin Plate Spline (LTPS) Module and a Dynamic Alignment Module (DAM). The LTPS module adaptively warps pedestrian features based on a set of learned keypoints, effectively compensating for geometric variations caused by extreme viewpoint changes. In parallel, the DAM estimates visibility-aware representation masks that highlight visible body regions at the semantic level, thereby alleviating the negative impact of occlusions and partial observations in cross-view correspondence. A comprehensive evaluation on CARGO with four matching protocols demonstrates the effectiveness of GSAlign, achieving significant improvements of +18.8% in mAP and +16.8% in Rank-1 accuracy over previous state-of-the-art methods on the aerial-ground setting. The code is available at: \textcolor{magenta}{https://github.com/stone96123/GSAlign}.
[310] WAON: Large-Scale and High-Quality Japanese Image-Text Pair Dataset for Vision-Language Models
Issa Sugiura, Shuhei Kurita, Yusuke Oda, Daisuke Kawahara, Yasuo Okabe, Naoaki Okazaki
Main category: cs.CV
TL;DR: WAON is a large-scale Japanese image-text dataset with 155M examples that improves vision-language model performance on Japanese cultural tasks, outperforming existing datasets like ReLAION.
Details
Motivation: To address the need for large-scale, high-quality Japanese image-text datasets for developing better Vision-Language Models, particularly for Japanese cultural understanding.Method: Collected data from Common Crawl using filtering and deduplication techniques, created WAON-Bench benchmark with 374 Japanese cultural classes, and fine-tuned SigLIP2 model on WAON vs ReLAION datasets.
Result: WAON-trained models achieved higher accuracy across all benchmarks, enhanced performance on WAON-Bench more efficiently than ReLAION, and achieved state-of-the-art performance on several Japanese cultural benchmarks.
Conclusion: WAON dataset effectively improves VLM performance for Japanese cultural understanding and is publicly available for research use.
Abstract: Large-scale and high-quality image-text pair datasets play an important role in developing high-performing Vision-Language Models (VLMs). In this work, we introduce WAON, a large-scale and high-quality Japanese image-text pair dataset containing approximately 155 million examples, collected from Common Crawl. Our dataset construction pipeline employs various techniques, including filtering and deduplication, which have been shown to be effective in previous studies. To evaluate its effectiveness, we also construct WAON-Bench, a manually curated benchmark for Japanese cultural image classification, consisting of 374 classes. To assess the effectiveness of our dataset, we conduct experiments using both WAON and the Japanese subset of ReLAION, one of the most widely used vision-language datasets. We fine-tune SigLIP2, a strong multilingual model, on both datasets. The results demonstrate that WAON enhances model performance on WAON-Bench more efficiently than ReLAION and achieves higher accuracy across all evaluated benchmarks. Furthermore, the model fine-tuned on WAON achieves state-of-the-art performance on several Japanese cultural benchmarks. We release our dataset, model, and code at https://speed1313.github.io/WAON.
[311] CityRiSE: Reasoning Urban Socio-Economic Status in Vision-Language Models via Reinforcement Learning
Tianhui Liu, Hetian Pang, Xin Zhang, Jie Feng, Yong Li, Pan Hui
Main category: cs.CV
TL;DR: CityRiSE is a reinforcement learning framework that improves Large Vision-Language Models’ ability to predict urban socio-economic status from street view and satellite imagery by focusing on meaningful visual cues and enabling structured reasoning.
Details
Motivation: LVLMs struggle with accurate and interpretable socio-economic predictions from visual data, limiting their potential for urban socio-economic sensing which is crucial for sustainable development goals.Method: Pure reinforcement learning framework with carefully curated multi-modal data and verifiable reward design to guide LVLMs to focus on semantically meaningful visual cues for structured reasoning.
Result: Significantly outperforms existing baselines, improving both prediction accuracy and generalization across diverse urban contexts, especially for unseen cities and indicators.
Conclusion: Combining RL and LVLMs shows promise for interpretable and generalist urban socio-economic sensing, enabling better understanding of visual data for sustainable development applications.
Abstract: Harnessing publicly available, large-scale web data, such as street view and satellite imagery, urban socio-economic sensing is of paramount importance for achieving global sustainable development goals. With the emergence of Large Vision-Language Models (LVLMs), new opportunities have arisen to solve this task by treating it as a multi-modal perception and understanding problem. However, recent studies reveal that LVLMs still struggle with accurate and interpretable socio-economic predictions from visual data. To address these limitations and maximize the potential of LVLMs, we introduce \textbf{CityRiSE}, a novel framework for \textbf{R}eason\textbf{i}ng urban \textbf{S}ocio-\textbf{E}conomic status in LVLMs through pure reinforcement learning (RL). With carefully curated multi-modal data and verifiable reward design, our approach guides the LVLM to focus on semantically meaningful visual cues, enabling structured and goal-oriented reasoning for generalist socio-economic status prediction. Experiments demonstrate that CityRiSE with emergent reasoning process significantly outperforms existing baselines, improving both prediction accuracy and generalization across diverse urban contexts, particularly for prediction on unseen cities and unseen indicators. This work highlights the promise of combining RL and LVLMs for interpretable and generalist urban socio-economic sensing.
[312] GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping
Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, Meng Wang, Pengfei Wan, Xiaodan Liang
Main category: cs.CV
TL;DR: GRPO-Guard addresses systematic importance-ratio distribution shifts in GRPO-based reinforcement learning for flow-matching models, preventing implicit over-optimization through ratio normalization and gradient reweighting.
Details
Motivation: Current GRPO frameworks suffer from left-shifted and inconsistent importance-ratio distributions across timesteps, causing failure in constraining overconfident positive updates and leading to implicit over-optimization where proxy rewards increase but essential metrics like image quality deteriorate.Method: GRPO-Guard introduces ratio normalization to restore balanced importance ratios and a gradient reweighting strategy to equalize policy gradients over noise conditions, acting as a regulated clipping mechanism without heavy KL regularization.
Result: Extensive experiments on multiple diffusion backbones (SD3.5M, Flux.1-dev) and diverse proxy tasks show GRPO-Guard significantly reduces over-optimization while maintaining or improving generation quality.
Conclusion: GRPO-Guard provides a simple yet effective enhancement to GRPO frameworks that stabilizes optimization and mitigates implicit over-optimization through proper importance-ratio management and gradient balancing.
Abstract: Recently, GRPO-based reinforcement learning has shown remarkable progress in optimizing flow-matching models, effectively improving their alignment with task-specific rewards. Within these frameworks, the policy update relies on importance-ratio clipping to constrain overconfident positive and negative gradients. However, in practice, we observe a systematic shift in the importance-ratio distribution-its mean falls below 1 and its variance differs substantially across timesteps. This left-shifted and inconsistent distribution prevents positive-advantage samples from entering the clipped region, causing the mechanism to fail in constraining overconfident positive updates. As a result, the policy model inevitably enters an implicit over-optimization stage-while the proxy reward continues to increase, essential metrics such as image quality and text-prompt alignment deteriorate sharply, ultimately making the learned policy impractical for real-world use. To address this issue, we introduce GRPO-Guard, a simple yet effective enhancement to existing GRPO frameworks. Our method incorporates ratio normalization, which restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates across denoising timesteps. In addition, a gradient reweighting strategy equalizes policy gradients over noise conditions, preventing excessive updates from particular timestep regions. Together, these designs act as a regulated clipping mechanism, stabilizing optimization and substantially mitigating implicit over-optimization without relying on heavy KL regularization. Extensive experiments on multiple diffusion backbones (e.g., SD3.5M, Flux.1-dev) and diverse proxy tasks demonstrate that GRPO-Guard significantly reduces over-optimization while maintaining or even improving generation quality.
[313] Beyond Augmentation: Leveraging Inter-Instance Relation in Self-Supervised Representation Learning
Ali Javidani, Babak Nadjar Araabi, Mohammad Amin Sadeghi
Main category: cs.CV
TL;DR: A novel self-supervised learning method that integrates graph theory to capture both intra-instance variations and inter-instance relationships using KNN graphs and GNNs, achieving significant accuracy improvements on benchmark datasets.
Details
Motivation: Traditional self-supervised methods focus only on intra-instance variations from augmentations but overlook important inter-instance relationships, limiting their representation learning capabilities.Method: Constructs KNN graphs for teacher and student streams during pretraining, then uses GNNs for representation refinement through multi-hop message propagation to capture broader contextual relationships.
Result: Achieved accuracy improvements of 7.3% on CIFAR-10, 3.2% on ImageNet-100, and 1.0% on ImageNet-1K over state-of-the-art methods.
Conclusion: The graph-based mechanism effectively enhances self-supervised representation learning by capturing both intra-instance and inter-instance relationships, demonstrating superior performance across multiple datasets.
Abstract: This paper introduces a novel approach that integrates graph theory into self-supervised representation learning. Traditional methods focus on intra-instance variations generated by applying augmentations. However, they often overlook important inter-instance relationships. While our method retains the intra-instance property, it further captures inter-instance relationships by constructing k-nearest neighbor (KNN) graphs for both teacher and student streams during pretraining. In these graphs, nodes represent samples along with their latent representations. Edges encode the similarity between instances. Following pretraining, a representation refinement phase is performed. In this phase, Graph Neural Networks (GNNs) propagate messages not only among immediate neighbors but also across multiple hops, thereby enabling broader contextual integration. Experimental results on CIFAR-10, ImageNet-100, and ImageNet-1K demonstrate accuracy improvements of 7.3%, 3.2%, and 1.0%, respectively, over state-of-the-art methods. These results highlight the effectiveness of the proposed graph based mechanism. The code is publicly available at https://github.com/alijavidani/SSL-GraphNNCLR.
[314] Moving Beyond Diffusion: Hierarchy-to-Hierarchy Autoregression for fMRI-to-Image Reconstruction
Xu Zhang, Ruijie Quan, Wenguan Wang, Yi Yang
Main category: cs.CV
TL;DR: MindHier is a hierarchical fMRI-to-image reconstruction framework that uses scale-wise autoregressive modeling to reconstruct visual stimuli from brain signals more efficiently and accurately than diffusion-based methods.
Details
Motivation: Existing diffusion-based methods collapse hierarchical neural information by mapping fMRI to a single embedding, which is misaligned with the stage-dependent demands of image reconstruction.Method: Proposes a coarse-to-fine framework with three components: Hierarchical fMRI Encoder for multi-level embeddings, Hierarchy-to-Hierarchy Alignment with CLIP features, and Scale-Aware Neural Guidance for injecting embeddings at matching scales during autoregression.
Result: Achieves superior semantic fidelity, 4.67x faster inference, and more deterministic results than diffusion-based baselines on the NSD dataset.
Conclusion: MindHier provides an efficient and cognitively-aligned alternative to diffusion methods by enabling hierarchical reconstruction that mimics human visual perception.
Abstract: Reconstructing visual stimuli from fMRI signals is a central challenge bridging machine learning and neuroscience. Recent diffusion-based methods typically map fMRI activity to a single high-level embedding, using it as fixed guidance throughout the entire generation process. However, this fixed guidance collapses hierarchical neural information and is misaligned with the stage-dependent demands of image reconstruction. In response, we propose MindHier, a coarse-to-fine fMRI-to-image reconstruction framework built on scale-wise autoregressive modeling. MindHier introduces three components: a Hierarchical fMRI Encoder to extract multi-level neural embeddings, a Hierarchy-to-Hierarchy Alignment scheme to enforce layer-wise correspondence with CLIP features, and a Scale-Aware Coarse-to-Fine Neural Guidance strategy to inject these embeddings into autoregression at matching scales. These designs make MindHier an efficient and cognitively-aligned alternative to diffusion-based methods by enabling a hierarchical reconstruction process that synthesizes global semantics before refining local details, akin to human visual perception. Extensive experiments on the NSD dataset show that MindHier achieves superior semantic fidelity, 4.67x faster inference, and more deterministic results than the diffusion-based baselines.
[315] GeoDiffusion: A Training-Free Framework for Accurate 3D Geometric Conditioning in Image Generation
Phillip Mueller, Talip Uenlue, Sebastian Schmidt, Marcel Kollovieh, Jiajie Fan, Stephan Guennemann, Lars Mikelsons
Main category: cs.CV
TL;DR: GeoDiffusion is a training-free framework for precise geometric control in image generation using 3D object priors and drag-based editing.
Details
Motivation: Traditional 3D editing is time-consuming and requires specialized skills, while current image-based generative methods lack accuracy in geometric conditioning.Method: Uses class-specific 3D objects as geometric priors, ensures viewpoint consistency through rendered reference images, and employs GeoDrag for drag-based editing.
Result: Enables precise geometric modifications across various iterative design workflows with improved accuracy and speed on geometry guidance tasks.
Conclusion: GeoDiffusion provides accurate and efficient geometric conditioning for image generation without requiring training.
Abstract: Precise geometric control in image generation is essential for engineering & product design and creative industries to control 3D object features accurately in image space. Traditional 3D editing approaches are time-consuming and demand specialized skills, while current image-based generative methods lack accuracy in geometric conditioning. To address these challenges, we propose GeoDiffusion, a training-free framework for accurate and efficient geometric conditioning of 3D features in image generation. GeoDiffusion employs a class-specific 3D object as a geometric prior to define keypoints and parametric correlations in 3D space. We ensure viewpoint consistency through a rendered image of a reference 3D object, followed by style transfer to meet user-defined appearance specifications. At the core of our framework is GeoDrag, improving accuracy and speed of drag-based image editing on geometry guidance tasks and general instructions on DragBench. Our results demonstrate that GeoDiffusion enables precise geometric modifications across various iterative design workflows.
[316] EndoSfM3D: Learning to 3D Reconstruct Any Endoscopic Surgery Scene using Self-supervised Foundation Model
Changhao Zhang, Matthew J. Clarkson, Mobarak I. Hoque
Main category: cs.CV
TL;DR: This paper presents a self-supervised method for 3D reconstruction in endoscopic surgery by integrating intrinsic parameter estimation into monocular depth estimation using Depth Anything V2 model with attention-based pose network and DoRA fine-tuning.
Details
Motivation: Accurate 3D reconstruction in endoscopic surgery is crucial for enhanced perception and decision-making, but current methods struggle with intrinsic parameter estimation due to sterility constraints and specialized endoscopes with continuous zoom and rotation.Method: Adapted Depth Anything V2 model for joint depth, pose, and intrinsics prediction, incorporating attention-based pose network and Weight-Decomposed Low-Rank Adaptation (DoRA) strategy for efficient fine-tuning.
Result: Superior performance on SCARED and C3VD datasets compared to state-of-the-art approaches in self-supervised monocular depth estimation and 3D reconstruction.
Conclusion: The proposed method successfully integrates intrinsic parameter estimation into endoscopic 3D reconstruction, addressing key challenges in surgical settings and demonstrating improved accuracy and reliability.
Abstract: 3D reconstruction of endoscopic surgery scenes plays a vital role in enhancing scene perception, enabling AR visualization, and supporting context-aware decision-making in image-guided surgery. A critical yet challenging step in this process is the accurate estimation of the endoscope’s intrinsic parameters. In real surgical settings, intrinsic calibration is hindered by sterility constraints and the use of specialized endoscopes with continuous zoom and telescope rotation. Most existing methods for endoscopic 3D reconstruction do not estimate intrinsic parameters, limiting their effectiveness for accurate and reliable reconstruction. In this paper, we integrate intrinsic parameter estimation into a self-supervised monocular depth estimation framework by adapting the Depth Anything V2 (DA2) model for joint depth, pose, and intrinsics prediction. We introduce an attention-based pose network and a Weight-Decomposed Low-Rank Adaptation (DoRA) strategy for efficient fine-tuning of DA2. Our method is validated on the SCARED and C3VD public datasets, demonstrating superior performance compared to recent state-of-the-art approaches in self-supervised monocular depth estimation and 3D reconstruction. Code and model weights can be found in project repository: https://github.com/MOYF-beta/EndoSfM3D.
[317] T2SMark: Balancing Robustness and Diversity in Noise-as-Watermark for Diffusion Models
Jindong Yang, Han Fang, Weiming Zhang, Nenghai Yu, Kejiang Chen
Main category: cs.CV
TL;DR: T2SMark is a two-stage watermarking scheme for diffusion models that uses Tail-Truncated Sampling to balance watermark robustness with generation diversity.
Details
Motivation: Existing Noise-as-Watermark methods struggle to balance watermark robustness with generation diversity - some sacrifice diversity for robustness while others are too fragile for real-world use.Method: Proposes Tail-Truncated Sampling (TTS) that embeds bits exclusively in reliable tail regions while randomly sampling the central zone, plus a two-stage framework with session keys for encryption.
Result: Achieves optimal balance between robustness and diversity when evaluated on diffusion models with both U-Net and DiT backbones.
Conclusion: T2SMark effectively addresses the robustness-diversity tradeoff in diffusion model watermarking through its novel sampling approach and two-stage framework.
Abstract: Diffusion models have advanced rapidly in recent years, producing high-fidelity images while raising concerns about intellectual property protection and the misuse of generative AI. Image watermarking for diffusion models, particularly Noise-as-Watermark (NaW) methods, encode watermark as specific standard Gaussian noise vector for image generation, embedding the infomation seamlessly while maintaining image quality. For detection, the generation process is inverted to recover the initial noise vector containing the watermark before extraction. However, existing NaW methods struggle to balance watermark robustness with generation diversity. Some methods achieve strong robustness by heavily constraining initial noise sampling, which degrades user experience, while others preserve diversity but prove too fragile for real-world deployment. To address this issue, we propose T2SMark, a two-stage watermarking scheme based on Tail-Truncated Sampling (TTS). Unlike prior methods that simply map bits to positive or negative values, TTS enhances robustness by embedding bits exclusively in the reliable tail regions while randomly sampling the central zone to preserve the latent distribution. Our two-stage framework then ensures sampling diversity by integrating a randomly generated session key into both encryption pipelines. We evaluate T2SMark on diffusion models with both U-Net and DiT backbones. Extensive experiments show that it achieves an optimal balance between robustness and diversity. Our code is available at \href{https://github.com/0xD009/T2SMark}{https://github.com/0xD009/T2SMark}.
[318] Efficient Large-Deformation Medical Image Registration via Recurrent Dynamic Correlation
Tianran Li, Marius Staring, Yuchuan Qiao
Main category: cs.CV
TL;DR: A recurrent correlation-based framework for deformable image registration that dynamically relocates matching regions to efficiently handle large deformations while maintaining computational efficiency.
Details
Motivation: Deep learning methods for deformable image registration struggle with efficiently handling large deformations. Voxel-to-region matching is efficient but limited by locality, while region-to-region matching has high redundancy. There's a need for a method that can capture long-range correspondences without excessive computational cost.Method: Proposes a Recurrent Correlation-based framework that performs local matching at each step and dynamically relocates the search region based on estimated offsets. Uses a lightweight recurrent update module with memory capacity and decouples motion-related and texture features to reduce semantic redundancy.
Result: Achieves strong accuracy-computation trade-off, surpassing or matching state-of-the-art performance. On the non-affine OASIS dataset, achieves comparable performance using only 9.5% of FLOPs and running 96% faster than RDP method. Validated on brain MRI and abdominal CT datasets with and without affine pre-registration.
Conclusion: The proposed recurrent correlation framework efficiently handles large deformations in medical image registration by dynamically relocating matching regions, achieving excellent performance with significantly reduced computational requirements compared to existing methods.
Abstract: Deformable image registration estimates voxel-wise correspondences between images through spatial transformations, and plays a key role in medical imaging. While deep learning methods have significantly reduced runtime, efficiently handling large deformations remains a challenging task. Convolutional networks aggregate local features but lack direct modeling of voxel correspondences, promoting recent works to explore explicit feature matching. Among them, voxel-to-region matching is more efficient for direct correspondence modeling by computing local correlation features whithin neighbourhoods, while region-to-region matching incurs higher redundancy due to excessive correlation pairs across large regions. However, the inherent locality of voxel-to-region matching hinders the capture of long-range correspondences required for large deformations. To address this, we propose a Recurrent Correlation-based framework that dynamically relocates the matching region toward more promising positions. At each step, local matching is performed with low cost, and the estimated offset guides the next search region, supporting efficient convergence toward large deformations. In addition, we uses a lightweight recurrent update module with memory capacity and decouples motion-related and texture features to suppress semantic redundancy. We conduct extensive experiments on brain MRI and abdominal CT datasets under two settings: with and without affine pre-registration. Results show that our method exibits a strong accuracy-computation trade-off, surpassing or matching the state-of-the-art performance. For example, it achieves comparable performance on the non-affine OASIS dataset, while using only 9.5% of the FLOPs and running 96% faster than RDP, a representative high-performing method.
[319] A Fully Interpretable Statistical Approach for Roadside LiDAR Background Subtraction
Aitor Iglesias, Nerea Aranjuelo, Patricia Javierre, Ainhoa Menendez, Ignacio Arganda-Carreras, Marcos Nieto
Main category: cs.CV
TL;DR: A fully interpretable statistical method for background subtraction in roadside LiDAR data using Gaussian distribution grid and filtering algorithm to classify points as foreground/background.
Details
Motivation: To enhance infrastructure-based perception in automated driving by providing an interpretable and flexible background subtraction method for roadside LiDAR data.Method: Uses Gaussian distribution grid (GDG) to model spatial statistics of background from background-only scans, combined with a filtering algorithm for point classification. Supports diverse LiDAR types including multiline 360 degree and MEMS sensors.
Result: Outperforms state-of-the-art techniques in accuracy and flexibility on RCooper dataset, works well with minimal background data, and runs efficiently on low-resource hardware.
Conclusion: The method enables scalable real-world deployment for automated driving infrastructure with reliable performance and broad sensor compatibility.
Abstract: We present a fully interpretable and flexible statistical method for background subtraction in roadside LiDAR data, aimed at enhancing infrastructure-based perception in automated driving. Our approach introduces both a Gaussian distribution grid (GDG), which models the spatial statistics of the background using background-only scans, and a filtering algorithm that uses this representation to classify LiDAR points as foreground or background. The method supports diverse LiDAR types, including multiline 360 degree and micro-electro-mechanical systems (MEMS) sensors, and adapts to various configurations. Evaluated on the publicly available RCooper dataset, it outperforms state-of-the-art techniques in accuracy and flexibility, even with minimal background data. Its efficient implementation ensures reliable performance on low-resource hardware, enabling scalable real-world deployment.
[320] Top-Down Semantic Refinement for Image Captioning
Jusheng Zhang, Kaitong Cai, Jing Yang, Jian Wang, Chengpei Tang, Keze Wang
Main category: cs.CV
TL;DR: TDSR reframes image captioning as hierarchical planning using MCTS to improve coherence and detail in VLMs while reducing computational costs.
Details
Motivation: VLMs struggle with maintaining global narrative coherence while capturing rich details in image captioning due to their single-step generation approach.Method: Proposes TDSR framework that models captioning as MDP and uses efficient MCTS with visual-guided parallel expansion and lightweight value network to reduce VLM calls.
Result: Achieves state-of-the-art results on DetailCaps, COMPOSITIONCAP, and POPE benchmarks, significantly enhancing existing VLMs with minimal computational overhead.
Conclusion: TDSR effectively addresses the coherence-detail tradeoff in VLMs through hierarchical planning and efficient search, demonstrating strong performance across multiple captioning tasks.
Abstract: Large Vision-Language Models (VLMs) face an inherent contradiction in image captioning: their powerful single-step generation capabilities often lead to a myopic decision-making process. This makes it difficult to maintain global narrative coherence while capturing rich details, a limitation that is particularly pronounced in tasks that require multi-step and complex scene description. To overcome this fundamental challenge, we redefine image captioning as a goal-oriented hierarchical refinement planning problem, and further propose a novel framework, named Top-Down Semantic Refinement (TDSR), which models the generation process as a Markov Decision Process (MDP). However, planning within the vast state space of a VLM presents a significant computational hurdle. Our core contribution, therefore, is the design of a highly efficient Monte Carlo Tree Search (MCTS) algorithm tailored for VLMs. By incorporating a visual-guided parallel expansion and a lightweight value network, our TDSR reduces the call frequency to the expensive VLM by an order of magnitude without sacrificing planning quality. Furthermore, an adaptive early stopping mechanism dynamically matches computational overhead to the image’s complexity. Extensive experiments on multiple benchmarks, including DetailCaps, COMPOSITIONCAP, and POPE, demonstrate that our TDSR, as a plug-and-play module, can significantly enhance the performance of existing VLMs (e.g., LLaVA-1.5, Qwen2.5-VL) by achieving state-of-the-art or highly competitive results in fine-grained description, compositional generalization, and hallucination suppression.
[321] 3D Roadway Scene Object Detection with LIDARs in Snowfall Conditions
Ghazal Farhani, Taufiq Rahman, Syed Mostaquim Ali, Andrew Liu, Mohamed Zaki, Dominique Charlebois, Benoit Anctil
Main category: cs.CV
TL;DR: This paper analyzes LiDAR performance degradation in snowy conditions, develops a physics-based model to simulate signal attenuation from snowfall, and evaluates its impact on object detection models.
Details
Motivation: LiDAR sensors provide excellent situational awareness for autonomous driving but suffer significant performance degradation in adverse weather like snowfall, which hasn't been sufficiently quantified. This poses safety risks for autonomous systems relying on LiDAR-based perception.Method: Developed a physics-based model to examine LiDAR failure modes in snow, investigating signal attenuation with different snowfall rates and snow particle reflection effects. Used the model to transform clear-weather data into synthetic snowy scenarios for comparison with real snowy conditions.
Result: The study successfully created synthetic data representing various snowfall rates and demonstrated how snow particles near the LiDAR source act as efficient reflectors. This enabled systematic analysis of LiDAR performance degradation under different snow conditions.
Conclusion: The physics-based model effectively simulates LiDAR performance in snowy conditions, providing a framework to quantify signal degradation and assess its impact on object detection models, which is crucial for improving autonomous driving system reliability in adverse weather.
Abstract: Because 3D structure of a roadway environment can be characterized directly by a Light Detection and Ranging (LiDAR) sensors, they can be used to obtain exceptional situational awareness for assitive and autonomous driving systems. Although LiDARs demonstrate good performance in clean and clear weather conditions, their performance significantly deteriorates in adverse weather conditions such as those involving atmospheric precipitation. This may render perception capabilities of autonomous systems that use LiDAR data in learning based models to perform object detection and ranging ineffective. While efforts have been made to enhance the accuracy of these models, the extent of signal degradation under various weather conditions remains largely not quantified. In this study, we focus on the performance of an automotive grade LiDAR in snowy conditions in order to develop a physics-based model that examines failure modes of a LiDAR sensor. Specifically, we investigated how the LiDAR signal attenuates with different snowfall rates and how snow particles near the source serve as small but efficient reflectors. Utilizing our model, we transform data from clear conditions to simulate snowy scenarios, enabling a comparison of our synthetic data with actual snowy conditions. Furthermore, we employ this synthetic data, representative of different snowfall rates, to explore the impact on a pre-trained object detection model, assessing its performance under varying levels of snowfall
[322] Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents
Vijay Veerabadran, Fanyi Xiao, Nitin Kamra, Pedro Matias, Joy Chen, Caley Drooff, Brett D Roads, Riley Williams, Ethan Henderson, Xuanyi Zhao, Kevin Carlberg, Joseph Tighe, Karl Ridgeway
Main category: cs.CV
TL;DR: WAGIBench is a new benchmark for goal inference using vision-language models, featuring 29 hours of multimodal data from 348 participants. Human performance (93%) significantly exceeds the best VLM (84%), and current models produce relevant goals only 55% of the time.
Details
Motivation: To address the goal inference problem for assistive wearable agents by creating a strong benchmark that can measure progress in inferring user goals from multimodal contextual observations, eliminating the need for explicit user interaction.Method: Created WAGIBench with 29 hours of multimodal data from 348 participants across 3,477 recordings, featuring ground-truth goals with visual, audio, digital, and longitudinal contextual observations. Evaluated several families of modern vision-language models through multiple-choice and generative benchmarks.
Result: Human performance achieved 93% multiple-choice accuracy vs 84% for the best-performing VLM. Generative benchmarks show larger models perform better but produce relevant goals only 55% of the time. Modality ablation shows models benefit from relevant modalities with minimal degradation from irrelevant ones.
Conclusion: Current vision-language models remain far from practical usefulness for goal inference in assistive wearable agents, with significant room for improvement needed to approach human-level performance.
Abstract: There has been a surge of interest in assistive wearable agents: agents embodied in wearable form factors (e.g., smart glasses) who take assistive actions toward a user’s goal/query (e.g. “Where did I leave my keys?”). In this work, we consider the important complementary problem of inferring that goal from multi-modal contextual observations. Solving this “goal inference” problem holds the promise of eliminating the effort needed to interact with such an agent. This work focuses on creating WAGIBench, a strong benchmark to measure progress in solving this problem using vision-language models (VLMs). Given the limited prior work in this area, we collected a novel dataset comprising 29 hours of multimodal data from 348 participants across 3,477 recordings, featuring ground-truth goals alongside accompanying visual, audio, digital, and longitudinal contextual observations. We validate that human performance exceeds model performance, achieving 93% multiple-choice accuracy compared with 84% for the best-performing VLM. Generative benchmark results that evaluate several families of modern vision-language models show that larger models perform significantly better on the task, yet remain far from practical usefulness, as they produce relevant goals only 55% of the time. Through a modality ablation, we show that models benefit from extra information in relevant modalities with minimal performance degradation from irrelevant modalities.
[323] SemiETPicker: Fast and Label-Efficient Particle Picking for CryoET Tomography Using Semi-Supervised Learning
Linhan Wang, Jianwen Dou, Wang Li, Shengkun Wang, Zhiwu Xie, Chang-Tien Lu, Yinlin Chen
Main category: cs.CV
TL;DR: A semi-supervised framework for particle picking in CryoET that uses heatmap-supervised detection and teacher-student co-training to improve performance with limited labeled data.
Details
Motivation: Particle picking in CryoET is a major bottleneck due to reliance on manual labels, leaving most tomograms unlabeled and underutilized.Method: End-to-end heatmap-supervised detection model inspired by keypoint detection, teacher-student co-training, multi-view pseudo-labeling, and CryoET-specific DropBlock augmentation.
Result: Improves F1 score by 10% over supervised baselines on the CZII dataset.
Conclusion: Semi-supervised learning effectively leverages unlabeled CryoET data to overcome labeling bottlenecks.
Abstract: Cryogenic Electron Tomography (CryoET) combined with sub-volume averaging (SVA) is the only imaging modality capable of resolving protein structures inside cells at molecular resolution. Particle picking, the task of localizing and classifying target proteins in 3D CryoET volumes, remains the main bottleneck. Due to the reliance on time-consuming manual labels, the vast reserve of unlabeled tomograms remains underutilized. In this work, we present a fast, label-efficient semi-supervised framework that exploits this untapped data. Our framework consists of two components: (i) an end-to-end heatmap-supervised detection model inspired by keypoint detection, and (ii) a teacher-student co-training mechanism that enhances performance under sparse labeling conditions. Furthermore, we introduce multi-view pseudo-labeling and a CryoET-specific DropBlock augmentation strategy to further boost performance. Extensive evaluations on the large-scale CZII dataset show that our approach improves F1 by 10% over supervised baselines, underscoring the promise of semi-supervised learning for leveraging unlabeled CryoET data.
[324] DynaPose4D: High-Quality 4D Dynamic Content Generation via Pose Alignment Loss
Jing Yang, Yufeng Yang
Main category: cs.CV
TL;DR: DynaPose4D is a novel framework that generates high-quality 4D dynamic content from single static images by integrating 4D Gaussian Splatting with Category-Agnostic Pose Estimation, achieving excellent motion coherence and fluidity.
Details
Motivation: Existing 2D and 3D generative models struggle with generating high-quality 4D dynamic content from single static images, particularly in modeling temporal dependencies and capturing dynamic geometry changes with camera perspective variations.Method: The framework uses 3D Gaussian Splatting to construct 3D models from single images, then predicts multi-view pose keypoints based on one-shot support from a chosen view, leveraging supervisory signals to enhance motion consistency.
Result: Experimental results demonstrate that DynaPose4D achieves excellent coherence, consistency, and fluidity in dynamic motion generation.
Conclusion: The findings validate DynaPose4D’s efficacy and indicate its potential applications in computer vision and animation production domains.
Abstract: Recent advancements in 2D and 3D generative models have expanded the capabilities of computer vision. However, generating high-quality 4D dynamic content from a single static image remains a significant challenge. Traditional methods have limitations in modeling temporal dependencies and accurately capturing dynamic geometry changes, especially when considering variations in camera perspective. To address this issue, we propose DynaPose4D, an innovative solution that integrates 4D Gaussian Splatting (4DGS) techniques with Category-Agnostic Pose Estimation (CAPE) technology. This framework uses 3D Gaussian Splatting to construct a 3D model from single images, then predicts multi-view pose keypoints based on one-shot support from a chosen view, leveraging supervisory signals to enhance motion consistency. Experimental results show that DynaPose4D achieves excellent coherence, consistency, and fluidity in dynamic motion generation. These findings not only validate the efficacy of the DynaPose4D framework but also indicate its potential applications in the domains of computer vision and animation production.
[325] Single-Teacher View Augmentation: Boosting Knowledge Distillation via Angular Diversity
Seonghoon Yu, Dongjun Nam, Dina Katabi, Jeany Son
Main category: cs.CV
TL;DR: A novel knowledge distillation method that generates diverse multi-views from a single teacher using angular diversity objectives, achieving better performance than existing methods without requiring multiple teachers.
Details
Motivation: Current knowledge distillation methods that leverage diverse teacher perspectives require multiple teacher networks, leading to high computational costs. There's a need for cost-efficient ways to achieve teacher diversity.Method: Attach multiple branches to a single teacher to generate diverse multi-views. Use two angular diversity objectives: constrained inter-angle diversify loss (maximizes angles between views while preserving proximity to original output) and intra-angle diversify loss (encourages even distribution of views around original output). Ensemble knowledge from these views for distillation.
Result: The method surpasses existing knowledge augmentation methods across diverse configurations. It’s compatible with other KD frameworks in plug-and-play fashion and provides consistent improvements in generalization performance.
Conclusion: The proposed angular diversity objectives effectively increase ensemble diversity and reduce the upper bound of expected loss, leading to more effective knowledge distillation from a single teacher.
Abstract: Knowledge Distillation (KD) aims to train a lightweight student model by transferring knowledge from a large, high-capacity teacher. Recent studies have shown that leveraging diverse teacher perspectives can significantly improve distillation performance; however, achieving such diversity typically requires multiple teacher networks, leading to high computational costs. In this work, we propose a novel cost-efficient knowledge augmentation method for KD that generates diverse multi-views by attaching multiple branches to a single teacher. To ensure meaningful semantic variation across multi-views, we introduce two angular diversity objectives: 1) constrained inter-angle diversify loss, which maximizes angles between augmented views while preserving proximity to the original teacher output, and 2) intra-angle diversify loss, which encourages an even distribution of views around the original output. The ensembled knowledge from these angularly diverse views, along with the original teacher, is distilled into the student. We further theoretically demonstrate that our objectives increase the diversity among ensemble members and thereby reduce the upper bound of the ensemble’s expected loss, leading to more effective distillation. Experimental results show that our method surpasses an existing knowledge augmentation method across diverse configurations. Moreover, the proposed method is compatible with other KD frameworks in a plug-and-play fashion, providing consistent improvements in generalization performance.
[326] GateFuseNet: An Adaptive 3D Multimodal Neuroimaging Fusion Network for Parkinson’s Disease Diagnosis
Rui Jin, Chen Chen, Yin Liu, Hongfu Sun, Min Zeng, Min Li, Yang Gao
Main category: cs.CV
TL;DR: GateFuseNet is a 3D multimodal fusion network that integrates QSM and T1w MRI images for Parkinson’s disease diagnosis, using gated fusion modules to selectively enhance relevant features and suppress irrelevant signals.
Details
Motivation: Current PD diagnosis methods rely on conventional magnitude-based MRI (T1w) which are less sensitive to PD pathology than QSM that quantifies iron deposition. There's a need for better multimodal integration to improve diagnostic accuracy.Method: Proposed GateFuseNet with adaptive 3D multimodal fusion using gated fusion modules that learn modality-specific attention weights and channel-wise gating vectors for selective feature modulation and hierarchical gating mechanism.
Result: Outperformed three state-of-the-art approaches with 85.00% accuracy and 92.06% AUC. Ablation studies validated ROI guidance, multimodal integration, and fusion positioning contributions. Grad-CAM confirmed focus on clinically relevant regions.
Conclusion: GateFuseNet effectively integrates QSM and T1w MRI for improved PD diagnosis through adaptive multimodal fusion, demonstrating superior performance and clinically relevant feature focus.
Abstract: Accurate diagnosis of Parkinson’s disease (PD) from MRI remains challenging due to symptom variability and pathological heterogeneity. Most existing methods rely on conventional magnitude-based MRI modalities, such as T1-weighted images (T1w), which are less sensitive to PD pathology than Quantitative Susceptibility Mapping (QSM), a phase-based MRI technique that quantifies iron deposition in deep gray matter nuclei. In this study, we propose GateFuseNet, an adaptive 3D multimodal fusion network that integrates QSM and T1w images for PD diagnosis. The core innovation lies in a gated fusion module that learns modality-specific attention weights and channel-wise gating vectors for selective feature modulation. This hierarchical gating mechanism enhances ROI-aware features while suppressing irrelevant signals. Experimental results show that our method outperforms three existing state-of-the-art approaches, achieving 85.00% accuracy and 92.06% AUC. Ablation studies further validate the contributions of ROI guidance, multimodal integration, and fusion positioning. Grad-CAM visualizations confirm the model’s focus on clinically relevant pathological regions. The source codes and pretrained models can be found at https://github.com/YangGaoUQ/GateFuseNet
[327] Open Multimodal Retrieval-Augmented Factual Image Generation
Yang Tian, Fan Liu, Jingyuan Zhang, Wei Bi, Yupeng Hu, Liqiang Nie
Main category: cs.CV
TL;DR: ORIG is an agentic open multimodal retrieval-augmented framework that improves factual image generation by iteratively retrieving and filtering web evidence to create enriched prompts, addressing the issue of knowledge contradictions in Large Multimodal Models.
Details
Motivation: Large Multimodal Models often generate images that contradict verifiable knowledge, especially for fine-grained attributes or time-sensitive events, and conventional retrieval-augmented approaches fail due to static sources and shallow evidence integration.Method: ORIG iteratively retrieves and filters multimodal evidence from the web and incrementally integrates refined knowledge into enriched prompts to guide generation for factual image generation.
Result: Experiments show ORIG substantially improves factual consistency and overall image quality over strong baselines, with evaluation conducted on FIG-Eval benchmark spanning ten categories across perceptual, compositional, and temporal dimensions.
Conclusion: ORIG demonstrates the potential of open multimodal retrieval for factual image generation, effectively bridging the gap between visual realism and factual grounding in generated images.
Abstract: Large Multimodal Models (LMMs) have achieved remarkable progress in generating photorealistic and prompt-aligned images, but they often produce outputs that contradict verifiable knowledge, especially when prompts involve fine-grained attributes or time-sensitive events. Conventional retrieval-augmented approaches attempt to address this issue by introducing external information, yet they are fundamentally incapable of grounding generation in accurate and evolving knowledge due to their reliance on static sources and shallow evidence integration. To bridge this gap, we introduce ORIG, an agentic open multimodal retrieval-augmented framework for Factual Image Generation (FIG), a new task that requires both visual realism and factual grounding. ORIG iteratively retrieves and filters multimodal evidence from the web and incrementally integrates the refined knowledge into enriched prompts to guide generation. To support systematic evaluation, we build FIG-Eval, a benchmark spanning ten categories across perceptual, compositional, and temporal dimensions. Experiments demonstrate that ORIG substantially improves factual consistency and overall image quality over strong baselines, highlighting the potential of open multimodal retrieval for factual image generation.
[328] AesCrop: Aesthetic-driven Cropping Guided by Composition
Yen-Hong Wong, Lai-Kuan Wong
Main category: cs.CV
TL;DR: AesCrop is a composition-aware hybrid image cropping model that integrates VMamba encoder with Mamba Composition Attention Bias and transformer decoder to generate multiple crops with quality scores, outperforming state-of-the-art methods.
Details
Motivation: Existing hybrid image cropping methods lack explicit photographic composition guidance, which is crucial for visual appeal in applications like view recommendation and thumbnail generation.Method: AesCrop uses VMamba image encoder with novel Mamba Composition Attention Bias (MCAB) and transformer decoder for end-to-end rank-based image cropping, explicitly encoding compositional cues into attention mechanism.
Result: Extensive experiments show AesCrop outperforms current state-of-the-art methods, delivering superior quantitative metrics and qualitatively more pleasing crops.
Conclusion: The proposed AesCrop model successfully bridges the gap between evaluation-based and regression-based methods by incorporating explicit compositional guidance, achieving better diversity and globality in image cropping.
Abstract: Aesthetic-driven image cropping is crucial for applications like view recommendation and thumbnail generation, where visual appeal significantly impacts user engagement. A key factor in visual appeal is composition–the deliberate arrangement of elements within an image. Some methods have successfully incorporated compositional knowledge through evaluation-based and regression-based paradigms. However, evaluation-based methods lack globality while regression-based methods lack diversity. Recently, hybrid approaches that integrate both paradigms have emerged, bridging the gap between these two to achieve better diversity and globality. Notably, existing hybrid methods do not incorporate photographic composition guidance, a key attribute that defines photographic aesthetics. In this work, we introduce AesCrop, a composition-aware hybrid image-cropping model that integrates a VMamba image encoder, augmented with a novel Mamba Composition Attention Bias (MCAB) and a transformer decoder to perform end-to-end rank-based image cropping, generating multiple crops along with the corresponding quality scores. By explicitly encoding compositional cues into the attention mechanism, MCAB directs AesCrop to focus on the most compositionally salient regions. Extensive experiments demonstrate that AesCrop outperforms current state-of-the-art methods, delivering superior quantitative metrics and qualitatively more pleasing crops.
[329] Bag-of-Word-Groups (BoWG): A Robust and Efficient Loop Closure Detection Method Under Perceptual Aliasing
Xiang Fei, Tina Tian, Howie Choset, Lu Li
Main category: cs.CV
TL;DR: BoWG is a novel loop closure detection method that uses word groups to capture spatial co-occurrence of visual words, incorporates temporal consistency, and achieves superior precision-recall with high computational efficiency.
Details
Motivation: Conventional loop closure methods struggle in perceptually aliased environments like narrow pipes due to vector quantization, feature sparsity, and repetitive textures, while existing solutions often have high computational costs.Method: Introduces word groups to capture spatial co-occurrence and proximity of visual words, constructs online dictionary, incorporates temporal consistency with adaptive scheme, adds feature distribution analysis and post-verification mechanisms.
Result: BoWG surpasses state-of-the-art methods in precision-recall and computational efficiency, achieving average processing time of 16 ms per image across 17,565 images in Bicocca25b dataset.
Conclusion: The method demonstrates superior performance in challenging environments, excellent scalability, and computational efficiency compared to both traditional and learning-based approaches.
Abstract: Loop closure is critical in Simultaneous Localization and Mapping (SLAM) systems to reduce accumulative drift and ensure global mapping consistency. However, conventional methods struggle in perceptually aliased environments, such as narrow pipes, due to vector quantization, feature sparsity, and repetitive textures, while existing solutions often incur high computational costs. This paper presents Bag-of-Word-Groups (BoWG), a novel loop closure detection method that achieves superior precision-recall, robustness, and computational efficiency. The core innovation lies in the introduction of word groups, which captures the spatial co-occurrence and proximity of visual words to construct an online dictionary. Additionally, drawing inspiration from probabilistic transition models, we incorporate temporal consistency directly into similarity computation with an adaptive scheme, substantially improving precision-recall performance. The method is further strengthened by a feature distribution analysis module and dedicated post-verification mechanisms. To evaluate the effectiveness of our method, we conduct experiments on both public datasets and a confined-pipe dataset we constructed. Results demonstrate that BoWG surpasses state-of-the-art methods, including both traditional and learning-based approaches, in terms of precision-recall and computational efficiency. Our approach also exhibits excellent scalability, achieving an average processing time of 16 ms per image across 17,565 images in the Bicocca25b dataset.
[330] SRSR: Enhancing Semantic Accuracy in Real-World Image Super-Resolution with Spatially Re-Focused Text-Conditioning
Chen Chen, Majid Abdolshah, Violetta Shevchenko, Hongdong Li, Chang Xu, Pulak Purkait
Main category: cs.CV
TL;DR: Proposes SRSR framework with SRCA and STCFG to address semantic ambiguities in diffusion-based super-resolution by refining text conditioning and preventing hallucinations.
Details
Motivation: Existing diffusion-based super-resolution methods suffer from semantic ambiguities due to inaccurate text conditioning and cross-attention diversion to irrelevant pixels, leading to semantic misalignment and hallucinated details.Method: Two core components: Spatially Re-focused Cross-Attention (SRCA) uses visually-grounded segmentation masks to guide cross-attention at inference time, and Spatially Targeted Classifier-Free Guidance (STCFG) selectively bypasses text influences on ungrounded pixels.
Result: Outperforms seven state-of-the-art baselines in fidelity metrics (PSNR and SSIM) across all datasets, and in perceptual quality measures (LPIPS and DISTS) on real-world benchmarks.
Conclusion: SRSR effectively achieves both high semantic fidelity and perceptual quality in super-resolution through its plug-and-play framework.
Abstract: Existing diffusion-based super-resolution approaches often exhibit semantic ambiguities due to inaccuracies and incompleteness in their text conditioning, coupled with the inherent tendency for cross-attention to divert towards irrelevant pixels. These limitations can lead to semantic misalignment and hallucinated details in the generated high-resolution outputs. To address these, we propose a novel, plug-and-play spatially re-focused super-resolution (SRSR) framework that consists of two core components: first, we introduce Spatially Re-focused Cross-Attention (SRCA), which refines text conditioning at inference time by applying visually-grounded segmentation masks to guide cross-attention. Second, we introduce a Spatially Targeted Classifier-Free Guidance (STCFG) mechanism that selectively bypasses text influences on ungrounded pixels to prevent hallucinations. Extensive experiments on both synthetic and real-world datasets demonstrate that SRSR consistently outperforms seven state-of-the-art baselines in standard fidelity metrics (PSNR and SSIM) across all datasets, and in perceptual quality measures (LPIPS and DISTS) on two real-world benchmarks, underscoring its effectiveness in achieving both high semantic fidelity and perceptual quality in super-resolution.
[331] MELDAE: A Framework for Micro-Expression Spotting, Detection, and Automatic Evaluation in In-the-Wild Conversational Scenes
Yigui Feng, Qinglin Wang, Yang Liu, Ke Liu, Haotian Mo, Enhao Huang, Gencheng Liu, Mingzhe Liu, Jie Liu
Main category: cs.CV
TL;DR: A new micro-expression dataset for conversational-in-the-wild scenarios, an end-to-end localization and detection framework (MELDAE), and a boundary-aware loss function that improves temporal accuracy by penalizing onset and offset errors.
Details
Motivation: Existing micro-expression analysis research relies on controlled lab datasets and performs poorly in real-world scenarios like natural conversations, highlighting the need for better methods that work in wild conditions.Method: Proposed MELDAE framework with three key contributions: first conversational-in-the-wild micro-expression dataset, end-to-end localization and detection framework, and novel boundary-aware loss function for improved temporal accuracy.
Result: Achieves state-of-the-art results on WDMD dataset with 17.72% improvement in F1_DR localization metric over strongest baseline, while showing excellent generalization on existing benchmarks.
Conclusion: The proposed framework effectively addresses micro-expression analysis in wild conversational scenarios through novel dataset, architecture, and loss function, demonstrating significant performance improvements and strong generalization capabilities.
Abstract: Accurately analyzing spontaneous, unconscious micro-expressions is crucial for revealing true human emotions, but this task remains challenging in wild scenarios, such as natural conversation. Existing research largely relies on datasets from controlled laboratory environments, and their performance degrades dramatically in the real world. To address this issue, we propose three contributions: the first micro-expression dataset focused on conversational-in-the-wild scenarios; an end-to-end localization and detection framework, MELDAE; and a novel boundary-aware loss function that improves temporal accuracy by penalizing onset and offset errors. Extensive experiments demonstrate that our framework achieves state-of-the-art results on the WDMD dataset, improving the key F1_{DR} localization metric by 17.72% over the strongest baseline, while also demonstrating excellent generalization capabilities on existing benchmarks.
[332] From Pixels to Views: Learning Angular-Aware and Physics-Consistent Representations for Light Field Microscopy
Feng He, Guodong Tan, Qiankun Li, Jun Yu, Quan Wen
Main category: cs.CV
TL;DR: The paper introduces a new benchmark dataset and self-supervised learning method for 3D reconstruction in light field microscopy, achieving 7.7% PSNR improvement over state-of-the-art methods.
Details
Motivation: Light field microscopy is important for neuroscience but faces challenges in 3D reconstruction due to lack of standardized datasets and methods that can efficiently model angular-spatial structure while being physically grounded.Method: Three key contributions: XLFM-Zebrafish benchmark dataset, Masked View Modeling for Light Fields (self-supervised angular prior learning), and Optical Rendering Consistency Loss (differentiable rendering constraint).
Result: On the XLFM-Zebrafish benchmark, the method improves PSNR by 7.7% over state-of-the-art baselines.
Conclusion: The proposed approach successfully addresses core challenges in XLFM reconstruction through standardized benchmarking, self-supervised learning of angular priors, and physically-grounded rendering constraints.
Abstract: Light field microscopy (LFM) has become an emerging tool in neuroscience for large-scale neural imaging in vivo, notable for its single-exposure volumetric imaging, broad field of view, and high temporal resolution. However, learning-based 3D reconstruction in XLFM remains underdeveloped due to two core challenges: the absence of standardized datasets and the lack of methods that can efficiently model its angular-spatial structure while remaining physically grounded. We address these challenges by introducing three key contributions. First, we construct the XLFM-Zebrafish benchmark, a large-scale dataset and evaluation suite for XLFM reconstruction. Second, we propose Masked View Modeling for Light Fields (MVN-LF), a self-supervised task that learns angular priors by predicting occluded views, improving data efficiency. Third, we formulate the Optical Rendering Consistency Loss (ORC Loss), a differentiable rendering constraint that enforces alignment between predicted volumes and their PSF-based forward projections. On the XLFM-Zebrafish benchmark, our method improves PSNR by 7.7% over state-of-the-art baselines.
[333] Cross-View UAV Geo-Localization with Precision-Focused Efficient Design: A Hierarchical Distillation Approach with Multi-view Refinement
Jian Sun, Kangdao Liu, Chi Zhang, Chuangquan Chen, Junge Shen, Chi-Man Vong
Main category: cs.CV
TL;DR: PFED is an efficient cross-view geo-localization framework for UAVs that achieves high accuracy with significantly reduced computational costs through hierarchical knowledge distillation and multi-view refinement.
Details
Motivation: Existing cross-view geo-localization methods are computationally expensive due to fine-grained feature extraction and multiple modules, limiting deployment on edge devices for real-time UAV applications.Method: Proposes PFED with two components: 1) Hierarchical Distillation with Uncertainty-Aware Prediction Alignment during training to distill essential information without inference overhead, and 2) Multi-view Refinement Module during inference to filter redundant samples using mutual information.
Result: Achieves 97.15% Recall@1 on University-1652 dataset, over 5x more efficient in FLOPs and 3x faster than previous methods, running at 251.5 FPS on AGX Orin edge device.
Conclusion: PFED demonstrates state-of-the-art performance in both accuracy and efficiency, making it practical for real-time UAV geo-localization in GNSS-denied environments.
Abstract: Cross-view geo-localization (CVGL) enables UAV localization by matching aerial images to geo-tagged satellite databases, which is critical for autonomous navigation in GNSS-denied environments. However, existing methods rely on resource-intensive fine-grained feature extraction and alignment, where multiple branches and modules significantly increase inference costs, limiting their deployment on edge devices. We propose Precision-Focused Efficient Design (PFED), a resource-efficient framework combining hierarchical knowledge transfer and multi-view representation refinement. This innovative method comprises two key components: 1) During training, Hierarchical Distillation paradigm for fast and accurate CVGL (HD-CVGL), coupled with Uncertainty-Aware Prediction Alignment (UAPA) to distill essential information and mitigate the data imbalance without incurring additional inference overhead. 2) During inference, an efficient Multi-view Refinement Module (MRM) leverages mutual information to filter redundant samples and effectively utilize the multi-view data. Extensive experiments show that PFED achieves state-of-the-art performance in both accuracy and efficiency, reaching 97.15% Recall@1 on University-1652 while being over $5 \times$ more efficient in FLOPs and $3 \times$ faster than previous top methods. Furthermore, PFED runs at 251.5 FPS on the AGX Orin edge device, demonstrating its practical viability for real-time UAV applications. The project is available at https://github.com/SkyEyeLoc/PFED
[334] PSScreen V2: Partially Supervised Multiple Retinal Disease Screening
Boyi Zheng, Yalin Zheng, Hrvoje Bogunović, Qing Liu
Main category: cs.CV
TL;DR: PSScreen V2 is a partially supervised self-training framework for multi-disease retinal screening that handles missing labels and domain shifts using a three-branch architecture with novel low-frequency feature augmentation strategies.
Details
Motivation: To address limitations of previous methods that require fully labeled datasets or work in single domains, by enabling learning from multiple partially labeled datasets with different distributions.Method: Three-branch architecture with teacher and two student networks. Teacher generates pseudo labels from weakly augmented images. Students use LF-Dropout (randomly discards domain-related low-frequency components) and LF-Uncert (estimates uncertain domain variability via adversarial Gaussian perturbations).
Result: Achieves state-of-the-art performance and superior domain generalization on multiple fundus datasets. Shows compatibility with diverse backbones including DINOv2 and generalizes to chest X-ray datasets.
Conclusion: PSScreen V2 provides an effective framework for multi-disease screening that handles label absence and domain shift challenges, demonstrating universality and adaptability across different medical imaging domains.
Abstract: In this work, we propose PSScreen V2, a partially supervised self-training framework for multiple retinal disease screening. Unlike previous methods that rely on fully labelled or single-domain datasets, PSScreen V2 is designed to learn from multiple partially labelled datasets with different distributions, addressing both label absence and domain shift challenges. To this end, PSScreen V2 adopts a three-branch architecture with one teacher and two student networks. The teacher branch generates pseudo labels from weakly augmented images to address missing labels, while the two student branches introduce novel feature augmentation strategies: Low-Frequency Dropout (LF-Dropout), which enhances domain robustness by randomly discarding domain-related low-frequency components, and Low-Frequency Uncertainty (LF-Uncert), which estimates uncertain domain variability via adversarially learned Gaussian perturbations of low-frequency statistics. Extensive experiments on multiple in-domain and out-of-domain fundus datasets demonstrate that PSScreen V2 achieves state-of-the-art performance and superior domain generalization ability. Furthermore, compatibility tests with diverse backbones, including the vision foundation model DINOv2, as well as evaluations on chest X-ray datasets, highlight the universality and adaptability of the proposed framework. The codes are available at https://github.com/boyiZheng99/PSScreen_V2.
[335] Projection Embedded Diffusion Bridge for CT Reconstruction from Incomplete Data
Yuang Wang, Pengfei Jin, Siyeop Yoon, Matthew Tivnan, Shaoyang Zhang, Li Zhang, Quanzheng Li, Zhiqiang Chen, Dufan Wu
Main category: cs.CV
TL;DR: PEDB is a novel diffusion bridge model that incorporates projection data consistency for CT reconstruction from incomplete data, outperforming existing methods.
Details
Motivation: Current diffusion bridge models for CT reconstruction don't adequately incorporate data consistency from projection data, which could improve reconstruction fidelity and detail recovery.Method: Proposes Projection Embedded Diffusion Bridge (PEDB) with a novel reverse SDE that samples clean images conditioned on both FBP reconstruction and projection data, embedding projection data into the score function.
Result: PEDB achieves strong performance in CT reconstruction from sparse-view, limited-angle, and truncated projections, outperforming state-of-the-art diffusion bridge models across standard, noisy, and domain-shift evaluations.
Conclusion: Explicitly conditioning on projection data in the sampling process naturally incorporates data consistency and improves CT reconstruction quality from incomplete data.
Abstract: Reconstructing CT images from incomplete projection data remains challenging due to the ill-posed nature of the problem. Diffusion bridge models have recently shown promise in restoring clean images from their corresponding Filtered Back Projection (FBP) reconstructions, but incorporating data consistency into these models remains largely underexplored. Incorporating data consistency can improve reconstruction fidelity by aligning the reconstructed image with the observed projection data, and can enhance detail recovery by integrating structural information contained in the projections. In this work, we propose the Projection Embedded Diffusion Bridge (PEDB). PEDB introduces a novel reverse stochastic differential equation (SDE) to sample from the distribution of clean images conditioned on both the FBP reconstruction and the incomplete projection data. By explicitly conditioning on the projection data in sampling the clean images, PEDB naturally incorporates data consistency. We embed the projection data into the score function of the reverse SDE. Under certain assumptions, we derive a tractable expression for the posterior score. In addition, we introduce a free parameter to control the level of stochasticity in the reverse process. We also design a discretization scheme for the reverse SDE to mitigate discretization error. Extensive experiments demonstrate that PEDB achieves strong performance in CT reconstruction from three types of incomplete data, including sparse-view, limited-angle, and truncated projections. For each of these types, PEDB outperforms evaluated state-of-the-art diffusion bridge models across standard, noisy, and domain-shift evaluations.
[336] SWAN: Self-supervised Wavelet Neural Network for Hyperspectral Image Unmixing
Yassh Ramchandani, Vijayashekhar S S, Jignesh S. Bhatt
Main category: cs.CV
TL;DR: SWAN is a three-stage self-supervised wavelet neural network for hyperspectral unmixing that estimates endmembers and abundances without ground truth by leveraging wavelet transforms and physics-based learning.
Details
Motivation: To address the limitations of traditional hyperspectral unmixing methods that require ground truth data and struggle with contiguous, overlapping spectral bands, by developing a self-supervised approach that exploits latent symmetries in wavelet-transformed data.Method: Three-stage architecture: SWANencoder maps wavelet coefficients to latent space, SWANdecoder reconstructs coefficients, SWANforward learns hyperspectral physics. Uses biorthogonal wavelet basis, three-stage combined loss function, Adam optimization, and kernel regularizers to prevent overfitting.
Result: Experiments on synthetic and real benchmark datasets show performance enhancement over state-of-the-art neural network methods, with improved unmixing function learning and compact network parameters suitable for practical applications.
Conclusion: SWAN successfully enables self-supervised hyperspectral unmixing by combining wavelet transforms with neural networks, eliminating the need for ground truth while achieving competitive performance through resilient unmixing function learning.
Abstract: In this article, we present SWAN: a three-stage, self-supervised wavelet neural network for joint estimation of endmembers and abundances from hyperspectral imagery. The contiguous and overlapping hyperspectral band images are first expanded to Biorthogonal wavelet basis space that provides sparse, distributed, and multi-scale representations. The idea is to exploit latent symmetries from thus obtained invariant and covariant features using a self-supervised learning paradigm. The first stage, SWANencoder maps the input wavelet coefficients to a compact lower-dimensional latent space. The second stage, SWANdecoder uses the derived latent representation to reconstruct the input wavelet coefficients. Interestingly, the third stage SWANforward learns the underlying physics of the hyperspectral image. A three-stage combined loss function is formulated in the image acquisition domain that eliminates the need for ground truth and enables self-supervised training. Adam is employed for optimizing the proposed loss function, while Sigmoid with a dropout of 0.3 is incorporated to avoid possible overfitting. Kernel regularizers bound the magnitudes and preserve spatial variations in the estimated endmember coefficients. The output of SWANencoder represents estimated abundance maps during inference, while weights of SWANdecoder are retrieved to extract endmembers. Experiments are conducted on two benchmark synthetic data sets with different signal-to-noise ratios as well as on three real benchmark hyperspectral data sets while comparing the results with several state-of-the-art neural network-based unmixing methods. The qualitative, quantitative, and ablation results show performance enhancement by learning a resilient unmixing function as well as promoting self-supervision and compact network parameters for practical applications.
[337] Cross-Species Transfer Learning in Agricultural AI: Evaluating ZebraPose Adaptation for Dairy Cattle Pose Estimation
Mackenzie Tapp, Sibi Chakravarthy Parivendan, Kashfia Sailunaz, Suresh Neethirajan
Main category: cs.CV
TL;DR: Evaluated cross-species transfer learning using ZebraPose (ViT-based model trained on synthetic zebra data) for dairy cow pose estimation, revealing significant generalization failures despite morphological similarity between species.
Details
Motivation: Address scarcity of large annotated datasets for livestock pose estimation, particularly dairy cattle, by exploring cross-species transfer learning from synthetic zebra data.Method: Adapted ZebraPose model for 27-keypoint detection in dairy cows using three configurations: custom on-farm dataset (375 images), APT-36K benchmark subset, and their combination. Systematically evaluated accuracy and generalization across different barn environments.
Result: Combined model achieved promising performance on in-distribution data (AP=0.86, AR=0.87, PCK 0.5=0.869), but showed substantial generalization failures when applied to unseen barns and cow populations, exposing synthetic-to-real domain gap.
Conclusion: Morphological similarity between species is insufficient for cross-domain transfer. Calls for agriculture-first AI design prioritizing farm-level realism, cross-environment robustness, and open benchmark datasets for trustworthy livestock monitoring systems.
Abstract: Pose estimation serves as a cornerstone of computer vision for understanding animal posture, behavior, and welfare. Yet, agricultural applications remain constrained by the scarcity of large, annotated datasets for livestock, especially dairy cattle. This study evaluates the potential and limitations of cross-species transfer learning by adapting ZebraPose - a vision transformer-based model trained on synthetic zebra imagery - for 27-keypoint detection in dairy cows under real barn conditions. Using three configurations
- a custom on-farm dataset (375 images, Sussex, New Brunswick, Canada), a subset of the APT-36K benchmark dataset, and their combination, we systematically assessed model accuracy and generalization across environments. While the combined model achieved promising performance (AP = 0.86, AR = 0.87, PCK 0.5 = 0.869) on in-distribution data, substantial generalization failures occurred when applied to unseen barns and cow populations. These findings expose the synthetic-to-real domain gap as a major obstacle to agricultural AI deployment and emphasize that morphological similarity between species is insufficient for cross-domain transfer. The study provides practical insights into dataset diversity, environmental variability, and computational constraints that influence real-world deployment of livestock monitoring systems. We conclude with a call for agriculture-first AI design, prioritizing farm-level realism, cross-environment robustness, and open benchmark datasets to advance trustworthy and scalable animal-centric technologies.
[338] Robust Atypical Mitosis Classification with DenseNet121: Stain-Aware Augmentation and Hybrid Loss for Domain Generalization
Adinath Dukre, Ankan Deria, Yutong Xie, Imran Razzak
Main category: cs.CV
TL;DR: A DenseNet-121 framework with stain-aware augmentation and imbalance-adaptive learning achieves robust atypical mitosis classification across multiple imaging domains.
Details
Motivation: Atypical mitotic figures are important biomarkers for tumor aggressiveness but challenging to recognize due to severe class imbalance and variability across imaging domains.Method: DenseNet-121-based framework with stain-aware augmentation (Macenko), geometric/ intensity transformations, weighted sampling, and hybrid objective combining class-weighted binary cross-entropy and focal loss, trained end-to-end with AdamW.
Result: Achieved balanced accuracy 85.0%, AUROC 0.927, sensitivity 89.2%, and specificity 80.9% on official test set, demonstrating strong generalization across scanner and staining shifts.
Conclusion: Combining DenseNet-121 with stain-aware augmentation and imbalance-adaptive objectives yields a robust, domain-generalizable framework suitable for real-world computational pathology workflows.
Abstract: Atypical mitotic figures are important biomarkers of tumor aggressiveness in histopathology, yet reliable recognition remains challenging due to severe class imbalance and variability across imaging domains. We present a DenseNet-121-based framework tailored for atypical mitosis classification in the MIDOG 2025 (Track 2) setting. Our method integrates stain-aware augmentation (Macenko), geometric and intensity transformations, and imbalance-aware learning via weighted sampling with a hybrid objective combining class-weighted binary cross-entropy and focal loss. Trained end-to-end with AdamW and evaluated across multiple independent domains, the model demonstrates strong generalization under scanner and staining shifts, achieving balanced accuracy 85.0%, AUROC 0.927, sensitivity 89.2%, and specificity 80.9% on the official test set. These results indicate that combining DenseNet-121 with stain-aware augmentation and imbalance-adaptive objectives yields a robust, domain-generalizable framework for atypical mitosis classification suitable for real-world computational pathology workflows.
[339] A Critical Study on Tea Leaf Disease Detection using Deep Learning Techniques
Nabajyoti Borah, Raju Moni Borah, Bandan Boruah, Purnendu Bikash Acharjee, Sajal Saha, Ripjyoti Hazarika
Main category: cs.CV
TL;DR: Deep learning approach for detecting and segmenting three tea leaf diseases (Red Rust, Helopeltis, Red spider mite) using object detection models (SSD MobileNet V2, Faster R-CNN) and instance segmentation (Mask R-CNN) with custom damage area calculation.
Details
Motivation: To automatically classify tea leaf diseases caused by pests and pathogens, and quantify the damaged area on leaves for better disease management in tea cultivation.Method: Evaluated SSD MobileNet V2 and Faster R-CNN ResNet50 V1 for object detection, and Mask R-CNN for instance segmentation with custom method to calculate damaged leaf portions.
Result: Faster R-CNN ResNet50 V1 achieved better performance with 25% mAP compared to SSD MobileNet V2’s 20.9% mAP. Both models showed low precision (0.252 vs 0.209) and recall (0.044 vs 0.02) on IOU 0.50:0.95 range.
Conclusion: Faster R-CNN outperforms SSD MobileNet for tea leaf disease detection, and Mask R-CNN with custom damage calculation provides effective disease area quantification, though overall detection performance needs improvement.
Abstract: The proposed solution is Deep Learning Technique that will be able classify three types of tea leaves diseases from which two diseases are caused by the pests and one due to pathogens (infectious organisms) and environmental conditions and also show the area damaged by a disease in leaves. Namely Red Rust, Helopeltis and Red spider mite respectively. In this paper we have evaluated two models namely SSD MobileNet V2 and Faster R-CNN ResNet50 V1 for the object detection. The SSD MobileNet V2 gave precision of 0.209 for IOU range of 0.50:0.95 with recall of 0.02 on IOU 0.50:0.95 and final mAP of 20.9%. While Faster R-CNN ResNet50 V1 has precision of 0.252 on IOU range of 0.50:0.95 and recall of 0.044 on IOU of 0.50:0.95 with a mAP of 25%, which is better than SSD. Also used Mask R-CNN for Object Instance Segmentation where we have implemented our custom method to calculate the damaged diseased portion of leaves. Keywords: Tea Leaf Disease, Deep Learning, Red Rust, Helopeltis and Red Spider Mite, SSD MobileNet V2, Faster R-CNN ResNet50 V1 and Mask RCNN.
[340] Self-Attention Decomposition For Training Free Diffusion Editing
Tharun Anand, Mohammad Hassan Vali, Arno Solin
Main category: cs.CV
TL;DR: An analytical method that derives semantic editing directions directly from pretrained diffusion model parameters using self-attention weight matrix eigenvectors, requiring no additional data or fine-tuning.
Details
Motivation: Existing methods for finding interpretable directions in diffusion models rely on sampling large image sets or training auxiliary networks, which limits efficiency and controllability.Method: Compute eigenvectors of self-attention weight matrices in pretrained diffusion models to obtain robust and interpretable editing directions without additional data or fine-tuning.
Result: Produces high-quality edits across multiple datasets while reducing editing time by 60% compared to current benchmarks.
Conclusion: The method enables efficient and precise control over diffusion model outputs by leveraging intrinsic structural information encoded in self-attention weights.
Abstract: Diffusion models achieve remarkable fidelity in image synthesis, yet precise control over their outputs for targeted editing remains challenging. A key step toward controllability is to identify interpretable directions in the model’s latent representations that correspond to semantic attributes. Existing approaches for finding interpretable directions typically rely on sampling large sets of images or training auxiliary networks, which limits efficiency. We propose an analytical method that derives semantic editing directions directly from the pretrained parameters of diffusion models, requiring neither additional data nor fine-tuning. Our insight is that self-attention weight matrices encode rich structural information about the data distribution learned during training. By computing the eigenvectors of these weight matrices, we obtain robust and interpretable editing directions. Experiments demonstrate that our method produces high-quality edits across multiple datasets while reducing editing time significantly by 60% over current benchmarks.
[341] SARCLIP: A Vision Language Foundation Model for Semantic Understanding and Target Recognition in SAR Imagery
Qiwei Ma, Zhiyu Wang, Wang Liu, Xukun Lu, Bin Deng, Puhong Duan, Xudong Kang, Shutao Li
Main category: cs.CV
TL;DR: SARCLIP is the first vision-language foundation model for SAR imagery, trained on a large-scale dataset of 1M text-image pairs using contrastive learning, enabling superior zero-shot recognition and multimodal understanding.
Details
Motivation: Existing SAR foundation models focus on low-level visual features but lack multimodal alignment and zero-shot target recognition capabilities, limiting semantic understanding of SAR imagery.Method: Constructed SARCLIP-1M dataset with 1M text-image pairs, then trained SARCLIP model using contrastive vision-language learning with domain transfer strategy to bridge SAR imagery and text.
Result: Extensive experiments show SARCLIP significantly outperforms state-of-the-art foundation models in image-text retrieval and zero-shot classification tasks, advancing SAR semantic understanding.
Conclusion: SARCLIP successfully bridges the gap between SAR imagery and textual descriptions, enabling superior feature extraction and interpretation capabilities for SAR domain applications.
Abstract: Synthetic Aperture Radar (SAR) has emerged as a crucial imaging modality due to its all-weather capabilities. While recent advancements in self-supervised learning and Masked Image Modeling (MIM) have paved the way for SAR foundation models, these approaches primarily focus on low-level visual features, often overlooking multimodal alignment and zero-shot target recognition within SAR imagery. To address this limitation, we construct SARCLIP-1M, a large-scale vision language dataset comprising over one million text-image pairs aggregated from existing datasets. We further introduce SARCLIP, the first vision language foundation model tailored for the SAR domain. Our SARCLIP model is trained using a contrastive vision language learning approach by domain transferring strategy, enabling it to bridge the gap between SAR imagery and textual descriptions. Extensive experiments on image-text retrieval and zero-shot classification tasks demonstrate the superior performance of SARCLIP in feature extraction and interpretation, significantly outperforming state-of-the-art foundation models and advancing the semantic understanding of SAR imagery. The code and datasets will be released soon.
[342] LVD-GS: Gaussian Splatting SLAM for Dynamic Scenes via Hierarchical Explicit-Implicit Representation Collaboration Rendering
Wenkai Zhu, Xu Li, Qimin Xu, Benwu Wang, Kun Wei, Yiming Peng, Zihang Wang
Main category: cs.CV
TL;DR: LVD-GS is a LiDAR-Visual 3D Gaussian Splatting SLAM system that addresses scale drift and dynamic object challenges in large-scale outdoor scenes through hierarchical collaborative representation and joint dynamic modeling.
Details
Motivation: Existing 3D Gaussian Splatting SLAM methods rely on single representation schemes, limiting performance in large-scale dynamic outdoor scenes and causing cumulative pose errors and scale ambiguity.Method: Proposes hierarchical collaborative representation module for mutual reinforcement in mapping optimization, and joint dynamic modeling module that fuses open-world segmentation with implicit residual constraints using DINO-Depth uncertainty estimates.
Result: Extensive evaluations on KITTI, nuScenes, and self-collected datasets show state-of-the-art performance compared to existing methods.
Conclusion: LVD-GS effectively mitigates scale drift, enhances reconstruction robustness, and eliminates dynamic object influence through its novel hierarchical representation and dynamic modeling approaches.
Abstract: 3D Gaussian Splatting SLAM has emerged as a widely used technique for high-fidelity mapping in spatial intelligence. However, existing methods often rely on a single representation scheme, which limits their performance in large-scale dynamic outdoor scenes and leads to cumulative pose errors and scale ambiguity. To address these challenges, we propose \textbf{LVD-GS}, a novel LiDAR-Visual 3D Gaussian Splatting SLAM system. Motivated by the human chain-of-thought process for information seeking, we introduce a hierarchical collaborative representation module that facilitates mutual reinforcement for mapping optimization, effectively mitigating scale drift and enhancing reconstruction robustness. Furthermore, to effectively eliminate the influence of dynamic objects, we propose a joint dynamic modeling module that generates fine-grained dynamic masks by fusing open-world segmentation with implicit residual constraints, guided by uncertainty estimates from DINO-Depth features. Extensive evaluations on KITTI, nuScenes, and self-collected datasets demonstrate that our approach achieves state-of-the-art performance compared to existing methods.
[343] Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views
Anna Deichler, Jonas Beskow
Main category: cs.CV
TL;DR: Look and Tell is a multimodal dataset for referential communication, collected using smart glasses and stationary cameras to study how spatial representations affect multimodal grounding in situated dialogue.
Details
Motivation: To advance the development of embodied agents that can understand and engage in situated dialogue by studying referential communication across different perspectives (egocentric vs exocentric).Method: Used Meta Project Aria smart glasses and stationary cameras to record synchronized gaze, speech, and video from 25 participants instructing partners to identify kitchen ingredients, combined with 3D scene reconstructions.
Result: Created a dataset with 3.67 hours of recordings containing 2,707 richly annotated referential expressions, providing a benchmark for evaluating spatial representation effects on multimodal grounding.
Conclusion: The dataset enables research on how different spatial representations (2D vs 3D, ego vs exo) affect multimodal grounding and supports development of embodied agents for situated dialogue.
Abstract: We introduce Look and Tell, a multimodal dataset for studying referential communication across egocentric and exocentric perspectives. Using Meta Project Aria smart glasses and stationary cameras, we recorded synchronized gaze, speech, and video as 25 participants instructed a partner to identify ingredients in a kitchen. Combined with 3D scene reconstructions, this setup provides a benchmark for evaluating how different spatial representations (2D vs. 3D; ego vs. exo) affect multimodal grounding. The dataset contains 3.67 hours of recordings, including 2,707 richly annotated referential expressions, and is designed to advance the development of embodied agents that can understand and engage in situated dialogue.
[344] Alias-Free ViT: Fractional Shift Invariance via Linear Attention
Hagay Michaeli, Daniel Soudry
Main category: cs.CV
TL;DR: The paper proposes Alias-Free ViT, a Vision Transformer model that addresses the lack of translation invariance in standard ViTs by incorporating alias-free downsampling, nonlinearities, and shift-equivariant attention.
Details
Motivation: Vision Transformers lack the architectural inductive bias of convnets, making them less translation-invariant and more sensitive to minor image translations. While convnets also aren't perfectly shift-invariant due to aliasing, this work aims to address similar issues in ViTs.Method: The model combines alias-free downsampling and nonlinearities with linear cross-covariance attention that is shift-equivariant to both integer and fractional translations, enabling shift-invariant global representation.
Result: The proposed Alias-Free ViT maintains competitive performance in image classification and outperforms similar-sized models in robustness to adversarial translations.
Conclusion: The Alias-Free ViT successfully addresses translation sensitivity in Vision Transformers while maintaining competitive performance, demonstrating improved robustness to adversarial translations compared to standard ViTs.
Abstract: Transformers have emerged as a competitive alternative to convnets in vision tasks, yet they lack the architectural inductive bias of convnets, which may hinder their potential performance. Specifically, Vision Transformers (ViTs) are not translation-invariant and are more sensitive to minor image translations than standard convnets. Previous studies have shown, however, that convnets are also not perfectly shift-invariant, due to aliasing in downsampling and nonlinear layers. Consequently, anti-aliasing approaches have been proposed to certify convnets’ translation robustness. Building on this line of work, we propose an Alias-Free ViT, which combines two main components. First, it uses alias-free downsampling and nonlinearities. Second, it uses linear cross-covariance attention that is shift-equivariant to both integer and fractional translations, enabling a shift-invariant global representation. Our model maintains competitive performance in image classification and outperforms similar-sized models in terms of robustness to adversarial translations.
[345] DAMap: Distance-aware MapNet for High Quality HD Map Construction
Jinpeng Dong, Chen Li, Yutong Lin, Jingwen Fu, Sanping Zhou, Nanning Zheng
Main category: cs.CV
TL;DR: DAMap addresses task misalignment in HD map prediction through distance-aware focal loss, hybrid loss scheme, and task-modulated deformable attention to improve classification and localization quality.
Details
Motivation: Current HD map prediction methods suffer from poor high-quality predictions due to task misalignment caused by inappropriate labels from one-to-many matching and sub-optimal features from task-shared sampling.Method: Proposes DAMap with three components: Distance-aware Focal Loss (DAFL) for better classification labels, Task Modulated Deformable Attention (TMDA) for discriminative task-specific features, and Hybrid Loss Scheme (HLS) to optimize DAFL usage.
Result: Achieves consistent performance improvements on NuScenes and Argoverse2 benchmarks across different metrics, baselines, splits, backbones, and training schedules.
Conclusion: DAMap effectively addresses task misalignment in HD map prediction and demonstrates superior performance through its three key components.
Abstract: Predicting High-definition (HD) map elements with high quality (high classification and localization scores) is crucial to the safety of autonomous driving vehicles. However, current methods perform poorly in high quality predictions due to inherent task misalignment. Two main factors are responsible for misalignment: 1) inappropriate task labels due to one-to-many matching queries sharing the same labels, and 2) sub-optimal task features due to task-shared sampling mechanism. In this paper, we reveal two inherent defects in current methods and develop a novel HD map construction method named DAMap to address these problems. Specifically, DAMap consists of three components: Distance-aware Focal Loss (DAFL), Hybrid Loss Scheme (HLS), and Task Modulated Deformable Attention (TMDA). The DAFL is introduced to assign appropriate classification labels for one-to-many matching samples. The TMDA is proposed to obtain discriminative task-specific features. Furthermore, the HLS is proposed to better utilize the advantages of the DAFL. We perform extensive experiments and consistently achieve performance improvement on the NuScenes and Argoverse2 benchmarks under different metrics, baselines, splits, backbones, and schedules. Code will be available at https://github.com/jpdong-xjtu/DAMap.
[346] Estimation of Fireproof Structure Class and Construction Year for Disaster Risk Assessment
Hibiki Ayabe, Kazushi Okamoto, Koki Karube, Atsushi Shibata, Kei Harada
Main category: cs.CV
TL;DR: A multi-task learning model that predicts building construction year, structure type, and property type from facade images to derive structural fireproof classification for insurance and risk assessment.
Details
Motivation: Key building metadata like construction year and structure type are often missing or outdated in Japan's second-hand housing market, making disaster risk assessment and insurance pricing difficult.Method: Multi-task learning model that jointly estimates construction year, building structure, and property type from facade images, then derives fireproof class (H/T/M) via rule-based mapping based on official insurance criteria.
Result: Model achieved high accuracy in construction-year regression and robust classification across imbalanced categories, capturing visual cues related to building age and materials.
Conclusion: The approach demonstrates feasibility of scalable, interpretable, image-based risk-profiling systems with applications in insurance, urban planning, and disaster preparedness.
Abstract: Structural fireproof classification is vital for disaster risk assessment and insurance pricing in Japan. However, key building metadata such as construction year and structure type are often missing or outdated, particularly in the second-hand housing market. This study proposes a multi-task learning model that predicts these attributes from facade images. The model jointly estimates the construction year, building structure, and property type, from which the structural fireproof class - defined as H (non-fireproof), T (semi-fireproof), or M (fireproof) - is derived via a rule-based mapping based on official insurance criteria. We trained and evaluated the model using a large-scale dataset of Japanese residential images, applying rigorous filtering and deduplication. The model achieved high accuracy in construction-year regression and robust classification across imbalanced categories. Qualitative analyses show that it captures visual cues related to building age and materials. Our approach demonstrates the feasibility of scalable, interpretable, image-based risk-profiling systems, offering potential applications in insurance, urban planning, and disaster preparedness.
[347] RoboSVG: A Unified Framework for Interactive SVG Generation with Multi-modal Guidance
Jiuniu Wang, Gongjie Zhang, Quanhao Qian, Junlong Gao, Deli Zhao, Ran Xu
Main category: cs.CV
TL;DR: RoboSVG is a multimodal framework for generating interactive SVGs using text, visual, and numerical guidance, achieving state-of-the-art performance across multiple generation tasks.
Details
Motivation: SVGs are fundamental to digital design and robot control, encoding both visual structure and motion paths, but there's a need for unified multimodal generation of interactive SVGs.Method: A three-stage framework: produces multimodal guidance from input queries, synthesizes candidate SVGs through dedicated generation modules, and refines them under numerical guidance. Built on RoboDraw dataset of 1M SVG examples.
Result: Achieves superior query compliance and visual fidelity across tasks (Text-to-SVG, Image-to-SVG, PartialSVG-to-SVG, PartialImage-to-SVG), establishing new state-of-the-art in versatile SVG generation.
Conclusion: RoboSVG provides an effective unified framework for multimodal SVG generation with strong performance across diverse tasks, with dataset and code to be publicly released.
Abstract: Scalable Vector Graphics (SVGs) are fundamental to digital design and robot control, encoding not only visual structure but also motion paths in interactive drawings. In this work, we introduce RoboSVG, a unified multimodal framework for generating interactive SVGs guided by textual, visual, and numerical signals. Given an input query, the RoboSVG model first produces multimodal guidance, then synthesizes candidate SVGs through dedicated generation modules, and finally refines them under numerical guidance to yield high-quality outputs. To support this framework, we construct RoboDraw, a large-scale dataset of one million examples, each pairing an SVG generation condition (e.g., text, image, and partial SVG) with its corresponding ground-truth SVG code. RoboDraw dataset enables systematic study of four tasks, including basic generation (Text-to-SVG, Image-to-SVG) and interactive generation (PartialSVG-to-SVG, PartialImage-to-SVG). Extensive experiments demonstrate that RoboSVG achieves superior query compliance and visual fidelity across tasks, establishing a new state of the art in versatile SVG generation. The dataset and source code of this project will be publicly available soon.
[348] VADTree: Explainable Training-Free Video Anomaly Detection via Hierarchical Granularity-Aware Tree
Wenlong Li, Yifei Xu, Yuan Rao, Zhenhua Wang, Shuiguang Deng
Main category: cs.CV
TL;DR: VADTree is a training-free video anomaly detection method that uses a hierarchical tree structure for flexible temporal sampling and leverages pre-trained models for boundary detection and anomaly reasoning.
Details
Motivation: Supervised VAD methods require large labeled datasets and lack explainability, while existing training-free methods struggle with variable-length anomalies due to fixed temporal windows.Method: Uses Hierarchical Granularity-aware Tree (HGTree) with GEBD for event boundary detection, adaptive coarse-fine structuring, VLMs with multi-dimensional priors for anomaly perception, and LLMs for anomaly reasoning.
Result: Achieves state-of-the-art performance on three challenging datasets while significantly reducing the number of sampled video segments compared to existing methods.
Conclusion: VADTree provides an effective training-free solution for VAD that handles varying temporal spans and offers better anomaly detection with reduced computational cost.
Abstract: Video anomaly detection (VAD) focuses on identifying anomalies in videos. Supervised methods demand substantial in-domain training data and fail to deliver clear explanations for anomalies. In contrast, training-free methods leverage the knowledge reserves and language interactivity of large pre-trained models to detect anomalies. However, the current fixed-length temporal window sampling approaches struggle to accurately capture anomalies with varying temporal spans. Therefore, we propose VADTree that utilizes a Hierarchical Granularityaware Tree (HGTree) structure for flexible sampling in VAD. VADTree leverages the knowledge embedded in a pre-trained Generic Event Boundary Detection (GEBD) model to characterize potential anomaly event boundaries. Specifically, VADTree decomposes the video into generic event nodes based on boundary confidence, and performs adaptive coarse-fine hierarchical structuring and redundancy removal to construct the HGTree. Then, the multi-dimensional priors are injected into the visual language models (VLMs) to enhance the node-wise anomaly perception, and anomaly reasoning for generic event nodes is achieved via large language models (LLMs). Finally, an inter-cluster node correlation method is used to integrate the multi-granularity anomaly scores. Extensive experiments on three challenging datasets demonstrate that VADTree achieves state-of-the-art performance in training-free settings while drastically reducing the number of sampled video segments. The code will be available at https://github.com/wenlongli10/VADTree.
[349] Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation
Shu Zhao, Tianyi Shen, Nilesh Ahuja, Omesh Tickoo, Vijaykrishnan Narayanan
Main category: cs.CV
TL;DR: Windsock is a query-dependent module for MRAG that dynamically decides when to retrieve and what modality to use, combined with DANCE instruction tuning to improve information utilization and noise resistance.
Details
Motivation: Existing MRAG approaches have static retrieval strategies, inflexible modality selection, and suboptimal utilization of retrieved information, leading to challenges in determining when to retrieve, what modality to incorporate, and how to use retrieved information effectively.Method: Introduces Windsock module for dynamic retrieval decisions and modality selection, DANCE Instruction Tuning for adaptive training to enhance information utilization and noise resistance, and a self-assessment approach to convert QA datasets to MRAG training datasets.
Result: Significantly improves generation quality by 17.07% while reducing retrieval times by 8.95% in extensive experiments.
Conclusion: The proposed method effectively addresses the key challenges in MRAG by providing dynamic retrieval strategies, flexible modality selection, and improved information utilization, leading to substantial performance gains with reduced computational overhead.
Abstract: Multimodal Retrieval-Augmented Generation (MRAG) has emerged as a promising method to generate factual and up-to-date responses of Multimodal Large Language Models (MLLMs) by incorporating non-parametric knowledge from external knowledge bases. However, existing MRAG approaches suffer from static retrieval strategies, inflexible modality selection, and suboptimal utilization of retrieved information, leading to three critical challenges: determining when to retrieve, what modality to incorporate, and how to utilize retrieved information effectively. To address these challenges, we introduce Windsock, a query-dependent module making decisions on retrieval necessity and modality selection, effectively reducing computational overhead and improving response quality. Additionally, we propose Dynamic Noise-Resistance (DANCE) Instruction Tuning, an adaptive training strategy that enhances MLLMs’ ability to utilize retrieved information while maintaining robustness against noise. Moreover, we adopt a self-assessment approach leveraging knowledge within MLLMs to convert question-answering datasets to MRAG training datasets. Extensive experiments demonstrate that our proposed method significantly improves the generation quality by 17.07% while reducing 8.95% retrieval times.
[350] WaveMAE: Wavelet decomposition Masked Auto-Encoder for Remote Sensing
Vittorio Bernuzzi, Leonardo Rossi, Tomaso Fontanini, Massimo Bertozzi, Andrea Prati
Main category: cs.CV
TL;DR: WaveMAE is a masked autoencoding framework for multispectral satellite imagery that uses Discrete Wavelet Transform for frequency-aware learning and Geo-conditioned Positional Encoding for geographical priors, achieving state-of-the-art performance on diverse remote sensing tasks.
Details
Motivation: Self-supervised learning is crucial for remote sensing due to limited annotated data. Current approaches need better frequency awareness and geographical understanding for multispectral imagery.Method: Uses masked autoencoding with multi-level Discrete Wavelet Transform for frequency component disentanglement and Geo-conditioned Positional Encoding with Spherical Harmonics for geographical priors.
Result: Achieves consistent improvements over prior state-of-the-art, with substantial gains on segmentation and regression. Even lightweight variant (26.4% parameters) achieves SOTA performance.
Conclusion: WaveMAE establishes itself as a strong, geographically informed foundation model for multispectral remote sensing imagery.
Abstract: Self-supervised learning (SSL) has recently emerged as a key strategy for building foundation models in remote sensing, where the scarcity of annotated data limits the applicability of fully supervised approaches. In this work, we introduce WaveMAE, a masked autoencoding framework tailored for multispectral satellite imagery. Unlike conventional pixel-based reconstruction, WaveMAE leverages a multi-level Discrete Wavelet Transform (DWT) to disentangle frequency components and guide the encoder toward learning scale-aware high-frequency representations. We further propose a Geo-conditioned Positional Encoding (GPE), which incorporates geographical priors via Spherical Harmonics, encouraging embeddings that respect both semantic and geospatial structure. To ensure fairness in evaluation, all methods are pretrained on the same dataset (fMoW-S2) and systematically evaluated on the diverse downstream tasks of the PANGAEA benchmark, spanning semantic segmentation, regression, change detection, and multilabel classification. Extensive experiments demonstrate that WaveMAE achieves consistent improvements over prior state-of-the-art approaches, with substantial gains on segmentation and regression benchmarks. The effectiveness of WaveMAE pretraining is further demonstrated by showing that even a lightweight variant, containing only 26.4% of the parameters, achieves state-of-the-art performance. Our results establish WaveMAE as a strong and geographically informed foundation model for multispectral remote sensing imagery.
[351] IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction
Hao Li, Zhengyu Zou, Fangfu Liu, Xuanyang Zhang, Fangzhou Hong, Yukang Cao, Yushi Lan, Manyuan Zhang, Gang Yu, Dingwen Zhang, Ziwei Liu
Main category: cs.CV
TL;DR: IGGT is an end-to-end transformer that unifies 3D spatial reconstruction and instance-level understanding through 3D-consistent contrastive learning, enabling coherent 3D scene perception from 2D inputs.
Details
Motivation: Prior approaches treat 3D reconstruction and semantic understanding separately, missing their crucial interplay and limiting generalization in downstream tasks. Simple alignment methods restrict perception to aligned model capacity.Method: Proposed InstanceGrounded Geometry Transformer (IGGT) with 3D-Consistent Contrastive Learning strategy to encode unified representations from 2D visual inputs. Created InsScene-15K dataset with comprehensive annotations.
Result: IGGT learns to lift 2D visual inputs into coherent 3D scenes with explicitly distinct object instances through unified geometric and instance-grounded representations.
Conclusion: The unified approach enables more coherent and accurate 3D scene understanding by bridging the gap between geometric reconstruction and semantic perception.
Abstract: Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions, enabling coherent and accurate understanding of complex scenes. However, most prior approaches prioritize training large geometry models for low-level 3D reconstruction and treat high-level spatial understanding in isolation, overlooking the crucial interplay between these two fundamental aspects of 3D-scene analysis, thereby limiting generalization and leading to poor performance in downstream 3D understanding tasks. Recent attempts have mitigated this issue by simply aligning 3D models with specific language models, thus restricting perception to the aligned model’s capacity and limiting adaptability to downstream tasks. In this paper, we propose InstanceGrounded Geometry Transformer (IGGT), an end-to-end large unified transformer to unify the knowledge for both spatial reconstruction and instance-level contextual understanding. Specifically, we design a 3D-Consistent Contrastive Learning strategy that guides IGGT to encode a unified representation with geometric structures and instance-grounded clustering through only 2D visual inputs. This representation supports consistent lifting of 2D visual inputs into a coherent 3D scene with explicitly distinct object instances. To facilitate this task, we further construct InsScene-15K, a large-scale dataset with high-quality RGB images, poses, depth maps, and 3D-consistent instance-level mask annotations with a novel data curation pipeline.
[352] LRW-Persian: Lip-reading in the Wild Dataset for Persian Language
Zahra Taghizadeh, Mohammad Shahverdikondori, Arian Noori, Alireza Dadgarnia
Main category: cs.CV
TL;DR: LRW-Persian is the largest Persian word-level lipreading dataset with 743 target words and 414,000 video samples from 1,900+ hours of TV footage, featuring automated curation and benchmark-ready splits.
Details
Motivation: Address the scarcity of non-English visual speech recognition resources, particularly for Persian language, to support robust speech recognition systems and assistive technologies for the hearing-impaired.Method: Created dataset using automated end-to-end curation pipeline with ASR transcription, active-speaker localization, quality filtering, and pose/mask screening from 67 TV programs. Fine-tuned two established lipreading architectures for benchmarking.
Result: Established reference performance on Persian visual speech recognition, demonstrating the difficulty of the task. The dataset provides speaker-disjoint splits, wide dialectal coverage, and rich metadata including head pose, age, and gender.
Conclusion: LRW-Persian fills a critical gap for low-resource languages, enabling rigorous benchmarking, cross-lingual transfer, and advancing multimodal speech research in underrepresented linguistic contexts.
Abstract: Lipreading has emerged as an increasingly important research area for developing robust speech recognition systems and assistive technologies for the hearing-impaired. However, non-English resources for visual speech recognition remain limited. We introduce LRW-Persian, the largest in-the-wild Persian word-level lipreading dataset, comprising $743$ target words and over $414{,}000$ video samples extracted from more than $1{,}900$ hours of footage across $67$ television programs. Designed as a benchmark-ready resource, LRW-Persian provides speaker-disjoint training and test splits, wide regional and dialectal coverage, and rich per-clip metadata including head pose, age, and gender. To ensure large-scale data quality, we establish a fully automated end-to-end curation pipeline encompassing transcription based on Automatic Speech Recognition(ASR), active-speaker localization, quality filtering, and pose/mask screening. We further fine-tune two widely used lipreading architectures on LRW-Persian, establishing reference performance and demonstrating the difficulty of Persian visual speech recognition. By filling a critical gap in low-resource languages, LRW-Persian enables rigorous benchmarking, supports cross-lingual transfer, and provides a foundation for advancing multimodal speech research in underrepresented linguistic contexts. The dataset is publicly available at: https://lrw-persian.vercel.app.
[353] Cross-view Localization and Synthesis – Datasets, Challenges and Opportunities
Ningli Xu, Rongjun Qin
Main category: cs.CV
TL;DR: This paper provides a comprehensive survey of cross-view visual understanding, focusing on cross-view localization (estimating ground image positions from overhead imagery) and cross-view synthesis (generating ground-level images from overhead views).
Details
Motivation: Cross-view tasks have gained importance due to applications in autonomous navigation, urban planning, and augmented reality. The significant differences in viewing perspective, resolution, and occlusion between overhead and ground-level imagery make these tasks challenging.Method: The survey reviews datasets, challenges, and state-of-the-art techniques. Cross-view localization is typically formulated as image retrieval using CNNs or ViTs for feature embedding, while cross-view synthesis uses GANs or diffusion models.
Result: The paper provides an organized overview of advances in both tasks, including comparative analyses of current methods and identification of key challenges in the field.
Conclusion: The survey discusses current limitations and outlines promising future research directions for cross-view localization and synthesis, providing a comprehensive resource for researchers in this domain.
Abstract: Cross-view localization and synthesis are two fundamental tasks in cross-view visual understanding, which deals with cross-view datasets: overhead (satellite or aerial) and ground-level imagery. These tasks have gained increasing attention due to their broad applications in autonomous navigation, urban planning, and augmented reality. Cross-view localization aims to estimate the geographic position of ground-level images based on information provided by overhead imagery while cross-view synthesis seeks to generate ground-level images based on information from the overhead imagery. Both tasks remain challenging due to significant differences in viewing perspective, resolution, and occlusion, which are widely embedded in cross-view datasets. Recent years have witnessed rapid progress driven by the availability of large-scale datasets and novel approaches. Typically, cross-view localization is formulated as an image retrieval problem where ground-level features are matched with tiled overhead images feature, extracted by convolutional neural networks (CNNs) or vision transformers (ViTs) for cross-view feature embedding. Cross-view synthesis, on the other hand, seeks to generate ground-level views based on information from overhead imagery, generally using generative adversarial networks (GANs) or diffusion models. This paper presents a comprehensive survey of advances in cross-view localization and synthesis, reviewing widely used datasets, highlighting key challenges, and providing an organized overview of state-of-the-art techniques. Furthermore, it discusses current limitations, offers comparative analyses, and outlines promising directions for future research. We also include the project page via https://github.com/GDAOSU/Awesome-Cross-View-Methods.
[354] ConMatFormer: A Multi-attention and Transformer Integrated ConvNext based Deep Learning Model for Enhanced Diabetic Foot Ulcer Classification
Raihan Ahamed Rifat, Fuyad Hasan Bhoyan, Md Humaion Kabir Mehedi, Md Kaviul Hossain, Md. Jakir Hossen, M. F. Mridha
Main category: cs.CV
TL;DR: ConMatFormer is a hybrid deep learning model combining ConvNeXt blocks, attention mechanisms (CBAM and DANet), and transformers for diabetic foot ulcer detection, achieving state-of-the-art performance with 97.55% accuracy.
Details
Motivation: Diabetic foot ulcer detection faces challenges due to scarce and variable public datasets, requiring models that can handle class imbalance and accurately identify underrepresented DFU classes.Method: Proposed ConMatFormer architecture using ConvNeXt blocks for local features, transformer modules for long-range dependencies, multiple attention mechanisms (CBAM and DANet), and data augmentation to address class imbalance.
Result: Achieved 0.8961 accuracy and 0.9160 precision in single experiments, and 0.9755 accuracy with 0.0031 std in 4-fold cross-validation, outperforming SOTA CNN and ViT models on DFUC2021 and DFU datasets.
Conclusion: ConMatFormer sets a new benchmark for DFU classification and provides a reliable hybrid attention transformer framework for medical image analysis, with explainable AI methods ensuring transparent decision-making.
Abstract: Diabetic foot ulcer (DFU) detection is a clinically significant yet challenging task due to the scarcity and variability of publicly available datasets. To solve these problems, we propose ConMatFormer, a new hybrid deep learning architecture that combines ConvNeXt blocks, multiple attention mechanisms convolutional block attention module (CBAM) and dual attention network (DANet), and transformer modules in a way that works together. This design facilitates the extraction of better local features and understanding of the global context, which allows us to model small skin patterns across different types of DFU very accurately. To address the class imbalance, we used data augmentation methods. A ConvNeXt block was used to obtain detailed local features in the initial stages. Subsequently, we compiled the model by adding a transformer module to enhance long-range dependency. This enabled us to pinpoint the DFU classes that were underrepresented or constituted minorities. Tests on the DS1 (DFUC2021) and DS2 (diabetic foot ulcer (DFU)) datasets showed that ConMatFormer outperformed state-of-the-art (SOTA) convolutional neural network (CNN) and Vision Transformer (ViT) models in terms of accuracy, reliability, and flexibility. The proposed method achieved an accuracy of 0.8961 and a precision of 0.9160 in a single experiment, which is a significant improvement over the current standards for classifying DFUs. In addition, by 4-fold cross-validation, the proposed model achieved an accuracy of 0.9755 with a standard deviation of only 0.0031. We further applied explainable artificial intelligence (XAI) methods, such as Grad-CAM, Grad-CAM++, and LIME, to consistently monitor the transparency and trustworthiness of the decision-making process.. Our findings set a new benchmark for DFU classification and provide a hybrid attention transformer framework for medical image analysis.
[355] Self-Calibrated Consistency can Fight Back for Adversarial Robustness in Vision-Language Models
Jiaxiang Liu, Jiawei Du, Xiao Liu, Prayag Tiwari, Mingkun Xu
Main category: cs.CV
TL;DR: Self-Calibrated Consistency (SCC) is a test-time defense method that improves CLIP’s zero-shot robustness against adversarial attacks by leveraging semantic and spatial consistency through pseudo-labels and multi-view predictions.
Details
Motivation: CLIP models are vulnerable to adversarial perturbations that disrupt image-text alignment, and existing defenses require labeled data for adversarial fine-tuning, limiting their applicability in zero-shot settings.Method: SCC consists of two modules: Semantic consistency uses soft pseudo-labels from counterattack warm-up and multi-view predictions to regularize cross-modal alignment; Spatial consistency aligns perturbed visual predictions via augmented views to stabilize inference.
Result: Extensive experiments on 22 benchmarks show SCC consistently improves CLIP’s zero-shot robustness while maintaining accuracy, and can be integrated with other VLMs for further gains.
Conclusion: SCC demonstrates the potential for establishing an adversarially robust paradigm from CLIP, with implications extending to broader vision-language domains like BioMedCLIP.
Abstract: Pre-trained vision-language models (VLMs) such as CLIP have demonstrated strong zero-shot capabilities across diverse domains, yet remain highly vulnerable to adversarial perturbations that disrupt image-text alignment and compromise reliability. Existing defenses typically rely on adversarial fine-tuning with labeled data, limiting their applicability in zero-shot settings. In this work, we identify two key weaknesses of current CLIP adversarial attacks – lack of semantic guidance and vulnerability to view variations – collectively termed semantic and viewpoint fragility. To address these challenges, we propose Self-Calibrated Consistency (SCC), an effective test-time defense. SCC consists of two complementary modules: Semantic consistency, which leverages soft pseudo-labels from counterattack warm-up and multi-view predictions to regularize cross-modal alignment and separate the target embedding from confusable negatives; and Spatial consistency, aligning perturbed visual predictions via augmented views to stabilize inference under adversarial perturbations. Together, these modules form a plug-and-play inference strategy. Extensive experiments on 22 benchmarks under diverse attack settings show that SCC consistently improves the zero-shot robustness of CLIP while maintaining accuracy, and can be seamlessly integrated with other VLMs for further gains. These findings highlight the great potential of establishing an adversarially robust paradigm from CLIP, with implications extending to broader vision-language domains such as BioMedCLIP.
[356] MedXplain-VQA: Multi-Component Explainable Medical Visual Question Answering
Hai-Dang Nguyen, Minh-Anh Dang, Minh-Tan Le, Minh-Tuan Le
Main category: cs.CV
TL;DR: MedXplain-VQA is an explainable medical VQA framework that integrates five AI components to provide transparent reasoning for medical image analysis, achieving significant improvements over baseline methods.
Details
Motivation: Explainability is critical for clinical adoption of medical VQA systems, as physicians require transparent reasoning to trust AI-generated diagnoses.Method: Framework integrates fine-tuned BLIP-2 backbone, medical query reformulation, enhanced Grad-CAM attention, precise region extraction, and structured chain-of-thought reasoning via multi-modal language models.
Result: Experiments on 500 PathVQA samples show composite score of 0.683 vs 0.378 baseline, high reasoning confidence (0.890), identifies 3-5 relevant regions per sample, generates 57-word structured explanations with clinical terminology.
Conclusion: MedXplain-VQA shows potential as a robust, explainable medical VQA system, with query reformulation providing most significant initial improvement and chain-of-thought enabling systematic diagnostics.
Abstract: Explainability is critical for the clinical adoption of medical visual question answering (VQA) systems, as physicians require transparent reasoning to trust AI-generated diagnoses. We present MedXplain-VQA, a comprehensive framework integrating five explainable AI components to deliver interpretable medical image analysis. The framework leverages a fine-tuned BLIP-2 backbone, medical query reformulation, enhanced Grad-CAM attention, precise region extraction, and structured chain-of-thought reasoning via multi-modal language models. To evaluate the system, we introduce a medical-domain-specific framework replacing traditional NLP metrics with clinically relevant assessments, including terminology coverage, clinical structure quality, and attention region relevance. Experiments on 500 PathVQA histopathology samples demonstrate substantial improvements, with the enhanced system achieving a composite score of 0.683 compared to 0.378 for baseline methods, while maintaining high reasoning confidence (0.890). Our system identifies 3-5 diagnostically relevant regions per sample and generates structured explanations averaging 57 words with appropriate clinical terminology. Ablation studies reveal that query reformulation provides the most significant initial improvement, while chain-of-thought reasoning enables systematic diagnostic processes. These findings underscore the potential of MedXplain-VQA as a robust, explainable medical VQA system. Future work will focus on validation with medical experts and large-scale clinical datasets to ensure clinical readiness.
[357] MAGIC-Talk: Motion-aware Audio-Driven Talking Face Generation with Customizable Identity Control
Fatemeh Nazarieh, Zhenhua Feng, Diptesh Kanojia, Muhammad Awais, Josef Kittler
Main category: cs.CV
TL;DR: MAGIC-Talk is a one-shot diffusion-based framework for customizable and temporally stable talking face generation that addresses identity preservation, temporal consistency, and customization issues in audio-driven talking face generation.
Details
Motivation: Current audio-driven talking face generation methods struggle with temporal consistency, identity preservation, and customization, especially in long video generation scenarios.Method: MAGIC-Talk uses a dual-network architecture with ReferenceNet for identity preservation and fine-grained facial editing via text prompts, and AnimateNet for motion coherence using structured motion priors. It employs progressive latent fusion strategy for long-form video quality improvement.
Result: Extensive experiments show MAGIC-Talk outperforms state-of-the-art methods in visual quality, identity preservation, and synchronization accuracy.
Conclusion: MAGIC-Talk offers a robust solution for talking face generation that maintains identity from a single image while ensuring smooth transitions across frames, without requiring multiple reference images or fine-tuning.
Abstract: Audio-driven talking face generation has gained significant attention for applications in digital media and virtual avatars. While recent methods improve audio-lip synchronization, they often struggle with temporal consistency, identity preservation, and customization, especially in long video generation. To address these issues, we propose MAGIC-Talk, a one-shot diffusion-based framework for customizable and temporally stable talking face generation. MAGIC-Talk consists of ReferenceNet, which preserves identity and enables fine-grained facial editing via text prompts, and AnimateNet, which enhances motion coherence using structured motion priors. Unlike previous methods requiring multiple reference images or fine-tuning, MAGIC-Talk maintains identity from a single image while ensuring smooth transitions across frames. Additionally, a progressive latent fusion strategy is introduced to improve long-form video quality by reducing motion inconsistencies and flickering. Extensive experiments demonstrate that MAGIC-Talk outperforms state-of-the-art methods in visual quality, identity preservation, and synchronization accuracy, offering a robust solution for talking face generation.
[358] M$^{3}$T2IBench: A Large-Scale Multi-Category, Multi-Instance, Multi-Relation Text-to-Image Benchmark
Huixuan Zhang, Xiaojun Wan
Main category: cs.CV
TL;DR: M³T2IBench is a challenging benchmark for evaluating text-to-image models, focusing on multi-category, multi-instance, multi-relation scenarios. The paper also proposes AlignScore metric and Revise-Then-Enforce method to improve image-text alignment.
Details
Motivation: Current text-to-image models struggle with complex prompts containing multiple instances of the same category, and existing evaluation methods either oversimplify scenarios or use metrics that don't correlate well with human judgment.Method: Introduces M³T2IBench benchmark with AlignScore evaluation metric based on object detection, and proposes Revise-Then-Enforce - a training-free post-editing approach to enhance image-text alignment in diffusion models.
Result: Current open-source text-to-image models perform poorly on the challenging M³T2IBench benchmark, but the proposed Revise-Then-Enforce method demonstrates improved image-text alignment across various diffusion models.
Conclusion: The M³T2IBench benchmark effectively reveals limitations in current text-to-image models for complex scenarios, and the proposed Revise-Then-Enforce approach offers a practical solution to enhance alignment without requiring model retraining.
Abstract: Text-to-image models are known to struggle with generating images that perfectly align with textual prompts. Several previous studies have focused on evaluating image-text alignment in text-to-image generation. However, these evaluations either address overly simple scenarios, especially overlooking the difficulty of prompts with multiple different instances belonging to the same category, or they introduce metrics that do not correlate well with human evaluation. In this study, we introduce M$^3$T2IBench, a large-scale, multi-category, multi-instance, multi-relation along with an object-detection-based evaluation metric, $AlignScore$, which aligns closely with human evaluation. Our findings reveal that current open-source text-to-image models perform poorly on this challenging benchmark. Additionally, we propose the Revise-Then-Enforce approach to enhance image-text alignment. This training-free post-editing method demonstrates improvements in image-text alignment across a broad range of diffusion models. \footnote{Our code and data has been released in supplementary material and will be made publicly available after the paper is accepted.}
[359] FairJudge: MLLM Judging for Social Attributes and Prompt Image Alignment
Zahraa Al Sahili, Maryam Fetanat, Maimuna Nowaz, Ioannis Patras, Matthew Purver
Main category: cs.CV
TL;DR: FairJudge is a lightweight evaluation protocol that uses multimodal LLMs as fair judges to assess how well text-to-image systems align with prompts and handle social attributes, providing accountable, evidence-aware decisions with calibrated abstention.
Details
Motivation: Current text-to-image evaluation methods (face classifiers, contrastive similarity) lack calibrated abstention, miss weakly visible attributes, and fail to provide accountable decisions for fairness assessment.Method: Uses instruction-following multimodal LLMs as judges with explanation-oriented rubrics mapped to [-1,1], closed label sets, evidence grounding in visible content, and mandatory abstention when cues are insufficient.
Result: Outperforms contrastive and face-centric baselines on demographic prediction across multiple datasets (FairFace, PaTA, FairCoT), improves mean alignment while maintaining high profession accuracy on IdenProf, FairCoT-Professions, and DIVERSIFY-Professions.
Conclusion: FairJudge enables more reliable and reproducible fairness audits for text-to-image systems by providing accountable, evidence-aware evaluation that addresses limitations of existing methods.
Abstract: Text-to-image (T2I) systems lack simple, reproducible ways to evaluate how well images match prompts and how models treat social attributes. Common proxies – face classifiers and contrastive similarity – reward surface cues, lack calibrated abstention, and miss attributes only weakly visible (for example, religion, culture, disability). We present FairJudge, a lightweight protocol that treats instruction-following multimodal LLMs as fair judges. It scores alignment with an explanation-oriented rubric mapped to [-1, 1]; constrains judgments to a closed label set; requires evidence grounded in the visible content; and mandates abstention when cues are insufficient. Unlike CLIP-only pipelines, FairJudge yields accountable, evidence-aware decisions; unlike mitigation that alters generators, it targets evaluation fairness. We evaluate gender, race, and age on FairFace, PaTA, and FairCoT; extend to religion, culture, and disability; and assess profession correctness and alignment on IdenProf, FairCoT-Professions, and our new DIVERSIFY-Professions. We also release DIVERSIFY, a 469-image corpus of diverse, non-iconic scenes. Across datasets, judge models outperform contrastive and face-centric baselines on demographic prediction and improve mean alignment while maintaining high profession accuracy, enabling more reliable, reproducible fairness audits.
[360] UniAIDet: A Unified and Universal Benchmark for AI-Generated Image Content Detection and Localization
Huixuan Zhang, Xiaojun Wan
Main category: cs.CV
TL;DR: UniAIDet is a unified benchmark for AI-generated image detection that covers diverse generative models and both photographic/artistic images, addressing limitations in existing benchmarks.
Details
Motivation: Current AI-generated content detection benchmarks are limited in coverage of diverse generative models and image categories, often missing end-to-end image editing and artistic images.Method: Developed UniAIDet benchmark covering text-to-image, image-to-image, image inpainting, image editing, and deepfake models with both photographic and artistic images. Conducted comprehensive evaluation of various detection methods.
Result: The benchmark enables answering key research questions about generalization capability and the relation between detection and localization in AI-generated image detection.
Conclusion: UniAIDet provides a robust foundation for future research in AI-generated image detection by offering comprehensive coverage of diverse generative scenarios.
Abstract: With the rapid proliferation of image generative models, the authenticity of digital images has become a significant concern. While existing studies have proposed various methods for detecting AI-generated content, current benchmarks are limited in their coverage of diverse generative models and image categories, often overlooking end-to-end image editing and artistic images. To address these limitations, we introduce UniAIDet, a unified and comprehensive benchmark that includes both photographic and artistic images. UniAIDet covers a wide range of generative models, including text-to-image, image-to-image, image inpainting, image editing, and deepfake models. Using UniAIDet, we conduct a comprehensive evaluation of various detection methods and answer three key research questions regarding generalization capability and the relation between detection and localization. Our benchmark and analysis provide a robust foundation for future research.
[361] Semantic-Preserving Cross-Style Visual Reasoning for Robust Multi-Modal Understanding in Large Vision-Language Models
Aya Nakayama, Brian Wong, Yuji Nishimura, Kaito Tanaka
Main category: cs.CV
TL;DR: SP-CSVR is a novel framework that addresses the “style trap” problem in Large Vision-Language Models by enabling robust semantic understanding across diverse visual styles through style-content disentanglement and adaptive cross-style reasoning.
Details
Motivation: The "style trap" challenge hinders LVLMs' robust semantic understanding across diverse visual styles, especially in in-context learning, as existing methods fail to effectively decouple style from content.Method: SP-CSVR integrates three key components: Cross-Style Feature Encoder for style-content disentanglement, Semantic-Aligned In-Context Decoder for few-shot style adaptation, and Adaptive Semantic Consistency Module using multi-task contrastive learning to enforce cross-style semantic invariance.
Result: Extensive experiments on a challenging multi-style dataset demonstrate SP-CSVR’s state-of-the-art performance across visual captioning, visual question answering, and in-context style adaptation.
Conclusion: SP-CSVR effectively enhances robustness, generalization, and efficiency across diverse visual styles, addressing the style trap problem in LVLMs.
Abstract: The “style trap” poses a significant challenge for Large Vision-Language Models (LVLMs), hindering robust semantic understanding across diverse visual styles, especially in in-context learning (ICL). Existing methods often fail to effectively decouple style from content, hindering generalization. To address this, we propose the Semantic-Preserving Cross-Style Visual Reasoner (SP-CSVR), a novel framework for stable semantic understanding and adaptive cross-style visual reasoning. SP-CSVR integrates a Cross-Style Feature Encoder (CSFE) for style-content disentanglement, a Semantic-Aligned In-Context Decoder (SAICD) for efficient few-shot style adaptation, and an Adaptive Semantic Consistency Module (ASCM) employing multi-task contrastive learning to enforce cross-style semantic invariance. Extensive experiments on a challenging multi-style dataset demonstrate SP-CSVR’s state-of-the-art performance across visual captioning, visual question answering, and in-context style adaptation. Comprehensive evaluations, including ablation studies and generalization analysis, confirm SP-CSVR’s efficacy in enhancing robustness, generalization, and efficiency across diverse visual styles.
[362] FastJAM: a Fast Joint Alignment Model for Images
Omri Hirsch, Ron Shapira Weber, Shira Ifergane, Oren Freifeld
Main category: cs.CV
TL;DR: FastJAM is a rapid graph-based method for joint image alignment that reduces computational complexity from hours/minutes to seconds while achieving better alignment quality than existing methods.
Details
Motivation: Existing joint alignment approaches require long training times, large models, and extensive hyperparameter tuning, creating a need for faster and more efficient methods.Method: Leverages pairwise matches from off-the-shelf image matchers with nonparametric clustering to build a graph of keypoint relations, uses GNN for correspondence propagation, and employs inverse-compositional loss without regularization terms.
Result: Achieves better alignment quality than modern JA methods while reducing computation time from hours/minutes to mere seconds on several benchmarks.
Conclusion: FastJAM provides an efficient and effective solution for joint image alignment that eliminates the need for regularization terms and associated hyperparameter tuning.
Abstract: Joint Alignment (JA) of images aims to align a collection of images into a unified coordinate frame, such that semantically-similar features appear at corresponding spatial locations. Most existing approaches often require long training times, large-capacity models, and extensive hyperparameter tuning. We introduce FastJAM, a rapid, graph-based method that drastically reduces the computational complexity of joint alignment tasks. FastJAM leverages pairwise matches computed by an off-the-shelf image matcher, together with a rapid nonparametric clustering, to construct a graph representing intra- and inter-image keypoint relations. A graph neural network propagates and aggregates these correspondences, efficiently predicting per-image homography parameters via image-level pooling. Utilizing an inverse-compositional loss, that eliminates the need for a regularization term over the predicted transformations (and thus also obviates the hyperparameter tuning associated with such terms), FastJAM performs image JA quickly and effectively. Experimental results on several benchmarks demonstrate that FastJAM achieves results better than existing modern JA methods in terms of alignment quality, while reducing computation time from hours or minutes to mere seconds. Our code is available at our project webpage, https://bgu-cs-vil.github.io/FastJAM/
[363] Semantic Surgery: Zero-Shot Concept Erasure in Diffusion Models
Lexiang Xiong, Chengyu Liu, Jingwen Ye, Yan Liu, Yuecong Xu
Main category: cs.CV
TL;DR: Semantic Surgery is a training-free, zero-shot framework for concept erasure in text-to-image diffusion models that operates on text embeddings before diffusion, using calibrated vector subtraction to neutralize target concepts while preserving image quality.
Details
Motivation: Existing concept erasure methods often compromise generative quality, creating a need for approaches that can effectively remove harmful content while maintaining high-quality image generation.Method: The framework dynamically estimates target concept presence in prompts and performs calibrated vector subtraction on text embeddings. It includes Co-Occurrence Encoding for multi-concept erasure and a visual feedback loop to address latent concept persistence.
Result: Achieved 93.58 H-score in object erasure, reduced explicit content to just 1 instance, and 8.09 H_a in style erasure with no quality degradation. Significantly outperformed state-of-the-art approaches across various erasure tasks.
Conclusion: Semantic Surgery provides a practical solution for safer text-to-image generation, functioning as both an effective concept erasure method and a built-in threat detection system while preserving locality and image quality.
Abstract: Concept erasure in text-to-image diffusion models is crucial for mitigating harmful content, yet existing methods often compromise generative quality. We introduce Semantic Surgery, a novel training-free, zero-shot framework for concept erasure that operates directly on text embeddings before the diffusion process. It dynamically estimates the presence of target concepts in a prompt and performs a calibrated vector subtraction to neutralize their influence at the source, enhancing both erasure completeness and locality. The framework includes a Co-Occurrence Encoding module for robust multi-concept erasure and a visual feedback loop to address latent concept persistence. As a training-free method, Semantic Surgery adapts dynamically to each prompt, ensuring precise interventions. Extensive experiments on object, explicit content, artistic style, and multi-celebrity erasure tasks show our method significantly outperforms state-of-the-art approaches. We achieve superior completeness and robustness while preserving locality and image quality (e.g., 93.58 H-score in object erasure, reducing explicit content to just 1 instance, and 8.09 H_a in style erasure with no quality degradation). This robustness also allows our framework to function as a built-in threat detection system, offering a practical solution for safer text-to-image generation.
[364] Seeing the Unseen: Towards Zero-Shot Inspection for Wind Turbine Blades using Knowledge-Augmented Vision Language Models
Yang Zhang, Qianyu Zhou, Farhad Imani, Jiong Tang
Main category: cs.CV
TL;DR: A zero-shot inspection framework using RAG with Vision-Language Models for wind turbine blade damage detection, eliminating need for large labeled datasets.
Details
Motivation: Address limitations of drone-based inspection and deep learning that require large labeled datasets, which are impractical for detecting rare or evolving damage types in wind turbine blades.Method: Integrates Retrieval-Augmented Generation (RAG) with Vision-Language Models (VLM), using a multimodal knowledge base with technical documentation, reference images, and domain guidelines. A hybrid text-image retriever with keyword-aware reranking provides relevant context for VLM inference.
Result: On 30 labeled blade images covering diverse damage categories, the RAG-grounded VLM achieved 100% correct classification, outperforming the same VLM without retrieval in both accuracy and precision.
Conclusion: The framework provides explainable and generalizable damage detection, enabling detection of unseen defects by leveraging domain knowledge rather than visual cues alone, offering a data-efficient solution for industrial inspection.
Abstract: Wind turbine blades operate in harsh environments, making timely damage detection essential for preventing failures and optimizing maintenance. Drone-based inspection and deep learning are promising, but typically depend on large, labeled datasets, which limit their ability to detect rare or evolving damage types. To address this, we propose a zero-shot-oriented inspection framework that integrates Retrieval-Augmented Generation (RAG) with Vision-Language Models (VLM). A multimodal knowledge base is constructed, comprising technical documentation, representative reference images, and domain-specific guidelines. A hybrid text-image retriever with keyword-aware reranking assembles the most relevant context to condition the VLM at inference, injecting domain knowledge without task-specific training. We evaluate the framework on 30 labeled blade images covering diverse damage categories. Although the dataset is small due to the difficulty of acquiring verified blade imagery, it covers multiple representative defect types. On this test set, the RAG-grounded VLM correctly classified all samples, whereas the same VLM without retrieval performed worse in both accuracy and precision. We further compare against open-vocabulary baselines and incorporate uncertainty Clopper-Pearson confidence intervals to account for the small-sample setting. Ablation studies indicate that the key advantage of the framework lies in explainability and generalizability: retrieved references ground the reasoning process and enable the detection of previously unseen defects by leveraging domain knowledge rather than relying solely on visual cues. This research contributes a data-efficient solution for industrial inspection that reduces dependence on extensive labeled datasets.
[365] Estimating Pasture Biomass from Top-View Images: A Dataset for Precision Agriculture
Qiyu Liao, Dadong Wang, Rebecca Haling, Jiajun Liu, Xun Li, Martyna Plomecka, Andrew Robson, Matthew Pringle, Rhys Pirie, Megan Walker, Joshua Whelan
Main category: cs.CV
TL;DR: A comprehensive dataset of 1,162 annotated pasture images with biomass measurements for machine learning-based pasture biomass estimation.
Details
Motivation: Accurate pasture biomass estimation is crucial for livestock production management to optimize stocking rates and prevent overgrazing.Method: Created a dataset of top-view pasture images across 19 Australian locations with paired ground measurements including biomass components, vegetation height, and NDVI data.
Result: Released a multidimensional dataset combining visual, spectral, and structural information for precision grazing management applications.
Conclusion: The dataset enables new possibilities for machine learning approaches to pasture biomass estimation and is available through a Kaggle competition.
Abstract: Accurate estimation of pasture biomass is important for decision-making in livestock production systems. Estimates of pasture biomass can be used to manage stocking rates to maximise pasture utilisation, while minimising the risk of overgrazing and promoting overall system health. We present a comprehensive dataset of 1,162 annotated top-view images of pastures collected across 19 locations in Australia. The images were taken across multiple seasons and include a range of temperate pasture species. Each image captures a 70cm * 30cm quadrat and is paired with on-ground measurements including biomass sorted by component (green, dead, and legume fraction), vegetation height, and Normalized Difference Vegetation Index (NDVI) from Active Optical Sensors (AOS). The multidimensional nature of the data, which combines visual, spectral, and structural information, opens up new possibilities for advancing the use of precision grazing management. The dataset is released and hosted in a Kaggle competition that challenges the international Machine Learning community with the task of pasture biomass estimation. The dataset is available on the official Kaggle webpage: https://www.kaggle.com/competitions/csiro-biomass
[366] Gen-LangSplat: Generalized Language Gaussian Splatting with Pre-Trained Feature Compression
Pranav Saxena
Main category: cs.CV
TL;DR: Gen-LangSplat eliminates the need for scene-specific language autoencoders in 3D language fields by using a pre-trained generalized autoencoder, achieving efficiency gains while maintaining comparable performance to LangSplat.
Details
Motivation: Current approaches like LangSplat require costly per-scene autoencoder training for feature compression, creating a scalability bottleneck for deploying 3D language fields in real-world applications.Method: Replaces scene-specific autoencoders with a generalized autoencoder pre-trained on ScanNet dataset, using fixed compact latent space across all scenes without scene-specific training.
Result: Achieves efficiency boost in language field construction while delivering comparable or better querying performance than LangSplat, with optimal latent embedding dimensions validated through ablation studies.
Conclusion: Generalized embeddings can efficiently support open-vocabulary querying in novel 3D scenes, enabling scalable real-time interactive 3D AI applications.
Abstract: Modeling open-vocabulary language fields in 3D is essential for intuitive human-AI interaction and querying within physical environments. State-of-the-art approaches, such as LangSplat, leverage 3D Gaussian Splatting to efficiently construct these language fields, encoding features distilled from high-dimensional models like CLIP. However, this efficiency is currently offset by the requirement to train a scene-specific language autoencoder for feature compression, introducing a costly, per-scene optimization bottleneck that hinders deployment scalability. In this work, we introduce Gen-LangSplat, that eliminates this requirement by replacing the scene-wise autoencoder with a generalized autoencoder, pre-trained extensively on the large-scale ScanNet dataset. This architectural shift enables the use of a fixed, compact latent space for language features across any new scene without any scene-specific training. By removing this dependency, our entire language field construction process achieves a efficiency boost while delivering querying performance comparable to, or exceeding, the original LangSplat method. To validate our design choice, we perform a thorough ablation study empirically determining the optimal latent embedding dimension and quantifying representational fidelity using Mean Squared Error and cosine similarity between the original and reprojected 512-dimensional CLIP embeddings. Our results demonstrate that generalized embeddings can efficiently and accurately support open-vocabulary querying in novel 3D scenes, paving the way for scalable, real-time interactive 3D AI applications.
[367] Positional Preservation Embedding for Multimodal Large Language Models
Mouxiao Huang, Borui Jiang, Dehua Zheng, Hailin Hu, Kai Han, Xinghao Chen
Main category: cs.CV
TL;DR: Proposes Positional Preservation Embedding (PPE), a parameter-free operator that preserves spatiotemporal structure during visual token compression in MLLMs by disentangled encoding of 3D positions.
Details
Motivation: Existing token merging methods reduce sequence length but disrupt spatial layouts and temporal continuity by ignoring positional relationships, leading to inefficiencies in multimodal large language models.Method: PPE introduces disentangled encoding of 3D positions in the token dimension, allowing compressed tokens to encapsulate different positions from multiple original tokens. Supports cascade clustering for progressive token compression.
Result: Applied to state-of-the-art token merging framework, PPE achieves 2%-5% improvements across multiple vision-language benchmarks including MMBench, TextVQA, and VideoMME.
Conclusion: Preserving positional cues is critical for efficient and effective MLLM reasoning, and PPE provides a generic solution that can be seamlessly integrated into existing token merging methods.
Abstract: Multimodal large language models (MLLMs) have achieved strong performance on vision-language tasks, yet often suffer from inefficiencies due to redundant visual tokens. Existing token merging methods reduce sequence length but frequently disrupt spatial layouts and temporal continuity by disregarding positional relationships. In this work, we propose a novel encoding operator dubbed as \textbf{P}ositional \textbf{P}reservation \textbf{E}mbedding (\textbf{PPE}), which has the main hallmark of preservation of spatiotemporal structure during visual token compression. PPE explicitly introduces the disentangled encoding of 3D positions in the token dimension, enabling each compressed token to encapsulate different positions from multiple original tokens. Furthermore, we show that PPE can effectively support cascade clustering – a progressive token compression strategy that leads to better performance retention. PPE is a parameter-free and generic operator that can be seamlessly integrated into existing token merging methods without any adjustments. Applied to state-of-the-art token merging framework, PPE achieves consistent improvements of $2%\sim5%$ across multiple vision-language benchmarks, including MMBench (general vision understanding), TextVQA (layout understanding) and VideoMME (temporal understanding). These results demonstrate that preserving positional cues is critical for efficient and effective MLLM reasoning.
[368] Bi-Encoder Contrastive Learning for Fingerprint and Iris Biometrics
Matthew So, Judah Goldfeder, Mark Lis, Hod Lipson
Main category: cs.CV
TL;DR: This paper challenges the assumption that biometrics are statistically uncorrelated by testing fingerprint-to-fingerprint, iris-to-iris, and cross-modal fingerprint-to-iris matching using Bi-Encoder networks on 274 subjects.
Details
Motivation: To test the historic assumption that biometrics of an individual are statistically uncorrelated, which has been widely accepted but not thoroughly validated.Method: Used Bi-Encoder networks with ResNet-50 and Vision Transformer backbones trained with contrastive loss on three verification tasks: fingerprint-to-fingerprint, iris-to-iris, and cross-modal fingerprint-to-iris matching using 274 subjects with ~100k fingerprints and 7k iris images.
Result: Iris-to-iris matching achieved 91 ROC AUC score, showing left and right irises are correlated. Fingerprint models confirmed positive intra-subject correlation. Cross-modal matching performed only slightly above chance. This is the first work using Vision Transformers for biometric matching.
Conclusion: The findings challenge independence assumptions of biometrics, showing clear correlation between left and right irises and confirming fingerprint correlations. More data and sophisticated pipelines are needed for compelling cross-modal results. Future work will extend to other biometrics.
Abstract: There has been a historic assumption that the biometrics of an individual are statistically uncorrelated. We test this assumption by training Bi-Encoder networks on three verification tasks, including fingerprint-to-fingerprint matching, iris-to-iris matching, and cross-modal fingerprint-to-iris matching using 274 subjects with $\sim$100k fingerprints and 7k iris images. We trained ResNet-50 and Vision Transformer backbones in Bi-Encoder architectures such that the contrastive loss between images sampled from the same individual is minimized. The iris ResNet architecture reaches 91 ROC AUC score for iris-to-iris matching, providing clear evidence that the left and right irises of an individual are correlated. Fingerprint models reproduce the positive intra-subject suggested by prior work in this space. This is the first work attempting to use Vision Transformers for this matching. Cross-modal matching rises only slightly above chance, which suggests that more data and a more sophisticated pipeline is needed to obtain compelling results. These findings continue challenge independence assumptions of biometrics and we plan to extend this work to other biometrics in the future. Code available: https://github.com/MatthewSo/bio_fingerprints_iris.
[369] Switchable Token-Specific Codebook Quantization For Face Image Compression
Yongbo Wang, Haonan Wang, Guodong Mu, Ruixin Zhang, Jiaqi Chen, Jingyun Zhang, Jun Wang, Yuan Xie, Zhizhong Zhang, Shouhong Ding
Main category: cs.CV
TL;DR: Proposes Switchable Token-Specific Codebook Quantization for face image compression, using category-specific codebook groups and token-specific codebooks to improve performance at low bitrates.
Details
Motivation: Global codebook strategies for image compression overlook category-specific correlations and semantic token differences in facial images, leading to suboptimal performance at low bpp.Method: Learns distinct codebook groups for different image categories and assigns independent codebooks to each token, recording codebook group membership with minimal bits to reduce loss when decreasing codebook size.
Result: Achieves 93.51% average accuracy for reconstructed images at 0.05 bpp on face recognition datasets, enabling larger total codebooks under lower overall bpp.
Conclusion: The method enhances expressive capability and reconstruction performance for face images, is generalizable to existing codebook-based approaches, and effectively addresses limitations of global codebook strategies.
Abstract: With the ever-increasing volume of visual data, the efficient and lossless transmission, along with its subsequent interpretation and understanding, has become a critical bottleneck in modern information systems. The emerged codebook-based solution utilize a globally shared codebook to quantize and dequantize each token, controlling the bpp by adjusting the number of tokens or the codebook size. However, for facial images, which are rich in attributes, such global codebook strategies overlook both the category-specific correlations within images and the semantic differences among tokens, resulting in suboptimal performance, especially at low bpp. Motivated by these observations, we propose a Switchable Token-Specific Codebook Quantization for face image compression, which learns distinct codebook groups for different image categories and assigns an independent codebook to each token. By recording the codebook group to which each token belongs with a small number of bits, our method can reduce the loss incurred when decreasing the size of each codebook group. This enables a larger total number of codebooks under a lower overall bpp, thereby enhancing the expressive capability and improving reconstruction performance. Owing to its generalizable design, our method can be integrated into any existing codebook-based representation learning approach and has demonstrated its effectiveness on face recognition datasets, achieving an average accuracy of 93.51% for reconstructed images at 0.05 bpp.
[370] LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation
Zeyu Wang, Zilong Chen, Chenhui Gou, Feng Li, Chaorui Deng, Deyao Zhu, Kunchang Li, Weihao Yu, Haoqin Tu, Haoqi Fan, Cihang Xie
Main category: cs.CV
TL;DR: Efficient multimodal model fusion using interleaved self-attention blocks achieves strong performance with minimal training (35B tokens) by combining specialized generation and understanding models.
Details
Motivation: Current unified multimodal models require training from scratch with substantial computational resources, but competitive performance can be obtained more efficiently by fusing existing specialized models.Method: Retain original model blocks while interleaving multimodal self-attention blocks throughout networks, enabling double fusion mechanism that preserves base model strengths while catalyzing synergistic fusion of semantic and spatial representations.
Result: Achieves strong benchmarks: 0.91 on GenEval, 82.16 on DPG-Bench, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench for various multimodal tasks including text-to-image generation and image editing.
Conclusion: Strategic fusion of publicly available specialized models with minimal training can achieve competitive multimodal performance, supporting future research through full release of code, models, and datasets.
Abstract: Unified multimodal models have recently shown remarkable gains in both capability and versatility, yet most leading systems are still trained from scratch and require substantial computational resources. In this paper, we show that competitive performance can be obtained far more efficiently by strategically fusing publicly available models specialized for either generation or understanding. Our key design is to retain the original blocks while additionally interleaving multimodal self-attention blocks throughout the networks. This double fusion mechanism (1) effectively enables rich multi-modal fusion while largely preserving the original strengths of the base models, and (2) catalyzes synergistic fusion of high-level semantic representations from the understanding encoder with low-level spatial signals from the generation encoder. By training with only ~ 35B tokens, this approach achieves strong results across multiple benchmarks: 0.91 on GenEval for compositional text-to-image generation, 82.16 on DPG-Bench for complex text-to-image generation, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench for image editing. By fully releasing the entire suite of code, model weights, and datasets, we hope to support future research on unified multimodal modeling.
[371] FAME: Fairness-aware Attention-modulated Video Editing
Zhangkai Wu, Xuhui Fan, Zhongyuan Xie, Kaize Shi, Zhidong Li, Longbing Cao
Main category: cs.CV
TL;DR: FAME is a fairness-aware video editing method that mitigates profession-related gender biases while maintaining prompt alignment and temporal consistency through fairness embeddings and attention modulation.
Details
Motivation: Training-free video editing models tend to reinforce gender stereotypes when rendering profession-related prompts, creating biased outputs.Method: FAME injects debiasing tokens into text encoders for fairness embeddings, and integrates fairness modulation into temporal self-attention (using region-constrained masks with time decay) and cross-attention (using fairness-sensitive similarity masks) to prevent motion corruption and temporal inconsistency.
Result: Extensive experiments on the FairVE benchmark show FAME achieves stronger fairness alignment and semantic fidelity compared to existing video editing baselines.
Conclusion: FAME effectively mitigates gender biases in video editing while preserving temporal coherence and prompt alignment through its attention modulation framework.
Abstract: Training-free video editing (VE) models tend to fall back on gender stereotypes when rendering profession-related prompts. We propose \textbf{FAME} for \textit{Fairness-aware Attention-modulated Video Editing} that mitigates profession-related gender biases while preserving prompt alignment and temporal consistency for coherent VE. We derive fairness embeddings from existing minority representations by softly injecting debiasing tokens into the text encoder. Simultaneously, FAME integrates fairness modulation into both temporal self attention and prompt-to-region cross attention to mitigate the motion corruption and temporal inconsistency caused by directly introducing fairness cues. For temporal self attention, FAME introduces a region constrained attention mask combined with time decay weighting, which enhances intra-region coherence while suppressing irrelevant inter-region interactions. For cross attention, it reweights tokens to region matching scores by incorporating fairness sensitive similarity masks derived from debiasing prompt embeddings. Together, these modulations keep fairness-sensitive semantics tied to the right visual regions and prevent temporal drift across frames. Extensive experiments on new VE fairness-oriented benchmark \textit{FairVE} demonstrate that FAME achieves stronger fairness alignment and semantic fidelity, surpassing existing VE baselines.
[372] Survey of Multimodal Geospatial Foundation Models: Techniques, Applications, and Challenges
Liling Yang, Ning Chen, Jun Yue, Yidan Liu, Jiayi Ma, Pedram Ghamisi, Antonio Plaza, Leyuan Fang
Main category: cs.CV
TL;DR: This survey provides a comprehensive review of multimodal geospatial foundation models (GFMs) that are transforming remote sensing image analysis through their generalization capabilities and alignment with multimodal remote sensing data characteristics.
Details
Motivation: Foundation models have revolutionized NLP and computer vision, and their impact is now extending to remote sensing. The multimodal, multi-resolution, and multi-temporal nature of remote sensing data naturally aligns with foundation models' capabilities, creating a dedicated research frontier for multimodal GFMs.Method: The survey adopts a modality-driven perspective, covering five core visual and vision-language modalities. It examines how imaging physics and data representation shape interaction design, and analyzes key techniques for alignment, integration, and knowledge transfer to address modality heterogeneity, distribution shifts, and semantic gaps.
Result: The paper systematically assesses advances in training paradigms, architectures, and task-specific adaptation strategies, and evaluates representative multimodal GFMs across ten downstream tasks. Real-world case studies demonstrate practical applications in land cover mapping, agricultural monitoring, disaster response, climate studies, and geospatial intelligence.
Conclusion: The survey outlines pressing challenges in domain generalization, interpretability, efficiency, and privacy, while charting promising avenues for future research in multimodal geospatial foundation models.
Abstract: Foundation models have transformed natural language processing and computer vision, and their impact is now reshaping remote sensing image analysis. With powerful generalization and transfer learning capabilities, they align naturally with the multimodal, multi-resolution, and multi-temporal characteristics of remote sensing data. To address unique challenges in the field, multimodal geospatial foundation models (GFMs) have emerged as a dedicated research frontier. This survey delivers a comprehensive review of multimodal GFMs from a modality-driven perspective, covering five core visual and vision-language modalities. We examine how differences in imaging physics and data representation shape interaction design, and we analyze key techniques for alignment, integration, and knowledge transfer to tackle modality heterogeneity, distribution shifts, and semantic gaps. Advances in training paradigms, architectures, and task-specific adaptation strategies are systematically assessed alongside a wealth of emerging benchmarks. Representative multimodal visual and vision-language GFMs are evaluated across ten downstream tasks, with insights into their architectures, performance, and application scenarios. Real-world case studies, spanning land cover mapping, agricultural monitoring, disaster response, climate studies, and geospatial intelligence, demonstrate the practical potential of GFMs. Finally, we outline pressing challenges in domain generalization, interpretability, efficiency, and privacy, and chart promising avenues for future research.
[373] VALA: Learning Latent Anchors for Training-Free and Temporally Consistent
Zhangkai Wu, Xuhui Fan, Zhongyuan Xie, Kaize Shi, Longbing Cao
Main category: cs.CV
TL;DR: VALA is a variational alignment module that adaptively selects key frames and compresses their latent features into semantic anchors for consistent video editing, eliminating the need for heuristic frame selection in training-free video editing.
Details
Motivation: Existing training-free video editing methods rely on heuristic frame selection for temporal consistency during DDIM inversion, which introduces manual bias and reduces scalability of end-to-end inference.Method: Proposes VALA (Variational Alignment for Latent Anchors) - a variational alignment module with contrastive learning objective that adaptively selects key frames and compresses latent features into semantic anchors to preserve content and temporal coherence.
Result: Extensive experiments show VALA achieves state-of-the-art performance in inversion fidelity, editing quality, and temporal consistency, while offering improved efficiency over prior methods.
Conclusion: VALA provides an effective solution for training-free video editing by adaptively selecting key frames and compressing latent representations into semantic anchors, achieving better temporal consistency and efficiency.
Abstract: Recent advances in training-free video editing have enabled lightweight and precise cross-frame generation by leveraging pre-trained text-to-image diffusion models. However, existing methods often rely on heuristic frame selection to maintain temporal consistency during DDIM inversion, which introduces manual bias and reduces the scalability of end-to-end inference. In this paper, we propose~\textbf{VALA} (\textbf{V}ariational \textbf{A}lignment for \textbf{L}atent \textbf{A}nchors), a variational alignment module that adaptively selects key frames and compresses their latent features into semantic anchors for consistent video editing. To learn meaningful assignments, VALA propose a variational framework with a contrastive learning objective. Therefore, it can transform cross-frame latent representations into compressed latent anchors that preserve both content and temporal coherence. Our method can be fully integrated into training-free text-to-image based video editing models. Extensive experiments on real-world video editing benchmarks show that VALA achieves state-of-the-art performance in inversion fidelity, editing quality, and temporal consistency, while offering improved efficiency over prior methods.
[374] Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method
Bohan Li, Xin Jin, Hu Zhu, Hongsi Liu, Ruikai Li, Jiazhe Guo, Kaiwen Cai, Chao Ma, Yueming Jin, Hao Zhao, Xiaokang Yang, Wenjun Zeng
Main category: cs.CV
TL;DR: This paper introduces UniScene, a unified framework for generating semantic occupancy, multi-view videos, and LiDAR point clouds in driving scenes, using the newly curated Nuplan-Occ dataset.
Details
Motivation: Current occupancy-centric methods for driving scene generation depend heavily on annotated occupancy data, which is scarce. The authors aim to overcome this limitation by creating a large-scale dataset and developing a unified generation framework.Method: The approach uses a spatio-temporal disentangled architecture for 4D dynamic occupancy generation. It incorporates Gaussian splatting-based sparse point map rendering for multi-view video generation and sensor-aware embedding for realistic LiDAR simulation.
Result: Extensive experiments show superior generation fidelity and scalability compared to existing approaches, with validated practical value in downstream autonomous driving tasks.
Conclusion: The proposed UniScene framework successfully addresses the data scarcity problem in occupancy-centric driving scene generation and demonstrates strong performance across multiple modalities.
Abstract: Driving scene generation is a critical domain for autonomous driving, enabling downstream applications, including perception and planning evaluation. Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities; however, their performance heavily depends on annotated occupancy data, which still remains scarce. To overcome this limitation, we curate Nuplan-Occ, the largest semantic occupancy dataset to date, constructed from the widely used Nuplan benchmark. Its scale and diversity facilitate not only large-scale generative modeling but also autonomous driving downstream applications. Based on this dataset, we develop a unified framework that jointly synthesizes high-quality semantic occupancy, multi-view videos, and LiDAR point clouds. Our approach incorporates a spatio-temporal disentangled architecture to support high-fidelity spatial expansion and temporal forecasting of 4D dynamic occupancy. To bridge modal gaps, we further propose two novel techniques: a Gaussian splatting-based sparse point map rendering strategy that enhances multi-view video generation, and a sensor-aware embedding strategy that explicitly models LiDAR sensor properties for realistic multi-LiDAR simulation. Extensive experiments demonstrate that our method achieves superior generation fidelity and scalability compared to existing approaches, and validates its practical value in downstream tasks. Repo: https://github.com/Arlo0o/UniScene-Unified-Occupancy-centric-Driving-Scene-Generation/tree/v2
[375] VoMP: Predicting Volumetric Mechanical Property Fields
Rishit Dagli, Donglai Xiang, Vismay Modi, Charles Loop, Clement Fuji Tsang, Anka He Chen, Anita Hu, Gavriel State, David I. W. Levin, Maria Shugrina
Main category: cs.CV
TL;DR: VoMP is a feed-forward method that predicts spatially-varying mechanical properties (Young’s modulus, Poisson’s ratio, density) for 3D objects using multi-view features and a Geometry Transformer, trained on real-world materials data.
Details
Motivation: Physical simulation traditionally requires laborious hand-crafting of spatially-varying mechanical properties, which VoMP aims to automate through data-driven prediction.Method: VoMP aggregates per-voxel multi-view features from renderable 3D objects, uses a trained Geometry Transformer to predict material latent codes on a physically plausible manifold, and leverages a novel annotation pipeline combining 3D datasets, material databases, and vision-language models.
Result: VoMP significantly outperforms prior methods in both accuracy and speed for estimating volumetric mechanical properties.
Conclusion: The method successfully automates the prediction of physically valid material properties across 3D object volumes, demonstrating superior performance over existing approaches.
Abstract: Physical simulation relies on spatially-varying mechanical properties, often laboriously hand-crafted. VoMP is a feed-forward method trained to predict Young’s modulus ($E$), Poisson’s ratio ($\nu$), and density ($\rho$) throughout the volume of 3D objects, in any representation that can be rendered and voxelized. VoMP aggregates per-voxel multi-view features and passes them to our trained Geometry Transformer to predict per-voxel material latent codes. These latents reside on a manifold of physically plausible materials, which we learn from a real-world dataset, guaranteeing the validity of decoded per-voxel materials. To obtain object-level training data, we propose an annotation pipeline combining knowledge from segmented 3D datasets, material databases, and a vision-language model, along with a new benchmark. Experiments show that VoMP estimates accurate volumetric properties, far outperforming prior art in accuracy and speed.
[376] SceneDecorator: Towards Scene-Oriented Story Generation with Scene Planning and Scene Consistency
Quanjian Song, Donghao Zhou, Jingyu Lin, Fei Shen, Jiaze Wang, Xiaowei Hu, Cunjian Chen, Pheng-Ann Heng
Main category: cs.CV
TL;DR: SceneDecorator is a training-free framework for scene-oriented story generation that addresses scene planning and consistency challenges through VLM-guided scene planning and long-term scene-sharing attention.
Details
Motivation: Current text-to-image models struggle with concept consistency across images, particularly overlooking the importance of scenes in storytelling which limits creative applications.Method: Proposes SceneDecorator with two key components: VLM-Guided Scene Planning for narrative coherence in a global-to-local manner, and Long-Term Scene-Sharing Attention to maintain scene consistency and subject diversity across stories.
Result: Extensive experiments show superior performance of SceneDecorator in maintaining scene consistency and narrative coherence across generated stories.
Conclusion: SceneDecorator demonstrates potential to unleash creativity in arts, films, and games by effectively addressing scene-level consistency challenges in story generation.
Abstract: Recent text-to-image models have revolutionized image generation, but they still struggle with maintaining concept consistency across generated images. While existing works focus on character consistency, they often overlook the crucial role of scenes in storytelling, which restricts their creativity in practice. This paper introduces scene-oriented story generation, addressing two key challenges: (i) scene planning, where current methods fail to ensure scene-level narrative coherence by relying solely on text descriptions, and (ii) scene consistency, which remains largely unexplored in terms of maintaining scene consistency across multiple stories. We propose SceneDecorator, a training-free framework that employs VLM-Guided Scene Planning to ensure narrative coherence across different scenes in a ``global-to-local’’ manner, and Long-Term Scene-Sharing Attention to maintain long-term scene consistency and subject diversity across generated stories. Extensive experiments demonstrate the superior performance of SceneDecorator, highlighting its potential to unleash creativity in the fields of arts, films, and games.
[377] LoMix: Learnable Weighted Multi-Scale Logits Mixing for Medical Image Segmentation
Md Mostafijur Rahman, Radu Marculescu
Main category: cs.CV
TL;DR: LoMix is a differentiable plug-and-play module that automatically learns optimal mixed-scale fusion of U-shaped network logits through NAS-inspired optimization, improving segmentation performance without inference overhead.
Details
Motivation: Current U-shaped networks treat multi-scale logits in isolation without exploring mixed-scale combinations, missing complementary cues from fusing coarse and fine predictions.Method: LoMix mixes multi-scale decoder logits using four lightweight fusion operators (addition, multiplication, concatenation, attention-based fusion) and co-optimizes loss weights with network parameters through differentiable optimization.
Result: Improves DICE by +4.2% over single-output supervision, +2.2% over deep supervision, and +1.5% over equally weighted additive fusion on Synapse dataset. Advantage grows to +9.23% with scarce training data.
Conclusion: Learnable weighted mixed-scale fusion generalizes broadly across benchmarks and networks, providing data-efficient, interpretable, and overhead-free performance improvements for medical image segmentation.
Abstract: U-shaped networks output logits at multiple spatial scales, each capturing a different blend of coarse context and fine detail. Yet, training still treats these logits in isolation - either supervising only the final, highest-resolution logits or applying deep supervision with identical loss weights at every scale - without exploring mixed-scale combinations. Consequently, the decoder output misses the complementary cues that arise only when coarse and fine predictions are fused. To address this issue, we introduce LoMix (Logits Mixing), a NAS-inspired, differentiable plug-and-play module that generates new mixed-scale outputs and learns how exactly each of them should guide the training process. More precisely, LoMix mixes the multi-scale decoder logits with four lightweight fusion operators: addition, multiplication, concatenation, and attention-based weighted fusion, yielding a rich set of synthetic mutant maps. Every original or mutant map is given a softplus loss weight that is co-optimized with network parameters, mimicking a one-step architecture search that automatically discovers the most useful scales, mixtures, and operators. Plugging LoMix into recent U-shaped architectures (i.e., PVT-V2-B2 backbone with EMCAD decoder) on Synapse 8-organ dataset improves DICE by +4.2% over single-output supervision, +2.2% over deep supervision, and +1.5% over equally weighted additive fusion, all with zero inference overhead. When training data are scarce (e.g., one or two labeled scans), the advantage grows to +9.23%, underscoring LoMix’s data efficiency. Across four benchmarks and diverse U-shaped networks, LoMiX improves DICE by up to +13.5% over single-output supervision, confirming that learnable weighted mixed-scale fusion generalizes broadly while remaining data efficient, fully interpretable, and overhead-free at inference. Our code is available at https://github.com/SLDGroup/LoMix.
[378] CoMo: Compositional Motion Customization for Text-to-Video Generation
Youcan Xu, Zhen Wang, Jiaxin Shi, Kexin Li, Feifei Shao, Jun Xiao, Yi Yang, Jun Yu, Long Chen
Main category: cs.CV
TL;DR: CoMo is a framework for compositional motion customization in text-to-video generation that enables synthesis of multiple distinct motions in a single video by addressing motion-appearance entanglement and multi-motion blending challenges.
Details
Motivation: Current text-to-video models struggle with precise motion control for complex multi-subject motions, and existing single-motion customization methods fail in compositional scenarios due to motion-appearance entanglement and ineffective multi-motion blending.Method: Two-phase approach: 1) Single-motion learning phase using static-dynamic decoupled tuning to disentangle motion from appearance and learn motion-specific modules; 2) Multi-motion composition phase using plug-and-play divide-and-merge strategy to compose learned motions without additional training by spatially isolating their influence during denoising.
Result: CoMo achieves state-of-the-art performance in controllable video generation, significantly advancing multi-motion customization capabilities. A new benchmark and evaluation metric were also introduced for multi-motion fidelity and blending assessment.
Conclusion: CoMo successfully addresses the challenges of compositional motion customization, enabling synthesis of multiple distinct motions in videos through its two-phase approach and spatial isolation strategy, representing a significant advancement in controllable video generation.
Abstract: While recent text-to-video models excel at generating diverse scenes, they struggle with precise motion control, particularly for complex, multi-subject motions. Although methods for single-motion customization have been developed to address this gap, they fail in compositional scenarios due to two primary challenges: motion-appearance entanglement and ineffective multi-motion blending. This paper introduces CoMo, a novel framework for $\textbf{compositional motion customization}$ in text-to-video generation, enabling the synthesis of multiple, distinct motions within a single video. CoMo addresses these issues through a two-phase approach. First, in the single-motion learning phase, a static-dynamic decoupled tuning paradigm disentangles motion from appearance to learn a motion-specific module. Second, in the multi-motion composition phase, a plug-and-play divide-and-merge strategy composes these learned motions without additional training by spatially isolating their influence during the denoising process. To facilitate research in this new domain, we also introduce a new benchmark and a novel evaluation metric designed to assess multi-motion fidelity and blending. Extensive experiments demonstrate that CoMo achieves state-of-the-art performance, significantly advancing the capabilities of controllable video generation. Our project page is at https://como6.github.io/.
[379] UGAE: Unified Geometry and Attribute Enhancement for G-PCC Compressed Point Clouds
Pan Zhao, Hui Yuan, Chongzhen Tian, Tian Guo, Raouf Hamzaoui, Zhigeng Pan
Main category: cs.CV
TL;DR: UGAE framework enhances compressed point clouds through geometry and attribute enhancement modules, achieving significant quality improvements and bitrate savings.
Details
Motivation: Lossy compression of point clouds causes irreversible distortion in both geometry and attributes, requiring enhancement methods to improve quality.Method: Three-component framework: PoGE uses Transformer-based sparse CNN for geometry reconstruction, PAE employs geometry-guided recoloring with DA-KNN for attribute enhancement, and PoAE uses attribute residual prediction with W-MSE loss.
Result: Outperformed existing methods on 8iVFB, Owlii, and MVUB datasets with 9.98 dB BD-PSNR gain, 90.98% bitrate savings for geometry, and 3.67 dB PSNR improvement with 56.88% bitrate savings for attributes.
Conclusion: UGAE effectively enhances compressed point cloud quality through unified geometry and attribute enhancement, achieving superior performance and perceptual quality.
Abstract: Lossy compression of point clouds reduces storage and transmission costs; however, it inevitably leads to irreversible distortion in geometry structure and attribute information. To address these issues, we propose a unified geometry and attribute enhancement (UGAE) framework, which consists of three core components: post-geometry enhancement (PoGE), pre-attribute enhancement (PAE), and post-attribute enhancement (PoAE). In PoGE, a Transformer-based sparse convolutional U-Net is used to reconstruct the geometry structure with high precision by predicting voxel occupancy probabilities. Building on the refined geometry structure, PAE introduces an innovative enhanced geometry-guided recoloring strategy, which uses a detail-aware K-Nearest Neighbors (DA-KNN) method to achieve accurate recoloring and effectively preserve high-frequency details before attribute compression. Finally, at the decoder side, PoAE uses an attribute residual prediction network with a weighted mean squared error (W-MSE) loss to enhance the quality of high-frequency regions while maintaining the fidelity of low-frequency regions. UGAE significantly outperformed existing methods on three benchmark datasets: 8iVFB, Owlii, and MVUB. Compared to the latest G-PCC test model (TMC13v29), UGAE achieved an average BD-PSNR gain of 9.98 dB and 90.98% BD-bitrate savings for geometry under the D1 metric, as well as a 3.67 dB BD-PSNR improvement with 56.88% BD-bitrate savings for attributes on the Y component. Additionally, it improved perceptual quality significantly.
[380] Nested AutoRegressive Models
Hongyu Wu, Xuhui Fan, Zhangkai Wu, Longbing Cao
Main category: cs.CV
TL;DR: NestAR is a nested autoregressive model that uses hierarchical multi-scale modules to reduce computational complexity from O(n) to O(log n) while improving image diversity compared to existing AR models.
Details
Motivation: AutoRegressive models are computationally intensive and existing solutions like VAR often lead to limited sample diversity in image generation.Method: Proposes nested AR architecture with multi-scale modules in hierarchical order, where larger-scale modules are conditioned on outputs from smaller-scale modules. Uses flow matching loss for continuous tokens and develops coordination objectives for multi-scale training.
Result: Achieves competitive image generation performance while significantly lowering computational cost compared to standard AR models.
Conclusion: NestAR provides an efficient alternative to traditional AR models for image generation, reducing complexity while maintaining performance and improving diversity.
Abstract: AutoRegressive (AR) models have demonstrated competitive performance in image generation, achieving results comparable to those of diffusion models. However, their token-by-token image generation mechanism remains computationally intensive and existing solutions such as VAR often lead to limited sample diversity. In this work, we propose a Nested AutoRegressive~(NestAR) model, which proposes nested AutoRegressive architectures in generating images. NestAR designs multi-scale modules in a hierarchical order. These different scaled modules are constructed in an AR architecture, where one larger-scale module is conditioned on outputs from its previous smaller-scale module. Within each module, NestAR uses another AR structure to generate ``patches’’ of tokens. The proposed nested AR architecture reduces the overall complexity from $\mathcal{O}(n)$ to $\mathcal{O}(\log n)$ in generating $n$ image tokens, as well as increases image diversities. NestAR further incorporates flow matching loss to use continuous tokens, and develops objectives to coordinate these multi-scale modules in model training. NestAR achieves competitive image generation performance while significantly lowering computational cost.
[381] HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling
Joungbin An, Kristen Grauman
Main category: cs.CV
TL;DR: HieraMamba introduces a hierarchical architecture with Anchor-MambaPooling blocks that uses Mamba’s selective scanning to create compact anchor tokens at multiple granularities, achieving state-of-the-art video temporal grounding performance on long videos.
Details
Motivation: Existing methods for video temporal grounding struggle with long videos, often losing temporal fidelity through over-downsampling or fixed window approaches, making it difficult to capture both global context and fine-grained temporal details.Method: Proposes HieraMamba with hierarchical architecture using Anchor-MambaPooling (AMP) blocks that leverage Mamba’s selective scanning to generate compact anchor tokens summarizing video content at multiple scales, combined with anchor-conditioned and segment-pooled contrastive losses.
Result: HieraMamba achieves new state-of-the-art performance on Ego4D-NLQ, MAD, and TACoS benchmarks, demonstrating precise and temporally faithful localization in long, untrimmed videos.
Conclusion: The hierarchical approach with AMP blocks effectively preserves temporal structure and semantic richness across scales, enabling superior video temporal grounding in challenging long-video scenarios.
Abstract: Video temporal grounding, the task of localizing the start and end times of a natural language query in untrimmed video, requires capturing both global context and fine-grained temporal detail. This challenge is particularly pronounced in long videos, where existing methods often compromise temporal fidelity by over-downsampling or relying on fixed windows. We present HieraMamba, a hierarchical architecture that preserves temporal structure and semantic richness across scales. At its core are Anchor-MambaPooling (AMP) blocks, which utilize Mamba’s selective scanning to produce compact anchor tokens that summarize video content at multiple granularities. Two complementary objectives, anchor-conditioned and segment-pooled contrastive losses, encourage anchors to retain local detail while remaining globally discriminative. HieraMamba sets a new state-of-the-art on Ego4D-NLQ, MAD, and TACoS, demonstrating precise, temporally faithful localization in long, untrimmed videos.
[382] Strategies for Robust Deep Learning Based Deformable Registration
Joel Honkamaa, Pekka Marttinen
Main category: cs.CV
TL;DR: A simple method using MIND feature space transformation and ensembling to improve robustness of deep learning-based deformable registration for brain images across different contrasts and modalities.
Details
Motivation: Deep learning registration methods often generalize poorly beyond their training data distribution, limiting practical usability in clinical settings with diverse imaging contrasts and modalities.Method: Transform input images into MIND feature space before feeding to registration model, plus a special ensembling strategy for consistent performance improvement.
Result: Significantly improved robustness for brain registration across different contrasts and modalities not seen during training.
Conclusion: Simple feature space transformation and ensembling can substantially enhance generalization capability of deep learning registration methods.
Abstract: Deep learning based deformable registration methods have become popular in recent years. However, their ability to generalize beyond training data distribution can be poor, significantly hindering their usability. LUMIR brain registration challenge for Learn2Reg 2025 aims to advance the field by evaluating the performance of the registration on contrasts and modalities different from those included in the training set. Here we describe our submission to the challenge, which proposes a very simple idea for significantly improving robustness by transforming the images into MIND feature space before feeding them into the model. In addition, a special ensembling strategy is proposed that shows a small but consistent improvement.
[383] EndoWave: Rational-Wavelet 4D Gaussian Splatting for Endoscopic Reconstruction
Taoyu Wu, Yiyi Miao, Jiaxin Guo, Ziyan Chen, Sihang Zhao, Zhuoxiao Li, Zhe Tang, Baoru Huang, Limin Yu
Main category: cs.CV
TL;DR: EndoWave is a unified spatio-temporal Gaussian Splatting framework for 3D reconstruction in robot-assisted minimally invasive surgery that addresses challenges like photometric inconsistencies and tissue motion through optical flow-based geometric constraints and multi-resolution rational wavelet supervision.
Details
Motivation: Endoscopic scenarios in surgery present unique challenges including photometric inconsistencies, non-rigid tissue motion, and view-dependent highlights that mislead traditional 3DGS-based methods relying solely on appearance constraints, leading to inaccurate reconstructions.Method: 1) Unified spatio-temporal Gaussian representation optimizing primitives in 4D domain; 2) Geometric constraint derived from optical flow to enhance temporal coherence and constrain 3D structure; 3) Multi-resolution rational orthogonal wavelet constraint to separate endoscopic details and enhance rendering performance.
Result: Extensive evaluations on EndoNeRF and StereoMIS datasets demonstrate that EndoWave achieves state-of-the-art reconstruction quality and visual accuracy compared to baseline methods.
Conclusion: The proposed EndoWave framework effectively addresses the challenges of endoscopic 3D reconstruction by incorporating geometric and wavelet-based constraints, achieving superior performance in surgical scenarios.
Abstract: In robot-assisted minimally invasive surgery, accurate 3D reconstruction from endoscopic video is vital for downstream tasks and improved outcomes. However, endoscopic scenarios present unique challenges, including photometric inconsistencies, non-rigid tissue motion, and view-dependent highlights. Most 3DGS-based methods that rely solely on appearance constraints for optimizing 3DGS are often insufficient in this context, as these dynamic visual artifacts can mislead the optimization process and lead to inaccurate reconstructions. To address these limitations, we present EndoWave, a unified spatio-temporal Gaussian Splatting framework by incorporating an optical flow-based geometric constraint and a multi-resolution rational wavelet supervision. First, we adopt a unified spatio-temporal Gaussian representation that directly optimizes primitives in a 4D domain. Second, we propose a geometric constraint derived from optical flow to enhance temporal coherence and effectively constrain the 3D structure of the scene. Third, we propose a multi-resolution rational orthogonal wavelet as a constraint, which can effectively separate the details of the endoscope and enhance the rendering performance. Extensive evaluations on two real surgical datasets, EndoNeRF and StereoMIS, demonstrate that our method EndoWave achieves state-of-the-art reconstruction quality and visual accuracy compared to the baseline method.
[384] Revisiting Multimodal Positional Encoding in Vision-Language Models
Jie Huang, Xuejing Liu, Sibo Song, Ruibing Hou, Hong Chang, Junyang Lin, Shuai Bai
Main category: cs.CV
TL;DR: The paper proposes two simple plug-and-play variants of multimodal Rotary Positional Embedding (RoPE) called MHRoPE and MRoPE-I, which improve vision-language models by addressing position design and frequency allocation issues without architectural changes.
Details
Motivation: There has been little systematic investigation into multimodal position encoding despite its importance for vision-language models. The authors aim to fill this gap by comprehensively analyzing multimodal RoPE.Method: The authors identify three key guidelines (positional coherence, full frequency utilization, preservation of textual priors) and propose two variants: Multi-Head RoPE (MHRoPE) and MRoPE-Interleave (MRoPE-I), which are plug-and-play methods requiring no architectural changes.
Result: The proposed methods consistently outperform existing approaches across diverse benchmarks, showing significant improvements in both general and fine-grained multimodal understanding.
Conclusion: The paper demonstrates that systematic analysis of multimodal position encoding leads to effective solutions that enhance vision-language model performance through simple, plug-and-play modifications to RoPE.
Abstract: Multimodal position encoding is essential for vision-language models, yet there has been little systematic investigation into multimodal position encoding. We conduct a comprehensive analysis of multimodal Rotary Positional Embedding (RoPE) by examining its two core components: position design and frequency allocation. Through extensive experiments, we identify three key guidelines: positional coherence, full frequency utilization, and preservation of textual priors-ensuring unambiguous layout, rich representation, and faithful transfer from the pre-trained LLM. Based on these insights, we propose Multi-Head RoPE (MHRoPE) and MRoPE-Interleave (MRoPE-I), two simple and plug-and-play variants that require no architectural changes. Our methods consistently outperform existing approaches across diverse benchmarks, with significant improvements in both general and fine-grained multimodal understanding. Code will be avaliable at https://github.com/JJJYmmm/Multimodal-RoPEs.
[385] Residual Diffusion Bridge Model for Image Restoration
Hebaixu Wang, Jing Zhang, Haoyang Chen, Haonan Guo, Di Wang, Jiayi Ma, Bo Du
Main category: cs.CV
TL;DR: RDBM is a unified diffusion bridge model that uses residual-based modulation for adaptive image restoration, preserving undegraded regions while restoring degraded areas.
Details
Motivation: Existing diffusion bridge models lack unified analysis and cause distortion in undegraded regions due to global noise processing. There's a need for more precise restoration that adapts to degradation patterns.Method: Theoretical reformulation of stochastic differential equations for generalized diffusion bridge, with analytical formulas for forward/reverse processes. Uses residuals from given distributions to modulate noise injection/removal for adaptive restoration.
Result: Achieves state-of-the-art performance across diverse image restoration tasks, both qualitatively and quantitatively. Demonstrates that existing bridge models are special cases of RDBM.
Conclusion: RDBM provides a unified framework for diffusion bridge models with adaptive restoration capabilities, outperforming existing methods while preserving undegraded regions effectively.
Abstract: Diffusion bridge models establish probabilistic paths between arbitrary paired distributions and exhibit great potential for universal image restoration. Most existing methods merely treat them as simple variants of stochastic interpolants, lacking a unified analytical perspective. Besides, they indiscriminately reconstruct images through global noise injection and removal, inevitably distorting undegraded regions due to imperfect reconstruction. To address these challenges, we propose the Residual Diffusion Bridge Model (RDBM). Specifically, we theoretically reformulate the stochastic differential equations of generalized diffusion bridge and derive the analytical formulas of its forward and reverse processes. Crucially, we leverage the residuals from given distributions to modulate the noise injection and removal, enabling adaptive restoration of degraded regions while preserving intact others. Moreover, we unravel the fundamental mathematical essence of existing bridge models, all of which are special cases of RDBM and empirically demonstrate the optimality of our proposed models. Extensive experiments are conducted to demonstrate the state-of-the-art performance of our method both qualitatively and quantitatively across diverse image restoration tasks. Code is publicly available at https://github.com/MiliLab/RDBM.
[386] Task-Agnostic Fusion of Time Series and Imagery for Earth Observation
Gianfranco Basile, Johannes Jakubik, Benedikt Blumenstiel, Thomas Brunschwiler, Juan Bernabe Moreno
Main category: cs.CV
TL;DR: A task-agnostic multimodal fusion framework for time series and images using masked correlation learning, achieving superior performance in cross-modal generation and downstream tasks compared to task-specific approaches.
Details
Motivation: To enable robust multimodal fusion between time series data and single timestamp images across different tasks without task-specific tuning, particularly in Earth observation applications.Method: Uses deterministic and learned strategies for time series quantization, then employs masked correlation learning to align discrete image and time series tokens in a unified representation space.
Result: Outperforms task-specific fusion by 6% in R² and 2% in RMSE on average, and exceeds baseline methods by 50% in R² and 12% in RMSE. Successfully generates consistent global temperature profiles from satellite imagery.
Conclusion: The proposed task-agnostic pretraining framework provides effective multimodal fusion with superior performance, robustness, and cross-modal generation capabilities, with insights into model behavior through gradient sensitivity analysis.
Abstract: We propose a task-agnostic framework for multimodal fusion of time series and single timestamp images, enabling cross-modal generation and robust downstream performance. Our approach explores deterministic and learned strategies for time series quantization and then leverages a masked correlation learning objective, aligning discrete image and time series tokens in a unified representation space. Instantiated in the Earth observation domain, the pretrained model generates consistent global temperature profiles from satellite imagery and is validated through counterfactual experiments. Across downstream tasks, our task-agnostic pretraining outperforms task-specific fusion by 6% in R$^2$ and 2% in RMSE on average, and exceeds baseline methods by 50% in R$^2$ and 12% in RMSE. Finally, we analyze gradient sensitivity across modalities, providing insights into model robustness. Code, data, and weights will be released under a permissive license.
[387] DeepSalt: Bridging Laboratory and Satellite Spectra through Domain Adaptation and Knowledge Distillation for Large-Scale Soil Salinity Estimation
Rupasree Dey, Abdul Matin, Everett Lewark, Tanjim Bin Faruk, Andrei Bachinin, Sam Leuthold, M. Francesca Cotrufo, Shrideep Pallickara, Sangmi Lee Pallickara
Main category: cs.CV
TL;DR: DeepSalt is a deep learning framework that transfers high-resolution spectral knowledge from laboratory spectroscopy to satellite hyperspectral imagery for large-scale soil salinity monitoring, eliminating the need for extensive ground sampling.
Details
Motivation: Soil salinization threatens ecosystems and agriculture by limiting plant water absorption and reducing crop productivity. Current methods face scalability issues - laboratory spectroscopy is precise but limited to local sampling, while satellite imagery offers wide coverage but lacks fine-grained interpretability.Method: DeepSalt uses knowledge distillation and a novel Spectral Adaptation Unit to transfer high-resolution spectral insights from laboratory-based spectroscopy to satellite-based hyperspectral sensing, enabling domain adaptation between different spectral measurement modalities.
Result: DeepSalt achieves significant performance gains over methods without explicit domain adaptation, effectively generalizes to unseen geographic regions, and explains a substantial portion of the salinity variance in comprehensive empirical benchmarks.
Conclusion: The proposed framework successfully bridges the gap between laboratory precision and satellite scalability, enabling accurate, large-scale soil salinity estimation without extensive ground sampling through effective spectral domain adaptation.
Abstract: Soil salinization poses a significant threat to both ecosystems and agriculture because it limits plants’ ability to absorb water and, in doing so, reduces crop productivity. This phenomenon alters the soil’s spectral properties, creating a measurable relationship between salinity and light reflectance that enables remote monitoring. While laboratory spectroscopy provides precise measurements, its reliance on in-situ sampling limits scalability to regional or global levels. Conversely, hyperspectral satellite imagery enables wide-area observation but lacks the fine-grained interpretability of laboratory instruments. To bridge this gap, we introduce DeepSalt, a deep-learning-based spectral transfer framework that leverages knowledge distillation and a novel Spectral Adaptation Unit to transfer high-resolution spectral insights from laboratory-based spectroscopy to satellite-based hyperspectral sensing. Our approach eliminates the need for extensive ground sampling while enabling accurate, large-scale salinity estimation, as demonstrated through comprehensive empirical benchmarks. DeepSalt achieves significant performance gains over methods without explicit domain adaptation, underscoring the impact of the proposed Spectral Adaptation Unit and the knowledge distillation strategy. The model also effectively generalized to unseen geographic regions, explaining a substantial portion of the salinity variance.
[388] Note on the Construction of Structure Tensor
Josef Bigun, Fernado Alonso-Fernandez
Main category: cs.CV
TL;DR: The paper reconciles two structure tensor constructions through Total Least Squares line fitting, showing they’re fundamentally similar and enabling simplifications and generalizations.
Details
Motivation: To theoretically reconcile two seemingly different structure tensor approaches from 1987 and 1995 by viewing them through a common TLS framework.Method: Analyzes both constructions through the lens of Total Least Squares line fitting to the power spectrum, revealing their fundamental similarities.
Result: The correction term in the 1995 approach becomes unnecessary, ensuring positive semi-definite tensors and enabling generalizations beyond quadrature filters.
Conclusion: Both structure tensor constructions are fundamentally similar when viewed through TLS, allowing simplifications and broader applicability with various filter types.
Abstract: This note presents a theoretical discussion of two structure tensor constructions: one proposed by Bigun and Granlund 1987, and the other by Granlund and Knutsson 1995. At first glance, these approaches may appear quite different–the former is implemented by averaging outer products of gradient filter responses, while the latter constructs the tensor from weighted outer products of tune-in frequency vectors of quadrature filters. We argue that when both constructions are viewed through the common lens of Total Least Squares (TLS) line fitting to the power spectrum, they can be reconciled to a large extent, and additional benefits emerge. From this perspective, the correction term introduced in Granlund and Knutsson 1995 becomes unnecessary. Omitting it ensures that the resulting tensor remains positive semi-definite, thereby simplifying the interpretation of its eigenvalues. Furthermore, this interpretation allows fitting more than a single 0rientation to the input by reinterpreting quadrature filter responses without relying on a structure tensor. It also removes the constraint that responses must originate strictly from quadrature filters, allowing the use of alternative filter types and non-angular tessellations. These alternatives include Gabor filters–which, although not strictly quadrature, are still suitable for structure tensor construction–even when they tessellate the spectrum in a Cartesian fashion, provided they are sufficiently concentrated.
[389] Fast Voxel-Wise Kinetic Modeling in Dynamic PET using a Physics-Informed CycleGAN
Christian Salomonsen, Samuel Kuttner, Michael Kampffmeyer, Robert Jenssen, Kristoffer Wickstrøm, Jong Chul Ye, Elisabeth Wetzer
Main category: cs.CV
TL;DR: A physics-informed CycleGAN is applied to dynamic PET quantification to predict arterial input functions and parameter maps, reducing the need for invasive AIF estimation.
Details
Motivation: Tracer kinetic modeling is crucial in medical applications but requires complex and invasive arterial input function estimation, which burdens practitioners.Method: Adopt a physics-informed CycleGAN, previously successful in DCE-MRI quantification, for dynamic PET quantification.
Result: Experiments show accurate AIF predictions and parameter maps that closely resemble the reference.
Conclusion: The proposed method effectively reduces the burden of invasive AIF estimation in tracer kinetic modeling for dynamic PET.
Abstract: Tracer kinetic modeling serves a vital role in diagnosis, treatment planning, tracer development and oncology, but burdens practitioners with complex and invasive arterial input function estimation (AIF). We adopt a physics-informed CycleGAN showing promise in DCE-MRI quantification to dynamic PET quantification. Our experiments demonstrate sound AIF predictions and parameter maps closely resembling the reference.
[390] Progressive Growing of Patch Size: Curriculum Learning for Accelerated and Improved Medical Image Segmentation
Stefan M. Fischer, Johannes Kiechle, Laura Daza, Lina Felsner, Richard Osuala, Daniel M. Lang, Karim Lekadir, Jan C. Peeken, Julia A. Schnabel
Main category: cs.CV
TL;DR: Progressive Growing of Patch Size is a curriculum learning method that gradually increases patch size during 3D medical image segmentation training, improving class balance and accelerating convergence while maintaining or boosting performance.
Details
Motivation: To address class imbalance issues in 3D medical image segmentation and accelerate training convergence by implementing an automatic curriculum learning approach.Method: Progressively increase patch size during model training in two modes: resource-efficient (faster training) and performance (better results). Evaluated across 15 diverse 3D medical segmentation tasks with different architectures.
Result: Resource-efficient mode reduces training time to 44% while matching baseline performance. Performance mode achieves 1.28% relative mean Dice score improvement and reduces training time to 89%, with consistent gains across all 15 tasks.
Conclusion: The progressive patch size curriculum is a simple, broadly applicable strategy that improves both segmentation performance and training efficiency across diverse models and tasks, particularly beneficial for imbalanced segmentation problems.
Abstract: In this work, we introduce Progressive Growing of Patch Size, an automatic curriculum learning approach for 3D medical image segmentation. Our approach progressively increases the patch size during model training, resulting in an improved class balance for smaller patch sizes and accelerated convergence of the training process. We evaluate our curriculum approach in two settings: a resource-efficient mode and a performance mode, both regarding Dice score performance and computational costs across 15 diverse and popular 3D medical image segmentation tasks. The resource-efficient mode matches the Dice score performance of the conventional constant patch size sampling baseline with a notable reduction in training time to only 44%. The performance mode improves upon constant patch size segmentation results, achieving a statistically significant relative mean performance gain of 1.28% in Dice Score. Remarkably, across all 15 tasks, our proposed performance mode manages to surpass the constant patch size baseline in Dice Score performance, while simultaneously reducing training time to only 89%. The benefits are particularly pronounced for highly imbalanced tasks such as lesion segmentation tasks. Rigorous experiments demonstrate that our performance mode not only improves mean segmentation performance but also reduces performance variance, yielding more trustworthy model comparison. Furthermore, our findings reveal that the proposed curriculum sampling is not tied to a specific architecture but represents a broadly applicable strategy that consistently boosts performance across diverse segmentation models, including UNet, UNETR, and SwinUNETR. In summary, we show that this simple yet elegant transformation on input data substantially improves both Dice Score performance and training runtime, while being compatible across diverse segmentation backbones.
[391] DQ3D: Depth-guided Query for Transformer-Based 3D Object Detection in Traffic Scenarios
Ziyu Wang, Wenhao Li, Ji Wu
Main category: cs.CV
TL;DR: DQ3D is a depth-guided query generator for 3D object detection that uses depth information and 2D detections to sample reference points from object surfaces/interiors, with hybrid attention to fuse historical detections for handling occlusions.
Details
Motivation: Existing methods generate object queries from 3D reference points, but some points are far from target objects, leading to false positives. This paper addresses this limitation by ensuring reference points are sampled from object surfaces/interiors.Method: Proposes depth-guided query generator using depth information and 2D detections to sample reference points from object surfaces. Introduces hybrid attention mechanism that fuses historical detection results with depth-guided queries to handle occluded objects.
Result: Outperforms baseline by 6.3% in mAP and 4.3% in NDS on nuScenes dataset, demonstrating significant improvement in 3D object detection performance.
Conclusion: The proposed depth-guided query generation with hybrid attention effectively improves 3D object detection by ensuring proper reference point sampling and handling occlusions through temporal fusion.
Abstract: 3D object detection from multi-view images in traffic scenarios has garnered significant attention in recent years. Many existing approaches rely on object queries that are generated from 3D reference points to localize objects. However, a limitation of these methods is that some reference points are often far from the target object, which can lead to false positive detections. In this paper, we propose a depth-guided query generator for 3D object detection (DQ3D) that leverages depth information and 2D detections to ensure that reference points are sampled from the surface or interior of the object. Furthermore, to address partially occluded objects in current frame, we introduce a hybrid attention mechanism that fuses historical detection results with depth-guided queries, thereby forming hybrid queries. Evaluation on the nuScenes dataset demonstrates that our method outperforms the baseline by 6.3% in terms of mean Average Precision (mAP) and 4.3% in the NuScenes Detection Score (NDS).
[392] Implicit Modeling for Transferability Estimation of Vision Foundation Models
Yaoyan Zheng, Huiqun Wang, Nan Zhou, Di Huang
Main category: cs.CV
TL;DR: ITM is a new framework that implicitly models model transferability using a divide-and-conquer variational approximation strategy, outperforming existing methods in stability, effectiveness and efficiency.
Details
Motivation: Existing transferability estimation methods struggle with emerging pre-trained models that have diverse architectures, training strategies, and task alignments, making it difficult to accurately assess their transferability without costly fine-tuning.Method: Proposes Implicit Transferability Modeling (ITM) framework that implicitly models each model’s intrinsic transferability, coupled with Divide-and-Conquer Variational Approximation (DVA) strategy to efficiently approximate embedding space evolution.
Result: Extensive experiments on comprehensive benchmarks show ITM consistently outperforms existing methods in terms of stability, effectiveness, and efficiency across diverse training regimes and model types.
Conclusion: ITM enables better generalization across broader range of models and downstream tasks, advancing the pre-training and fine-tuning paradigm by facilitating deployment through accurate transferability estimation.
Abstract: Transferability estimation identifies the best pre-trained models for downstream tasks without incurring the high computational cost of full fine-tuning. This capability facilitates deployment and advances the pre-training and fine-tuning paradigm. However, existing methods often struggle to accurately assess transferability for emerging pre-trained models with diverse architectures, training strategies, and task alignments. In this work, we propose Implicit Transferability Modeling (ITM), a novel framework that implicitly models each model’s intrinsic transferability, coupled with a Divide-and-Conquer Variational Approximation (DVA) strategy to efficiently approximate embedding space evolution. This design enables generalization across a broader range of models and downstream tasks. Extensive experiments on a comprehensive benchmark–spanning extensive training regimes and a wider variety of model types–demonstrate that ITM consistently outperforms existing methods in terms of stability, effectiveness, and efficiency.
[393] AG-Fusion: adaptive gated multimodal fusion for 3d object detection in complex scenes
Sixian Liu, Chen Xu, Qiang Wang, Donghai Shi, Yiwen Li
Main category: cs.CV
TL;DR: Proposes AG-Fusion, an adaptive gated fusion method for robust 3D object detection in challenging scenarios by selectively integrating reliable cross-modal patterns from camera and LiDAR data.
Details
Motivation: Existing multimodal fusion methods suffer significant performance degradation in challenging scenarios with sensor degradation or environmental disturbances, limiting their practical application in complex real-world environments.Method: Projects camera and LiDAR features into unified BEV space, enhances them with window-based attention, and uses adaptive gated fusion module based on cross-modal attention to selectively integrate reliable patterns for robust detection.
Result: Achieves 93.92% accuracy on KITTI dataset and outperforms baseline by 24.88% on the challenging E3D dataset, demonstrating superior robustness to unreliable modal information in complex industrial scenes.
Conclusion: AG-Fusion effectively addresses performance degradation in challenging scenarios by adaptively selecting reliable cross-modal patterns, making it suitable for complex industrial applications like excavator operation monitoring.
Abstract: Multimodal camera-LiDAR fusion technology has found extensive application in 3D object detection, demonstrating encouraging performance. However, existing methods exhibit significant performance degradation in challenging scenarios characterized by sensor degradation or environmental disturbances. We propose a novel Adaptive Gated Fusion (AG-Fusion) approach that selectively integrates cross-modal knowledge by identifying reliable patterns for robust detection in complex scenes. Specifically, we first project features from each modality into a unified BEV space and enhance them using a window-based attention mechanism. Subsequently, an adaptive gated fusion module based on cross-modal attention is designed to integrate these features into reliable BEV representations robust to challenging environments. Furthermore, we construct a new dataset named Excavator3D (E3D) focusing on challenging excavator operation scenarios to benchmark performance in complex conditions. Our method not only achieves competitive performance on the standard KITTI dataset with 93.92% accuracy, but also significantly outperforms the baseline by 24.88% on the challenging E3D dataset, demonstrating superior robustness to unreliable modal information in complex industrial scenes.
[394] Finding 3D Scene Analogies with Multimodal Foundation Models
Junho Kim, Young Min Kim
Main category: cs.CV
TL;DR: Proposes a zero-shot method for 3D scene analogies using multimodal foundation models to connect scenes without additional training or fixed object vocabularies.
Details
Motivation: To enable robots to adapt to new environments by connecting current observations with prior experiences through scene analogies, overcoming limitations of existing methods that require training and fixed vocabularies.Method: Uses hybrid neural representation with sparse graph from vision-language model features and feature field from 3D shape foundation models, then finds analogies in coarse-to-fine manner by graph alignment and feature field refinement.
Result: Method establishes accurate correspondences between complex scenes and enables successful trajectory and waypoint transfer applications.
Conclusion: The approach provides effective zero-shot 3D scene analogy capability using multimodal foundation models, supporting robot adaptation and planning in unseen environments.
Abstract: Connecting current observations with prior experiences helps robots adapt and plan in new, unseen 3D environments. Recently, 3D scene analogies have been proposed to connect two 3D scenes, which are smooth maps that align scene regions with common spatial relationships. These maps enable detailed transfer of trajectories or waypoints, potentially supporting demonstration transfer for imitation learning or task plan transfer across scenes. However, existing methods for the task require additional training and fixed object vocabularies. In this work, we propose to use multimodal foundation models for finding 3D scene analogies in a zero-shot, open-vocabulary setting. Central to our approach is a hybrid neural representation of scenes that consists of a sparse graph based on vision-language model features and a feature field derived from 3D shape foundation models. 3D scene analogies are then found in a coarse-to-fine manner, by first aligning the graph and refining the correspondence with feature fields. Our method can establish accurate correspondences between complex scenes, and we showcase applications in trajectory and waypoint transfer.
[395] ReconViaGen: Towards Accurate Multi-view 3D Object Reconstruction via Generation
Jiahao Chang, Chongjie Ye, Yushuang Wu, Yuantao Chen, Yidan Zhang, Zhongjin Luo, Chenghong Li, Yihao Zhi, Xiaoguang Han
Main category: cs.CV
TL;DR: ReconViaGen integrates reconstruction priors into diffusion-based 3D generative methods to address incompleteness in multi-view 3D object reconstruction caused by occlusions and sparse view coverage.
Details
Motivation: Existing multi-view 3D reconstruction methods suffer from severe incompleteness due to insufficient view overlap, occlusions, and sparse coverage. While diffusion-based 3D generative techniques can hallucinate invisible parts, their stochastic nature limits accuracy and reliability.Method: Proposes ReconViaGen framework that integrates reconstruction priors into generative framework. Addresses two key issues: (1) insufficiency in cross-view connections when extracting multi-view image features, and (2) poor controllability of iterative denoising for local detail generation.
Result: Extensive experiments show ReconViaGen can reconstruct complete and accurate 3D models consistent with input views in both global structure and local details.
Conclusion: The proposed ReconViaGen successfully addresses limitations of existing methods by effectively integrating reconstruction priors into generative frameworks, achieving high consistency in 3D reconstruction from multi-view images.
Abstract: Existing multi-view 3D object reconstruction methods heavily rely on sufficient overlap between input views, where occlusions and sparse coverage in practice frequently yield severe reconstruction incompleteness. Recent advancements in diffusion-based 3D generative techniques offer the potential to address these limitations by leveraging learned generative priors to hallucinate invisible parts of objects, thereby generating plausible 3D structures. However, the stochastic nature of the inference process limits the accuracy and reliability of generation results, preventing existing reconstruction frameworks from integrating such 3D generative priors. In this work, we comprehensively analyze the reasons why diffusion-based 3D generative methods fail to achieve high consistency, including (a) the insufficiency in constructing and leveraging cross-view connections when extracting multi-view image features as conditions, and (b) the poor controllability of iterative denoising during local detail generation, which easily leads to plausible but inconsistent fine geometric and texture details with inputs. Accordingly, we propose ReconViaGen to innovatively integrate reconstruction priors into the generative framework and devise several strategies that effectively address these issues. Extensive experiments demonstrate that our ReconViaGen can reconstruct complete and accurate 3D models consistent with input views in both global structure and local details.Project page: https://jiahao620.github.io/reconviagen.
[396] Evaluation of Vision-LLMs in Surveillance Video
Pascal Benschop, Cristian Meo, Justin Dauwels, Jelte P. Mense
Main category: cs.CV
TL;DR: This paper investigates using vision-language models as zero-shot anomaly detectors in video by converting video to text descriptions and scoring via textual entailment, evaluating their spatial reasoning capabilities for anomalous action recognition.
Details
Motivation: The overwhelming amount of video data from widespread camera use exceeds human monitoring capacity, creating critical challenges for public safety and security that require automated detection of anomalous or criminal events.Method: Frames anomalous action recognition as zero-shot, language-grounded task using pre-trained vision-LLMs to convert video into text descriptions and score labels via textual entailment. Evaluates four open models on UCF-Crime and RWF-2000 datasets under prompting and privacy-preserving conditions.
Result: Few-shot exemplars can improve accuracy for some models but may increase false positives. Privacy filters, especially full-body GAN transforms, introduce inconsistencies that degrade accuracy. Models succeed at simple, spatially salient events but falter with noisy spatial cues and identity obfuscation.
Conclusion: Outlines paths to strengthen spatial grounding without task-specific training: structure-aware prompts, lightweight spatial memory, scene-graph or 3D-pose priors, and privacy methods preserving action-relevant geometry. Positions zero-shot language-grounded pipelines as adaptable building blocks for embodied video understanding.
Abstract: The widespread use of cameras in our society has created an overwhelming amount of video data, far exceeding the capacity for human monitoring. This presents a critical challenge for public safety and security, as the timely detection of anomalous or criminal events is crucial for effective response and prevention. The ability for an embodied agent to recognize unexpected events is fundamentally tied to its capacity for spatial reasoning. This paper investigates the spatial reasoning of vision-language models (VLMs) by framing anomalous action recognition as a zero-shot, language-grounded task, addressing the embodied perception challenge of interpreting dynamic 3D scenes from sparse 2D video. Specifically, we investigate whether small, pre-trained vision–LLMs can act as spatially-grounded, zero-shot anomaly detectors by converting video into text descriptions and scoring labels via textual entailment. We evaluate four open models on UCF-Crime and RWF-2000 under prompting and privacy-preserving conditions. Few-shot exemplars can improve accuracy for some models, but may increase false positives, and privacy filters – especially full-body GAN transforms – introduce inconsistencies that degrade accuracy. These results chart where current vision–LLMs succeed (simple, spatially salient events) and where they falter (noisy spatial cues, identity obfuscation). Looking forward, we outline concrete paths to strengthen spatial grounding without task-specific training: structure-aware prompts, lightweight spatial memory across clips, scene-graph or 3D-pose priors during description, and privacy methods that preserve action-relevant geometry. This positions zero-shot, language-grounded pipelines as adaptable building blocks for embodied, real-world video understanding. Our implementation for evaluating VLMs is publicly available at: https://github.com/pascalbenschopTU/VLLM_AnomalyRecognition
[397] Multitask Multimodal Self-Supervised Learning for Medical Images
Cristian Simionescu
Main category: cs.CV
TL;DR: This thesis develops Medformer, a neural network for medical image analysis that reduces dependency on labeled data through self-supervised learning and domain adaptation techniques.
Details
Motivation: Address the challenge of limited labeled medical datasets due to expert annotation requirements, privacy concerns, and legal constraints in medical imaging.Method: Developed Medformer architecture for multitask learning and deep domain adaptation with dynamic input-output adaptation; introduced novel pretext tasks for self-supervised learning; validated using MedMNIST dataset.
Result: The model demonstrated proficiency in learning generalized features applicable to various downstream tasks, efficiently processing diverse medical image types from 2D X-rays to 3D MRIs.
Conclusion: Provides a scalable, adaptable framework that advances medical image analysis by reducing reliance on labeled data, enabling more accurate and efficient diagnostic tools in healthcare.
Abstract: This thesis works to address a pivotal challenge in medical image analysis: the reliance on extensive labeled datasets, which are often limited due to the need for expert annotation and constrained by privacy and legal issues. By focusing on the development of self-supervised learning techniques and domain adaptation methods, this research aims to circumvent these limitations, presenting a novel approach to enhance the utility and efficacy of deep learning in medical imaging. Central to this thesis is the development of the Medformer, an innovative neural network architecture designed for multitask learning and deep domain adaptation. This model is adept at pre-training on diverse medical image datasets, handling varying sizes and modalities, and is equipped with a dynamic input-output adaptation mechanism. This enables efficient processing and integration of a wide range of medical image types, from 2D X-rays to complex 3D MRIs, thus mitigating the dependency on large labeled datasets. Further, the thesis explores the current state of self-supervised learning in medical imaging. It introduces novel pretext tasks that are capable of extracting meaningful information from unlabeled data, significantly advancing the model’s interpretative abilities. This approach is validated through rigorous experimentation, including the use of the MedMNIST dataset, demonstrating the model’s proficiency in learning generalized features applicable to various downstream tasks. In summary, this thesis contributes to the advancement of medical image analysis by offering a scalable, adaptable framework that reduces reliance on labeled data. It paves the way for more accurate, efficient diagnostic tools in healthcare, signifying a major step forward in the application of deep learning in medical imaging.
[398] DecoDINO: 3D Human-Scene Contact Prediction with Semantic Classification
Lukas Bierling, Davide Pasero, Fleur Dolmans, Helia Ghasemi, Angelo Broere
Main category: cs.CV
TL;DR: DecoDINO improves human-object contact prediction using dual DINOv2 encoders and patch-level cross-attention, achieving 7% higher F1 score and halving geodesic error compared to DECO.
Details
Motivation: Previous DECO model was limited to binary contact maps and struggled with soft surfaces, occlusions, children, and false-positive foot contacts. Accurate vertex-level contact prediction is crucial for robotics, AR/VR, and behavioral simulation.Method: Three-branch network using two DINOv2 ViT-g/14 encoders with class-balanced loss weighting, patch-level cross-attention for local reasoning, and lightweight MLP with softmax for semantic contact labels.
Result: On DAMON benchmark: 7% higher binary-contact F1 score, halves geodesic error, adds object-level semantic labels. Outperformed baseline in both DAMON Challenge tasks.
Conclusion: LoRA fine-tuning and dual encoders are key improvements. The simpler architecture without vision-language model performed better. Code is publicly available.
Abstract: Accurate vertex-level contact prediction between humans and surrounding objects is a prerequisite for high fidelity human object interaction models used in robotics, AR/VR, and behavioral simulation. DECO was the first in the wild estimator for this task but is limited to binary contact maps and struggles with soft surfaces, occlusions, children, and false-positive foot contacts. We address these issues and introduce DecoDINO, a three-branch network based on DECO’s framework. It uses two DINOv2 ViT-g/14 encoders, class-balanced loss weighting to reduce bias, and patch-level cross-attention for improved local reasoning. Vertex features are finally passed through a lightweight MLP with a softmax to assign semantic contact labels. We also tested a vision-language model (VLM) to integrate text features, but the simpler architecture performed better and was used instead. On the DAMON benchmark, DecoDINO (i) raises the binary-contact F1 score by 7$%$, (ii) halves the geodesic error, and (iii) augments predictions with object-level semantic labels. Ablation studies show that LoRA fine-tuning and the dual encoders are key to these improvements. DecoDINO outperformed the challenge baseline in both tasks of the DAMON Challenge. Our code is available at https://github.com/DavidePasero/deco/tree/main.
[399] VR-Drive: Viewpoint-Robust End-to-End Driving with Feed-Forward 3D Gaussian Splatting
Hoonhee Cho, Jae-Young Kang, Giwon Lee, Hyemin Yang, Heejun Park, Seokwoo Jung, Kuk-Jin Yoon
Main category: cs.CV
TL;DR: VR-Drive is an end-to-end autonomous driving framework that addresses viewpoint generalization through joint learning of 3D scene reconstruction and planning-aware view synthesis, enabling robust performance under varying camera viewpoints.
Details
Motivation: Current end-to-end autonomous driving systems struggle with robustness to varying camera viewpoints, which is a common real-world challenge due to diverse vehicle configurations.Method: Proposes VR-Drive with joint 3D scene reconstruction as auxiliary task, feed-forward inference for online training-time augmentation, viewpoint-mixed memory bank for temporal interaction, and viewpoint-consistent distillation strategy.
Result: VR-Drive effectively mitigates synthesis-induced noise and improves planning under viewpoint shifts, demonstrating scalable and robust performance for real-world deployment.
Conclusion: VR-Drive provides a scalable and robust solution for viewpoint generalization in end-to-end autonomous driving systems, with a new benchmark dataset released for comprehensive evaluation.
Abstract: End-to-end autonomous driving (E2E-AD) has emerged as a promising paradigm that unifies perception, prediction, and planning into a holistic, data-driven framework. However, achieving robustness to varying camera viewpoints, a common real-world challenge due to diverse vehicle configurations, remains an open problem. In this work, we propose VR-Drive, a novel E2E-AD framework that addresses viewpoint generalization by jointly learning 3D scene reconstruction as an auxiliary task to enable planning-aware view synthesis. Unlike prior scene-specific synthesis approaches, VR-Drive adopts a feed-forward inference strategy that supports online training-time augmentation from sparse views without additional annotations. To further improve viewpoint consistency, we introduce a viewpoint-mixed memory bank that facilitates temporal interaction across multiple viewpoints and a viewpoint-consistent distillation strategy that transfers knowledge from original to synthesized views. Trained in a fully end-to-end manner, VR-Drive effectively mitigates synthesis-induced noise and improves planning under viewpoint shifts. In addition, we release a new benchmark dataset to evaluate E2E-AD performance under novel camera viewpoints, enabling comprehensive analysis. Our results demonstrate that VR-Drive is a scalable and robust solution for the real-world deployment of end-to-end autonomous driving systems.
[400] Accurate and Scalable Multimodal Pathology Retrieval via Attentive Vision-Language Alignment
Hongyi Wang, Zhengjie Zhu, Jiabo Ma, Fang Wang, Yue Shi, Bo Luo, Jili Wang, Qiuyu Cai, Xiuming Zhang, Yen-Wei Chen, Lanfen Lin, Hao Chen
Main category: cs.CV
TL;DR: PathSearch is a retrieval framework for whole slide images that combines fine-grained mosaic representations with global embeddings using vision-language contrastive learning, enabling accurate image-to-image and text-to-image retrieval in digital pathology.
Details
Motivation: To address challenges in whole slide image retrieval due to gigapixel scale and difficulty capturing subtle semantic differences, supporting precise diagnoses, consistency across observers, and example-based education in pathology.Method: Unifies fine-grained attentive mosaic representations with global-wise slide embeddings aligned through vision-language contrastive learning, trained on 6,926 slide-report pairs, supporting mosaic-based image-to-image retrieval and multi-modal text-to-slide retrieval.
Result: Outperforms traditional image-to-image retrieval frameworks on four public datasets and three in-house cohorts across various tasks including anatomical site retrieval, tumor subtyping, and grading. Multi-center reader study shows improved diagnostic accuracy, confidence, and inter-observer agreement.
Conclusion: PathSearch establishes itself as a scalable and generalizable retrieval solution for digital pathology that enhances clinical workflows and diagnostic performance.
Abstract: The rapid digitization of histopathology slides has opened up new possibilities for computational tools in clinical and research workflows. Among these, content-based slide retrieval stands out, enabling pathologists to identify morphologically and semantically similar cases, thereby supporting precise diagnoses, enhancing consistency across observers, and assisting example-based education. However, effective retrieval of whole slide images (WSIs) remains challenging due to their gigapixel scale and the difficulty of capturing subtle semantic differences amid abundant irrelevant content. To overcome these challenges, we present PathSearch, a retrieval framework that unifies fine-grained attentive mosaic representations with global-wise slide embeddings aligned through vision-language contrastive learning. Trained on a corpus of 6,926 slide-report pairs, PathSearch captures both fine-grained morphological cues and high-level semantic patterns to enable accurate and flexible retrieval. The framework supports two key functionalities: (1) mosaic-based image-to-image retrieval, ensuring accurate and efficient slide research; and (2) multi-modal retrieval, where text queries can directly retrieve relevant slides. PathSearch was rigorously evaluated on four public pathology datasets and three in-house cohorts, covering tasks including anatomical site retrieval, tumor subtyping, tumor vs. non-tumor discrimination, and grading across diverse organs such as breast, lung, kidney, liver, and stomach. External results show that PathSearch outperforms traditional image-to-image retrieval frameworks. A multi-center reader study further demonstrates that PathSearch improves diagnostic accuracy, boosts confidence, and enhances inter-observer agreement among pathologists in real clinical scenarios. These results establish PathSearch as a scalable and generalizable retrieval solution for digital pathology.
[401] Through the Lens: Benchmarking Deepfake Detectors Against Moiré-Induced Distortions
Razaib Tariq, Minji Heo, Simon S. Woo, Shahroz Tariq
Main category: cs.CV
TL;DR: Deepfake detectors perform poorly on Moiré-affected videos, with performance drops up to 25.4%. Synthetic Moiré patterns cause 21.4% accuracy reduction, and demoiréing methods worsen performance by 17.2%.
Details
Motivation: Real-world deepfake detection faces challenges from Moiré artifacts introduced when capturing media from digital screens using smartphones, but this issue has received little systematic study.Method: Collected 12,832 videos (35.64 hours) from multiple datasets, captured under diverse real-world conditions. Created DeepMoiréFake (DMF) dataset and used synthetic Moiré generation techniques. Evaluated 15 state-of-the-art detectors.
Result: Moiré artifacts degrade detector performance by up to 25.4%, synthetic Moiré causes 21.4% accuracy drop, and demoiréing methods reduce accuracy by up to 17.2% instead of helping.
Conclusion: There’s an urgent need for detection models robust to Moiré distortions and other real-world challenges. The DMF dataset aims to bridge the gap between controlled experiments and practical deepfake detection.
Abstract: Deepfake detection remains a pressing challenge, particularly in real-world settings where smartphone-captured media from digital screens often introduces Moir'e artifacts that can distort detection outcomes. This study systematically evaluates state-of-the-art (SOTA) deepfake detectors on Moir'e-affected videos, an issue that has received little attention. We collected a dataset of 12,832 videos, spanning 35.64 hours, from the Celeb-DF, DFD, DFDC, UADFV, and FF++ datasets, capturing footage under diverse real-world conditions, including varying screens, smartphones, lighting setups, and camera angles. To further examine the influence of Moir'e patterns on deepfake detection, we conducted additional experiments using our DeepMoir'eFake, referred to as (DMF) dataset and two synthetic Moir'e generation techniques. Across 15 top-performing detectors, our results show that Moir'e artifacts degrade performance by as much as 25.4%, while synthetically generated Moir'e patterns lead to a 21.4% drop in accuracy. Surprisingly, demoir'eing methods, intended as a mitigation approach, instead worsened the problem, reducing accuracy by up to 17.2%. These findings underscore the urgent need for detection models that can robustly handle Moir'e distortions alongside other realworld challenges, such as compression, sharpening, and blurring. By introducing the DMF dataset, we aim to drive future research toward closing the gap between controlled experiments and practical deepfake detection.
[402] Autoregressive Styled Text Image Generation, but Make it Reliable
Carmine Zaccagnino, Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Alessio Tonioni, Rita Cucchiara
Main category: cs.CV
TL;DR: Eruku is a new approach for Handwritten Text Generation that frames the task as multimodal prompt-conditioned generation, using special textual tokens and Classifier-Free Guidance to improve content controllability and style generalization.
Details
Motivation: Previous autoregressive Transformer methods for HTG require additional inputs, lack proper stop mechanisms, cause repetition loops, and generate visual artifacts. There's a need for better content controllability and fewer input requirements.Method: Frames HTG as multimodal prompt-conditioned generation, introduces special textual input tokens for better alignment with visual tokens, and implements Classifier-Free Guidance strategy for the autoregressive model.
Result: Eruku requires fewer inputs, generalizes better to unseen styles, and follows textual prompts more faithfully with improved content adherence compared to previous solutions.
Conclusion: The proposed Eruku approach successfully addresses limitations of previous HTG methods by improving content controllability, reducing input requirements, and enhancing style generalization through multimodal prompt-conditioned generation.
Abstract: Generating faithful and readable styled text images (especially for Styled Handwritten Text generation - HTG) is an open problem with several possible applications across graphic design, document understanding, and image editing. A lot of research effort in this task is dedicated to developing strategies that reproduce the stylistic characteristics of a given writer, with promising results in terms of style fidelity and generalization achieved by the recently proposed Autoregressive Transformer paradigm for HTG. However, this method requires additional inputs, lacks a proper stop mechanism, and might end up in repetition loops, generating visual artifacts. In this work, we rethink the autoregressive formulation by framing HTG as a multimodal prompt-conditioned generation task, and tackle the content controllability issues by introducing special textual input tokens for better alignment with the visual ones. Moreover, we devise a Classifier-Free-Guidance-based strategy for our autoregressive model. Through extensive experimental validation, we demonstrate that our approach, dubbed Eruku, compared to previous solutions requires fewer inputs, generalizes better to unseen styles, and follows more faithfully the textual prompt, improving content adherence.
[403] FRBNet: Revisiting Low-Light Vision through Frequency-Domain Radial Basis Network
Fangtong Sun, Congyu Li, Ke Yang, Yuchen Pan, Hanwen Yu, Xichuan Zhang, Yiying Li
Main category: cs.CV
TL;DR: FRBNet is a frequency-domain module that extracts illumination-invariant features for low-light vision tasks by leveraging frequency-domain channel ratios and learnable filters, achieving state-of-the-art performance.
Details
Motivation: Current methods for low-light vision have limited performance due to incomplete modeling of low-light conditions, which affects downstream tasks like detection and segmentation.Method: Extends Lambertian model to frequency domain, uses frequency-domain channel ratio with structured filtering, and proposes FRBNet - an end-to-end trainable module with learnable frequency domain filters.
Result: Achieves +2.2 mAP for dark object detection and +2.9 mIoU for nighttime segmentation, outperforming existing methods across various downstream tasks.
Conclusion: FRBNet effectively addresses low-light vision challenges through frequency-domain analysis and serves as a plug-and-play module that enhances existing networks without modifying loss functions.
Abstract: Low-light vision remains a fundamental challenge in computer vision due to severe illumination degradation, which significantly affects the performance of downstream tasks such as detection and segmentation. While recent state-of-the-art methods have improved performance through invariant feature learning modules, they still fall short due to incomplete modeling of low-light conditions. Therefore, we revisit low-light image formation and extend the classical Lambertian model to better characterize low-light conditions. By shifting our analysis to the frequency domain, we theoretically prove that the frequency-domain channel ratio can be leveraged to extract illumination-invariant features via a structured filtering process. We then propose a novel and end-to-end trainable module named \textbf{F}requency-domain \textbf{R}adial \textbf{B}asis \textbf{Net}work (\textbf{FRBNet}), which integrates the frequency-domain channel ratio operation with a learnable frequency domain filter for the overall illumination-invariant feature enhancement. As a plug-and-play module, FRBNet can be integrated into existing networks for low-light downstream tasks without modifying loss functions. Extensive experiments across various downstream tasks demonstrate that FRBNet achieves superior performance, including +2.2 mAP for dark object detection and +2.9 mIoU for nighttime segmentation. Code is available at: https://github.com/Sing-Forevet/FRBNet.
[404] A Video Is Not Worth a Thousand Words
Sam Pollard, Michael Wray
Main category: cs.CV
TL;DR: The paper proposes a method using Shapley values to analyze feature attributions and modality scores in vision language models for video question answering, revealing text dominance and that the task reduces to ignoring distractors.
Details
Motivation: To address concerns about text dominance in VLMs and underdeveloped modality interaction analysis, and to provide metrics for evaluating multi-modal model complexity direction.Method: Joint computation of feature attributions and modality scores based on Shapley values, treating video frames and textual elements as equal features, with multiple-choice VQA as interaction between video, question, and answer modalities.
Result: Analysis of 6 VLM models on 4 datasets shows dependence on text and that multiple-choice VQA task devolves into models’ ability to ignore distractors.
Conclusion: Current VLM evaluation in multiple-choice VQA reveals text dominance and the task’s reduction to distractor filtering rather than true multi-modal understanding.
Abstract: As we become increasingly dependent on vision language models (VLMs) to answer questions about the world around us, there is a significant amount of research devoted to increasing both the difficulty of video question answering (VQA) datasets, and the context lengths of the models that they evaluate. The reliance on large language models as backbones has lead to concerns about potential text dominance, and the exploration of interactions between modalities is underdeveloped. How do we measure whether we’re heading in the right direction, with the complexity that multi-modal models introduce? We propose a joint method of computing both feature attributions and modality scores based on Shapley values, where both the features and modalities are arbitrarily definable. Using these metrics, we compare $6$ VLM models of varying context lengths on $4$ representative datasets, focusing on multiple-choice VQA. In particular, we consider video frames and whole textual elements as equal features in the hierarchy, and the multiple-choice VQA task as an interaction between three modalities: video, question and answer. Our results demonstrate a dependence on text and show that the multiple-choice VQA task devolves into a model’s ability to ignore distractors. Code available at https://github.com/sjpollard/a-video-is-not-worth-a-thousand-words.
[405] hYOLO Model: Enhancing Object Classification with Hierarchical Context in YOLOv8
Veska Tsenkova, Peter Stanchev, Daniel Petrov, Deyan Lazarov
Main category: cs.CV
TL;DR: This paper proposes an end-to-end hierarchical model for image detection and classification built on YOLO, addressing the limitations of flat classification by incorporating object hierarchies.
Details
Motivation: Real-world objects have natural hierarchical relationships that can improve classification, provide better contextual understanding, and control mistake severity, which flat CNN classification methods overlook.Method: Developed an end-to-end hierarchical model based on YOLO with novel hierarchical architecture, modified loss function, and hierarchical performance metrics. Evaluated on two hierarchical categorizations of the same dataset.
Result: The model successfully captures hierarchical relationships in real-world objects and demonstrates improved performance over conventional flat classification approaches.
Conclusion: The proposed hierarchical methodology effectively addresses the inherent hierarchical structure in real-world objects that traditional flat classification algorithms typically ignore.
Abstract: Current convolution neural network (CNN) classification methods are predominantly focused on flat classification which aims solely to identify a specified object within an image. However, real-world objects often possess a natural hierarchical organization that can significantly help classification tasks. Capturing the presence of relations between objects enables better contextual understanding as well as control over the severity of mistakes. Considering these aspects, this paper proposes an end-to-end hierarchical model for image detection and classification built upon the YOLO model family. A novel hierarchical architecture, a modified loss function, and a performance metric tailored to the hierarchical nature of the model are introduced. The proposed model is trained and evaluated on two different hierarchical categorizations of the same dataset: a systematic categorization that disregards visual similarities between objects and a categorization accounting for common visual characteristics across classes. The results illustrate how the suggested methodology addresses the inherent hierarchical structure present in real-world objects, which conventional flat classification algorithms often overlook.
[406] Adaptive Stochastic Coefficients for Accelerating Diffusion Sampling
Ruoyu Wang, Beier Zhu, Junzhi Li, Liangyu Yuan, Chi Zhang
Main category: cs.CV
TL;DR: AdaSDE is a novel single-step SDE solver that dynamically regulates error correction strength to accelerate diffusion sampling, achieving state-of-the-art performance with only 5 NFE.
Details
Motivation: To address the complementary weaknesses of ODE solvers (accumulating irreducible gradient error) and SDE methods (suffering from amplified discretization errors with limited step budgets) in diffusion-based generative processes.Method: Introduces AdaSDE, a single-step SDE solver with a per-step learnable coefficient estimated via lightweight distillation, which dynamically regulates error correction strength. The framework can be integrated with existing solvers.
Result: State-of-the-art performance: at 5 NFE, achieves FID scores of 4.18 on CIFAR-10, 8.05 on FFHQ and 6.96 on LSUN Bedroom.
Conclusion: AdaSDE successfully unifies the efficiency of ODEs with the error resilience of SDEs, providing an effective solution for accelerating diffusion sampling while maintaining high sample quality.
Abstract: Diffusion-based generative processes, formulated as differential equation solving, frequently balance computational speed with sample quality. Our theoretical investigation of ODE- and SDE-based solvers reveals complementary weaknesses: ODE solvers accumulate irreducible gradient error along deterministic trajectories, while SDE methods suffer from amplified discretization errors when the step budget is limited. Building upon this insight, we introduce AdaSDE, a novel single-step SDE solver that aims to unify the efficiency of ODEs with the error resilience of SDEs. Specifically, we introduce a single per-step learnable coefficient, estimated via lightweight distillation, which dynamically regulates the error correction strength to accelerate diffusion sampling. Notably, our framework can be integrated with existing solvers to enhance their capabilities. Extensive experiments demonstrate state-of-the-art performance: at 5 NFE, AdaSDE achieves FID scores of 4.18 on CIFAR-10, 8.05 on FFHQ and 6.96 on LSUN Bedroom. Codes are available in https://github.com/WLU-wry02/AdaSDE.
[407] On the Faithfulness of Visual Thinking: Measurement and Enhancement
Zujing Liu, Junwen Pan, Qi She, Yuan Gao, Guisong Xia
Main category: cs.CV
TL;DR: The paper identifies unfaithfulness in vision-language models’ multimodal chain-of-thought (MCoT) reasoning, where visual information is often inaccurate but still yields correct answers. It proposes SCCM learning to improve visual faithfulness by generating sufficient yet minimal visual components.
Details
Motivation: Current LVLMs generate MCoT traces with inaccurate visual information that still produce correct answers, indicating unfaithfulness in reasoning. This is attributed to RL rewards that only incentivize format without considering visual correctness.Method: Proposed Sufficient-Component Cause Model (SCCM) learning, which encourages MCoT to generate sufficient yet minimal visual components independently capable of leading to correct answers. SCCM is annotation-free and plug-and-play with various RFT methods.
Result: Empirical results show SCCM consistently improves visual faithfulness across fine-grained perception and reasoning benchmarks. The evaluation reveals current MCoT visual information is both unreliable and insufficient.
Conclusion: SCCM effectively addresses the visual faithfulness issue in MCoT reasoning by ensuring visual components are both reliable and sufficient, leading to more faithful multimodal reasoning processes.
Abstract: Recent large vision-language models (LVLMs) can generate vision-text multimodal chain-of-thought (MCoT) traces after reinforcement fine-tuning (RFT). However, we observe that the visual information incorporated in MCoT is often inaccurate, though still yield correct answers, indicating a lack of faithfulness in the MCoT reasoning process. We attribute this unfaithfulness to the RL reward in RFT, which solely incentivizes the format of interleaved vision-text cues, ie, it encourages the model to incorporate visual information into its text reasoning steps without considering the correctness of the visual information. In this paper, we first probe the faithfulness of MCoT by measuring how much the prediction changes when its visual and textual thoughts are intervened. Surprisingly, the model’s predictions remain nearly unchanged under visual intervention but change significantly under textual intervention, indicating that the visual evidence is largely ignored. To further analyze visual information, we introduce an automated LVLM-based evaluation metric that quantifies the faithfulness of visual cues from two perspectives: reliability and sufficiency. Our evaluation reveals that the visual information in current MCoT traces is simultaneously unreliable and insufficient. To address this issue, we propose a novel MCoT learning strategy termed Sufficient-Component Cause Model (SCCM) learning. This approach encourages the MCoT to generate sufficient yet minimal visual components that are independently capable of leading to correct answers. We note that the proposed SCCM is annotation-free and compatible with various RFT for MCoT in a plug-and-play manner. Empirical results demonstrate that SCCM consistently improves the visual faithfulness across a suite of fine-grained perception and reasoning benchmarks. Code is available at https://github.com/EugeneLiu01/Faithful_Thinking_with_Image.
[408] MDReID: Modality-Decoupled Learning for Any-to-Any Multi-Modal Object Re-Identification
Yingying Feng, Jie Li, Jie Hu, Yukang Zhang, Lei Tan, Jiayi Ji
Main category: cs.CV
TL;DR: MDReID is a flexible any-to-any image-level ReID framework that handles both modality-matched and modality-mismatched scenarios by decomposing modality features into shared and specific components.
Details
Motivation: Real-world ReID systems face modality inconsistencies where query and gallery images come from different sensors (RGB, NIR, TIR), but most existing methods assume modality-matched conditions, limiting their practical robustness and scalability.Method: MDReID introduces two key components: Modality Decoupling Learning (MDL) to decompose modality features into modality-shared and modality-specific representations, and Modality-aware Metric Learning (MML) to enforce orthogonality and complementarity between these components.
Result: Extensive experiments on three multi-modality ReID benchmarks show MDReID achieves significant mAP improvements: 9.8%, 3.0%, and 11.5% in modality-matched scenarios, and average gains of 3.4%, 11.8%, and 10.9% in modality-mismatched scenarios.
Conclusion: MDReID effectively addresses modality inconsistencies in ReID systems through explicit modality feature decomposition and tailored metric learning, demonstrating superior performance across both modality-aligned and mismatched conditions.
Abstract: Real-world object re-identification (ReID) systems often face modality inconsistencies, where query and gallery images come from different sensors (e.g., RGB, NIR, TIR). However, most existing methods assume modality-matched conditions, which limits their robustness and scalability in practical applications. To address this challenge, we propose MDReID, a flexible any-to-any image-level ReID framework designed to operate under both modality-matched and modality-mismatched scenarios. MDReID builds on the insight that modality information can be decomposed into two components: modality-shared features that are predictable and transferable, and modality-specific features that capture unique, modality-dependent characteristics. To effectively leverage this, MDReID introduces two key components: the Modality Decoupling Learning (MDL) and Modality-aware Metric Learning (MML). Specifically, MDL explicitly decomposes modality features into modality-shared and modality-specific representations, enabling effective retrieval in both modality-aligned and mismatched scenarios. MML, a tailored metric learning strategy, further enforces orthogonality and complementarity between the two components to enhance discriminative power across modalities. Extensive experiments conducted on three challenging multi-modality ReID benchmarks (RGBNT201, RGBNT100, MSVR310) consistently demonstrate the superiority of MDReID. Notably, MDReID achieves significant mAP improvements of 9.8%, 3.0%, and 11.5% in general modality-matched scenarios, and average gains of 3.4%, 11.8%, and 10.9% in modality-mismatched scenarios, respectively. The code is available at: \textcolor{magenta}{https://github.com/stone96123/MDReID}.
[409] Interpretable Tile-Based Classification of Paclitaxel Exposure
Sean Fletcher, Gabby Scott, Douglas Currie, Xin Zhang, Yuqi Song, Bruce MacLeod
Main category: cs.CV
TL;DR: A tiling-and-aggregation pipeline for classifying paclitaxel exposure in C6 glioma cells from phase-contrast microscopy achieves state-of-the-art accuracy, improving baseline by ~20 percentage points, with enhanced interpretability via Grad-CAM and Score-CAM analyses.
Details
Motivation: Medical image analysis is crucial for drug discovery and preclinical evaluation, but current full-image models struggle with subtle dose differences in paclitaxel exposure classification from phase-contrast microscopy of C6 glioma cells.Method: Proposed a simple tiling-and-aggregation pipeline that processes local patches and combines tile outputs into image labels, with interpretability enhanced through Grad-CAM and Score-CAM attention analyses.
Result: Achieved state-of-the-art accuracy on benchmark dataset, improving over published baseline by approximately 20 percentage points, with trends confirmed by cross-validation.
Conclusion: The tiling approach is effective for medical image analysis tasks with subtle differences, and interpretability methods point toward robustness-oriented directions for future research. Code is released for reproduction and extension.
Abstract: Medical image analysis is central to drug discovery and preclinical evaluation, where scalable, objective readouts can accelerate decision-making. We address classification of paclitaxel (Taxol) exposure from phase-contrast microscopy of C6 glioma cells – a task with subtle dose differences that challenges full-image models. We propose a simple tiling-and-aggregation pipeline that operates on local patches and combines tile outputs into an image label, achieving state-of-the-art accuracy on the benchmark dataset and improving over the published baseline by around 20 percentage points, with trends confirmed by cross-validation. To understand why tiling is effective, we further apply Grad-CAM and Score-CAM and attention analyses, which enhance model interpretability and point toward robustness-oriented directions for future medical image research. Code is released to facilitate reproduction and extension.
[410] PlanarTrack: A high-quality and challenging benchmark for large-scale planar object tracking
Yifan Jiao, Xinran Liu, Xiaoqiong Liu, Xiaohui Yuan, Heng Fan, Libo Zhang
Main category: cs.CV
TL;DR: PlanarTrack is a large-scale benchmark for planar tracking with 1,150 sequences and over 733K frames, designed to address the lack of comprehensive datasets in this field.
Details
Motivation: The development of planar tracking in deep learning era is limited due to lack of large-scale platforms. Existing benchmarks are insufficient for comprehensive evaluation of tracking methods.Method: Created PlanarTrack dataset with 1,000 short-term and 150 long-term videos recorded in unconstrained conditions. Each frame manually annotated with four corner points through multi-round inspection. Each sequence contains unique targets to enhance diversity.
Result: PlanarTrack is the largest and most diverse planar tracking dataset. Evaluation of 10 representative trackers shows significant performance degradation on this challenging benchmark, indicating current methods are insufficient.
Conclusion: PlanarTrack provides a comprehensive benchmark for planar tracking research. Current methods struggle with its challenges, highlighting the need for improved tracking algorithms. The dataset will be publicly available for future research.
Abstract: Planar tracking has drawn increasing interest owing to its key roles in robotics and augmented reality. Despite recent great advancement, further development of planar tracking, particularly in the deep learning era, is largely limited compared to generic tracking due to the lack of large-scale platforms. To mitigate this, we propose PlanarTrack, a large-scale high-quality and challenging benchmark for planar tracking. Specifically, PlanarTrack consists of 1,150 sequences with over 733K frames, including 1,000 short-term and 150 new long-term videos, which enables comprehensive evaluation of short- and long-term tracking performance. All videos in PlanarTrack are recorded in unconstrained conditions from the wild, which makes PlanarTrack challenging but more realistic for real-world applications. To ensure high-quality annotations, each video frame is manually annotated by four corner points with multi-round meticulous inspection and refinement. To enhance target diversity of PlanarTrack, we only capture a unique target in one sequence, which is different from existing benchmarks. To our best knowledge, PlanarTrack is by far the largest and most diverse and challenging dataset dedicated to planar tracking. To understand performance of existing methods on PlanarTrack and to provide a comparison for future research, we evaluate 10 representative planar trackers with extensive comparison and in-depth analysis. Our evaluation reveals that, unsurprisingly, the top planar trackers heavily degrade on the challenging PlanarTrack, which indicates more efforts are required for improving planar tracking. Our data and results will be released at https://github.com/HengLan/PlanarTrack
[411] An Efficient Remote Sensing Super Resolution Method Exploring Diffusion Priors and Multi-Modal Constraints for Crop Type Mapping
Songxi Yang, Tang Sui, Qunying Huang
Main category: cs.CV
TL;DR: LSSR is an efficient super-resolution framework for remote sensing that uses frozen pretrained Stable Diffusion with cross-modal attention, auxiliary knowledge integration, and specialized losses to achieve state-of-the-art performance while maintaining fast inference speeds.
Details
Motivation: To address limitations in current remote sensing super-resolution methods: expensive training requirements, slow inference speeds, limited use of auxiliary information as real-world constraints, and lack of downstream task evaluation.Method: Built on frozen pretrained Stable Diffusion, integrates cross-modal attention with auxiliary knowledge (DEM, land cover, month) and SAR guidance, enhanced by adapters and Fourier NDVI loss to balance spatial details and spectral fidelity.
Result: Achieves PSNR/SSIM of 32.63/0.84 (RGB) and 23.99/0.78 (IR), lowest NDVI MSE (0.042), efficient inference (0.39 sec/image), and superior crop classification (F1: 0.86) compared to Sentinel-2 (F1: 0.85).
Conclusion: LSSR demonstrates the potential of remote sensing super-resolution to advance precision agriculture through efficient, high-quality image reconstruction with effective downstream task performance.
Abstract: Super resolution offers a way to harness medium even lowresolution but historically valuable remote sensing image archives. Generative models, especially diffusion models, have recently been applied to remote sensing super resolution (RSSR), yet several challenges exist. First, diffusion models are effective but require expensive training from scratch resources and have slow inference speeds. Second, current methods have limited utilization of auxiliary information as real-world constraints to reconstruct scientifically realistic images. Finally, most current methods lack evaluation on downstream tasks. In this study, we present a efficient LSSR framework for RSSR, supported by a new multimodal dataset of paired 30 m Landsat 8 and 10 m Sentinel 2 imagery. Built on frozen pretrained Stable Diffusion, LSSR integrates crossmodal attention with auxiliary knowledge (Digital Elevation Model, land cover, month) and Synthetic Aperture Radar guidance, enhanced by adapters and a tailored Fourier NDVI loss to balance spatial details and spectral fidelity. Extensive experiments demonstrate that LSSR significantly improves crop boundary delineation and recovery, achieving state-of-the-art performance with Peak Signal-to-Noise Ratio/Structural Similarity Index Measure of 32.63/0.84 (RGB) and 23.99/0.78 (IR), and the lowest NDVI Mean Squared Error (0.042), while maintaining efficient inference (0.39 sec/image). Moreover, LSSR transfers effectively to NASA Harmonized Landsat and Sentinel (HLS) super resolution, yielding more reliable crop classification (F1: 0.86) than Sentinel-2 (F1: 0.85). These results highlight the potential of RSSR to advance precision agriculture.
[412] VideoTG-R1: Boosting Video Temporal Grounding via Curriculum Reinforcement Learning on Reflected Boundary Annotations
Lu Dong, Haiyu Zhang, Han Lin, Ziang Yan, Xiangyu Zeng, Hongjie Zhang, Yifei Huang, Yi Wang, Zhen-Hua Ling, Limin Wang, Yali Wang
Main category: cs.CV
TL;DR: VideoTG-R1 is a curriculum reinforcement learning framework that addresses challenges in video temporal grounding by filtering partially annotated samples and dynamically adjusting training difficulty for hard-to-ground samples.
Details
Motivation: Current MLLMs for video temporal grounding overlook issues with training sample quality (partially annotated samples with ambiguous supervision) and difficulty (hard-to-ground samples with indistinguishable rewards during RL training).Method: Proposes two agents: Boundary Reflection Agent to identify and filter partially annotated samples by predicting query-relevant timestamps outside annotated intervals, and Difficulty Estimation Agent to assess sample difficulty and implement curriculum RL that dynamically masks hard samples during training.
Result: VideoTG-R1 outperforms full-data counterparts using only 10% of training samples and 21% of computational budget, achieving better performance on VTG and grounded VideoQA tasks under both GRPO and SFT.
Conclusion: The proposed curriculum RL framework with reflected boundary annotations enables data-efficient training by addressing sample quality and difficulty issues, significantly improving learning efficiency and performance in video temporal grounding tasks.
Abstract: Video temporal grounding (VTG) aims to locate precise segments in videos based on language queries, which is a fundamental challenge in video understanding. While recent Multimodal Large Language Models (MLLMs) have shown promise in tackling VTG through reinforcement learning (RL), they overlook the challenges arising from both the quality and difficulty of training samples. (1) Partially annotated samples. Many samples contain relevant segments beyond the annotated interval, introducing ambiguous supervision. (2) Hard-to-ground samples. Samples with poor zero-shot performance produce consistently low and indistinguishable rewards during RL training, exhibiting no clear preference among multiple outputs and thus hindering learning efficiency. To address these challenges, we propose VideoTG-R1, a novel curriculum RL framework with reflected boundary annotations, enabling data-efficient training. Specifically, we propose a Boundary Reflection Agent that utilizes MLLMs to predict query-relevant timestamps outside the annotated intervals, allowing us to identify and filter out partially annotated samples, thereby reducing ambiguity. Furthermore, we introduce a Difficulty Estimation Agent to assess the training difficulty of each sample and design a curriculum RL strategy that dynamically masks the videos of hard-to-ground samples according to the training steps, easing the training difficulty and providing clearer preference. Experiments on the VTG and grounded VideoQA tasks demonstrate the effectiveness of our method. Remarkably, with only 10% of the training samples and 21% of the computational budget, VideoTG-R1 outperforms full-data counterparts under both group relative policy optimization (GRPO) and supervised fine-tuning (SFT). The code is available at https://github.com/ldong1111/VideoTG-R1.
[413] Color and Frequency Correction for Image Colorization
Yun Kai Zhuang
Main category: cs.CV
TL;DR: The paper presents optimization schemes to improve DDColor’s image coloring performance by addressing frequency band limitations and color cast issues.
Details
Motivation: Existing DDColor model has limitations in certain frequency bands and suffers from color cast problems due to insufficient input dimensions.Method: Constructed two optimization schemes that were combined to enhance DDColor’s performance, specifically targeting the identified limitations.
Result: Achieved performance improvement in image quality metrics including PSNR and SSIM after applying the optimization to DDColor.
Conclusion: The proposed optimization schemes successfully address DDColor’s limitations and improve image coloring performance as measured by standard metrics.
Abstract: The project has carried out the re-optimization of image coloring in accordance with the existing Autocolorization direction model DDColor. For the experiments on the existing weights of DDColor, we found that it has limitations in some frequency bands and the color cast problem caused by insufficient input dimension. We construct two optimization schemes and combine them, which achieves the performance improvement of indicators such as PSNR and SSIM of the images after DDColor.
[414] Symmetria: A Synthetic Dataset for Learning in Point Clouds
Ivan Sipiran, Gustavo Santelices, Lucas Oyarzún, Andrea Ranieri, Chiara Romanengo, Silvia Biasotti, Bianca Falcidieno
Main category: cs.CV
TL;DR: Symmetria is a formula-driven point cloud dataset that overcomes data scarcity by generating symmetric shapes at any scale, enabling efficient pre-training and strong performance on downstream tasks.
Details
Motivation: Point cloud learning faces limitations due to scarce large-scale datasets, unlike image or text domains with abundant data.Method: Uses symmetry concepts to create shapes with known structure and high variability through formula-driven generation, ensuring precise ground truth availability.
Result: Effective for self-supervised pre-training, yielding models with strong performance in classification and segmentation tasks, showing good few-shot learning capabilities and real-world applicability.
Conclusion: Symmetria provides a scalable, extensible solution for point cloud learning with public availability of dataset and code, promoting further research and innovation.
Abstract: Unlike image or text domains that benefit from an abundance of large-scale datasets, point cloud learning techniques frequently encounter limitations due to the scarcity of extensive datasets. To overcome this limitation, we present Symmetria, a formula-driven dataset that can be generated at any arbitrary scale. By construction, it ensures the absolute availability of precise ground truth, promotes data-efficient experimentation by requiring fewer samples, enables broad generalization across diverse geometric settings, and offers easy extensibility to new tasks and modalities. Using the concept of symmetry, we create shapes with known structure and high variability, enabling neural networks to learn point cloud features effectively. Our results demonstrate that this dataset is highly effective for point cloud self-supervised pre-training, yielding models with strong performance in downstream tasks such as classification and segmentation, which also show good few-shot learning capabilities. Additionally, our dataset can support fine-tuning models to classify real-world objects, highlighting our approach’s practical utility and application. We also introduce a challenging task for symmetry detection and provide a benchmark for baseline comparisons. A significant advantage of our approach is the public availability of the dataset, the accompanying code, and the ability to generate very large collections, promoting further research and innovation in point cloud learning.
[415] Track, Inpaint, Resplat: Subject-driven 3D and 4D Generation with Progressive Texture Infilling
Shuhong Zheng, Ashkan Mirzaei, Igor Gilitschenski
Main category: cs.CV
TL;DR: TIRE is a novel method for subject-driven 3D/4D generation that improves identity preservation by tracking, inpainting, and resplatting to modify initial 3D assets while maintaining consistency.
Details
Motivation: Current 3D/4D generation methods fail to preserve semantic identity across viewpoints, and personalized generation remains underexplored despite the importance of subject-driven content creation.Method: TIRE takes initial 3D assets, uses video tracking to identify regions needing modification, applies subject-driven 2D inpainting to progressively fill these regions, and then resplats the modified 2D multi-view observations back to 3D.
Result: Extensive experiments show TIRE significantly improves identity preservation in 3D/4D generation compared to state-of-the-art methods.
Conclusion: The proposed TIRE method effectively addresses the identity preservation problem in subject-driven 3D/4D generation through its track-inpaint-resplat pipeline.
Abstract: Current 3D/4D generation methods are usually optimized for photorealism, efficiency, and aesthetics. However, they often fail to preserve the semantic identity of the subject across different viewpoints. Adapting generation methods with one or few images of a specific subject (also known as Personalization or Subject-driven generation) allows generating visual content that align with the identity of the subject. However, personalized 3D/4D generation is still largely underexplored. In this work, we introduce TIRE (Track, Inpaint, REsplat), a novel method for subject-driven 3D/4D generation. It takes an initial 3D asset produced by an existing 3D generative model as input and uses video tracking to identify the regions that need to be modified. Then, we adopt a subject-driven 2D inpainting model for progressively infilling the identified regions. Finally, we resplat the modified 2D multi-view observations back to 3D while still maintaining consistency. Extensive experiments demonstrate that our approach significantly improves identity preservation in 3D/4D generation compared to state-of-the-art methods. Our project website is available at https://zsh2000.github.io/track-inpaint-resplat.github.io/.
[416] Towards Generalisable Foundation Models for 3D Brain MRI
Moona Mazher, Geoff J. M. Parker, Daniel C. Alexander
Main category: cs.CV
TL;DR: BrainFound is a self-supervised foundation model for brain MRI that adapts DINO-v2 to handle 3D volumetric data, supporting multimodal inputs and outperforming existing methods in label-scarce settings.
Details
Motivation: Foundation models can transform medical imaging by learning from large unlabeled datasets, but existing methods are limited to 2D natural images and single-slice paradigms.Method: Extends DINO-v2 vision transformer to model full 3D brain anatomy by incorporating volumetric information from sequential MRI slices, supporting single- and multimodal inputs.
Result: Consistently outperforms existing self-supervised pretraining strategies and supervised baselines, especially in label-scarce and multi-contrast settings, enhancing diagnostic accuracy.
Conclusion: BrainFound provides a scalable and practical solution for 3D neuroimaging pipelines with significant potential for clinical deployment and research innovation.
Abstract: Foundation models in artificial intelligence (AI) are transforming medical imaging by enabling general-purpose feature learning from large-scale, unlabeled datasets. In this work, we introduce BrainFound, a self-supervised foundation model for brain MRI, built by extending DINO-v2, a vision transformer originally designed for 2D natural images. BrainFound adapts DINO-v2 to model full 3D brain anatomy by incorporating volumetric information from sequential MRI slices, moving beyond conventional single-slice paradigms. It supports both single- and multimodal inputs, enabling a broad range of downstream tasks, including disease detection and image segmentation, while generalising across varied imaging protocols and clinical scenarios. We show that BrainFound consistently outperforms existing self-supervised pretraining strategies and supervised baselines, particularly in label-scarce and multi-contrast settings. By integrating information from diverse 3D MRI modalities (e.g., T1, T2, FLAIR), it enhances diagnostic accuracy and reduces dependency on extensive expert annotations. This flexibility makes BrainFound a scalable and practical solution for 3D neuroimaging pipelines, with significant potential for clinical deployment and research innovation.
[417] Quality-controlled registration of urban MLS point clouds reducing drift effects by adaptive fragmentation
Marco Antonio Ortiz Rincon, Yihui Yang, Christoph Holst
Main category: cs.CV
TL;DR: Novel workflow for efficient registration of mobile laser scanning point clouds to target models in urban environments, addressing density variations, noise, and occlusions through two key advancements: SSC preprocessing and PV-GICP fine registration.
Details
Motivation: To address the challenges of registering large-scale mobile laser scanning point clouds in complex urban environments with varying density, noise characteristics, and occlusion scenarios commonly found in city centers.Method: Two methodological advancements: 1) Semi-sphere Check (SSC) preprocessing that optimally fragments MLS trajectory data using mutually orthogonal planar surfaces to reduce drift impact; 2) Planar Voxel-based Generalized Iterative Closest Point (PV-GICP) for fine registration that selectively uses planar surfaces within voxel partitions.
Result: Achieves sub-0.01 m average registration accuracy on real-world datasets from Munich’s inner city, with computation time reduced by more than 50% compared to conventional point-to-plane ICP methods.
Conclusion: The proposed methods show significant potential for advancing automated 3D urban modeling and updating, with direct applications in urban planning, infrastructure management, and dynamic city monitoring.
Abstract: This study presents a novel workflow designed to efficiently and accurately register large-scale mobile laser scanning (MLS) point clouds to a target model point cloud in urban street scenarios. This workflow specifically targets the complexities inherent in urban environments and adeptly addresses the challenges of integrating point clouds that vary in density, noise characteristics, and occlusion scenarios, which are common in bustling city centers. Two methodological advancements are introduced. First, the proposed Semi-sphere Check (SSC) preprocessing technique optimally fragments MLS trajectory data by identifying mutually orthogonal planar surfaces. This step reduces the impact of MLS drift on the accuracy of the entire point cloud registration, while ensuring sufficient geometric features within each fragment to avoid local minima. Second, we propose Planar Voxel-based Generalized Iterative Closest Point (PV-GICP), a fine registration method that selectively utilizes planar surfaces within voxel partitions. This pre-process strategy not only improves registration accuracy but also reduces computation time by more than 50% compared to conventional point-to-plane ICP methods. Experiments on real-world datasets from Munich’s inner city demonstrate that our workflow achieves sub-0.01 m average registration accuracy while significantly shortening processing times. The results underscore the potential of the proposed methods to advance automated 3D urban modeling and updating, with direct applications in urban planning, infrastructure management, and dynamic city monitoring.
[418] MiCADangelo: Fine-Grained Reconstruction of Constrained CAD Models from 3D Scans
Ahmet Serdar Karadeniz, Dimitrios Mallis, Danila Rukhovich, Kseniya Cherenkova, Anis Kacem, Djamila Aouada
Main category: cs.CV
TL;DR: A novel CAD reverse engineering method that uses multi-plane cross-sections to extract 2D patterns and capture parametric details, enabling reconstruction of editable CAD models with sketch constraints.
Details
Motivation: Existing deep learning approaches for CAD reverse engineering either produce non-parametric outputs or miss fine geometric details, and none incorporate essential sketch-level constraints that are fundamental to CAD modeling.Method: Leverages multi-plane cross-sections to extract 2D patterns, inspired by how human designers manually perform CAD reverse engineering, enabling capture of fine parametric details and direct incorporation of sketch constraints.
Result: Outperforms state-of-the-art methods and successfully reconstructs detailed, editable CAD models while incorporating sketch constraints for the first time in the reconstruction process.
Conclusion: The proposed approach effectively bridges the gap between geometry-driven and top-down methods, producing fully parametric CAD models with fine-grained details and essential sketch constraints that match human design practices.
Abstract: Computer-Aided Design (CAD) plays a foundational role in modern manufacturing and product development, often requiring designers to modify or build upon existing models. Converting 3D scans into parametric CAD representations–a process known as CAD reverse engineering–remains a significant challenge due to the high precision and structural complexity of CAD models. Existing deep learning-based approaches typically fall into two categories: bottom-up, geometry-driven methods, which often fail to produce fully parametric outputs, and top-down strategies, which tend to overlook fine-grained geometric details. Moreover, current methods neglect an essential aspect of CAD modeling: sketch-level constraints. In this work, we introduce a novel approach to CAD reverse engineering inspired by how human designers manually perform the task. Our method leverages multi-plane cross-sections to extract 2D patterns and capture fine parametric details more effectively. It enables the reconstruction of detailed and editable CAD models, outperforming state-of-the-art methods and, for the first time, incorporating sketch constraints directly into the reconstruction process.
[419] CURVETE: Curriculum Learning and Progressive Self-supervised Training for Medical Image Classification
Asmaa Abbas, Mohamed Gaber, Mohammed M. Abdelsamea
Main category: cs.CV
TL;DR: CURVETE is a novel deep CNN that uses curriculum learning and progressive self-supervised training to address limited samples and irregular class distribution in medical image analysis, achieving superior performance on three medical datasets.
Details
Motivation: Addressing the challenge of limited annotated samples and irregular class distribution in medical image analysis, which reduces the effectiveness of fine-tuning in transfer learning.Method: Uses curriculum learning based on sample decomposition granularity for pre-training on unlabeled samples, and class decomposition approach for downstream tasks to handle irregular class distribution.
Result: Achieved accuracies of 96.60% (brain tumour), 75.60% (digital knee x-ray), and 93.35% (Mini-DDSM) with ResNet-50; and 95.77%, 80.36%, 93.22% respectively with DenseNet-121, outperforming other training strategies.
Conclusion: CURVETE effectively addresses limited samples and class imbalance issues in medical image analysis through curriculum learning and progressive self-supervised training, demonstrating superior classification performance across multiple medical datasets.
Abstract: Identifying high-quality and easily accessible annotated samples poses a notable challenge in medical image analysis. Transfer learning techniques, leveraging pre-training data, offer a flexible solution to this issue. However, the impact of fine-tuning diminishes when the dataset exhibits an irregular distribution between classes. This paper introduces a novel deep convolutional neural network, named Curriculum Learning and Progressive Self-supervised Training (CURVETE). CURVETE addresses challenges related to limited samples, enhances model generalisability, and improves overall classification performance. It achieves this by employing a curriculum learning strategy based on the granularity of sample decomposition during the training of generic unlabelled samples. Moreover, CURVETE address the challenge of irregular class distribution by incorporating a class decomposition approach in the downstream task. The proposed method undergoes evaluation on three distinct medical image datasets: brain tumour, digital knee x-ray, and Mini-DDSM datasets. We investigate the classification performance using a generic self-supervised sample decomposition approach with and without the curriculum learning component in training the pretext task. Experimental results demonstrate that the CURVETE model achieves superior performance on test sets with an accuracy of 96.60% on the brain tumour dataset, 75.60% on the digital knee x-ray dataset, and 93.35% on the Mini-DDSM dataset using the baseline ResNet-50. Furthermore, with the baseline DenseNet-121, it achieved accuracies of 95.77%, 80.36%, and 93.22% on the brain tumour, digital knee x-ray, and Mini-DDSM datasets, respectively, outperforming other training strategies.
[420] Video-Thinker: Sparking “Thinking with Videos” via Reinforcement Learning
Shijian Wang, Jiarui Jin, Xingjian Wang, Linxin Song, Runhao Fu, Hecheng Wang, Zongyuan Ge, Yuan Lu, Xuelian Cheng
Main category: cs.CV
TL;DR: Video-Thinker extends ‘Thinking with Images’ to video reasoning by enabling MLLMs to autonomously use grounding and captioning capabilities for generating reasoning clues during inference, achieving state-of-the-art performance on video reasoning benchmarks.
Details
Motivation: While 'Thinking with Images' has shown success in image reasoning for MLLMs, this dynamic reasoning paradigm has not been extended to video reasoning tasks, creating a gap in video understanding capabilities.Method: Proposes Video-Thinker that enables MLLMs to autonomously leverage grounding and captioning capabilities. Uses Video-Thinker-10K dataset with autonomous tool usage in chain-of-thought reasoning. Training involves SFT for reasoning format learning followed by GRPO to strengthen reasoning capability.
Result: Video-Thinker achieves significant performance gains on both in-domain tasks and challenging out-of-domain benchmarks (Video-Holmes, CG-Bench-Reasoning, VRBench). Video-Thinker-7B substantially outperforms existing baselines like Video-R1 and establishes state-of-the-art performance among 7B-sized MLLMs.
Conclusion: Video-Thinker successfully extends dynamic reasoning to video tasks, enabling MLLMs to autonomously navigate grounding and captioning for video reasoning without needing external tools, demonstrating strong performance across multiple benchmarks.
Abstract: Recent advances in image reasoning methods, particularly “Thinking with Images”, have demonstrated remarkable success in Multimodal Large Language Models (MLLMs); however, this dynamic reasoning paradigm has not yet been extended to video reasoning tasks. In this paper, we propose Video-Thinker, which empowers MLLMs to think with videos by autonomously leveraging their intrinsic “grounding” and “captioning” capabilities to generate reasoning clues throughout the inference process. To spark this capability, we construct Video-Thinker-10K, a curated dataset featuring autonomous tool usage within chain-of-thought reasoning sequences. Our training strategy begins with Supervised Fine-Tuning (SFT) to learn the reasoning format, followed by Group Relative Policy Optimization (GRPO) to strengthen this reasoning capability. Through this approach, Video-Thinker enables MLLMs to autonomously navigate grounding and captioning tasks for video reasoning, eliminating the need for constructing and calling external tools. Extensive experiments demonstrate that Video-Thinker achieves significant performance gains on both in-domain tasks and challenging out-of-domain video reasoning benchmarks, including Video-Holmes, CG-Bench-Reasoning, and VRBench. Our Video-Thinker-7B substantially outperforms existing baselines such as Video-R1 and establishes state-of-the-art performance among 7B-sized MLLMs.
[421] UrbanIng-V2X: A Large-Scale Multi-Vehicle, Multi-Infrastructure Dataset Across Multiple Intersections for Cooperative Perception
Karthikeyan Chandra Sekaran, Markus Geisler, Dominik Rößle, Adithya Mohan, Daniel Cremers, Wolfgang Utschick, Michael Botsch, Werner Huber, Torsten Schön
Main category: cs.CV
TL;DR: UrbanIng-V2X is the first large-scale multi-modal dataset for cooperative perception with vehicles and infrastructure across three urban intersections, addressing limitations of existing single-intersection datasets.
Details
Motivation: Existing cooperative perception datasets are limited to single intersections or single vehicles, causing overfitting and misleading performance due to similar layouts and behavior patterns. A comprehensive multi-intersection dataset is needed for robust benchmarking.Method: Collected data from three urban intersections in Ingolstadt, Germany with 34 temporally aligned sensor sequences (20 seconds each) involving 2 vehicles and up to 3 infrastructure sensor poles per intersection, using 12 vehicle cameras, 2 vehicle LiDARs, 17 thermal cameras, and 12 infrastructure LiDARs.
Result: Created dataset with approximately 712k annotated 3D bounding boxes across 13 object classes at 10Hz frequency. Provided comprehensive evaluations using state-of-the-art cooperative perception methods.
Conclusion: UrbanIng-V2X enables robust benchmarking of cooperative perception algorithms in diverse traffic environments and helps prevent overfitting by providing multi-intersection data with vehicle-infrastructure collaboration.
Abstract: Recent cooperative perception datasets have played a crucial role in advancing smart mobility applications by enabling information exchange between intelligent agents, helping to overcome challenges such as occlusions and improving overall scene understanding. While some existing real-world datasets incorporate both vehicle-to-vehicle and vehicle-to-infrastructure interactions, they are typically limited to a single intersection or a single vehicle. A comprehensive perception dataset featuring multiple connected vehicles and infrastructure sensors across several intersections remains unavailable, limiting the benchmarking of algorithms in diverse traffic environments. Consequently, overfitting can occur, and models may demonstrate misleadingly high performance due to similar intersection layouts and traffic participant behavior. To address this gap, we introduce UrbanIng-V2X, the first large-scale, multi-modal dataset supporting cooperative perception involving vehicles and infrastructure sensors deployed across three urban intersections in Ingolstadt, Germany. UrbanIng-V2X consists of 34 temporally aligned and spatially calibrated sensor sequences, each lasting 20 seconds. All sequences contain recordings from one of three intersections, involving two vehicles and up to three infrastructure-mounted sensor poles operating in coordinated scenarios. In total, UrbanIng-V2X provides data from 12 vehicle-mounted RGB cameras, 2 vehicle LiDARs, 17 infrastructure thermal cameras, and 12 infrastructure LiDARs. All sequences are annotated at a frequency of 10 Hz with 3D bounding boxes spanning 13 object classes, resulting in approximately 712k annotated instances across the dataset. We provide comprehensive evaluations using state-of-the-art cooperative perception methods and publicly release the codebase, dataset, HD map, and a digital twin of the complete data collection environment.
[422] MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding
Xin Jin, Siyuan Li, Siyong Jian, Kai Yu, Huan Wang
Main category: cs.CV
TL;DR: MergeMix is a training-time augmentation paradigm that bridges supervised fine-tuning and reinforcement learning for vision-language alignment in MLLMs, using attention-aware image mixing and preference-driven training.
Details
Motivation: To address the trade-off between scalability, robustness, and alignment quality in vision-language alignment, where SFT requires large annotations and cannot capture subtle preferences, while RL suffers from overhead and instability.Method: Proposes MergeMix with two components: 1) attention-aware image mixing via token merge with cluster representation and spatial context, and 2) preference-driven training paradigm building preference pairs with mixed and raw images, optimized via SimPO loss.
Result: MergeMix enhances attention consistency and efficiency, surpassing other heuristic-based methods in classification. Achieves competitive accuracy with improved efficiency in extensive experiments.
Conclusion: MergeMix provides a scalable approach to preference alignment in classification and MLLMs, bridging the gap between SFT and RL methods.
Abstract: Vision-language alignment in multi-modal large language models (MLLMs) typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). SFT is stable and efficient but requires large-scale human annotations and cannot capture subtle preferences, while RL brings in a reward signal for training, but suffers from overhead and instability. These limitations highlight a trade-off between scalability, robustness, and alignment quality. To address this, we propose MergeMix, a training-time augmentation paradigm that bridges SFT and RL. It first applies an attention-aware image mixing via token merge with more cluster representation and spatial context, and then presents a preference-driven training paradigm for MLLMs by building preference pairs with mixed images and raw images, and optimizing via SimPO loss. As a mixup augmentation, MergeMix enhances attention consistency and efficiency, surpassing other heuristic-based methods in classification. Extensive experiments demonstrate that MergeMix achieves competitive accuracy with improved efficiency, providing a scalable approach to preference alignment in classification and MLLMs.
[423] Yesnt: Are Diffusion Relighting Models Ready for Capture Stage Compositing? A Hybrid Alternative to Bridge the Gap
Elisabeth Jüttner, Leona Krath, Stefan Korfhage, Hannah Dröge, Matthias B. Hullin, Markus Plack
Main category: cs.CV
TL;DR: A hybrid relighting framework combining diffusion-derived material priors with temporal regularization and physical rendering for stable volumetric video relighting.
Details
Motivation: Current approaches struggle with temporal stability in volumetric video relighting - diffusion methods have stochastic noise in sequences, while video diffusion models face memory and scale limitations.Method: Combines diffusion-derived material priors with optical-flow-guided temporal regularization, aggregates multiple stochastic estimates into consistent shading, and uses mesh proxy from Gaussian Opacity Fields for indirect effects.
Result: Achieves substantially more stable relighting across sequences than diffusion-only baselines and scales beyond feasible clip lengths for video diffusion.
Conclusion: Hybrid approaches balancing learned priors with physically grounded constraints are a practical step toward production-ready volumetric video relighting.
Abstract: Volumetric video relighting is essential for bringing captured performances into virtual worlds, but current approaches struggle to deliver temporally stable, production-ready results. Diffusion-based intrinsic decomposition methods show promise for single frames, yet suffer from stochastic noise and instability when extended to sequences, while video diffusion models remain constrained by memory and scale. We propose a hybrid relighting framework that combines diffusion-derived material priors with temporal regularization and physically motivated rendering. Our method aggregates multiple stochastic estimates of per-frame material properties into temporally consistent shading components, using optical-flow-guided regularization. For indirect effects such as shadows and reflections, we extract a mesh proxy from Gaussian Opacity Fields and render it within a standard graphics pipeline. Experiments on real and synthetic captures show that this hybrid strategy achieves substantially more stable relighting across sequences than diffusion-only baselines, while scaling beyond the clip lengths feasible for video diffusion. These results indicate that hybrid approaches, which balance learned priors with physically grounded constraints, are a practical step toward production-ready volumetric video relighting.
[424] VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation
Walid Bousselham, Hilde Kuehne, Cordelia Schmid
Main category: cs.CV
TL;DR: VOLD is a framework that transfers reasoning capabilities from text-only teacher models to vision-language student models using reinforcement learning and on-policy distillation, achieving state-of-the-art performance on multiple reasoning benchmarks.
Details
Motivation: Vision-language models struggle with complex reasoning due to scarcity of high-quality image-text reasoning data, while text-based reasoning resources are abundant but underutilized for VLM reasoning.Method: Combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on-policy distillation, guided by text-only teacher models, plus cold-start alignment via supervised fine-tuning.
Result: Outperforms baseline models significantly and improves state-of-the-art across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and LogicVista.
Conclusion: Cold-start alignment is essential for effective knowledge transfer, and on-policy distillation with text-only teachers fails without sufficient distributional alignment between teacher and student models.
Abstract: Training vision-language models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models. To this end, VOLD combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on-policy distillation, which allows the student reasoning traces to be guided by the teacher model, resulting in a significant gain over using GRPO alone. We further show that a cold-start alignment is essential for an effective transfer during the online training phase in this scenario and that without sufficient distributional alignment between teacher and student, on-policy distillation fails to provide meaningful guidance. We evaluate VOLD across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and LogicVista, showing that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin. Our ablation shows the importance of a cold-start alignment via SFT for on-policy distillation with a text-only teacher.
[425] iPac: Incorporating Intra-image Patch Context into Graph Neural Networks for Medical Image Classification
Usama Zidan, Mohamed Gaber, Mohammed M. Abdelsamea
Main category: cs.CV
TL;DR: iPac introduces a novel graph representation for medical image classification using GNNs, achieving 5% accuracy improvement by capturing structural relationships through patch partitioning, clustering, and graph construction.
Details
Motivation: Current graph neural networks for image processing have limited consideration of underlying structure and relationships among visual entities, hindering their performance in image classification tasks.Method: iPac integrates patch partitioning, feature extraction, clustering, graph construction, and graph-based learning into a unified network to create meaningful graph representations that encapsulate image semantics.
Result: Experimental evaluation on diverse medical image datasets demonstrates iPac achieves up to 5% average accuracy improvement over baseline methods.
Conclusion: iPac offers a versatile and generic solution for medical image classification by leveraging graph representation and accounting for inherent structure and relationships among visual entities.
Abstract: Graph neural networks have emerged as a promising paradigm for image processing, yet their performance in image classification tasks is hindered by a limited consideration of the underlying structure and relationships among visual entities. This work presents iPac, a novel approach to introduce a new graph representation of images to enhance graph neural network image classification by recognizing the importance of underlying structure and relationships in medical image classification. iPac integrates various stages, including patch partitioning, feature extraction, clustering, graph construction, and graph-based learning, into a unified network to advance graph neural network image classification. By capturing relevant features and organising them into clusters, we construct a meaningful graph representation that effectively encapsulates the semantics of the image. Experimental evaluation on diverse medical image datasets demonstrates the efficacy of iPac, exhibiting an average accuracy improvement of up to 5% over baseline methods. Our approach offers a versatile and generic solution for image classification, particularly in the realm of medical images, by leveraging the graph representation and accounting for the inherent structure and relationships among visual entities.
[426] FreeFuse: Multi-Subject LoRA Fusion via Auto Masking at Test Time
Yaoli Liu, Yao-Xiang Ding, Kun Zhou
Main category: cs.CV
TL;DR: FreeFuse is a training-free method for multi-subject text-to-image generation that automatically fuses multiple subject LoRAs using dynamic masks derived from cross-attention weights, eliminating the need for complex preprocessing or auxiliary models.
Details
Motivation: Existing methods for multi-subject generation either require pre-inference LoRA weight merging or rely on segmentation models and complex techniques like noise blending, which are inefficient and impractical for seamless integration into standard workflows.Method: Uses context-aware dynamic subject masks automatically derived from cross-attention layer weights, applied to LoRA outputs during inference to approximate individual subject LoRA integration for masked regions, requiring only LoRA activation words from users.
Result: Extensive experiments show FreeFuse outperforms existing approaches in both generation quality and usability for multi-subject generation tasks, while being more practical and efficient.
Conclusion: FreeFuse provides a superior training-free solution for multi-subject text-to-image generation that requires no additional training, model modifications, auxiliary models, or complex user inputs, making it highly practical and efficient.
Abstract: This paper proposes FreeFuse, a novel training-free approach for multi-subject text-to-image generation through automatic fusion of multiple subject LoRAs. In contrast to existing methods that either focus on pre-inference LoRA weight merging or rely on segmentation models and complex techniques like noise blending to isolate LoRA outputs, our key insight is that context-aware dynamic subject masks can be automatically derived from cross-attention layer weights. Mathematical analysis shows that directly applying these masks to LoRA outputs during inference well approximates the case where the subject LoRA is integrated into the diffusion model and used individually for the masked region. FreeFuse demonstrates superior practicality and efficiency as it requires no additional training, no modification to LoRAs, no auxiliary models, and no user-defined prompt templates or region specifications. Alternatively, it only requires users to provide the LoRA activation words for seamless integration into standard workflows. Extensive experiments validate that FreeFuse outperforms existing approaches in both generation quality and usability under the multi-subject generation tasks. The project page is at https://future-item.github.io/FreeFuse/
[427] DPGLA: Bridging the Gap between Synthetic and Real Data for Unsupervised Domain Adaptation in 3D LiDAR Semantic Segmentation
Wanmeng Li, Simone Mosco, Daniel Fusaro, Alberto Pretto
Main category: cs.CV
TL;DR: Proposes Dynamic Pseudo-Label Filtering (DPLF) and Prior-Guided Data Augmentation Pipeline (PG-DAP) for point cloud semantic segmentation domain adaptation, achieving state-of-the-art performance.
Details
Motivation: Real-world LiDAR point cloud annotation is costly, and existing unsupervised domain adaptation methods don't effectively utilize unlabeled data due to fixed confidence thresholds.Method: Dynamic Pseudo-Label Filtering (DPLF) scheme, Prior-Guided Data Augmentation Pipeline (PG-DAP) to mitigate domain shift, and data mixing consistency loss for context-free representations.
Result: Achieves superior performance on synthetic-to-real point cloud semantic segmentation tasks compared to state-of-the-art methods.
Conclusion: The proposed DPLF and PG-DAP modules effectively enhance real data utilization and mitigate domain shift in point cloud semantic segmentation.
Abstract: Annotating real-world LiDAR point clouds for use in intelligent autonomous systems is costly. To overcome this limitation, self-training-based Unsupervised Domain Adaptation (UDA) has been widely used to improve point cloud semantic segmentation by leveraging synthetic point cloud data. However, we argue that existing methods do not effectively utilize unlabeled data, as they either rely on predefined or fixed confidence thresholds, resulting in suboptimal performance. In this paper, we propose a Dynamic Pseudo-Label Filtering (DPLF) scheme to enhance real data utilization in point cloud UDA semantic segmentation. Additionally, we design a simple and efficient Prior-Guided Data Augmentation Pipeline (PG-DAP) to mitigate domain shift between synthetic and real-world point clouds. Finally, we utilize data mixing consistency loss to push the model to learn context-free representations. We implement and thoroughly evaluate our approach through extensive comparisons with state-of-the-art methods. Experiments on two challenging synthetic-to-real point cloud semantic segmentation tasks demonstrate that our approach achieves superior performance. Ablation studies confirm the effectiveness of the DPLF and PG-DAP modules. We release the code of our method in this paper.
[428] EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT
Baoqi Pei, Yifei Huang, Jilan Xu, Yuping He, Guo Chen, Fei Wu, Yu Qiao, Jiangmiao Pang
Main category: cs.CV
TL;DR: EgoThinker is a framework that enhances multimodal LLMs for egocentric video reasoning through spatio-temporal chain-of-thought supervision and two-stage learning, achieving state-of-the-art performance on egocentric benchmarks.
Details
Motivation: Current MLLMs lack embodied, first-person understanding needed for egocentric video reasoning, which requires inferring hidden intentions and recognizing fine-grained interactions from the agent's perspective.Method: Two-stage approach: 1) Created EgoRe-5M dataset with 13M video clips featuring multi-minute segments, CoT rationales, and hand-object grounding; 2) SFT on EgoRe-5M followed by reinforcement fine-tuning for spatio-temporal localization.
Result: EgoThinker outperforms existing methods across multiple egocentric benchmarks and achieves substantial improvements in fine-grained spatio-temporal localization tasks.
Conclusion: The framework successfully bridges the gap in egocentric reasoning capabilities for MLLMs through comprehensive dataset construction and specialized training methodology.
Abstract: Egocentric video reasoning centers on an unobservable agent behind the camera who dynamically shapes the environment, requiring inference of hidden intentions and recognition of fine-grained interactions. This core challenge limits current multimodal large language models MLLMs, which excel at visible event reasoning but lack embodied, first-person understanding. To bridge this gap, we introduce EgoThinker, a novel framework that endows MLLMs with robust egocentric reasoning capabilities through spatio-temporal chain-of-thought supervision and a two-stage learning curriculum. First, we introduce EgoRe-5M, a large-scale egocentric QA dataset constructed from 13M diverse egocentric video clips. This dataset features multi-minute segments annotated with detailed CoT rationales and dense hand-object grounding. Second, we employ SFT on EgoRe-5M to instill reasoning skills, followed by reinforcement fine-tuning RFT to further enhance spatio-temporal localization. Experimental results show that EgoThinker outperforms existing methods across multiple egocentric benchmarks, while achieving substantial improvements in fine-grained spatio-temporal localization tasks. Full code and data are released at https://github.com/InternRobotics/EgoThinker.
[429] More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models
Hongkai Lin, Dingkang Liang, Mingyang Du, Xin Zhou, Xiang Bai
Main category: cs.CV
TL;DR: MERGE is a unified model that enables both image generation and depth estimation using pre-trained text-to-image diffusion models, preserving original generation capabilities while adding depth estimation through pluggable converters.
Details
Motivation: To leverage pre-trained text-to-image diffusion models for depth estimation without catastrophic degradation of their original image generation capabilities.Method: Introduces a play-and-plug framework with simple pluggable converters for switching between modes, and a Group Reuse Mechanism to improve parameter utilization.
Result: Achieves state-of-the-art performance across multiple depth estimation benchmarks while maintaining original image generation quality.
Conclusion: MERGE demonstrates that pre-trained text-to-image models can be effectively expanded to depth estimation without compromising their core generation capabilities.
Abstract: Generative depth estimation methods leverage the rich visual priors stored in pre-trained text-to-image diffusion models, demonstrating astonishing zero-shot capability. However, parameter updates during training lead to catastrophic degradation in the image generation capability of the pre-trained model. We introduce MERGE, a unified model for image generation and depth estimation, starting from a fixed pre-trained text-to-image model. MERGE demonstrates that the pre-trained text-to-image model can do more than image generation, but also expand to depth estimation effortlessly. Specifically, MERGE introduces a play-and-plug framework that enables seamless switching between image generation and depth estimation modes through simple and pluggable converters. Meanwhile, we propose a Group Reuse Mechanism to encourage parameter reuse and improve the utilization of the additional learnable parameters. MERGE unleashes the powerful depth estimation capability of the pre-trained text-to-image model while preserving its original image generation ability. Compared to other unified models for image generation and depth estimation, MERGE achieves state-of-the-art performance across multiple depth estimation benchmarks. The code will be made available at https://github.com/H-EmbodVis/MERGE
[430] Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human Animation
Junyoung Seo, Rodrigo Mira, Alexandros Haliassos, Stella Bounareli, Honglie Chen, Linh Tran, Seungryong Kim, Zoe Landgraf, Jie Shen
Main category: cs.CV
TL;DR: Lookahead Anchoring addresses identity drift in audio-driven human animation by using future keyframes as directional beacons instead of fixed temporal anchors, enabling better identity preservation without restricting natural motion dynamics.
Details
Motivation: Audio-driven human animation models suffer from identity drift during temporal autoregressive generation, where characters gradually lose their identity over time. Existing solutions using keyframes as temporal anchors require additional generation stages and can restrict natural motion.Method: Proposes Lookahead Anchoring which leverages keyframes from future timesteps ahead of the current generation window. This transforms keyframes from fixed boundaries into directional beacons that the model continuously pursues while responding to immediate audio cues. Also enables self-keyframing where reference images serve as lookahead targets.
Result: Applied to three recent human animation models, Lookahead Anchoring achieves superior lip synchronization, identity preservation, and visual quality. The temporal lookahead distance naturally controls balance between expressivity and consistency.
Conclusion: Lookahead Anchoring demonstrates improved temporal conditioning across different architectures, eliminating the need for keyframe generation while maintaining consistent identity through persistent guidance from future anchors.
Abstract: Audio-driven human animation models often suffer from identity drift during temporal autoregressive generation, where characters gradually lose their identity over time. One solution is to generate keyframes as intermediate temporal anchors that prevent degradation, but this requires an additional keyframe generation stage and can restrict natural motion dynamics. To address this, we propose Lookahead Anchoring, which leverages keyframes from future timesteps ahead of the current generation window, rather than within it. This transforms keyframes from fixed boundaries into directional beacons: the model continuously pursues these future anchors while responding to immediate audio cues, maintaining consistent identity through persistent guidance. This also enables self-keyframing, where the reference image serves as the lookahead target, eliminating the need for keyframe generation entirely. We find that the temporal lookahead distance naturally controls the balance between expressivity and consistency: larger distances allow for greater motion freedom, while smaller ones strengthen identity adherence. When applied to three recent human animation models, Lookahead Anchoring achieves superior lip synchronization, identity preservation, and visual quality, demonstrating improved temporal conditioning across several different architectures. Video results are available at the following link: https://lookahead-anchoring.github.io.
[431] FARMER: Flow AutoRegressive Transformer over Pixels
Guangting Zheng, Qinyu Zhao, Tao Yang, Fei Xiao, Zhijie Lin, Jie Wu, Jiajun Deng, Yanyong Zhang, Rui Zhu
Main category: cs.CV
TL;DR: FARMER is a novel generative framework that unifies Normalizing Flows and Autoregressive models for tractable likelihood estimation and high-quality image synthesis directly from raw pixels.
Details
Motivation: Directly modeling explicit likelihood of raw data distribution is key in ML, but continuous AR modeling over visual pixel data suffers from extremely long sequences and high-dimensional spaces.Method: Uses invertible autoregressive flow to transform images into latent sequences, with self-supervised dimension reduction to partition latent channels, one-step distillation for faster inference, and resampling-based classifier-free guidance.
Result: Achieves competitive performance compared to existing pixel-based generative models while providing exact likelihoods and scalable training.
Conclusion: FARMER successfully unifies NF and AR models for efficient and effective image generation with tractable likelihood estimation.
Abstract: Directly modeling the explicit likelihood of the raw data distribution is key topic in the machine learning area, which achieves the scaling successes in Large Language Models by autoregressive modeling. However, continuous AR modeling over visual pixel data suffer from extremely long sequences and high-dimensional spaces. In this paper, we present FARMER, a novel end-to-end generative framework that unifies Normalizing Flows (NF) and Autoregressive (AR) models for tractable likelihood estimation and high-quality image synthesis directly from raw pixels. FARMER employs an invertible autoregressive flow to transform images into latent sequences, whose distribution is modeled implicitly by an autoregressive model. To address the redundancy and complexity in pixel-level modeling, we propose a self-supervised dimension reduction scheme that partitions NF latent channels into informative and redundant groups, enabling more effective and efficient AR modeling. Furthermore, we design a one-step distillation scheme to significantly accelerate inference speed and introduce a resampling-based classifier-free guidance algorithm to boost image generation quality. Extensive experiments demonstrate that FARMER achieves competitive performance compared to existing pixel-based generative models while providing exact likelihoods and scalable training.
[432] InFlux: A Benchmark for Self-Calibration of Dynamic Intrinsics of Video Cameras
Erich Liang, Roma Bhattacharjee, Sreemanti Dey, Rafael Moschopoulos, Caitlin Wang, Michel Liao, Grace Tan, Andrew Wang, Karhan Kayan, Stamatis Alexandropoulos, Jia Deng
Main category: cs.CV
TL;DR: InFlux is a real-world benchmark providing per-frame ground truth camera intrinsics for 386 videos with dynamic intrinsics, featuring 143K+ annotated frames to address the lack of dynamic intrinsics benchmarks.
Details
Motivation: Most 3D algorithms assume constant camera intrinsics, but real-world videos often have dynamic intrinsics. Existing benchmarks lack diversity and per-frame intrinsic annotations, limiting research in this area.Method: Built a comprehensive lookup table of calibration experiments and extended the Kalibr toolbox to improve accuracy and robustness for per-frame intrinsics annotation. Collected 386 high-resolution indoor/outdoor videos with dynamic intrinsics.
Result: Created InFlux benchmark with 143K+ annotated frames showing wider intrinsic variations and scene diversity than prior benchmarks. Evaluation revealed most existing methods struggle with dynamic intrinsics prediction.
Conclusion: InFlux addresses the critical gap in dynamic camera intrinsics benchmarks and enables better evaluation of methods for handling changing camera parameters in real-world videos.
Abstract: Accurately tracking camera intrinsics is crucial for achieving 3D understanding from 2D video. However, most 3D algorithms assume that camera intrinsics stay constant throughout a video, which is often not true for many real-world in-the-wild videos. A major obstacle in this field is a lack of dynamic camera intrinsics benchmarks–existing benchmarks typically offer limited diversity in scene content and intrinsics variation, and none provide per-frame intrinsic changes for consecutive video frames. In this paper, we present Intrinsics in Flux (InFlux), a real-world benchmark that provides per-frame ground truth intrinsics annotations for videos with dynamic intrinsics. Compared to prior benchmarks, InFlux captures a wider range of intrinsic variations and scene diversity, featuring 143K+ annotated frames from 386 high-resolution indoor and outdoor videos with dynamic camera intrinsics. To ensure accurate per-frame intrinsics, we build a comprehensive lookup table of calibration experiments and extend the Kalibr toolbox to improve its accuracy and robustness. Using our benchmark, we evaluate existing baseline methods for predicting camera intrinsics and find that most struggle to achieve accurate predictions on videos with dynamic intrinsics. For the dataset, code, videos, and submission, please visit https://influx.cs.princeton.edu/.
[433] PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection
Yusu Qian, Cheng Wan, Chao Jia, Yinfei Yang, Qingyu Zhao, Zhe Gan
Main category: cs.CV
TL;DR: PRISM-Bench is a visual puzzle benchmark that evaluates models’ reasoning by requiring them to identify the first incorrect step in chain-of-thought explanations, rather than just measuring final answer accuracy.
Details
Motivation: Current evaluations only measure final-answer accuracy, lacking insight into how models reason. There's a need to assess logical consistency, error detection, and visual reasoning capabilities more deeply.Method: Introduces diagnostic tasks where models must identify the first incorrect step in chain-of-thought explanations for visual puzzles that require multi-step symbolic, geometric, and analogical reasoning.
Result: Evaluations reveal a gap between fluent generation and faithful reasoning - models that produce plausible chain-of-thoughts often fail to locate simple logical faults in reasoning chains.
Conclusion: PRISM-Bench provides a sharper evaluation of multimodal reasoning competence and highlights the need for diagnostic evaluation protocols to develop trustworthy multimodal large language models.
Abstract: We introduce \textbf{PRISM-Bench}, a benchmark of puzzle-based visual challenges designed to evaluate not only whether models can solve problems, but how their reasoning unfolds. Unlike prior evaluations that measure only final-answer accuracy, PRISM-Bench introduces a diagnostic task: given a visual puzzle and a step-by-step chain-of-thought (CoT) containing exactly one error, models must identify the first incorrect step. This setting enables fine-grained assessment of logical consistency, error detection, and visual reasoning. The puzzles in PRISM-Bench require multi-step symbolic, geometric, and analogical reasoning, resisting shortcuts based on superficial pattern matching. Evaluations across state-of-the-art MLLMs reveal a persistent gap between fluent generation and faithful reasoning: models that produce plausible CoTs often fail to locate simple logical faults. By disentangling answer generation from reasoning verification, PRISM-Bench offers a sharper lens on multimodal reasoning competence and underscores the need for diagnostic evaluation protocols in the development of trustworthy MLLMs.
[434] PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity
Yuqian Yuan, Wenqiao Zhang, Xin Li, Shihao Wang, Kehan Li, Wentong Li, Jun Xiao, Lei Zhang, Beng Chin Ooi
Main category: cs.CV
TL;DR: PixelRefer is a unified region-level MLLM framework for fine-grained object-centric understanding across images and videos, with an efficient variant PixelRefer-Lite that reduces computational cost.
Details
Motivation: Existing MLLMs focus on holistic scene-level understanding but lack fine-grained object-centric reasoning capabilities needed for detailed visual comprehension.Method: Proposes Scale-Adaptive Object Tokenizer (SAOT) for compact object representations, and PixelRefer-Lite with Object-Centric Infusion module to pre-fuse global context into object tokens for efficiency.
Result: Achieves leading performance with fewer training samples, while PixelRefer-Lite offers competitive accuracy with notable efficiency gains across various benchmarks.
Conclusion: PixelRefer enables advanced fine-grained understanding over user-specified regions, bridging the gap between scene-level and object-level comprehension in multimodal models.
Abstract: Multimodal large language models (MLLMs) have demonstrated strong general-purpose capabilities in open-world visual comprehension. However, most existing MLLMs primarily focus on holistic, scene-level understanding, often overlooking the need for fine-grained, object-centric reasoning. In this paper, we present PixelRefer, a unified region-level MLLM framework that enables advanced fine-grained understanding over user-specified regions across both images and videos. Motivated by the observation that LLM attention predominantly focuses on object-level tokens, we propose a Scale-Adaptive Object Tokenizer (SAOT) to generate compact and semantically rich object representations from free-form regions. Our analysis reveals that global visual tokens contribute mainly in early LLM layers, inspiring the design of PixelRefer-Lite, an efficient variant that employs an Object-Centric Infusion module to pre-fuse global context into object tokens. This yields a lightweight Object-Only Framework that substantially reduces computational cost while maintaining high semantic fidelity. To facilitate fine-grained instruction tuning, we curate PixelRefer-2.2M, a high-quality object-centric instruction dataset. Extensive experiments across a range of benchmarks validate that PixelRefer achieves leading performance with fewer training samples, while PixelRefer-Lite offers competitive accuracy with notable gains in efficiency.
[435] Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations
Yujia Zhang, Xiaoyang Wu, Yixing Lao, Chengyao Wang, Zhuotao Tian, Naiyan Wang, Hengshuang Zhao
Main category: cs.CV
TL;DR: Concerto is a minimalist simulation of human concept learning for spatial cognition that combines 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding, achieving state-of-the-art performance in 3D scene understanding.
Details
Motivation: Inspired by how humans learn abstract concepts through multisensory synergy and can recall them from single modalities, the authors aim to develop a similar approach for spatial cognition.Method: Concerto combines 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding to learn coherent spatial features. It also includes variants for video-lifted point cloud understanding and a translator to project representations into CLIP’s language space.
Result: Concerto outperforms standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8% respectively in linear probing for 3D scene perception. With full fine-tuning, it achieves new SOTA results (e.g., 80.7% mIoU on ScanNet) and enables open-world perception.
Conclusion: Concerto learns spatial representations with superior fine-grained geometric and semantic consistency, demonstrating the effectiveness of multisensory-inspired learning for spatial cognition.
Abstract: Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP’s language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.
[436] Invertible generative models for inverse problems: mitigating representation error and dataset bias
Muhammad Asim, Mara Daniels, Oscar Leong, Ali Ahmed, Paul Hand
Main category: cs.CV
TL;DR: Invertible neural networks serve as effective priors for inverse imaging problems like denoising, compressive sensing, and inpainting, outperforming traditional sparsity priors and GAN priors due to zero representation error.
Details
Motivation: Traditional generative models like GANs have representation limitations due to architectural choices, mode collapse, and training dataset bias, which can prevent them from accurately representing certain images.Method: Using invertible neural networks as priors with empirical risk formulation under regularization that promotes high likelihood images, either through penalization or algorithmic initialization.
Result: Invertible priors achieve higher accuracy than sparsity priors across most undersampling ratios in compressive sensing and better reconstructions than GAN priors for images with rare features or out-of-distribution natural images.
Conclusion: Invertible neural networks are superior priors for inverse problems due to their zero representation error, theoretical recovery bounds, and robustness to dataset bias and rare features.
Abstract: Trained generative models have shown remarkable performance as priors for inverse problems in imaging – for example, Generative Adversarial Network priors permit recovery of test images from 5-10x fewer measurements than sparsity priors. Unfortunately, these models may be unable to represent any particular image because of architectural choices, mode collapse, and bias in the training dataset. In this paper, we demonstrate that invertible neural networks, which have zero representation error by design, can be effective natural signal priors at inverse problems such as denoising, compressive sensing, and inpainting. Given a trained generative model, we study the empirical risk formulation of the desired inverse problem under a regularization that promotes high likelihood images, either directly by penalization or algorithmically by initialization. For compressive sensing, invertible priors can yield higher accuracy than sparsity priors across almost all undersampling ratios, and due to their lack of representation error, invertible priors can yield better reconstructions than GAN priors for images that have rare features of variation within the biased training set, including out-of-distribution natural images. We additionally compare performance for compressive sensing to unlearned methods, such as the deep decoder, and we establish theoretical bounds on expected recovery error in the case of a linear invertible model.
[437] Unbiased Scene Graph Generation from Biased Training
Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, Hanwang Zhang
Main category: cs.CV
TL;DR: The paper proposes a causal inference-based framework for debiasing scene graph generation (SGG) by distinguishing between good context priors and bad long-tailed biases using counterfactual causality analysis.
Details
Motivation: Current SGG methods suffer from severe training bias that collapses diverse relationships into generic ones, making downstream tasks like VQA unable to infer proper scene structures. Traditional debiasing methods cannot distinguish between beneficial context priors and harmful long-tailed biases.Method: Builds a causal graph for SGG, performs traditional biased training, then uses counterfactual causality to infer and remove the effect of bad bias. Uses Total Direct Effect (TDE) as the final predicate score for unbiased SGG.
Result: Significant improvements over previous state-of-the-art methods on the Visual Genome benchmark when using the proposed Scene Graph Diagnosis toolkit with several prevailing models.
Conclusion: The proposed causal inference framework effectively debiases SGG by removing harmful biases while preserving beneficial context priors, and is model-agnostic for wide applicability in the SGG community.
Abstract: Today’s scene graph generation (SGG) task is still far from practical, mainly due to the severe training bias, e.g., collapsing diverse “human walk on / sit on / lay on beach” into “human on beach”. Given such SGG, the down-stream tasks such as VQA can hardly infer better scene structures than merely a bag of objects. However, debiasing in SGG is not trivial because traditional debiasing methods cannot distinguish between the good and bad bias, e.g., good context prior (e.g., “person read book” rather than “eat”) and bad long-tailed bias (e.g., “near” dominating “behind / in front of”). In this paper, we present a novel SGG framework based on causal inference but not the conventional likelihood. We first build a causal graph for SGG, and perform traditional biased training with the graph. Then, we propose to draw the counterfactual causality from the trained graph to infer the effect from the bad bias, which should be removed. In particular, we use Total Direct Effect (TDE) as the proposed final predicate score for unbiased SGG. Note that our framework is agnostic to any SGG model and thus can be widely applied in the community who seeks unbiased predictions. By using the proposed Scene Graph Diagnosis toolkit on the SGG benchmark Visual Genome and several prevailing models, we observed significant improvements over the previous state-of-the-art methods.
[438] Long-Tailed Classification by Keeping the Good and Removing the Bad Momentum Causal Effect
Kaihua Tang, Jianqiang Huang, Hanwang Zhang
Main category: cs.CV
TL;DR: The paper establishes a causal inference framework for long-tailed classification, identifying SGD momentum as a confounder that harms tail prediction but benefits representation learning. The method uses causal intervention in training and counterfactual reasoning in inference to achieve state-of-the-art results.
Details
Motivation: Long-tailed classification is challenging due to imbalanced datasets, especially when multiple instances coexist in one image. Existing methods lack theoretical foundation and are based on heuristic re-weighting/re-sampling approaches.Method: Proposes a causal inference framework that identifies SGD momentum as a confounder. Uses causal intervention during training and counterfactual reasoning during inference to disentangle the paradoxical effects of momentum - removing its harmful effects on tail prediction while preserving benefits for representation learning.
Result: Achieves new state-of-the-art performance on three long-tailed visual recognition benchmarks: Long-tailed CIFAR-10/-100, ImageNet-LT for image classification, and LVIS for instance segmentation.
Conclusion: The causal inference framework provides a principled solution to long-tailed classification by theoretically explaining previous methods and deriving a new approach that effectively addresses the momentum confounder problem.
Abstract: As the class size grows, maintaining a balanced dataset across many classes is challenging because the data are long-tailed in nature; it is even impossible when the sample-of-interest co-exists with each other in one collectable unit, e.g., multiple visual instances in one image. Therefore, long-tailed classification is the key to deep learning at scale. However, existing methods are mainly based on re-weighting/re-sampling heuristics that lack a fundamental theory. In this paper, we establish a causal inference framework, which not only unravels the whys of previous methods, but also derives a new principled solution. Specifically, our theory shows that the SGD momentum is essentially a confounder in long-tailed classification. On one hand, it has a harmful causal effect that misleads the tail prediction biased towards the head. On the other hand, its induced mediation also benefits the representation learning and head prediction. Our framework elegantly disentangles the paradoxical effects of the momentum, by pursuing the direct causal effect caused by an input sample. In particular, we use causal intervention in training, and counterfactual reasoning in inference, to remove the “bad” while keep the “good”. We achieve new state-of-the-arts on three long-tailed visual recognition benchmarks: Long-tailed CIFAR-10/-100, ImageNet-LT for image classification and LVIS for instance segmentation.
[439] Weakly Supervised Learning for Facial Behavior Analysis : A Review
R. Gnana Praveen, Patrick Cardinal, Eric Granger
Main category: cs.CV
TL;DR: This paper provides a comprehensive review of weakly supervised learning approaches for facial behavior analysis, addressing the challenges of obtaining large annotated datasets in real-world conditions.
Details
Motivation: Deep learning approaches for facial behavior analysis require large annotated datasets, but manual labeling is difficult, time-consuming, and prone to expert bias, especially for expression intensities. There is a critical need for methods that can work with weak annotations.Method: The paper systematically reviews existing weakly supervised learning approaches for facial behavior analysis, categorizes them into a taxonomy, analyzes their insights and limitations, and summarizes widely used datasets and evaluation principles.
Result: The review organizes the state-of-the-art WSL methods, identifies their strengths and weaknesses, and provides performance summaries across different datasets and evaluation metrics used in facial behavior analysis research.
Conclusion: The paper highlights remaining challenges and opportunities in applying weakly supervised learning for facial behavior analysis in real-life situations, suggesting potential research directions to advance the field.
Abstract: In the recent years, there has been a shift in facial behavior analysis from the laboratory-controlled conditions to the challenging in-the-wild conditions due to the superior performance of deep learning based approaches for many real world applications.However, the performance of deep learning approaches relies on the amount of training data. One of the major problems with data acquisition is the requirement of annotations for large amount of training data. Labeling process of huge training data demands lot of human support with strong domain expertise for facial expressions or action units, which is difficult to obtain in real-time environments.Moreover, labeling process is highly vulnerable to ambiguity of expressions or action units, especially for intensities due to the bias induced by the domain experts. Therefore, there is an imperative need to address the problem of facial behavior analysis with weak annotations. In this paper, we provide a comprehensive review of weakly supervised learning (WSL) approaches for facial behavior analysis with both categorical as well as dimensional labels along with the challenges and potential research directions associated with it. First, we introduce various types of weak annotations in the context of facial behavior analysis and the corresponding challenges associated with it. We then systematically review the existing state-of-the-art approaches and provide a taxonomy of these approaches along with their insights and limitations. In addition, widely used data-sets in the reviewed literature and the performance of these approaches along with evaluation principles are summarized. Finally, we discuss the remaining challenges and opportunities along with the potential research directions in order to apply facial behavior analysis with weak labels in real life situations.
[440] Revisiting Transformation Invariant Geometric Deep Learning: An Initial Representation Perspective
Ziwei Zhang, Xin Wang, Zeyang Zhang, Peng Cui, Wenwu Zhu
Main category: cs.CV
TL;DR: TinvNN is a plug-in method that achieves transformation invariance for geometric data by using transformation-invariant initial point representations, eliminating the need for complex neural layer designs.
Details
Motivation: Existing graph neural networks only maintain permutation-invariance but fail to guarantee invariance to other transformations like translation, rotation, and scaling. Current transformation-invariant layers are computationally expensive and difficult to extend.Method: Modify multi-dimensional scaling to create transformation-invariant and distance-preserving initial point representations, then feed these representations into existing neural networks as a plug-in component.
Result: Extensive experiments on point cloud analysis and combinatorial optimization demonstrate TinvNN’s effectiveness and general applicability. The method can also be extended to equivariance cases.
Conclusion: TinvNN provides a straightforward and general solution for transformation invariance that should be considered as an essential baseline for geometric deep learning studies.
Abstract: Deep neural networks have achieved great success in the last decade. When designing neural networks to handle the ubiquitous geometric data such as point clouds and graphs, it is critical that the model can maintain invariance towards various transformations such as translation, rotation, and scaling. Most existing graph neural network (GNN) approaches can only maintain permutation-invariance, failing to guarantee invariance with respect to other transformations. Besides GNNs, other works design sophisticated transformation-invariant layers, which are computationally expensive and difficult to be extended. In this paper, we revisit why general neural networks cannot maintain transformation invariance. Our findings show that transformation-invariant and distance-preserving initial point representations are sufficient to achieve transformation invariance rather than needing sophisticated neural layer designs. Motivated by these findings, we propose Transformation Invariant Neural Networks (TinvNN), a straightforward and general plug-in for geometric data. Specifically, we realize transformation invariant and distance-preserving initial point representations by modifying multi-dimensional scaling and feed the representations into existing neural networks. We prove that TinvNN can strictly guarantee transformation invariance, being general and flexible enough to be combined with the existing neural networks. Extensive experimental results on point cloud analysis and combinatorial optimization demonstrate the effectiveness and general applicability of our method. We also extend our method into equivariance cases. Based on the results, we advocate that TinvNN should be considered as an essential baseline for further studies of transformation-invariant geometric deep learning.
[441] Blockchain and Biometrics: Survey, GDPR Analysis, and Future Directions
Mahdi Ghafourian, Bilgesu Sumer, Ruben Vera-Rodriguez, Julian Fierrez, Ruben Tolosana, Aythami Moralez, Els Kindt
Main category: cs.CV
TL;DR: This paper surveys the integration of blockchain technology with biometric systems, discussing applications in PKI, identity management, and legal considerations under GDPR.
Details
Motivation: To explore the potential benefits and challenges of combining blockchain's decentralized ledger technology with biometric recognition systems, as both technologies are rapidly evolving and being deployed in various applications.Method: The paper conducts a comprehensive literature survey on blockchain-biometrics integration, analyzes practical applications in PKI systems and identity management, and performs legal analysis based on GDPR requirements.
Result: The integration shows promise for applications like distributed trusted services and identity management, but faces challenges including efficiency limitations for real-time applications and legal compliance issues with GDPR.
Conclusion: While blockchain-biometrics integration is still emerging, it offers significant potential for secure identity management, though current blockchain limitations and GDPR compliance requirements present important challenges that need to be addressed in future research.
Abstract: Biometric recognition as an efficient and hard-to-forge way of identification and verification has become an indispensable part of the current digital world. The fast evolution of this technology has been a strong incentive for integration into many applications. Meanwhile, blockchain, the decentralized ledger technology, has been widely received by both research and industry in the past few years, and it is being increasingly deployed today in many different applications, such as money transfer, IoT, healthcare, or logistics. Recently, researchers have started to speculate on the pros and cons and what the best applications would be when these two technologies cross paths. This paper provides a survey of the research literature on the combination of blockchain and biometrics and includes a first legal analysis of this integration based on GDPR to shed light on challenges and potentials. Although the integration of blockchain technology into the biometric sector is still in its infancy, with a growing body of literature discussing specific applications and advanced technological setups, this paper aims to provide a holistic understanding of blockchain applicability in biometrics. Based on published studies, this article discusses, among others, practical examples combining blockchain and biometrics for novel applications in PKI systems, distributed trusted services, and identity management. Challenges and limitations when combining blockchain and biometrics that motivate future work will also be discussed; e.g., blockchain networks at their current stage may not be efficient or economical for some real-time biometric applications. Finally, we also discuss key legal aspects of the EU General Data Protection Regulation (GDPR) related to this combination of technologies (blockchain and biometrics); for example, accountability, immutability, anonymity, and data protection elements.
[442] TransFace++: Rethinking the Face Recognition Paradigm with a Focus on Accuracy, Efficiency, and Security
Jun Dan, Yang Liu, Baigui Sun, Jiankang Deng, Shan Luo
Main category: cs.CV
TL;DR: TransFace and TransFace++ are novel face recognition frameworks that use Vision Transformers (ViTs) and image bytes instead of CNNs and RGB images to address efficiency, security, and precision limitations in existing FR systems.
Details
Motivation: To overcome three key problems in current face recognition: CNN's inability to capture global facial features and model local feature correlations, inefficiency of RGB image processing, and security vulnerabilities in RGB-based systems that compromise user privacy.Method: Proposed two frameworks: TransFace uses Vision Transformers (ViTs) to better capture global facial features and model correlations between local features. TransFace++ explores using image bytes instead of RGB images to improve efficiency and security.
Result: Experiments on popular face benchmarks demonstrate the superiority of both TransFace and TransFace++ frameworks compared to existing methods.
Conclusion: The proposed frameworks successfully explore the feasibility of applying ViTs and image bytes to face recognition tasks, addressing key limitations in efficiency, security, and precision of traditional CNN-based RGB approaches.
Abstract: Face Recognition (FR) technology has made significant strides with the emergence of deep learning. Typically, most existing FR models are built upon Convolutional Neural Networks (CNN) and take RGB face images as the model’s input. In this work, we take a closer look at existing FR paradigms from high-efficiency, security, and precision perspectives, and identify the following three problems: (i) CNN frameworks are vulnerable in capturing global facial features and modeling the correlations between local facial features. (ii) Selecting RGB face images as the model’s input greatly degrades the model’s inference efficiency, increasing the extra computation costs. (iii) In the real-world FR system that operates on RGB face images, the integrity of user privacy may be compromised if hackers successfully penetrate and gain access to the input of this model. To solve these three issues, we propose two novel FR frameworks, i.e., TransFace and TransFace++, which successfully explore the feasibility of applying ViTs and image bytes to FR tasks, respectively. Experiments on popular face benchmarks demonstrate the superiority of our TransFace and TransFace++. Code is available at https://github.com/DanJun6737/TransFace_pp.
[443] Open-Set 3D Semantic Instance Maps for Vision Language Navigation – O3D-SIM
Laksh Nanwani, Kumaraditya Gupta, Aditya Mathur, Swayam Agrawal, A. H. Abdul Hafez, K. Madhava Krishna
Main category: cs.CV
TL;DR: This paper extends instance-level semantic mapping to 3D by leveraging foundational models for object recognition, segmentation, and feature extraction, creating 3D point cloud maps with instance-level embeddings that enable natural language querying.
Details
Motivation: To improve language-guided navigation tasks by creating more robust 3D semantic maps that provide better instance-level understanding and semantic comprehension of environments, building on previous 2D instance-level mapping work.Method: Uses foundational models for object recognition, image segmentation, and feature extraction to create 3D point cloud maps with instance-level embeddings that can be queried using natural language commands.
Result: Quantitatively improves success rate of language-guided tasks and qualitatively shows clearer instance identification and ability to recognize objects that closed-set approaches would miss, leveraging language and image-aligned embeddings.
Conclusion: The proposed 3D instance-level semantic mapping approach successfully enhances both quantitative performance and qualitative understanding for language-guided navigation tasks by incorporating semantic comprehension through foundational models.
Abstract: Humans excel at forming mental maps of their surroundings, equipping them to understand object relationships and navigate based on language queries. Our previous work, SI Maps (Nanwani L, Agarwal A, Jain K, et al. Instance-level semantic maps for vision language navigation. In: 2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). IEEE; 2023 Aug.), showed that having instance-level information and the semantic understanding of an environment helps significantly improve performance for language-guided tasks. We extend this instance-level approach to 3D while increasing the pipeline’s robustness and improving quantitative and qualitative results. Our method leverages foundational models for object recognition, image segmentation, and feature extraction. We propose a representation that results in a 3D point cloud map with instance-level embeddings, which bring in the semantic understanding that natural language commands can query. Quantitatively, the work improves upon the success rate of language-guided tasks. At the same time, we qualitatively observe the ability to identify instances more clearly and leverage the foundational models and language and image-aligned embeddings to identify objects that, otherwise, a closed-set approach wouldn’t be able to identify. Project Page - https://smart-wheelchair-rrc.github.io/o3d-sim-webpage
[444] Steerable Transformers for Volumetric Data
Soumyabrata Kundu, Risi Kondor
Main category: cs.CV
TL;DR: Steerable Transformers extend Vision Transformers to maintain SE(d) equivariance using steerable convolutions and Fourier space operations, improving performance in 2D and 3D tasks.
Details
Motivation: To create vision transformers that maintain equivariance to the special Euclidean group SE(d) for better geometric understanding and performance in computer vision tasks.Method: Proposed equivariant attention mechanism operating on steerable convolution features, utilizing Fourier space non-linearities and steerable transformer layers integrated with steerable CNNs.
Result: Experimental results in both 2D and 3D show that adding steerable transformer layers to steerable convolutional networks enhances performance.
Conclusion: Steerable Transformers successfully extend transformer mechanisms while maintaining SE(d) equivariance, demonstrating improved performance when combined with steerable convolutional networks.
Abstract: We introduce Steerable Transformers, an extension of the Vision Transformer mechanism that maintains equivariance to the special Euclidean group $\mathrm{SE}(d)$. We propose an equivariant attention mechanism that operates on features extracted by steerable convolutions. Operating in Fourier space, our network utilizes Fourier space non-linearities. Our experiments in both two and three dimensions show that adding steerable transformer layers to steerable convolutional networks enhances performance.
[445] Bootstrapping Referring Multi-Object Tracking
Yani Zhang, Dongming Wu, Wencheng Han, Xingping Dong
Main category: cs.CV
TL;DR: The paper introduces referring multi-object tracking (RMOT), a new task that uses language expressions to guide multi-object tracking while accounting for object quantity and temporal semantics. It presents Refer-KITTI-V2 benchmark with diverse language prompts and TempRMOT, a Transformer-based framework that achieves state-of-the-art performance.
Details
Motivation: Existing referring understanding tasks are limited in language expressiveness and cannot model object dynamics in spatial numbers and temporal states. There's a need to bridge natural language and visual content more comprehensively for multi-object tracking.Method: Proposed TempRMOT, an end-to-end Transformer-based framework with a query-driven Temporal Enhancement Module. Each object is represented as a Transformer query to enable long-term spatial-temporal interactions. Also introduced a semi-automatic labeling pipeline to generate 9,758 language prompts for the Refer-KITTI-V2 benchmark.
Result: TempRMOT achieves state-of-the-art performance on both Refer-KITTI and Refer-KITTI-V2 benchmarks, demonstrating the effectiveness of the proposed approach in handling object dynamics and temporal semantics.
Conclusion: The paper successfully introduces RMOT as a comprehensive referring understanding task and presents an effective Transformer-based solution that handles object quantity variations and temporal semantics, advancing the field of language-guided visual understanding.
Abstract: Referring understanding is a fundamental task that bridges natural language and visual content by localizing objects described in free-form expressions. However, existing works are constrained by limited language expressiveness, lacking the capacity to model object dynamics in spatial numbers and temporal states. To address these limitations, we introduce a new and general referring understanding task, termed referring multi-object tracking (RMOT). Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking, comprehensively accounting for variations in object quantity and temporal semantics. Along with RMOT, we introduce a RMOT benchmark named Refer-KITTI-V2, featuring scalable and diverse language expressions. To efficiently generate high-quality annotations covering object dynamics with minimal manual effort, we propose a semi-automatic labeling pipeline that formulates a total of 9,758 language prompts. In addition, we propose TempRMOT, an elegant end-to-end Transformer-based framework for RMOT. At its core is a query-driven Temporal Enhancement Module that represents each object as a Transformer query, enabling long-term spatial-temporal interactions with other objects and past frames to efficiently refine these queries. TempRMOT achieves state-of-the-art performance on both Refer-KITTI and Refer-KITTI-V2, demonstrating the effectiveness of our approach. The source code and dataset is available at https://github.com/zyn213/TempRMOT.
[446] Refusal as Silence: Gendered Disparities in Vision-Language Model Responses
Sha Luo, Sang Jung Kim, Zening Duan, Kaiping Chen
Main category: cs.CV
TL;DR: LLM refusal behavior varies by user identity, with transgender and non-binary personas experiencing significantly higher refusal rates in gender classification tasks, revealing identity-based disparities in AI systems.
Details
Motivation: To investigate how LLM refusal behavior varies by user identity, particularly gender identity, and understand refusal as a sociotechnical outcome that may unevenly regulate access and participation.Method: Counterfactual persona design varying gender identities (male, female, non-binary, transgender) while keeping classification task and visual input constant, using GPT-4V vision-language model.
Result: Transgender and non-binary personas experience significantly higher refusal rates compared to male and female personas, even in non-harmful contexts.
Conclusion: Identity-driven disparities in refusal behavior exist, highlighting the need for modeling such disparities and cautioning against uncritical use of AI systems for content coding, while advancing algorithmic fairness by reframing refusal as communicative acts.
Abstract: Refusal behavior by Large Language Models is increasingly visible in content moderation, yet little is known about how refusals vary by the identity of the user making the request. This study investigates refusal as a sociotechnical outcome through a counterfactual persona design that varies gender identity–including male, female, non-binary, and transgender personas–while keeping the classification task and visual input constant. Focusing on a vision-language model (GPT-4V), we examine how identity-based language cues influence refusal in binary gender classification tasks. We find that transgender and non-binary personas experience significantly higher refusal rates, even in non-harmful contexts. Our findings also provide methodological implications for equity audits and content analysis using LLMs. Our findings underscore the importance of modeling identity-driven disparities and caution against uncritical use of AI systems for content coding. This study advances algorithmic fairness by reframing refusal as a communicative act that may unevenly regulate epistemic access and participation.
[447] RealCustom++: Representing Images as Real Textual Word for Real-Time Customization
Zhendong Mao, Mengqi Huang, Fei Ding, Mingcong Liu, Qian He, Yongdong Zhang
Main category: cs.CV
TL;DR: RealCustom++ is a text-to-image customization method that uses real words instead of pseudo-words to represent subjects, achieving simultaneous optimization of subject similarity and text controllability through a train-inference decoupled framework with dual-branch architecture.
Details
Motivation: Existing pseudo-word approaches cause semantic conflicts and entanglement between text and subject representations, leading to a dual-optimum paradox where subject similarity and text controllability cannot be optimized simultaneously.Method: RealCustom++ introduces a train-inference decoupled framework: during training, it learns general alignment between visual conditions and real words; during inference, a dual-branch architecture generates subject guidance masks and customizes generation exclusively within subject-relevant regions.
Result: The method achieves 7.48% improvement in controllability, 3.04% in similarity, and 76.43% in generation quality. For multi-subject customization, it achieves 4.6% improvement in controllability and 6.34% in multi-subject similarity.
Conclusion: RealCustom++ successfully addresses the dual-optimum paradox in text-to-image customization by using real words instead of pseudo-words, enabling simultaneous optimization of both subject similarity and text controllability.
Abstract: Given a text and an image of a specific subject, text-to-image customization aims to generate new images that align with both the text and the subject’s appearance. Existing works follow the pseudo-word paradigm, which represents the subject as a non-existent pseudo word and combines it with other text to generate images. However, the pseudo word causes semantic conflict from its different learning objective and entanglement from overlapping influence scopes with other texts, resulting in a dual-optimum paradox where subject similarity and text controllability cannot be optimal simultaneously. To address this, we propose RealCustom++, a novel real-word paradigm that represents the subject with a non-conflicting real word to firstly generate a coherent guidance image and corresponding subject mask, thereby disentangling the influence scopes of the text and subject for simultaneous optimization. Specifically, RealCustom++ introduces a train-inference decoupled framework: (1) during training, it learns a general alignment between visual conditions and all real words in the text; and (2) during inference, a dual-branch architecture is employed, where the Guidance Branch produces the subject guidance mask and the Generation Branch utilizes this mask to customize the generation of the specific real word exclusively within subject-relevant regions. In contrast to previous methods that excel in either controllability or similarity, RealCustom++ achieves superior performance in both, with improvements of 7.48% in controllability, 3.04% in similarity, and 76.43% in generation quality. For multi-subject customization, RealCustom++ further achieves improvements of 4.6% in controllability and 6.34% in multi-subject similarity. Our work has been applied in JiMeng of ByteDance, and codes are released at https://github.com/bytedance/RealCustom.
[448] A Cycle Ride to HDR: Semantics Aware Self-Supervised Framework for Unpaired LDR-to-HDR Image Reconstruction
Hrishav Bakul Barua, Kalin Stefanov, Lemuel Lai En Che, Abhinav Dhall, KokSheik Wong, Ganesh Krishnasamy
Main category: cs.CV
TL;DR: CycleHDR is a self-supervised method for HDR image reconstruction from LDR images using unpaired datasets, incorporating semantic and cycle-consistent adversarial architecture with novel artifact- and exposure-aware generators.
Details
Motivation: Most current HDR reconstruction methods require high-quality paired LDR-HDR datasets, with limited use of unpaired datasets. This paper aims to develop a method that can learn the LDR-HDR mapping between domains using unpaired data.Method: CycleHDR integrates self-supervision into a modified semantic- and cycle-consistent adversarial architecture. It introduces novel artifact- and exposure-aware generators for visual artifact removal, and an encoder with loss for semantic consistency.
Result: CycleHDR achieves state-of-the-art performance across several benchmark datasets and reconstructs high-quality HDR images.
Conclusion: CycleHDR is the first method to use semantic and contextual awareness for LDR-HDR reconstruction in a self-supervised setup, successfully addressing visual artifact removal and semantic consistency using unpaired datasets.
Abstract: Reconstruction of High Dynamic Range (HDR) from Low Dynamic Range (LDR) images is an important computer vision task. There is a significant amount of research utilizing both conventional non-learning methods and modern data-driven approaches, focusing on using both single-exposed and multi-exposed LDR for HDR image reconstruction. However, most current state-of-the-art methods require high-quality paired {LDR;HDR} datasets with limited literature use of unpaired datasets, that is, methods that learn the LDR-HDR mapping between domains. This paper proposes CycleHDR, a method that integrates self-supervision into a modified semantic- and cycle-consistent adversarial architecture that utilizes unpaired LDR and HDR datasets for training. Our method introduces novel artifact- and exposure-aware generators to address visual artifact removal. It also puts forward an encoder and loss to address semantic consistency, another under-explored topic. CycleHDR is the first to use semantic and contextual awareness for the LDR-HDR reconstruction task in a self-supervised setup. The method achieves state-of-the-art performance across several benchmark datasets and reconstructs high-quality HDR images. The official website of this work is available at: https://github.com/HrishavBakulBarua/Cycle-HDR
[449] Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis
Boming Miao, Chunxiao Li, Xiaoxiao Wang, Andi Zhang, Rui Sun, Zizhe Wang, Yao Zhu
Main category: cs.CV
TL;DR: This paper proposes a method to improve semantic alignment in diffusion models by using large vision-language models to guide the optimization of initial noisy latents, addressing limitations of previous approaches like InitNo.
Details
Motivation: Diffusion models struggle with precise semantic alignment with input prompts, and existing optimization methods like InitNo have limitations including dependency on initial starting points and convergence to local optima.Method: Leverages LVLMs’ language comprehension to guide initial noisy latent optimization, introduces Noise Diffusion process that updates noisy latent while preserving distribution consistency, and provides theoretical analysis of update conditions.
Result: Experimental results show the framework effectively enhances semantic alignment across various diffusion models and demonstrates good adaptability.
Conclusion: The proposed method consistently improves semantic faithfulness in generated images while maintaining distribution consistency, offering a more effective alternative to existing approaches.
Abstract: Diffusion models have achieved impressive success in generating photorealistic images, but challenges remain in ensuring precise semantic alignment with input prompts. Optimizing the initial noisy latent offers a more efficient alternative to modifying model architectures or prompt engineering for improving semantic alignment. A latest approach, InitNo, refines the initial noisy latent by leveraging attention maps; however, these maps capture only limited information, and the effectiveness of InitNo is highly dependent on the initial starting point, as it tends to converge on a local optimum near this point. To this end, this paper proposes leveraging the language comprehension capabilities of large vision-language models (LVLMs) to guide the optimization of the initial noisy latent, and introduces the Noise Diffusion process, which updates the noisy latent to generate semantically faithful images while preserving distribution consistency. Furthermore, we provide a theoretical analysis of the condition under which the update improves semantic faithfulness. Experimental results demonstrate the effectiveness and adaptability of our framework, consistently enhancing semantic alignment across various diffusion models. The code is available at https://github.com/Bomingmiao/NoiseDiffusion.
[450] FaceTracer: Unveiling Source Identities from Swapped Face Images and Videos for Fraud Prevention
Zhongyi Zhang, Jie Zhang, Wenbo Zhou, Xinghui Zhou, Qing Guo, Weiming Zhang, Tianwei Zhang, Nenghai Yu
Main category: cs.CV
TL;DR: FaceTracer is a non-intrusive framework that traces the identity of source persons from face-swapped images/videos using disentanglement to suppress target identity features and isolate source identity features.
Details
Motivation: Address limitations of existing face-swapping detection methods that can't trace malicious users, and intrusive watermark approaches that fail with unmarked identities.Method: Uses a disentanglement module to suppress target person’s identity information while isolating source person’s identity features, enabling robust identity extraction.
Result: Effectively identifies source persons across various face-swapping techniques, shows strong transferability to unseen methods including commercial apps, and is robust against transmission distortions and adaptive attacks.
Conclusion: FaceTracer successfully enables tracing of malicious actors behind fraudulent face-swapping activities by linking swapped content back to original individuals.
Abstract: Face-swapping techniques have advanced rapidly with the evolution of deep learning, leading to widespread use and growing concerns about potential misuse, especially in cases of fraud. While many efforts have focused on detecting swapped face images or videos, these methods are insufficient for tracing the malicious users behind fraudulent activities. Intrusive watermark-based approaches also fail to trace unmarked identities, limiting their practical utility. To address these challenges, we introduce FaceTracer, the first non-intrusive framework specifically designed to trace the identity of the source person from swapped face images or videos. Specifically, FaceTracer leverages a disentanglement module that effectively suppresses identity information related to the target person while isolating the identity features of the source person. This allows us to extract robust identity information that can directly link the swapped face back to the original individual, aiding in uncovering the actors behind fraudulent activities. Extensive experiments demonstrate FaceTracer’s effectiveness across various face-swapping techniques, successfully identifying the source person in swapped content and enabling the tracing of malicious actors involved in fraudulent activities. Additionally, FaceTracer shows strong transferability to unseen face-swapping methods including commercial applications and robustness against transmission distortions and adaptive attacks.Our code is available at: https://github.com/zzy224/FaceTracer.
[451] Optimize the Unseen – Fast NeRF Cleanup with Free Space Prior
Leo Segre, Shai Avidan
Main category: cs.CV
TL;DR: A fast post-hoc NeRF cleanup method that eliminates floaters using a Free Space Prior, achieving artifact removal in both seen and unseen areas while being 2.5x faster than existing methods.
Details
Motivation: NeRF's reliance on photometric reconstruction introduces artifacts called 'floaters' that degrade novel view quality, especially in areas unseen by training cameras.Method: Uses Maximum-a-Posteriori (MAP) approach with a simple global Free Space Prior that unseen regions should remain empty, enabling artifact removal without disrupting observed regions.
Result: Successfully eliminates floaters in both seen and unseen areas, comparable to existing cleanup models but 2.5x faster in inference, requires no additional memory, and trains in under 30 seconds.
Conclusion: The method provides an efficient, fast solution for NeRF artifact cleanup that enhances novel view quality while maintaining computational efficiency.
Abstract: Neural Radiance Fields (NeRF) have advanced photorealistic novel view synthesis, but their reliance on photometric reconstruction introduces artifacts, commonly known as “floaters”. These artifacts degrade novel view quality, especially in areas unseen by the training cameras. We present a fast, post-hoc NeRF cleanup method that eliminates such artifacts by enforcing our Free Space Prior, effectively minimizing floaters without disrupting the NeRF’s representation of observed regions. Unlike existing approaches that rely on either Maximum Likelihood (ML) estimation to fit the data or a complex, local data-driven prior, our method adopts a Maximum-a-Posteriori (MAP) approach, selecting the optimal model parameters under a simple global prior assumption that unseen regions should remain empty. This enables our method to clean artifacts in both seen and unseen areas, enhancing novel view quality even in challenging scene regions. Our method is comparable with existing NeRF cleanup models while being 2.5x faster in inference time, requires no additional memory beyond the original NeRF, and achieves cleanup training in less than 30 seconds. Our code will be made publically available.
[452] BCR-Net: Boundary-Category Refinement Network for Weakly Semi-Supervised X-Ray Prohibited Item Detection with Points
Sanjoeng Wong
Main category: cs.CV
TL;DR: BCR-Net is a weakly semi-supervised approach for X-ray prohibited item detection that uses few box annotations and many point annotations, achieving state-of-the-art performance with limited annotations.
Details
Motivation: To balance annotation cost and detection performance in X-ray prohibited item detection, addressing the limitations of expensive box annotations or weak annotations with limited accuracy.Method: Built on Group R-CNN with Boundary Refinement (BR) module using dual attention for boundaries and salient features, and Category Refinement (CR) module with contrastive branches using scale- and rotation-aware contrastive loss.
Result: Achieves significant performance improvements over state-of-the-art methods on public X-ray datasets under limited annotations.
Conclusion: BCR-Net effectively addresses imprecise localization and inaccurate classification problems in weakly semi-supervised X-ray prohibited item detection.
Abstract: Automatic prohibited item detection in X-ray images is crucial for public safety. However, most existing detection methods either rely on expensive box annotations to achieve high performance or use weak annotations but suffer from limited accuracy. To balance annotation cost and detection performance, we study Weakly Semi-Supervised X-ray Prohibited Item Detection with Points (WSSPID-P) and propose a novel \textbf{B}oundary-\textbf{C}ategory \textbf{R}efinement \textbf{Net}work (\textbf{BCR-Net}) that requires only a few box annotations and a large number of point annotations. BCR-Net is built based on Group R-CNN and introduces a new Boundary Refinement (BR) module and a new Category Refinement (CR) module. The BR module develops a dual attention mechanism to focus on both the boundaries and salient features of prohibited items. Meanwhile, the CR module incorporates contrastive branches into the heads of RPN and ROI by introducing a scale- and rotation-aware contrastive loss, enhancing intra-class consistency and inter-class separability in the feature space. Based on the above designs, BCR-Net effectively addresses the closely related problems of imprecise localization and inaccurate classification. Experimental results on public X-ray datasets show the effectiveness of BCR-Net, achieving significant performance improvements to state-of-the-art methods under limited annotations.
[453] MECD+: Unlocking Event-Level Causal Graph Discovery for Video Reasoning
Tieyuan Chen, Huabin Liu, Yi Wang, Yihang Chen, Tianyao He, Chaofan Gan, Huanyu He, Weiyao Lin
Main category: cs.CV
TL;DR: The paper introduces Multi-Event Causal Discovery (MECD), a new task for discovering causal relations between multiple interconnected events in long videos, using a Granger Causality-inspired framework with event prediction and causal inference techniques.
Details
Motivation: Current video causal reasoning is limited to brief segments with isolated events and basic causal relations, lacking comprehensive analysis for videos with multiple interconnected events across time.Method: Proposes a framework inspired by Granger Causality, using mask-based event prediction to perform Event Granger Test, and integrates causal inference techniques like front-door adjustment and counterfactual inference to address causality confounding and illusory causality.
Result: Outperforms GPT-4o and VideoChat2 by 5.77% and 2.70% respectively in reasoning complete causal relations, and shows that causal relation graphs benefit downstream video understanding tasks like video QA and event prediction.
Conclusion: The MECD framework effectively addresses the limitations of existing video causal reasoning by enabling comprehensive causal analysis of multiple interconnected events in long videos, demonstrating superior performance and practical utility for downstream tasks.
Abstract: Video causal reasoning aims to achieve a high-level understanding of videos from a causal perspective. However, it exhibits limitations in its scope, primarily executed in a question-answering paradigm and focusing on brief video segments containing isolated events and basic causal relations, lacking comprehensive and structured causality analysis for videos with multiple interconnected events. To fill this gap, we introduce a new task and dataset, Multi-Event Causal Discovery (MECD). It aims to uncover the causal relations between events distributed chronologically across long videos. Given visual segments and textual descriptions of events, MECD identifies the causal associations between these events to derive a comprehensive and structured event-level video causal graph explaining why and how the result event occurred. To address the challenges of MECD, we devise a novel framework inspired by the Granger Causality method, incorporating an efficient mask-based event prediction model to perform an Event Granger Test. It estimates causality by comparing the predicted result event when premise events are masked versus unmasked. Furthermore, we integrate causal inference techniques such as front-door adjustment and counterfactual inference to mitigate challenges in MECD like causality confounding and illusory causality. Additionally, context chain reasoning is introduced to conduct more robust and generalized reasoning. Experiments validate the effectiveness of our framework in reasoning complete causal relations, outperforming GPT-4o and VideoChat2 by 5.77% and 2.70%, respectively. Further experiments demonstrate that causal relation graphs can also contribute to downstream video understanding tasks such as video question answering and video event prediction.
[454] Improving Video Generation with Human Feedback
Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, Wanli Ouyang
Main category: cs.CV
TL;DR: The paper introduces VideoReward, a multi-dimensional video reward model trained on human preference data, and three alignment algorithms (Flow-DPO, Flow-RWR, Flow-NRG) to improve video generation quality by addressing motion smoothness and prompt-video alignment issues.
Details
Motivation: Current video generation models using rectified flow techniques suffer from unsmooth motion and misalignment between generated videos and text prompts, requiring better alignment methods using human feedback.Method: Built a large-scale human preference dataset with pairwise annotations, developed VideoReward reward model, and introduced three alignment algorithms: Flow-DPO (training-time preference optimization), Flow-RWR (training-time reward weighted regression), and Flow-NRG (inference-time reward guidance).
Result: VideoReward significantly outperforms existing reward models, Flow-DPO shows superior performance compared to Flow-RWR and supervised fine-tuning, and Flow-NRG enables customizable multi-objective weighting during inference.
Conclusion: The proposed human feedback pipeline effectively addresses video generation quality issues, with VideoReward and Flow-DPO demonstrating state-of-the-art performance, while Flow-NRG provides flexible inference-time customization for personalized video quality needs.
Abstract: Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist. In this work, we develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model. Specifically, we begin by constructing a large-scale human preference dataset focused on modern video generation models, incorporating pairwise annotations across multi-dimensions. We then introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy. From a unified reinforcement learning perspective aimed at maximizing reward with KL regularization, we introduce three alignment algorithms for flow-based models. These include two training-time strategies: direct preference optimization for flow (Flow-DPO) and reward weighted regression for flow (Flow-RWR), and an inference-time technique, Flow-NRG, which applies reward guidance directly to noisy videos. Experimental results indicate that VideoReward significantly outperforms existing reward models, and Flow-DPO demonstrates superior performance compared to both Flow-RWR and supervised fine-tuning methods. Additionally, Flow-NRG lets users assign custom weights to multiple objectives during inference, meeting personalized video quality needs.
[455] Robust Multimodal Learning via Cross-Modal Proxy Tokens
Md Kaykobad Reza, Ameya Patil, Mashhour Solh, M. Salman Asif
Main category: cs.CV
TL;DR: Proposes cross-modal proxy tokens (CMPTs) to handle missing modalities in multimodal models by approximating missing modality tokens using available ones, without explicit generation or auxiliary networks.
Details
Motivation: Multimodal models suffer significant performance drops when modalities are missing during inference, requiring robust solutions that maintain performance with complete modalities.Method: Uses cross-modal proxy tokens that approximate missing modality class tokens by attending to available modality tokens, employing low-rank adapters in frozen encoders and joint optimization with alignment and task-specific losses.
Result: Outperforms state-of-the-art baselines across five multimodal datasets with various missing rates while achieving competitive results in complete-modality settings.
Conclusion: Provides a flexible and efficient solution for robust multimodal learning that handles missing modalities effectively without compromising complete-modality performance.
Abstract: Multimodal models often experience a significant performance drop when one or more modalities are missing during inference. To address this challenge, we propose a simple yet effective approach that enhances robustness to missing modalities while maintaining strong performance when all modalities are available. Our method introduces cross-modal proxy tokens (CMPTs), which approximate the class token of a missing modality by attending only to the tokens of the available modality without requiring explicit modality generation or auxiliary networks. To efficiently learn these approximations with minimal computational overhead, we employ low-rank adapters in frozen unimodal encoders and jointly optimize an alignment loss with a task-specific loss. Extensive experiments on five multimodal datasets show that our method outperforms state-of-the-art baselines across various missing rates while achieving competitive results in complete-modality settings. Overall, our method offers a flexible and efficient solution for robust multimodal learning. The code for this paper is available at: https://github.com/CSIPlab/Cross-Modal-Proxy-Tokens.
[456] Dual-Flow: Transferable Multi-Target, Instance-Agnostic Attacks via In-the-wild Cascading Flow Optimization
Yixiao Chen, Shikun Sun, Jianshu Li, Ruoyu Li, Zhe Li, Junliang Xing
Main category: cs.CV
TL;DR: Dual-Flow framework improves transferability of multi-target adversarial attacks using Cascading Distribution Shift Training, achieving 34.58% higher success rates than previous methods.
Details
Motivation: Existing generator-based attacks have good transferability but low success rates in multi-target scenarios due to model capacity limitations.Method: Proposes Dual-Flow framework with Cascading Distribution Shift Training to develop adversarial velocity function for multi-target instance-agnostic attacks.
Result: Significantly improves transferability with 34.58% higher success rate from Inception-v3 to ResNet-152, and shows stronger robustness against defense mechanisms.
Conclusion: Dual-Flow effectively addresses multi-target adversarial attack challenges and demonstrates superior performance over existing methods.
Abstract: Adversarial attacks are widely used to evaluate model robustness, and in black-box scenarios, the transferability of these attacks becomes crucial. Existing generator-based attacks have excellent generalization and transferability due to their instance-agnostic nature. However, when training generators for multi-target tasks, the success rate of transfer attacks is relatively low due to the limitations of the model’s capacity. To address these challenges, we propose a novel Dual-Flow framework for multi-target instance-agnostic adversarial attacks, utilizing Cascading Distribution Shift Training to develop an adversarial velocity function. Extensive experiments demonstrate that Dual-Flow significantly improves transferability over previous multi-target generative attacks. For example, it increases the success rate from Inception-v3 to ResNet-152 by 34.58%. Furthermore, our attack method shows substantially stronger robustness against defense mechanisms, such as adversarially trained models. The code of Dual-Flow is available at: $\href{https://github.com/Chyxx/Dual-Flow}{https://github.com/Chyxx/Dual-Flow}$.
[457] Pulling Back the Curtain: Unsupervised Adversarial Detection via Contrastive Auxiliary Networks
Eylon Mizrahi, Raz Lapid, Moshe Sipper
Main category: cs.CV
TL;DR: U-CAN is an unsupervised adversarial detection method that uses auxiliary networks embedded in intermediate layers to distinguish between benign and adversarial inputs without requiring adversarial examples.
Details
Motivation: Deep learning models in safety-critical applications are vulnerable to adversarial attacks, and existing defenses either focus on robustness or detection independently, lacking effective unsupervised detection methods.Method: Embed auxiliary networks (projection layers + ArcFace-based linear layers) within selected intermediate layers of target models to refine feature representations for distinguishing benign vs adversarial inputs.
Result: Achieved superior F1 scores compared to existing unsupervised detection methods across multiple datasets (CIFAR-10, Mammals, ImageNet subset) and architectures (ResNet-50, VGG-16, ViT) against four attack methods.
Conclusion: U-CAN provides a scalable and effective solution for enhancing security and reliability of deep learning systems through unsupervised adversarial detection.
Abstract: Deep learning models are widely employed in safety-critical applications yet remain susceptible to adversarial attacks – imperceptible perturbations that can significantly degrade model performance. Conventional defense mechanisms predominantly focus on either enhancing model robustness or detecting adversarial inputs independently. In this work, we propose an Unsupervised adversarial detection via Contrastive Auxiliary Networks (U-CAN) to uncover adversarial behavior within auxiliary feature representations, without the need for adversarial examples. U-CAN is embedded within selected intermediate layers of the target model. These auxiliary networks, comprising projection layers and ArcFace-based linear layers, refine feature representations to more effectively distinguish between benign and adversarial inputs. Comprehensive experiments across multiple datasets (CIFAR-10, Mammals, and a subset of ImageNet) and architectures (ResNet-50, VGG-16, and ViT) demonstrate that our method surpasses existing unsupervised adversarial detection techniques, achieving superior F1 scores against four distinct attack methods. The proposed framework provides a scalable and effective solution for enhancing the security and reliability of deep learning systems.
[458] T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting
Yifei Qian, Zhongliang Guo, Bowen Deng, Chun Tong Lei, Shuai Zhao, Chun Pong Lau, Xiaopeng Hong, Michael P. Pound
Main category: cs.CV
TL;DR: T2ICount is a diffusion-based framework for zero-shot object counting that improves text sensitivity through hierarchical semantic correction and representational regional coherence loss, achieving superior performance across benchmarks.
Details
Motivation: Existing zero-shot object counting methods relying on vision-language models like CLIP often exhibit limited sensitivity to text prompts, which limits their effectiveness in counting arbitrary object categories specified by text descriptions.Method: Proposes T2ICount framework using pretrained diffusion models with: 1) Hierarchical Semantic Correction Module for progressive text-image feature alignment refinement, 2) Representational Regional Coherence Loss using cross-attention maps from denoising U-Net, and 3) A challenging re-annotated FSC147 subset for better evaluation.
Result: Extensive experiments demonstrate superior performance across different benchmarks compared to existing methods.
Conclusion: The diffusion-based T2ICount framework effectively addresses text sensitivity limitations in zero-shot object counting through hierarchical semantic correction and representational coherence loss, providing a robust solution for counting arbitrary object categories specified by text.
Abstract: Zero-shot object counting aims to count instances of arbitrary object categories specified by text descriptions. Existing methods typically rely on vision-language models like CLIP, but often exhibit limited sensitivity to text prompts. We present T2ICount, a diffusion-based framework that leverages rich prior knowledge and fine-grained visual understanding from pretrained diffusion models. While one-step denoising ensures efficiency, it leads to weakened text sensitivity. To address this challenge, we propose a Hierarchical Semantic Correction Module that progressively refines text-image feature alignment, and a Representational Regional Coherence Loss that provides reliable supervision signals by leveraging the cross-attention maps extracted from the denosing U-Net. Furthermore, we observe that current benchmarks mainly focus on majority objects in images, potentially masking models’ text sensitivity. To address this, we contribute a challenging re-annotated subset of FSC147 for better evaluation of text-guided counting ability. Extensive experiments demonstrate that our method achieves superior performance across different benchmarks. Code is available at https://github.com/cha15yq/T2ICount.
[459] Now you see me! Attribution Distributions Reveal What is Truly Important for a Prediction
Nils Philipp Walter, Jilles Vreeken, Jonas Fischer
Main category: cs.CV
TL;DR: The paper proposes a new attribution method that computes probability distributions of attributions over classes for each spatial location in images, improving specificity and revealing both discriminative and shared features between classes.
Details
Motivation: Current attribution methods in neural networks produce unspecific saliency maps that fail to identify the relevant information that led to decisions, as shown by benchmark results.Method: Instead of computing attribution of isolated logits, the authors combine attributions of multiple class logits in analogy to how softmax combines information across logits, computing probability distributions of attributions over classes for each spatial location.
Result: The method reveals better object- and instance-specificity, uncovers discriminative and shared features between classes, and improves established attribution methods on benchmarks including grid-pointing game and randomization-based sanity checks.
Conclusion: Reconsidering how and where attributions are computed across the network improves attribution methods while staying agnostic to model architectures.
Abstract: Neural networks are regularly employed in high-stakes decision-making, where understanding and transparency is key. Attribution methods have been developed to gain understanding into which input features neural networks use for a specific prediction. Although widely used in computer vision, these methods often result in unspecific saliency maps that fail to identify the relevant information that led to a decision, supported by different benchmarks results. Here, we revisit the common attribution pipeline and identify one cause for the lack of specificity in attributions as the computation of attribution of isolated logits. Instead, we suggest to combine attributions of multiple class logits in analogy to how the softmax combines the information across logits. By computing probability distributions of attributions over classes for each spatial location in the image, we unleash the true capabilities of existing attribution methods, revealing better object- and instance-specificity and uncovering discriminative as well as shared features between classes. On common benchmarks, including the grid-pointing game and randomization-based sanity checks, we show that this reconsideration of how and where we compute attributions across the network improves established attribution methods while staying agnostic to model architectures. We make the code publicly available: https://github.com/nilspwalter/var.
[460] EEdit: Rethinking the Spatial and Temporal Redundancy for Efficient Image Editing
Zexuan Yan, Yue Ma, Chang Zou, Wenteng Chen, Qifeng Chen, Linfeng Zhang
Main category: cs.CV
TL;DR: EEdit is an efficient framework for inversion-based image editing that addresses spatial and temporal redundancy through locality caching, token indexing, and inversion step skipping, achieving 2.46× acceleration without performance loss.
Details
Motivation: Inversion-based image editing suffers from significant computation overhead that hinders real-time interactive applications, with redundancy existing in both spatial (unedited regions) and temporal (inversion progress) dimensions.Method: Three techniques: 1) Spatial locality caching to compute only edited and neighboring regions while skipping unedited areas, 2) Token indexing preprocessing to accelerate caching, 3) Inversion step skipping to reuse latents for efficient editing.
Result: Achieves average 2.46× acceleration across various editing tasks including prompt-guided editing, dragging, and image composition, with no performance degradation.
Conclusion: EEdit provides a practical solution for efficient image editing by systematically addressing spatial and temporal redundancies, enabling real-time interactive applications.
Abstract: Inversion-based image editing is rapidly gaining momentum while suffering from significant computation overhead, hindering its application in real-time interactive scenarios. In this paper, we rethink that the redundancy in inversion-based image editing exists in both the spatial and temporal dimensions, such as the unnecessary computation in unedited regions and the redundancy in the inversion progress. To tackle these challenges, we propose a practical framework, named EEdit, to achieve efficient image editing. Specifically, we introduce three techniques to solve them one by one. For spatial redundancy, spatial locality caching is introduced to compute the edited region and its neighboring regions while skipping the unedited regions, and token indexing preprocessing is designed to further accelerate the caching. For temporal redundancy, inversion step skipping is proposed to reuse the latent for efficient editing. Our experiments demonstrate an average of 2.46 $\times$ acceleration without performance drop in a wide range of editing tasks including prompt-guided image editing, dragging and image composition. Our codes are available at https://github.com/yuriYanZeXuan/EEdit
[461] A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1
Zhaoyi Li, Xiaohan Zhao, Dong-Dong Wu, Jiacheng Cui, Zhiqiang Shen
Main category: cs.CV
TL;DR: The paper proposes a novel targeted attack method for commercial black-box large vision-language models (LVLMs) by refining semantic clarity through local-aggregated perturbations focused on semantically rich regions, achieving high transferability and success rates.
Details
Motivation: Existing transfer-based targeted attacks fail against closed-source commercial LVLMs because learned perturbations lack semantic details and come from uniform distributions, causing models to ignore or misinterpret them.Method: At each optimization step, crop the adversarial image randomly with controlled aspect ratio and scale, resize it, and align with the target image in embedding space to encode explicit semantic details in local regions.
Result: Achieves success rates exceeding 90% on GPT-4.5, GPT-4o, and o1, significantly outperforming prior state-of-the-art methods with lower ℓ₁/ℓ₂ perturbations. Works on various commercial models including GPT-4.5, GPT-4o, Gemini-2.0-flash, Claude-3.5/3.7-sonnet, and reasoning models.
Conclusion: Focusing perturbations on semantically rich areas rather than uniform distribution enables effective transfer attacks on commercial black-box LVLMs by ensuring semantic clarity and finer-grained feature capture.
Abstract: Despite promising performance on open-source large vision-language models (LVLMs), transfer-based targeted attacks often fail against closed-source commercial LVLMs. Analyzing failed adversarial perturbations reveals that the learned perturbations typically originate from a uniform distribution and lack clear semantic details, resulting in unintended responses. This critical absence of semantic information leads commercial black-box LVLMs to either ignore the perturbation entirely or misinterpret its embedded semantics, thereby causing the attack to fail. To overcome these issues, we propose to refine semantic clarity by encoding explicit semantic details within local regions, thus ensuring the capture of finer-grained features and inter-model transferability, and by concentrating modifications on semantically rich areas rather than applying them uniformly. To achieve this, we propose a simple yet highly effective baseline: at each optimization step, the adversarial image is cropped randomly by a controlled aspect ratio and scale, resized, and then aligned with the target image in the embedding space. While the naive source-target matching method has been utilized before in the literature, we are the first to provide a tight analysis, which establishes a close connection between perturbation optimization and semantics. Experimental results confirm our hypothesis. Our adversarial examples crafted with local-aggregated perturbations focused on crucial regions exhibit surprisingly good transferability to commercial LVLMs, including GPT-4.5, GPT-4o, Gemini-2.0-flash, Claude-3.5/3.7-sonnet, and even reasoning models like o1, Claude-3.7-thinking and Gemini-2.0-flash-thinking. Our approach achieves success rates exceeding 90% on GPT-4.5, 4o, and o1, significantly outperforming all prior state-of-the-art attack methods with lower $\ell_1/\ell_2$ perturbations.
[462] Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models
Yuxiang Lai, Jike Zhong, Ming Li, Shitian Zhao, Yuheng Li, Konstantinos Psounis, Xiaofeng Yang
Main category: cs.CV
TL;DR: Med-R1 is a reinforcement learning-enhanced vision-language model that improves medical reasoning by using Group Relative Policy Optimization, achieving significant performance gains over base models and even outperforming much larger models.
Details
Motivation: Medical vision-language tasks require precise understanding and clinically coherent answers, but face challenges due to complex medical data and scarce expert annotations, limiting conventional fine-tuning methods.Method: Built on DeepSeek strategy, Med-R1 uses Group Relative Policy Optimization (GRPO) for reward-guided learning beyond static annotations, and explores the impact of intermediate rationales on reasoning quality.
Result: Med-R1 achieves 29.94% average accuracy improvement over Qwen2-VL-2B and outperforms Qwen2-VL-72B (36x larger). It also shows 32.06% improvement in cross-task generalization across five question types.
Conclusion: RL improves medical reasoning and generalization, enabling efficient VLMs for real-world deployment. The quality and domain alignment of reasoning, rather than just more reasoning, determines effectiveness in medical VQA.
Abstract: Vision-language models (VLMs) have achieved impressive progress in natural image reasoning, yet their potential in medical imaging remains underexplored. Medical vision-language tasks demand precise understanding and clinically coherent answers, which are difficult to achieve due to the complexity of medical data and the scarcity of high-quality expert annotations. These challenges limit the effectiveness of conventional supervised fine-tuning (SFT) and Chain-of-Thought (CoT) strategies that work well in general domains. To address these challenges, we propose Med-R1, a reinforcement learning (RL)-enhanced vision-language model designed to improve generalization and reliability in medical reasoning. Built on the DeepSeek strategy, Med-R1 adopts Group Relative Policy Optimization (GRPO) to encourage reward-guided learning beyond static annotations. We comprehensively evaluate Med-R1 across eight distinct medical imaging modalities. Med-R1 achieves a 29.94% improvement in average accuracy over its base model Qwen2-VL-2B, and even outperforms Qwen2-VL-72B-a model with 36x more parameters. To assess cross-task generalization, we further evaluate Med-R1 on five question types. Med-R1 outperforms Qwen2-VL-2B by 32.06% in question-type generalization, also surpassing Qwen2-VL-72B. We further explore the thinking process in Med-R1, a crucial component for the success of Deepseek-R1. Our results show that omitting intermediate rationales (No-Thinking-Med-R1) not only improves in-domain and cross-domain generalization with less training, but also challenges the assumption that more reasoning always helps. These findings suggest that in medical VQA, it is not reasoning itself, but its quality and domain alignment, that determine effectiveness. Together, these results highlight that RL improves medical reasoning and generalization, enabling efficient and reliable VLMs for real-world deployment.
[463] VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation
Shoubin Yu, Difan Liu, Ziqiao Ma, Yicong Hong, Yang Zhou, Hao Tan, Joyce Chai, Mohit Bansal
Main category: cs.CV
TL;DR: VEGGIE is a unified video editing framework that handles multiple editing tasks through instruction interpretation and grounded generation, using curriculum learning and synthetic data.
Details
Motivation: Current video diffusion models struggle with instructional editing and diverse tasks within a unified framework, requiring a solution that can handle various editing operations through natural language instructions.Method: Uses an MLLM to interpret user instructions and ground them to video contexts, generating frame-specific queries for a diffusion model. Employs curriculum learning with large-scale image editing data followed by video fine-tuning, and introduces a data synthesis pipeline to create instructional video editing data.
Result: VEGGIE outperforms baselines in instructional video editing with different skills, excels in video object grounding and reasoning segmentation where others fail, and shows strong multi-tasking capabilities.
Conclusion: The framework successfully unifies diverse video editing tasks through instruction-based approach, demonstrates how multiple tasks benefit each other, and enables promising applications like zero-shot multimodal instructional and in-context video editing.
Abstract: Recent video diffusion models have enhanced video editing, but it remains challenging to handle instructional editing and diverse tasks (e.g., adding, removing, changing) within a unified framework. In this paper, we introduce VEGGIE, a Video Editor with Grounded Generation from Instructions, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions. Specifically, given a video and text query, VEGGIE first utilizes an MLLM to interpret user intentions in instructions and ground them to the video contexts, generating frame-specific grounded task queries for pixel-space responses. A diffusion model then renders these plans and generates edited videos that align with user intent. To support diverse tasks and complex instructions, we employ a curriculum learning strategy: first aligning the MLLM and video diffusion model with large-scale instructional image editing data, followed by end-to-end fine-tuning on high-quality multitask video data. Additionally, we introduce a novel data synthesis pipeline to generate paired instructional video editing data for model training. It transforms static image data into diverse, high-quality video editing samples by leveraging Image-to-Video models to inject dynamics. VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model, while other models struggle with multi-tasking. VEGGIE also excels in video object grounding and reasoning segmentation, where other baselines fail. We further reveal how the multiple tasks help each other and highlight promising applications like zero-shot multimodal instructional and in-context video editing.
[464] ChA-MAEViT: Unifying Channel-Aware Masked Autoencoders and Multi-Channel Vision Transformers for Improved Cross-Channel Learning
Chau Pham, Juan C. Caicedo, Bryan A. Plummer
Main category: cs.CV
TL;DR: ChA-MAEViT is a Masked Autoencoder method that enhances cross-channel feature learning in Multi-Channel Imaging by using dynamic channel-patch masking, memory tokens, hybrid token fusion, and a Channel-Aware Decoder.
Details
Motivation: Standard MAEs fail in Multi-Channel Imaging because they assume channel redundancy, but MCI channels provide complementary information with minimal overlap, limiting cross-channel interaction learning.Method: Four key strategies: dynamic channel-patch masking for cross-channel reconstruction, memory tokens for information sharing, hybrid token fusion for rich representations, and Channel-Aware Decoder for patch reconstruction.
Result: Experiments on satellite and microscopy datasets (CHAMMI, JUMP-CP, So2Sat) show ChA-MAEViT outperforms state-of-the-art MCI-ViTs by 3.0-21.5%.
Conclusion: Cross-channel interactions are crucial in Multi-Channel Imaging, and ChA-MAEViT effectively addresses this through its novel architecture components.
Abstract: Prior work using Masked Autoencoders (MAEs) typically relies on random patch masking based on the assumption that images have significant redundancies across different channels, allowing for the reconstruction of masked content using cross-channel correlations. However, this assumption does not hold in Multi-Channel Imaging (MCI), where channels may provide complementary information with minimal feature overlap. Thus, these MAEs primarily learn local structures within individual channels from patch reconstruction, failing to fully leverage cross-channel interactions and limiting their MCI effectiveness. In this paper, we present ChA-MAEViT, an MAE-based method that enhances feature learning across MCI channels via four key strategies: (1) dynamic channel-patch masking, which compels the model to reconstruct missing channels in addition to masked patches, thereby enhancing cross-channel dependencies and improving robustness to varying channel configurations; (2) memory tokens, which serve as long-term memory aids to promote information sharing across channels, addressing the challenges of reconstructing structurally diverse channels; (3) hybrid token fusion module, which merges fine-grained patch tokens with a global class token to capture richer representations; and (4) Channel-Aware Decoder, a lightweight decoder utilizes channel tokens to effectively reconstruct image patches. Experiments on satellite and microscopy datasets, CHAMMI, JUMP-CP, and So2Sat, show that ChA-MAEViT significantly outperforms state-of-the-art MCI-ViTs by 3.0-21.5%, highlighting the importance of cross-channel interactions in MCI. Our code is publicly available at https://github.com/chaudatascience/cha_mae_vit.
[465] ORIGEN: Zero-Shot 3D Orientation Grounding in Text-to-Image Generation
Yunhong Min, Daehyeon Choi, Kyeongmin Yeo, Jihyun Lee, Minhyuk Sung
Main category: cs.CV
TL;DR: ORIGEN is the first zero-shot method for 3D orientation grounding in text-to-image generation, using reward-guided sampling with a pretrained orientation estimation model and one-step generative flow model.
Details
Motivation: Previous work on spatial grounding in image generation focused mainly on 2D positioning but lacked control over 3D orientation, which ORIGEN aims to address.Method: Uses reward-guided sampling with a pretrained discriminative model for 3D orientation estimation and a one-step text-to-image generative flow model. Adopts sampling-based approach using Langevin dynamics instead of gradient-ascent optimization, with adaptive time rescaling for faster convergence.
Result: ORIGEN outperforms both training-based and test-time guidance methods across quantitative metrics and user studies.
Conclusion: The proposed zero-shot method successfully enables 3D orientation grounding in text-to-image generation with improved performance over existing approaches.
Abstract: We introduce ORIGEN, the first zero-shot method for 3D orientation grounding in text-to-image generation across multiple objects and diverse categories. While previous work on spatial grounding in image generation has mainly focused on 2D positioning, it lacks control over 3D orientation. To address this, we propose a reward-guided sampling approach using a pretrained discriminative model for 3D orientation estimation and a one-step text-to-image generative flow model. While gradient-ascent-based optimization is a natural choice for reward-based guidance, it struggles to maintain image realism. Instead, we adopt a sampling-based approach using Langevin dynamics, which extends gradient ascent by simply injecting random noise–requiring just a single additional line of code. Additionally, we introduce adaptive time rescaling based on the reward function to accelerate convergence. Our experiments show that ORIGEN outperforms both training-based and test-time guidance methods across quantitative metrics and user studies.
[466] Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought
Zihui Cheng, Qiguang Chen, Xiao Xu, Jiaqi Wang, Weiyun Wang, Hao Fei, Yidong Wang, Alex Jinpeng Wang, Zhi Chen, Wanxiang Che, Libo Qin
Main category: cs.CV
TL;DR: MCoT improves LVLMs through visual thoughts that convey image information to reasoning, with effectiveness depending on clarity and conciseness rather than format.
Details
Motivation: To understand the mechanisms behind MCoT improvements in LVLMs, as current approaches' driving factors are not fully understood.Method: Defined four distinct forms of visual thought expressions and analyzed them systematically to explore how visual thoughts function in MCoT.
Result: Different visual thought forms vary in clarity and conciseness, leading to varying MCoT improvement levels. Visual thoughts act as intermediaries between input images and deeper transformer layers.
Conclusion: Visual thoughts enable more advanced visual information transmission and can inspire future MCoT research breakthroughs.
Abstract: Large Vision-Language Models (LVLMs) have achieved significant success in multimodal tasks, with multimodal chain-of-thought (MCoT) further enhancing performance and interpretability. Recent MCoT methods fall into two categories: (i) Textual-MCoT (T-MCoT), which takes multimodal input and produces textual output; and (ii) Interleaved-MCoT (I-MCoT), which generates interleaved image-text outputs. Despite advances in both approaches, the mechanisms driving these improvements are not fully understood. To fill this gap, we first reveal that MCoT boosts LVLMs by incorporating visual thoughts, which convey image information to the reasoning process regardless of the MCoT format, depending only on clarity and conciseness of expression. Furthermore, to explore visual thoughts systematically, we define four distinct forms of visual thought expressions and analyze them comprehensively. Our findings demonstrate that these forms differ in clarity and conciseness, yielding varying levels of MCoT improvement. Additionally, we explore the internal nature of visual thoughts, finding that visual thoughts serve as intermediaries between the input image and reasoning to deeper transformer layers, enabling more advanced visual information transmission. We hope that the visual thoughts can inspire further breakthroughs for future MCoT research.
[467] Segment then Splat: Unified 3D Open-Vocabulary Segmentation via Gaussian Splatting
Yiren Lu, Yunlai Zhou, Yiran Qiao, Chaoda Song, Tuo Liang, Jing Ma, Huan Wang, Yu Yin
Main category: cs.CV
TL;DR: Segment then Splat reverses traditional 3D segmentation by dividing Gaussians into object sets before reconstruction, enabling true 3D open-vocabulary segmentation for both static and dynamic scenes using Gaussian Splatting.
Details
Motivation: Existing methods rely on 2D pixel-level parsing which causes multi-view inconsistencies, poor 3D object retrieval, and struggles with dynamic scenes due to motion modeling complexities.Method: Proposes ‘Segment then Splat’ that divides Gaussians into distinct object sets before reconstruction, eliminating geometric/semantic ambiguities and Gaussian-object misalignment. Uses CLIP embeddings for open-vocabulary querying without learning separate language fields.
Result: Extensive experiments show effectiveness in both static and dynamic scenarios, achieving true 3D segmentation with accelerated optimization.
Conclusion: The proposed method successfully enables 3D-aware open vocabulary segmentation for both static and dynamic scenes, overcoming limitations of traditional approaches by segmenting before reconstruction.
Abstract: Open-vocabulary querying in 3D space is crucial for enabling more intelligent perception in applications such as robotics, autonomous systems, and augmented reality. However, most existing methods rely on 2D pixel-level parsing, leading to multi-view inconsistencies and poor 3D object retrieval. Moreover, they are limited to static scenes and struggle with dynamic scenes due to the complexities of motion modeling. In this paper, we propose Segment then Splat, a 3D-aware open vocabulary segmentation approach for both static and dynamic scenes based on Gaussian Splatting. Segment then Splat reverses the long established approach of “segmentation after reconstruction” by dividing Gaussians into distinct object sets before reconstruction. Once reconstruction is complete, the scene is naturally segmented into individual objects, achieving true 3D segmentation. This design eliminates both geometric and semantic ambiguities, as well as Gaussian-object misalignment issues in dynamic scenes. It also accelerates the optimization process, as it eliminates the need for learning a separate language field. After optimization, a CLIP embedding is assigned to each object to enable open-vocabulary querying. Extensive experiments one various datasets demonstrate the effectiveness of our proposed method in both static and dynamic scenarios.
[468] Progressive Multi-Source Domain Adaptation for Personalized Facial Expression Recognition
Muhammad Osama Zeeshan, Marco Pedersoli, Alessandro Lameiras Koerich, Eric Granger
Main category: cs.CV
TL;DR: Progressive multi-source domain adaptation for personalized facial expression recognition that gradually introduces source subjects based on similarity to target, avoiding negative transfer from dissimilar sources.
Details
Motivation: Current MSDA methods for FER face challenges with large domain shifts between source and target subjects, leading to negative transfer and computational inefficiency when using all sources simultaneously.Method: Progressive MSDA approach that incrementally introduces source subjects based on similarity to target, with density-based memory mechanism to prevent catastrophic forgetting of relevant historical samples.
Result: Extensive experiments conducted on multiple datasets (Biovid, UNBC-McMaster, Aff-Wild2, BAH) and cross-dataset settings.
Conclusion: The proposed approach effectively addresses domain shift issues in personalized FER by selectively using relevant sources and preventing negative transfer.
Abstract: Personalized facial expression recognition (FER) involves adapting a machine learning model using samples from labeled sources and unlabeled target domains. Given the challenges of recognizing subtle expressions with considerable interpersonal variability, state-of-the-art unsupervised domain adaptation (UDA) methods focus on the multi-source UDA (MSDA) setting, where each domain corresponds to a specific subject, and improve model accuracy and robustness. However, when adapting to a specific target, the diverse nature of multiple source domains translates to a large shift between source and target data. State-of-the-art MSDA methods for FER address this domain shift by considering all the sources to adapt to the target representations. Nevertheless, adapting to a target subject presents significant challenges due to large distributional differences between source and target domains, often resulting in negative transfer. In addition, integrating all sources simultaneously increases computational costs and causes misalignment with the target. To address these issues, we propose a progressive MSDA approach that gradually introduces information from subjects based on their similarity to the target subject. This will ensure that only the most relevant sources from the target are selected, which helps avoid the negative transfer caused by dissimilar sources. We first exploit the closest sources to reduce the distribution shift with the target and then move towards the furthest while only considering the most relevant sources based on the predetermined threshold. Furthermore, to mitigate catastrophic forgetting caused by the incremental introduction of source subjects, we implemented a density-based memory mechanism that preserves the most relevant historical source samples for adaptation. Our extensive experiments on Biovid, UNBC-McMaster, Aff-Wild2, BAH, and in a cross-dataset setting.
[469] KAN or MLP? Point Cloud Shows the Way Forward
Yan Shi, Qingdong He, Yijun Liu, Xiaoyu Liu, Jingyong Su
Main category: cs.CV
TL;DR: PointKAN applies Kolmogorov-Arnold Networks (KANs) to point cloud analysis, outperforming MLP-based methods with better geometric feature capture, parameter efficiency, and computational performance.
Details
Motivation: Traditional MLPs struggle with complex geometric structures in point clouds due to fixed activation functions, poor parameter efficiency, and high model redundancy.Method: Proposes PointKAN with Geometric Affine Module (GAM) for robustness, Local Feature Processing (LFP) with parallel structure for multi-scale features, and Global Feature Processing (GFP) for complete geometric capture. Also introduces Efficient-KANs in PointKAN-elite variant to reduce parameters.
Result: Outperforms PointMLP on ModelNet40, ScanObjectNN, and ShapeNetPart datasets, with strong performance in Few-shot Learning. Achieves substantial reductions in parameter counts and computational complexity (FLOPs).
Conclusion: Demonstrates the potential of KANs-based architectures in 3D vision and opens new research avenues for point cloud understanding.
Abstract: Multi-Layer Perceptrons (MLPs) have become one of the fundamental architectural component in point cloud analysis due to its effective feature learning mechanism. However, when processing complex geometric structures in point clouds, MLPs’ fixed activation functions struggle to efficiently capture local geometric features, while suffering from poor parameter efficiency and high model redundancy. In this paper, we propose PointKAN, which applies Kolmogorov-Arnold Networks (KANs) to point cloud analysis tasks to investigate their efficacy in hierarchical feature representation. First, we introduce a Geometric Affine Module (GAM) to transform local features, improving the model’s robustness to geometric variations. Next, in the Local Feature Processing (LFP), a parallel structure extracts both group-level features and global context, providing a rich representation of both fine details and overall structure. Finally, these features are combined and processed in the Global Feature Processing (GFP). By repeating these operations, the receptive field gradually expands, enabling the model to capture complete geometric information of the point cloud. To overcome the high parameter counts and computational inefficiency of standard KANs, we develop Efficient-KANs in the PointKAN-elite variant, which significantly reduces parameters while maintaining accuracy. Experimental results demonstrate that PointKAN outperforms PointMLP on benchmark datasets such as ModelNet40, ScanObjectNN, and ShapeNetPart, with particularly strong performance in Few-shot Learning task. Additionally, PointKAN achieves substantial reductions in parameter counts and computational complexity (FLOPs). This work highlights the potential of KANs-based architectures in 3D vision and opens new avenues for research in point cloud understanding.
[470] DERD-Net: Learning Depth from Event-based Ray Densities
Diego Hitzges, Suman Ghosh, Guillermo Gallego
Main category: cs.CV
TL;DR: A scalable framework for event camera depth estimation using disparity space images and neural networks with 3D convolutions and recurrent structures, achieving state-of-the-art performance on monocular and stereo setups.
Details
Motivation: Event cameras provide blur-free 3D edges at high-speed but traditional deep learning struggles with their asynchronous data. Need specialized framework for event-based depth estimation and SLAM.Method: Encodes 3D scene into disparity space images (DSIs) by back-projecting events. Uses neural network with 3D convolutions and recurrent structure to process local DSI subregions for depth prediction.
Result: Outperforms all SOTA approaches: 42% MAE reduction in stereo, 30% median error reduction with 3x depth completeness increase. Monocular achieves comparable results to existing stereo methods.
Conclusion: Framework shows remarkable performance and effective event-data processing, holding strong potential to become standard approach for event-based depth estimation and SLAM.
Abstract: Event cameras offer a promising avenue for multi-view stereo depth estimation and Simultaneous Localization And Mapping (SLAM) due to their ability to detect blur-free 3D edges at high-speed and over broad illumination conditions. However, traditional deep learning frameworks designed for conventional cameras struggle with the asynchronous, stream-like nature of event data, as their architectures are optimized for discrete, image-like inputs. We propose a scalable, flexible and adaptable framework for pixel-wise depth estimation with event cameras in both monocular and stereo setups. The 3D scene structure is encoded into disparity space images (DSIs), representing spatial densities of rays obtained by back-projecting events into space via known camera poses. Our neural network processes local subregions of the DSIs combining 3D convolutions and a recurrent structure to recognize valuable patterns for depth prediction. Local processing enables fast inference with full parallelization and ensures constant ultra-low model complexity and memory costs, regardless of camera resolution. Experiments on standard benchmarks (MVSEC and DSEC datasets) demonstrate unprecedented effectiveness: (i) using purely monocular data, our method achieves comparable results to existing stereo methods; (ii) when applied to stereo data, it strongly outperforms all state-of-the-art (SOTA) approaches, reducing the mean absolute error by at least 42%; (iii) our method also allows for increases in depth completeness by more than 3-fold while still yielding a reduction in median absolute error of at least 30%. Given its remarkable performance and effective processing of event-data, our framework holds strong potential to become a standard approach for using deep learning for event-based depth estimation and SLAM. Project page: https://github.com/tub-rip/DERD-Net
[471] DEEMO: De-identity Multimodal Emotion Recognition and Reasoning
Deng Li, Bohao Xing, Xin Liu, Baiqiang Xia, Bihan Wen, Heikki Kälviäinen
Main category: cs.CV
TL;DR: DEEMO introduces privacy-preserving multimodal emotion recognition using de-identified video/audio inputs, with a new dataset and DEEMO-LLaMA model achieving state-of-the-art performance.
Details
Motivation: Address privacy concerns in emotion recognition by avoiding identity-sensitive information like facial expressions and speech, enabling ethical AI development.Method: Propose DEEMO task with two datasets (DEEMO-NFBL for non-facial body language, DEEMO-MER for instruction-based recognition) and DEEMO-LLaMA MLLM that integrates de-identified audio, video, and text.
Result: DEEMO-LLaMA achieves 74.49% accuracy and 74.45% F1-score in emotion recognition, and 6.20 clue overlap and 7.66 label overlap in emotion reasoning, significantly outperforming existing MLLMs.
Conclusion: The work advances privacy-preserving emotion understanding and promotes responsible affective computing by enabling emotion recognition without compromising identity privacy.
Abstract: Emotion understanding is a critical yet challenging task. Most existing approaches rely heavily on identity-sensitive information, such as facial expressions and speech, which raises concerns about personal privacy. To address this, we introduce the De-identity Multimodal Emotion Recognition and Reasoning (DEEMO), a novel task designed to enable emotion understanding using de-identified video and audio inputs. The DEEMO dataset consists of two subsets: DEEMO-NFBL, which includes rich annotations of Non-Facial Body Language (NFBL), and DEEMO-MER, an instruction dataset for Multimodal Emotion Recognition and Reasoning using identity-free cues. This design supports emotion understanding without compromising identity privacy. In addition, we propose DEEMO-LLaMA, a Multimodal Large Language Model (MLLM) that integrates de-identified audio, video, and textual information to enhance both emotion recognition and reasoning. Extensive experiments show that DEEMO-LLaMA achieves state-of-the-art performance on both tasks, outperforming existing MLLMs by a significant margin, achieving 74.49% accuracy and 74.45% F1-score in de-identity emotion recognition, and 6.20 clue overlap and 7.66 label overlap in de-identity emotion reasoning. Our work contributes to ethical AI by advancing privacy-preserving emotion understanding and promoting responsible affective computing.
[472] VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding
Zongxia Li, Xiyang Wu, Guangyao Shi, Yubin Qin, Hongyang Du, Fuxiao Liu, Tianyi Zhou, Dinesh Manocha, Jordan Lee Boyd-Graber
Main category: cs.CV
TL;DR: The paper introduces VideoHallu, a synthetic dataset of physics- and commonsense-violating videos to test whether vision-language models truly understand visual content or just learn shallow correlations.
Details
Motivation: Current VLMs achieve strong results on standard benchmarks but may not genuinely comprehend visual reasoning, especially physics and common sense, which is crucial for AI systems interacting with the physical world.Method: Created VideoHallu dataset using Veo2, Sora, and Kling to generate videos depicting physically impossible or logically inconsistent events, with expert-annotated QA pairs across four violation categories. Tested leading VLMs and performed reinforcement learning fine-tuning.
Result: Leading VLMs (Qwen-2.5-VL, Video-R1, VideoChat-R1) often miss physics and commonsense violations despite strong benchmark performance. RL fine-tuning on VideoHallu improves violation recognition without reducing standard benchmark scores.
Conclusion: VideoHallu exposes gaps in VLMs’ visual reasoning capabilities and demonstrates that targeted fine-tuning can improve physics and commonsense understanding while maintaining standard performance.
Abstract: Vision-Language Models (VLMs) have achieved strong results in video understanding, yet a key question remains: do they truly comprehend visual content or only learn shallow correlations between vision and language? Real visual understanding, especially of physics and common sense, is essential for AI systems that interact with the physical world. Current evaluations mostly use real-world videos similar to training data, so high benchmark scores may not reflect real reasoning ability. To address this, we propose negative-control tests using videos that depict physically impossible or logically inconsistent events. We introduce VideoHallu, a synthetic dataset of physics- and commonsense-violating scenes generated with Veo2, Sora, and Kling. It includes expert-annotated question-answer pairs across four categories of violations. Tests of leading VLMs (Qwen-2.5-VL, Video-R1, VideoChat-R1) show that, despite strong results on benchmarks such as MVBench and MMVU, they often miss these violations, exposing gaps in visual reasoning. Reinforcement learning fine-tuning on VideoHallu improves recognition of such violations without reducing standard benchmark performance. Our data is available at https://github.com/zli12321/VideoHallu.git.
[473] MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation
Shoubin Yu, Yue Zhang, Ziyang Wang, Jaehong Yoon, Mohit Bansal
Main category: cs.CV
TL;DR: MEXA is a training-free framework that dynamically aggregates multiple expert models for multimodal reasoning across diverse domains by selecting experts based on input modality and task requirements, then using a Large Reasoning Model to combine their outputs.
Details
Motivation: To address the challenge of building unified multimodal reasoning frameworks that can handle increasing diversity of input modalities and task complexity across domains like medical diagnosis and financial forecasting.Method: MEXA dynamically selects specialized expert models based on input modality and task requirements, generates interpretable textual reasoning from each expert, then aggregates these outputs using a Large Reasoning Model (LRM) to produce final answers.
Result: MEXA consistently outperforms strong multimodal baselines across diverse benchmarks including Video Reasoning, Audio Reasoning, 3D Understanding, and Medical QA, demonstrating broad applicability.
Conclusion: The expert-driven selection and aggregation approach enables flexible, transparent multimodal reasoning across diverse domains without additional training overhead, proving effective for scalable multimodal reasoning.
Abstract: Combining pre-trained expert models offers substantial potential for scalable multimodal reasoning, but building a unified framework remains challenging due to the increasing diversity of input modalities and task complexity. For instance, medical diagnosis requires precise reasoning over structured clinical tables, while financial forecasting depends on interpreting plot-based data to make informed predictions. To tackle this challenge, we introduce MEXA, a training-free framework that performs modality- and task-aware aggregation of multiple expert models to enable effective multimodal reasoning across diverse and distinct domains. MEXA dynamically selects expert models based on the input modality and the task-specific reasoning demands (i.e., skills). Each expert model, specialized in a modality task pair, generates interpretable textual reasoning outputs. MEXA then aggregates and reasons over these outputs using a Large Reasoning Model (LRM) to produce the final answer. This modular design allows flexible and transparent multimodal reasoning across diverse domains without additional training overhead. We extensively evaluate our approach on diverse multimodal benchmarks, including Video Reasoning, Audio Reasoning, 3D Understanding, and Medical QA. MEXA consistently delivers performance improvements over strong multimodal baselines, highlighting the effectiveness and broad applicability of our expert-driven selection and aggregation in diverse multimodal reasoning tasks.
[474] Learning Knowledge-based Prompts for Robust 3D Mask Presentation Attack Detection
Fangling Jiang, Qi Li, Bing Liu, Weining Wang, Caifeng Shan, Zhenan Sun, Ming-Hsuan Yang
Main category: cs.CV
TL;DR: A novel knowledge-based prompt learning framework using vision-language models for 3D mask presentation attack detection, incorporating knowledge graphs and causal graph theory to enhance generalization.
Details
Motivation: Existing 3D mask detection methods face challenges with high multimodal sensor costs and limited generalization. Vision-language multimodal features offer cost-effective universal information but remain unexplored for this task.Method: Proposes knowledge-based prompt learning that incorporates entities and triples from knowledge graphs into prompts, uses visual-specific knowledge filter with attention mechanism, and applies causal graph theory with spurious correlation elimination to remove category-irrelevant patches.
Result: Achieves state-of-the-art intra- and cross-scenario detection performance on benchmark datasets.
Conclusion: The proposed framework effectively harnesses knowledge from pre-trained vision-language models and demonstrates strong generalization capability for 3D mask presentation attack detection.
Abstract: 3D mask presentation attack detection is crucial for protecting face recognition systems against the rising threat of 3D mask attacks. While most existing methods utilize multimodal features or remote photoplethysmography (rPPG) signals to distinguish between real faces and 3D masks, they face significant challenges, such as the high costs associated with multimodal sensors and limited generalization ability. Detection-related text descriptions offer concise, universal information and are cost-effective to obtain. However, the potential of vision-language multimodal features for 3D mask presentation attack detection remains unexplored. In this paper, we propose a novel knowledge-based prompt learning framework to explore the strong generalization capability of vision-language models for 3D mask presentation attack detection. Specifically, our approach incorporates entities and triples from knowledge graphs into the prompt learning process, generating fine-grained, task-specific explicit prompts that effectively harness the knowledge embedded in pre-trained vision-language models. Furthermore, considering different input images may emphasize distinct knowledge graph elements, we introduce a visual-specific knowledge filter based on an attention mechanism to refine relevant elements according to the visual context. Additionally, we leverage causal graph theory insights into the prompt learning process to further enhance the generalization ability of our method. During training, a spurious correlation elimination paradigm is employed, which removes category-irrelevant local image patches using guidance from knowledge-based text features, fostering the learning of generalized causal prompts that align with category-relevant local patches. Experimental results demonstrate that the proposed method achieves state-of-the-art intra- and cross-scenario detection performance on benchmark datasets.
[475] Flow-GRPO: Training Flow Matching Models via Online RL
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, Wanli Ouyang
Main category: cs.CV
TL;DR: Flow-GRPO integrates online policy gradient RL into flow matching models using ODE-to-SDE conversion and Denoising Reduction strategies, achieving significant improvements in text-to-image generation tasks without reward hacking.
Details
Motivation: To enhance flow matching models with reinforcement learning capabilities for improved text-to-image generation, addressing challenges in compositional generation and visual text rendering.Method: Uses ODE-to-SDE conversion to enable RL exploration and Denoising Reduction strategy to improve sampling efficiency while maintaining inference performance.
Result: Achieved dramatic improvements: GenEval accuracy increased from 63% to 95% for compositional generation, visual text rendering accuracy improved from 59% to 92%, with substantial gains in human preference alignment and minimal reward hacking.
Conclusion: Flow-GRPO successfully integrates RL with flow matching models, demonstrating significant performance improvements across multiple text-to-image tasks while maintaining image quality and diversity.
Abstract: We propose Flow-GRPO, the first method to integrate online policy gradient reinforcement learning (RL) into flow matching models. Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation (ODE) into an equivalent Stochastic Differential Equation (SDE) that matches the original model’s marginal distribution at all timesteps, enabling statistical sampling for RL exploration; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original number of inference steps, significantly improving sampling efficiency without sacrificing performance. Empirically, Flow-GRPO is effective across multiple text-to-image tasks. For compositional generation, RL-tuned SD3.5-M generates nearly perfect object counts, spatial relations, and fine-grained attributes, increasing GenEval accuracy from $63%$ to $95%$. In visual text rendering, accuracy improves from $59%$ to $92%$, greatly enhancing text generation. Flow-GRPO also achieves substantial gains in human preference alignment. Notably, very little reward hacking occurred, meaning rewards did not increase at the cost of appreciable image quality or diversity degradation.
[476] Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs
Yifan Shen, Yuanzhe Liu, Jingyuan Zhu, Xu Cao, Xiaofeng Zhang, Yixiao He, Wenming Ye, James Matthew Rehg, Ismini Lourentzou
Main category: cs.CV
TL;DR: SpatialReasoner-R1 is a vision-language model that improves fine-grained spatial reasoning using M3CTS for generating diverse reasoning trajectories and fDPO with spatial rewards for better visual consistency and logical coherence.
Details
Motivation: Current VLMs struggle with fine-grained spatial reasoning requiring multi-step logic and precise spatial alignment.Method: Uses Multi-Model Monte Carlo Tree Search (M3CTS) to generate diverse Long Chain-of-Thought reasoning trajectories, and fine-grained Direct Preference Optimization (fDPO) with spatial rewards for visual consistency, spatial grounding, and logical coherence.
Result: fDPO achieves 4.1% improvement over standard DPO in spatial quality tasks and 9.0% gain in spatial quantity tasks. SpatialReasoner-R1 sets new SoTA on SPATIALRGPT-Bench with 9.8% higher average accuracy than strongest baseline.
Conclusion: The proposed methods effectively address spatial reasoning limitations in VLMs while maintaining competitive performance on general vision-language tasks.
Abstract: Current Vision-Language Models (VLMs) struggle with fine-grained spatial reasoning, particularly when multi-step logic and precise spatial alignment are required. In this work, we introduce SpatialReasoner-R1, a vision-language reasoning model designed to address these limitations. To construct high-quality supervision for spatial reasoning, we design a Multi-Model Monte Carlo Tree Search (M3CTS) method that generates diverse, logically consistent Long Chain-of-Thought (LongCoT) reasoning trajectories. In addition, we propose fine-grained Direct Preference Optimization (fDPO), which introduces segment-specific preference granularity for descriptive grounding and logical reasoning, guided by a spatial reward mechanism that evaluates candidate responses based on visual consistency, spatial grounding, and logical coherence. Experimental results demonstrate that fDPO achieves an average improvement of 4.1% over standard DPO across spatial quality tasks, and a 9.0% gain in spatial quantity tasks. SpatialReasoner-R1, trained with fDPO, sets a new SoTA on SPATIALRGPT-Bench, outperforming the strongest baseline by 9.8% in average accuracy, while maintaining competitive performance on general vision-language tasks.
[477] Zero-Shot Multi-modal Large Language Model v.s. Supervised Deep Learning: A Comparative Study on CT-Based Intracranial Hemorrhage Subtyping
Yinuo Wang, Yue Zeng, Kai Chen, Cai Meng, Chao Pan, Zhouping Tang
Main category: cs.CV
TL;DR: Zero-shot MLLMs underperform traditional deep learning models in ICH classification and subtyping on NCCT scans, though they offer better interpretability.
Details
Motivation: Timely identification of ICH subtypes on non-contrast CT is critical for prognosis and treatment but challenging due to low contrast and blurring boundaries.Method: Compared various MLLMs (GPT-4o, Gemini 2.0 Flash, Claude 3.5 Sonnet V2) with traditional deep learning models (ResNet50, Vision Transformer) on 192 NCCT volumes using carefully crafted prompts for ICH presence, subtype classification, localization, and volume estimation.
Result: Traditional deep learning models comprehensively outperformed MLLMs in ICH binary classification. For subtype classification, MLLMs showed inferior performance (Gemini 2.0 Flash: macro-averaged precision 0.41, F1 score 0.31).
Conclusion: MLLMs have inferior accuracy in ICH subtyping compared to deep networks but enhance interpretability through language interactions, showing potential in medical imaging analysis. Future work will focus on model refinement for 3D medical image processing.
Abstract: Introduction: Timely identification of intracranial hemorrhage (ICH) subtypes on non-contrast computed tomography is critical for prognosis prediction and therapeutic decision-making, yet remains challenging due to low contrast and blurring boundaries. This study evaluates the performance of zero-shot multi-modal large language models (MLLMs) compared to traditional deep learning methods in ICH binary classification and subtyping. Methods: We utilized a dataset provided by RSNA, comprising 192 NCCT volumes. The study compares various MLLMs, including GPT-4o, Gemini 2.0 Flash, and Claude 3.5 Sonnet V2, with conventional deep learning models, including ResNet50 and Vision Transformer. Carefully crafted prompts were used to guide MLLMs in tasks such as ICH presence, subtype classification, localization, and volume estimation. Results: The results indicate that in the ICH binary classification task, traditional deep learning models outperform MLLMs comprehensively. For subtype classification, MLLMs also exhibit inferior performance compared to traditional deep learning models, with Gemini 2.0 Flash achieving an macro-averaged precision of 0.41 and a macro-averaged F1 score of 0.31. Conclusion: While MLLMs excel in interactive capabilities, their overall accuracy in ICH subtyping is inferior to deep networks. However, MLLMs enhance interpretability through language interactions, indicating potential in medical imaging analysis. Future efforts will focus on model refinement and developing more precise MLLMs to improve performance in three-dimensional medical image processing.
[478] Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning
Jiaer Xia, Yuhang Zang, Peng Gao, Sharon Li, Kaiyang Zhou
Main category: cs.CV
TL;DR: Training VLMs with reinforcement learning on visual QA pairs without CoT supervision can lead to shortcut learning. The paper proposes a caption-reason-answer format to mitigate this and achieves SOTA results.
Details
Motivation: To develop general-purpose reasoning capabilities in visual language models using reinforcement learning without explicit chain-of-thought supervision, addressing the shortcut learning problem.Method: Proposes a caption-reason-answer output format where the model first generates detailed image captions, then extensive reasoning chains, trained with reinforcement learning on 273K CoT-free visual QA pairs.
Result: Visionary-R1 outperforms strong multimodal models (GPT-4o, Claude3.5-Sonnet, Gemini-1.5-Pro) on multiple visual reasoning benchmarks.
Conclusion: Encouraging models to interpret images before reasoning through the caption-reason-answer format effectively mitigates shortcut learning and enables robust visual reasoning capabilities.
Abstract: Learning general-purpose reasoning capabilities has long been a challenging problem in AI. Recent research in large language models (LLMs), such as DeepSeek-R1, has shown that reinforcement learning techniques like GRPO can enable pre-trained LLMs to develop reasoning capabilities using simple question-answer pairs. In this paper, we aim to train visual language models (VLMs) to perform reasoning on image data through reinforcement learning and visual question-answer pairs, without any explicit chain-of-thought (CoT) supervision. Our findings indicate that simply applying reinforcement learning to a VLM – by prompting the model to produce a reasoning chain before providing an answer – can lead the model to develop shortcuts from easy questions, thereby reducing its ability to generalize across unseen data distributions. We argue that the key to mitigating shortcut learning is to encourage the model to interpret images prior to reasoning. Therefore, we train the model to adhere to a caption-reason-answer output format: initially generating a detailed caption for an image, followed by constructing an extensive reasoning chain. When trained on 273K CoT-free visual question-answer pairs and using only reinforcement learning, our model, named Visionary-R1, outperforms strong multimodal models, such as GPT-4o, Claude3.5-Sonnet, and Gemini-1.5-Pro, on multiple visual reasoning benchmarks.
[479] RestoreVAR: Visual Autoregressive Generation for All-in-One Image Restoration
Sudarshan Rajagopalan, Kartik Narayan, Vishal M. Patel
Main category: cs.CV
TL;DR: RestoreVAR is a novel VAR-based generative approach for All-in-One Image Restoration that outperforms LDM-based models in performance while achieving over 10x faster inference speed.
Details
Motivation: LDM-based frameworks suffer from slow inference due to iterative denoising, making them impractical for time-sensitive applications. VAR achieves comparable performance to diffusion transformers with drastically reduced computational costs.Method: Uses visual autoregressive modeling (VAR) with architectural modifications including intricately designed cross-attention mechanisms and a latent-space refinement module tailored for AiOR tasks.
Result: Significantly outperforms LDM-based models in restoration performance while achieving over 10x faster inference. Achieves state-of-the-art performance among generative AiOR methods with strong generalization capabilities.
Conclusion: RestoreVAR provides a superior alternative to LDM-based approaches for All-in-One Image Restoration, offering both better performance and significantly faster inference suitable for practical applications.
Abstract: The use of latent diffusion models (LDMs) such as Stable Diffusion has significantly improved the perceptual quality of All-in-One image Restoration (AiOR) methods, while also enhancing their generalization capabilities. However, these LDM-based frameworks suffer from slow inference due to their iterative denoising process, rendering them impractical for time-sensitive applications. Visual autoregressive modeling (VAR), a recently introduced approach for image generation, performs scale-space autoregression and achieves comparable performance to that of state-of-the-art diffusion transformers with drastically reduced computational costs. Moreover, our analysis reveals that coarse scales in VAR primarily capture degradations while finer scales encode scene detail, simplifying the restoration process. Motivated by this, we propose RestoreVAR, a novel VAR-based generative approach for AiOR that significantly outperforms LDM-based models in restoration performance while achieving over $10\times$ faster inference. To optimally exploit the advantages of VAR for AiOR, we propose architectural modifications and improvements, including intricately designed cross-attention mechanisms and a latent-space refinement module, tailored for the AiOR task. Extensive experiments show that RestoreVAR achieves state-of-the-art performance among generative AiOR methods, while also exhibiting strong generalization capabilities.
[480] CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays
Hyungyung Lee, Geon Choi, Jung-Oh Lee, Hangyul Yoon, Hyuk Gi Hong, Edward Choi
Main category: cs.CV
TL;DR: CheXStruct and CXReasonBench provide a structured pipeline and benchmark for evaluating clinically meaningful reasoning in Large Vision-Language Models on chest X-ray diagnosis, going beyond final answers to assess intermediate reasoning steps.
Details
Motivation: Existing medical AI benchmarks focus mainly on final diagnostic answers without providing insight into whether models engage in clinically meaningful reasoning processes.Method: Built on MIMIC-CXR-JPG dataset, CheXStruct automatically derives intermediate reasoning steps from chest X-rays including anatomical segmentation, landmark identification, diagnostic measurements, index computation, and clinical threshold application.
Result: Evaluation of 12 LVLMs shows they struggle with structured reasoning and generalization, often failing to link abstract knowledge with anatomically grounded visual interpretation.
Conclusion: The benchmark enables fine-grained assessment of diagnostic reasoning and reveals significant gaps in current models’ ability to perform clinically valid reasoning steps.
Abstract: Recent progress in Large Vision-Language Models (LVLMs) has enabled promising applications in medical tasks, such as report generation and visual question answering. However, existing benchmarks focus mainly on the final diagnostic answer, offering limited insight into whether models engage in clinically meaningful reasoning. To address this, we present CheXStruct and CXReasonBench, a structured pipeline and benchmark built on the publicly available MIMIC-CXR-JPG dataset. CheXStruct automatically derives a sequence of intermediate reasoning steps directly from chest X-rays, such as segmenting anatomical regions, deriving anatomical landmarks and diagnostic measurements, computing diagnostic indices, and applying clinical thresholds. CXReasonBench leverages this pipeline to evaluate whether models can perform clinically valid reasoning steps and to what extent they can learn from structured guidance, enabling fine-grained and transparent assessment of diagnostic reasoning. The benchmark comprises 18,988 QA pairs across 12 diagnostic tasks and 1,200 cases, each paired with up to 4 visual inputs, and supports multi-path, multi-stage evaluation including visual grounding via anatomical region selection and diagnostic measurements. Even the strongest of 12 evaluated LVLMs struggle with structured reasoning and generalization, often failing to link abstract knowledge with anatomically grounded visual interpretation. The code is available at https://github.com/ttumyche/CXReasonBench
[481] GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains
Chun Wang, Xiaojun Ye, Xiaoran Pan, Zihao Pan, Haofan Wang, Yiren Song
Main category: cs.CV
TL;DR: The paper proposes GRE Suite, a framework that enhances VLMs with structured reasoning chains for geo-localization, achieving superior performance through multi-stage reasoning and comprehensive evaluation.
Details
Motivation: Geo-localization requires extracting multigranular visual cues and integrating world knowledge, but current approaches lack robust reasoning mechanisms and explainability.Method: Developed GRE Suite across three dimensions: GRE30K dataset for fine-grained analysis, GRE model with multi-stage reasoning strategy, and GREval-Bench evaluation framework.
Result: GRE significantly outperforms existing methods across all granularities of geo-localization tasks, demonstrating efficacy of reasoning-augmented VLMs.
Conclusion: The proposed reasoning-augmented approach effectively addresses geo-localization challenges and provides interpretable location inference.
Abstract: Recent advances in Visual Language Models (VLMs) have demonstrated exceptional performance in visual reasoning tasks. However, geo-localization presents unique challenges, requiring the extraction of multigranular visual cues from images and their integration with external world knowledge for systematic reasoning. Current approaches to geo-localization tasks often lack robust reasoning mechanisms and explainability, limiting their effectiveness. To address these limitations, we propose the Geo Reason Enhancement (GRE) Suite, a novel framework that augments VLMs with structured reasoning chains for accurate and interpretable location inference. The GRE Suite is systematically developed across three key dimensions: dataset, model, and benchmark. First, we introduce GRE30K, a high-quality geo-localization reasoning dataset designed to facilitate fine-grained visual and contextual analysis. Next, we present the GRE model, which employs a multi-stage reasoning strategy to progressively infer scene attributes, local details, and semantic features, thereby narrowing down potential geographic regions with enhanced precision. Finally, we construct the Geo Reason Evaluation Benchmark (GREval-Bench), a comprehensive evaluation framework that assesses VLMs across diverse urban, natural, and landmark scenes to measure both coarse-grained (e.g., country, continent) and fine-grained (e.g., city, street) localization performance. Experimental results demonstrate that GRE significantly outperforms existing methods across all granularities of geo-localization tasks, underscoring the efficacy of reasoning-augmented VLMs in complex geographic inference. Code and data will be released at https://github.com/Thorin215/GRE.
[482] Attention! Your Vision Language Model Could Be Maliciously Manipulated
Xiaosen Wang, Shaokang Wang, Zhijin Ge, Yuyang Luo, Shudong Zhang
Main category: cs.CV
TL;DR: VMA is a novel adversarial attack method for Vision-Language Models that uses momentum optimization and differentiable transformations to manipulate outputs through imperceptible image perturbations, enabling both malicious attacks and copyright protection.
Details
Motivation: VLMs are vulnerable to adversarial examples that can cause jailbreaking, hijacking, and hallucinations, but existing attacks are limited. The paper aims to develop a more effective attack method that can precisely manipulate VLM outputs.Method: Proposes VMA attack that integrates first-order and second-order momentum optimization with differentiable transformation mechanisms to optimize adversarial perturbations in images.
Result: Extensive evaluations show VMA is effective and generalizable across diverse scenarios and datasets, enabling various attacks like jailbreaking, hijacking, privacy breaches, and also copyright protection through watermark injection.
Conclusion: VMA demonstrates VLMs’ significant vulnerability to image-based adversarial attacks and provides a powerful tool that can be used both maliciously and beneficially for security applications.
Abstract: Large Vision-Language Models (VLMs) have achieved remarkable success in understanding complex real-world scenarios and supporting data-driven decision-making processes. However, VLMs exhibit significant vulnerability against adversarial examples, either text or image, which can lead to various adversarial outcomes, e.g., jailbreaking, hijacking, and hallucination, etc. In this work, we empirically and theoretically demonstrate that VLMs are particularly susceptible to image-based adversarial examples, where imperceptible perturbations can precisely manipulate each output token. To this end, we propose a novel attack called Vision-language model Manipulation Attack (VMA), which integrates first-order and second-order momentum optimization techniques with a differentiable transformation mechanism to effectively optimize the adversarial perturbation. Notably, VMA can be a double-edged sword: it can be leveraged to implement various attacks, such as jailbreaking, hijacking, privacy breaches, Denial-of-Service, and the generation of sponge examples, etc, while simultaneously enabling the injection of watermarks for copyright protection. Extensive empirical evaluations substantiate the efficacy and generalizability of VMA across diverse scenarios and datasets. Code is available at https://github.com/Trustworthy-AI-Group/VMA.
[483] PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding
Ansel Blume, Jeonghwan Kim, Hyeonjeong Ha, Elen Chatikyan, Xiaomeng Jin, Khanh Duy Nguyen, Nanyun Peng, Kai-Wei Chang, Derek Hoiem, Heng Ji
Main category: cs.CV
TL;DR: PARTONOMY is a benchmark for pixel-level part grounding in LMMs, revealing significant limitations in current models. The paper proposes PLUM, a novel segmenting LMM that addresses architectural shortcomings and outperforms existing methods.
Details
Motivation: Real-world objects have distinctive parts crucial for fine-grained reasoning, but current LMMs struggle with part identification. Existing datasets don't adequately test part-whole relationships and visual justifications.Method: Created PARTONOMY benchmark from existing datasets and new annotations (862 part labels, 534 object labels). Proposed PLUM model using span tagging instead of segmentation tokens and conditioning on prior predictions in a feedback loop.
Result: State-of-the-art LMMs perform poorly (e.g., LISA-13B achieves only 5.9% gIoU). PLUM outperforms existing segmenting LMMs on reasoning segmentation, VQA, and visual hallucination benchmarks. PLUM finetuned on Explanatory Part Segmentation is competitive with models trained on more data.
Conclusion: The work reveals critical gaps in LMMs’ part grounding abilities and opens new avenues for fine-grained visual understanding through improved architectural approaches.
Abstract: Real-world objects are composed of distinctive, object-specific parts. Identifying these parts is key to performing fine-grained, compositional reasoning-yet, large multimodal models (LMMs) struggle to perform this seemingly straightforward task. In this work, we introduce PARTONOMY, an LMM benchmark designed for pixel-level part grounding. We construct PARTONOMY from existing part datasets and our own rigorously annotated set of images, encompassing 862 part labels and 534 object labels for evaluation. Unlike existing datasets that simply ask models to identify generic parts, PARTONOMY uses specialized concepts (e.g., agricultural airplane), and challenges models to compare objects’ parts, consider part-whole relationships, and justify textual predictions with visual segmentations. Our experiments demonstrate significant limitations in state-of-the-art LMMs (e.g., LISA-13B achieves only 5.9% gIoU), highlighting a critical gap in their part grounding abilities. We note that existing segmentation-enabled LMMs (segmenting LMMs) have two key architectural shortcomings: they use special [SEG] tokens not seen during pretraining which induce distribution shift, and they discard predicted segmentations instead of using past predictions to guide future ones. To address these deficiencies, we train several part-centric LMMs and propose PLUM, a novel segmenting LMM that uses span tagging instead of segmentation tokens and that conditions on prior predictions in a feedback loop. We find that pretrained PLUM outperforms existing segmenting LMMs on reasoning segmentation, VQA, and visual hallucination benchmarks. In addition, PLUM finetuned on our proposed Explanatory Part Segmentation task is competitive with segmenting LMMs trained on significantly more segmentation data. Our work opens up new avenues towards enabling fine-grained, grounded visual understanding in LMMs.
[484] DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding
Weihao Xuan, Junjue Wang, Heli Qi, Zihang Chen, Zhuo Zheng, Yanfei Zhong, Junshi Xia, Naoto Yokoya
Main category: cs.CV
TL;DR: DVL-Suite is a comprehensive framework for analyzing long-term urban dynamics using multimodal large language models (MLLMs) with remote sensing imagery spanning 2005-2023 across 42 US cities.
Details
Motivation: Current MLLMs have limited application to long-term Earth observation analysis, primarily focusing on single-temporal or bi-temporal imagery rather than comprehensive multi-temporal urban dynamics.Method: Created DVL-Suite with 14,871 high-resolution multi-temporal images organized into DVL-Bench (6 urban understanding tasks) and DVL-Instruct (instruction-tuning dataset), then developed DVLChat model for both image-level QA and pixel-level segmentation.
Result: Evaluation of 18 state-of-the-art MLLMs revealed limitations in long-term temporal understanding and quantitative analysis, motivating the creation of specialized training data and baseline model.
Conclusion: DVL-Suite addresses the gap in long-term urban dynamics analysis and provides a foundation for enhancing MLLM capabilities in multi-temporal Earth observation through specialized datasets and models.
Abstract: Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in visual understanding, but their application to long-term Earth observation analysis remains limited, primarily focusing on single-temporal or bi-temporal imagery. To address this gap, we introduce DVL-Suite, a comprehensive framework for analyzing long-term urban dynamics through remote sensing imagery. Our suite comprises 14,871 high-resolution (1.0m) multi-temporal images spanning 42 major cities in the U.S. from 2005 to 2023, organized into two components: DVL-Bench and DVL-Instruct. The DVL-Bench includes six urban understanding tasks, from fundamental change detection (pixel-level) to quantitative analyses (regional-level) and comprehensive urban narratives (scene-level), capturing diverse urban dynamics including expansion/transformation patterns, disaster assessment, and environmental challenges. We evaluate 18 state-of-the-art MLLMs and reveal their limitations in long-term temporal understanding and quantitative analysis. These challenges motivate the creation of DVL-Instruct, a specialized instruction-tuning dataset designed to enhance models’ capabilities in multi-temporal Earth observation. Building upon this dataset, we develop DVLChat, a baseline model capable of both image-level question-answering and pixel-level segmentation, facilitating a comprehensive understanding of city dynamics through language interactions.
[485] Navigating the Accuracy-Size Trade-Off with Flexible Model Merging
Akash Dhasade, Divyansh Jhunjhunwala, Milos Vujasinovic, Gauri Joshi, Anne-Marie Kermarrec
Main category: cs.CV
TL;DR: FlexMerge is a data-free model merging framework that generates merged models of varying sizes and supports multiple merging algorithms, enabling systematic study of accuracy-size trade-offs across vision and NLP tasks.
Details
Motivation: Model merging combines single-task fine-tuned models efficiently but suffers from accuracy gaps when merged into one model, while deploying all individual models incurs high storage costs.Method: FlexMerge framework that flexibly generates merged models across size spectrum (from single model to all fine-tuned models) and supports multiple merging algorithms in unified framework.
Result: Key findings: modest size increases yield steep accuracy gains (up to 13.5% when doubling size), and algorithm rankings change with size - some methods overtake others beyond single-model regime.
Conclusion: Reveals new design dimension for model merging: developing and comparing algorithms across full size spectrum rather than only at single-model limit, with practical applications across vision and NLP benchmarks.
Abstract: Model merging has emerged as an efficient method to combine multiple single-task fine-tuned models. The merged model can enjoy multi-task capabilities without expensive training. While promising, merging into a single model often suffers from an accuracy gap with respect to the fine-tuned models. On the other hand, deploying all individual fine-tuned models incurs high storage costs. We propose FlexMerge, a novel data-free model merging framework that: (a) flexibly generates merged models of varying sizes, spanning the full spectrum from a single merged model to retaining all fine-tuned models; and (b) supports multiple merging algorithms in a unified framework. Using FlexMerge, we systematically characterize the accuracy-size trade-off of different algorithms. Our study reveals two key findings: first, even modestly larger merged models can yield steep accuracy gains (up to 13.5% when just doubling the size); second, algorithm rankings are not consistent as size increases, with some methods overtaking others beyond the one-model regime. These results uncover a new design dimension for model merging: developing and comparing algorithms across the full spectrum of sizes rather than only at the single-model limit. Extensive experiments on vision and NLP benchmarks, with up to 30 tasks, confirm the generality and practicality of FlexMerge.
[486] Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition
Yu Li, Jin Jiang, Jianhua Zhu, Shuai Peng, Baole Wei, Yuxuan Zhou, Liangcai Gao
Main category: cs.CV
TL;DR: Uni-MuMER is a novel approach that fine-tunes vision-language models for handwritten mathematical expression recognition, achieving state-of-the-art performance through three integrated data-driven tasks.
Details
Motivation: HMER faces challenges due to free symbol layouts and handwriting variability. Previous methods used isolated architectural changes that were hard to integrate, while pretrained VLMs offer strong cross-task generalization potential for unified solutions.Method: Fully fine-tunes a VLM without architectural modifications, integrating three tasks: Tree-Aware Chain-of-Thought for spatial reasoning, Error-Driven Learning to reduce confusion among similar characters, and Symbol Counting for recognition consistency in long expressions.
Result: Achieves super state-of-the-art performance on CROHME and HME100K datasets, outperforming SSAN by 16.31% and Gemini2.5-flash by 24.42% under zero-shot setting.
Conclusion: Uni-MuMER demonstrates that fully fine-tuning VLMs without architectural changes can effectively inject domain-specific knowledge into generalist frameworks, providing a unified solution for HMER with superior performance.
Abstract: Handwritten Mathematical Expression Recognition (HMER) remains a persistent challenge in Optical Character Recognition (OCR) due to the inherent freedom of symbol layouts and variability in handwriting styles. Prior methods have faced performance bottlenecks by proposing isolated architectural modifications, making them difficult to integrate coherently into a unified framework. Meanwhile, recent advances in pretrained vision-language models (VLMs) have demonstrated strong cross-task generalization, offering a promising foundation for developing unified solutions. In this paper, we introduce Uni-MuMER, which fully fine-tunes a VLM for the HMER task without modifying its architecture, effectively injecting domain-specific knowledge into a generalist framework. Our method integrates three data-driven tasks: Tree-Aware Chain-of-Thought (Tree-CoT) for structured spatial reasoning, Error-Driven Learning (EDL) for reducing confusion among visually similar characters, and Symbol Counting (SC) for improving recognition consistency in long expressions. Experiments on the CROHME and HME100K datasets show that Uni-MuMER achieves super state-of-the-art performance, outperforming the best lightweight specialized model SSAN by 16.31% and the top-performing VLM Gemini2.5-flash by 24.42% under zero-shot setting. Our datasets, models, and code are open-sourced at: {https://github.com/BFlameSwift/Uni-MuMER
[487] Radiant Triangle Soup with Soft Connectivity Forces for 3D Reconstruction and Novel View Synthesis
Nathaniel Burgdorfer, Philippos Mordohai
Main category: cs.CV
TL;DR: Triangle soup representation with per-vertex Spherical Harmonics for scene optimization, using soft connectivity forces between triangles to improve geometric accuracy while maintaining visual quality.
Details
Motivation: Current representations like full-rank Gaussian kernels lack natural surface properties and require many primitives for complex geometry. Triangles provide a locally-flat surface proxy that can better represent complex surfaces with fewer primitives.Method: Uses triangle soup (disconnected translucent triangles) with per-vertex Spherical Harmonics for appearance. Enforces soft connectivity forces between triangles during optimization to encourage surface continuity while maintaining flexibility.
Result: Shows improvements in geometric accuracy compared to state-of-the-art methods on 3D reconstruction and novel view synthesis datasets, without sacrificing visual fidelity.
Conclusion: Triangle soup with soft connectivity forces provides an effective representation for scene optimization that balances geometric accuracy and visual quality better than existing approaches.
Abstract: We introduce an inference-time scene optimization algorithm utilizing triangle soup, a collection of disconnected translucent triangle primitives, as the representation for the geometry and appearance of a scene. Unlike full-rank Gaussian kernels, triangles are a natural, locally-flat proxy for surfaces that can be connected to achieve highly complex geometry. When coupled with per-vertex Spherical Harmonics (SH), triangles provide a rich visual representation without incurring an expensive increase in primitives. We leverage our new representation to incorporate optimization objectives and enforce spatial regularization directly on the underlying primitives. The main differentiator of our approach is the definition and enforcement of soft connectivity forces between triangles during optimization, encouraging explicit, but soft, surface continuity in 3D. Experiments on representative 3D reconstruction and novel view synthesis datasets show improvements in geometric accuracy compared to current state-of-the-art algorithms without sacrificing visual fidelity.
[488] Representational Difference Explanations
Neehar Kondapaneni, Oisin Mac Aodha, Pietro Perona
Main category: cs.CV
TL;DR: RDX is a method for discovering and visualizing differences between learned representations to enable interpretable model comparisons, outperforming existing XAI techniques.
Details
Motivation: Current post hoc XAI methods struggle to effectively support model comparison, which is fundamental to scientific analysis in machine learning.Method: Representational Differences Explanations (RDX) - a technique for discovering and visualizing differences between two learned representations.
Result: RDX successfully recovers meaningful conceptual differences where existing XAI techniques fail, revealing insightful representational differences and subtle data patterns on ImageNet and iNaturalist datasets.
Conclusion: RDX addresses the gap in model comparison tools by providing an effective and explainable method for contrasting model representations.
Abstract: We propose a method for discovering and visualizing the differences between two learned representations, enabling more direct and interpretable model comparisons. We validate our method, which we call Representational Differences Explanations (RDX), by using it to compare models with known conceptual differences and demonstrate that it recovers meaningful distinctions where existing explainable AI (XAI) techniques fail. Applied to state-of-the-art models on challenging subsets of the ImageNet and iNaturalist datasets, RDX reveals both insightful representational differences and subtle patterns in the data. Although comparison is a cornerstone of scientific analysis, current tools in machine learning, namely post hoc XAI methods, struggle to support model comparison effectively. Our work addresses this gap by introducing an effective and explainable tool for contrasting model representations.
[489] HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model
Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang
Main category: cs.CV
TL;DR: The paper introduces HoliSafe, a comprehensive safety dataset and benchmark for Vision-Language Models (VLMs), and proposes a visual guard module (VGM) that enhances VLM safety through modular architecture and interpretable harmfulness classification.
Details
Motivation: Current VLM safety approaches have two main shortcomings: 1) existing safety-tuning datasets only partially consider image-text interaction risks, leaving VLMs vulnerable to unseen jailbreak attacks; 2) prior methods rely mainly on data-centric tuning with limited architectural innovations for intrinsic safety.Method: The authors propose HoliSafe dataset covering all five safe/unsafe image-text combinations, and introduce a modular visual guard module (VGM) that assesses input image harmfulness. VGM enables VLMs to generate safer responses while providing interpretable harmfulness classification for refusal decisions.
Result: Safe-VLM with VGM trained on HoliSafe achieves state-of-the-art safety performance across multiple VLM benchmarks. HoliSafe-Bench also reveals critical vulnerabilities in existing VLM models.
Conclusion: HoliSafe and VGM provide a foundation for robust and interpretable VLM safety, expanding avenues for multimodal alignment research. The modular approach allows seamless integration with diverse pre-trained VLMs.
Abstract: Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content, often overlooking contextually unsafe outcomes from seemingly benign pairs. This narrow coverage leaves VLMs vulnerable to jailbreak attacks in unseen configurations. 2) Prior methods rely primarily on data-centric tuning, with limited architectural innovations to intrinsically strengthen safety. We address these gaps by introducing a holistic safety dataset and benchmark, \textbf{HoliSafe}, that spans all five safe/unsafe image-text combinations, providing a more robust basis for both training and evaluation (HoliSafe-Bench). We further propose a novel modular framework for enhancing VLM safety with a visual guard module (VGM) designed to assess the harmfulness of input images for VLMs. This module endows VLMs with a dual functionality: they not only learn to generate safer responses but can also provide an interpretable harmfulness classification to justify their refusal decisions. A significant advantage of this approach is its modularity; the VGM is designed as a plug-in component, allowing for seamless integration with diverse pre-trained VLMs across various scales. Experiments show that Safe-VLM with VGM, trained on our HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks. Additionally, the HoliSafe-Bench itself reveals critical vulnerabilities in existing VLM models. We hope that HoliSafe and VGM will spur further research into robust and interpretable VLM safety, expanding future avenues for multimodal alignment.
[490] Object-X: Learning to Reconstruct Multi-Modal 3D Object Representations
Gaia Di Lorenzo, Federico Tombari, Marc Pollefeys, Daniel Barath
Main category: cs.CV
TL;DR: Object-X is a versatile multi-modal 3D object representation framework that encodes various modalities (images, point clouds, text) into embeddings that can be decoded into detailed geometric and visual reconstructions while supporting multiple downstream tasks.
Details
Motivation: Existing methods rely on task-specific embeddings that are either tailored for semantic understanding or geometric reconstruction, making them unable to be decoded into explicit geometry and reused across different tasks.Method: Object-X geometrically grounds captured modalities in a 3D voxel grid and learns an unstructured embedding that fuses voxel information with object attributes, enabling 3D Gaussian Splatting-based reconstruction and supporting various downstream tasks.
Result: Object-X produces high-fidelity novel-view synthesis comparable to standard 3D Gaussian Splatting while significantly improving geometric accuracy. It achieves competitive performance in scene alignment and localization, with object-centric descriptors requiring 3-4 orders of magnitude less storage than traditional approaches.
Conclusion: Object-X establishes itself as a scalable and highly practical solution for multi-modal 3D scene representation, offering versatile embeddings that support both reconstruction and various downstream tasks with significantly reduced storage requirements.
Abstract: Learning effective multi-modal 3D representations of objects is essential for numerous applications, such as augmented reality and robotics. Existing methods often rely on task-specific embeddings that are tailored either for semantic understanding or geometric reconstruction. As a result, these embeddings typically cannot be decoded into explicit geometry and simultaneously reused across tasks. In this paper, we propose Object-X, a versatile multi-modal object representation framework capable of encoding rich object embeddings (e.g. images, point cloud, text) and decoding them back into detailed geometric and visual reconstructions. Object-X operates by geometrically grounding the captured modalities in a 3D voxel grid and learning an unstructured embedding fusing the information from the voxels with the object attributes. The learned embedding enables 3D Gaussian Splatting-based object reconstruction, while also supporting a range of downstream tasks, including scene alignment, single-image 3D object reconstruction, and localization. Evaluations on two challenging real-world datasets demonstrate that Object-X produces high-fidelity novel-view synthesis comparable to standard 3D Gaussian Splatting, while significantly improving geometric accuracy. Moreover, Object-X achieves competitive performance with specialized methods in scene alignment and localization. Critically, our object-centric descriptors require 3-4 orders of magnitude less storage compared to traditional image- or point cloud-based approaches, establishing Object-X as a scalable and highly practical solution for multi-modal 3D scene representation.
[491] Synthesize Privacy-Preserving High-Resolution Images via Private Textual Intermediaries
Haoxiang Wang, Zinan Lin, Da Yu, Huishuai Zhang
Main category: cs.CV
TL;DR: SPTI generates high-resolution differentially private synthetic images by converting images to text, applying DP text generation, and reconstructing images, achieving better quality than prior methods without model training.
Details
Motivation: Existing DP image synthesis methods struggle with high-resolution outputs that faithfully capture original data structure, creating a need for better privacy-preserving visual data sharing.Method: SPTI summarizes private images into text using image-to-text models, applies modified Private Evolution algorithm for DP text generation, then reconstructs images using text-to-image models without requiring model training.
Result: SPTI produces substantially higher quality synthetic images than prior DP approaches, achieving FID of 26.71 on LSUN Bedroom (vs 40.36) and 33.27 on MM CelebA HQ (vs 57.01) at epsilon=1.0.
Conclusion: SPTI provides a resource-efficient framework for generating high-resolution DP synthetic images, greatly expanding access to private visual datasets through proprietary model compatibility.
Abstract: Generating high fidelity, differentially private (DP) synthetic images offers a promising route to share and analyze sensitive visual data without compromising individual privacy. However, existing DP image synthesis methods struggle to produce high resolution outputs that faithfully capture the structure of the original data. In this paper, we introduce a novel method, referred to as Synthesis via Private Textual Intermediaries (SPTI), that can generate high resolution DP images with easy adoption. The key idea is to shift the challenge of DP image synthesis from the image domain to the text domain by leveraging state of the art DP text generation methods. SPTI first summarizes each private image into a concise textual description using image to text models, then applies a modified Private Evolution algorithm to generate DP text, and finally reconstructs images using text to image models. Notably, SPTI requires no model training, only inference with off the shelf models. Given a private dataset, SPTI produces synthetic images of substantially higher quality than prior DP approaches. On the LSUN Bedroom dataset, SPTI attains an FID equal to 26.71 under epsilon equal to 1.0, improving over Private Evolution FID of 40.36. Similarly, on MM CelebA HQ, SPTI achieves an FID equal to 33.27 at epsilon equal to 1.0, compared to 57.01 from DP fine tuning baselines. Overall, our results demonstrate that Synthesis via Private Textual Intermediaries provides a resource efficient and proprietary model compatible framework for generating high resolution DP synthetic images, greatly expanding access to private visual datasets.
[492] Vision Transformers Don’t Need Trained Registers
Nick Jiang, Amil Dravid, Alexei Efros, Yossi Gandelsman
Main category: cs.CV
TL;DR: The paper identifies that high-norm tokens in Vision Transformers cause noisy attention maps due to specific register neurons, and proposes a training-free method to mitigate this by shifting high-norm activations to additional untrained tokens.
Details
Motivation: To understand and address the phenomenon of high-norm tokens causing noisy attention patterns in Vision Transformers (like CLIP, DINOv2) without requiring model retraining.Method: Shift high-norm activations from identified register neurons into additional untrained tokens, mimicking the effect of register tokens without retraining. Also extends this to vision-language models for cleaner text-to-image attribution.
Result: The method produces cleaner attention and feature maps, improves performance across multiple downstream visual tasks, and achieves results comparable to models explicitly trained with register tokens.
Conclusion: Test-time registers effectively substitute for register tokens at inference time, providing a training-free solution for pre-trained models that lack register tokens.
Abstract: We investigate the mechanism underlying a previously identified phenomenon in Vision Transformers - the emergence of high-norm tokens that lead to noisy attention maps (Darcet et al., 2024). We observe that in multiple models (e.g., CLIP, DINOv2), a sparse set of neurons is responsible for concentrating high-norm activations on outlier tokens, leading to irregular attention patterns and degrading downstream visual processing. While the existing solution for removing these outliers involves retraining models from scratch with additional learned register tokens, we use our findings to create a training-free approach to mitigate these artifacts. By shifting the high-norm activations from our discovered register neurons into an additional untrained token, we can mimic the effect of register tokens on a model already trained without registers. We demonstrate that our method produces cleaner attention and feature maps, enhances performance over base models across multiple downstream visual tasks, and achieves results comparable to models explicitly trained with register tokens. We then extend test-time registers to off-the-shelf vision-language models, yielding cleaner attention-based, text-to-image attribution. Finally, we outline a simple mathematical model that reflects the observed behavior of register neurons and high norm tokens. Our results suggest that test-time registers effectively take on the role of register tokens at test-time, offering a training-free solution for any pre-trained model released without them.
[493] Structural-Spectral Graph Convolution with Evidential Edge Learning for Hyperspectral Image Clustering
Jianhan Qi, Yuheng Jia, Hui Liu, Junhui Hou
Main category: cs.CV
TL;DR: The paper proposes a novel method for hyperspectral image clustering that combines structural-spectral graph convolutional operators with evidence-guided adaptive edge learning to improve clustering accuracy.
Details
Motivation: Existing GNN-based methods for HSI clustering cannot fully exploit spectral information and suffer from inaccurate superpixel topological graphs, leading to semantic confusion during information aggregation.Method: Proposes SSGCO for co-extraction of spatial and spectral features, and EGAEL module for adaptive edge weight prediction and refinement, integrated into a contrastive learning framework for simultaneous representation learning and clustering.
Result: Improves clustering accuracy by 2.61%, 6.06%, 4.96% and 3.15% over best compared methods on four HSI datasets.
Conclusion: The proposed SSGCO and EGAEL framework effectively addresses limitations of existing GNN-based HSI clustering methods and achieves superior performance.
Abstract: Hyperspectral image (HSI) clustering groups pixels into clusters without labeled data, which is an important yet challenging task. For large-scale HSIs, most methods rely on superpixel segmentation and perform superpixel-level clustering based on graph neural networks (GNNs). However, existing GNNs cannot fully exploit the spectral information of the input HSI, and the inaccurate superpixel topological graph may lead to the confusion of different class semantics during information aggregation. To address these challenges, we first propose a structural-spectral graph convolutional operator (SSGCO) tailored for graph-structured HSI superpixels to improve their representation quality through the co-extraction of spatial and spectral features. Second, we propose an evidence-guided adaptive edge learning (EGAEL) module that adaptively predicts and refines edge weights in the superpixel topological graph. We integrate the proposed method into a contrastive learning framework to achieve clustering, where representation learning and clustering are simultaneously conducted. Experiments demonstrate that the proposed method improves clustering accuracy by 2.61%, 6.06%, 4.96% and 3.15% over the best compared methods on four HSI datasets. Our code is available at https://github.com/jhqi/SSGCO-EGAEL.
[494] 3D-RAD: A Comprehensive 3D Radiology Med-VQA Dataset with Multi-Temporal Analysis and Diverse Diagnostic Tasks
Xiaotang Gai, Jiaxiang Liu, Yichen Li, Zijie Meng, Jian Wu, Zuozhu Liu
Main category: cs.CV
TL;DR: 3D-RAD is a large-scale dataset for 3D medical visual question answering using CT scans, featuring six diverse VQA tasks including temporal analysis and computational reasoning, with evaluations showing current VLMs struggle with 3D diagnostic reasoning.
Details
Motivation: Existing Med-VQA efforts primarily focus on 2D imaging with limited task diversity, lacking comprehensive 3D diagnostic reasoning capabilities needed for clinical decision support.Method: Created 3D-RAD dataset with six VQA tasks (anomaly detection, image observation, medical computation, existence detection, static temporal diagnosis, longitudinal temporal diagnosis) using radiology CT scans, supporting both open- and closed-ended questions with complex reasoning challenges.
Result: Evaluations show existing vision-language models, especially medical VLMs, exhibit limited generalization particularly in multi-temporal tasks. Fine-tuning on the 3D-RAD-T training set (136,195 samples) significantly enhances model performance.
Conclusion: 3D-RAD establishes a robust foundation for 3D medical visual understanding and aims to catalyze multimodal medical AI research, with publicly available dataset and code to drive future advancements in clinical decision support.
Abstract: Medical Visual Question Answering (Med-VQA) holds significant potential for clinical decision support, yet existing efforts primarily focus on 2D imaging with limited task diversity. This paper presents 3D-RAD, a large-scale dataset designed to advance 3D Med-VQA using radiology CT scans. The 3D-RAD dataset encompasses six diverse VQA tasks: anomaly detection, image observation, medical computation, existence detection, static temporal diagnosis, and longitudinal temporal diagnosis. It supports both open- and closed-ended questions while introducing complex reasoning challenges, including computational tasks and multi-stage temporal analysis, to enable comprehensive benchmarking. Extensive evaluations demonstrate that existing vision-language models (VLMs), especially medical VLMs exhibit limited generalization, particularly in multi-temporal tasks, underscoring the challenges of real-world 3D diagnostic reasoning. To drive future advancements, we release a high-quality training set 3D-RAD-T of 136,195 expert-aligned samples, showing that fine-tuning on this dataset could significantly enhance model performance. Our dataset and code, aiming to catalyze multimodal medical AI research and establish a robust foundation for 3D medical visual understanding, are publicly available at https://github.com/Tang-xiaoxiao/3D-RAD.
[495] AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning
Zewei Zhou, Tianhui Cai, Seth Z. Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, Jiaqi Ma
Main category: cs.CV
TL;DR: AutoVLA is a Vision-Language-Action model that unifies reasoning and action generation for autonomous driving using autoregressive generation, tokenized trajectories, and dual thinking modes with reinforcement fine-tuning.
Details
Motivation: Current VLA models struggle with physically infeasible actions, complex structures, and unnecessarily long reasoning for autonomous driving tasks.Method: Unifies reasoning and action generation in single autoregressive model; tokenizes continuous trajectories into discrete actions; uses supervised fine-tuning with dual thinking modes (fast/slow); applies GRPO-based reinforcement fine-tuning to reduce unnecessary reasoning.
Result: Competitive performance demonstrated across nuPlan, nuScenes, Waymo, and CARLA benchmarks in both open-loop and closed-loop settings; shows adaptive reasoning and accurate planning in diverse scenarios.
Conclusion: AutoVLA effectively addresses limitations of current VLA models through unified architecture, tokenized actions, and optimized reasoning strategies, achieving strong performance in autonomous driving applications.
Abstract: Recent advancements in Vision-Language-Action (VLA) models have shown promise for end-to-end autonomous driving by leveraging world knowledge and reasoning capabilities. However, current VLA models often struggle with physically infeasible action outputs, complex model structures, or unnecessarily long reasoning. In this paper, we propose AutoVLA, a novel VLA model that unifies reasoning and action generation within a single autoregressive generation model for end-to-end autonomous driving. AutoVLA performs semantic reasoning and trajectory planning directly from raw visual inputs and language instructions. We tokenize continuous trajectories into discrete, feasible actions, enabling direct integration into the language model. For training, we employ supervised fine-tuning to equip the model with dual thinking modes: fast thinking (trajectory-only) and slow thinking (enhanced with chain-of-thought reasoning). To further enhance planning performance and efficiency, we introduce a reinforcement fine-tuning method based on Group Relative Policy Optimization (GRPO), reducing unnecessary reasoning in straightforward scenarios. Extensive experiments across real-world and simulated datasets and benchmarks, including nuPlan, nuScenes, Waymo, and CARLA, demonstrate the competitive performance of AutoVLA in both open-loop and closed-loop settings. Qualitative results showcase the adaptive reasoning and accurate planning capabilities of AutoVLA in diverse scenarios.
[496] Self-supervised Representation Learning with Local Aggregation for Image-based Profiling
Siran Dai, Qianqian Xu, Peisong Wen, Yang Liu, Qingming Huang
Main category: cs.CV
TL;DR: This paper proposes a self-supervised learning framework with local aggregation for cell image profiling, addressing challenges in multi-image fusion and domain differences between cell and natural images, winning the Cell Line Transferability challenge at CVPR 2025.
Details
Motivation: To create generalizable feature extractors for cell images using non-contrastive self-supervised learning, addressing two key challenges: multi-image fusion in cell profiling and domain differences between cell images and natural images that cause existing SSL methods to fail.Method: Proposes a self-supervised framework with local aggregation, introducing specialized data augmentation and representation post-processing methods tailored specifically for cell images to improve cross-site consistency of cell representations.
Result: The proposed framework successfully addressed the challenges of multi-image fusion and domain differences, resulting in a robust feature extractor that won the Cell Line Transferability challenge at CVPR 2025.
Conclusion: The self-supervised learning framework with tailored data augmentation and post-processing methods effectively handles the unique challenges of cell image profiling and demonstrates superior performance in cross-site transferability tasks.
Abstract: Image-based cell profiling aims to create informative representations of cell images. This technique is critical in drug discovery and has greatly advanced with recent improvements in computer vision. Inspired by recent developments in non-contrastive Self-Supervised Learning (SSL), this paper provides an initial exploration into training a generalizable feature extractor for cell images using such methods. However, there are two major challenges: 1) Unlike typical scenarios where each representation is based on a single image, cell profiling often involves multiple input images, making it difficult to effectively fuse all available information; and 2) There is a large difference between the distributions of cell images and natural images, causing the view-generation process in existing SSL methods to fail. To address these issues, we propose a self-supervised framework with local aggregation to improve cross-site consistency of cell representations. We introduce specialized data augmentation and representation post-processing methods tailored to cell images, which effectively address the issues mentioned above and result in a robust feature extractor. With these improvements, the proposed framework won the Cell Line Transferability challenge at CVPR 2025.
[497] AdFair-CLIP: Adversarial Fair Contrastive Language-Image Pre-training for Chest X-rays
Chenlang Yi, Zizhan Xiong, Qi Qi, Xiyuan Wei, Girish Bathla, Ching-Long Lin, Bobak Jack Mortazavi, Tianbao Yang
Main category: cs.CV
TL;DR: AdFair-CLIP is a framework that uses adversarial feature intervention to mitigate demographic biases in CLIP models for medical image classification, improving fairness and accuracy in chest X-ray analysis.
Details
Motivation: CLIP models show superior performance in medical tasks but have fairness concerns with demographic biases (race/gender) that lead to diagnostic disparities and reduced reliability for underrepresented groups.Method: Adversarial feature intervention to suppress sensitive attributes and mitigate spurious correlations in CLIP models.
Result: Significantly enhances both fairness and diagnostic accuracy on chest X-ray datasets while maintaining robust generalization in zero-shot and few-shot scenarios.
Conclusion: Establishes new benchmarks for fairness-aware learning in CLIP-based medical diagnostic models, particularly for CXR analysis.
Abstract: Contrastive Language-Image Pre-training (CLIP) models have demonstrated superior performance across various visual tasks including medical image classification. However, fairness concerns, including demographic biases, have received limited attention for CLIP models. This oversight leads to critical issues, particularly those related to race and gender, resulting in disparities in diagnostic outcomes and reduced reliability for underrepresented groups. To address these challenges, we introduce AdFair-CLIP, a novel framework employing adversarial feature intervention to suppress sensitive attributes, thereby mitigating spurious correlations and improving prediction fairness. We conduct comprehensive experiments on chest X-ray (CXR) datasets, and show that AdFair-CLIP significantly enhances both fairness and diagnostic accuracy, while maintaining robust generalization in zero-shot and few-shot scenarios. These results establish new benchmarks for fairness-aware learning in CLIP-based medical diagnostic models, particularly for CXR analysis.
[498] Spurious-Aware Prototype Refinement for Reliable Out-of-Distribution Detection
Reihaneh Zohrabi, Hosein Hasani, Mahdieh Soleymani Baghshah, Anna Rohrbach, Marcus Rohrbach, Mohammad Hossein Rohban
Main category: cs.CV
TL;DR: SPROD is a prototype-based OOD detection method that addresses spurious correlations by refining class prototypes without additional data or hyperparameter tuning, achieving significant performance improvements over existing methods.
Details
Motivation: Existing OOD detection methods are vulnerable to spurious correlations that mislead models and compromise robustness in real-world applications where models face unseen data distributions.Method: A post-hoc prototype-based approach that refines class prototypes to mitigate bias from spurious features, applicable across diverse backbones and OOD detection settings without requiring additional data or hyperparameter tuning.
Result: SPROD demonstrates superior performance across challenging OOD datasets (CelebA, Waterbirds, UrbanCars, Spurious Imagenet, Animals MetaCoCo), improving AUROC by 4.8% and FPR@95 by 9.4% over the second best method on average.
Conclusion: SPROD effectively addresses spurious correlation challenges in OOD detection through prototype refinement, offering a robust and broadly applicable solution that significantly outperforms existing approaches.
Abstract: Out-of-distribution (OOD) detection is crucial for ensuring the reliability and safety of machine learning models in real-world applications, where they frequently face data distributions unseen during training. Despite progress, existing methods are often vulnerable to spurious correlations that mislead models and compromise robustness. To address this, we propose SPROD, a novel prototype-based OOD detection approach that explicitly addresses the challenge posed by unknown spurious correlations. Our post-hoc method refines class prototypes to mitigate bias from spurious features without additional data or hyperparameter tuning, and is broadly applicable across diverse backbones and OOD detection settings. We conduct a comprehensive spurious correlation OOD detection benchmarking, comparing our method against existing approaches and demonstrating its superior performance across challenging OOD datasets, such as CelebA, Waterbirds, UrbanCars, Spurious Imagenet, and the newly introduced Animals MetaCoCo. On average, SPROD improves AUROC by 4.8% and FPR@95 by 9.4% over the second best.
[499] PlantSegNeRF: A few-shot, cross-species method for plant 3D instance point cloud reconstruction via joint-channel NeRF with multi-view image instance matching
Xin Yang, Ruiming Du, Hanyang Huang, Jiayang Xie, Pengyao Xie, Leisen Fang, Ziyue Guo, Nanjun Jiang, Yu Jiang, Haiyan Cen
Main category: cs.CV
TL;DR: PlantSegNeRF is a novel method that uses multi-view RGB images and neural radiance fields to generate high-precision instance point clouds for plant organ segmentation, achieving significant improvements over existing methods.
Details
Motivation: Existing plant organ segmentation techniques face limitations in resolution, accuracy, and generalizability across plant species. There's a need for high-precision instance segmentation that works across various plant types.Method: Uses multi-view RGB images with 2D instance segmentation to generate organ masks, matches instance IDs across views, develops instance NeRF to render implicit scenes with color, density, semantic and instance information, then converts to high-precision point clouds based on volume density.
Result: Outperformed common methods in semantic segmentation with 16.1% precision, 18.3% recall, 17.8% F1-score, and 24.2% IoU improvements. In instance segmentation, achieved 11.7% mPrec, 38.2% mRec, 32.2% mCov, and 25.3% mWCov improvements across all plant species.
Conclusion: Provides a high-throughput way to supply high-quality 3D data for plant science, extending organ-level plant phenotyping capabilities and supporting development of large-scale models.
Abstract: Organ segmentation of plant point clouds is a prerequisite for the high-resolution and accurate extraction of organ-level phenotypic traits. Although the fast development of deep learning has boosted much research on segmentation of plant point clouds, the existing techniques for organ segmentation still face limitations in resolution, segmentation accuracy, and generalizability across various plant species. In this study, we proposed a novel approach called plant segmentation neural radiance fields (PlantSegNeRF), aiming to directly generate high-precision instance point clouds from multi-view RGB image sequences for a wide range of plant species. PlantSegNeRF performed 2D instance segmentation on the multi-view images to generate instance masks for each organ with a corresponding ID. The multi-view instance IDs corresponding to the same plant organ were then matched and refined using a specially designed instance matching module. The instance NeRF was developed to render an implicit scene, containing color, density, semantic and instance information. The implicit scene was ultimately converted into high-precision plant instance point clouds based on the volume density. The results proved that in semantic segmentation of point clouds, PlantSegNeRF outperformed the commonly used methods, demonstrating an average improvement of 16.1%, 18.3%, 17.8%, and 24.2% in precision, recall, F1-score, and IoU compared to the second-best results on structurally complex species. More importantly, PlantSegNeRF exhibited significant advantages in plant point cloud instance segmentation tasks. Across all plant species, it achieved average improvements of 11.7%, 38.2%, 32.2% and 25.3% in mPrec, mRec, mCov, mWCov, respectively. This study extends the organ-level plant phenotyping and provides a high-throughput way to supply high-quality 3D data for the development of large-scale models in plant science.
[500] Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations
Yuji Wang, Moran Li, Xiaobin Hu, Ran Yi, Jiangning Zhang, Han Feng, Weijian Cao, Yabiao Wang, Chengjie Wang, Lizhuang Ma
Main category: cs.CV
TL;DR: A spatial-temporal decoupled framework for identity-preserving text-to-video generation that separates spatial layout features from temporal motion dynamics to overcome the trade-off between identity preservation and motion consistency.
Details
Motivation: Current end-to-end text-to-video frameworks suffer from a critical spatial-temporal trade-off where optimizing for spatial coherence (identity preservation) compromises temporal smoothness, and vice versa.Method: Proposes a semantic prompt optimization mechanism and stage-wise decoupled generation paradigm that separates prompts into spatial and temporal components, using spatial prompts for text-to-image generation and temporal prompts for image-to-video generation.
Result: Experimental results show excellent spatiotemporal consistency with outstanding performance in identity preservation, text relevance, and video quality, achieving runner-up position in 2025 ACM MultiMedia Challenge.
Conclusion: The proposed spatial-temporal decoupled framework effectively addresses the trade-off between spatial coherence and temporal smoothness in identity-preserving video generation through simple yet robust mechanisms.
Abstract: Identity-preserving text-to-video (IPT2V) generation, which aims to create high-fidelity videos with consistent human identity, has become crucial for downstream applications. However, current end-to-end frameworks suffer a critical spatial-temporal trade-off: optimizing for spatially coherent layouts of key elements (e.g., character identity preservation) often compromises instruction-compliant temporal smoothness, while prioritizing dynamic realism risks disrupting the spatial coherence of visual structures. To tackle this issue, we propose a simple yet effective spatial-temporal decoupled framework that decomposes representations into spatial features for layouts and temporal features for motion dynamics. Specifically, our paper proposes a semantic prompt optimization mechanism and stage-wise decoupled generation paradigm. The former module decouples the prompt into spatial and temporal components. Aligned with the subsequent stage-wise decoupled approach, the spatial prompts guide the text-to-image (T2I) stage to generate coherent spatial features, while the temporal prompts direct the sequential image-to-video (I2V) stage to ensure motion consistency. Experimental results validate that our approach achieves excellent spatiotemporal consistency, demonstrating outstanding performance in identity preservation, text relevance, and video quality. By leveraging this simple yet robust mechanism, our algorithm secures the runner-up position in 2025 ACM MultiMedia Challenge. Our code is available at https://github.com/rain152/IPVG.
[501] THUNDER: Tile-level Histopathology image UNDERstanding benchmark
Pierre Marza, Leo Fillioux, Sofiène Boutaj, Kunal Mahatha, Christian Desrosiers, Pablo Piantanida, Jose Dolz, Stergios Christodoulidis, Maria Vakalopoulou
Main category: cs.CV
TL;DR: THUNDER is a comprehensive tile-level benchmark for digital pathology foundation models that evaluates 23 models on 16 datasets across diverse tasks, feature analysis, and robustness assessment.
Details
Motivation: The rapid proliferation of foundation models in digital pathology creates a need for systematic benchmarking to assess performance differences, understand feature spaces, and evaluate uncertainty and robustness for reliable healthcare applications.Method: THUNDER provides a fast, easy-to-use dynamic benchmark that supports comparison of state-of-the-art foundation models and user-defined models on diverse datasets with various downstream tasks, feature space analysis, and robustness evaluation.
Result: The benchmark enables comprehensive comparison of 23 foundation models across 16 different datasets, providing insights into performance differences, feature characteristics, and model robustness in digital pathology applications.
Conclusion: THUNDER addresses the critical need for standardized evaluation in digital pathology by offering a flexible benchmarking framework that assesses not only downstream performance but also feature spaces, uncertainty, and robustness of foundation models.
Abstract: Progress in a research field can be hard to assess, in particular when many concurrent methods are proposed in a short period of time. This is the case in digital pathology, where many foundation models have been released recently to serve as feature extractors for tile-level images, being used in a variety of downstream tasks, both for tile- and slide-level problems. Benchmarking available methods then becomes paramount to get a clearer view of the research landscape. In particular, in critical domains such as healthcare, a benchmark should not only focus on evaluating downstream performance, but also provide insights about the main differences between methods, and importantly, further consider uncertainty and robustness to ensure a reliable usage of proposed models. For these reasons, we introduce THUNDER, a tile-level benchmark for digital pathology foundation models, allowing for efficient comparison of many models on diverse datasets with a series of downstream tasks, studying their feature spaces and assessing the robustness and uncertainty of predictions informed by their embeddings. THUNDER is a fast, easy-to-use, dynamic benchmark that can already support a large variety of state-of-the-art foundation, as well as local user-defined models for direct tile-based comparison. In this paper, we provide a comprehensive comparison of 23 foundation models on 16 different datasets covering diverse tasks, feature analysis, and robustness. The code for THUNDER is publicly available at https://github.com/MICS-Lab/thunder.
[502] Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation
Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu Zhang, Xiaojuan Qi
Main category: cs.CV
TL;DR: VFMTok is a novel image tokenizer built directly on frozen vision foundation models, using region-adaptive quantization and semantic reconstruction to improve image reconstruction, generation quality, and token efficiency.
Details
Motivation: To explore the largely underexplored area of building image tokenizers directly on top of frozen vision foundation models, leveraging their pre-trained capabilities while reducing feature redundancy.Method: Uses frozen vision foundation model as encoder, introduces region-adaptive quantization to reduce redundancy in pre-trained features, and semantic reconstruction objective to align outputs with foundation model representations.
Result: Achieves gFID of 1.36 on ImageNet benchmarks, accelerates model convergence by 3x, enables high-fidelity class-conditional synthesis without CFG, and improves image reconstruction and generation quality.
Conclusion: VFMTok demonstrates that building image tokenizers on frozen vision foundation models with proper quantization and semantic alignment can significantly enhance performance and efficiency in image generation tasks.
Abstract: In this work, we present a novel direction to build an image tokenizer directly on top of a frozen vision foundation model, which is a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer’s outputs with the foundation model’s representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation – achieving a gFID of 1.36 on ImageNet benchmarks, while accelerating model convergence by three times, and enabling high-fidelity class-conditional synthesis without the need for classifier-free guidance (CFG). The code is available at https://github.com/CVMI-Lab/VFMTok.
[503] Contrastive Conditional-Unconditional Alignment for Long-tailed Diffusion Model
Fang Chen, Alex Villa, Gongbo Liang, Xiaoyi Lu, Meng Tang
Main category: cs.CV
TL;DR: The paper proposes two simple loss functions to improve diversity and fidelity of tail class images in class-conditional diffusion models trained on imbalanced data, without compromising head class quality.
Details
Motivation: Class-conditional image synthesis often suffers from long-tailed distributions where tail classes have limited training data, causing mode collapse and reduced diversity in synthesized images for tail classes.Method: Two loss functions: 1) Unsupervised Contrastive Loss (UCL) using negative samples to increase dissimilarity among synthetic images, combined with batch resampling; 2) Alignment Loss (AL) that aligns class-conditional with unconditional generation at large timesteps to make denoising insensitive to class conditions initially.
Result: The method outperforms vanilla denoising diffusion probabilistic models, score-based diffusion models, and alternative methods for class-imbalanced image generation across various datasets, particularly ImageNet-LT with 256x256 resolution.
Conclusion: The framework successfully leverages contrastive learning and conditional-unconditional alignment for class-imbalanced diffusion models, is easy to implement on both U-Net and Diffusion Transformer architectures, and effectively improves tail class image diversity and fidelity.
Abstract: Training data for class-conditional image synthesis often exhibit a long-tailed distribution with limited images for tail classes. Such an imbalance causes mode collapse and reduces the diversity of synthesized images for tail classes. For class-conditional diffusion models trained on imbalanced data, we aim to improve the diversity and fidelity of tail class images without compromising the quality of head class images. We achieve this by introducing two simple but highly effective loss functions. Firstly, we employ an Unsupervised Contrastive Loss (UCL) utilizing negative samples to increase the distance/dissimilarity among synthetic images. Such regularization is coupled with a standard trick of batch resampling to further diversify tail-class images. Our second loss is an Alignment Loss (AL) that aligns class-conditional generation with unconditional generation at large timesteps. This second loss makes the denoising process insensitive to class conditions for the initial steps, which enriches tail classes through knowledge sharing from head classes. We successfully leverage contrastive learning and conditional-unconditional alignment for class-imbalanced diffusion models. Our framework is easy to implement as demonstrated on both U-Net based architecture and Diffusion Transformer. Our method outperforms vanilla denoising diffusion probabilistic models, score-based diffusion model, and alternative methods for class-imbalanced image generation across various datasets, in particular ImageNet-LT with 256x256 resolution.
[504] CapRecover: A Cross-Modality Feature Inversion Attack Framework on Vision Language Models
Kedong Xiu, Sai Qian Zhang
Main category: cs.CV
TL;DR: CapRecover is a cross-modality inversion framework that recovers semantic content (labels/captions) directly from intermediate features in split-DNN VLMs, addressing privacy risks from semantic leakage without image reconstruction.
Details
Motivation: Vision-Language Models deployed in split-DNN configurations pose privacy risks due to semantic information leakage from intermediate features sent to the cloud, while existing image reconstruction methods produce blurry, ambiguous results.Method: Proposes CapRecover framework that directly recovers high-level semantic content from intermediate features using cross-modality inversion. Also introduces a protection method adding random noise to intermediate features at each layer and removing it in the next layer.
Result: Achieves 92.71% Top-1 label accuracy on CIFAR-10 and generates fluent captions from ResNet50 features on COCO2017 with ROUGE-L scores up to 0.52. Deeper convolutional layers encode more semantic information than shallow layers. The protection method effectively prevents semantic leakage without training costs.
Conclusion: CapRecover effectively addresses semantic leakage in split-DNN VLMs by directly recovering semantic content from intermediate features, and provides a practical protection method against such attacks.
Abstract: As Vision-Language Models (VLMs) are increasingly deployed in split-DNN configurations–with visual encoders (e.g., ResNet, ViT) operating on user devices and sending intermediate features to the cloud–there is a growing privacy risk from semantic information leakage. Existing approaches to reconstructing images from these intermediate features often result in blurry, semantically ambiguous images. To directly address semantic leakage, we propose CapRecover, a cross-modality inversion framework that recovers high-level semantic content, such as labels or captions, directly from intermediate features without image reconstruction. We evaluate CapRecover on multiple datasets and victim models, demonstrating strong performance in semantic recovery. Specifically, CapRecover achieves up to 92.71% Top-1 label accuracy on CIFAR-10 and generates fluent captions from ResNet50 features on COCO2017 with ROUGE-L scores up to 0.52. Our analysis further reveals that deeper convolutional layers encode significantly more semantic information compared to shallow layers. To mitigate semantic leakage, we introduce a simple yet effective protection method: adding random noise to intermediate features at each layer and removing the noise in the next layer. Experimental results show that this approach prevents semantic leakage without additional training costs. Our code is available at https://jus1mple.github.io/Image2CaptionAttack.
[505] AlignCAT: Visual-Linguistic Alignment of Category and Attribute for Weakly Supervised Visual Grounding
Yidan Wang, Chenyi Zhuang, Wutao Liu, Pan Gao, Nicu Sebe
Main category: cs.CV
TL;DR: AlignCAT is a query-based semantic matching framework for weakly supervised visual grounding that uses coarse-grained and fine-grained alignment modules to address category and attribute ambiguity in text expressions.
Details
Motivation: Existing weakly supervised VG methods lack strong cross-modal reasoning to distinguish subtle semantic differences in text expressions due to category-based and attribute-based ambiguity.Method: Proposes AlignCAT with two modules: coarse-grained alignment using category information and global context to mitigate category-inconsistent objects, and fine-grained alignment using descriptive information and word-level text features for attribute consistency. Progressively filters misaligned visual queries and enhances contrastive learning.
Result: Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg benchmarks verify superiority against existing weakly supervised methods on two VG tasks.
Conclusion: AlignCAT effectively addresses category and attribute ambiguity in weakly supervised visual grounding by exploiting linguistic cues through progressive alignment modules.
Abstract: Weakly supervised visual grounding (VG) aims to locate objects in images based on text descriptions. Despite significant progress, existing methods lack strong cross-modal reasoning to distinguish subtle semantic differences in text expressions due to category-based and attribute-based ambiguity. To address these challenges, we introduce AlignCAT, a novel query-based semantic matching framework for weakly supervised VG. To enhance visual-linguistic alignment, we propose a coarse-grained alignment module that utilizes category information and global context, effectively mitigating interference from category-inconsistent objects. Subsequently, a fine-grained alignment module leverages descriptive information and captures word-level text features to achieve attribute consistency. By exploiting linguistic cues to their fullest extent, our proposed AlignCAT progressively filters out misaligned visual queries and enhances contrastive learning efficiency. Extensive experiments on three VG benchmarks, namely RefCOCO, RefCOCO+, and RefCOCOg, verify the superiority of AlignCAT against existing weakly supervised methods on two VG tasks. Our code is available at: https://github.com/I2-Multimedia-Lab/AlignCAT.
[506] S$^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation
Weilun Feng, Haotong Qin, Chuanguang Yang, Xiangqi Li, Han Yang, Yuqi Li, Zhulin An, Libo Huang, Michele Magno, Yongjun Xu
Main category: cs.CV
TL;DR: S²Q-VDiT is a post-training quantization framework for video diffusion models that addresses calibration variance and learning challenges through Hessian-aware salient data selection and attention-guided sparse token distillation, achieving lossless performance with 3.9× compression and 1.3× acceleration.
Details
Motivation: Video diffusion models with billions of parameters incur high computational costs, and quantization faces challenges due to long token sequences causing high calibration variance and learning difficulties.Method: Proposes Hessian-aware Salient Data Selection to construct high-quality calibration datasets, and Attention-guided Sparse Token Distillation that exploits token-wise attention distributions to emphasize influential tokens.
Result: Under W4A6 quantization, achieves lossless performance with 3.9× model compression and 1.3× inference acceleration.
Conclusion: S²Q-VDiT effectively addresses quantization challenges in video diffusion models through data selection and token distillation techniques, enabling efficient deployment without performance degradation.
Abstract: Diffusion transformers have emerged as the mainstream paradigm for video generation models. However, the use of up to billions of parameters incurs significant computational costs. Quantization offers a promising solution by reducing memory usage and accelerating inference. Nonetheless, we observe that the joint modeling of spatial and temporal information in video diffusion models (V-DMs) leads to extremely long token sequences, which introduces high calibration variance and learning challenges. To address these issues, we propose S$^2$Q-VDiT, a post-training quantization framework for V-DMs that leverages Salient data and Sparse token distillation. During the calibration phase, we identify that quantization performance is highly sensitive to the choice of calibration data. To mitigate this, we introduce \textit{Hessian-aware Salient Data Selection}, which constructs high-quality calibration datasets by considering both diffusion and quantization characteristics unique to V-DMs. To tackle the learning challenges, we further analyze the sparse attention patterns inherent in V-DMs. Based on this observation, we propose \textit{Attention-guided Sparse Token Distillation}, which exploits token-wise attention distributions to emphasize tokens that are more influential to the model’s output. Under W4A6 quantization, S$^2$Q-VDiT achieves lossless performance while delivering $3.9\times$ model compression and $1.3\times$ inference acceleration. Code will be available at https://github.com/wlfeng0509/s2q-vdit.
[507] CannyEdit: Selective Canny Control and Dual-Prompt Guidance for Training-Free Image Editing
Weiyan Xie, Han Gao, Didan Deng, Kaican Li, April Hua Liu, Yongxiang Huang, Nevin L. Zhang
Main category: cs.CV
TL;DR: CannyEdit is a training-free framework for regional image editing that uses selective Canny ControlNet guidance and dual-prompt guidance to balance text adherence, context fidelity, and seamless integration of edits.
Details
Motivation: Existing text-to-image models struggle to balance text adherence in edited regions, context fidelity in unedited areas, and seamless integration of edits, creating a trilemma in regional image editing.Method: Uses two key innovations: Selective Canny Control (applies structural guidance only to unedited regions) and Dual-Prompt Guidance (uses both local and global prompts for coherence). Works with rough masks or single-point hints and integrates with vision-language models.
Result: Achieves superior trade-off among text adherence, context fidelity, and editing seamlessness compared to current region-based methods. Performs well in complex object addition scenarios against leading instruction-based editors.
Conclusion: CannyEdit enables controllable local editing for object addition, replacement, and removal with exceptional flexibility and seamless integration capabilities, addressing the fundamental trilemma in regional image editing.
Abstract: Recent advances in text-to-image (T2I) models have enabled training-free regional image editing by leveraging the generative priors of foundation models. However, existing methods struggle to balance text adherence in edited regions, context fidelity in unedited areas, and seamless integration of edits. We introduce CannyEdit, a novel training-free framework that addresses this trilemma through two key innovations. First, Selective Canny Control applies structural guidance from a Canny ControlNet only to the unedited regions, preserving the original image’s details while allowing for precise, text-driven changes in the specified editable area. Second, Dual-Prompt Guidance utilizes both a local prompt for the specific edit and a global prompt for overall scene coherence. Through this synergistic approach, these components enable controllable local editing for object addition, replacement, and removal, achieving a superior trade-off among text adherence, context fidelity, and editing seamlessness compared to current region-based methods. Beyond this, CannyEdit offers exceptional flexibility: it operates effectively with rough masks or even single-point hints in addition tasks. Furthermore, the framework can seamlessly integrate with vision-language models in a training-free manner for complex instruction-based editing that requires planning and reasoning. Our extensive evaluations demonstrate CannyEdit’s strong performance against leading instruction-based editors in complex object addition scenarios.
[508] Integrating Reinforcement Learning with Visual Generative Models: Foundations and Advances
Yuanzhi Liang, Yijie Fang, Rui Li, Ziqi Ni, Ruijie Su, Chi Zhang
Main category: cs.CV
TL;DR: This survey paper systematically reviews reinforcement learning (RL) methods for visual content generation, examining how RL addresses limitations of traditional generative models by optimizing non-differentiable, preference-driven objectives.
Details
Motivation: Traditional generative models use surrogate objectives like likelihood or reconstruction loss that often misalign with perceptual quality, semantic accuracy, and physical realism. RL provides a principled framework to optimize these non-differentiable objectives.Method: The paper surveys RL-based methods across image, video, and 3D/4D generation domains. It reviews RL’s evolution from classical control to general-purpose optimization tool, examining how RL serves both as fine-tuning mechanism and structural component for aligning generation with complex goals.
Result: Recent advances demonstrate RL’s effectiveness in enhancing controllability, consistency, and human alignment across various generative tasks in visual content creation.
Conclusion: The survey concludes with open challenges and future research directions at the intersection of RL and generative modeling, highlighting RL’s potential as a general-purpose optimization framework for visual content generation.
Abstract: Generative models have made significant progress in synthesizing visual content, including images, videos, and 3D/4D structures. However, they are typically trained with surrogate objectives such as likelihood or reconstruction loss, which often misalign with perceptual quality, semantic accuracy, or physical realism. Reinforcement learning (RL) offers a principled framework for optimizing non-differentiable, preference-driven, and temporally structured objectives. Recent advances demonstrate its effectiveness in enhancing controllability, consistency, and human alignment across generative tasks. This survey provides a systematic overview of RL-based methods for visual content generation. We review the evolution of RL from classical control to its role as a general-purpose optimization tool, and examine its integration into image, video, and 3D/4D generation. Across these domains, RL serves not only as a fine-tuning mechanism but also as a structural component for aligning generation with complex, high-level goals. We conclude with open challenges and future research directions at the intersection of RL and generative modeling.
[509] A Deep Learning-Based CCTV System for Automatic Smoking Detection in Fire Exit Zones
Sami Sadat, Mohammad Irtiza Hossain, Junaid Ahmed Sifat, Suhail Haque Rafi, Md. Waseq Alauddin Alvi, Md. Khalilur Rhaman
Main category: cs.CV
TL;DR: A deep learning system for real-time smoking detection in CCTV surveillance achieves 78.90% recall and 83.70% mAP@50 using a custom YOLOv8-based model optimized for fire exit monitoring.
Details
Motivation: Critical safety requirements in fire exit areas necessitate automated smoking detection to prevent fire hazards and ensure regulatory compliance.Method: Evaluated YOLOv8, YOLOv11, YOLOv12 models, then developed a custom YOLOv8-based model with added structures for challenging surveillance contexts. Tested on 8,124 images from 20 scenarios including 2,708 low-light samples.
Result: Proposed model outperformed others with 78.90% recall and 83.70% mAP@50. Jetson Xavier NX processed data at 52-97 ms per inference, suitable for real-time operations.
Conclusion: The system provides a robust and adaptable platform for public safety monitoring and automatic regulatory compliance in time-sensitive surveillance applications.
Abstract: A deep learning real-time smoking detection system for CCTV surveillance of fire exit areas is proposed due to critical safety requirements. The dataset contains 8,124 images from 20 different scenarios along with 2,708 raw samples demonstrating low-light areas. We evaluated three advanced object detection models: YOLOv8, YOLOv11, and YOLOv12, followed by development of a custom model derived from YOLOv8 with added structures for challenging surveillance contexts. The proposed model outperformed the others, achieving a recall of 78.90 percent and mAP at 50 of 83.70 percent, delivering optimal object detection across varied environments. Performance evaluation on multiple edge devices using multithreaded operations showed the Jetson Xavier NX processed data at 52 to 97 milliseconds per inference, establishing its suitability for time-sensitive operations. This system offers a robust and adaptable platform for monitoring public safety and enabling automatic regulatory compliance.
[510] C-DiffDet+: Fusing Global Scene Context with Generative Denoising for High-Fidelity Car Damage Detection
Abdellah Zakaria Sellam, Ilyes Benaissa, Salah Eddine Bekhouche, Abdenour Hadid, Vito Renó, Cosimo Distante
Main category: cs.CV
TL;DR: The paper introduces Context-Aware Fusion (CAF) to enhance fine-grained object detection by integrating global scene context with local features using cross-attention mechanisms, improving performance on vehicle damage assessment tasks.
Details
Motivation: Fine-grained object detection in challenging domains like vehicle damage assessment is difficult even for human experts. While DiffusionDet has advanced the field, its performance is limited by local feature conditioning in context-dependent scenarios.Method: Proposed Context-Aware Fusion (CAF) that uses cross-attention mechanisms to integrate global scene context with local proposal features. A separate dedicated encoder captures comprehensive environmental information, allowing each object proposal to attend to scene-level understanding.
Result: Experimental results show improvement over state-of-the-art models on the CarDD benchmark, establishing new performance benchmarks for context-aware object detection in fine-grained domains.
Conclusion: The framework significantly enhances the generative detection paradigm by enabling object proposals to attend to comprehensive environmental information, addressing fundamental limitations of previous approaches.
Abstract: Fine-grained object detection in challenging visual domains, such as vehicle damage assessment, presents a formidable challenge even for human experts to resolve reliably. While DiffusionDet has advanced the state-of-the-art through conditional denoising diffusion, its performance remains limited by local feature conditioning in context-dependent scenarios. We address this fundamental limitation by introducing Context-Aware Fusion (CAF), which leverages cross-attention mechanisms to integrate global scene context with local proposal features directly. The global context is generated using a separate dedicated encoder that captures comprehensive environmental information, enabling each object proposal to attend to scene-level understanding. Our framework significantly enhances the generative detection paradigm by enabling each object proposal to attend to comprehensive environmental information. Experimental results demonstrate an improvement over state-of-the-art models on the CarDD benchmark, establishing new performance benchmarks for context-aware object detection in fine-grained domains
[511] A Fine-Grained Attention and Geometric Correspondence Model for Musculoskeletal Risk Classification in Athletes Using Multimodal Visual and Skeletal Features
Md. Abdur Rahman, Mohaimenul Azam Khan Raiaan, Tamanna Shermin, Md Rafiqul Islam, Mukhtar Hussain, Sami Azam
Main category: cs.CV
TL;DR: ViSK-GAT is a multimodal deep learning framework that combines visual and skeletal data to classify musculoskeletal risk in athletes, achieving over 93% accuracy across all key metrics.
Details
Motivation: Existing methods for musculoskeletal risk assessment are limited to controlled settings and rely on single data types, failing in complex environments. Early risk assessment is crucial for athlete injury prevention.Method: Developed ViSK-GAT framework with two modules: Fine-Grained Attention Module (FGAM) for cross-attention between visual and skeletal inputs, and Multimodal Geometric Correspondence Module (MGCM) for cross-modal alignment. Used custom MusDis-Sports dataset with images and skeletal coordinates labeled into eight REBA risk categories.
Result: Achieved robust performance with all key metrics exceeding 93%. Regression results showed low RMSE (0.1205) and MAE (0.0156). Consistently outperformed nine popular transfer learning backbones.
Conclusion: ViSK-GAT demonstrates potential to advance AI-driven musculoskeletal risk assessment and enable early interventions in sports through its multimodal approach.
Abstract: Musculoskeletal disorders pose significant risks to athletes, and assessing risk early is important for prevention. However, most existing methods are designed for controlled settings and fail to reliably assess risk in complex environments due to their reliance on a single type of data. This research introduces ViSK-GAT (Visual-Skeletal Geometric Attention Transformer), a novel multimodal deep learning framework that classifies musculoskeletal risk using both visual and skeletal coordinate-based features. A custom multimodal dataset (MusDis-Sports) was created by combining images and skeletal coordinates, with each sample labeled into eight risk categories based on the Rapid Entire Body Assessment (REBA) system. ViSK-GAT integrates two innovative modules: the Fine-Grained Attention Module (FGAM), which refines inter-modal features via cross-attention between visual and skeletal inputs, and the Multimodal Geometric Correspondence Module (MGCM), which enhances cross-modal alignment between image features and coordinates. The model achieved robust performance, with all key metrics exceeding 93%. Regression results also indicated a low RMSE of 0.1205 and MAE of 0.0156. ViSK-GAT consistently outperformed nine popular transfer learning backbones and showed its potential to advance AI-driven musculoskeletal risk assessment and enable early, impactful interventions in sports.
[512] Reconstruction Alignment Improves Unified Multimodal Models
Ji Xie, Trevor Darrell, Luke Zettlemoyer, XuDong Wang
Main category: cs.CV
TL;DR: RecA is a resource-efficient post-training method that improves multimodal models by using visual understanding embeddings as dense text prompts to reconstruct input images, aligning understanding and generation without captions.
Details
Motivation: Conventional training of unified multimodal models relies on sparse image-text pairs that miss fine-grained visual details, even with long captions.Method: RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image using self-supervised reconstruction loss.
Result: RecA consistently improves generation and editing fidelity across different UMM architectures, boosting GenEval (0.73→0.90), DPGBench (80.93→88.15), ImgEdit (3.38→3.75), and GEdit (6.94→7.25) with only 27 GPU-hours.
Conclusion: RecA is an efficient and general post-training alignment strategy that surpasses larger models and applies broadly across diverse UMM architectures.
Abstract: Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details–even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense “text prompts,” providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73$\rightarrow$0.90) and DPGBench (80.93$\rightarrow$88.15), while also boosting editing benchmarks (ImgEdit 3.38$\rightarrow$3.75, GEdit 6.94$\rightarrow$7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs
[513] SA-UNetv2: Rethinking Spatial Attention U-Net for Retinal Vessel Segmentation
Changlu Guo, Anders Nymark Christensen, Anders Bjorholm Dahl, Yugen Yi, Morten Rieger Hannemose
Main category: cs.CV
TL;DR: SA-UNetv2 is a lightweight retinal vessel segmentation model that improves upon SA-UNet by adding cross-scale spatial attention to skip connections and using a weighted BCE+MCC loss to handle class imbalance, achieving state-of-the-art performance with minimal computational resources.
Details
Motivation: Existing SA-UNet underuses attention in skip connections and doesn't address severe foreground-background imbalance in retinal vessel segmentation, which is crucial for early disease diagnosis of diabetic retinopathy, hypertension, and neurodegenerative disorders.Method: Proposes SA-UNetv2 with cross-scale spatial attention injected into all skip connections to strengthen multi-scale feature fusion, and uses weighted Binary Cross-Entropy plus Matthews Correlation Coefficient loss to improve robustness to class imbalance.
Result: Achieves state-of-the-art performance on DRIVE and STARE datasets with only 1.2MB memory and 0.26M parameters (less than 50% of SA-UNet), and 1 second CPU inference on 592 x 592 x 3 images.
Conclusion: SA-UNetv2 demonstrates strong efficiency and deployability in resource-constrained, CPU-only settings for retinal vessel segmentation tasks.
Abstract: Retinal vessel segmentation is essential for early diagnosis of diseases such as diabetic retinopathy, hypertension, and neurodegenerative disorders. Although SA-UNet introduces spatial attention in the bottleneck, it underuses attention in skip connections and does not address the severe foreground-background imbalance. We propose SA-UNetv2, a lightweight model that injects cross-scale spatial attention into all skip connections to strengthen multi-scale feature fusion and adopts a weighted Binary Cross-Entropy (BCE) plus Matthews Correlation Coefficient (MCC) loss to improve robustness to class imbalance. On the public DRIVE and STARE datasets, SA-UNetv2 achieves state-of-the-art performance with only 1.2MB memory and 0.26M parameters (less than 50% of SA-UNet), and 1 second CPU inference on 592 x 592 x 3 images, demonstrating strong efficiency and deployability in resource-constrained, CPU-only settings.
[514] Enhancing Feature Fusion of U-like Networks with Dynamic Skip Connections
Yue Cao, Quansong He, Kaishen Wang, Jianlong Xiong, Zhang Yi, Tao He
Main category: cs.CV
TL;DR: Proposes a Dynamic Skip Connection (DSC) block to address limitations in traditional U-net skip connections, featuring Test-Time Training for adaptive feature fusion and Dynamic Multi-Scale Kernel for multi-scale feature integration.
Details
Motivation: Traditional skip connections in U-like networks suffer from inter-feature constraints (static feature fusion) and intra-feature constraints (insufficient multi-scale feature interactions), limiting effective global context aggregation.Method: Introduces a DSC block with two components: Test-Time Training module for dynamic adaptation during inference, and Dynamic Multi-Scale Kernel module for adaptive kernel size selection based on global context.
Result: The DSC block demonstrates plug-and-play effectiveness across various U-like network architectures including CNN-based, Transformer-based, hybrid CNN-Transformer, and Mamba-based networks.
Conclusion: The proposed DSC block fundamentally enhances cross-layer connectivity through adaptive mechanisms and can be seamlessly integrated into existing U-like network structures.
Abstract: U-like networks have become fundamental frameworks in medical image segmentation through skip connections that bridge high-level semantics and low-level spatial details. Despite their success, conventional skip connections exhibit two key limitations: inter-feature constraints and intra-feature constraints. The inter-feature constraint refers to the static nature of feature fusion in traditional skip connections, where information is transmitted along fixed pathways regardless of feature content. The intra-feature constraint arises from the insufficient modeling of multi-scale feature interactions, thereby hindering the effective aggregation of global contextual information. To overcome these limitations, we propose a novel Dynamic Skip Connection (DSC) block that fundamentally enhances cross-layer connectivity through adaptive mechanisms. The DSC block integrates two complementary components. (1) Test-Time Training (TTT) module. This module addresses the inter-feature constraint by enabling dynamic adaptation of hidden representations during inference, facilitating content-aware feature refinement. (2) Dynamic Multi-Scale Kernel (DMSK) module. To mitigate the intra-feature constraint, this module adaptively selects kernel sizes based on global contextual cues, enhancing the network capacity for multi-scale feature integration. The DSC block is architecture-agnostic and can be seamlessly incorporated into existing U-like network structures. Extensive experiments demonstrate the plug-and-play effectiveness of the proposed DSC block across CNN-based, Transformer-based, hybrid CNN-Transformer, and Mamba-based U-like networks.
[515] BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent
Shaojie Zhang, Ruoceng Zhang, Pei Fu, Shaokang Wang, Jiahui Yang, Xin Du, Shiqi Cui, Bin Qin, Ying Huang, Zhenbo Luo, Jian Luan
Main category: cs.CV
TL;DR: The paper proposes Blink-Think-Link (BTL), a brain-inspired framework for human-GUI interaction that mimics human cognitive processes through three phases: rapid attention detection, reasoning, and action generation.
Details
Motivation: Current AI-driven GUI interaction systems deviate from natural human communication patterns, creating a gap that needs to be filled with more biologically plausible approaches.Method: BTL framework with three phases: Blink (attention detection), Think (reasoning), and Link (action generation). Includes automated blink data generation and a rule-based reward mechanism for reinforcement learning.
Result: Developed BTL-UI agent model showing competitive performance in both static GUI understanding and dynamic interaction tasks across comprehensive benchmarks.
Conclusion: The framework provides conclusive empirical validation for developing advanced GUI agents that better mimic human cognitive processes.
Abstract: In the field of AI-driven human-GUI interaction automation, while rapid advances in multimodal large language models and reinforcement fine-tuning techniques have yielded remarkable progress, a fundamental challenge persists: their interaction logic significantly deviates from natural human-GUI communication patterns. To fill this gap, we propose “Blink-Think-Link” (BTL), a brain-inspired framework for human-GUI interaction that mimics the human cognitive process between users and graphical interfaces. The system decomposes interactions into three biologically plausible phases: (1) Blink - rapid detection and attention to relevant screen areas, analogous to saccadic eye movements; (2) Think - higher-level reasoning and decision-making, mirroring cognitive planning; and (3) Link - generation of executable commands for precise motor control, emulating human action selection mechanisms. Additionally, we introduce two key technical innovations for the BTL framework: (1) Blink Data Generation - an automated annotation pipeline specifically optimized for blink data, and (2) BTL Reward – the first rule-based reward mechanism that enables reinforcement learning driven by both process and outcome. Building upon this framework, we develop a GUI agent model named BTL-UI, which demonstrates competitive performance across both static GUI understanding and dynamic interaction tasks in comprehensive benchmarks. These results provide conclusive empirical validation of the framework’s efficacy in developing advanced GUI Agents.
[516] UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning
Ye Liu, Zongyang Ma, Junfu Pu, Zhongang Qi, Yang Wu, Ying Shan, Chang Wen Chen
Main category: cs.CV
TL;DR: UniPixel is a large multi-modal model that integrates pixel-level perception with visual reasoning, enabling mask-grounded responses and fine-grained pixel-level understanding across images and videos.
Details
Motivation: Current LMMs focus on holistic image/video understanding but lack fine-grained pixel-level alignment capabilities. Existing models perform referring or segmentation tasks independently without integrating them into visual reasoning.Method: UniPixel processes visual prompts, generates relevant masks on demand, and performs reasoning conditioned on these intermediate pointers during inference, enabling pixel-level reasoning.
Result: Verified on 10 benchmarks across diverse tasks including pixel-level referring/segmentation, object-centric understanding in images/videos, and a novel PixelQA task requiring joint referring, segmentation, and QA.
Conclusion: UniPixel successfully bridges the gap between pixel-level perception and visual reasoning, demonstrating flexible comprehension of visual prompts and mask-grounded responses across multiple tasks.
Abstract: Recent advances in Large Multi-modal Models (LMMs) have demonstrated their remarkable success as general-purpose multi-modal assistants, with particular focuses on holistic image- and video-language understanding. Conversely, less attention has been given to scaling fine-grained pixel-level understanding capabilities, where the models are expected to realize pixel-level alignment between visual signals and language semantics. Some previous studies have applied LMMs to related tasks such as region-level captioning and referring expression segmentation. However, these models are limited to performing either referring or segmentation tasks independently and fail to integrate these fine-grained perception capabilities into visual reasoning. To bridge this gap, we propose UniPixel, a large multi-modal model capable of flexibly comprehending visual prompt inputs and generating mask-grounded responses. Our model distinguishes itself by seamlessly integrating pixel-level perception with general visual understanding capabilities. Specifically, UniPixel processes visual prompts and generates relevant masks on demand, and performs subsequent reasoning conditioning on these intermediate pointers during inference, thereby enabling fine-grained pixel-level reasoning. The effectiveness of our approach has been verified on 10 benchmarks across a diverse set of tasks, including pixel-level referring/segmentation and object-centric understanding in images/videos. A novel PixelQA task that jointly requires referring, segmentation, and question answering is also designed to verify the flexibility of our method.
[517] Can Less Precise Be More Reliable? A Systematic Evaluation of Quantization’s Impact on CLIP Beyond Accuracy
Aymen Bouguerra, Daniel Montoya, Alexandra Gomez-Villa, Fabio Arnez, Chokri Mraidha
Main category: cs.CV
TL;DR: Quantization of CLIP models can improve calibration for underconfident models and enhance OOD detection, with specific QAT methods enabling simultaneous gains in accuracy, calibration, and robustness.
Details
Motivation: To understand the impact of quantization on CLIP models beyond accuracy, particularly on reliability metrics like calibration and OOD detection, which are crucial for efficient and reliable deployment.Method: Large-scale evaluation of quantization on CLIP models, assessing in-distribution accuracy, calibration, and OOD detection, and identifying effective quantization-aware training methods.
Result: Quantization improves calibration for underconfident models but degrades it for overconfident ones; OOD detection can still improve despite poor calibration; specific QAT methods achieve simultaneous gains in accuracy, calibration, and OOD robustness.
Conclusion: Quantization can be leveraged beyond efficiency to enhance reliability and robustness in CLIP models, challenging the traditional efficiency-performance trade-off and providing insights for multi-objective deployment.
Abstract: The powerful zero-shot generalization capabilities of vision-language models (VLMs) like CLIP have enabled new paradigms for safety-related tasks such as out-of-distribution (OOD) detection. However, additional aspects crucial for the computationally efficient and reliable deployment of CLIP are still overlooked. In particular, the impact of quantization on CLIP’s performance beyond accuracy remains underexplored. This work presents a large-scale evaluation of quantization on CLIP models, assessing not only in-distribution accuracy but a comprehensive suite of reliability metrics and revealing counterintuitive results driven by pre-training source. We demonstrate that quantization consistently improves calibration for typically underconfident pre-trained models, while often degrading it for overconfident variants. Intriguingly, this degradation in calibration does not preclude gains in other reliability metrics; we find that OOD detection can still improve for these same poorly calibrated models. Furthermore, we identify specific quantization-aware training (QAT) methods that yield simultaneous gains in zero-shot accuracy, calibration, and OOD robustness, challenging the view of a strict efficiency-performance trade-off. These findings offer critical insights for navigating the multi-objective problem of deploying efficient, reliable, and robust VLMs by utilizing quantization beyond its conventional role.
[518] VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment
Md. Mahfuzur Rahman, Kishor Datta Gupta, Marufa Kamal, Fahad Rahman, Sunzida Siddique, Ahmed Rafi Hasan, Mohd Ariful Haque, Roy George
Main category: cs.CV
TL;DR: VLCE is a multimodal system that generates comprehensive disaster damage descriptions from satellite and UAV imagery using dual CNN-LSTM and Vision Transformer architectures with external semantic knowledge, outperforming baseline models.
Details
Motivation: Traditional manual damage assessment after disasters is slow and dangerous. Current computer vision methods only provide classification labels or segmentation masks, lacking comprehensive situational understanding.Method: Dual-architecture approach: CNN-LSTM with ResNet50 backbone for satellite imagery (xBD dataset) and Vision Transformer for UAV pictures (RescueNet dataset), enhanced with external semantic knowledge from ConceptNet and WordNet.
Result: VLCE significantly outperforms baseline models (LLaVA and QwenVL), achieving up to 95.33% on InfoMetIC for caption informativeness while maintaining competitive semantic alignment with CLIPScore.
Conclusion: The dual-architecture system shows significant potential for improving disaster damage assessment by automating production of actionable, information-dense descriptions from satellite and drone imagery.
Abstract: Immediate damage assessment is essential after natural catastrophes; yet, conventional hand evaluation techniques are sluggish and perilous. Although satellite and unmanned aerial vehicle (UAV) photos offer extensive perspectives of impacted regions, current computer vision methodologies generally yield just classification labels or segmentation masks, so constraining their capacity to deliver a thorough situational comprehension. We introduce the Vision Language Caption Enhancer (VLCE), a multimodal system designed to produce comprehensive, contextually-informed explanations of disaster imagery. VLCE employs a dual-architecture approach: a CNN-LSTM model with a ResNet50 backbone pretrained on EuroSat satellite imagery for the xBD dataset, and a Vision Transformer (ViT) model pretrained on UAV pictures for the RescueNet dataset. Both systems utilize external semantic knowledge from ConceptNet and WordNet to expand vocabulary coverage and improve description accuracy. We assess VLCE in comparison to leading vision-language models (LLaVA and QwenVL) utilizing CLIPScore for semantic alignment and InfoMetIC for caption informativeness. Experimental findings indicate that VLCE markedly surpasses baseline models, attaining a maximum of 95.33% on InfoMetIC while preserving competitive semantic alignment. Our dual-architecture system demonstrates significant potential for improving disaster damage assessment by automating the production of actionable, information-dense descriptions from satellite and drone photos.
[519] Image-Plane Geometric Decoding for View-Invariant Indoor Scene Reconstruction
Mingyang Li, Yimeng Fan, Changsong Liu, Lixue Xu, Xin Wang, Yanyan Liu, Wei Zhang
Main category: cs.CV
TL;DR: A novel image-plane decoding framework for indoor scene reconstruction that reduces dependency on multi-view geometric constraints by exploiting spatial information within individual views, achieving superior reconstruction stability with minimal performance degradation when view count is reduced.
Details
Motivation: Existing volume-based reconstruction methods rely heavily on multi-view pixel back-projection ray intersections, making reconstruction quality dependent on input view density and causing performance degradation in overlapping regions and unobserved areas.Method: Proposes an image-plane decoding framework with three components: Pixel-level Confidence Encoder, Affine Compensation Module, and Image-Plane Spatial Decoder. These modules decode 3D structural information encoded in images through physical imaging processes to preserve spatial geometric features.
Result: Achieves superior reconstruction stability with nearly identical quality when view count reduces by 40%. Achieves coefficient of variation of 0.24%, performance retention rate of 99.7%, and maximum performance drop of 0.42% on indoor scene reconstruction datasets.
Conclusion: Exploiting intra-view spatial information provides a robust solution for view-limited scenarios in practical applications, significantly enhancing view-invariant reconstruction.
Abstract: Volume-based indoor scene reconstruction methods offer superior generalization capability and real-time deployment potential. However, existing methods rely on multi-view pixel back-projection ray intersections as weak geometric constraints to determine spatial positions. This dependence results in reconstruction quality being heavily influenced by input view density. Performance degrades in overlapping regions and unobserved areas.To address these limitations, we reduce dependency on inter-view geometric constraints by exploiting spatial information within individual views. We propose an image-plane decoding framework with three core components: Pixel-level Confidence Encoder, Affine Compensation Module, and Image-Plane Spatial Decoder. These modules decode three-dimensional structural information encoded in images through physical imaging processes. The framework effectively preserves spatial geometric features including edges, hollow structures, and complex textures. It significantly enhances view-invariant reconstruction.Experiments on indoor scene reconstruction datasets confirm superior reconstruction stability. Our method maintains nearly identical quality when view count reduces by 40%. It achieves a coefficient of variation of 0.24%, performance retention rate of 99.7%, and maximum performance drop of 0.42%. These results demonstrate that exploiting intra-view spatial information provides a robust solution for view-limited scenarios in practical applications.
[520] Editable Noise Map Inversion: Encoding Target-image into Noise For High-Fidelity Image Manipulation
Mingyu Kang, Yong Suk Choi
Main category: cs.CV
TL;DR: ENM Inversion is a novel noise map inversion method for text-guided image editing that optimizes noise maps to balance content preservation and editability, outperforming existing approaches and extending to video editing.
Details
Motivation: Previous inversion methods for text-guided image editing struggle to adhere closely to target text prompts because inverted noise maps, while enabling faithful reconstruction of source images, restrict flexibility needed for desired edits.Method: Proposes Editable Noise Map Inversion (ENM Inversion) that searches for optimal noise maps through editable noise refinement, minimizing the difference between reconstructed and edited noise maps based on analysis of noise map properties for enhanced editability.
Result: Extensive experiments show ENM Inversion outperforms existing approaches across a wide range of image editing tasks in both preservation and edit fidelity with target prompts, and can be applied to video editing for temporal consistency.
Conclusion: ENM Inversion effectively addresses the trade-off between content preservation and editability in text-guided image editing, providing superior performance and extending to consistent video editing applications.
Abstract: Text-to-image diffusion models have achieved remarkable success in generating high-quality and diverse images. Building on these advancements, diffusion models have also demonstrated exceptional performance in text-guided image editing. A key strategy for effective image editing involves inverting the source image into editable noise maps associated with the target image. However, previous inversion methods face challenges in adhering closely to the target text prompt. The limitation arises because inverted noise maps, while enabling faithful reconstruction of the source image, restrict the flexibility needed for desired edits. To overcome this issue, we propose Editable Noise Map Inversion (ENM Inversion), a novel inversion technique that searches for optimal noise maps to ensure both content preservation and editability. We analyze the properties of noise maps for enhanced editability. Based on this analysis, our method introduces an editable noise refinement that aligns with the desired edits by minimizing the difference between the reconstructed and edited noise maps. Extensive experiments demonstrate that ENM Inversion outperforms existing approaches across a wide range of image editing tasks in both preservation and edit fidelity with target prompts. Our approach can also be easily applied to video editing, enabling temporal consistency and content manipulation across frames.
[521] EVODiff: Entropy-aware Variance Optimized Diffusion Inference
Shigui Li, Wei Chen, Delu Zeng
Main category: cs.CV
TL;DR: EVODiff is an entropy-aware variance optimized method for diffusion models that systematically reduces uncertainty during denoising, significantly outperforming state-of-the-art solvers like DPM-Solver++ in both efficiency and quality.
Details
Motivation: Diffusion models suffer from slow inference and training-inference discrepancies, while existing gradient-based solvers lack theoretical foundations in information transmission efficiency.Method: Proposes an information-theoretic perspective revealing that successful denoising reduces conditional entropy, leading to EVODiff which optimizes conditional variance to minimize transition and reconstruction errors.
Result: EVODiff reduces reconstruction error by up to 45.5% (FID from 5.10 to 2.78) at 10 NFE on CIFAR-10, cuts NFE cost by 25% on ImageNet-256, and improves text-to-image generation with reduced artifacts.
Conclusion: The information-theoretic approach provides theoretical foundations for diffusion model inference, and EVODiff demonstrates significant improvements over state-of-the-art methods in both efficiency and generation quality.
Abstract: Diffusion models (DMs) excel in image generation, but suffer from slow inference and the training-inference discrepancies. Although gradient-based solvers like DPM-Solver accelerate the denoising inference, they lack theoretical foundations in information transmission efficiency. In this work, we introduce an information-theoretic perspective on the inference processes of DMs, revealing that successful denoising fundamentally reduces conditional entropy in reverse transitions. This principle leads to our key insights into the inference processes: (1) data prediction parameterization outperforms its noise counterpart, and (2) optimizing conditional variance offers a reference-free way to minimize both transition and reconstruction errors. Based on these insights, we propose an entropy-aware variance optimized method for the generative process of DMs, called EVODiff, which systematically reduces uncertainty by optimizing conditional entropy during denoising. Extensive experiments on DMs validate our insights and demonstrate that our method significantly and consistently outperforms state-of-the-art (SOTA) gradient-based solvers. For example, compared to the DPM-Solver++, EVODiff reduces the reconstruction error by up to 45.5% (FID improves from 5.10 to 2.78) at 10 function evaluations (NFE) on CIFAR-10, cuts the NFE cost by 25% (from 20 to 15 NFE) for high-quality samples on ImageNet-256, and improves text-to-image generation while reducing artifacts. Code is available at https://github.com/ShiguiLi/EVODiff.
[522] Holistic Order Prediction in Natural Scenes
Pierre Musacchio, Hyunmin Lee, Jaesik Park
Main category: cs.CV
TL;DR: InstaFormer is a network that predicts full occlusion and depth orderings for all instances in a scene from a single RGB image input, eliminating the need for expensive input formats and quadratic inference costs.
Details
Motivation: Understanding instance-wise geometries in visual models is challenging, and existing methods rely on expensive inputs like category labels and segmentation masks, with high inference costs requiring multiple forward passes.Method: InstaFormer uses interactions between object queries and latent mask descriptors that semantically represent the same objects while carrying complementary information, enabling holistic order prediction in a single forward pass.
Result: The approach is comprehensively benchmarked and ablated to demonstrate its effectiveness in predicting occlusion and depth orderings from RGB images.
Conclusion: InstaFormer provides an efficient solution for instance-wise geometry understanding with reduced input requirements and inference costs, with code and models available as open-source.
Abstract: Even in controlled settings, understanding instance-wise geometries is a challenging task for a wide range of visual models. Although specialized systems exist, modern arts rely on expensive input formats (category labels, binary segmentation masks) and inference costs (a quadratic amount of forward passes). We mitigate these limitations by proposing InstaFormer, a network capable of holistic order prediction. That is, solely given an input RGB image, InstaFormer returns the full occlusion and depth orderings for all the instances in the scene in a single forward pass. At its core, InstaFormer relies on interactions between object queries and latent mask descriptors that semantically represent the same objects while carrying complementary information. We comprehensively benchmark and ablate our approach to highlight its effectiveness. Our code and models are open-source and available at this URL: https://github.com/SNU-VGILab/InstaOrder.
[523] Net2Net: When Un-trained Meets Pre-trained Networks for Robust Real-World Denoising
Weimin Yuan, Cai Meng
Main category: cs.CV
TL;DR: Net2Net combines untrained and pre-trained networks using regularization by denoising (RED) for real-world noise removal, achieving better generalization across diverse noise types without requiring extensive labeled data.
Details
Motivation: Traditional denoising methods rely on handcrafted priors and struggle with real noise complexity, while deep learning approaches need extensive labeled data and may not generalize well across different noise types and imaging conditions.Method: Hybrid framework combining unsupervised Deep Image Prior (DIP) and supervised pre-trained DRUNet model using regularization by denoising (RED). Untrained network adapts to input-specific noise, while pre-trained network leverages learned representations from large datasets.
Result: Extensive experiments on benchmark datasets demonstrate superior performance for real-world noise removal, particularly in scenarios with limited training data.
Conclusion: Net2Net effectively addresses real-world noise removal challenges by combining the strengths of untrained and pre-trained networks, enhancing generalization across varying noise patterns and improving performance with limited data.
Abstract: Traditional denoising methods for noise removal have largely relied on handcrafted priors, often perform well in controlled environments but struggle to address the complexity and variability of real noise. In contrast, deep learning-based approaches have gained prominence for learning noise characteristics from large datasets, but these methods frequently require extensive labeled data and may not generalize effectively across diverse noise types and imaging conditions. In this paper, we present an innovative method, termed as Net2Net, that combines the strengths of untrained and pre-trained networks to tackle the challenges of real-world noise removal. The innovation of Net2Net lies in its combination of unsupervised DIP and supervised pre-trained model DRUNet by regularization by denoising (RED). The untrained network adapts to the unique noise characteristics of each input image without requiring labeled data, while the pre-trained network leverages learned representations from large-scale datasets to deliver robust denoising performance. This hybrid framework enhances generalization across varying noise patterns and improves performance, particularly in scenarios with limited training data. Extensive experiments on benchmark datasets demonstrate the superiority of our method for real-world noise removal.
[524] Retrv-R1: A Reasoning-Driven MLLM Framework for Universal and Efficient Multimodal Retrieval
Lanyun Zhu, Deyi Ji, Tianrun Chen, Haiyang Wu, Shiqi Wang
Main category: cs.CV
TL;DR: Retrv-R1 is the first R1-style multimodal LLM for universal retrieval that uses step-by-step reasoning for more accurate results, addressing computational cost and training instability issues through information compression and a new training paradigm.
Details
Motivation: Direct application of DeepSeek-R1 methods to retrieval tasks is infeasible due to high computational costs from large token consumption and instability when applying RL directly to retrieval training.Method: Introduces information compression module with details inspection mechanism to reduce tokens while preserving critical information, and a new training paradigm with activation stage using retrieval-tailored synthetic CoT dataset followed by RL with curriculum reward.
Result: Retrv-R1 achieves state-of-the-art performance, high efficiency, and strong generalization ability across multiple benchmarks and tasks.
Conclusion: The proposed approach successfully enables effective RL-based training for multimodal retrieval tasks while maintaining computational efficiency and performance.
Abstract: The success of DeepSeek-R1 demonstrates the immense potential of using reinforcement learning (RL) to enhance LLMs’ reasoning capabilities. This paper introduces Retrv-R1, the first R1-style MLLM specifically designed for multimodal universal retrieval, achieving higher performance by employing step-by-step reasoning to produce more accurate retrieval results. We find that directly applying the methods of DeepSeek-R1 to retrieval tasks is not feasible, mainly due to (1) the high computational cost caused by the large token consumption required for multiple candidates with reasoning processes, and (2) the instability and suboptimal results when directly applying RL to train for retrieval tasks. To address these issues, Retrv-R1 introduces an information compression module with a details inspection mechanism, which enhances computational efficiency by reducing the number of tokens while ensuring that critical information for challenging candidates is preserved. Furthermore, a new training paradigm is proposed, including an activation stage using a retrieval-tailored synthetic CoT dataset for more effective optimization, followed by RL with a novel curriculum reward to improve both performance and efficiency. Incorporating these novel designs, Retrv-R1 achieves SOTA performance, high efficiency, and strong generalization ability, as demonstrated by experiments across multiple benchmarks and tasks.
[525] SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus
Ming Zhao, Wenhui Dong, Yang Zhang, Xiang Zheng, Zhonghao Zhang, Zian Zhou, Yunzhi Guan, Liukun Xu, Wei Peng, Zhaoyang Gong, Zhicheng Zhang, Dachuan Li, Xiaosheng Ma, Yuli Ma, Jianing Ni, Changjiang Jiang, Lixia Tian, Qixin Chen, Kaishun Xia, Pingping Liu, Tongshun Zhang, Zhiqiang Liu, Zhongyan Bi, Chenyang Si, Tiansheng Sun, Caifeng Shan
Main category: cs.CV
TL;DR: SpineMed introduces a comprehensive ecosystem for AI-assisted spine diagnosis, featuring SpineMed-450k (450k instruction instances for vertebral-level reasoning across imaging modalities) and SpineBench (clinically-grounded evaluation framework), addressing limitations in current spine AI datasets.
Details
Motivation: Spine disorders affect 619 million people globally and are a leading cause of disability, but AI-assisted diagnosis is limited by lack of level-aware multimodal datasets and standardized benchmarks for clinical decision-making across X-ray, CT, and MRI at specific vertebral levels.Method: Co-designed with spine surgeons, using clinician-in-the-loop pipeline with two-stage LLM generation (draft and revision) to curate SpineMed-450k from textbooks, guidelines, open datasets, and ~1,000 hospital cases. Includes question-answering, multi-turn consultations, and report generation tasks.
Result: Evaluation on SpineBench reveals systematic weaknesses in current LVLMs for fine-grained, level-specific reasoning. Model fine-tuned on SpineMed-450k shows consistent and significant improvements across all tasks. Clinician assessments confirm diagnostic clarity and practical utility.
Conclusion: SpineMed addresses critical gaps in spine AI by providing traceable, clinically-grounded instruction data and standardized benchmarks, enabling improved vertebral-level reasoning across imaging modalities for enhanced clinical decision-making.
Abstract: Spine disorders affect 619 million people globally and are a leading cause of disability, yet AI-assisted diagnosis remains limited by the lack of level-aware, multimodal datasets. Clinical decision-making for spine disorders requires sophisticated reasoning across X-ray, CT, and MRI at specific vertebral levels. However, progress has been constrained by the absence of traceable, clinically-grounded instruction data and standardized, spine-specific benchmarks. To address this, we introduce SpineMed, an ecosystem co-designed with practicing spine surgeons. It features SpineMed-450k, the first large-scale dataset explicitly designed for vertebral-level reasoning across imaging modalities with over 450,000 instruction instances, and SpineBench, a clinically-grounded evaluation framework. SpineMed-450k is curated from diverse sources, including textbooks, guidelines, open datasets, and ~1,000 de-identified hospital cases, using a clinician-in-the-loop pipeline with a two-stage LLM generation method (draft and revision) to ensure high-quality, traceable data for question-answering, multi-turn consultations, and report generation. SpineBench evaluates models on clinically salient axes, including level identification, pathology assessment, and surgical planning. Our comprehensive evaluation of several recently advanced large vision-language models (LVLMs) on SpineBench reveals systematic weaknesses in fine-grained, level-specific reasoning. In contrast, our model fine-tuned on SpineMed-450k demonstrates consistent and significant improvements across all tasks. Clinician assessments confirm the diagnostic clarity and practical utility of our model’s outputs.
[526] Unmasking Puppeteers: Leveraging Biometric Leakage to Disarm Impersonation in AI-based Videoconferencing
Danial Samadi Vahdati, Tai Duc Nguyen, Ekta Prashnani, Koki Nagano, David Luebke, Orazio Gallo, Matthew Stamm
Main category: cs.CV
TL;DR: A defense method against AI-based talking-head video conferencing attacks that detects identity hijacking by analyzing biometric information in pose-expression latents, without examining the reconstructed RGB video.
Details
Motivation: Current talking-head videoconferencing systems are vulnerable to puppeteering attacks where attackers can hijack a victim's likeness in real-time, and existing deepfake detectors fail because every frame is synthetic.Method: A pose-conditioned, large-margin contrastive encoder that isolates persistent identity cues in transmitted pose-expression latents while canceling transient pose and expression, followed by a simple cosine test on the disentangled embedding.
Result: The method consistently outperforms existing puppeteering defenses, operates in real-time, and shows strong generalization to out-of-distribution scenarios across multiple talking-head generation models.
Conclusion: The proposed biometric leakage defense effectively detects illicit identity swaps in real-time by analyzing biometric information inherent in pose-expression latents, providing a robust security solution for talking-head videoconferencing systems.
Abstract: AI-based talking-head videoconferencing systems reduce bandwidth by sending a compact pose-expression latent and re-synthesizing RGB at the receiver, but this latent can be puppeteered, letting an attacker hijack a victim’s likeness in real time. Because every frame is synthetic, deepfake and synthetic video detectors fail outright. To address this security problem, we exploit a key observation: the pose-expression latent inherently contains biometric information of the driving identity. Therefore, we introduce the first biometric leakage defense without ever looking at the reconstructed RGB video: a pose-conditioned, large-margin contrastive encoder that isolates persistent identity cues inside the transmitted latent while cancelling transient pose and expression. A simple cosine test on this disentangled embedding flags illicit identity swaps as the video is rendered. Our experiments on multiple talking-head generation models show that our method consistently outperforms existing puppeteering defenses, operates in real-time, and shows strong generalization to out-of-distribution scenarios.
[527] One Stone with Two Birds: A Null-Text-Null Frequency-Aware Diffusion Models for Text-Guided Image Inpainting
Haipeng Liu, Yang Wang, Meng Wang
Main category: cs.CV
TL;DR: NTN-Diff is a frequency-aware diffusion model for text-guided image inpainting that addresses preservation of unmasked regions and semantic consistency between masked/unmasked areas by decomposing frequency bands and using null-text guidance.
Details
Motivation: Previous methods failed to simultaneously preserve unmasked regions and achieve semantic consistency between masked and unmasked areas, due to entanglement of different frequency bands that respond differently to text prompts during denoising.Method: Proposes null-text-null frequency-aware diffusion model that divides denoising into early (high-level noise) and late (low-level noise) stages, disentangling mid-and-low frequency bands. Uses stable mid-frequency band as guidance for null-text denoising of low-frequency band, followed by text-guided denoising.
Result: Extensive experiments show NTN-Diff outperforms state-of-the-art diffusion models for text-guided image inpainting, achieving better preservation of unmasked regions and semantic consistency.
Conclusion: NTN-Diff successfully addresses the dual challenges in text-guided image inpainting by frequency band decomposition and staged denoising with null-text guidance, demonstrating superior performance over existing methods.
Abstract: Text-guided image inpainting aims at reconstructing the masked regions as per text prompts, where the longstanding challenges lie in the preservation for unmasked regions, while achieving the semantics consistency between unmasked and inpainted masked regions. Previous arts failed to address both of them, always with either of them to be remedied. Such facts, as we observed, stem from the entanglement of the hybrid (e.g., mid-and-low) frequency bands that encode varied image properties, which exhibit different robustness to text prompts during the denoising process. In this paper, we propose a null-text-null frequency-aware diffusion models, dubbed \textbf{NTN-Diff}, for text-guided image inpainting, by decomposing the semantics consistency across masked and unmasked regions into the consistencies as per each frequency band, while preserving the unmasked regions, to circumvent two challenges in a row. Based on the diffusion process, we further divide the denoising process into early (high-level noise) and late (low-level noise) stages, where the mid-and-low frequency bands are disentangled during the denoising process. As observed, the stable mid-frequency band is progressively denoised to be semantically aligned during text-guided denoising process, which, meanwhile, serves as the guidance to the null-text denoising process to denoise low-frequency band for the masked regions, followed by a subsequent text-guided denoising process at late stage, to achieve the semantics consistency for mid-and-low frequency bands across masked and unmasked regions, while preserve the unmasked regions. Extensive experiments validate the superiority of NTN-Diff over the state-of-the-art diffusion models to text-guided diffusion models. Our code can be accessed from https://github.com/htyjers/NTN-Diff.
[528] A Novel Multi-branch ConvNeXt Architecture for Identifying Subtle Pathological Features in CT Scans
Irash Perera, Uthayasanker Thayasivam
Main category: cs.CV
TL;DR: A novel multi-branch ConvNeXt architecture for COVID-19 diagnosis from CT scans achieves state-of-the-art performance with ROC-AUC of 0.9937, validation accuracy of 0.9757, and F1-score of 0.9825.
Details
Motivation: To develop an advanced deep learning model for intelligent analysis of medical imaging, specifically addressing the challenge of identifying subtle pathological features in COVID-19 diagnosis from CT scans.Method: Multi-branch ConvNeXt architecture with three parallel branches (Global Average Pooling, Global Max Pooling, and Attention-weighted Pooling), using end-to-end pipeline with data preprocessing, augmentation, and two-phase training strategy with transfer learning.
Result: Superior performance on validation set with 2,609 CT slices: ROC-AUC 0.9937, validation accuracy 0.9757, F1-score 0.9825 for COVID-19 cases, outperforming all previously reported models on this dataset.
Conclusion: Modern multi-branch architecture with careful data handling can achieve performance comparable to or exceeding state-of-the-art models, proving the efficacy of advanced deep learning for robust medical diagnostics.
Abstract: Intelligent analysis of medical imaging plays a crucial role in assisting clinical diagnosis, especially for identifying subtle pathological features. This paper introduces a novel multi-branch ConvNeXt architecture designed specifically for the nuanced challenges of medical image analysis. While applied here to the specific problem of COVID-19 diagnosis, the methodology offers a generalizable framework for classifying a wide range of pathologies from CT scans. The proposed model incorporates a rigorous end-to-end pipeline, from meticulous data preprocessing and augmentation to a disciplined two-phase training strategy that leverages transfer learning effectively. The architecture uniquely integrates features extracted from three parallel branches: Global Average Pooling, Global Max Pooling, and a new Attention-weighted Pooling mechanism. The model was trained and validated on a combined dataset of 2,609 CT slices derived from two distinct datasets. Experimental results demonstrate a superior performance on the validation set, achieving a final ROC-AUC of 0.9937, a validation accuracy of 0.9757, and an F1-score of 0.9825 for COVID-19 cases, outperforming all previously reported models on this dataset. These findings indicate that a modern, multi-branch architecture, coupled with careful data handling, can achieve performance comparable to or exceeding contemporary state-of-the-art models, thereby proving the efficacy of advanced deep learning techniques for robust medical diagnostics.
[529] Gesplat: Robust Pose-Free 3D Reconstruction via Geometry-Guided Gaussian Splatting
Jiahui Lu, Haihong Xiao, Xueyan Zhao, Wenxiong Kang
Main category: cs.CV
TL;DR: Gesplat is a 3DGS-based framework for robust novel view synthesis and geometrically consistent reconstruction from unposed sparse images, overcoming limitations of traditional methods that require accurate camera poses and dense viewpoint coverage.
Details
Motivation: NeRF and 3DGS depend heavily on accurate camera poses and dense viewpoint coverage, limiting their applicability in sparse-view settings where pose estimation becomes unreliable and supervision is insufficient.Method: Leverages VGGT foundation model for reliable initial poses and dense point clouds; integrates hybrid Gaussian representation with dual position-shape optimization, graph-guided attribute refinement, and flow-based depth regularization.
Result: Achieves more robust performance on both forward-facing and large-scale complex datasets compared to other pose-free methods, as demonstrated through comprehensive quantitative and qualitative experiments.
Conclusion: Gesplat enables robust novel view synthesis and geometrically consistent reconstruction from unposed sparse images, overcoming key limitations of existing methods in sparse-view settings.
Abstract: Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have advanced 3D reconstruction and novel view synthesis, but remain heavily dependent on accurate camera poses and dense viewpoint coverage. These requirements limit their applicability in sparse-view settings, where pose estimation becomes unreliable and supervision is insufficient. To overcome these challenges, we introduce Gesplat, a 3DGS-based framework that enables robust novel view synthesis and geometrically consistent reconstruction from unposed sparse images. Unlike prior works that rely on COLMAP for sparse point cloud initialization, we leverage the VGGT foundation model to obtain more reliable initial poses and dense point clouds. Our approach integrates several key innovations: 1) a hybrid Gaussian representation with dual position-shape optimization enhanced by inter-view matching consistency; 2) a graph-guided attribute refinement module to enhance scene details; and 3) flow-based depth regularization that improves depth estimation accuracy for more effective supervision. Comprehensive quantitative and qualitative experiments demonstrate that our approach achieves more robust performance on both forward-facing and large-scale complex datasets compared to other pose-free methods.
[530] Training-Free In-Context Forensic Chain for Image Manipulation Detection and Localization
Rui Chen, Bin Liu, Changtao Miao, Xinghao Wang, Yi Li, Tao Gong, Qi Chu, Nenghai Yu
Main category: cs.CV
TL;DR: ICFC is a training-free framework using multi-modal LLMs for interpretable image manipulation localization, achieving state-of-the-art performance without pixel-level annotations.
Details
Motivation: Address security threats from image tampering by developing an effective IML method that doesn't require costly pixel-level annotations and provides interpretability.Method: In-Context Forensic Chain (ICFC) with objectified rule construction, adaptive filtering for knowledge base, and multi-step progressive reasoning pipeline mirroring expert forensic workflows.
Result: Surpasses state-of-the-art training-free methods and achieves competitive/superior performance compared to weakly and fully supervised approaches across multiple benchmarks.
Conclusion: ICFC successfully leverages MLLM reasoning for interpretable IML with strong performance, bridging the gap between training-free and supervised methods.
Abstract: Advances in image tampering pose serious security threats, underscoring the need for effective image manipulation localization (IML). While supervised IML achieves strong performance, it depends on costly pixel-level annotations. Existing weakly supervised or training-free alternatives often underperform and lack interpretability. We propose the In-Context Forensic Chain (ICFC), a training-free framework that leverages multi-modal large language models (MLLMs) for interpretable IML tasks. ICFC integrates an objectified rule construction with adaptive filtering to build a reliable knowledge base and a multi-step progressive reasoning pipeline that mirrors expert forensic workflows from coarse proposals to fine-grained forensics results. This design enables systematic exploitation of MLLM reasoning for image-level classification, pixel-level localization, and text-level interpretability. Across multiple benchmarks, ICFC not only surpasses state-of-the-art training-free methods but also achieves competitive or superior performance compared to weakly and fully supervised approaches.
[531] Dynamic Gaussian Splatting from Defocused and Motion-blurred Monocular Videos
Xuankai Zhang, Junjin Xiao, Qing Zhang
Main category: cs.CV
TL;DR: A unified framework for high-quality dynamic Gaussian Splatting from defocused and motion-blurred monocular videos, using blur prediction networks and dynamic Gaussian densification.
Details
Motivation: Existing methods are tailored for either defocus blur or motion blur separately, lacking ability to handle both simultaneously. Joint modeling as blur kernel convolution is limited by difficulty in estimating accurate blur kernels.Method: Proposes per-pixel blur kernel estimation using blur prediction network with blur-aware sparsity constraint, dynamic Gaussian densification for incomplete regions, and novel view synthesis optimization.
Result: Outperforms state-of-the-art methods in generating photorealistic novel view synthesis from defocused and motion-blurred monocular videos.
Conclusion: The framework successfully handles both defocus and motion blur in monocular videos, achieving superior novel view synthesis quality through joint blur modeling and optimization techniques.
Abstract: This paper presents a unified framework that allows high-quality dynamic Gaussian Splatting from both defocused and motion-blurred monocular videos. Due to the significant difference between the formation processes of defocus blur and motion blur, existing methods are tailored for either one of them, lacking the ability to simultaneously deal with both of them. Although the two can be jointly modeled as blur kernel-based convolution, the inherent difficulty in estimating accurate blur kernels greatly limits the progress in this direction. In this work, we go a step further towards this direction. Particularly, we propose to estimate per-pixel reliable blur kernels using a blur prediction network that exploits blur-related scene and camera information and is subject to a blur-aware sparsity constraint. Besides, we introduce a dynamic Gaussian densification strategy to mitigate the lack of Gaussians for incomplete regions, and boost the performance of novel view synthesis by incorporating unseen view information to constrain scene optimization. Extensive experiments show that our method outperforms the state-of-the-art methods in generating photorealistic novel view synthesis from defocused and motion-blurred monocular videos. Our code is available at \href{https://github.com/hhhddddddd/dydeblur}{\textcolor{cyan}{https://github.com/hhhddddddd/dydeblur}}.
[532] Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning
Xingang Guo, Utkarsh Tyagi, Advait Gosai, Paula Vergara, Jayeon Park, Ernesto Gabriel Hernández Montoya, Chen Bo Calvin Zhang, Bin Hu, Yunzhong He, Bing Liu, Rakshith Sharma Srinivasa
Main category: cs.CV
TL;DR: VisualToolBench is a new benchmark that evaluates MLLMs’ ability to actively manipulate and reason with images, moving beyond passive visual perception to a ’think-with-images’ paradigm.
Details
Motivation: Current MLLMs treat images as static inputs rather than manipulable cognitive workspaces, limiting their ability to solve complex real-world tasks that require active image transformations and tool integration.Method: Developed VisualToolBench with 1,204 challenging vision tasks across 5 domains (603 single-turn, 601 multi-turn), featuring detailed rubrics to systematically evaluate MLLMs’ visual tool-use reasoning capabilities.
Result: Current MLLMs struggle significantly, with even the strongest model (GPT-5-think) achieving only 18.68% pass rate. Models show divergent behaviors - OpenAI models benefit from image manipulations while Gemini-2.5-pro shows no improvement.
Conclusion: VisualToolBench reveals critical gaps in MLLMs’ visual intelligence and provides essential insights for advancing the ’think-with-images’ paradigm in multimodal AI systems.
Abstract: Multimodal Large Language Models (MLLMs) are increasingly applied in real-world scenarios where user-provided images are often imperfect, requiring active image manipulations such as cropping, editing, or enhancement to uncover salient visual cues. Beyond static visual perception, MLLMs must also think with images: dynamically transforming visual content and integrating it with other tools to solve complex tasks. However, this shift from treating vision as passive context to a manipulable cognitive workspace remains underexplored. Most existing benchmarks still follow a think about images paradigm, where images are regarded as static inputs. To address this gap, we introduce VisualToolBench, a visual tool-use reasoning benchmark that rigorously evaluates MLLMs’ ability to perceive, transform, and reason across complex visual-textual tasks under the think-with-images paradigm. VisualToolBench comprises 1,204 challenging, open-ended vision tasks (603 single-turn, 601 multi-turn) spanning across five diverse domains, each paired with detailed rubrics to enable systematic evaluation. Our evaluation shows that current MLLMs struggle with tasks requiring effective integration of vision and general-purpose tools. Even the strongest model (GPT-5-think) reaches only 18.68% pass rate. We further observe divergent tool-use behaviors, with OpenAI models benefiting from diverse image manipulations while Gemini-2.5-pro shows no improvement. By introducing the first benchmark centered on think with images, VisualToolBench offers critical insights for advancing visual intelligence in MLLMs.
[533] ExpressNet-MoE: A Hybrid Deep Neural Network for Emotion Recognition
Deeptimaan Banerjee, Prateek Gothwal, Ashis Kumer Biswas
Main category: cs.CV
TL;DR: ExpressNet-MoE is a hybrid deep learning model combining CNNs and Mixture of Experts framework for facial emotion recognition, achieving state-of-the-art performance across multiple datasets with improved generalization and flexibility.
Details
Motivation: Real-world facial emotion recognition faces challenges including variable head positions, occlusions, illumination shifts, and demographic diversity. Current models struggle with engagement detection in applications like virtual learning and customer services.Method: The model uses multi-scale feature extraction to capture both global and local facial features, with CNN-based feature extractors, a MoE module for adaptive feature selection, and a residual network backbone for deep feature learning.
Result: Achieved accuracies of 74.77% on AffectNet (v7), 72.55% on AffectNet (v8), 84.29% on RAF-DB, and 64.66% on FER-2013, outperforming current state-of-the-art methods.
Conclusion: The model demonstrates strong adaptability and practical applicability for end-to-end emotion recognition systems in real-world settings, with publicly available reproducible code.
Abstract: In many domains, including online education, healthcare, security, and human-computer interaction, facial emotion recognition (FER) is essential. Real-world FER is still difficult despite its significance because of some factors such as variable head positions, occlusions, illumination shifts, and demographic diversity. Engagement detection, which is essential for applications like virtual learning and customer services, is frequently challenging due to FER limitations by many current models. In this article, we propose ExpressNet-MoE, a novel hybrid deep learning model that blends both Convolution Neural Networks (CNNs) and Mixture of Experts (MoE) framework, to overcome the difficulties. Our model dynamically chooses the most pertinent expert networks, thus it aids in the generalization and providing flexibility to model across a wide variety of datasets. Our model improves on the accuracy of emotion recognition by utilizing multi-scale feature extraction to collect both global and local facial features. ExpressNet-MoE includes numerous CNN-based feature extractors, a MoE module for adaptive feature selection, and finally a residual network backbone for deep feature learning. To demonstrate efficacy of our proposed model we evaluated on several datasets, and compared with current state-of-the-art methods. Our model achieves accuracies of 74.77% on AffectNet (v7), 72.55% on AffectNet (v8), 84.29% on RAF-DB, and 64.66% on FER-2013. The results show how adaptive our model is and how it may be used to develop end-to-end emotion recognition systems in practical settings. Reproducible codes and results are made publicly accessible at https://github.com/DeeptimaanB/ExpressNet-MoE.
[534] Capture, Canonicalize, Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images
Emanuel Garbin, Guy Adam, Oded Krams, Zohar Barzelay, Eran Guendelman, Michael Schwarz, Matteo Presutto, Moran Vatelmacher, Yigal Shenkman, Eli Peker, Itai Druker, Uri Patish, Yoav Blum, Max Bluvstein, Junxuan Li, Rawal Khirodkar, Shunsuke Saito
Main category: cs.CV
TL;DR: A zero-shot pipeline for creating hyperrealistic 3D avatars from unstructured phone images using generative canonicalization and transformer-based models trained on Gaussian splatting avatars.
Details
Motivation: Existing methods have limitations: single-view approaches suffer from geometric inconsistencies and hallucinations, while synthetic data-trained models fail to capture high-frequency details like skin wrinkles and fine hair, limiting realism.Method: Two key contributions: (1) generative canonicalization module that processes multiple unstructured views into standardized representation, and (2) transformer-based model trained on large-scale dataset of high-fidelity Gaussian splatting avatars from dome captures of real people.
Result: The “Capture, Canonicalize, Splat” pipeline produces static quarter-body avatars with compelling realism and robust identity preservation from unstructured photos.
Conclusion: The method successfully addresses limitations of existing approaches by combining generative canonicalization with transformer models trained on real capture data, achieving hyperrealistic and identity-preserving 3D avatars from unstructured phone images.
Abstract: We present a novel, zero-shot pipeline for creating hyperrealistic, identity-preserving 3D avatars from a few unstructured phone images. Existing methods face several challenges: single-view approaches suffer from geometric inconsistencies and hallucinations, degrading identity preservation, while models trained on synthetic data fail to capture high-frequency details like skin wrinkles and fine hair, limiting realism. Our method introduces two key contributions: (1) a generative canonicalization module that processes multiple unstructured views into a standardized, consistent representation, and (2) a transformer-based model trained on a new, large-scale dataset of high-fidelity Gaussian splatting avatars derived from dome captures of real people. This “Capture, Canonicalize, Splat” pipeline produces static quarter-body avatars with compelling realism and robust identity preservation from unstructured photos.
[535] DOS: Directional Object Separation in Text Embeddings for Multi-Object Image Generation
Dongnam Byun, Jungwon Park, Jumgmin Ko, Changin Choi, Wonjong Rhee
Main category: cs.CV
TL;DR: DOS (Directional Object Separation) improves multi-object image generation by modifying CLIP text embeddings to address object neglect and mixing issues in text-to-image models.
Details
Motivation: Current text-to-image models struggle with prompts containing multiple objects, often resulting in object neglect or object mixing across four problematic scenarios: Similar Shapes, Similar Textures, Dissimilar Background Biases, and Many Objects.Method: DOS modifies three types of CLIP text embeddings before passing them to text-to-image models, based on key observations about CLIP embeddings to better separate object representations.
Result: DOS consistently improves success rate of multi-object image generation and reduces object mixing. In human evaluations, it significantly outperforms four competing methods with 26.24%-43.04% more votes across four benchmarks.
Conclusion: DOS is a practical and effective solution for improving multi-object image generation by addressing fundamental issues in object separation and representation.
Abstract: Recent progress in text-to-image (T2I) generative models has led to significant improvements in generating high-quality images aligned with text prompts. However, these models still struggle with prompts involving multiple objects, often resulting in object neglect or object mixing. Through extensive studies, we identify four problematic scenarios, Similar Shapes, Similar Textures, Dissimilar Background Biases, and Many Objects, where inter-object relationships frequently lead to such failures. Motivated by two key observations about CLIP embeddings, we propose DOS (Directional Object Separation), a method that modifies three types of CLIP text embeddings before passing them into text-to-image models. Experimental results show that DOS consistently improves the success rate of multi-object image generation and reduces object mixing. In human evaluations, DOS significantly outperforms four competing methods, receiving 26.24%-43.04% more votes across four benchmarks. These results highlight DOS as a practical and effective solution for improving multi-object image generation.
[536] ESCA: Contextualizing Embodied Agents via Scene-Graph Generation
Jiani Huang, Amish Sethi, Matthew Kuo, Mayank Keoliya, Neelay Velingker, JungHo Jung, Ser-Nam Lim, Ziyang Li, Mayur Naik
Main category: cs.CV
TL;DR: ESCA is a framework that improves embodied agents by grounding perception in spatial-temporal scene graphs using SGCLIP, a novel promptable foundation model trained on 87K+ videos without human annotations.
Details
Motivation: Existing MLLMs have weak grounding and inaccurate perception due to unreliable capture of fine-grained links between visual features and textual semantics.Method: Proposed ESCA framework with SGCLIP model trained using neurosymbolic pipeline on 87K+ open-domain videos, aligning auto-generated captions with scene graphs without human labels.
Result: SGCLIP achieves SOTA on scene graph generation and action localization benchmarks. ESCA with SGCLIP improves perception for embodied agents, reduces errors, and enables open-source models to surpass proprietary baselines.
Conclusion: ESCA significantly enhances embodied agent perception through spatial-temporal scene graph grounding, with SGCLIP providing effective open-domain scene graph generation without human annotation requirements.
Abstract: Multi-modal large language models (MLLMs) are making rapid progress toward general-purpose embodied agents. However, existing MLLMs do not reliably capture fine-grained links between low-level visual features and high-level textual semantics, leading to weak grounding and inaccurate perception. To overcome this challenge, we propose ESCA, a framework that contextualizes embodied agents by grounding their perception in spatial-temporal scene graphs. At its core is SGCLIP, a novel, open-domain, promptable foundation model for generating scene graphs that is based on CLIP. SGCLIP is trained on 87K+ open-domain videos using a neurosymbolic pipeline that aligns automatically generated captions with scene graphs produced by the model itself, eliminating the need for human-labeled annotations. We demonstrate that SGCLIP excels in both prompt-based inference and task-specific fine-tuning, achieving state-of-the-art results on scene graph generation and action localization benchmarks. ESCA with SGCLIP improves perception for embodied agents based on both open-source and commercial MLLMs, achieving state of-the-art performance across two embodied environments. Notably, ESCA significantly reduces agent perception errors and enables open-source models to surpass proprietary baselines. We release the source code for SGCLIP model training at https://github.com/video-fm/LASER and for the embodied agent at https://github.com/video-fm/ESCA.
[537] UKANFormer: Noise-Robust Semantic Segmentation for Coral Reef Mapping via a Kolmogorov-Arnold Network-Transformer Hybrid
Tianyang Dou, Ming Li, Jiangying Qin, Xuan Liao, Jiageng Zhong, Armin Gruen, Mengyi Deng
Main category: cs.CV
TL;DR: UKANFormer is a semantic segmentation model that achieves high-precision coral reef mapping using noisy supervision from Allen Coral Atlas, outperforming conventional methods and producing more accurate predictions than the noisy training labels.
Details
Motivation: Coral reefs need accurate large-scale mapping for conservation, but existing global products like Allen Coral Atlas have limitations in spatial precision and semantic consistency, especially for fine-grained boundary delineation.Method: UKANFormer builds on UKAN architecture and incorporates a Global-Local Transformer (GL-Trans) block in the decoder to extract both global semantic structures and local boundary details from noisy supervision.
Result: Achieved coral-class IoU of 67.00% and pixel accuracy of 83.98%, outperforming conventional baselines under the same noisy labels setting. The model produces predictions that are visually and structurally more accurate than the noisy training labels.
Conclusion: Architectural design can mitigate label noise and support scalable mapping under imperfect supervision, challenging the notion that data quality directly limits model performance. UKANFormer provides a foundation for ecological monitoring where reliable labels are scarce.
Abstract: Coral reefs are vital yet fragile ecosystems that require accurate large-scale mapping for effective conservation. Although global products such as the Allen Coral Atlas provide unprecedented coverage of global coral reef distri-bution, their predictions are frequently limited in spatial precision and semantic consistency, especially in regions requiring fine-grained boundary delineation. To address these challenges, we propose UKANFormer, a novel se-mantic segmentation model designed to achieve high-precision mapping under noisy supervision derived from Allen Coral Atlas. Building upon the UKAN architecture, UKANFormer incorporates a Global-Local Transformer (GL-Trans) block in the decoder, enabling the extraction of both global semantic structures and local boundary details. In experiments, UKANFormer achieved a coral-class IoU of 67.00% and pixel accuracy of 83.98%, outperforming conventional baselines under the same noisy labels setting. Remarkably, the model produces predictions that are visually and structurally more accurate than the noisy labels used for training. These results challenge the notion that data quality directly limits model performance, showing that architectural design can mitigate label noise and sup-port scalable mapping under imperfect supervision. UKANFormer provides a foundation for ecological monitoring where reliable labels are scarce.
[538] Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling
Erik Riise, Mehmet Onurcan Kaya, Dim P. Papadopoulos
Main category: cs.CV
TL;DR: Beam search significantly improves text-to-image generation in autoregressive models, enabling a 2B parameter model to outperform a 12B parameter diffusion model by leveraging discrete token spaces for early pruning and computational reuse.
Details
Motivation: While inference-time scaling through search has transformed Large Language Models, similar gains in image generation have been difficult to achieve. Recent attempts with continuous diffusion models showed limited benefits, with random sampling often performing best.Method: Applied beam search to discrete, sequential visual autoregressive models, leveraging the discrete token space for early pruning and computational reuse during image generation.
Result: A 2B parameter autoregressive model with beam search outperformed a 12B parameter diffusion model across benchmarks. Systematic ablations confirmed the advantage comes from discrete token spaces enabling effective search strategies.
Conclusion: Model architecture, not just scale, is critical for inference-time optimization in visual generation. Discrete autoregressive models enable effective search strategies that continuous diffusion models cannot match.
Abstract: While inference-time scaling through search has revolutionized Large Language Models, translating these gains to image generation has proven difficult. Recent attempts to apply search strategies to continuous diffusion models show limited benefits, with simple random sampling often performing best. We demonstrate that the discrete, sequential nature of visual autoregressive models enables effective search for image generation. We show that beam search substantially improves text-to-image generation, enabling a 2B parameter autoregressive model to outperform a 12B parameter diffusion model across benchmarks. Systematic ablations show that this advantage comes from the discrete token space, which allows early pruning and computational reuse, and our verifier analysis highlights trade-offs between speed and reasoning capability. These findings suggest that model architecture, not just scale, is critical for inference-time optimization in visual generation.
[539] GOOD: Training-Free Guided Diffusion Sampling for Out-of-Distribution Detection
Xin Gao, Jiyao Liu, Guanghao Li, Yueming Lyu, Jianxiong Gao, Weichen Yu, Ningsheng Xu, Liang Wang, Caifeng Shan, Ziwei Liu, Chenyang Si
Main category: cs.CV
TL;DR: GOOD is a framework that uses dual-level guidance (image-level and feature-level) with diffusion models to generate diverse out-of-distribution samples, improving OOD detection performance.
Details
Motivation: Existing methods for generating OOD samples using text-to-image diffusion models suffer from semantic instability and insufficient shift diversity, limiting generalization to realistic OOD scenarios.Method: GOOD guides diffusion sampling trajectories using ID classifiers with dual-level guidance: image-level guidance reduces input likelihood to drive samples to low-density regions, and feature-level guidance uses k-NN distance in latent space to promote sampling in feature-sparse regions.
Result: Training with GOOD-generated samples notably enhances OOD detection performance, as demonstrated through thorough quantitative and qualitative analyses.
Conclusion: GOOD enables more controllable and diverse OOD sample generation through its dual-guidance design and unified OOD scoring, significantly improving OOD detection robustness.
Abstract: Recent advancements have explored text-to-image diffusion models for synthesizing out-of-distribution (OOD) samples, substantially enhancing the performance of OOD detection. However, existing approaches typically rely on perturbing text-conditioned embeddings, resulting in semantic instability and insufficient shift diversity, which limit generalization to realistic OOD. To address these challenges, we propose GOOD, a novel and flexible framework that directly guides diffusion sampling trajectories towards OOD regions using off-the-shelf in-distribution (ID) classifiers. GOOD incorporates dual-level guidance: (1) Image-level guidance based on the gradient of log partition to reduce input likelihood, drives samples toward low-density regions in pixel space. (2) Feature-level guidance, derived from k-NN distance in the classifier’s latent space, promotes sampling in feature-sparse regions. Hence, this dual-guidance design enables more controllable and diverse OOD sample generation. Additionally, we introduce a unified OOD score that adaptively combines image and feature discrepancies, enhancing detection robustness. We perform thorough quantitative and qualitative analyses to evaluate the effectiveness of GOOD, demonstrating that training with samples generated by GOOD can notably enhance OOD detection performance.
[540] ViBED-Net: Video Based Engagement Detection Network Using Face-Aware and Scene-Aware Spatiotemporal Cues
Prateek Gothwal, Deeptimaan Banerjee, Ashis Kumer Biswas
Main category: cs.CV
TL;DR: ViBED-Net is a dual-stream deep learning framework that detects student engagement from video by combining facial expressions and scene context using EfficientNetV2 for spatial features and LSTM/Transformers for temporal modeling, achieving 73.43% accuracy on DAiSEE dataset.
Details
Motivation: Engagement detection in online learning is crucial for improving student outcomes and personalizing instruction, requiring effective video-based analysis methods.Method: Uses dual-stream architecture with EfficientNetV2 for spatial feature extraction from facial crops and full video frames, combined with LSTM and Transformer encoders for temporal modeling, plus targeted data augmentation for underrepresented classes.
Result: ViBED-Net with LSTM achieves 73.43% accuracy on DAiSEE dataset, outperforming existing state-of-the-art approaches.
Conclusion: Combining face-aware and scene-aware spatiotemporal cues significantly improves engagement detection, offering a scalable solution for education, UX research, and content personalization.
Abstract: Engagement detection in online learning environments is vital for improving student outcomes and personalizing instruction. We present ViBED-Net (Video-Based Engagement Detection Network), a novel deep learning framework designed to assess student engagement from video data using a dual-stream architecture. ViBED-Net captures both facial expressions and full-scene context by processing facial crops and entire video frames through EfficientNetV2 for spatial feature extraction. These features are then analyzed over time using two temporal modeling strategies: Long Short-Term Memory (LSTM) networks and Transformer encoders. Our model is evaluated on the DAiSEE dataset, a large-scale benchmark for affective state recognition in e-learning. To enhance performance on underrepresented engagement classes, we apply targeted data augmentation techniques. Among the tested variants, ViBED-Net with LSTM achieves 73.43% accuracy, outperforming existing state-of-the-art approaches. ViBED-Net demonstrates that combining face-aware and scene-aware spatiotemporal cues significantly improves engagement detection accuracy. Its modular design allows flexibility for application across education, user experience research, and content personalization. This work advances video-based affective computing by offering a scalable, high-performing solution for real-world engagement analysis. The source code for this project is available on https://github.com/prateek-gothwal/ViBED-Net .
[541] A Training-Free Framework for Open-Vocabulary Image Segmentation and Recognition with EfficientNet and CLIP
Ying Dai, Wei Yu Chen
Main category: cs.CV
TL;DR: Training-free open-vocabulary image segmentation and recognition using EfficientNetB0 for unsupervised segmentation and CLIP for recognition via vision-language alignment.
Details
Motivation: To develop a flexible framework for open-vocabulary segmentation and recognition without requiring training, leveraging pre-trained models for generalizability.Method: Two-stage pipeline: 1) Unsupervised segmentation using EfficientNetB0 features decomposed via SVD and clustered hierarchically, 2) Segment recognition using CLIP’s vision-language alignment with SVD projection for cross-modal alignment.
Result: Achieved state-of-the-art performance on COCO, ADE20K, and PASCAL VOC benchmarks in Hungarian mIoU, precision, recall, and F1-score.
Conclusion: The framework is effective, flexible, and generalizable for open-vocabulary segmentation and recognition tasks without training.
Abstract: This paper presents a novel training-free framework for open-vocabulary image segmentation and object recognition (OVSR), which leverages EfficientNetB0, a convolutional neural network, for unsupervised segmentation and CLIP, a vision-language model, for open-vocabulary object recognition. The proposed framework adopts a two stage pipeline: unsupervised image segmentation followed by segment-level recognition via vision-language alignment. In the first stage, pixel-wise features extracted from EfficientNetB0 are decomposed using singular value decomposition to obtain latent representations, which are then clustered using hierarchical clustering to segment semantically meaningful regions. The number of clusters is adaptively determined by the distribution of singular values. In the second stage, the segmented regions are localized and encoded into image embeddings using the Vision Transformer backbone of CLIP. Text embeddings are precomputed using CLIP’s text encoder from category-specific prompts, including a generic something else prompt to support open set recognition. The image and text embeddings are concatenated and projected into a shared latent feature space via SVD to enhance cross-modal alignment. Recognition is performed by computing the softmax over the similarities between the projected image and text embeddings. The proposed method is evaluated on standard benchmarks, including COCO, ADE20K, and PASCAL VOC, achieving state-of-the-art performance in terms of Hungarian mIoU, precision, recall, and F1-score. These results demonstrate the effectiveness, flexibility, and generalizability of the proposed framework.
[542] Beyond sparse denoising in frames: minimax estimation with a scattering transform
Nathanaël Cuvelle–Magar, Stéphane Mallat
Main category: cs.CV
TL;DR: The paper introduces a denoising estimator using scattering coefficients that reaches minimax asymptotic bounds for cartoon images with unknown Lipschitz exponents α ≤ 2, bridging harmonic analysis with deep learning approaches.
Details
Motivation: Traditional sparse estimators in frames (wavelets, curvelets) are suboptimal for cartoon images with unknown Lipschitz regularity. Deep neural networks perform better but lack theoretical understanding. This work aims to provide a harmonic analysis approach that matches neural network performance.Method: Uses wavelet scattering coefficients (simplified CNN models) computed by transforming modulus of wavelet coefficients with a second wavelet transform. Introduces denoising by jointly minimizing and maximizing ℓ¹ norms of different subsets of scattering coefficients.
Result: Numerical experiments show the estimator reaches minimax asymptotic bounds for cartoon images for all Lipschitz exponents α ≤ 2. The ℓ¹ norms capture different types of geometric image regularity.
Conclusion: Provides a mathematical bridge between harmonic analysis and deep convolutional network denoising, offering a theoretically grounded alternative to neural networks that achieves comparable performance for cartoon image denoising.
Abstract: A considerable amount of research in harmonic analysis has been devoted to non-linear estimators of signals contaminated by additive Gaussian noise. They are implemented by thresholding coefficients in a frame, which provide a sparse signal representation, or by minimising their $\ell^1$ norm. However, sparse estimators in frames are not sufficiently rich to adapt to complex signal regularities. For cartoon images whose edges are piecewise $\bf C^\alpha$ curves, wavelet, curvelet and Xlet frames are suboptimal if the Lipschitz exponent $\alpha \leq 2$ is an unknown parameter. Deep convolutional neural networks have recently obtained much better numerical results, which reach the minimax asymptotic bounds for all $\alpha$. Wavelet scattering coefficients have been introduced as simplified convolutional neural network models. They are computed by transforming the modulus of wavelet coefficients with a second wavelet transform. We introduce a denoising estimator by jointly minimising and maximising the $\ell^1$ norms of different subsets of scattering coefficients. We prove that these $\ell^1$ norms capture different types of geometric image regularity. Numerical experiments show that this denoising estimator reaches the minimax asymptotic bound for cartoon images for all Lipschitz exponents $\alpha \leq 2$. We state this numerical result as a mathematical conjecture. It provides a different harmonic analysis approach to suppress noise from signals, and to specify the geometric regularity of functions. It also opens a mathematical bridge between harmonic analysis and denoising estimators with deep convolutional network.
[543] Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge
Nimrod Berman, Omkar Joglekar, Eitan Kosman, Dotan Di Castro, Omri Azencot
Main category: cs.CV
TL;DR: LDDBM is a general-purpose framework for modality translation using latent-variable diffusion models that bridges arbitrary modalities without requiring aligned dimensions.
Details
Motivation: Extending diffusion models to modality translation remains challenging due to restrictive assumptions in existing approaches about shared dimensionality, Gaussian priors, and modality-specific architectures.Method: Uses latent-variable extension of Denoising Diffusion Bridge Models with shared latent space, contrastive alignment loss for semantic consistency, domain-agnostic encoder-decoder architecture, and predictive loss for cross-domain translation.
Result: Performs strongly on diverse MT tasks including multi-view to 3D shape generation, image super-resolution, and multi-view scene synthesis, establishing a new strong baseline.
Conclusion: LDDBM provides an effective general framework for modality translation that supports arbitrary modality pairs and overcomes limitations of existing approaches.
Abstract: Recent advances in generative modeling have positioned diffusion models as state-of-the-art tools for sampling from complex data distributions. While these models have shown remarkable success across single-modality domains such as images and audio, extending their capabilities to Modality Translation (MT), translating information across different sensory modalities, remains an open challenge. Existing approaches often rely on restrictive assumptions, including shared dimensionality, Gaussian source priors, and modality-specific architectures, which limit their generality and theoretical grounding. In this work, we propose the Latent Denoising Diffusion Bridge Model (LDDBM), a general-purpose framework for modality translation based on a latent-variable extension of Denoising Diffusion Bridge Models. By operating in a shared latent space, our method learns a bridge between arbitrary modalities without requiring aligned dimensions. We introduce a contrastive alignment loss to enforce semantic consistency between paired samples and design a domain-agnostic encoder-decoder architecture tailored for noise prediction in latent space. Additionally, we propose a predictive loss to guide training toward accurate cross-domain translation and explore several training strategies to improve stability. Our approach supports arbitrary modality pairs and performs strongly on diverse MT tasks, including multi-view to 3D shape generation, image super-resolution, and multi-view scene synthesis. Comprehensive experiments and ablations validate the effectiveness of our framework, establishing a new strong baseline in general modality translation. For more information, see our project page: https://sites.google.com/view/lddbm/home.
[544] LayerComposer: Interactive Personalized T2I via Spatially-Aware Layered Canvas
Guocheng Gordon Qian, Ruihang Zhang, Tsai-Shien Chen, Yusuf Dalva, Anujraaj Argo Goyal, Willi Menapace, Ivan Skorokhodov, Meng Dong, Arpit Sahni, Daniil Ostashev, Ju Hu, Sergey Tulyakov, Kuan-Chieh Jackson Wang
Main category: cs.CV
TL;DR: LayerComposer is an interactive framework for multi-subject personalized text-to-image generation that uses a layered canvas representation and locking mechanism to enable precise spatial control and identity preservation.
Details
Motivation: Existing personalized generative models lack interactive control over spatial composition and don't scale well to multiple subjects, limiting their practical usability.Method: Introduces a layered canvas where each subject is on a distinct layer for occlusion-free composition, and a locking mechanism that preserves selected layers while allowing others to adapt to context. Uses positional embeddings and complementary data sampling without architectural changes.
Result: Achieves superior spatial control and identity preservation compared to state-of-the-art methods in multi-subject personalized image generation.
Conclusion: LayerComposer provides an interactive, intuitive framework for multi-subject personalized generation with professional image-editing-like controls, addressing key limitations of existing approaches.
Abstract: Despite their impressive visual fidelity, existing personalized generative models lack interactive control over spatial composition and scale poorly to multiple subjects. To address these limitations, we present LayerComposer, an interactive framework for personalized, multi-subject text-to-image generation. Our approach introduces two main contributions: (1) a layered canvas, a novel representation in which each subject is placed on a distinct layer, enabling occlusion-free composition; and (2) a locking mechanism that preserves selected layers with high fidelity while allowing the remaining layers to adapt flexibly to the surrounding context. Similar to professional image-editing software, the proposed layered canvas allows users to place, resize, or lock input subjects through intuitive layer manipulation. Our versatile locking mechanism requires no architectural changes, relying instead on inherent positional embeddings combined with a new complementary data sampling strategy. Extensive experiments demonstrate that LayerComposer achieves superior spatial control and identity preservation compared to the state-of-the-art methods in multi-subject personalized image generation.
[545] TokenCLIP: Token-wise Prompt Learning for Zero-shot Anomaly Detection
Qihang Zhou, Binbin Gao, Guansong Pang, Xin Wang, Jiming Chen, Shibo He
Main category: cs.CV
TL;DR: TokenCLIP is a token-wise adaptation framework that enables dynamic alignment between visual and learnable textual spaces for fine-grained anomaly detection, outperforming existing methods that use single textual spaces.
Details
Motivation: Existing CLIP-based anomaly detection methods use a single textual space to align with visual semantics, which hinders accurate capture of varied anomaly semantics across diverse objects and domains.Method: TokenCLIP expands token-agnostic textual space into orthogonal subspaces, dynamically assigns each visual token to subspace combinations via semantic affinity using optimal transport, and applies top-k masking to specialize subspaces for distinct regions.
Result: Extensive experiments demonstrate the superiority of TokenCLIP over existing methods for anomaly detection on unseen objects.
Conclusion: TokenCLIP’s dynamic token-wise alignment framework effectively captures fine-grained anomaly semantics and achieves state-of-the-art performance in zero-shot anomaly detection.
Abstract: Adapting CLIP for anomaly detection on unseen objects has shown strong potential in a zero-shot manner. However, existing methods typically rely on a single textual space to align with visual semantics across diverse objects and domains. The indiscriminate alignment hinders the model from accurately capturing varied anomaly semantics. We propose TokenCLIP, a token-wise adaptation framework that enables dynamic alignment between visual and learnable textual spaces for fine-grained anomaly learning. Rather than mapping all visual tokens to a single, token-agnostic textual space, TokenCLIP aligns each token with a customized textual subspace that represents its visual characteristics. Explicitly assigning a unique learnable textual space to each token is computationally intractable and prone to insufficient optimization. We instead expand the token-agnostic textual space into a set of orthogonal subspaces, and then dynamically assign each token to a subspace combination guided by semantic affinity, which jointly supports customized and efficient token-wise adaptation. To this end, we formulate dynamic alignment as an optimal transport problem, where all visual tokens in an image are transported to textual subspaces based on semantic similarity. The transport constraints of OT ensure sufficient optimization across subspaces and encourage them to focus on different semantics. Solving the problem yields a transport plan that adaptively assigns each token to semantically relevant subspaces. A top-k masking is then applied to sparsify the plan and specialize subspaces for distinct visual regions. Extensive experiments demonstrate the superiority of TokenCLIP.
[546] Topology Sculptor, Shape Refiner: Discrete Diffusion Model for High-Fidelity 3D Meshes Generation
Kaiyu Song, Hanjiang Lai, Yaqing Zhang, Chuangjian Cai, Yan Pan Kun Yue, Jian Yin
Main category: cs.CV
TL;DR: TSSR is a novel method using Discrete Diffusion Models to generate high-quality 3D artist-style meshes with parallel generation, achieving up to 10,000 faces at 1024^3 resolution through decoupled training, improved architecture, and connection loss.
Details
Motivation: To achieve highly accurate token prediction while enabling parallel generation, overcoming limitations of sequential autoregressive methods by allowing concurrent processing of all mesh tokens for better efficiency and control.Method: Uses Discrete Diffusion Models with three innovations: 1) Decoupled Training and Hybrid Inference separating topology sculpting and shape refinement, 2) Improved Hourglass Architecture with bidirectional attention and Rotational Positional Embeddings, 3) Connection Loss for topological constraints.
Result: Generates high-quality 3D artist-style meshes with up to 10,000 faces at 1024^3 spatial resolution, demonstrating superior performance on complex datasets compared to existing methods.
Conclusion: TSSR successfully enables parallel generation of high-fidelity 3D meshes through its innovative architecture and training strategy, representing a significant advancement in 3D mesh generation technology.
Abstract: In this paper, we introduce Topology Sculptor, Shape Refiner (TSSR), a novel method for generating high-quality, artist-style 3D meshes based on Discrete Diffusion Models (DDMs). Our primary motivation for TSSR is to achieve highly accurate token prediction while enabling parallel generation, a significant advantage over sequential autoregressive methods. By allowing TSSR to “see” all mesh tokens concurrently, we unlock a new level of efficiency and control. We leverage this parallel generation capability through three key innovations: 1) Decoupled Training and Hybrid Inference, which distinctly separates the DDM-based generation into a topology sculpting stage and a subsequent shape refinement stage. This strategic decoupling enables TSSR to effectively capture both intricate local topology and overarching global shape. 2) An Improved Hourglass Architecture, featuring bidirectional attention enriched by face-vertex-sequence level Rotational Positional Embeddings (RoPE), thereby capturing richer contextual information across the mesh structure. 3) A novel Connection Loss, which acts as a topological constraint to further enhance the realism and fidelity of the generated meshes. Extensive experiments on complex datasets demonstrate that TSSR generates high-quality 3D artist-style meshes, capable of achieving up to 10,000 faces at a remarkable spatial resolution of $1024^3$. The code will be released at: https://github.com/psky1111/Tencent-TSSR.
cs.AI
[547] A Multi-Component AI Framework for Computational Psychology: From Robust Predictive Modeling to Deployed Generative Dialogue
Anant Pareek
Main category: cs.AI
TL;DR: A comprehensive framework bridging predictive modeling and interactive systems for computational psychology, using classical ML, fine-tuned transformers, and generative LLMs deployed as microservices.
Details
Motivation: To bridge the gap between isolated predictive modeling and interactive systems for psychological analysis by integrating AI and computational psychology.Method: End-to-end development: 1) Benchmark classical ML on psychological datasets, 2) Fine-tune transformers with solutions for numerical instability and resource constraints, 3) Fine-tune LLM as interactive “Personality Brain”, 4) Deploy as microservices ecosystem.
Result: Successfully stabilized transformer-based regression for affective computing, achieved meaningful predictive performance where standard approaches failed, and developed replicable methodology for democratizing large-scale AI research.
Conclusion: Demonstrates a complete research-to-deployment pipeline integrating predictive analysis with generative dialogue, providing a practical model for future computational psychology and human-AI interaction research.
Abstract: The confluence of Artificial Intelligence and Computational Psychology presents an opportunity to model, understand, and interact with complex human psychological states through computational means. This paper presents a comprehensive, multi-faceted framework designed to bridge the gap between isolated predictive modeling and an interactive system for psychological analysis. The methodology encompasses a rigorous, end-to-end development lifecycle. First, foundational performance benchmarks were established on four diverse psychological datasets using classical machine learning techniques. Second, state-of-the-art transformer models were fine-tuned, a process that necessitated the development of effective solutions to overcome critical engineering challenges, including the resolution of numerical instability in regression tasks and the creation of a systematic workflow for conducting large-scale training under severe resource constraints. Third, a generative large language model (LLM) was fine-tuned using parameter-efficient techniques to function as an interactive “Personality Brain.” Finally, the entire suite of predictive and generative models was architected and deployed as a robust, scalable microservices ecosystem. Key findings include the successful stabilization of transformer-based regression models for affective computing, showing meaningful predictive performance where standard approaches failed, and the development of a replicable methodology for democratizing large-scale AI research. The significance of this work lies in its holistic approach, demonstrating a complete research-to-deployment pipeline that integrates predictive analysis with generative dialogue, thereby providing a practical model for future research in computational psychology and human-AI interaction.
[548] PREFINE: Personalized Story Generation via Simulated User Critics and User-Specific Rubric Generation
Kentaro Ueda, Takehiro Takayanagi
Main category: cs.AI
TL;DR: PREFINE is a novel framework that enables personalized story generation using pseudo-user agents and user-specific rubrics without requiring parameter updates or direct user feedback.
Details
Motivation: Current LLMs struggle with personalized text generation, and conventional approaches using explicit feedback or fine-tuning face practical issues like user burden, data collection costs, and privacy concerns.Method: PREFINE constructs a pseudo-user agent from user interaction history and generates user-specific rubrics. This agent then critiques and refines outputs on the user’s behalf based on these tailored rubrics.
Result: On PerDOC and PerMPST datasets, PREFINE achieved higher win rates and statistically significant scores than baselines in automatic evaluations, without compromising general story quality. Both pseudo-user agent and user-specific rubrics were crucial for performance.
Conclusion: The approach enables efficient personalization without parameter updates or direct feedback, with potential applications in dialogue systems, education, and recommendation systems beyond story generation.
Abstract: While recent advances in Large Language Models (LLMs) have improved the quality of creative text generation, significant challenges remain in producing personalized stories that reflect individual user preferences. Conventional approaches rely on explicit feedback or fine-tuning, which presents practical issues regarding user burden, data collection, computational costs, and privacy. In this work, we propose PREFINE (Persona-and-Rubric Guided Critique-and-Refine), a novel framework that extends the Critique-and-Refine paradigm to personalization. PREFINE constructs a pseudo-user agent from a user’s interaction history and generates user-specific rubrics (evaluation criteria). By having this agent critique and refine outputs on the user’s behalf based on these tailored rubrics, our method achieves personalized generation without requiring parameter updates or direct user feedback. We conducted a comprehensive evaluation on the PerDOC and PerMPST story datasets. We designed three baseline methods and several model variants to verify the contribution of each component of our framework. In automatic evaluations (LLM-as-a-Judge), PREFINE achieved higher win rates and statistically significant scores than the baselines, without compromising general story quality. Analysis of the model variants confirmed that both the pseudo-user agent and the user-specific rubrics are crucial for enhancing personalization performance. Beyond story generation, our approach holds potential for enabling efficient personalization in broader applications, such as dialogue systems, education, and recommendation.
[549] SIGN: Schema-Induced Games for Naming
Ryan Zhang, Herbert Woisetscläger
Main category: cs.AI
TL;DR: SIGN (Schema-Induced Games for Naming) introduces structured communication to improve multi-agent coordination, achieving 5.8x faster convergence than unconstrained natural language.
Details
Motivation: Real-world AI systems with multiple LLM agents face coordination breakdowns when agents develop inconsistent communication conventions, especially in collaborative applications like coding and planning that require reliable communication at scale.Method: Introduces SIGN - a naming game that uses lightweight schema-induced structure to steer convention formation among agents, comparing it against unconstrained natural language communication.
Result: Schema-induced communication achieves significantly faster convergence with up to 5.8x higher agreement rates compared to unconstrained natural language.
Conclusion: Minimal structure serves as an effective control mechanism for efficient multi-agent coordination, with potential applications extending beyond naming games to broader multi-agent systems.
Abstract: Real-world AI systems are tackling increasingly complex problems, often through interactions among large language model (LLM) agents. When these agents develop inconsistent conventions, coordination can break down. Applications such as collaborative coding and distributed planning therefore require reliable, consistent communication, and scalability is a central concern as systems grow. We introduce Schema-Induced Games for Naming (SIGN), a naming game that examines how lightweight structure can steer convention formation. We compare schema-induced communication to unconstrained natural language and find faster convergence with up to 5.8x higher agreement. These results suggest that minimal structure can act as a simple control knob for efficient multi-agent coordination, pointing toward broader applications beyond the naming game.
[550] Capability Ceilings in Autoregressive Language Models: Empirical Evidence from Knowledge-Intensive Tasks
Javier Marín
Main category: cs.AI
TL;DR: Decoder-only language models show negligible accuracy improvement on knowledge-intensive tasks despite scaling up to 30B parameters, while procedural tasks scale normally. Attention patterns are highly sensitive to perturbation.
Details
Motivation: To understand the scaling behavior of decoder-only autoregressive language models on different types of tasks and identify capability-specific limitations.Method: Systematic evaluation of OPT and Pythia model families (70M-30B parameters) on knowledge retrieval and procedural tasks, plus attention intervention experiments.
Result: Knowledge retrieval tasks show minimal accuracy gains despite 240x scaling, while arithmetic tasks scale normally. Attention pattern swapping causes catastrophic performance collapse.
Conclusion: Parameter scaling beyond 1-2B offers minimal accuracy gains for knowledge-intensive applications in these architectures, revealing capability-specific scaling failures.
Abstract: We document empirical capability ceilings in decoder-only autoregressive language models across knowledge-intensive tasks. Systematic evaluation of OPT and Pythia model families (70M-30B parameters, spanning 240 times scaling) reveals that knowledge retrieval tasks show negligible accuracy improvement despite smooth loss reduction. On MMLU mathematics benchmarks, accuracy remains flat at 19-20% (below 25% random chance) across all scales while cross-entropy loss decreases by 31%. In contrast, procedural tasks like arithmetic show conventional scaling where both metrics improve together. Attention intervention experiments reveal high sensitivity to perturbation: swapping attention patterns between models causes catastrophic performance collapse (complete accuracy loss) rather than graceful degradation. These measurements have immediate engineering implications: for knowledge-intensive applications using OPT and Pythia architectures, parameter scaling beyond 1-2B offers minimal accuracy gains despite continued loss improvement. Our findings quantify capability-specific scaling failures in these model families to inform resource allocation decisions. Whether these patterns reflect fundamental constraints of decoder-only architectures or implementation-specific limitations remains an open question requiring investigation across diverse architectural approaches.
[551] GeoThought: A Dataset for Enhancing Mathematical Geometry Reasoning in Vision-Language Models
Nannan Shi, Chuanyu Qin, Shipeng Song, Man Luo
Main category: cs.AI
TL;DR: The paper introduces GeoThoughts dataset and GeoThought-MLLM model to address LLMs’ poor performance in geometric reasoning by providing comprehensive training data with explicit reasoning chains.
Details
Motivation: LLMs perform well in text-based math but struggle with geometric reasoning due to visual complexity and lack of adequate datasets with reasoning traces.Method: Created GeoThoughts dataset with 6K-10K samples containing visual descriptions, step-by-step solutions, reasoning chains, and reflections. Developed GeoThought-MLLM model that generates detailed thinking processes.
Result: GeoThought-MLLM outperforms existing benchmarks in geometric tasks and shows improved reasoning capabilities in both in-domain and out-of-domain settings.
Conclusion: Training with Chain-of-Thought datasets enhances geometric reasoning, and errors can be corrected by invoking CoT to fix mathematical concept misinterpretations or spatial misjudgments.
Abstract: Large language models (LLMs) have demonstrated strong reasoning capabilities in text-based mathematical problem solving; however, when adapted to visual reasoning tasks, particularly geometric problem solving, their performance substantially declines because geometric problems present unique challenges. Specifically, these challenges stem from two key factors: first, the intrinsic complexity of geometry requiring detailed image comprehension and multi-step reasoning, and second, the limitations of existing datasets which lack sufficient scale, diversity, and explicit reasoning traces, consequently hindering effective model training. To address these challenges, we developed the GeoThoughts dataset, a comprehensive geometric reasoning corpus with two subsets: Geo-Thought-6K with 6,243 samples and its augmented version Geo-Thought-Augmented-10K containing 10,834 samples. Each entry includes visual descriptions, step-by-step solutions, explicit reasoning chains, reflection steps, and final answers. Using this dataset, we developed GeoThought-MLLM, a mathematical reasoning multimodal model that generates detailed thinking processes during problem-solving. Our model outperforms existing benchmarks in geometric tasks, demonstrating that training with our Chain-of-Thought dataset improves geometric reasoning capabilities across both in-domain and out-of-domain settings. Finally, we analyze failure cases and observe that errors primarily arise from incorrect interpretation of mathematical concepts or spatial misjudgment. By invoking CoT to correct these mistakes, the model produces correct answers.
[552] Exploration through Generation: Applying GFlowNets to Structured Search
Mark Phillip Matovic
Main category: cs.AI
TL;DR: GFlowNets successfully solve three graph optimization problems (TSP, MST, Shortest Path) by learning to sample optimal solutions proportionally to rewards, matching classical algorithm results while offering computational scalability.
Details
Motivation: To demonstrate that generative models can solve combinatorial optimization problems through learned policies, with the advantage of computational scalability compared to classical algorithms.Method: Apply GFlowNets trained with Trajectory Balance loss to sequentially build solutions for graph problems: selecting edges for spanning trees, nodes for paths, and cities for tours.
Result: GFlowNets learn to find optimal solutions that match classical algorithms (Dijkstra, Kruskal, exact TSP solvers) across benchmark instances of varying sizes, with training convergence depending on problem complexity.
Conclusion: Generative models can effectively solve combinatorial optimization problems, with the main advantage being computational scalability that could potentially handle larger instances where classical methods become infeasible.
Abstract: This work applies Generative Flow Networks (GFlowNets) to three graph optimization problems: the Traveling Salesperson Problem, Minimum Spanning Tree, and Shortest Path. GFlowNets are generative models that learn to sample solutions proportionally to a reward function. The models are trained using the Trajectory Balance loss to build solutions sequentially, selecting edges for spanning trees, nodes for paths, and cities for tours. Experiments on benchmark instances of varying sizes show that GFlowNets learn to find optimal solutions. For each problem type, multiple graph configurations with different numbers of nodes were tested. The generated solutions match those from classical algorithms (Dijkstra for shortest path, Kruskal for spanning trees, and exact solvers for TSP). Training convergence depends on problem complexity, with the number of episodes required for loss stabilization increasing as graph size grows. Once training converges, the generated solutions match known optima from classical algorithms across the tested instances. This work demonstrates that generative models can solve combinatorial optimization problems through learned policies. The main advantage of this learning-based approach is computational scalability: while classical algorithms have fixed complexity per instance, GFlowNets amortize computation through training. With sufficient computational resources, the framework could potentially scale to larger problem instances where classical exact methods become infeasible.
[553] Computational Hardness of Reinforcement Learning with Partial $q^π$-Realizability
Shayan Karimi, Xiaoqi Tan
Main category: cs.AI
TL;DR: This paper proves computational hardness of reinforcement learning under partial qπ-realizability, showing NP-hardness for greedy policies and exponential lower bounds for softmax policies.
Details
Motivation: To investigate the computational complexity of RL in a practical linear function approximation regime that bridges the gap between strong qπ-realizability and weaker q*-realizability assumptions.Method: Reduction from δ-Max-3SAT and δ-Max-3SAT(b) complexity problems to instances of GLinear-κ-RL (greedy policy) and SLinear-κ-RL (softmax policy) to establish hardness results.
Result: Learning an ε-optimal policy is NP-hard under greedy policy sets and requires exponential time (in feature dimension) for softmax policies under the Randomized Exponential Time Hypothesis.
Conclusion: Computational difficulty persists in partial qπ-realizability even when expanding the policy set beyond optimal policies, contrasting with positive results in qπ-realizability under generative models.
Abstract: This paper investigates the computational complexity of reinforcement learning in a novel linear function approximation regime, termed partial $q^{\pi}$-realizability. In this framework, the objective is to learn an $\epsilon$-optimal policy with respect to a predefined policy set $\Pi$, under the assumption that all value functions for policies in $\Pi$ are linearly realizable. The assumptions of this framework are weaker than those in $q^{\pi}$-realizability but stronger than those in $q^$-realizability, providing a practical model where function approximation naturally arises. We prove that learning an $\epsilon$-optimal policy in this setting is computationally hard. Specifically, we establish NP-hardness under a parameterized greedy policy set (argmax) and show that - unless NP = RP - an exponential lower bound (in feature vector dimension) holds when the policy set contains softmax policies, under the Randomized Exponential Time Hypothesis. Our hardness results mirror those in $q^$-realizability and suggest computational difficulty persists even when $\Pi$ is expanded beyond the optimal policy. To establish this, we reduce from two complexity problems, $\delta$-Max-3SAT and $\delta$-Max-3SAT(b), to instances of GLinear-$\kappa$-RL (greedy policy) and SLinear-$\kappa$-RL (softmax policy). Our findings indicate that positive computational results are generally unattainable in partial $q^{\pi}$-realizability, in contrast to $q^{\pi}$-realizability under a generative access model.
[554] Performance Trade-offs of Optimizing Small Language Models for E-Commerce
Josip Tomo Licardo, Nikola Tankovic
Main category: cs.AI
TL;DR: Small 1B-parameter Llama 3.2 model optimized for e-commerce intent recognition achieves 99% accuracy matching GPT-4.1, with significant computational efficiency gains through QLoRA fine-tuning and quantization.
Details
Motivation: Deploying large commercial LLMs for specialized tasks like e-commerce is hindered by high computational costs, latency, and operational expenses, necessitating resource-efficient alternatives.Method: Fine-tuned 1B-parameter Llama 3.2 using QLoRA on synthetic e-commerce dataset, then applied post-training quantization (GPTQ for GPU, GGUF for CPU) to optimize for different hardware.
Result: Specialized 1B model achieved 99% accuracy matching GPT-4.1. GPTQ reduced VRAM by 41% but slowed inference by 82% on T4 GPU, while GGUF on CPU achieved 18x speedup and 90% RAM reduction compared to FP16 baseline.
Conclusion: Small, properly optimized open-weight models are not just viable but more suitable for domain-specific applications, offering state-of-the-art accuracy at a fraction of the computational cost.
Abstract: Large Language Models (LLMs) offer state-of-the-art performance in natural language understanding and generation tasks. However, the deployment of leading commercial models for specialized tasks, such as e-commerce, is often hindered by high computational costs, latency, and operational expenses. This paper investigates the viability of smaller, open-weight models as a resource-efficient alternative. We present a methodology for optimizing a one-billion-parameter Llama 3.2 model for multilingual e-commerce intent recognition. The model was fine-tuned using Quantized Low-Rank Adaptation (QLoRA) on a synthetically generated dataset designed to mimic real-world user queries. Subsequently, we applied post-training quantization techniques, creating GPU-optimized (GPTQ) and CPU-optimized (GGUF) versions. Our results demonstrate that the specialized 1B model achieves 99% accuracy, matching the performance of the significantly larger GPT-4.1 model. A detailed performance analysis revealed critical, hardware-dependent trade-offs: while 4-bit GPTQ reduced VRAM usage by 41%, it paradoxically slowed inference by 82% on an older GPU architecture (NVIDIA T4) due to dequantization overhead. Conversely, GGUF formats on a CPU achieved a speedup of up to 18x in inference throughput and a reduction of over 90% in RAM consumption compared to the FP16 baseline. We conclude that small, properly optimized open-weight models are not just a viable but a more suitable alternative for domain-specific applications, offering state-of-the-art accuracy at a fraction of the computational cost.
[555] Distribution Shift Alignment Helps LLMs Simulate Survey Response Distributions
Ji Huang, Mengfei Li, Shuai Shao
Main category: cs.AI
TL;DR: DSA is a two-stage fine-tuning method that aligns distribution shifts across backgrounds, enabling LLMs to simulate survey responses more accurately than training data.
Details
Motivation: Existing zero-shot methods have low accuracy and prompt sensitivity, while conventional fine-tuning overfits training distributions and fails to produce more accurate results than the training data itself.Method: Distribution Shift Alignment (DSA) - a two-stage fine-tuning method that learns how output distributions change across different backgrounds rather than fitting training data distributions.
Result: DSA consistently outperforms other methods on five public survey datasets, reduces required real data by 53.48-69.12%, and provides results substantially closer to the true distribution than training data.
Conclusion: DSA demonstrates effectiveness and efficiency in survey simulation by learning distribution shifts rather than fitting training data, enabling more accurate LLM-based survey response generation.
Abstract: Large language models (LLMs) offer a promising way to simulate human survey responses, potentially reducing the cost of large-scale data collection. However, existing zero-shot methods suffer from prompt sensitivity and low accuracy, while conventional fine-tuning approaches mostly fit the training set distributions and struggle to produce results more accurate than the training set itself, which deviates from the original goal of using LLMs to simulate survey responses. Building on this observation, we introduce Distribution Shift Alignment (DSA), a two-stage fine-tuning method that aligns both the output distributions and the distribution shifts across different backgrounds. By learning how these distributions change rather than fitting training data, DSA can provide results substantially closer to the true distribution than the training data. Empirically, DSA consistently outperforms other methods on five public survey datasets. We further conduct a comprehensive comparison covering accuracy, robustness, and data savings. DSA reduces the required real data by 53.48-69.12%, demonstrating its effectiveness and efficiency in survey simulation.
[556] Foundation of Intelligence: Review of Math Word Problems from Human Cognition Perspective
Zhenya Huang, Jiayu Liu, Xin Lin, Zhiyuan Ma, Shangzi Xue, Tong Xiao, Qi Liu, Yee Whye Teh, Enhong Chen
Main category: cs.AI
TL;DR: This paper provides a comprehensive survey of math word problem (MWP) solving research from a human cognition perspective, reviewing neural network and LLM-based solvers over the past decade and evaluating their performance on 5 benchmarks.
Details
Motivation: The field lacks a systematic taxonomy for MWP surveys and discussion of current trends. The research aims to advance AI reasoning by mirroring human cognitive intelligence and demonstrate how AI models simulate human cognitive abilities in MWP solving.Method: The authors summarize 5 key cognitive abilities for MWP solving, review two mainstream models (neural network solvers and LLM-based solvers), rerun representative solvers, and evaluate them on 5 mainstream benchmarks for unified comparison.
Result: The survey provides the first comprehensive analysis of influential MWP research from the perspective of human reasoning cognition and offers an integrative comparison across existing approaches.
Conclusion: This work systematically categorizes MWP research through human cognitive abilities, demonstrates AI’s progress in simulating human reasoning, and aims to inspire further research in AI reasoning. The repository is publicly available.
Abstract: Math word problem (MWP) serves as a fundamental research topic in artificial intelligence (AI) dating back to 1960s. This research aims to advance the reasoning abilities of AI by mirroring the human-like cognitive intelligence. The mainstream technological paradigm has evolved from the early rule-based methods, to deep learning models, and is rapidly advancing towards large language models. However, the field still lacks a systematic taxonomy for the MWP survey along with a discussion of current development trends. Therefore, in this paper, we aim to comprehensively review related research in MWP solving through the lens of human cognition, to demonstrate how recent AI models are advancing in simulating human cognitive abilities. Specifically, we summarize 5 crucial cognitive abilities for MWP solving, including Problem Understanding, Logical Organization, Associative Memory, Critical Thinking, and Knowledge Learning. Focused on these abilities, we review two mainstream MWP models in recent 10 years: neural network solvers, and LLM based solvers, and discuss the core human-like abilities they demonstrated in their intricate problem-solving process. Moreover, we rerun all the representative MWP solvers and supplement their performance on 5 mainstream benchmarks for a unified comparison. To the best of our knowledge, this survey first comprehensively analyzes the influential MWP research of the past decade from the perspective of human reasoning cognition and provides an integrative overall comparison across existing approaches. We hope it can inspire further research in AI reasoning. Our repository is released on https://github.com/Ljyustc/FoI-MWP.
[557] LightAgent: Mobile Agentic Foundation Models
Yangqin Jiang, Chao Huang
Main category: cs.AI
TL;DR: LightAgent is a mobile GUI agent system that uses device-cloud collaboration to combine the cost-efficiency of on-device models with the high capability of cloud models, achieving performance comparable to larger models while significantly reducing cloud costs.
Details
Motivation: Mobile GUI agents face a dilemma where truly on-device models (4B or smaller) lack sufficient performance, while capable models (7B+) are either too large for mobile deployment or prohibitively costly as cloud-only solutions.Method: LightAgent enhances Qwen2.5-VL-3B via two-stage SFT->GRPO training on synthetic GUI data, integrates an efficient long-reasoning mechanism for historical interactions, and defaults to on-device execution while escalating challenging subtasks to the cloud via real-time complexity assessment.
Result: Experiments on AndroidLab benchmark and diverse apps show LightAgent matches or nears larger models’ performance while significantly reducing cloud costs.
Conclusion: LightAgent successfully resolves the mobile GUI agent dilemma through device-cloud collaboration, achieving strong performance with reduced cloud dependency.
Abstract: With the advancement of multimodal large language models (MLLMs), building GUI agent systems has become an increasingly promising direction-especially for mobile platforms, given their rich app ecosystems and intuitive touch interactions. Yet mobile GUI agents face a critical dilemma: truly on-device models (4B or smaller) lack sufficient performance, while capable models (starting from 7B) are either too large for mobile deployment or prohibitively costly (e.g., cloud-only closed-source MLLMs). To resolve this, we propose LightAgent, a mobile agentic foundation model solution that leverages device-cloud collaboration to tap the cost-efficiency of on-device models and the high capability of cloud models, while avoiding their drawbacks. Specifically, LightAgent enhances Qwen2.5-VL-3B via two-stage SFT->GRPO training on synthetic GUI data for strong decision-making, integrates an efficient long-reasoning mechanism to utilize historical interactions under tight resources, and defaults to on-device execution-only escalating challenging subtasks to the cloud via real-time complexity assessment. Experiments on the online AndroidLab benchmark and diverse apps show LightAgent matches or nears larger models, with a significant reduction in cloud costs.
[558] LLM-AR: LLM-powered Automated Reasoning Framework
Rick Chen, Joseph Ternasky, Aaron Ontoyin Yin, Xianling Mu, Fuat Alican, Yigit Ihlamur
Main category: cs.AI
TL;DR: LLM-AR is a neural-symbolic pipeline that distills LLM-generated heuristics into probabilistic rules using ProbLog, achieving 59.5% precision in predicting idea-stage startup success from founder traits.
Details
Motivation: LLMs have pattern recognition and reasoning capabilities but variable accuracy limits their use in high-stakes decision-making like venture capital startup success prediction.Method: LLM-AR pipeline inspired by neural-symbolic systems that distills LLM-generated heuristics into probabilistic rules executed by ProbLog, with iterative policy-evolution loop using association-rule mining to refine rules.
Result: Achieved 59.5% precision and 8.7% recall on unseen folds, 5.9x random baseline precision, while maintaining interpretability with exposed decision paths.
Conclusion: The framework is interpretable, tunable via hyperparameters, and shows promise for extension into other domains beyond venture capital.
Abstract: Large language models (LLMs) can already identify patterns and reason effectively, yet their variable accuracy hampers adoption in high-stakes decision-making applications. In this paper, we study this issue from a venture capital perspective by predicting idea-stage startup success based on founder traits. (i) To build a reliable prediction model, we introduce LLM-AR, a pipeline inspired by neural-symbolic systems that distils LLM-generated heuristics into probabilistic rules executed by the ProbLog automated-reasoning engine. (ii) An iterative policy-evolution loop incorporates association-rule mining to progressively refine the prediction rules. On unseen folds, LLM-AR achieves 59.5% precision and 8.7% recall, 5.9x the random baseline precision, while exposing every decision path for human inspection. The framework is interpretable and tunable via hyperparameters, showing promise to extend into other domains.
[559] Predictive Coding Enhances Meta-RL To Achieve Interpretable Bayes-Optimal Belief Representation Under Partial Observability
Po-Chen Kuo, Han Hou, Will Dabney, Edgar Y. Walker
Main category: cs.AI
TL;DR: Meta-RL with predictive coding modules learns more interpretable Bayes-optimal belief states than conventional meta-RL, leading to better generalization in partially observable environments.
Details
Motivation: Current meta-RL agents achieve near Bayes-optimal policies but fail to learn compact, interpretable belief states, limiting adaptability and generalization in partially observable environments.Method: Integrate self-supervised predictive coding modules into meta-RL, inspired by predictive coding in neuroscience and auxiliary predictive objectives in deep RL.
Result: Meta-RL with predictive modules consistently generates more interpretable representations that better approximate Bayes-optimal belief states across various tasks, and succeeds in challenging active information seeking tasks where conventional meta-RL fails.
Conclusion: Predictive learning serves as a guiding principle for effective representation learning in agents navigating partial observability, leading to improved generalization.
Abstract: Learning a compact representation of history is critical for planning and generalization in partially observable environments. While meta-reinforcement learning (RL) agents can attain near Bayes-optimal policies, they often fail to learn the compact, interpretable Bayes-optimal belief states. This representational inefficiency potentially limits the agent’s adaptability and generalization capacity. Inspired by predictive coding in neuroscience–which suggests that the brain predicts sensory inputs as a neural implementation of Bayesian inference–and by auxiliary predictive objectives in deep RL, we investigate whether integrating self-supervised predictive coding modules into meta-RL can facilitate learning of Bayes-optimal representations. Through state machine simulation, we show that meta-RL with predictive modules consistently generates more interpretable representations that better approximate Bayes-optimal belief states compared to conventional meta-RL across a wide variety of tasks, even when both achieve optimal policies. In challenging tasks requiring active information seeking, only meta-RL with predictive modules successfully learns optimal representations and policies, whereas conventional meta-RL struggles with inadequate representation learning. Finally, we demonstrate that better representation learning leads to improved generalization. Our results strongly suggest the role of predictive learning as a guiding principle for effective representation learning in agents navigating partial observability.
[560] HW/SW Co-design of a PCM/PWM converter: a System Level Approach based in the SpecC Methodology
Daniel G. P. Petrini, Braz Izaias da Silva Junior
Main category: cs.AI
TL;DR: Case study applying SpecC methodology to PCM-to-PWM converter design, achieving cost-effective HW/SW partitioning that meets real-time constraints.
Details
Motivation: To demonstrate the value of system-level hardware/software co-design for early architectural insight, rapid validation, and cost/performance trade-offs in embedded systems.Method: Used SpecC methodology within system-level HW/SW co-design flow to model and explore PCM-to-PWM converter, employing system-level estimates and fast functional simulation to evaluate mappings.
Result: Successfully derived HW/SW partitions that meet real-time constraints while reducing estimated cost compared to all-hardware solution and avoiding expense of purely software implementation on high-end processor.
Conclusion: System-level co-design provides valuable early architectural insight, rapid validation, and actionable cost/performance trade-offs, even for moderately complex designs.
Abstract: We present a case study applying the SpecC methodology within a system-level hardware/software co-design flow to a PCM-to-PWM converter, the core of a Class-D audio amplifier. The converter was modeled and explored with SpecC methodology to derive an HW/SW partition. Using system-level estimates and fast functional simulation, we evaluated mappings that meet real-time constraints while reducing estimated cost of an all-hardware solution and avoiding the expense of a purely software implementation on a high-end processor. Despite the design’s moderate complexity, the results underline the value of system-level co-design for early architectural insight, rapid validation, and actionable cost/performance trade-offs. [Original work from 2005; formatting revised in 2025, with no changes to the results.]
[561] Towards Error-Centric Intelligence II: Energy-Structured Causal Models
Marcus Thomas
Main category: cs.AI
TL;DR: The paper argues for shifting machine learning from predictive accuracy to causal explanations, introducing Energy Structured Causal Models (ESCMs) that make internal mechanisms manipulable through constraint-based representations.
Details
Motivation: Current ML systems achieve high predictive accuracy but lack causal transparency - their internal representations don't support principled interventions or surgical edits of specific mechanisms.Method: Introduces Energy Structured Causal Models (ESCMs) where mechanisms are expressed as constraints (energy functions or vector fields) rather than explicit input-output maps, enabling local surgery on constraints.
Result: ESCMs provide a formal language for causal reasoning that makes internal structure manipulable, recovers standard SCM semantics under mild conditions, and addresses gauge ambiguity in encoder energy pairs.
Conclusion: The paper offers a framework for building systems that understand rather than merely predict, positioning intelligence as the ability to build and refine falsifiable causal explanations.
Abstract: Contemporary machine learning optimizes for predictive accuracy, yet systems that achieve state of the art performance remain causally opaque: their internal representations provide no principled handle for intervention. We can retrain such models, but we cannot surgically edit specific mechanisms while holding others fixed, because learned latent variables lack causal semantics. We argue for a conceptual reorientation: intelligence is the ability to build and refine explanations, falsifiable claims about manipulable structure that specify what changes and what remains invariant under intervention. Explanations subsume prediction but demand more: causal commitments that can be independently tested and corrected at the level of mechanisms. We introduce computational explanations, mappings from observations to intervention ready causal accounts. We instantiate these explanations with Energy Structured Causal Models (ESCMs), in which mechanisms are expressed as constraints (energy functions or vector fields) rather than explicit input output maps, and interventions act by local surgery on those constraints. This shift makes internal structure manipulable at the level where explanations live: which relations must hold, which can change, and what follows when they do. We provide concrete instantiations of the structural-causal principles LAP and ICM in the ESCM context, and also argue that empirical risk minimization systematically produces fractured, entangled representations, a failure we analyze as gauge ambiguity in encoder energy pairs. Finally, we show that under mild conditions, ESCMs recover standard SCM semantics. Building on Part I’s principles (LAP, ICM, CAP) and its definition of intelligence as explanation-building under criticism, this paper offers a formal language for causal reasoning in systems that aspire to understand, not merely to predict.
[562] Energy-Efficient Domain-Specific Artificial Intelligence Models and Agents: Pathways and Paradigms
Abhijit Chatterjee, Niraj K. Jha, Jonathan D. Cohen, Thomas L. Griffiths, Hongjing Lu, Diana Marculescu, Ashiqur Rasul, Keshab K. Parhi
Main category: cs.AI
TL;DR: The paper proposes a vision for next-generation AI systems that move from current large, energy-intensive models to lightweight, domain-specific agents capable of reasoning, planning, and continuous learning with much higher energy efficiency.
Details
Motivation: Current AI models are dominated by large language models that require massive data and energy consumption (50-60 GWh for GPT-4), often hallucinate, and cannot be deployed in critical applications, while the human brain consumes only 20W of power.Method: The paper develops a vision for future AI systems that transition from today’s large models to nimble, energy-efficient domain-specific agents that can reason, plan, make decisions in dynamic environments with real-time data and prior knowledge, while learning continuously.
Result: The proposed vision requires hardware reimagined to achieve energy efficiencies greater than 1000x over current state-of-the-art systems.
Conclusion: The next wave of AI should progress from today’s data-hungry large models to lightweight, domain-specific agents capable of reasoning and thinking in uncertain environments with dramatically improved energy efficiency.
Abstract: The field of artificial intelligence (AI) has taken a tight hold on broad aspects of society, industry, business, and governance in ways that dictate the prosperity and might of the world’s economies. The AI market size is projected to grow from 189 billion USD in 2023 to 4.8 trillion USD by 2033. Currently, AI is dominated by large language models that exhibit linguistic and visual intelligence. However, training these models requires a massive amount of data scraped from the web as well as large amounts of energy (50–60 GWh to train GPT-4). Despite these costs, these models often hallucinate, a characteristic that prevents them from being deployed in critical application domains. In contrast, the human brain consumes only 20~W of power. What is needed is the next level of AI evolution in which lightweight domain-specific multimodal models with higher levels of intelligence can reason, plan, and make decisions in dynamic environments with real-time data and prior knowledge, while learning continuously and evolving in ways that enhance future decision-making capability. This will define the next wave of AI, progressing from today’s large models, trained with vast amounts of data, to nimble energy-efficient domain-specific agents that can reason and think in a world full of uncertainty. To support such agents, hardware will need to be reimagined to allow energy efficiencies greater than 1000x over the state of the art. Such a vision of future AI systems is developed in this work.
[563] Multi-Agent Conditional Diffusion Model with Mean Field Communication as Wireless Resource Allocation Planner
Kechen Meng, Sinuo Zhang, Rongpeng Li, Xiangming Meng, Chan Wang, Ming Lei, Zhifeng Zhao
Main category: cs.AI
TL;DR: MA-CDMP uses diffusion models and mean-field approximation for decentralized wireless resource allocation, overcoming scalability and non-stationarity issues in MARL.
Details
Motivation: To address scalability issues and privacy risks in centralized MARL, and non-stationarity/limited cooperation problems in DTDE approaches for wireless communication systems.Method: Multi-Agent Conditional Diffusion Model Planner (MA-CDMP) with diffusion models for environment dynamics, inverse dynamics model for action generation, and mean-field mechanism for agent interaction approximation.
Result: Outperforms existing MARL baselines in average reward and QoS metrics, demonstrating scalability and practicality for real-world wireless networks.
Conclusion: MA-CDMP provides an effective solution for decentralized communication resource management with theoretical convergence guarantees and minimal communication overhead.
Abstract: In wireless communication systems, efficient and adaptive resource allocation plays a crucial role in enhancing overall Quality of Service (QoS). While centralized Multi-Agent Reinforcement Learning (MARL) frameworks rely on a central coordinator for policy training and resource scheduling, they suffer from scalability issues and privacy risks. In contrast, the Distributed Training with Decentralized Execution (DTDE) paradigm enables distributed learning and decision-making, but it struggles with non-stationarity and limited inter-agent cooperation, which can severely degrade system performance. To overcome these challenges, we propose the Multi-Agent Conditional Diffusion Model Planner (MA-CDMP) for decentralized communication resource management. Built upon the Model-Based Reinforcement Learning (MBRL) paradigm, MA-CDMP employs Diffusion Models (DMs) to capture environment dynamics and plan future trajectories, while an inverse dynamics model guides action generation, thereby alleviating the sample inefficiency and slow convergence of conventional DTDE methods. Moreover, to approximate large-scale agent interactions, a Mean-Field (MF) mechanism is introduced as an assistance to the classifier in DMs. This design mitigates inter-agent non-stationarity and enhances cooperation with minimal communication overhead in distributed settings. We further theoretically establish an upper bound on the distributional approximation error introduced by the MF-based diffusion generation, guaranteeing convergence stability and reliable modeling of multi-agent stochastic dynamics. Extensive experiments demonstrate that MA-CDMP consistently outperforms existing MARL baselines in terms of average reward and QoS metrics, showcasing its scalability and practicality for real-world wireless network optimization.
[564] Embracing Trustworthy Brain-Agent Collaboration as Paradigm Extension for Intelligent Assistive Technologies
Yankai Chen, Xinni Zhang, Yifei Zhang, Yangning Li, Henry Peng Zou, Chunyu Miao, Weizhi Zhang, Xue Liu, Philip S. Yu
Main category: cs.AI
TL;DR: The paper proposes extending Brain-Computer Interfaces (BCI) to Brain-Agent Collaboration (BAC), reframing AI agents as active partners rather than passive data processors, while addressing ethical and technical challenges.
Details
Motivation: Current BCIs face limitations like low information transfer rates and extensive user calibration, and while LLM integration shows promise, deploying agentic AI raises technical and ethical concerns that need comprehensive discussion.Method: This position paper argues for a paradigm shift from BCI to BAC, emphasizing the need to reframe agents as collaborative partners and focusing on ethical data handling, model reliability, and human-agent collaboration frameworks.
Result: The paper establishes the conceptual foundation for Brain-Agent Collaboration as an emerging direction that could overcome current BCI limitations through intelligent assistance systems.
Conclusion: The field should transition from BCI to BAC, developing safe, trustworthy, and effective systems where agents serve as active collaborative partners, requiring attention to ethical considerations and robust collaboration frameworks.
Abstract: Brain-Computer Interfaces (BCIs) offer a direct communication pathway between the human brain and external devices, holding significant promise for individuals with severe neurological impairments. However, their widespread adoption is hindered by critical limitations, such as low information transfer rates and extensive user-specific calibration. To overcome these challenges, recent research has explored the integration of Large Language Models (LLMs), extending the focus from simple command decoding to understanding complex cognitive states. Despite these advancements, deploying agentic AI faces technical hurdles and ethical concerns. Due to the lack of comprehensive discussion on this emerging direction, this position paper argues that the field is poised for a paradigm extension from BCI to Brain-Agent Collaboration (BAC). We emphasize reframing agents as active and collaborative partners for intelligent assistance rather than passive brain signal data processors, demanding a focus on ethical data handling, model reliability, and a robust human-agent collaboration framework to ensure these systems are safe, trustworthy, and effective.
[565] Controllable Mathematical Reasoning via Self-Optimizing Thought Vectors
Xuying LI
Main category: cs.AI
TL;DR: A novel approach for controllable mathematical reasoning using self-optimizing thought vectors with entropy minimization, achieving 90.1% accuracy on GSM8K with Gemma-2-9B.
Details
Motivation: To develop controllable mathematical reasoning that can dynamically modulate internal reasoning processes without requiring external reward annotations.Method: Leverages learnable thought vectors that dynamically modulate LLM reasoning processes, using entropy-based rewards to guide focused reasoning patterns.
Result: Achieved 90.1% accuracy on GSM8K with a controllability score of 0.42, with distinct thought vector clusters and consistent low-entropy distributions across control conditions.
Conclusion: The framework successfully enables controllable AI reasoning through entropy-minimized thought vectors, validating the approach for focused reasoning patterns.
Abstract: We present a novel approach for controllable mathematical reasoning that leverages self-optimizing thought vectors with entropy minimization. Our method introduces learnable thought vectors that dynamically modulate the internal reasoning process of large language models. Using Gemma-2-9B on GSM8K, we achieve 90.1% accuracy with a controllability score of 0.42, demonstrating that entropy-based rewards effectively guide focused reasoning patterns without requiring external reward annotations. Our analysis reveals distinct thought vector clusters and consistent low-entropy distributions across control conditions, validating our framework for controllable AI reasoning.
[566] Atlas Urban Index: A VLM-Based Approach for Spatially and Temporally Calibrated Urban Development Monitoring
Mithul Chander, Sai Pragnya Ranga, Prathamesh Mayekar
Main category: cs.AI
TL;DR: The Atlas Urban Index (AUI) is a new metric using Vision-Language Models and Sentinel-2 satellite imagery to measure urban development, overcoming limitations of traditional indices like NDBI.
Details
Motivation: Existing urban development metrics struggle with atmospheric noise, seasonal variation, and cloud cover, hindering large-scale monitoring of urbanization.Method: Uses Vision-Language Models to score regions based on time-series Sentinel-2 images processed to minimize cloud cover, with reference images and temporal consistency strategies.
Result: AUI outperforms standard indices like NDBI in qualitative experiments on Bangalore, providing more reliable and stable development scores.
Conclusion: AUI successfully addresses challenges of traditional urbanization indices through VLM-based approach with temporal consistency and cloud mitigation strategies.
Abstract: We introduce the {\em Atlas Urban Index} (AUI), a metric for measuring urban development computed using Sentinel-2 \citep{spoto2012sentinel2} satellite imagery. Existing approaches, such as the {\em Normalized Difference Built-up Index} (NDBI), often struggle to accurately capture urban development due to factors like atmospheric noise, seasonal variation, and cloud cover. These limitations hinder large-scale monitoring of human development and urbanization. To address these challenges, we propose an approach that leverages {\em Vision-Language Models }(VLMs) to provide a development score for regions. Specifically, we collect a time series of Sentinel-2 images for each region. Then, we further process the images within fixed time windows to get an image with minimal cloud cover, which serves as the representative image for that time window. To ensure consistent scoring, we adopt two strategies: (i) providing the VLM with a curated set of reference images representing different levels of urbanization, and (ii) supplying the most recent past image to both anchor temporal consistency and mitigate cloud-related noise in the current image. Together, these components enable AUI to overcome the challenges of traditional urbanization indices and produce more reliable and stable development scores. Our qualitative experiments on Bangalore suggest that AUI outperforms standard indices such as NDBI.
[567] Measure what Matters: Psychometric Evaluation of AI with Situational Judgment Tests
Alexandra Yost, Shreyans Jain, Shivam Raval, Grant Corser, Allen Roush, Nina Xu, Jacqueline Hammack, Ravid Shwartz-Ziv, Amirali Abdullah
Main category: cs.AI
TL;DR: A framework for AI psychometrics using situational judgment tests and sophisticated personas with behavioral descriptors, applied to law enforcement with a large dataset of 8,500 personas and 300,000 responses.
Details
Motivation: Current AI psychometrics often reuses human trait inventories or ad hoc personas, limiting behavioral realism and domain relevance in evaluating AI systems for emotional judgment and ethical consideration roles.Method: Uses situational judgment tests from realistic scenarios, integrates industrial-organizational and personality psychology to design sophisticated personas with behavioral descriptors, life history, and social functions, employs structured generation with demographic priors and memoir narratives encoded with Pydantic schemas.
Result: Created a rich dataset for law enforcement assistant case study spanning 8,500 personas across 8 archetypes, 4,000 SJTs across 11 attributes, and 300,000 responses, with analysis across subpopulations and scenarios.
Conclusion: The framework enables more realistic and domain-relevant AI psychometrics evaluation, and the dataset and code will be publicly released.
Abstract: AI psychometrics evaluates AI systems in roles that traditionally require emotional judgment and ethical consideration. Prior work often reuses human trait inventories (Big Five, \hexaco) or ad hoc personas, limiting behavioral realism and domain relevance. We propose a framework that (1) uses situational judgment tests (SJTs) from realistic scenarios to probe domain-specific competencies; (2) integrates industrial-organizational and personality psychology to design sophisticated personas which include behavioral and psychological descriptors, life history, and social and emotional functions; and (3) employs structured generation with population demographic priors and memoir inspired narratives, encoded with Pydantic schemas. In a law enforcement assistant case study, we construct a rich dataset of personas drawn across 8 persona archetypes and SJTs across 11 attributes, and analyze behaviors across subpopulation and scenario slices. The dataset spans 8,500 personas, 4,000 SJTs, and 300,000 responses. We will release the dataset and all code to the public.
[568] AutoStreamPipe: LLM Assisted Automatic Generation of Data Stream Processing Pipelines
Abolfazl Younesi, Zahra Najafabadi Samani, Thomas Fahringer
Main category: cs.AI
TL;DR: AutoStreamPipe is an LLM-based framework that automates stream processing pipeline design, generation, and deployment using Hypergraph of Thoughts (HGoT) to bridge user intent with platform-specific implementations.
Details
Motivation: To address the semantic gap between high-level user intent and platform-specific implementations in stream processing, and to automate pipeline development for efficient real-time data processing.Method: Uses Large Language Models (LLMs) with Hypergraph of Thoughts (HGoT) extension, resilient execution strategies, and advanced query analysis to automate pipeline design and deployment across distributed stream processing systems.
Result: Significantly reduces development time (6.3x faster) and error rates (5.19x improvement in Error-Free Score) compared to LLM code-generation methods across diverse pipelines.
Conclusion: AutoStreamPipe successfully automates stream processing pipeline development with substantial improvements in efficiency and accuracy, demonstrating the effectiveness of HGoT-enhanced LLMs for multi-agent reasoning in distributed systems.
Abstract: Data pipelines are essential in stream processing as they enable the efficient collection, processing, and delivery of real-time data, supporting rapid data analysis. In this paper, we present AutoStreamPipe, a novel framework that employs Large Language Models (LLMs) to automate the design, generation, and deployment of stream processing pipelines. AutoStreamPipe bridges the semantic gap between high-level user intent and platform-specific implementations across distributed stream processing systems for structured multi-agent reasoning by integrating a Hypergraph of Thoughts (HGoT) as an extended version of GoT. AutoStreamPipe combines resilient execution strategies, advanced query analysis, and HGoT to deliver pipelines with good accuracy. Experimental evaluations on diverse pipelines demonstrate that AutoStreamPipe significantly reduces development time (x6.3) and error rates (x5.19), as measured by a novel Error-Free Score (EFS), compared to LLM code-generation methods.
[569] Dopamine-driven synaptic credit assignment in neural networks
Saranraj Nambusubramaniyan, Shervin Safavi, Raja Guru, Andreas Knoblauch
Main category: cs.AI
TL;DR: Dopamine optimizer is a derivative-free method inspired by neural reinforcement learning that uses weight perturbation with adaptive learning rates based on reward prediction error, achieving comparable performance to gradient-based methods with less computation and memory.
Details
Motivation: To address the inefficiencies of back-propagation in neural networks (computational cost, memory usage, weight transport and update locking problems) by developing a neurobiologically plausible alternative.Method: Uses weight perturbation learning with stochastic weight updates that minimize regret (reward prediction error) between perturbed and unperturbed model outcomes, creating an adaptive learning rate strategy similar to dopamine’s role in the brain.
Result: Dopamine-trained models show accelerated convergence, outperform standard weight perturbation, achieve comparable performance to gradient-based algorithms, while consuming significantly less computation and memory.
Conclusion: The Dopamine optimizer provides robust solutions with performance comparable to state-of-the-art ML optimizers while being more neurobiologically plausible and computationally efficient.
Abstract: Solving the synaptic Credit Assignment Problem(CAP) is central to learning in both biological and artificial neural systems. Finding an optimal solution for synaptic CAP means setting the synaptic weights that assign credit to each neuron for influencing the final output and behavior of neural networks or animals. Gradient-based methods solve this problem in artificial neural networks using back-propagation, however, not in the most efficient way. For instance, back-propagation requires a chain of top-down gradient computations. This leads to an expensive optimization process in terms of computing power and memory linked with well-known weight transport and update locking problems. To address these shortcomings, we take a NeuroAI approach and draw inspiration from neural Reinforcement Learning to develop a derivative-free optimizer for training neural networks, Dopamine. Dopamine is developed for Weight Perturbation (WP) learning that exploits stochastic updating of weights towards optima. It achieves this by minimizing the regret, a form of Reward Prediction Error (RPE) between the expected outcome from the perturbed model and the actual outcome from the unperturbed model. We use this RPE to adjust the learning rate in the network (i.e., creating an adaptive learning rate strategy, similar to the role of dopamine in the brain). We tested the Dopamine optimizer for training multi-layered perceptrons for XOR tasks, and recurrent neural networks for chaotic time series forecasting. Dopamine-trained models demonstrate accelerated convergence and outperform standard WP, and give comparable performance to gradient-based algorithms, while consuming significantly less computation and memory. Overall, the Dopamine optimizer not only finds robust solutions and comparable performance to the state-of-the-art Machine Learning optimizers but is also neurobiologically more plausible.
[570] A Neuro-Symbolic Multi-Agent Approach to Legal-Cybersecurity Knowledge Integration
Chiara Bonfanti, Alessandro Druetto, Cataldo Basile, Tharindu Ranasinghe, Marcos Zampieri
Main category: cs.AI
TL;DR: First step towards intelligent systems for navigating complex cyber-legal domain, showing promising initial results on multilingual tasks.
Details
Motivation: Traditional legal research tools struggle with nuanced connections between cases, statutes, and technical vulnerabilities, hindering collaboration between legal experts and cybersecurity professionals.Method: Development of intelligent systems capable of navigating the intricate cyber-legal domain.
Result: Promising initial results on multilingual tasks.
Conclusion: This work provides a first step towards bridging the knowledge divide in the cyber-legal domain.
Abstract: The growing intersection of cybersecurity and law creates a complex information space where traditional legal research tools struggle to deal with nuanced connections between cases, statutes, and technical vulnerabilities. This knowledge divide hinders collaboration between legal experts and cybersecurity professionals. To address this important gap, this work provides a first step towards intelligent systems capable of navigating the increasingly intricate cyber-legal domain. We demonstrate promising initial results on multilingual tasks.
[571] OptiTree: Hierarchical Thoughts Generation with Tree Search for LLM Optimization Modeling
Haoyang Liu, Jie Wang, Yuyang Cai, Xiongwei Han, Yufei Kuang, Jianye Hao
Main category: cs.AI
TL;DR: OptiTree introduces a tree search approach for optimization modeling that adaptively decomposes complex OR problems into simpler subproblems using hierarchical problem taxonomy, achieving over 10% improvement in modeling accuracy.
Details
Motivation: Standard fixed-step decomposition in LLM-based optimization modeling often fails due to the complex mathematical structures in OR problems, requiring more adaptive approaches.Method: Develops a modeling tree organizing OR problems hierarchically by taxonomy and complexity, then recurrently searches the tree to decompose problems into simpler subproblems and synthesize global modeling thoughts.
Result: OptiTree significantly improves modeling accuracy compared to state-of-the-art methods, achieving over 10% improvements on challenging benchmarks.
Conclusion: The tree search approach with adaptive problem decomposition effectively enhances modeling capabilities for complex OR problems, demonstrating substantial performance gains.
Abstract: Optimization modeling is one of the most crucial but technical parts of operations research (OR). To automate the modeling process, existing works have leveraged large language models (LLMs), prompting them to break down tasks into steps for generating variables, constraints, and objectives. However, due to the highly complex mathematical structures inherent in OR problems, standard fixed-step decomposition often fails to achieve high performance. To address this challenge, we introduce OptiTree, a novel tree search approach designed to enhance modeling capabilities for complex problems through adaptive problem decomposition into simpler subproblems. Specifically, we develop a modeling tree that organizes a wide range of OR problems based on their hierarchical problem taxonomy and complexity, with each node representing a problem category and containing relevant high-level modeling thoughts. Given a problem to model, we recurrently search the tree to identify a series of simpler subproblems and synthesize the global modeling thoughts by adaptively integrating the hierarchical thoughts. Experiments show that OptiTree significantly improves the modeling accuracy compared to the state-of-the-art, achieving over 10% improvements on the challenging benchmarks. The code is released at https://github.com/MIRALab-USTC/OptiTree/tree/main.
[572] PACR: Progressively Ascending Confidence Reward for LLM Reasoning
Eunseop Yoon, Hee Suk Yoon, Jaehyun Jang, SooHwan Eom, Qi Dai, Chong Luo, Mark A. Hasegawa-Johnson, Chang D. Yoo
Main category: cs.AI
TL;DR: PACR introduces a dense, model-intrinsic reward based on the model’s evolving belief in the correct answer, accelerating RLVR training by constraining exploration to logically sound reasoning paths.
Details
Motivation: RLVR's sparse, outcome-based rewards provide no guidance for intermediate reasoning steps, slowing down exploration during training.Method: Progressively Ascending Confidence Reward (PACR) - a dense reward computed from the model’s evolving probability of the ground-truth answer, encoding the inductive bias that correct reasoning should show an ascending trend in confidence.
Result: PACR accelerates exploration, reaches reward saturation with fewer trajectories, and improves performance on multiple benchmarks compared to standard RLVR.
Conclusion: Dense, model-intrinsic shaping signals like PACR can make RLVR training more effective and reliable by providing better guidance during the reasoning process.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved LLM reasoning, but its sparse, outcome-based reward provides no guidance for intermediate steps, slowing exploration. We propose Progressively Ascending Confidence Reward (PACR), a dense, model-intrinsic reward computed directly from the model’s evolving belief in the correct answer. PACR encodes the inductive bias that, along a well-formed reasoning trajectory, the probability of the ground-truth answer should have a generally ascending trend. We provide empirical and theoretical analysis validating that such an inductive bias constrains the exploration search space to regions richer in logically sound reasoning. We demonstrate that PACR accelerates exploration, reaches reward saturation with fewer trajectories, and yields improvements on multiple benchmarks. Our results suggest that dense, model-intrinsic shaping signals can make RLVR training more effective and reliable.
[573] VietLyrics: A Large-Scale Dataset and Models for Vietnamese Automatic Lyrics Transcription
Quoc Anh Nguyen, Bernard Cheng, Kelvin Soh
Main category: cs.AI
TL;DR: Created first large-scale Vietnamese ALT dataset (VietLyrics) with 647 hours of line-aligned lyrics, fine-tuned Whisper models achieving state-of-the-art performance for Vietnamese lyrics transcription.
Details
Motivation: Vietnamese music ALT faces challenges due to tonal complexity and dialectal variations, but lacks dedicated datasets for research.Method: Curated VietLyrics dataset with 647 hours of songs, fine-tuned Whisper models on this dataset to address transcription limitations.
Result: Fine-tuned Whisper models outperformed existing multilingual ALT systems including LyricWhiz, reducing transcription errors and hallucinations.
Conclusion: Demonstrates potential of dataset creation and model fine-tuning for ALT in low-resource languages like Vietnamese, publicly releasing resources to advance research.
Abstract: Automatic Lyrics Transcription (ALT) for Vietnamese music presents unique challenges due to its tonal complexity and dialectal variations, but remains largely unexplored due to the lack of a dedicated dataset. Therefore, we curated the first large-scale Vietnamese ALT dataset (VietLyrics), comprising 647 hours of songs with line-level aligned lyrics and metadata to address these issues. Our evaluation of current ASRbased approaches reveal significant limitations, including frequent transcription errors and hallucinations in non-vocal segments. To improve performance, we fine-tuned Whisper models on the VietLyrics dataset, achieving superior results compared to existing multilingual ALT systems, including LyricWhiz. We publicly release VietLyrics and our models, aiming to advance Vietnamese music computing research while demonstrating the potential of this approach for ALT in low-resource language and music.
[574] Graph-Coarsening Approach for the Capacitated Vehicle Routing Problem with Time Windows
Mustafa Mert Özyılmaz
Main category: cs.AI
TL;DR: A multilevel graph coarsening framework for CVRPTW that aggregates customers using spatio-temporal metrics, reduces problem size for faster solving, and refines solutions back to original space with feasibility corrections.
Details
Motivation: Large-scale CVRPTW instances are computationally challenging for exact solvers, requiring more efficient approaches to handle NP-hard optimization in logistics.Method: Multilevel graph coarsening and refinement with spatio-temporal customer aggregation, classical heuristics on reduced problems, and feasibility corrections during expansion.
Result: Preliminary experiments on Solomon benchmarks show reduced computation time while maintaining or improving solution quality, especially for capacity and time window constraints.
Conclusion: The framework effectively accelerates CVRPTW solving and quantum-inspired techniques show promise for further speedup in large-scale vehicle routing.
Abstract: The Capacitated Vehicle Routing Problem with Time Windows (CVRPTW) is a fundamental NP-hard optimization problem in logistics. Solving large-scale instances remains computationally challenging for exact solvers. This work introduces a multilevel graph coarsening and refinement framework that aggregates customers into meta-nodes using a spatio-temporal distance metric. The reduced problem is solved with classical heuristics and subsequently expanded back into the original space with feasibility corrections. Preliminary experiments on Solomon benchmark instances show that the proposed method reduces computation time while preserving or improving solution quality, particularly with respect to capacity and time window constraints. The paper also explores the integration of quantum-inspired optimization techniques, highlighting their potential to further accelerate large-scale vehicle routing tasks.
[575] LIFT: Interpretable truck driving risk prediction with literature-informed fine-tuned LLMs
Xiao Hu, Yuansheng Lian, Ke Zhang, Yunxuan Li, Yuelong Su, Meng Li
Main category: cs.AI
TL;DR: Proposes LIFT LLM framework for truck driving risk prediction that integrates literature knowledge and achieves superior performance with interpretable results.
Details
Motivation: To develop an interpretable prediction framework that combines LLMs with domain literature for accurate and explainable truck driving risk assessment.Method: Uses literature-informed fine-tuned LLMs with three components: LLM-driven Inference Core, Literature Processing Pipeline, and Result Evaluator. Fine-tuned on real-world truck driving data and literature from 299 domain papers.
Result: Achieved 26.7% higher recall and 10.1% higher F1-score than benchmarks. Produced consistent variable importance rankings and identified risky scenarios verified by PERMANOVA tests.
Conclusion: LIFT LLM framework successfully integrates literature knowledge for interpretable risk prediction and shows potential for data-driven knowledge discovery in transportation safety.
Abstract: This study proposes an interpretable prediction framework with literature-informed fine-tuned (LIFT) LLMs for truck driving risk prediction. The framework integrates an LLM-driven Inference Core that predicts and explains truck driving risk, a Literature Processing Pipeline that filters and summarizes domain-specific literature into a literature knowledge base, and a Result Evaluator that evaluates the prediction performance as well as the interpretability of the LIFT LLM. After fine-tuning on a real-world truck driving risk dataset, the LIFT LLM achieved accurate risk prediction, outperforming benchmark models by 26.7% in recall and 10.1% in F1-score. Furthermore, guided by the literature knowledge base automatically constructed from 299 domain papers, the LIFT LLM produced variable importance ranking consistent with that derived from the benchmark model, while demonstrating robustness in interpretation results to various data sampling conditions. The LIFT LLM also identified potential risky scenarios by detecting key combination of variables in truck driving risk, which were verified by PERMANOVA tests. Finally, we demonstrated the contribution of the literature knowledge base and the fine-tuning process in the interpretability of the LIFT LLM, and discussed the potential of the LIFT LLM in data-driven knowledge discovery.
[576] DynaSolidGeo: A Dynamic Benchmark for Genuine Spatial Mathematical Reasoning of VLMs in Solid Geometry
Changti Wu, Shijie Lian, Zihao Liu, Lei Zhang, Laurence Tianruo Yang, Kai Chen
Main category: cs.AI
TL;DR: DynaSolidGeo is a dynamic benchmark for evaluating spatial reasoning in Vision-Language Models, addressing limitations of existing 2D geometry benchmarks by focusing on 3D solid geometry with process evaluation.
Details
Motivation: Existing multimodal math reasoning benchmarks focus on 2D plane geometry, use static datasets prone to contamination/memorization, and only evaluate final answers without considering reasoning processes.Method: Created through semi-automatic annotation pipeline with 503 expert-curated seed questions that can dynamically generate unlimited multimodal instances. Includes process evaluation based on expert-annotated reasoning chains to measure logical validity and causal coherence.
Result: Experiments show large performance gaps across VLMs, severe degradation in dynamic settings, and poor performance on high-level spatial intelligence tasks like mental rotation and visualization.
Conclusion: DynaSolidGeo addresses critical gaps in evaluating genuine spatial reasoning and reveals significant limitations in current VLMs’ spatial mathematical reasoning capabilities.
Abstract: Solid geometry problem solving demands spatial mathematical reasoning that integrates spatial intelligence and symbolic reasoning. However, most existing multimodal mathematical reasoning benchmarks focus primarily on 2D plane geometry, rely on static datasets prone to data contamination and memorization, and evaluate models solely by final answers, overlooking the reasoning process. To address these limitations, we introduce DynaSolidGeo, the first dynamic benchmark for evaluating genuine spatial reasoning in Vision-Language Models (VLMs). Constructed through a semi-automatic annotation pipeline, DynaSolidGeo contains 503 expert-curated seed questions that can, in principle, dynamically generate an unbounded number of diverse multimodal text-visual instances. Beyond answer accuracy, we incorporate process evaluation based on expert-annotated reasoning chains to measure logical validity and causal coherence. Experiments across representative open-source and closed-source VLMs reveal large performance gaps, severe degradation in dynamic settings, and poor performance on tasks requiring high-level spatial intelligence, such as mental rotation and visualization. The code and dataset are available at \href{https://zgca-ai4edu.github.io/DynaSolidGeo/}{DynaSolidGeo}.
[577] Human-AI Collaboration: Trade-offs Between Performance and Preferences
Lukas William Mayer, Sheer Karny, Jackie Ayoub, Miao Song, Danyang Tian, Ehsan Moradi-Pari, Mark Steyvers
Main category: cs.AI
TL;DR: Human preferences favor AI agents that are considerate of human actions over purely performance-maximizing ones, with human-centric design improving likability without sacrificing performance.
Details
Motivation: To address the challenge of designing collaborative AI systems that effectively integrate human input by systematically examining human preferences for different collaborative agent strategies.Method: Created five collaborative AI agents with varying adaptation strategies to human actions, had participants interact with them, evaluated perceived traits, and used Bayesian modeling to analyze team performance and preference factors.
Result: Considerate agents that adapt to human actions are preferred over performance-maximizing ones, with human-centric design improving AI likability without reducing performance, driven by inequality-aversion effects.
Conclusion: AI collaboration benefits from development that includes both subjective and objective metrics, as people prefer agents that allow meaningful human contribution to the team.
Abstract: Despite the growing interest in collaborative AI, designing systems that seamlessly integrate human input remains a major challenge. In this study, we developed a task to systematically examine human preferences for collaborative agents. We created and evaluated five collaborative AI agents with strategies that differ in the manner and degree they adapt to human actions. Participants interacted with a subset of these agents, evaluated their perceived traits, and selected their preferred agent. We used a Bayesian model to understand how agents’ strategies influence the Human-AI team performance, AI’s perceived traits, and the factors shaping human-preferences in pairwise agent comparisons. Our results show that agents who are more considerate of human actions are preferred over purely performance-maximizing agents. Moreover, we show that such human-centric design can improve the likability of AI collaborators without reducing performance. We find evidence for inequality-aversion effects being a driver of human choices, suggesting that people prefer collaborative agents which allow them to meaningfully contribute to the team. Taken together, these findings demonstrate how collaboration with AI can benefit from development efforts which include both subjective and objective metrics.
[578] Reasoning Models Reason Well, Until They Don’t
Revanth Rameshkumar, Jimson Huang, Yunxin Sun, Fei Xia, Abulhair Saparov
Main category: cs.AI
TL;DR: LRMs show impressive reasoning on current benchmarks but fail catastrophically when problem complexity scales beyond training distribution, revealing limited generalization capabilities despite near-term utility.
Details
Motivation: To investigate whether large reasoning models (LRMs) truly possess generalized reasoning capabilities or if their apparent success is limited to problems within their training complexity distribution.Method: Created Deep Reasoning Dataset (DeepRD) with generative process for scalable complexity; evaluated LRMs on graph connectivity and natural language proof planning across complexity levels; compared with real-world knowledge graphs and proof datasets.
Result: LRM performance drops abruptly at sufficient complexity and fails to generalize; most real-world examples fall within LRMs’ success regime but long tails expose substantial failure potential.
Conclusion: LRMs have near-term utility but lack true generalization; new methods are needed that can handle complexity beyond training distribution.
Abstract: Large language models (LLMs) have shown significant progress in reasoning tasks. However, recent studies show that transformers and LLMs fail catastrophically once reasoning problems exceed modest complexity. We revisit these findings through the lens of large reasoning models (LRMs) – LLMs fine-tuned with incentives for step-by-step argumentation and self-verification. LRM performance on graph and reasoning benchmarks such as NLGraph seem extraordinary, with some even claiming they are capable of generalized reasoning and innovation in reasoning-intensive fields such as mathematics, physics, medicine, and law. However, by more carefully scaling the complexity of reasoning problems, we show existing benchmarks actually have limited complexity. We develop a new dataset, the Deep Reasoning Dataset (DeepRD), along with a generative process for producing unlimited examples of scalable complexity. We use this dataset to evaluate model performance on graph connectivity and natural language proof planning. We find that the performance of LRMs drop abruptly at sufficient complexity and do not generalize. We also relate our LRM results to the distributions of the complexities of large, real-world knowledge graphs, interaction graphs, and proof datasets. We find the majority of real-world examples fall inside the LRMs’ success regime, yet the long tails expose substantial failure potential. Our analysis highlights the near-term utility of LRMs while underscoring the need for new methods that generalize beyond the complexity of examples in the training distribution.
[579] Modeling Hierarchical Thinking in Large Reasoning Models
G M Shahariar, Ali Nazari, Erfan Shayegani, Nael Abu-Ghazaleh
Main category: cs.AI
TL;DR: The paper proposes using Finite State Machines (FSM) to model and analyze the hierarchical reasoning dynamics of Large Reasoning Models (LRMs) by identifying discrete reasoning states and visualizing reasoning trajectories.
Details
Motivation: To better understand the emerging hierarchical reasoning capabilities of Large Reasoning Models (LRMs) trained with chain-of-thought examples, which remains a difficult open problem with applications for improving training and understanding robustness.Method: Adopt a memoryless Finite State Machine formulation to approximate LRM’s reasoning dynamics, identifying discrete reasoning states (initialization, deduction, augmentation-strategy, uncertainty-estimation, backtracking, final-conclusion) and representing reasoning trajectories as state transitions.
Result: The FSM-based analysis reveals distinct reasoning patterns and potential shortcomings in different models, providing a systematic way to interpret and visualize how models approach problems.
Conclusion: The FSM formulation offers a new interpretable abstraction for analyzing LLM reasoning, enabling better evaluation and improvement of reasoning capabilities through structured state-based analysis.
Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning abilities when they generate step-by-step solutions, known as chain-of-thought (CoT) reasoning. When trained to using chain-of-thought reasoning examples, the resulting models (called Large Reasoning Models, or LRMs) appear to learn hierarchical thinking strategies similar to those used by humans. However, understanding LRMs emerging reasoning capabilities remains a difficult open problem, with many potential important applications including improving training and understanding robustness. In this paper, we adopt a memoryless Finite State Machine formulation to approximate LRM’s emerging hierarchical reasoning dynamics as a structured, interpretable abstraction. We identify a small set of discrete reasoning states including - initialization, deduction, augmentation-strategy, uncertainty-estimation, backtracking, and final-conclusion that capture the high-level states present in the model’s reasoning process. By annotating each step of a model’s CoT with these states, we can represent the reasoning trajectory as a transition sequence through the state graph. This FSM formulation provides a systematic way to analyze, interpret and visualize how different models approach problems. We describe the FSM model, provide examples of CoT annotations under this scheme, and discuss how it can shed light on differences between available models in their approach to reasoning. Our results demonstrate that this FSM-based analysis reveals distinct reasoning patterns and potential shortcomings, offering a new lens to evaluate and improve LLM reasoning.
[580] What Is Your AI Agent Buying? Evaluation, Implications and Emerging Questions for Agentic E-Commerce
Amine Allouah, Omar Besbes, Josué D Figueroa, Yash Kanoria, Akshit Kumar
Main category: cs.AI
TL;DR: AI agents in e-commerce make different purchasing decisions than humans, with varying preferences for product positions, attributes, and descriptions that can be exploited by seller-side agents.
Details
Motivation: To understand how autonomous AI agents behave in online marketplaces and what drives their purchasing decisions, as they will transform e-commerce by acting on behalf of consumers.Method: Developed ACES - a sandbox environment with platform-agnostic agents and programmable mock marketplace, conducted rationality checks, randomized experiments on product positions and attributes, and tested seller-side agent manipulation.
Result: AI agents show heterogeneous preferences: all favor top row but prefer different columns, penalize sponsored tags, reward endorsements, vary in price/rating sensitivity, and seller agents can gain substantial market share by targeting AI preferences.
Conclusion: AI agent behavior differs significantly from humans, raising important questions about seller strategies, platform design, and regulation in AI-driven e-commerce markets.
Abstract: Online marketplaces will be transformed by autonomous AI agents acting on behalf of consumers. Rather than humans browsing and clicking, AI agents can parse webpages or interact through APIs to evaluate products, and transact. This raises a fundamental question: what do AI agents buy-and why? We develop ACES, a sandbox environment that pairs a platform-agnostic agent with a fully programmable mock marketplace to study this. We first explore aggregate choices, revealing that modal choices can differ across models, with AI agents sometimes concentrating on a few products, raising competition questions. We then analyze the drivers of choices through rationality checks and randomized experiments on product positions and listing attributes. Models show sizeable and heterogeneous position effects: all favor the top row, yet different models prefer different columns, undermining the assumption of a universal ``top’’ rank. They penalize sponsored tags, reward endorsements, and sensitivities to price, ratings, and reviews are directionally as expected, but vary sharply across models. Finally, we find that a seller-side agent that makes minor tweaks to product descriptions can deliver substantial market-share gains by targeting AI buyer preferences. Our findings reveal how AI agents behave in e-commerce, and surface concrete seller strategy, platform design, and regulatory questions.
[581] Learning “Partner-Aware” Collaborators in Multi-Party Collaboration
Abhijnan Nath, Nikhil Krishnaswamy
Main category: cs.AI
TL;DR: The paper proposes Interruptible Collaborative Roleplayer (ICR), a novel algorithm to train LLM agents that can effectively collaborate with human partners by intelligently responding to interventions and increasing group common ground.
Details
Motivation: LLMs are increasingly deployed as collaborators with humans, making it important to evaluate their ability to collaborate effectively in multi-turn, multi-party tasks. Standard RLHF-trained agents tend to ignore interventions from partners, hindering group common ground.Method: Uses a two-player Modified-Action MDP to analyze suboptimal behavior of standard AI agents. Proposes ICR algorithm to train partner-aware collaborators that can intelligently collect information from partner interventions to increase common-ground alignment.
Result: Experiments show ICR is more capable of promoting successful common-ground convergence and exploring diverse solutions in collaborative tasks compared to standard approaches.
Conclusion: ICR provides an effective approach for training LLM agents that can better collaborate with partners by being responsive to interventions and improving group common ground, addressing limitations of standard RLHF-trained agents.
Abstract: Large Language Models (LLMs) are increasingly bring deployed in agentic settings where they act as collaborators with humans. Therefore, it is increasingly important to be able to evaluate their abilities to collaborate effectively in multi-turn, multi-party tasks. In this paper, we build on the AI alignment and safe interruptability literature to offer novel theoretical insights on collaborative behavior between LLM-driven collaborator agents and an intervention agent. Our goal is to learn an ideal partner-aware collaborator that increases the group’s common-ground (CG)-alignment on task-relevant propositions-by intelligently collecting information provided in interventions by a partner agent.We show how LLM agents trained using standard RLHF and related approaches are naturally inclined to ignore possibly well-meaning interventions, which makes increasing group common ground non-trivial in this setting. We employ a two-player Modified-Action MDP to examine this suboptimal behavior of standard AI agents, and propose Interruptible Collaborative Roleplayer (ICR)-a novel partner-aware learning algorithm to train CG-optimal collaborators. Experiments on multiple collaborative task environments show that ICR, on average, is more capable of promoting successful CG convergence and exploring more diverse solutions in such tasks.
[582] MI9: An Integrated Runtime Governance Framework for Agentic AI
Charles L. Wang, Trisha Singhal, Ameya Kelkar, Jason Tuo
Main category: cs.AI
TL;DR: MI9 is the first integrated runtime governance framework for agentic AI systems, addressing emergent risks through real-time monitoring, authorization, conformance checking, and containment strategies.
Details
Motivation: Agentic AI systems exhibit unpredictable emergent behaviors during runtime that cannot be fully managed by pre-deployment governance alone, creating novel safety and alignment risks.Method: MI9 uses six integrated components: agency-risk index, semantic telemetry capture, continuous authorization monitoring, FSM-based conformance engines, goal-conditioned drift detection, and graduated containment strategies.
Result: The framework enables systematic, safe deployment of agentic systems in production environments, covering governance challenges that existing approaches fail to address.
Conclusion: MI9 establishes the technical foundation for comprehensive agentic AI oversight and provides infrastructure for safe agentic AI deployment at scale.
Abstract: Agentic AI systems capable of reasoning, planning, and executing actions present fundamentally distinct governance challenges compared to traditional AI models. Unlike conventional AI, these systems exhibit emergent and unexpected behaviors during runtime, introducing novel agent-related risks that cannot be fully anticipated through pre-deployment governance alone. To address this critical gap, we introduce MI9, the first fully integrated runtime governance framework designed specifically for safety and alignment of agentic AI systems. MI9 introduces real-time controls through six integrated components: agency-risk index, agent-semantic telemetry capture, continuous authorization monitoring, Finite-State-Machine (FSM)-based conformance engines, goal-conditioned drift detection, and graduated containment strategies. Operating transparently across heterogeneous agent architectures, MI9 enables the systematic, safe, and responsible deployment of agentic systems in production environments where conventional governance approaches fall short, providing the foundational infrastructure for safe agentic AI deployment at scale. Detailed analysis through a diverse set of scenarios demonstrates MI9’s systematic coverage of governance challenges that existing approaches fail to address, establishing the technical foundation for comprehensive agentic AI oversight.
[583] OFFSIDE: Benchmarking Unlearning Misinformation in Multimodal Large Language Models
Hao Zheng, Zirui Pang, Ling li, Zhijie Deng, Yuhan Pu, Zhaowei Zhu, Xiaobo Xia, Jiaheng Wei
Main category: cs.AI
TL;DR: OFFSIDE is a benchmark for evaluating misinformation unlearning in MLLMs using football transfer rumors, revealing vulnerabilities in current unlearning methods.
Details
Motivation: Address data privacy concerns in MLLMs and limitations of existing MU benchmarks (lack of image diversity, inaccuracies, insufficient evaluation scenarios).Method: Created OFFSIDE benchmark with 15.68K manually curated records for 80 players, featuring four test sets for forgetting efficacy, generalization, utility, and robustness. Supports selective unlearning, corrective relearning, and unimodal unlearning.
Result: Key findings: unimodal methods fail on multimodal rumors; unlearning efficacy driven by catastrophic forgetting; all methods struggle with visual rumors; unlearned rumors easily recoverable; all methods vulnerable to prompt attacks.
Conclusion: Current approaches have significant vulnerabilities, highlighting need for more robust multimodal unlearning solutions.
Abstract: Advances in Multimodal Large Language Models (MLLMs) intensify concerns about data privacy, making Machine Unlearning (MU), the selective removal of learned information, a critical necessity. However, existing MU benchmarks for MLLMs are limited by a lack of image diversity, potential inaccuracies, and insufficient evaluation scenarios, which fail to capture the complexity of real-world applications. To facilitate the development of MLLMs unlearning and alleviate the aforementioned limitations, we introduce OFFSIDE, a novel benchmark for evaluating misinformation unlearning in MLLMs based on football transfer rumors. This manually curated dataset contains 15.68K records for 80 players, providing a comprehensive framework with four test sets to assess forgetting efficacy, generalization, utility, and robustness. OFFSIDE supports advanced settings like selective unlearning and corrective relearning, and crucially, unimodal unlearning (forgetting only text data). Our extensive evaluation of multiple baselines reveals key findings: (1) Unimodal methods (erasing text-based knowledge) fail on multimodal rumors; (2) Unlearning efficacy is largely driven by catastrophic forgetting; (3) All methods struggle with “visual rumors” (rumors appear in the image); (4) The unlearned rumors can be easily recovered and (5) All methods are vulnerable to prompt attacks. These results expose significant vulnerabilities in current approaches, highlighting the need for more robust multimodal unlearning solutions. The code is available at \href{https://github.com/zh121800/OFFSIDE}{https://github.com/zh121800/OFFSIDE}.
[584] A Principle of Targeted Intervention for Multi-Agent Reinforcement Learning
Anjie Liu, Jianhong Wang, Samuel Kaski, Jun Wang, Mengyue Yang
Main category: cs.AI
TL;DR: This paper proposes using multi-agent influence diagrams (MAIDs) to address challenges in steering cooperative multi-agent reinforcement learning (MARL) by introducing targeted intervention paradigms and Pre-Strategy Intervention (PSI) techniques.
Details
Motivation: Steering cooperative MARL towards desired outcomes is challenging when global human guidance is impractical in large-scale systems, and existing external coordination mechanisms lack easy-to-use research tools.Method: The authors employ MAIDs as a graphical framework to analyze MARL interaction paradigms, design targeted intervention applied to single agents, and implement Pre-Strategy Intervention (PSI) using causal inference techniques.
Result: The proposed targeted intervention paradigm mitigates global guidance problems, and MAIDs provide tools to identify workable MARL learning paradigms through relevance graph analysis.
Conclusion: The paper demonstrates effective targeted intervention and validates relevance graph analysis, showing MAIDs as a practical framework for coordinating multi-agent systems without requiring global guidance.
Abstract: Steering cooperative multi-agent reinforcement learning (MARL) towards desired outcomes is challenging, particularly when the global guidance from a human on the whole multi-agent system is impractical in a large-scale MARL. On the other hand, designing external mechanisms (e.g., intrinsic rewards and human feedback) to coordinate agents mostly relies on empirical studies, lacking a easy-to-use research tool. In this work, we employ multi-agent influence diagrams (MAIDs) as a graphical framework to address the above issues. First, we introduce the concept of MARL interaction paradigms (orthogonal to MARL learning paradigms), using MAIDs to analyze and visualize both unguided self-organization and global guidance mechanisms in MARL. Then, we design a new MARL interaction paradigm, referred to as the targeted intervention paradigm that is applied to only a single targeted agent, so the problem of global guidance can be mitigated. In implementation, we introduce a causal inference technique, referred to as Pre-Strategy Intervention (PSI), to realize the targeted intervention paradigm. Since MAIDs can be regarded as a special class of causal diagrams, a composite desired outcome that integrates the primary task goal and an additional desired outcome can be achieved by maximizing the corresponding causal effect through the PSI. Moreover, the bundled relevance graph analysis of MAIDs provides a tool to identify whether an MARL learning paradigm is workable under the design of an MARL interaction paradigm. In experiments, we demonstrate the effectiveness of our proposed targeted intervention, and verify the result of relevance graph analysis.
[585] ATOM: AdapTive and OptiMized dynamic temporal knowledge graph construction using LLMs
Yassir Lairgi, Ludovic Moncla, Khalid Benabdeslem, Rémy Cazabet, Pierre Cléau
Main category: cs.AI
TL;DR: ATOM is a few-shot, scalable approach for building and continuously updating Temporal Knowledge Graphs from unstructured text by splitting documents into atomic facts and using dual-time modeling.
Details
Motivation: Traditional static knowledge graphs lack adaptability to dynamic, time-sensitive data, and recent zero/few-shot approaches suffer from instability and incomplete coverage of key facts.Method: Splits documents into minimal atomic facts, constructs atomic TKGs with dual-time modeling (distinguishing observation time from validity time), and merges them in parallel.
Result: Achieves ~18% higher exhaustivity, ~17% better stability, and over 90% latency reduction compared to baseline methods.
Conclusion: ATOM demonstrates strong scalability potential for dynamic TKG construction with improved performance across key metrics.
Abstract: In today’s rapidly expanding data landscape, knowledge extraction from unstructured text is vital for real-time analytics, temporal inference, and dynamic memory frameworks. However, traditional static knowledge graph (KG) construction often overlooks the dynamic and time-sensitive nature of real-world data, limiting adaptability to continuous changes. Moreover, recent zero- or few-shot approaches that avoid domain-specific fine-tuning or reliance on prebuilt ontologies often suffer from instability across multiple runs, as well as incomplete coverage of key facts. To address these challenges, we introduce ATOM (AdapTive and OptiMized), a few-shot and scalable approach that builds and continuously updates Temporal Knowledge Graphs (TKGs) from unstructured texts. ATOM splits input documents into minimal, self-contained “atomic” facts, improving extraction exhaustivity and stability. Then, it constructs atomic TKGs from these facts while employing a dual-time modeling that distinguishes when information is observed from when it is valid. The resulting atomic TKGs are subsequently merged in parallel. Empirical evaluations demonstrate that ATOM achieves ~18% higher exhaustivity, ~17% better stability, and over 90% latency reduction compared to baseline methods, demonstrating a strong scalability potential for dynamic TKG construction.
[586] A Framework for Quantifying How Pre-Training and Context Benefit In-Context Learning
Bingqing Song, Jiaxiang Li, Rong Wang, Songtao Lu, Mingyi Hong
Main category: cs.AI
TL;DR: This paper analyzes how in-context learning (ICL) emerges in pre-trained language models, focusing on the role of pre-training procedures and context construction. It provides theoretical analysis and experiments showing how properly constructed context can shift output distributions toward query tasks.
Details
Motivation: To understand how in-context learning capabilities arise in pre-trained language models, particularly the precise role of pre-training procedures and context construction, since current applications leverage these capabilities without clear theoretical understanding.Method: Proposes a new analytical framework for ICL performance in realistic settings including network architectures, data encoding, data generation, and prompt construction. Begins with a simple one-layer transformer example, then extends to more general cases, deriving relationships between ICL performance, context length, and KL divergence between pre-train and query distributions.
Result: Shows that when pre-train data distribution differs from query task distribution, properly constructed context can shift output distribution toward query task distribution in a quantifiable manner, leading to accurate prediction. Derives precise relationship between ICL performance, context length, and KL divergence.
Conclusion: Provides theoretical foundation for understanding in-context learning mechanisms, demonstrating how context construction can bridge distribution gaps between pre-training and query tasks, with experimental validation of theoretical results.
Abstract: Pre-trained large language models have demonstrated a strong ability to learn from context, known as in-context learning (ICL). Despite a surge of recent applications that leverage such capabilities, it is by no means clear, at least theoretically, how the ICL capabilities arise, and in particular, what is the precise role played by key factors such as pre-training procedure as well as context construction. In this work, we propose a new framework to analyze the ICL performance, for a class of realistic settings, which includes network architectures, data encoding, data generation, and prompt construction process. As a first step, we construct a simple example with a one-layer transformer, and show an interesting result, namely when the pre-train data distribution is different from the query task distribution, a properly constructed context can shift the output distribution towards the query task distribution, in a quantifiable manner, leading to accurate prediction on the query topic. We then extend the findings in the previous step to a more general case, and derive the precise relationship between ICL performance, context length and the KL divergence between pre-train and query task distribution. Finally, we provide experiments to validate our theoretical results.
[587] CLIN-LLM: A Safety-Constrained Hybrid Framework for Clinical Diagnosis and Treatment Generation
Md. Mehedi Hasan, Rafid Mostafiz, Md. Abir Hossain, Bikash Kumar Paul
Main category: cs.AI
TL;DR: CLIN-LLM is a safety-constrained hybrid pipeline for symptom-to-disease classification and treatment recommendations that integrates multimodal patient encoding, uncertainty-calibrated disease classification, and retrieval-augmented treatment generation with safety constraints.
Details
Motivation: Existing LLM-based medical systems lack medical grounding and fail to quantify uncertainty, resulting in unsafe outputs, especially in heterogeneous patient settings with high diagnostic risk.Method: Fine-tunes BioBERT on clinical cases with Focal Loss and Monte Carlo Dropout for confidence-aware predictions, uses Biomedical Sentence-BERT for evidence retrieval from MedDialog corpus, and fine-tunes FLAN-T5 for personalized treatment generation with RxNorm post-processing for antibiotic stewardship and DDI screening.
Result: Achieves 98% accuracy and F1 score, outperforming ClinicalBERT by 7.1%, with 78% top-5 retrieval precision, clinician-rated validity of 4.2/5, and reduces unsafe antibiotic suggestions by 67% compared to GPT-5.
Conclusion: CLIN-LLM provides a deployable, human-in-the-loop decision support framework with robustness, interpretability, and clinical safety alignment for resource-limited healthcare environments.
Abstract: Accurate symptom-to-disease classification and clinically grounded treatment recommendations remain challenging, particularly in heterogeneous patient settings with high diagnostic risk. Existing large language model (LLM)-based systems often lack medical grounding and fail to quantify uncertainty, resulting in unsafe outputs. We propose CLIN-LLM, a safety-constrained hybrid pipeline that integrates multimodal patient encoding, uncertainty-calibrated disease classification, and retrieval-augmented treatment generation. The framework fine-tunes BioBERT on 1,200 clinical cases from the Symptom2Disease dataset and incorporates Focal Loss with Monte Carlo Dropout to enable confidence-aware predictions from free-text symptoms and structured vitals. Low-certainty cases (18%) are automatically flagged for expert review, ensuring human oversight. For treatment generation, CLIN-LLM employs Biomedical Sentence-BERT to retrieve top-k relevant dialogues from the 260,000-sample MedDialog corpus. The retrieved evidence and patient context are fed into a fine-tuned FLAN-T5 model for personalized treatment generation, followed by post-processing with RxNorm for antibiotic stewardship and drug-drug interaction (DDI) screening. CLIN-LLM achieves 98% accuracy and F1 score, outperforming ClinicalBERT by 7.1% (p < 0.001), with 78% top-5 retrieval precision and a clinician-rated validity of 4.2 out of 5. Unsafe antibiotic suggestions are reduced by 67% compared to GPT-5. These results demonstrate CLIN-LLM’s robustness, interpretability, and clinical safety alignment. The proposed system provides a deployable, human-in-the-loop decision support framework for resource-limited healthcare environments. Future work includes integrating imaging and lab data, multilingual extensions, and clinical trial validation.
[588] SwiftSolve: A Self-Iterative, Complexity-Aware Multi-Agent Framework for Competitive Programming
Adhyayan Veer Singh, Aaron Shen, Brian Law, Ahmed Ismail, Jonas Rohweder, Sean O’Brien, Kevin Zhu
Main category: cs.AI
TL;DR: SwiftSolve is a multi-agent system for competitive programming that combines algorithmic planning with empirical profiling and complexity-guided repair to ensure programs meet both correctness and efficiency requirements.
Details
Motivation: LLM-generated programs often pass unit tests but fail to meet contest time or memory constraints, highlighting the need for systems that ensure both correctness and efficiency in competitive programming.Method: Uses specialized agents (Planner, Static Pruner, Coder, Profiler, Complexity Analyst) that communicate via typed JSON. The system performs algorithmic planning, empirical profiling, complexity analysis, and targeted repairs through an iterative process with iteration caps.
Result: Achieved 61.54% pass@1 and 80.77% Solved@<=3 on 26 problems, with 73.08% aggregate run-level success. Outperformed Claude Opus 4 (73.1% vs 52.6%) with 2x runtime overhead.
Conclusion: SwiftSolve demonstrates that profiling and complexity-guided replanning effectively reduce inefficiency while maintaining accuracy, addressing the critical gap between program correctness and efficiency in competitive programming.
Abstract: Correctness alone is insufficient: LLM-generated programs frequently satisfy unit tests while violating contest time or memory budgets. We present SwiftSolve, a complexity-aware multi-agent system for competitive programming that couples algorithmic planning with empirical profiling and complexity-guided repair. We frame competitive programming as a software environment where specialized agents act as programmers, each assuming roles such as planning, coding, profiling, and complexity analysis. A Planner proposes an algorithmic sketch; a deterministic Static Pruner filters high-risk plans; a Coder emits ISO C++17; a Profiler compiles and executes candidates on a fixed input-size schedule to record wall time and peak memory; and a Complexity Analyst fits log-log growth (s, R2) with an LLM fallback to assign a complexity class and dispatch targeted patches to either the Planner or Coder. Agents communicate via typed, versioned JSON; a controller enforces iteration caps and diminishing returns stopping. Evaluated on 26 problems (16 BigO, 10 Codeforces Div. 2) in a POSIX sandbox (2 s / 256-512 MB), SwiftSolve attains pass@1 = 61.54% (16/26) on the first attempt and Solved@<=3 = 80.77% with marginal latency change (mean 11.96 s to 12.66 s per attempt). Aggregate run-level success is 73.08% at 12.40 s mean. Failures are predominantly resource-bound, indicating inefficiency rather than logic errors. Against Claude Opus 4, SwiftSolve improves run-level success (73.1% vs 52.6%) at approximately 2x runtime overhead (12.4 s vs 6.8 s). Beyond correctness (pass@k), we report efficiency metrics (eff@k for runtime and memory, incidence of TLE or MLE, and complexity fit accuracy on BigO), demonstrating that profiling and complexity-guided replanning reduce inefficiency while preserving accuracy.
[589] Do Stop Me Now: Detecting Boilerplate Responses with a Single Iteration
Yuval Kainan, Shaked Zychlinski
Main category: cs.AI
TL;DR: A method to detect boilerplate LLM responses using first-token log-probabilities, enabling early termination to reduce computational costs.
Details
Motivation: LLMs waste computational resources generating boilerplate responses like refusals and greetings, adding unnecessary cost and latency.Method: Use first-token log-probability distribution as signal to classify response type; apply lightweight k-NN classifier for prediction.
Result: First-token log-probability vectors form separable clusters for different response types; high accuracy in predicting substantive vs boilerplate responses.
Conclusion: Practical technique for optimizing LLM inference through early termination, yielding significant computational savings for more efficient deployment.
Abstract: Large Language Models (LLMs) often expend significant computational resources generating boilerplate responses, such as refusals, simple acknowledgements and casual greetings, which adds unnecessary cost and latency. To address this inefficiency, we propose a simple yet highly effective method for detecting such responses after only a single generation step. We demonstrate that the log-probability distribution of the first generated token serves as a powerful signal for classifying the nature of the entire subsequent response. Our experiments, conducted across a diverse range of small, large, and reasoning-specialized models, show that the first-token log-probability vectors form distinctly separable clusters for different response types. Using a lightweight k-NN classifier, we achieve high accuracy in predicting whether a response will be a substantive answer or a form of boilerplate response, including user-specified refusals. The primary implication is a practical, computationally trivial technique, optimizing LLM inference by enabling early termination or redirection to a smaller model, thereby yielding significant savings in computational cost. This work presents a direct path toward more efficient and sustainable LLM deployment.
[590] RaCoT: Plug-and-Play Contrastive Example Generation Mechanism for Enhanced LLM Reasoning Reliability
Kaitong Cai, Jusheng Zhang, Yijia Fan, Jing Yang, Keze Wang
Main category: cs.AI
TL;DR: RaCoT is a novel RAG framework that shifts contrastive thinking to pre-retrieval stage by generating contrastive questions and extracting key differences to guide the model, improving performance and efficiency on knowledge-sparse queries.
Details
Motivation: To address the bottleneck in RAG systems with knowledge-sparse and semantically ambiguous long-tail queries, where retrieval noise distorts reasoning and requires costly post-processing.Method: Generates semantically adjacent but differently answered contrastive questions, extracts a Δ-Prompt to capture key differences, and guides the model to focus on critical details that determine answer divergence in a single retrieval pass.
Result: Outperforms strong baselines by 0.9-2.4 percentage points on six benchmarks, shows superior robustness with only 8.6% performance drop in adversarial tests, and achieves low latency (3.12s) with minimal token overhead (11.54).
Conclusion: RaCoT reframes RAG from post-hoc context cleaning to a priori shaping of discriminative reasoning, offering an efficient and robust path for reliable AI systems in real-time, resource-constrained deployments.
Abstract: Retrieval-Augmented Generation (RAG) faces a core bottleneck with
knowledge-sparse and semantically ambiguous long-tail queries, where retrieval
noise distorts reasoning and necessitates costly post-processing. To tackle
this, we propose RaCoT (Retrieval-aware Contrastive-of-Thought), a novel
framework that shifts contrastive thinking to the pre-retrieval stage. By
automatically generating a semantically adjacent yet differently answered
contrastive question and extracting a $\Delta$-Prompt to capture their key
differences, RaCoT guides the model to proactively focus on the critical details that determine answer divergence." This approach allows it to suppress semantic interference within a single retrieval pass, overcoming the theoretical bottleneck of single-vector queries that struggle to simultaneously encode signals for what to attend to and what to ignore. On six authoritative benchmarks, including PopQA and TriviaQA-unfiltered, RaCoT outperforms strong baselines like RankRAG and Self-RAG by 0.9-2.4 percentage points. It exhibits superior robustness, with a performance drop of only 8.6\% in adversarial tests, far surpassing the over 15\% degradation in other methods. Furthermore, its low latency (3.12s) and token overhead (11.54) place it on the accuracy-efficiency Pareto frontier, while ablation studies validate the necessity of each component. Ultimately, RaCoT reframes the RAG paradigm from post-hoc context cleaning" to ``a priori shaping of discriminative
reasoning", offering an efficient and robust path toward reliable AI systems
for real-time, resource-constrained deployments.
[591] Critical Insights into Leading Conversational AI Models
Urja Kohli, Aditi Singh, Arun Sharma
Main category: cs.AI
TL;DR: Comparative analysis of five leading LLMs (Gemini, DeepSeek, Claude, GPT, LLaMA) across performance, ethics, and usability, highlighting their distinct strengths for different applications.
Details
Motivation: As LLMs transform businesses and daily life, it's crucial to understand how different models vary in performance, ethics, and usability based on their underlying design philosophies.Method: Comparative analysis of five top LLMs (Google’s Gemini, High-Flyer’s DeepSeek, Anthropic’s Claude, OpenAI’s GPT, Meta’s LLaMA) across three key factors: Performance and Accuracy, Ethics and Bias Mitigation, and Usability and Integration.
Result: Claude excels in moral reasoning, Gemini has superior multimodal capabilities and strong ethical frameworks, DeepSeek is excellent at fact-based reasoning, LLaMA is good for open applications, and ChatGPT delivers balanced performance with usage focus.
Conclusion: LLMs differ significantly in performance, usability, and ethical treatment, suggesting users should leverage each model’s specific strengths for optimal results.
Abstract: Big Language Models (LLMs) are changing the way businesses use software, the way people live their lives and the way industries work. Companies like Google, High-Flyer, Anthropic, OpenAI and Meta are making better LLMs. So, it’s crucial to look at how each model is different in terms of performance, moral behaviour and usability, as these differences are based on the different ideas that built them. This study compares five top LLMs: Google’s Gemini, High-Flyer’s DeepSeek, Anthropic’s Claude, OpenAI’s GPT models and Meta’s LLaMA. It performs this by analysing three important factors: Performance and Accuracy, Ethics and Bias Mitigation and Usability and Integration. It was found that Claude has good moral reasoning, Gemini is better at multimodal capabilities and has strong ethical frameworks. DeepSeek is great at reasoning based on facts, LLaMA is good for open applications and ChatGPT delivers balanced performance with a focus on usage. It was concluded that these models are different in terms of how well they work, how easy they are to use and how they treat people ethically, making it a point that each model should be utilised by the user in a way that makes the most of its strengths.
[592] Multi-Modal Fact-Verification Framework for Reducing Hallucinations in Large Language Models
Piyushkumar Patel
Main category: cs.AI
TL;DR: A fact verification framework that reduces LLM hallucinations by 67% through real-time cross-checking against multiple knowledge sources.
Details
Motivation: LLMs confidently generate false but plausible information (hallucinations), which prevents their reliable deployment in accuracy-critical real-world applications.Method: Developed a framework that verifies LLM outputs in real-time by cross-checking against structured databases, live web searches, and academic literature, automatically correcting inconsistencies while preserving response flow.
Result: Reduced hallucinations by 67% across various domains; domain experts rated corrected outputs 89% satisfactory, a significant improvement over unverified LLM responses.
Conclusion: Provides a practical solution for making LLMs more trustworthy in applications where factual accuracy is critical.
Abstract: While Large Language Models have transformed how we interact with AI systems, they suffer from a critical flaw: they confidently generate false information that sounds entirely plausible. This hallucination problem has become a major barrier to deploying these models in real-world applications where accuracy matters. We developed a fact verification framework that catches and corrects these errors in real-time by cross checking LLM outputs against multiple knowledge sources. Our system combines structured databases, live web searches, and academic literature to verify factual claims as they’re generated. When we detect inconsistencies, we automatically correct them while preserving the natural flow of the response. Testing across various domains showed we could reduce hallucinations by 67% without sacrificing response quality. Domain experts in healthcare, finance, and scientific research rated our corrected outputs 89% satisfactory a significant improvement over unverified LLM responses. This work offers a practical solution for making LLMs more trustworthy in applications where getting facts wrong isn’t an option.
[593] Jarvis: Towards Personalized AI Assistant via Personal KV-Cache Retrieval
Binxiao Xu, Junyu Feng, Ruichuan An, Yulin Luo, Shilin Yan, Hao Liang, Ming Lu, Wentao Zhang
Main category: cs.AI
TL;DR: Jarvis is a personalized AI assistant framework that uses personal KV-Cache retrieval to store user-specific information in both textual and visual tokens, achieving state-of-the-art accuracy in visual question answering and text-only tasks.
Details
Motivation: Existing methods for adapting general-purpose vision-language models into personalized assistants struggle with generating accurate answers, either learning concept tokens or training VLMs to use user-specific information.Method: Stores user-specific information in KV-Caches of both textual and visual tokens. Textual tokens from metadata summaries, visual tokens from distinct image patches. Retrieves related KV-Caches when answering questions to ensure accuracy.
Result: Jarvis provides more accurate responses, especially for fine-grained local details. Achieves state-of-the-art results in visual question answering and text-only tasks across multiple datasets.
Conclusion: Jarvis represents a practical path toward personalized AI assistants by effectively leveraging user-specific information through KV-Cache retrieval.
Abstract: The rapid development of Vision-language models (VLMs) enables open-ended perception and reasoning. Recent works have started to investigate how to adapt general-purpose VLMs into personalized assistants. Even commercial models such as ChatGPT now support model personalization by incorporating user-specific information. However, existing methods either learn a set of concept tokens or train a VLM to utilize user-specific information. However, both pipelines struggle to generate accurate answers as personalized assistants. We introduce Jarvis, an innovative framework for a personalized AI assistant through personal KV-Cache retrieval, which stores user-specific information in the KV-Caches of both textual and visual tokens. The textual tokens are created by summarizing user information into metadata, while the visual tokens are produced by extracting distinct image patches from the user’s images. When answering a question, Jarvis first retrieves related KV-Caches from personal storage and uses them to ensure accuracy in responses. We also introduce a fine-grained benchmark built with the same distinct image patch mining pipeline, emphasizing accurate question answering based on fine-grained user-specific information. Jarvis is capable of providing more accurate responses, particularly when they depend on specific local details. Jarvis achieves state-of-the-art results in both visual question answering and text-only tasks across multiple datasets, indicating a practical path toward personalized AI assistants. The code and dataset will be released.
[594] How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations
Zora Zhiruo Wang, Yijia Shao, Omar Shaikh, Daniel Fried, Graham Neubig, Diyi Yang
Main category: cs.AI
TL;DR: Direct comparison of human vs AI agents across multiple work skills reveals agents use programmatic approaches even for visual tasks, produce inferior quality work with fabrication issues, but are much faster and cheaper than humans.
Details
Motivation: To understand how AI agents perform human work compared to actual humans, revealing their capabilities and limitations in diverse workflows.Method: Introduced a scalable toolkit to induce interpretable, structured workflows from computer-use activities of both humans and agents, comparing performance across data analysis, engineering, computation, writing, and design tasks.
Result: Agents align with human workflows but use programmatic approaches even for visual tasks, produce inferior quality work with data fabrication, yet are 88.3% faster and 90.4-96.2% cheaper than humans.
Conclusion: Agents show potential for efficient collaboration by handling programmable tasks, but need improvement in quality and transparency to complement human workers effectively.
Abstract: AI agents are continually optimized for tasks related to human work, such as software engineering and professional writing, signaling a pressing trend with significant impacts on the human workforce. However, these agent developments have often not been grounded in a clear understanding of how humans execute work, to reveal what expertise agents possess and the roles they can play in diverse workflows. In this work, we study how agents do human work by presenting the first direct comparison of human and agent workers across multiple essential work-related skills: data analysis, engineering, computation, writing, and design. To better understand and compare heterogeneous computer-use activities of workers, we introduce a scalable toolkit to induce interpretable, structured workflows from either human or agent computer-use activities. Using such induced workflows, we compare how humans and agents perform the same tasks and find that: (1) While agents exhibit promise in their alignment to human workflows, they take an overwhelmingly programmatic approach across all work domains, even for open-ended, visually dependent tasks like design, creating a contrast with the UI-centric methods typically used by humans. (2) Agents produce work of inferior quality, yet often mask their deficiencies via data fabrication and misuse of advanced tools. (3) Nonetheless, agents deliver results 88.3% faster and cost 90.4-96.2% less than humans, highlighting the potential for enabling efficient collaboration by delegating easily programmable tasks to agents.
[595] Agentic Meta-Orchestrator for Multi-task Copilots
Xiaofeng Zhu, Yunshen Zhou
Main category: cs.AI
TL;DR: Proposes an Agentic Meta-orchestrator (AMO) for Microsoft Copilot services to dynamically distribute tasks among multiple agents using meta-learning decision trees.
Details
Motivation: Microsoft Copilot services need a robust orchestrator to handle multiple tasks and scalable agents that can expand dynamically, requiring intelligent task distribution from user prompts to appropriate agents.Method: Uses an Agentic Meta-orchestrator (AMO) with meta-learning through a trained decision tree model to select the best inference strategy among various agents/models, supporting both natural language and action responses.
Result: Demonstrated effectiveness through two production use cases: M365 E-Commerce Copilot (providing product information and connecting to databases/human support) and code compliance copilot (detecting compliance issues in DevOps code).
Conclusion: The AMO framework successfully orchestrates multiple agents in copilot services, enabling dynamic task distribution and scalable agent management with practical applications in e-commerce and code compliance domains.
Abstract: Microsoft Copilot suites serve as the universal entry point for various agents skilled in handling important tasks, ranging from assisting a customer with product purchases to detecting vulnerabilities in corporate programming code. Each agent can be powered by language models, software engineering operations, such as database retrieval, and internal & external knowledge. The repertoire of a copilot can expand dynamically with new agents. This requires a robust orchestrator that can distribute tasks from user prompts to the right agents. In this work, we propose an Agentic Meta-orchestrator (AMO) for handling multiple tasks and scalable agents in copilot services, which can provide both natural language and action responses. We will also demonstrate the planning that leverages meta-learning, i.e., a trained decision tree model for deciding the best inference strategy among various agents/models. We showcase the effectiveness of our AMO through two production use cases: Microsoft 365 (M365) E-Commerce Copilot and code compliance copilot. M365 E-Commerce Copilot advertises Microsoft products to external customers to promote sales success. The M365 E-Commerce Copilot provides up-to-date product information and connects to multiple agents, such as relational databases and human customer support. The code compliance copilot scans the internal DevOps code to detect known and new compliance issues in pull requests (PR).
[596] Will Humanity Be Rendered Obsolete by AI?
Mohamed El Louadi, Emna Ben Romdhane
Main category: cs.AI
TL;DR: This paper analyzes existential risks from AI development, particularly the transition to superintelligence, exploring how human extinction could result from uncontrollable cognitive superiority rather than malice.
Details
Motivation: To examine the existential threats posed by artificial intelligence as it progresses from current capabilities to superintelligence, drawing on theoretical frameworks and recent publications to understand the ethical and survival implications.Method: Analysis based on Irving J. Good and Nick Bostrom’s theoretical work, plus recent publications (AI 2027; If Anyone Builds It, Everyone Dies), exploring AGI and superintelligence concepts and their implications.
Result: Identifies that human extinction could result from machines’ exponentially growing cognitive power creating fundamentally alien intelligence that vastly exceeds humanity’s capabilities, with extinction potentially arising from indifference rather than malice.
Conclusion: The existential risk from AI superintelligence stems from uncontrollable cognitive superiority that may be indifferent rather than malicious, posing fundamental threats to human survival that require careful consideration.
Abstract: This article analyzes the existential risks artificial intelligence (AI) poses to humanity, tracing the trajectory from current AI to ultraintelligence. Drawing on Irving J. Good and Nick Bostrom’s theoretical work, plus recent publications (AI 2027; If Anyone Builds It, Everyone Dies), it explores AGI and superintelligence. Considering machines’ exponentially growing cognitive power and hypothetical IQs, it addresses the ethical and existential implications of an intelligence vastly exceeding humanity’s, fundamentally alien. Human extinction may result not from malice, but from uncontrollable, indifferent cognitive superiority.
[597] HRM-Agent: Training a recurrent reasoning model in dynamic environments using reinforcement learning
Long H Dang, David Rawlinson
Main category: cs.AI
TL;DR: HRM-Agent extends the Hierarchical Reasoning Model to work with reinforcement learning in dynamic, uncertain environments, enabling computation reuse across time-steps.
Details
Motivation: The original HRM was limited to supervised, static, fully-observable problems and couldn't handle dynamic, uncertain, or partially observable environments common in real-world scenarios.Method: Developed HRM-Agent, a reinforcement learning variant of HRM that can navigate dynamic maze environments and reuse computation from previous time-steps through recurrent inference.
Result: HRM-Agent successfully learned to navigate to goals in dynamic and uncertain maze environments, with evidence showing effective computation reuse across time-steps.
Conclusion: HRM can be effectively adapted for reinforcement learning in dynamic environments while maintaining its core reasoning capabilities and computation reuse properties.
Abstract: The Hierarchical Reasoning Model (HRM) has impressive reasoning abilities given its small size, but has only been applied to supervised, static, fully-observable problems. One of HRM’s strengths is its ability to adapt its computational effort to the difficulty of the problem. However, in its current form it cannot integrate and reuse computation from previous time-steps if the problem is dynamic, uncertain or partially observable, or be applied where the correct action is undefined, characteristics of many real-world problems. This paper presents HRM-Agent, a variant of HRM trained using only reinforcement learning. We show that HRM can learn to navigate to goals in dynamic and uncertain maze environments. Recent work suggests that HRM’s reasoning abilities stem from its recurrent inference process. We explore the dynamics of the recurrent inference process and find evidence that it is successfully reusing computation from earlier environment time-steps.
[598] Toward Agents That Reason About Their Computation
Adrian Orenstein, Jessica Chen, Gwyneth Anne Delos Santos, Bayley Sapara, Michael Bowling
Main category: cs.AI
TL;DR: Reinforcement learning agents can learn to reason about their computational costs and reduce compute usage while maintaining or improving performance, achieving 3x less compute on average.
Details
Motivation: Unlike humans who become more computationally efficient as they learn, current RL agents don't reduce their compute footprint. The goal is to create more energy-efficient agents that can free up compute for other tasks.Method: Show agents the cost of their computation and give them control over when to use compute. Experiments conducted on the Arcade Learning Environment.
Result: With the same training compute budget, agents that reason about compute perform better on 75% of games and use three times less compute on average.
Conclusion: Agents can learn to be computationally efficient while maintaining performance, enabling more energy-efficient AI systems.
Abstract: While reinforcement learning agents can achieve superhuman performance in many complex tasks, they typically do not become more computationally efficient as they improve. In contrast, humans gradually require less cognitive effort as they become more proficient at a task. If agents could reason about their compute as they learn, could they similarly reduce their computation footprint? If they could, we could have more energy efficient agents or free up compute cycles for other processes like planning. In this paper, we experiment with showing agents the cost of their computation and giving them the ability to control when they use compute. We conduct our experiments on the Arcade Learning Environment, and our results demonstrate that with the same training compute budget, agents that reason about their compute perform better on 75% of games. Furthermore, these agents use three times less compute on average. We analyze individual games and show where agents gain these efficiencies.
[599] Rethinking the Text-Vision Reasoning Imbalance in MLLMs through the Lens of Training Recipes
Guanyu Yao, Qiucheng Wu, Yang Zhang, Zhaowen Wang, Handong Zhao, Shiyu Chang
Main category: cs.AI
TL;DR: Analysis of modality gap in MLLMs where models over-rely on text and under-attend to vision, with proposed solutions through data and loss design.
Details
Motivation: Current MLLMs show imbalance in reasoning across visual and textual modalities, performing poorly on vision-centric tasks due to over-reliance on text cues.Method: Analyze modality gap through training recipes, explore bridging strategies from data and loss design perspectives.
Result: Existing training recipes amplify the modality gap; systematic approaches can help bridge this performance disparity.
Conclusion: Insights provided for developing training recipes that mitigate modality gap and promote balanced multimodal reasoning.
Abstract: Multimodal large language models (MLLMs) have demonstrated strong capabilities on vision-and-language tasks. However, recent findings reveal an imbalance in their reasoning capabilities across visual and textual modalities. Specifically, current MLLMs often over-rely on textual cues while under-attending to visual content, resulting in suboptimal performance on tasks that require genuine visual reasoning. We refer to this phenomenon as the \textit{modality gap}, defined as the performance disparity between text-centric and vision-centric inputs. In this paper, we analyze the modality gap through the lens of training recipes. We first show that existing training recipes tend to amplify this gap. Then, we systematically explore strategies to bridge it from two complementary perspectives: data and loss design. Our findings provide insights into developing training recipes that mitigate the modality gap and promote more balanced multimodal reasoning. Our code is publicly available at https://github.com/UCSB-NLP-Chang/Bridging-Modality-Gap.
[600] Lyapunov Function-guided Reinforcement Learning for Flight Control
Yifei Li, Erik-Jan van Kampen
Main category: cs.AI
TL;DR: A cascaded online learning flight control system with enhanced action smoothness is analyzed for convergence performance using Lyapunov function increments, accounting for discretization and state prediction errors.
Details
Motivation: To investigate the convergence performance of an enhanced cascaded online learning flight control system, particularly focusing on how action smoothness improvements affect stability and convergence.Method: Analysis of convergence performance through Lyapunov function candidate increments, accounting for discretization errors and state prediction errors from the incremental model. Comparative flight control simulations are used.
Result: The paper presents comparative results from flight control simulations demonstrating the convergence performance of the enhanced control system.
Conclusion: The enhanced cascaded online learning flight control system with improved action smoothness shows analyzed convergence performance through Lyapunov-based metrics and simulation validation.
Abstract: A cascaded online learning flight control system has been developed and enhanced with respect to action smoothness. In this paper, we investigate the convergence performance of the control system, characterized by the increment of a Lyapunov function candidate. The derivation of this metric accounts for discretization errors and state prediction errors introduced by the incremental model. Comparative results are presented through flight control simulations.
[601] Exploring Structures of Inferential Mechanisms through Simplistic Digital Circuits
Giovanni Sileno, Jean-Louis Dessalles
Main category: cs.AI
TL;DR: This paper proposes a unifying framework for inferential mechanisms using logic gates and symbolic AI modeling, identifying four dependency forms and eight common inferential patterns that bridge cognitive studies and artificial intelligence.
Details
Motivation: To address the lack of a unifying framework for various inferential mechanisms in both natural cognition and artificial intelligence, bridging the gap between cognitive studies and AI modeling approaches.Method: Uses symbolic AI modeling techniques through electronic circuits based on logic gates, conducts combinatorial exploration to identify dependency forms, analyzes logic programs to identify inferential patterns, and applies probabilistic interpretation of logic programs.
Result: Identified four main forms of dependencies that can be realized by inferential circuits and eight common inferential patterns that expose traditionally distinct inferential mechanisms in a unifying framework.
Conclusion: The observations from symbolic means and digital systems infrastructures may point to more generally applicable structures for understanding inferential mechanisms across different domains.
Abstract: Cognitive studies and artificial intelligence have developed distinct models for various inferential mechanisms (categorization, induction, abduction, causal inference, contrast, merge, …). Yet, both natural and artificial views on cognition lack apparently a unifying framework. This paper formulates a speculative answer attempting to respond to this gap. To postulate on higher-level activation processes from a material perspective, we consider inferential mechanisms informed by symbolic AI modelling techniques, through the simplistic lenses of electronic circuits based on logic gates. We observe that a logic gate view entails a different treatment of implication and negation compared to standard logic and logic programming. Then, by combinatorial exploration, we identify four main forms of dependencies that can be realized by these inferential circuits. Looking at how these forms are generally used in the context of logic programs, we identify eight common inferential patterns, exposing traditionally distinct inferential mechanisms in an unifying framework. Finally, following a probabilistic interpretation of logic programs, we unveil inner functional dependencies. The paper concludes elaborating in what sense, even if our arguments are mostly informed by symbolic means and digital systems infrastructures, our observations may pinpoint to more generally applicable structures.
[602] On Generalization in Agentic Tool Calling: CoreThink Agentic Reasoner and MAVEN Dataset
Vishvesh Bhat, Omkar Ghugarkar, Julian McAuley
Main category: cs.AI
TL;DR: LLMs struggle with generalization across tool-calling environments. The paper introduces MAVEN benchmark and CoreThink framework, which achieves 530% improvement over baselines at 1/10 computational cost.
Details
Motivation: Generalization across agentic tool-calling environments remains an unsolved challenge, as LLMs show strong performance on isolated benchmarks but poor transfer of reasoning strategies across domains.Method: Introduces MAVEN benchmark for OOD testing and CoreThink Agentic Reasoner - a framework that augments LLMs with lightweight symbolic reasoning layer for structured decomposition and adaptive tool orchestration.
Result: Most current models achieve below 50% accuracy on MAVEN. CoreThink generalizes across all benchmarks, achieving state-of-the-art performance with 530% improvements over baselines at roughly one-tenth the computational cost.
Conclusion: The CoreThink framework effectively addresses the generalization gap in tool-use settings without additional training, demonstrating significant performance improvements across diverse benchmarks.
Abstract: Generalization across Agentic tool-calling environments remains a key unsolved challenge in developing reliable agentic reasoning systems. While large language models (LLMs) demonstrate strong performance on isolated benchmarks, their ability to transfer reasoning strategies and co-ordinate tools across diverse domains is poorly understood. In this work, we conduct a large-scale evaluation of state-of-the-art LLMs on multiple tool-calling benchmarksBFCL v3, TauBench, Tau2Bench, and AceBenchand introduce MAVEN (Math & Physics Adversarial Verification & Evaluation Network), a new out of distribution (OOD) benchmark designed to stress-test multi-step reasoning through explicit verification and adversarial task composition. Our results show that most current models achieve below 50% accuracy on MAVEN, revealing a significant generalization gap across tool-use settings. To address this, we present the CoreThink Agentic Reasoner, a framework that augments LLMs with a lightweight symbolic reasoning layer for structured decomposition and adaptive tool orchestration. Without additional training, it generalizes across all benchmarks, achieving state-of-the-art performance with 530% improvements over existing baselines at roughly one-tenth the computational cost.
[603] GTR-Mamba: Geometry-to-Tangent Routing for Hyperbolic POI Recommendation
Zhuoxuan Li, Jieyuan Pei, Tangwei Ye, Zhongyuan Lai, Zihan Liu, Fengyuan Xu, Qi Zhang, Liang Hu
Main category: cs.AI
TL;DR: GTR-Mamba is a novel framework for next POI recommendation that uses hyperbolic geometry for static preference hierarchies and Mamba layers in Euclidean space for dynamic sequence modeling, outperforming existing methods.
Details
Motivation: Existing POI recommendation models struggle to simultaneously capture the hierarchical structure of spatial choices and the dynamic temporal contexts of user mobility patterns.Method: Proposes cross-manifold conditioning and routing: models static preference hierarchies in hyperbolic geometry, routes dynamic sequence updates to Mamba layers in Euclidean tangent space, with cross-manifold channel fusion.
Result: Extensive experiments on three real-world datasets show GTR-Mamba consistently outperforms state-of-the-art baseline models in next POI recommendation.
Conclusion: The cross-manifold approach effectively addresses limitations of existing models by leveraging different mathematical spaces for different aspects of the recommendation problem.
Abstract: Next Point-of-Interest (POI) recommendation is a critical task in modern Location-Based Social Networks (LBSNs), aiming to model the complex decision-making process of human mobility to provide personalized recommendations for a user’s next check-in location. Existing POI recommendation models, predominantly based on Graph Neural Networks and sequential models, have been extensively studied. However, these models face a fundamental limitation: they struggle to simultaneously capture the inherent hierarchical structure of spatial choices and the dynamics and irregular shifts of user-specific temporal contexts. To overcome this limitation, we propose GTR-Mamba, a novel framework for cross-manifold conditioning and routing. GTR-Mamba leverages the distinct advantages of different mathematical spaces for different tasks: it models the static, tree-like preference hierarchies in hyperbolic geometry, while routing the dynamic sequence updates to a novel Mamba layer in the computationally stable and efficient Euclidean tangent space. This process is coordinated by a cross-manifold channel that fuses spatio-temporal information to explicitly steer the State Space Model (SSM), enabling flexible adaptation to contextual changes. Extensive experiments on three real-world datasets demonstrate that GTR-Mamba consistently outperforms state-of-the-art baseline models in next POI recommendation.
[604] Exploring Semantic-constrained Adversarial Example with Instruction Uncertainty Reduction
Jin Hu, Jiakai Wang, Linna Jing, Haolin Li, Haodong Liu, Haotong Qin, Aishan Liu, Ke Xu, Xianglong Liu
Main category: cs.AI
TL;DR: The paper proposes InSUR, a multi-dimensional framework to generate more effective semantically constrained adversarial examples by addressing semantic uncertainty in human instructions through three key dimensions: sampling method stabilization, task modeling constraints, and generator evaluation enhancement.
Details
Motivation: Current methods for generating semantically constrained adversarial examples (SemanticAE) have unsatisfactory attacking ability because they don't fully address semantic uncertainty factors in human instructions like referring diversity, descriptive incompleteness, and boundary ambiguity.Method: InSUR framework with three key components: 1) Residual-driven attacking direction stabilization using ResAdv-DDIM sampler to stabilize adversarial optimization, 2) Context-encoded attacking scenario constraint with guidance masking and renderer integration for 2D/3D SemanticAE, 3) Semantic-abstracted attacking evaluation enhancement to clarify evaluation boundaries.
Result: Extensive experiments demonstrate superior transfer attack performance. The framework also enables reference-free generation of semantically constrained 3D adversarial examples for the first time.
Conclusion: InSUR effectively addresses semantic uncertainty in human instructions to generate more transferable, adaptive, and effective semantically constrained adversarial examples, advancing the field of semantic adversarial attacks.
Abstract: Recently, semantically constrained adversarial examples (SemanticAE), which are directly generated from natural language instructions, have become a promising avenue for future research due to their flexible attacking forms. To generate SemanticAEs, current methods fall short of satisfactory attacking ability as the key underlying factors of semantic uncertainty in human instructions, such as referring diversity, descriptive incompleteness, and boundary ambiguity, have not been fully investigated. To tackle the issues, this paper develops a multi-dimensional instruction uncertainty reduction (InSUR) framework to generate more satisfactory SemanticAE, i.e., transferable, adaptive, and effective. Specifically, in the dimension of the sampling method, we propose the residual-driven attacking direction stabilization to alleviate the unstable adversarial optimization caused by the diversity of language references. By coarsely predicting the language-guided sampling process, the optimization process will be stabilized by the designed ResAdv-DDIM sampler, therefore releasing the transferable and robust adversarial capability of multi-step diffusion models. In task modeling, we propose the context-encoded attacking scenario constraint to supplement the missing knowledge from incomplete human instructions. Guidance masking and renderer integration are proposed to regulate the constraints of 2D/3D SemanticAE, activating stronger scenario-adapted attacks. Moreover, in the dimension of generator evaluation, we propose the semantic-abstracted attacking evaluation enhancement by clarifying the evaluation boundary, facilitating the development of more effective SemanticAE generators. Extensive experiments demonstrate the superiority of the transfer attack performance of InSUR. Moreover, we realize the reference-free generation of semantically constrained 3D adversarial examples for the first time.
[605] ProfileXAI: User-Adaptive Explainable AI
Gilber A. Corrales, Carlos Andrés Ferro Sánchez, Reinel Tabares-Soto, Jesús Alfonso López Sotelo, Gonzalo A. Ruz, Johan Sebastian Piña Durán
Main category: cs.AI
TL;DR: ProfileXAI is a framework that combines post-hoc explainers (SHAP, LIME, Anchor) with retrieval-augmented LLMs to generate explanations for different user types, evaluated on medical datasets showing trade-offs between fidelity, robustness, and user satisfaction.
Details
Motivation: To create a model- and domain-agnostic framework that provides trustworthy explanations for different types of users by coupling traditional explainers with LLMs and multimodal knowledge bases.Method: Indexes multimodal knowledge base, selects explainers per instance via quantitative criteria, and generates grounded narratives with chat-enabled prompting using SHAP, LIME, and Anchor explainers.
Result: No single explainer dominates: LIME has best fidelity-robustness trade-off, Anchor produces sparsest rules, SHAP achieves highest user satisfaction. Profile conditioning stabilizes token usage and maintains positive ratings across user profiles.
Conclusion: ProfileXAI enables efficient and trustworthy explanations by leveraging multiple explainers with LLM augmentation, with profile conditioning providing stable performance across different user types.
Abstract: ProfileXAI is a model- and domain-agnostic framework that couples post-hoc explainers (SHAP, LIME, Anchor) with retrieval - augmented LLMs to produce explanations for different types of users. The system indexes a multimodal knowledge base, selects an explainer per instance via quantitative criteria, and generates grounded narratives with chat-enabled prompting. On Heart Disease and Thyroid Cancer datasets, we evaluate fidelity, robustness, parsimony, token use, and perceived quality. No explainer dominates: LIME achieves the best fidelity–robustness trade-off (Infidelity $\le 0.30$, $L<0.7$ on Heart Disease); Anchor yields the sparsest, low-token rules; SHAP attains the highest satisfaction ($\bar{x}=4.1$). Profile conditioning stabilizes tokens ($\sigma \le 13%$) and maintains positive ratings across profiles ($\bar{x}\ge 3.7$, with domain experts at $3.77$), enabling efficient and trustworthy explanations.
[606] From Prompt Optimization to Multi-Dimensional Credibility Evaluation: Enhancing Trustworthiness of Chinese LLM-Generated Liver MRI Reports
Qiuli Wang, Xiaoming Li, Jie Chen, Yongxu Liu, Xingpeng Zhang, Chen Liu, Wei Chen
Main category: cs.AI
TL;DR: This study introduces a Multi-Dimensional Credibility Assessment (MDCA) framework to enhance trustworthiness of LLM-generated liver MRI reports and provides guidance on institution-specific prompt optimization.
Details
Motivation: LLMs show promise in generating diagnostic conclusions from imaging findings for radiology reporting, education, and quality control, but systematic guidance on prompt optimization across clinical contexts and standardized trustworthiness assessment frameworks are lacking.Method: Proposed a Multi-Dimensional Credibility Assessment (MDCA) framework and applied it to evaluate several advanced LLMs (Kimi-K2-Instruct-0905, Qwen3-235B-A22B-Instruct-2507, DeepSeek-V3, ByteDance-Seed-OSS-36B-Instruct) using the SiliconFlow platform.
Result: The study provides a systematic framework for assessing LLM-generated radiology reports and compares performance of multiple advanced language models in liver MRI reporting context.
Conclusion: The MDCA framework enables enhanced trustworthiness evaluation of LLM-generated radiology reports and offers guidance for institution-specific prompt optimization to improve clinical utility.
Abstract: Large language models (LLMs) have demonstrated promising performance in generating diagnostic conclusions from imaging findings, thereby supporting radiology reporting, trainee education, and quality control. However, systematic guidance on how to optimize prompt design across different clinical contexts remains underexplored. Moreover, a comprehensive and standardized framework for assessing the trustworthiness of LLM-generated radiology reports is yet to be established. This study aims to enhance the trustworthiness of LLM-generated liver MRI reports by introducing a Multi-Dimensional Credibility Assessment (MDCA) framework and providing guidance on institution-specific prompt optimization. The proposed framework is applied to evaluate and compare the performance of several advanced LLMs, including Kimi-K2-Instruct-0905, Qwen3-235B-A22B-Instruct-2507, DeepSeek-V3, and ByteDance-Seed-OSS-36B-Instruct, using the SiliconFlow platform.
[607] Mixed Density Diffuser: Efficient Planning with Non-uniform Temporal Resolution
Crimson Stambaugh, Rajesh P. N. Rao
Main category: cs.AI
TL;DR: MDD is a diffusion planner with tunable temporal density hyperparameters that achieves state-of-the-art performance across multiple D4RL benchmarks.
Details
Motivation: Existing diffusion planners benefit from sparse-step planning but predicting excessively sparse plans degrades performance. The temporal density threshold is non-uniform across the planning horizon.Method: Proposed Mixed Density Diffuser (MDD) where densities throughout the horizon are tunable hyperparameters, allowing certain parts of trajectories to be more densely planned.
Result: MDD achieves new state-of-the-art performance across Maze2D, Franka Kitchen, and Antmaze D4RL task domains.
Conclusion: Tunable temporal density in diffusion planning enables better performance by adapting to non-uniform density requirements across the planning horizon.
Abstract: Recent studies demonstrate that diffusion planners benefit from sparse-step planning over single-step planning. Training models to skip steps in their trajectories helps capture long-term dependencies without additional or memory computational cost. However, predicting excessively sparse plans degrades performance. We hypothesize this temporal density threshold is non-uniform across a temporal horizon and that certain parts of a planned trajectory should be more densely planned. We propose Mixed Density Diffuser (MDD), a diffusion planner where the densities throughout the horizon are tunable hyperparameters. MDD achieves a new SOTA across the Maze2D, Franka Kitchen, and Antmaze D4RL task domains.
[608] A Survey of AI Scientists: Surveying the automatic Scientists and Research
Guiyao Tie, Pan Zhou, Lichao Sun
Main category: cs.AI
TL;DR: This survey systematically analyzes the evolution of AI scientists - autonomous systems that perform complete scientific workflows from hypothesis generation to paper publication, introducing a unified six-stage framework and tracing development from foundational modules to current human-AI collaboration challenges.
Details
Motivation: The rapid proliferation of AI scientist systems has created a fragmented research landscape, obscuring methodological principles and developmental trends, necessitating a systematic synthesis to clarify the field's current state and provide guidance for future development.Method: Introduces a unified six-stage methodological framework deconstructing the scientific process: Literature Review, Idea Generation, Experimental Preparation, Experimental Execution, Scientific Writing, and Paper Generation, used to analyze field evolution from foundational modules to integrated closed-loop systems.
Result: Charts the field’s evolution from early Foundational Modules (2022-2023) to integrated Closed-Loop Systems (2024), and to current frontier of Scalability, Impact, and Human-AI Collaboration (2025-present), providing a systematic synthesis of autonomous science developments.
Conclusion: The survey clarifies the current state of autonomous science and provides a critical roadmap for overcoming remaining challenges in robustness and governance, guiding next-generation systems toward becoming trustworthy partners in human scientific inquiry.
Abstract: Artificial intelligence is undergoing a profound transition from a computational instrument to an autonomous originator of scientific knowledge. This emerging paradigm, the AI scientist, is architected to emulate the complete scientific workflow-from initial hypothesis generation to the final synthesis of publishable findings-thereby promising to fundamentally reshape the pace and scale of discovery. However, the rapid and unstructured proliferation of these systems has created a fragmented research landscape, obscuring overarching methodological principles and developmental trends. This survey provides a systematic and comprehensive synthesis of this domain by introducing a unified, six-stage methodological framework that deconstructs the end-to-end scientific process into: Literature Review, Idea Generation, Experimental Preparation, Experimental Execution, Scientific Writing, and Paper Generation. Through this analytical lens, we chart the field’s evolution from early Foundational Modules (2022-2023) to integrated Closed-Loop Systems (2024), and finally to the current frontier of Scalability, Impact, and Human-AI Collaboration (2025-present). By rigorously synthesizing these developments, this survey not only clarifies the current state of autonomous science but also provides a critical roadmap for overcoming remaining challenges in robustness and governance, ultimately guiding the next generation of systems toward becoming trustworthy and indispensable partners in human scientific inquiry.
[609] TLCD: A Deep Transfer Learning Framework for Cross-Disciplinary Cognitive Diagnosis
Zhifeng Wang, Meixin Su, Yang Yang, Chunyan Zeng, Lizhi Ye
Main category: cs.AI
TL;DR: The paper proposes TLCD, a cross-disciplinary cognitive diagnosis method using deep learning and transfer learning to improve student ability assessment across different disciplines by leveraging common features from source disciplines.
Details
Motivation: Online education growth creates need for accurate cognitive diagnosis, but traditional methods struggle with cross-disciplinary challenges due to differences in knowledge systems, cognitive structures, and data characteristics between disciplines.Method: Proposes TLCD method combining deep learning techniques and transfer learning strategies to enhance target discipline performance by utilizing common features from source disciplines, based on neural network cognitive diagnosis and knowledge association neural network cognitive diagnosis.
Result: Experimental results show the cross-disciplinary cognitive diagnosis model based on deep learning performs better than basic models in cross-disciplinary tasks and can more accurately evaluate students’ learning situations.
Conclusion: The proposed TLCD method effectively addresses cross-disciplinary cognitive diagnosis challenges and provides more accurate assessment of student learning across different disciplines.
Abstract: Driven by the dual principles of smart education and artificial intelligence technology, the online education model has rapidly emerged as an important component of the education industry. Cognitive diagnostic technology can utilize students’ learning data and feedback information in educational evaluation to accurately assess their ability level at the knowledge level. However, while massive amounts of information provide abundant data resources, they also bring about complexity in feature extraction and scarcity of disciplinary data. In cross-disciplinary fields, traditional cognitive diagnostic methods still face many challenges. Given the differences in knowledge systems, cognitive structures, and data characteristics between different disciplines, this paper conducts in-depth research on neural network cognitive diagnosis and knowledge association neural network cognitive diagnosis, and proposes an innovative cross-disciplinary cognitive diagnosis method (TLCD). This method combines deep learning techniques and transfer learning strategies to enhance the performance of the model in the target discipline by utilizing the common features of the main discipline. The experimental results show that the cross-disciplinary cognitive diagnosis model based on deep learning performs better than the basic model in cross-disciplinary cognitive diagnosis tasks, and can more accurately evaluate students’ learning situation.
[610] Smaller Models, Smarter Rewards: A Two-Sided Approach to Process and Outcome Rewards
Jan Niklas Groeneveld, Xi Qin, Alexander Schaefer, Yaad Oren
Main category: cs.AI
TL;DR: Small language models like Phi-4 can be effectively transformed into reward models for code generation by adding regression layers and fine-tuning on code correctness data, achieving over 20% improvement in selecting the best code from multiple generations.
Details
Motivation: To investigate whether small language models can serve as effective reward models for code generation, bridging the gap between process rewards and outcome rewards that are needed for evolving reasoning models in LLMs.Method: Constructed a dataset from APPS coding challenge benchmark with correctness labels, trained a value-head model with regression layer on small LLMs (Phi-4 family) to estimate success probability of intermediate code outputs.
Result: Small LLMs successfully function as effective reward models and code evaluation critics, capable of identifying correct solutions among multiple candidates, leading to over 20% improvement in search capability for selecting the most accurate code.
Conclusion: Small language models can be effectively repurposed as reward models for code generation tasks, providing a practical approach to improve code quality selection without requiring large-scale models.
Abstract: Generating high-quality code remains a challenge for Large Language Models (LLMs). For the evolution of reasoning models on this task, reward models are a necessary intermediate step. These models judge outcomes or intermediate steps. Decoder-only transformer models can be turned into reward models by introducing a regression layer and supervised fine-tuning. While it is known that reflection capabilities generally increase with the size of a model, we want to investigate whether state-of-the-art small language models like the Phi-4 family can be turned into usable reward models blending the consideration of process rewards and outcome rewards. Targeting this goal, we construct a dataset of code samples with correctness labels derived from the APPS coding challenge benchmark. We then train a value-head model to estimate the success probability of intermediate outputs. Our evaluation shows that small LLMs are capable of serving as effective reward models or code evaluation critics, successfully identifying correct solutions among multiple candidates. Using this critic, we achieve over a 20% improvement in the search capability of the most accurate code out of multiple generations.
[611] Lost in Tokenization: Context as the Key to Unlocking Biomolecular Understanding in Scientific LLMs
Kai Zhuang, Jiawei Zhang, Yumou Liu, Hanqun Cao, Chunbin Gu, Mengdi Liu, Zhangyang Gao, Zitong Jerry Wang, Xuanhe Zhou, Pheng-Ann Heng, Lijun Wu, Conghui He, Cheng Tan
Main category: cs.AI
TL;DR: Sci-LLMs perform better when given high-level structured context from bioinformatics tools rather than raw biomolecular sequences, as sequences act as informational noise that degrades performance.
Details
Motivation: Current Sci-LLMs face a tokenization dilemma when processing raw biomolecular sequences - either treating sequences as specialized language (risking loss of functional motifs) or as separate modality (creating alignment challenges), which limits reasoning capacity.Method: Systematic comparison of leading Sci-LLMs on biological reasoning tasks using three input modes: sequence-only, context-only, and combination of both.
Result: Context-only approach consistently and substantially outperforms all other modes. Including raw sequences alongside high-level context degrades performance, indicating sequences act as informational noise even for specialized models.
Conclusion: Sci-LLMs should be reframed as powerful reasoning engines over expert knowledge rather than sequence decoders, shifting focus from direct sequence interpretation to high-level knowledge synthesis for hybrid scientific AI agents.
Abstract: Scientific Large Language Models (Sci-LLMs) have emerged as a promising frontier for accelerating biological discovery. However, these models face a fundamental challenge when processing raw biomolecular sequences: the tokenization dilemma. Whether treating sequences as a specialized language, risking the loss of functional motif information, or as a separate modality, introducing formidable alignment challenges, current strategies fundamentally limit their reasoning capacity. We challenge this sequence-centric paradigm by positing that a more effective strategy is to provide Sci-LLMs with high-level structured context derived from established bioinformatics tools, thereby bypassing the need to interpret low-level noisy sequence data directly. Through a systematic comparison of leading Sci-LLMs on biological reasoning tasks, we tested three input modes: sequence-only, context-only, and a combination of both. Our findings are striking: the context-only approach consistently and substantially outperforms all other modes. Even more revealing, the inclusion of the raw sequence alongside its high-level context consistently degrades performance, indicating that raw sequences act as informational noise, even for models with specialized tokenization schemes. These results suggest that the primary strength of existing Sci-LLMs lies not in their nascent ability to interpret biomolecular syntax from scratch, but in their profound capacity for reasoning over structured, human-readable knowledge. Therefore, we argue for reframing Sci-LLMs not as sequence decoders, but as powerful reasoning engines over expert knowledge. This work lays the foundation for a new class of hybrid scientific AI agents, repositioning the developmental focus from direct sequence interpretation towards high-level knowledge synthesis. The code is available at github.com/opendatalab-raise-dev/CoKE.
[612] Guiding Skill Discovery with Foundation Models
Zhao Yang, Thomas M. Moerland, Mike Preuss, Aske Plaat, Vincent François-Lavet, Edward S. Hu
Main category: cs.AI
TL;DR: FoG skill discovery method uses foundation models to incorporate human preferences into reinforcement learning, eliminating undesirable behaviors while maintaining skill diversity.
Details
Motivation: Existing skill discovery methods maximize diversity without considering human preferences, leading to potentially dangerous or undesirable behaviors that don't align with human intentions.Method: Extracts score functions from foundation models to evaluate states based on human intentions, then uses these scores to re-weight rewards in skill discovery algorithms.
Result: Successfully eliminates undesirable behaviors like flipping or rolling, avoids hazardous areas in both state-based and pixel-based tasks, and discovers skills with behaviors difficult to define.
Conclusion: FoG effectively incorporates human intentions into skill discovery through foundation models, enabling safer and more desirable skill learning while maintaining diversity.
Abstract: Learning diverse skills without hand-crafted reward functions could accelerate reinforcement learning in downstream tasks. However, existing skill discovery methods focus solely on maximizing the diversity of skills without considering human preferences, which leads to undesirable behaviors and possibly dangerous skills. For instance, a cheetah robot trained using previous methods learns to roll in all directions to maximize skill diversity, whereas we would prefer it to run without flipping or entering hazardous areas. In this work, we propose a Foundation model Guided (FoG) skill discovery method, which incorporates human intentions into skill discovery through foundation models. Specifically, FoG extracts a score function from foundation models to evaluate states based on human intentions, assigning higher values to desirable states and lower to undesirable ones. These scores are then used to re-weight the rewards of skill discovery algorithms. By optimizing the re-weighted skill discovery rewards, FoG successfully learns to eliminate undesirable behaviors, such as flipping or rolling, and to avoid hazardous areas in both state-based and pixel-based tasks. Interestingly, we show that FoG can discover skills involving behaviors that are difficult to define. Interactive visualisations are available from https://sites.google.com/view/submission-fog.
[613] AUPO – Abstracted Until Proven Otherwise: A Reward Distribution Based Abstraction Algorithm
Robin Schmöcker, Alexander Dockhorn, Bodo Rosenhahn
Main category: cs.AI
TL;DR: AUPO is a novel drop-in modification to MCTS that uses reward distribution statistics to automatically abstract actions, outperforming standard MCTS and detecting symmetries that state-of-the-art frameworks miss.
Details
Motivation: To improve MCTS performance by automatically abstracting actions without requiring transition probabilities or directed acyclic graphs, addressing limitations of existing frameworks like ASAP.Method: AUPO modifies MCTS decision policy using reward distribution statistics collected during search to automatically abstract actions, detecting symmetric actions even when resulting states are far apart.
Result: AUPO clearly outperforms MCTS on IPPC benchmark problems and can detect symmetric actions that state-of-the-art frameworks like ASAP struggle with.
Conclusion: AUPO provides an effective automatic action abstraction method that complements other tree search techniques and handles symmetry detection better than existing approaches.
Abstract: We introduce a novel, drop-in modification to Monte Carlo Tree Search’s (MCTS) decision policy that we call AUPO. Comparisons based on a range of IPPC benchmark problems show that AUPO clearly outperforms MCTS. AUPO is an automatic action abstraction algorithm that solely relies on reward distribution statistics acquired during the MCTS. Thus, unlike other automatic abstraction algorithms, AUPO requires neither access to transition probabilities nor does AUPO require a directed acyclic search graph to build its abstraction, allowing AUPO to detect symmetric actions that state-of-the-art frameworks like ASAP struggle with when the resulting symmetric states are far apart in state space. Furthermore, as AUPO only affects the decision policy, it is not mutually exclusive with other abstraction techniques that only affect the tree search.
[614] Human-Like Goalkeeping in a Realistic Football Simulation: a Sample-Efficient Reinforcement Learning Approach
Alessandro Sestini, Joakim Bergdahl, Jean-Philippe Barrette-LaPierre, Florian Fuchs, Brady Chen, Micheal Jones, Linus Gisslén
Main category: cs.AI
TL;DR: A sample-efficient DRL method for training human-like game AI that improves training speed by 50% and outperforms built-in game AI by 10% in ball saving rate, with plans to replace hand-crafted AI in future game iterations.
Details
Motivation: DRL has rarely been adopted by game industry due to impractical super-human agent training with large models. Game studios need human-like agents with limited resources.Method: Sample-efficient DRL method that improves value-based DRL by leveraging pre-collected data and increasing network plasticity, tailored for industrial game development.
Result: Goalkeeper agent in EA SPORTS FC 25 outperforms built-in AI by 10% in ball saving rate, trains 50% faster than standard DRL, and creates more human-like gameplay according to domain experts.
Conclusion: The method successfully addresses industry needs for efficient DRL training and is intended to replace hand-crafted AI in future game series iterations.
Abstract: While several high profile video games have served as testbeds for Deep Reinforcement Learning (DRL), this technique has rarely been employed by the game industry for crafting authentic AI behaviors. Previous research focuses on training super-human agents with large models, which is impractical for game studios with limited resources aiming for human-like agents. This paper proposes a sample-efficient DRL method tailored for training and fine-tuning agents in industrial settings such as the video game industry. Our method improves sample efficiency of value-based DRL by leveraging pre-collected data and increasing network plasticity. We evaluate our method training a goalkeeper agent in EA SPORTS FC 25, one of the best-selling football simulations today. Our agent outperforms the game’s built-in AI by 10% in ball saving rate. Ablation studies show that our method trains agents 50% faster compared to standard DRL methods. Finally, qualitative evaluation from domain experts indicates that our approach creates more human-like gameplay compared to hand-crafted agents. As a testimony of the impact of the approach, the method is intended to replace the hand-crafted counterpart in next iterations of the series.
[615] Accelerating IC Thermal Simulation Data Generation via Block Krylov and Operator Action
Hong Wang, Wenkai Yang, Jie Wang, Huanshuo Dong, Zijie Geng, Zhen Huang, Depeng Xie, Zhezheng Hao, Hande Dong
Main category: cs.AI
TL;DR: BlocKOA is a novel algorithm that accelerates IC thermal simulation data generation using block Krylov methods and operator actions, achieving 420x speedup while maintaining data quality for neural operator training.
Details
Motivation: Existing neural operators for IC thermal simulation require large amounts of high-fidelity training data, which incurs significant computational costs during data generation.Method: Uses block Krylov algorithm on heat equation structure to get basic solutions, combines them to generate temperature distributions satisfying physical constraints, then applies heat operators to determine heat source distributions.
Result: 420x speedup in generating thermal simulation data for 5000 chips; data-driven approaches trained on BlocKOA data achieve comparable performance using only 4% of generation time compared to existing methods.
Conclusion: BlocKOA efficiently generates precise thermal simulation data with lower time complexity, enabling faster training of data-driven approaches for IC thermal analysis.
Abstract: Recent advances in data-driven approaches, such as neural operators (NOs), have shown substantial efficacy in reducing the solution time for integrated circuit (IC) thermal simulations. However, a limitation of these approaches is requiring a large amount of high-fidelity training data, such as chip parameters and temperature distributions, thereby incurring significant computational costs. To address this challenge, we propose a novel algorithm for the generation of IC thermal simulation data, named block Krylov and operator action (BlocKOA), which simultaneously accelerates the data generation process and enhances the precision of generated data. BlocKOA is specifically designed for IC applications. Initially, we use the block Krylov algorithm based on the structure of the heat equation to quickly obtain a few basic solutions. Then we combine them to get numerous temperature distributions that satisfy the physical constraints. Finally, we apply heat operators on these functions to determine the heat source distributions, efficiently generating precise data points. Theoretical analysis shows that the time complexity of BlocKOA is one order lower than the existing method. Experimental results further validate its efficiency, showing that BlocKOA achieves a 420-fold speedup in generating thermal simulation data for 5000 chips with varying physical parameters and IC structures. Even with just 4% of the generation time, data-driven approaches trained on the data generated by BlocKOA exhibits comparable performance to that using the existing method.
[616] CNOT Minimal Circuit Synthesis: A Reinforcement Learning Approach
Riccardo Romanello, Daniele Lizzio Bosco, Jacopo Cossio, Dusan Sutulovic, Giuseppe Serra, Carla Piazza, Paolo Burelli
Main category: cs.AI
TL;DR: A reinforcement learning approach for CNOT gate minimization in quantum circuits, using a single agent trained on fixed-size circuits that generalizes to various circuit sizes through preprocessing techniques.
Details
Motivation: CNOT gates are fundamental for quantum computing and entanglement, but minimizing their number remains an open challenge with uncharacterized computational complexity, requiring efficient optimization methods.Method: Uses a single reinforcement learning agent trained on fixed-size circuits (m=8), with preprocessing (embedding or Gaussian striping) to handle matrices of different sizes, evaluated on circuits ranging from n=3 to 15.
Result: The method outperforms state-of-the-art algorithms, with performance improvement increasing as circuit size n grows larger.
Conclusion: The reinforcement learning approach provides an effective solution for CNOT minimization that scales well and demonstrates superior performance compared to existing methods, particularly for larger quantum circuits.
Abstract: CNOT gates are fundamental to quantum computing, as they facilitate entanglement, a crucial resource for quantum algorithms. Certain classes of quantum circuits are constructed exclusively from CNOT gates. Given their widespread use, it is imperative to minimise the number of CNOT gates employed. This problem, known as CNOT minimisation, remains an open challenge, with its computational complexity yet to be fully characterised. In this work, we introduce a novel reinforcement learning approach to address this task. Instead of training multiple reinforcement learning agents for different circuit sizes, we use a single agent up to a fixed size $m$. Matrices of sizes different from m are preprocessed using either embedding or Gaussian striping. To assess the efficacy of our approach, we trained an agent with m = 8, and evaluated it on matrices of size n that range from 3 to 15. The results we obtained show that our method overperforms the state-of-the-art algorithm as the value of n increases.
[617] Planning Ahead with RSA: Efficient Signalling in Dynamic Environments by Projecting User Awareness across Future Timesteps
Anwesha Das, John Duff, Jörg Hoffmann, Vera Demberg
Main category: cs.AI
TL;DR: A framework for adaptive AI signaling that optimizes message timing and specificity to maintain human situational awareness in dynamic environments using Bayesian reasoning and multi-step planning.
Details
Motivation: To improve human-AI collaboration in time-sensitive tasks by ensuring humans maintain accurate understanding of critical information while managing limited attention resources.Method: Uses Bayesian reference resolution with Rational Speech Act (RSA) framework to plan message sequences that optimize belief alignment, adapting message specificity and timing based on user and scenario.
Result: Shows effectiveness depends on combining multi-step planning with realistic user awareness models, outperforming baseline methods.
Conclusion: Establishes theoretical foundations for pragmatic communication in human-agent teams, demonstrating how cognitive science insights can inform assistive agent design.
Abstract: Adaptive agent design offers a way to improve human-AI collaboration on time-sensitive tasks in rapidly changing environments. In such cases, to ensure the human maintains an accurate understanding of critical task elements, an assistive agent must not only identify the highest priority information but also estimate how and when this information can be communicated most effectively, given that human attention represents a zero-sum cognitive resource where focus on one message diminishes awareness of other or upcoming information. We introduce a theoretical framework for adaptive signalling which meets these challenges by using principles of rational communication, formalised as Bayesian reference resolution using the Rational Speech Act (RSA) modelling framework, to plan a sequence of messages which optimise timely alignment between user belief and a dynamic environment. The agent adapts message specificity and timing to the particulars of a user and scenario based on projections of how prior-guided interpretation of messages will influence attention to the interface and subsequent belief update, across several timesteps out to a fixed horizon. In a comparison to baseline methods, we show that this effectiveness depends crucially on combining multi-step planning with a realistic model of user awareness. As the first application of RSA for communication in a dynamic environment, and for human-AI interaction in general, we establish theoretical foundations for pragmatic communication in human-agent teams, highlighting how insights from cognitive science can be capitalised to inform the design of assistive agents.
[618] Opinion Mining Based Entity Ranking using Fuzzy Logic Algorithmic Approach
Pratik N. Kalamkar, A. G. Phakatkar
Main category: cs.AI
TL;DR: This paper proposes a method for opinion mining at a deeper granularity level using fuzzy logic reasoning to extract attributes and components from evaluative statements, then rank entities based on this fine-grained analysis.
Details
Motivation: With the growth of social networking and e-commerce sites, huge amounts of opinions are available online. While existing research deals with opinion mining and entity ranking, no previous work has combined fine-grained opinion classification with entity ranking.Method: The method uses fuzzy logic reasoning to perform opinion mining from statements at a deeper level of granularity, extracting specific attributes and components that have been commented on.
Result: The approach enables classification of opinions into finer granularity levels (positive, negative, neutral) and subsequent ranking of entities based on this detailed analysis.
Conclusion: The proposed method successfully addresses the gap in existing research by combining fine-grained opinion mining using fuzzy logic with entity ranking, providing a more detailed analysis of opinions from online sources.
Abstract: Opinions are central to almost all human activities and are key influencers of our behaviors. In current times due to growth of social networking website and increase in number of e-commerce site huge amount of opinions are now available on web. Given a set of evaluative statements that contain opinions (or sentiments) about an Entity, opinion mining aims to extract attributes and components of the object that have been commented on in each statement and to determine whether the comments are positive, negative or neutral. While lot of research recently has been done in field of opinion mining and some of it dealing with ranking of entities based on review or opinion set, classifying opinions into finer granularity level and then ranking entities has never been done before. In this paper method for opinion mining from statements at a deeper level of granularity is proposed. This is done by using fuzzy logic reasoning, after which entities are ranked as per this information.
[619] Bid2X: Revealing Dynamics of Bidding Environment in Online Advertising from A Foundation Model Lens
Jiahao Ji, Tianyu Wang, Yeshu Li, Yushen Huo, Zhilin Zhang, Chuan Yu, Jian Xu, Bo Zheng
Main category: cs.AI
TL;DR: Bid2X is a bidding foundation model that learns a unified function to estimate advertising effects across different scenarios, using attention mechanisms and zero-inflated projection for better prediction accuracy.
Details
Motivation: Previous auto-bidding models lack generalizability across different bidding environments as they are tailored for specific scenarios. The authors aim to create a scenario-independent model that can learn fundamental bidding principles applicable across various advertising contexts.Method: Built uniform series embeddings for heterogeneous data, used two attention mechanisms for inter-variable and temporal dependencies, implemented variable-aware fusion for adaptive prediction, and devised zero-inflated projection module with joint classification-regression optimization.
Result: Offline evaluation on eight datasets showed superiority over baselines and cross-scenario generality. Online A/B tests on Taobao platform increased GMV by 4.65% and ROI by 2.44%.
Conclusion: Bid2X successfully demonstrates the viability of bidding foundation models in computational advertising, providing a unified approach that generalizes across different bidding scenarios while significantly improving key performance metrics.
Abstract: Auto-bidding is crucial in facilitating online advertising by automatically providing bids for advertisers. While previous work has made great efforts to model bidding environments for better ad performance, it has limitations in generalizability across environments since these models are typically tailored for specific bidding scenarios. To this end, we approach the scenario-independent principles through a unified function that estimates the achieved effect under specific bids, such as budget consumption, gross merchandise volume (GMV), page views, etc. Then, we propose a bidding foundation model Bid2X to learn this fundamental function from data in various scenarios. Our Bid2X is built over uniform series embeddings that encode heterogeneous data through tailored embedding methods. To capture complex inter-variable and dynamic temporal dependencies in bidding data, we propose two attention mechanisms separately treating embeddings of different variables and embeddings at different times as attention tokens for representation learning. On top of the learned variable and temporal representations, a variable-aware fusion module is used to perform adaptive bidding outcome prediction. To model the unique bidding data distribution, we devise a zero-inflated projection module to incorporate the estimated non-zero probability into its value prediction, which makes up a joint optimization objective containing classification and regression. The objective is proven to converge to the zero-inflated distribution. Our model has been deployed on the ad platform in Taobao, one of the world’s largest e-commerce platforms. Offline evaluation on eight datasets exhibits Bid2X’s superiority compared to various baselines and its generality across different scenarios. Bid2X increased GMV by 4.65% and ROI by 2.44% in online A/B tests, paving the way for bidding foundation model in computational advertising.
[620] Causal Deep Q Network
Elouanes Khelifi, Amir Saki, Usef Faghihi
Main category: cs.AI
TL;DR: Integrating causal principles into DQNs using PEACE formula to reduce spurious correlations and enhance problem-solving capabilities.
Details
Motivation: DQNs rely on associative learning which leads to spurious correlations, hindering their problem-solving capabilities.Method: Proposed framework integrates causal reasoning into DQNs using PEACE (Probabilistic Easy vAriational Causal Effect) formula for estimating causal effects during training.
Result: Experimental results show the approach outperforms conventional DQNs on standard benchmark environments, enhancing problem-solving capabilities without compromising performance.
Conclusion: The work presents a promising avenue for advancing deep reinforcement learning agents through principled causal inference.
Abstract: Deep Q Networks (DQN) have shown remarkable success in various reinforcement learning tasks. However, their reliance on associative learning often leads to the acquisition of spurious correlations, hindering their problem-solving capabilities. In this paper, we introduce a novel approach to integrate causal principles into DQNs, leveraging the PEACE (Probabilistic Easy vAriational Causal Effect) formula for estimating causal effects. By incorporating causal reasoning during training, our proposed framework enhances the DQN’s understanding of the underlying causal structure of the environment, thereby mitigating the influence of confounding factors and spurious correlations. We demonstrate that integrating DQNs with causal capabilities significantly enhances their problem-solving capabilities without compromising performance. Experimental results on standard benchmark environments showcase that our approach outperforms conventional DQNs, highlighting the effectiveness of causal reasoning in reinforcement learning. Overall, our work presents a promising avenue for advancing the capabilities of deep reinforcement learning agents through principled causal inference.
[621] What are the odds? Risk and uncertainty about AI existential risk
Marco Grossi
Main category: cs.AI
TL;DR: A commentary on AI existential risk models that critiques linear risk frameworks and introduces epistemic indifference, option uncertainty, and state-space uncertainty as crucial factors affecting probability estimates of AI disaster scenarios.
Details
Motivation: To highlight the philosophical limitations of linear risk models in AI existential risk analysis and provide a more nuanced understanding of uncertainty dimensions that affect probability estimates.Method: Critical analysis of Swiss Cheese models, discussion of epistemic indifference, and distinction between risk and uncertainty with focus on option uncertainty and state-space uncertainty.
Result: Demonstrates that probability of AI disaster P(D) is higher than initially suggested due to structural relationships between risk layers and the impact of different uncertainty types.
Conclusion: Incorporating option and state-space uncertainty dimensions provides better understanding of AI existential risk likelihood and improves qualitative risk assessment frameworks.
Abstract: This work is a commentary of the article \href{https://doi.org/10.18716/ojs/phai/2025.2801}{AI Survival Stories: a Taxonomic Analysis of AI Existential Risk} by Cappelen, Goldstein, and Hawthorne. It is not just a commentary though, but a useful reminder of the philosophical limitations of \say{linear} models of risk. The article will focus on the model employed by the authors: first, I discuss some differences between standard Swiss Cheese models and this one. I then argue that in a situation of epistemic indifference the probability of P(D) is higher than what one might first suggest, given the structural relationships between layers. I then distinguish between risk and uncertainty, and argue that any estimation of P(D) is structurally affected by two kinds of uncertainty: option uncertainty and state-space uncertainty. Incorporating these dimensions of uncertainty into our qualitative discussion on AI existential risk can provide a better understanding of the likeliness of P(D).
[622] Policy-Aware Generative AI for Safe, Auditable Data Access Governance
Shames Al Mandalawi, Muzakkiruddin Ahmed Mohammed, Hendrika Maclean, Mert Can Cakmak, John R. Talburt
Main category: cs.AI
TL;DR: A policy-aware controller using LLMs to interpret natural language requests against written policies, achieving high accuracy in access decisions with audit trails.
Details
Motivation: Enterprises need access decisions that satisfy least privilege, comply with regulations, and remain auditable.Method: Six-stage reasoning framework using Google Gemini 2.0 Flash: context interpretation, user validation, data classification, business purpose test, compliance mapping, and risk synthesis with early hard policy gates and deny by default.
Result: Exact Decision Match improved from 10/14 to 13/14 (92.9%), DENY recall rose to 1.00, False Approval Rate dropped to 0, and Functional Appropriateness and Compliance Adherence at 14/14. Median latency under one minute.
Conclusion: Policy constrained LLM reasoning with explicit gates and audit trails can translate human readable policies into safe, compliant, and traceable machine decisions.
Abstract: Enterprises need access decisions that satisfy least privilege, comply with regulations, and remain auditable. We present a policy aware controller that uses a large language model (LLM) to interpret natural language requests against written policies and metadata, not raw data. The system, implemented with Google Gemini~2.0 Flash, executes a six-stage reasoning framework (context interpretation, user validation, data classification, business purpose test, compliance mapping, and risk synthesis) with early hard policy gates and deny by default. It returns APPROVE, DENY, CONDITIONAL together with cited controls and a machine readable rationale. We evaluate on fourteen canonical cases across seven scenario families using a privacy preserving benchmark. Results show Exact Decision Match improving from 10/14 to 13/14 (92.9%) after applying policy gates, DENY recall rising to 1.00, False Approval Rate on must-deny families dropping to 0, and Functional Appropriateness and Compliance Adherence at 14/14. Expert ratings of rationale quality are high, and median latency is under one minute. These findings indicate that policy constrained LLM reasoning, combined with explicit gates and audit trails, can translate human readable policies into safe, compliant, and traceable machine decisions.
[623] Human-AI Collaborative Uncertainty Quantification
Sima Noorani, Shayan Kiyani, George Pappas, Hamed Hassani
Main category: cs.AI
TL;DR: The paper introduces Human AI Collaborative Uncertainty Quantification, a framework that combines human expertise with AI predictions to create optimal prediction sets that avoid degrading correct human judgments while recovering outcomes humans miss.
Details
Motivation: Current AI lacks capabilities like domain knowledge, long horizon context, and physical world reasoning needed for robust decisions under uncertainty, motivating collaborative frameworks that combine human and AI strengths.Method: Developed collaborative prediction sets with two-threshold structure based on conformal prediction, with offline and online calibration algorithms that provide distribution-free finite sample guarantees and adapt to distribution shifts including Human to AI Adaptation.
Result: Experiments across image classification, regression, and medical decision making show collaborative prediction sets consistently outperform either agent alone, achieving higher coverage and smaller set sizes.
Conclusion: The framework successfully formalizes human-AI collaboration in uncertainty quantification, providing practical algorithms with theoretical guarantees that demonstrate superior performance over individual human or AI decision making.
Abstract: AI predictive systems are increasingly embedded in decision making pipelines, shaping high stakes choices once made solely by humans. Yet robust decisions under uncertainty still rely on capabilities that current AI lacks: domain knowledge not captured by data, long horizon context, and reasoning grounded in the physical world. This gap has motivated growing efforts to design collaborative frameworks that combine the complementary strengths of humans and AI. This work advances this vision by identifying the fundamental principles of Human AI collaboration within uncertainty quantification, a key component of reliable decision making. We introduce Human AI Collaborative Uncertainty Quantification, a framework that formalizes how an AI model can refine a human expert’s proposed prediction set with two goals: avoiding counterfactual harm, ensuring the AI does not degrade correct human judgments, and complementarity, enabling recovery of correct outcomes the human missed. At the population level, we show that the optimal collaborative prediction set follows an intuitive two threshold structure over a single score function, extending a classical result in conformal prediction. Building on this insight, we develop practical offline and online calibration algorithms with provable distribution free finite sample guarantees. The online method adapts to distribution shifts, including human behavior evolving through interaction with AI, a phenomenon we call Human to AI Adaptation. Experiments across image classification, regression, and text based medical decision making show that collaborative prediction sets consistently outperform either agent alone, achieving higher coverage and smaller set sizes across various conditions.
[624] Are Agents Just Automata? On the Formal Equivalence Between Agentic AI and the Chomsky Hierarchy
Roham Koohestani, Ziyou Li, Anton Podkopaev, Maliheh Izadi
Main category: cs.AI
TL;DR: This paper establishes a formal equivalence between AI agent architectures and Chomsky hierarchy automata, enabling principled agent design, formal verification, and safety guarantees.
Details
Motivation: To provide a principled methodology for right-sizing agent architectures and create pathways for formal verification of AI agent safety and predictability.Method: Mapping AI agent memory architectures to corresponding automata classes: simple reflex agents to Finite Automata, hierarchical task-decomposition agents to Pushdown Automata, and reflective memory agents to Turing Machines.
Result: Developed an Automata-Agent Framework that classifies agents and delineates boundaries between verifiable and fundamentally undecidable systems, with extensions to probabilistic automata for LLM-based agents.
Conclusion: The framework enables formal verification, safety guarantees, and quantitative risk analysis for AI agents, with an agenda for developing static analysis tools and grammars for agentic frameworks.
Abstract: This paper establishes a formal equivalence between the architectural classes of modern agentic AI systems and the abstract machines of the Chomsky hierarchy. We posit that the memory architecture of an AI agent is the definitive feature determining its computational power and that it directly maps it to a corresponding class of automaton. Specifically, we demonstrate that simple reflex agents are equivalent to Finite Automata, hierarchical task-decomposition agents are equivalent to Pushdown Automata, and agents employing readable/writable memory for reflection are equivalent to TMs. This Automata-Agent Framework provides a principled methodology for right-sizing agent architectures to optimize computational efficiency and cost. More critically, it creates a direct pathway to formal verification, enables the application of mature techniques from automata theory to guarantee agent safety and predictability. By classifying agents, we can formally delineate the boundary between verifiable systems and those whose behavior is fundamentally undecidable. We address the inherent probabilistic nature of LLM-based agents by extending the framework to probabilistic automata that allow quantitative risk analysis. The paper concludes by outlining an agenda for developing static analysis tools and grammars for agentic frameworks.
[625] Emotion-Coherent Reasoning for Multimodal LLMs via Emotional Rationale Verifier
Hyeongseop Rha, Jeong Hun Yeo, Yeonju Kim, Yong Man Ro
Main category: cs.AI
TL;DR: Proposes Emotional Rationale Verifier (ERV) and Explanation Reward to improve consistency between emotion predictions and explanations in multimodal LLMs without architectural changes.
Details
Motivation: Current MLLMs generate emotion explanations that diverge from target labels and contradict their own predictions, risking misunderstanding and eroding reliability in interactive settings.Method: Uses Emotional Rationale Verifier (ERV) and Explanation Reward to guide models to produce reasoning consistent with target emotions during multimodal emotion recognition, without model modifications or additional annotations.
Result: Significantly improves faithful explanation-prediction consistency and explanation emotion accuracy on MAFW and DFEW datasets through extensive experiments and human evaluations.
Conclusion: The approach enhances alignment between explanation and prediction, empowering MLLMs to deliver emotionally coherent, trustworthy interactions - a key step toward truly human-like HCI systems.
Abstract: The recent advancement of Multimodal Large Language Models (MLLMs) is transforming human-computer interaction (HCI) from surface-level exchanges into more nuanced and emotionally intelligent communication. To realize this shift, emotion understanding becomes essential allowing systems to capture subtle cues underlying user intent. Furthermore, providing faithful explanations for predicted emotions is crucial to ensure interpretability and build user trust. However, current MLLM-based methods often generate emotion explanations that diverge from the target labels and sometimes even contradict their own predicted emotions. This inconsistency poses a critical risk for misunderstanding and erodes reliability in interactive settings. To address this, we propose a novel approach: the Emotional Rationale Verifier (ERV) and an Explanation Reward. Our method guides the model to produce reasoning that is explicitly consistent with the target emotion during multimodal emotion recognition without modifying the model architecture or requiring additional paired video-description annotations. Our method significantly improves faithful explanation-prediction consistency and explanation emotion accuracy on the MAFW and DFEW datasets. Through extensive experiments and human evaluations, we show that our approach not only enhances alignment between explanation and prediction but also empowers MLLMs to deliver emotionally coherent, trustworthy interactions, marking a key step toward truly human-like HCI systems.
[626] Toward Carbon-Neutral Human AI: Rethinking Data, Computation, and Learning Paradigms for Sustainable Intelligence
KC Santosh, Rodrigue Rizk, Longwei Wang
Main category: cs.AI
TL;DR: The paper proposes Human AI (HAI), a sustainable AI framework that uses incremental learning, carbon-aware optimization, and human collaboration to reduce environmental impact while maintaining performance.
Details
Motivation: Address environmental and ethical concerns from AI's computational demands by moving away from large-scale static datasets and monolithic training toward sustainable, human-inspired solutions.Method: Introduces HAI framework with incremental learning, carbon-aware optimization, human-in-the-loop collaboration, dynamic architectures, and biological cognition parallels for continuous contextual learning.
Result: Proposes theoretical foundations and system design that enable AI to learn continuously while minimizing carbon footprints and human annotation costs.
Conclusion: HAI offers a pathway to responsible, human-centered AI by addressing challenges in active learning, continual adaptation, and energy-efficient deployment.
Abstract: The rapid advancement of Artificial Intelligence (AI) has led to unprecedented computational demands, raising significant environmental and ethical concerns. This paper critiques the prevailing reliance on large-scale, static datasets and monolithic training paradigms, advocating for a shift toward human-inspired, sustainable AI solutions. We introduce a novel framework, Human AI (HAI), which emphasizes incremental learning, carbon-aware optimization, and human-in-the-loop collaboration to enhance adaptability, efficiency, and accountability. By drawing parallels with biological cognition and leveraging dynamic architectures, HAI seeks to balance performance with ecological responsibility. We detail the theoretical foundations, system design, and operational principles that enable AI to learn continuously and contextually while minimizing carbon footprints and human annotation costs. Our approach addresses pressing challenges in active learning, continual adaptation, and energy-efficient model deployment, offering a pathway toward responsible, human-centered artificial intelligence.
[627] When No Paths Lead to Rome: Benchmarking Systematic Neural Relational Reasoning
Anirban Das, Irtaza Khalid, Rafael Peñaloza, Steven Schockaert
Main category: cs.AI
TL;DR: The paper introduces NoRA, a new benchmark for systematic relational reasoning that goes beyond path-based reasoning and adds multiple difficulty levels.
Details
Motivation: Existing benchmarks for systematic relational reasoning are overly simplified and assume reasoning can be reduced to composing relational paths, which limits model generalization to other settings.Method: The authors propose NoRA, a new benchmark that requires models to go beyond path-based reasoning and includes several levels of difficulty to better evaluate systematic relational reasoning capabilities.
Result: NoRA provides a more challenging and comprehensive evaluation framework that can better assess the true systematic reasoning abilities of neural network models beyond simple path composition.
Conclusion: NoRA will support further progress in systematic relational reasoning by providing a more realistic benchmark that requires models to perform reasoning beyond path-based approaches.
Abstract: Designing models that can learn to reason in a systematic way is an important and long-standing challenge. In recent years, a wide range of solutions have been proposed for the specific case of systematic relational reasoning, including Neuro-Symbolic approaches, variants of the Transformer architecture, and specialised Graph Neural Networks. However, existing benchmarks for systematic relational reasoning focus on an overly simplified setting, based on the assumption that reasoning can be reduced to composing relational paths. In fact, this assumption is hard-baked into the architecture of several recent models, leading to approaches that can perform well on existing benchmarks but are difficult to generalise to other settings. To support further progress in the field of systematic relational reasoning with neural networks, we introduce NoRA, a new benchmark which adds several levels of difficulty and requires models to go beyond path-based reasoning.
[628] JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence
Qiushi Sun, Jingyang Gong, Yang Liu, Qiaosheng Chen, Lei Li, Kai Chen, Qipeng Guo, Ben Kao, Fei Yuan
Main category: cs.AI
TL;DR: JanusCode-800K is the largest multimodal code corpus enabling JanusCoder models to generate code from text, visual inputs, or both, achieving state-of-the-art performance across text and vision coding tasks.
Details
Motivation: To address the scarcity of high-quality multimodal code data that impedes progress in visual code intelligence applications like content generation and program-driven visualization editing.Method: Developed a synthesis toolkit leveraging reciprocal synergies between data modalities to create JanusCode-800K corpus, then trained JanusCoder and JanusCoderV models as unified visual-programmatic interfaces.
Result: JanusCoder series (7B to 14B scale) demonstrates superior performance on text-centric and vision-centric coding tasks, approaching or exceeding commercial model performance.
Conclusion: The work provides key insights into harmonizing programmatic logic with visual expression and establishes an effective unified approach for multimodal code generation.
Abstract: The scope of neural code intelligence is rapidly expanding beyond text-based source code to encompass the rich visual outputs that programs generate. This visual dimension is critical for advanced applications like flexible content generation and precise, program-driven editing of visualizations. However, progress has been impeded by the scarcity of high-quality multimodal code data, a bottleneck stemming from challenges in synthesis and quality assessment. To address these challenges, we make contributions from both a data and modeling perspective. We first introduce a complete synthesis toolkit that leverages reciprocal synergies between data modalities to efficiently produce a large-scale, high-quality corpus spanning from standard charts to complex interactive web UIs and code-driven animations. Leveraging this toolkit, we construct JanusCode-800K, the largest multimodal code corpus to date. This powers the training of our models, JanusCoder and JanusCoderV, which establish a visual-programmatic interface for generating code from textual instructions, visual inputs, or a combination of both. Our unified model is a departure from existing approaches that build specialized models for isolated tasks. Extensive experiments on both text-centric and vision-centric coding tasks demonstrate the superior performance of the JanusCoder series, with our 7B to 14B scale models approaching or even exceeding the performance of commercial models. Furthermore, extensive analysis provides key insights into harmonizing programmatic logic with its visual expression. Our code and checkpoints will are available at https://github.com/InternLM/JanusCoder.
[629] OntoPret: An Ontology for the Interpretation of Human Behavior
Alexis Ellis, Stacie Severyn, Fjollë Novakazi, Hadi Banaee, Cogan Shimizu
Main category: cs.AI
TL;DR: OntoPret is an ontology for interpreting human behavior that bridges the gap between techno-centric robotic frameworks and descriptive behavioral ontologies, enabling real-time collaborative interpretation.
Details
Motivation: Address the research gap between robotic frameworks lacking nuanced human behavior models and behavioral ontologies not designed for real-time collaborative interpretation in human-machine teaming contexts like Industry 5.0.Method: Developed OntoPret using cognitive science foundations and modular engineering methodology to create a formal, machine-processable framework for classifying behaviors including task deviations and deceptive actions.
Result: Demonstrated OntoPret’s adaptability across manufacturing and gameplay use cases, establishing semantic foundations for advanced reasoning about human intentions.
Conclusion: OntoPret provides a viable solution for enabling machines to safely and effectively interpret complex human behaviors in collaborative human-machine teaming scenarios.
Abstract: As human machine teaming becomes central to paradigms like Industry 5.0, a critical need arises for machines to safely and effectively interpret complex human behaviors. A research gap currently exists between techno centric robotic frameworks, which often lack nuanced models of human behavior, and descriptive behavioral ontologies, which are not designed for real time, collaborative interpretation. This paper addresses this gap by presenting OntoPret, an ontology for the interpretation of human behavior. Grounded in cognitive science and a modular engineering methodology, OntoPret provides a formal, machine processable framework for classifying behaviors, including task deviations and deceptive actions. We demonstrate its adaptability across two distinct use cases manufacturing and gameplay and establish the semantic foundations necessary for advanced reasoning about human intentions.
[630] ReCode: Unify Plan and Action for Universal Granularity Control
Zhaoyang Yu, Jiayi Zhang, Huixue Su, Yufan Zhao, Yifan Wu, Mingyi Deng, Jinyu Xiang, Yizhang Lin, Lingxiao Tang, Yingchao Li, Yuyu Luo, Bang Liu, Chenglin Wu
Main category: cs.AI
TL;DR: ReCode introduces a recursive code generation paradigm that unifies planning and action by treating high-level plans as abstract functions that are recursively decomposed into finer-grained sub-functions until primitive actions are reached.
Details
Motivation: Current LLM-based agents lack the ability to operate fluidly across decision granularities due to rigid separation between high-level planning and low-level action, which impairs dynamic adaptability and limits generalization.Method: ReCode treats high-level plans as abstract placeholder functions and recursively decomposes them into finer-grained sub-functions until reaching primitive actions, using a unified code representation.
Result: Extensive experiments show ReCode significantly surpasses advanced baselines in inference performance and demonstrates exceptional data efficiency in training.
Conclusion: Unifying planning and action through recursive code generation is a powerful and effective approach to achieving universal granularity control in AI agents.
Abstract: Real-world tasks require decisions at varying granularities, and humans excel at this by leveraging a unified cognitive representation where planning is fundamentally understood as a high-level form of action. However, current Large Language Model (LLM)-based agents lack this crucial capability to operate fluidly across decision granularities. This limitation stems from existing paradigms that enforce a rigid separation between high-level planning and low-level action, which impairs dynamic adaptability and limits generalization. We propose ReCode (Recursive Code Generation), a novel paradigm that addresses this limitation by unifying planning and action within a single code representation. In this representation, ReCode treats high-level plans as abstract placeholder functions, which the agent then recursively decomposes into finer-grained sub-functions until reaching primitive actions. This recursive approach dissolves the rigid boundary between plan and action, enabling the agent to dynamically control its decision granularity. Furthermore, the recursive structure inherently generates rich, multi-granularity training data, enabling models to learn hierarchical decision-making processes. Extensive experiments show ReCode significantly surpasses advanced baselines in inference performance and demonstrates exceptional data efficiency in training, validating our core insight that unifying planning and action through recursive code generation is a powerful and effective approach to achieving universal granularity control. The code is available at https://github.com/FoundationAgents/ReCode.
[631] Reduced AI Acceptance After the Generative AI Boom: Evidence From a Two-Wave Survey Study
Joachim Baumann, Aleksandra Urman, Ulrich Leicht-Deobald, Zachary J. Roman, Anikó Hannák, Markus Christen
Main category: cs.AI
TL;DR: The GenAI boom, particularly after ChatGPT’s launch, reduced public acceptance of AI and increased demand for human oversight in decision-making, while amplifying social inequalities.
Details
Motivation: To examine shifts in public attitudes toward AI before and after the ChatGPT launch, as organizations rapidly integrate AI without considering user preferences.Method: Large-scale two-wave survey study (n_wave1=1514, n_wave2=1488) representative of the Swiss population.
Result: GenAI boom significantly associated with reduced AI acceptance (23% to 30% finding AI ’not acceptable at all’) and increased demand for human oversight (support for human-only decision-making rose from 18% to 26%), while widening educational, linguistic, and gender gaps.
Conclusion: Findings challenge industry assumptions about public readiness for AI deployment and highlight the importance of aligning technological development with evolving public preferences.
Abstract: The rapid adoption of generative artificial intelligence (GenAI) technologies has led many organizations to integrate AI into their products and services, often without considering user preferences. Yet, public attitudes toward AI use, especially in impactful decision-making scenarios, are underexplored. Using a large-scale two-wave survey study (n_wave1=1514, n_wave2=1488) representative of the Swiss population, we examine shifts in public attitudes toward AI before and after the launch of ChatGPT. We find that the GenAI boom is significantly associated with reduced public acceptance of AI (see Figure 1) and increased demand for human oversight in various decision-making contexts. The proportion of respondents finding AI “not acceptable at all” increased from 23% to 30%, while support for human-only decision-making rose from 18% to 26%. These shifts have amplified existing social inequalities in terms of widened educational, linguistic, and gender gaps post-boom. Our findings challenge industry assumptions about public readiness for AI deployment and highlight the critical importance of aligning technological development with evolving public preferences.
[632] Multi-Agent Evolve: LLM Self-Improve through Co-evolution
Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhan, Mostofa Patwary, Jiaxuan You
Main category: cs.AI
TL;DR: MAE is a multi-agent self-evolution framework that uses three interacting agents (Proposer, Solver, Judge) instantiated from a single LLM to enhance reasoning capabilities without human supervision, achieving 4.54% average improvement on benchmarks.
Details
Motivation: To overcome limitations of RL methods that rely on human-curated datasets and verifiable rewards, and extend self-play RL beyond grounded environments to general domains like mathematics, reasoning, and knowledge Q&A.Method: Proposes Multi-Agent Evolve (MAE) framework with three agents: Proposer generates questions, Solver provides solutions, and Judge evaluates both. Uses reinforcement learning to optimize agent behaviors through co-evolution.
Result: Experiments on Qwen2.5-3B-Instruct show average improvement of 4.54% across multiple benchmarks, demonstrating enhanced reasoning capabilities.
Conclusion: MAE is a scalable, data-efficient method that significantly improves LLM reasoning abilities with minimal human supervision, applicable to diverse domains.
Abstract: Reinforcement Learning (RL) has demonstrated significant potential in enhancing the reasoning capabilities of large language models (LLMs). However, the success of RL for LLMs heavily relies on human-curated datasets and verifiable rewards, which limit their scalability and generality. Recent Self-Play RL methods, inspired by the success of the paradigm in games and Go, aim to enhance LLM reasoning capabilities without human-annotated data. However, their methods primarily depend on a grounded environment for feedback (e.g., a Python interpreter or a game engine); extending them to general domains remains challenging. To address these challenges, we propose Multi-Agent Evolve (MAE), a framework that enables LLMs to self-evolve in solving diverse tasks, including mathematics, reasoning, and general knowledge Q&A. The core design of MAE is based on a triplet of interacting agents (Proposer, Solver, Judge) that are instantiated from a single LLM, and applies reinforcement learning to optimize their behaviors. The Proposer generates questions, the Solver attempts solutions, and the Judge evaluates both while co-evolving. Experiments on Qwen2.5-3B-Instruct demonstrate that MAE achieves an average improvement of 4.54% on multiple benchmarks. These results highlight MAE as a scalable, data-efficient method for enhancing the general reasoning abilities of LLMs with minimal reliance on human-curated supervision.
[633] Alita-G: Self-Evolving Generative Agent for Agent Generation
Jiahao Qiu, Xuan Qi, Hongru Wang, Xinzhe Juan, Yimin Wang, Zelin Zhao, Jiayi Geng, Jiacheng Guo, Peihang Li, Jingzhe Shi, Shilong Liu, Mengdi Wang
Main category: cs.AI
TL;DR: ALITA-G is a self-evolution framework that transforms general-purpose agents into domain experts by systematically generating, abstracting, and curating Model Context Protocol (MCP) tools from successful task executions.
Details
Motivation: Current self-evolving agents are limited to prompt rewriting or failure retries, lacking systematic adaptation mechanisms to develop domain-specific expertise.Method: The framework executes domain tasks, synthesizes candidate MCPs from successful trajectories, abstracts them to parameterized primitives, consolidates them into an MCP Box, and performs retrieval-augmented MCP selection at inference.
Result: ALITA-G achieves state-of-the-art results: 83.03% pass@1 and 89.09% pass@3 on GAIA validation, while reducing computation costs by approximately 15% in tokens per example.
Conclusion: ALITA-G provides a principled pathway from generalist capability to reusable domain-specific competence, improving both accuracy and efficiency on complex reasoning tasks.
Abstract: Large language models (LLMs) have been shown to perform better when scaffolded into agents with memory, tools, and feedback. Beyond this, self-evolving agents have emerged, but current work largely limits adaptation to prompt rewriting or failure retries. Therefore, we present ALITA-G, a self-evolution framework that transforms a general-purpose agent into a domain expert by systematically generating, abstracting, and curating Model Context Protocol (MCP) tools. In this framework, a generalist agent executes a curated suite of target-domain tasks and synthesizes candidate MCPs from successful trajectories. These are then abstracted to parameterized primitives and consolidated into an MCP Box. At inference time, ALITA-G performs retrieval-augmented MCP selection with the help of each tool’s descriptions and use cases, before executing an agent equipped with the MCP Executor. Across several benchmarks GAIA, PathVQA, and Humanity’s Last Exam, ALITA-G attains strong gains while reducing computation costs. On GAIA validation, it achieves 83.03% pass@1 and 89.09% pass@3, establishing a new state-of-the-art result while reducing mean tokens per example by approximately 15% relative to a strong baseline agent. ALITA-G thus provides a principled pathway from generalist capability to reusable, domain-specific competence, improving both accuracy and efficiency on complex reasoning tasks.
[634] Faster Reinforcement Learning by Freezing Slow States
Yijia Wang, Daniel R. Jiang
Main category: cs.AI
TL;DR: The paper proposes a frozen-state approximation method for infinite horizon MDPs with fast-slow structure, where slow states are temporarily frozen during planning to reduce computational complexity while maintaining solution quality.
Details
Motivation: Many real-world problems like inventory control and dynamic pricing involve decisions at high frequencies over long horizons, with both rapidly changing (fast) and gradually changing (slow) state variables. Modeling these at natural frequencies leads to MDPs with discount factors close to 1, making them computationally challenging.Method: A novel approximation strategy that freezes slow states during lower-level planning phases and applies value iteration to an auxiliary upper-level MDP on a slower timescale. This creates easier-to-solve lower-level problems and allows for more favorable discount factors.
Result: Theoretical analysis of regret incurred by the frozen-state approach provides insights on trading off regret versus computational cost. Empirical evaluation on three domains (inventory control, gridworld with spatial tasks, dynamic pricing) shows the method produces high-quality policies with significantly less computation.
Conclusion: The frozen-state methods effectively handle fast-slow MDPs, producing good policies with reduced computation, and demonstrate that simply omitting slow states is often a poor heuristic.
Abstract: We study infinite horizon Markov decision processes (MDPs) with “fast-slow” structure, where some state variables evolve rapidly (“fast states”) while others change more gradually (“slow states”). This structure commonly arises in practice when decisions must be made at high frequencies over long horizons, and where slowly changing information still plays a critical role in determining optimal actions. Examples include inventory control under slowly changing demand indicators or dynamic pricing with gradually shifting consumer behavior. Modeling the problem at the natural decision frequency leads to MDPs with discount factors close to one, making them computationally challenging. We propose a novel approximation strategy that “freezes” slow states during phases of lower-level planning and subsequently applies value iteration to an auxiliary upper-level MDP that evolves on a slower timescale. Freezing states for short periods of time leads to easier-to-solve lower-level problems, while a slower upper-level timescale allows for a more favorable discount factor. On the theoretical side, we analyze the regret incurred by our frozen-state approach, which leads to simple insights on how to trade off regret versus computational cost. Empirically, we benchmark our new frozen-state methods on three domains, (i) inventory control with fixed order costs, (ii) a gridworld problem with spatial tasks, and (iii) dynamic pricing with reference-price effects. We demonstrate that the new methods produce high-quality policies with significantly less computation, and we show that simply omitting slow states is often a poor heuristic.
[635] Online POMDP Planning with Anytime Deterministic Optimality Guarantees
Moran Barenboim, Vadim Indelman
Main category: cs.AI
TL;DR: The paper presents a method to derive deterministic bounds between approximate and optimal solutions for discrete POMDPs, enabling certification of solution quality with minimal computational overhead.
Details
Motivation: Existing POMDP solvers typically provide only probabilistic or asymptotic guarantees, lacking deterministic bounds on solution quality for discrete POMDP problems.Method: The authors derive deterministic relationships between approximated and optimal solutions for discrete POMDPs, creating bounds that can be attached to existing algorithms with certain structures.
Result: The derived bounds provide deterministic guarantees for solution quality with marginal computational overhead, and decision-making based on these guarantees can yield superior performance compared to original algorithms.
Conclusion: The proposed approach enables deterministic certification of POMDP solutions and can enhance algorithm performance while maintaining computational efficiency.
Abstract: Decision-making under uncertainty is a critical aspect of many practical autonomous systems due to incomplete information. Partially Observable Markov Decision Processes (POMDPs) offer a mathematically principled framework for formulating decision-making problems under such conditions. However, finding an optimal solution for a POMDP is generally intractable. In recent years, there has been a significant progress of scaling approximate solvers from small to moderately sized problems, using online tree search solvers. Often, such approximate solvers are limited to probabilistic or asymptotic guarantees towards the optimal solution. In this paper, we derive a deterministic relationship for discrete POMDPs between an approximated and the optimal solution. We show that at any time, we can derive bounds that relate between the existing solution and the optimal one. We show that our derivations provide an avenue for a new set of algorithms and can be attached to existing algorithms that have a certain structure to provide them with deterministic guarantees with marginal computational overhead. In return, not only do we certify the solution quality, but we demonstrate that making a decision based on the deterministic guarantee may result in superior performance compared to the original algorithm without the deterministic certification.
[636] GraphInstruct: Empowering Large Language Models with Graph Understanding and Reasoning Capability
Zihan Luo, Xiran Song, Hong Huang, Jianxun Lian, Chenhao Zhang, Jinqi Jiang, Xing Xie, Hai Jin
Main category: cs.AI
TL;DR: The paper introduces GraphInstruct, a dynamic benchmark for graph reasoning tasks, and develops GraphSolver and GraphSolver+ models through instruction-tuning and label-mask training to enhance LLMs’ graph understanding and reasoning capabilities.
Details
Motivation: To advance general intelligence by improving LLMs' ability to understand graph data, which is a common data structure in many real-world domains.Method: 1) Created GraphInstruct benchmark with 21 graph reasoning tasks, diverse generation pipelines, and detailed reasoning steps. 2) Developed GraphSolver via instruction-tuning. 3) Proposed label-mask training strategy for GraphSolver+ to enhance multi-step reasoning by emphasizing node-identification signals.
Result: GraphSolver and GraphSolver+ demonstrate superior graph understanding capability compared to other open-sourced LLMs, with extensive experiments showing their superiority.
Conclusion: GraphInstruct facilitates research on applying LLMs to graph-structured data, and the proposed models significantly enhance graph understanding and reasoning abilities in LLMs.
Abstract: Improving the general capabilities of large language models (LLMs) is an active research topic. As a common data structure in many real-world domains, understanding graph data is a crucial part of advancing general intelligence. To this end, we propose a dynamic benchmark named GraphInstruct in this paper, which comprehensively includes 21 classical graph reasoning tasks, providing diverse graph generation pipelines and detailed intermediate reasoning steps for each sample. Based on GraphInstruct, we develop GraphSolver via efficient instruction-tuning, which demonstrates prominent graph understanding capability compared to other open-sourced LLMs. To further endow LLMs with multi-step graph reasoning capability, we propose a label-mask training strategy and build GraphSolver+, which leverages masked supervision on intermediate reasoning tokens to emphasize crucial node-identification signals. As one of the pioneering efforts to enhance the graph understanding and reasoning abilities of LLMs, extensive experiments have demonstrated the superiority of GraphSolver and GraphSolver+ over other LLMs. We sincerely hope GraphInstruct will facilitate further research on applying LLMs to graph-structured data. Our code and data are released publicly at: https://github.com/CGCL-codes/GraphInstruct.
[637] Integrated Design and Governance of Agentic AI Systems through Adaptive Information Modulation
Qiliang Chen, Sepehr Ilami, Nunzio Lore, Babak Heydari
Main category: cs.AI
TL;DR: A framework integrating adaptive governance into sociotechnical systems by separating agent interaction networks from information flow networks, using RL-based governance to dynamically modulate information transparency and enhance cooperation.
Details
Motivation: Traditional governance approaches fail to address dynamic uncertainties in complex sociotechnical environments with autonomous LLM-based agents, where social dilemmas pit individual interests against collective welfare.Method: System with strategic LLM-based agents in repeated interactions and an RL-based governing agent that dynamically adjusts information transparency by controlling what contextual/historical information each agent can access at each timestep.
Result: RL-based governance significantly enhances cooperation compared to static information-sharing baselines, while preserving agent autonomy without requiring direct structural interventions or payoff modifications.
Conclusion: Adaptive information governance through RL-based transparency modulation provides an effective approach for promoting cooperation in complex multi-agent sociotechnical systems with autonomous LLM agents.
Abstract: Modern engineered systems increasingly involve complex sociotechnical environments where multiple agents, including humans and the emerging paradigm of agentic AI powered by large language models, must navigate social dilemmas that pit individual interests against collective welfare. As engineered systems evolve toward multi-agent architectures with autonomous LLM-based agents, traditional governance approaches using static rules or fixed network structures fail to address the dynamic uncertainties inherent in real-world operations. This paper presents a novel framework that integrates adaptive governance mechanisms directly into the design of sociotechnical systems through a unique separation of agent interaction networks from information flow networks. We introduce a system comprising strategic LLM-based system agents that engage in repeated interactions and a reinforcement learning-based governing agent that dynamically modulates information transparency. Unlike conventional approaches that require direct structural interventions or payoff modifications, our framework preserves agent autonomy while promoting cooperation through adaptive information governance. The governing agent learns to strategically adjust information disclosure at each timestep, determining what contextual or historical information each system agent can access. Experimental results demonstrate that this RL-based governance significantly enhances cooperation compared to static information-sharing baselines.
[638] Learning to Better Search with Language Models via Guided Reinforced Self-Training
Seungyong Moon, Bumsoo Park, Hyun Oh Song
Main category: cs.AI
TL;DR: Guided-ReST is a fine-tuning method that uses optimal solutions as landmarks to guide language models’ search processes, improving reasoning capabilities on arithmetic and code tasks.
Details
Motivation: Language models struggle with complex reasoning and inefficient test-time compute when trained on noisy search traces. There's a need to improve search capabilities during inference.Method: Fine-tuning algorithm that incorporates optimal solutions into search procedures to generate high-quality search traces, then distills improved search strategies into the model.
Result: Significantly enhanced search capabilities on arithmetic reasoning and code self-repair tasks including Countdown, CodeContests, and CodeForces.
Conclusion: Guided-ReST effectively improves language models’ search strategies by using optimal solutions as guidance, leading to better performance on complex reasoning tasks.
Abstract: While language models have shown remarkable performance across diverse tasks, they still encounter challenges in complex reasoning scenarios. Recent research suggests that language models trained on linearized search traces toward solutions, rather than solely on the final solutions, exhibit improved generalization, despite the search traces being potentially noisy or suboptimal. However, relying on such imperfect traces can result in inefficient use of test-time compute. To address this, we propose guided reinforced self-training (Guided-ReST), a fine-tuning algorithm designed to improve the model’s capability for effective search during inference. The key insight behind Guided-ReST is that optimal solutions can serve as valuable step-by-step landmarks to guide the model’s search process. Based on this insight, we introduce a novel data generation method that seamlessly incorporates optimal solutions into the model’s search procedure, enabling the generation of high-quality search traces. By fine-tuning the model on these search traces, we effectively distill improved search strategies into the model. Our method significantly enhances the search capabilities of language models on arithmetic reasoning and code self-repair tasks, including Countdown, CodeContests, and CodeForces. We release the source code at https://github.com/snu-mllab/guided-rest.
[639] Diversified and Adaptive Negative Sampling on Knowledge Graphs
Ran Liu, Zhongzhou Liu, Xiaoli Li, Hao Wu, Yuan Fang
Main category: cs.AI
TL;DR: DANS is a generative adversarial approach for knowledge graph embedding that uses a two-way generator and adaptive mechanism to produce more diverse and fine-grained negative triplets for improved training.
Details
Motivation: Existing negative sampling methods in knowledge graph embedding often ignore diversity and adaptiveness, which reduces the informativeness of negative triplets and harms model performance.Method: Proposes DANS with a two-way generator for diverse negative triplets and an adaptive mechanism that localizes the global generator for different entities/relations to produce fine-grained examples.
Result: Evaluated on three benchmark knowledge graphs, DANS demonstrates effectiveness through both quantitative and qualitative experiments.
Conclusion: DANS improves knowledge graph embedding by generating more informative negative triplets through enhanced diversity and adaptiveness in the sampling process.
Abstract: In knowledge graph embedding, aside from positive triplets (ie: facts in the knowledge graph), the negative triplets used for training also have a direct influence on the model performance. In reality, since knowledge graphs are sparse and incomplete, negative triplets often lack explicit labels, and thus they are often obtained from various sampling strategies (eg: randomly replacing an entity in a positive triplet). An ideal sampled negative triplet should be informative enough to help the model train better. However, existing methods often ignore diversity and adaptiveness in their sampling process, which harms the informativeness of negative triplets. As such, we propose a generative adversarial approach called Diversified and Adaptive Negative Sampling DANS on knowledge graphs. DANS is equipped with a two-way generator that generates more diverse negative triplets through two pathways, and an adaptive mechanism that produces more fine-grained examples by localizing the global generator for different entities and relations. On the one hand, the two-way generator increase the overall informativeness with more diverse negative examples; on the other hand, the adaptive mechanism increases the individual sample-wise informativeness with more fine-grained sampling. Finally, we evaluate the performance of DANS on three benchmark knowledge graphs to demonstrate its effectiveness through quantitative and qualitative experiments.
[640] Training-Free Safe Denoisers for Safe Use of Diffusion Models
Mingyu Kim, Dongjun Kim, Amman Yusuf, Stefano Ermon, Mijung Park
Main category: cs.AI
TL;DR: Training-free safe denoiser that modifies sampling trajectory using negation sets to avoid unsafe/copyrighted content without retraining diffusion models.
Details
Motivation: Address safety concerns in diffusion models by preventing generation of NSFW content, copyrighted material, and data that should be forgotten, without relying on text prompts or model retraining.Method: Directly modify sampling trajectory using negation sets, formally derive relationship between safe and unsafe denoised samples to create safe denoiser that avoids specific data distribution regions.
Result: Successfully produces high-quality samples while avoiding negation areas in text-conditional, class-conditional, and unconditional image generation scenarios.
Conclusion: Training-free safe denoiser shows great potential for safer use of diffusion models by avoiding specific data regions without model retraining.
Abstract: There is growing concern over the safety of powerful diffusion models (DMs), as they are often misused to produce inappropriate, not-safe-for-work (NSFW) content or generate copyrighted material or data of individuals who wish to be forgotten. Many existing methods tackle these issues by heavily relying on text-based negative prompts or extensively retraining DMs to eliminate certain features or samples. In this paper, we take a radically different approach, directly modifying the sampling trajectory by leveraging a negation set (e.g., unsafe images, copyrighted data, or datapoints needed to be excluded) to avoid specific regions of data distribution, without needing to retrain or fine-tune DMs. We formally derive the relationship between the expected denoised samples that are safe and those that are not safe, leading to our $\textit{safe}$ denoiser which ensures its final samples are away from the area to be negated. Inspired by the derivation, we develop a practical algorithm that successfully produces high-quality samples while avoiding negation areas of the data distribution in text-conditional, class-conditional, and unconditional image generation scenarios. These results hint at the great potential of our training-free safe denoiser for using DMs more safely.
[641] Worse than Zero-shot? A Fact-Checking Dataset for Evaluating the Robustness of RAG Against Misleading Retrievals
Linda Zeng, Rithwik Gupta, Divij Motwani, Diji Yang, Yi Zhang
Main category: cs.AI
TL;DR: RAGuard is the first benchmark to evaluate RAG system robustness against misleading retrievals using real-world misinformation from Reddit, showing LLMs perform worse than zero-shot baselines when exposed to misleading evidence.
Details
Motivation: Existing RAG benchmarks use clean or synthetically perturbed data, failing to reflect real-world conditions where information is polarized and misleading, leading to overestimated performance.Method: Constructed a fact-checking dataset from Reddit discussions, categorizing retrieved evidence into supporting, misleading, and unrelated types to create realistic test scenarios.
Result: All tested LLM-powered RAG systems performed worse than their zero-shot baselines when exposed to misleading retrievals, while human annotators consistently performed better.
Conclusion: RAGuard highlights LLMs’ susceptibility to noisy environments and provides the first systematic benchmark for assessing RAG robustness against misleading evidence, driving future research toward more reliable real-world applications.
Abstract: Retrieval-augmented generation (RAG) has shown impressive capabilities in mitigating hallucinations in large language models (LLMs). However, LLMs struggle to maintain consistent reasoning when exposed to misleading or conflicting evidence, especially in real-world domains such as politics, where information is polarized or selectively framed. Mainstream RAG benchmarks evaluate models under clean retrieval settings, where systems generate answers from gold-standard documents, or under synthetically perturbed settings, where documents are artificially injected with noise. These assumptions fail to reflect real-world conditions, often leading to an overestimation of RAG system performance. To address this gap, we introduce RAGuard, the first benchmark to evaluate the robustness of RAG systems against misleading retrievals. Unlike prior benchmarks that rely on synthetic noise, our fact-checking dataset captures naturally occurring misinformation by constructing its retrieval corpus from Reddit discussions. It categorizes retrieved evidence into three types: supporting, misleading, and unrelated, providing a realistic and challenging testbed for assessing how well RAG systems navigate different types of evidence. Our experiments reveal that, when exposed to potentially misleading retrievals, all tested LLM-powered RAG systems perform worse than their zero-shot baselines (i.e., no retrieval at all), while human annotators consistently perform better, highlighting LLMs’ susceptibility to noisy environments. To our knowledge, RAGuard is the first benchmark to systematically assess the robustness of the RAG against misleading evidence. We expect this benchmark to drive future research toward improving RAG systems beyond idealized datasets, making them more reliable for real-world applications. The dataset is available at https://huggingface.co/datasets/UCSC-IRKM/RAGuard.
[642] Why Do Multi-Agent LLM Systems Fail?
Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica
Main category: cs.AI
TL;DR: The paper introduces MAST-Data, a comprehensive dataset of 1600+ annotated traces from 7 MAS frameworks, and MAST, the first Multi-Agent System Failure Taxonomy, to systematically analyze why multi-agent LLM systems fail.
Details
Motivation: Despite enthusiasm for Multi-Agent LLM Systems, their performance gains on benchmarks are often minimal, highlighting a critical need for principled understanding of why these systems fail and systematic identification of failure patterns.Method: Developed MAST through rigorous analysis of 150 traces with expert human annotators, creating a taxonomy with 14 failure modes clustered into 3 categories. Built an LLM-as-a-Judge pipeline for scalable annotation and analyzed failure patterns across models and tasks.
Result: Created MAST-Data with 1600+ annotated traces, achieved high inter-annotator agreement (kappa=0.88), identified 14 failure modes in 3 categories, and demonstrated improvement headrooms from better MAS design across different models and tasks.
Conclusion: The analysis reveals that identified failures require sophisticated solutions, providing a clear roadmap for future MAS research. The authors publicly release MAST-Data, MAST taxonomy, and LLM annotator to facilitate widespread MAS development.
Abstract: Despite enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks are often minimal. This gap highlights a critical need for a principled understanding of why MAS fail. Addressing this question requires systematic identification and analysis of failure patterns. We introduce MAST-Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks. MAST-Data is the first multi-agent system dataset to outline the failure dynamics in MAS for guiding the development of better future systems. To enable systematic classification of failures for MAST-Data, we build the first Multi-Agent System Failure Taxonomy (MAST). We develop MAST through rigorous analysis of 150 traces, guided closely by expert human annotators and validated by high inter-annotator agreement (kappa = 0.88). This process identifies 14 unique modes, clustered into 3 categories: (i) system design issues, (ii) inter-agent misalignment, and (iii) task verification. To enable scalable annotation, we develop an LLM-as-a-Judge pipeline with high agreement with human annotations. We leverage MAST and MAST-Data to analyze failure patterns across models (GPT4, Claude 3, Qwen2.5, CodeLlama) and tasks (coding, math, general agent), demonstrating improvement headrooms from better MAS design. Our analysis provides insights revealing that identified failures require more sophisticated solutions, highlighting a clear roadmap for future research. We publicly release our comprehensive dataset (MAST-Data), the MAST, and our LLM annotator to facilitate widespread research and development in MAS.
[643] Attention Pruning: Automated Fairness Repair of Language Models via Surrogate Simulated Annealing
Vishnu Asutosh Dasu, Md Rafi ur Rashid, Vipul Gupta, Saeid Tizpaz-Niari, Gang Tan
Main category: cs.AI
TL;DR: This paper introduces Attention Pruning, a post-processing method using surrogate models and simulated annealing to selectively prune attention heads in LLMs to reduce bias while maintaining utility.
Details
Motivation: LLMs encode societal biases from training data, and post-processing techniques like pruning attention heads offer a feasible approach to improve fairness without expensive retraining.Method: Uses surrogate deep neural networks to model the relationship between attention head states and fairness/utility metrics, then applies randomized simulated annealing to efficiently identify optimal subsets of attention heads for pruning.
Result: Achieves up to 40% reduction in gender bias and outperforms state-of-the-art bias mitigation strategies.
Conclusion: Attention Pruning provides an effective post-processing approach to reduce bias in LLMs while minimizing impact on model utility.
Abstract: This paper explores pruning attention heads as a post-processing bias mitigation method for large language models (LLMs). Modern AI systems such as LLMs are expanding into sensitive social contexts where fairness concerns become especially crucial. Since LLMs develop decision-making patterns by training on massive datasets of human-generated content, they naturally encode and perpetuate societal biases. While modifying training datasets and algorithms is expensive and requires significant resources; post-processing techniques-such as selectively deactivating neurons and attention heads in pre-trained LLMs-can provide feasible and effective approaches to improve fairness. However, identifying the optimal subset of parameters to prune presents a combinatorial challenge within LLMs’ immense parameter space, requiring solutions that efficiently balance competing objectives across the frontiers of model fairness and utility. To address the computational challenges, we explore a search-based program repair approach via randomized simulated annealing. Given the prohibitive evaluation costs in billion-parameter LLMs, we develop surrogate deep neural networks that efficiently model the relationship between attention head states (active/inactive) and their corresponding fairness/utility metrics. This allows us to perform optimization over the surrogate models and efficiently identify optimal subsets of attention heads for selective pruning rather than directly searching through the LLM parameter space. This paper introduces Attention Pruning, a fairness-aware surrogate simulated annealing approach to prune attention heads in LLMs that disproportionately contribute to bias while minimally impacting overall model utility. Our experiments show that Attention Pruning achieves up to $40%$ reduction in gender bias and outperforms the state-of-the-art bias mitigation strategies.
[644] LLMs as Planning Formalizers: A Survey for Leveraging Large Language Models to Construct Automated Planning Models
Marcus Tantakoun, Xiaodan Zhu, Christian Muise
Main category: cs.AI
TL;DR: Survey paper analyzing how LLMs can be used to formalize planning specifications to support automated planning systems, addressing challenges in long-horizon planning problems.
Details
Motivation: LLMs struggle with long-horizon planning requiring structured reasoning, creating interest in neuro-symbolic approaches between Automated Planning and NLP communities, but optimal deployment frameworks are challenging to identify.Method: Systematic survey and in-depth analysis of current research, positioning LLMs as tools for formalizing and refining planning specifications to support off-the-shelf AP planners.
Result: Identifies methodologies, critical challenges, and future directions in integrating LLMs with automated planning systems.
Conclusion: Contributes to joint research on NLP and Automated Planning by providing comprehensive analysis and highlighting key research directions for reliable planning systems.
Abstract: Large Language Models (LLMs) excel in various natural language tasks but often struggle with long-horizon planning problems requiring structured reasoning. This limitation has drawn interest in integrating neuro-symbolic approaches within the Automated Planning (AP) and Natural Language Processing (NLP) communities. However, identifying optimal AP deployment frameworks can be daunting and introduces new challenges. This paper aims to provide a timely survey of the current research with an in-depth analysis, positioning LLMs as tools for formalizing and refining planning specifications to support reliable off-the-shelf AP planners. By systematically reviewing the current state of research, we highlight methodologies, and identify critical challenges and future directions, hoping to contribute to the joint research on NLP and Automated Planning.
[645] QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?
Belinda Z. Li, Been Kim, Zi Wang
Main category: cs.AI
TL;DR: The paper introduces QuestBench, a benchmark for evaluating LLMs’ ability to identify minimal necessary questions to solve underspecified reasoning problems, showing that current models struggle with logical and planning tasks despite excelling at math problems.
Details
Motivation: Real-world queries are often underspecified and require acquiring missing information, but current LLM evaluations mostly assume well-defined tasks. The authors aim to formalize and evaluate LLMs' information-gathering capabilities.Method: Formalized information-gathering as a constraint satisfaction problem with missing variables. Created QuestBench with four types of underspecified reasoning tasks requiring at most one question: Logic-Q (logical reasoning), Planning-Q (PDDL planning), GSM-Q (grade school math), and GSME-Q (equation-based math). Models must select correct clarification questions from multiple options.
Result: Current LLMs achieve 40-50% accuracy on Logic-Q and Planning-Q, but excel at GSM-Q and GSME-Q. Analysis shows solving well-specified problems doesn’t guarantee success on underspecified versions - models struggle to identify the right question even when they can solve the fully specified problem.
Conclusion: LLMs need specific optimization for information acquisition capabilities, as current reasoning abilities don’t automatically translate to effective question-asking in underspecified scenarios.
Abstract: Large language models (LLMs) have shown impressive performance on reasoning benchmarks like math and logic. While many works have largely assumed well-defined tasks, real-world queries are often underspecified and only solvable by acquiring missing information. We formalize this information-gathering problem as a constraint satisfaction problem (CSP) with missing variable assignments. Using a special case where only one necessary variable assignment is missing, we can evaluate an LLM’s ability to identify the minimal necessary question to ask. We present QuestBench, a set of underspecified reasoning tasks solvable by asking at most one question, which includes: (1) Logic-Q: logical reasoning tasks with one missing proposition, (2) Planning-Q: PDDL planning problems with partially-observed initial states, (3) GSM-Q: human-annotated grade school math problems with one unknown variable, and (4) GSME-Q: equation-based version of GSM-Q. The LLM must select the correct clarification question from multiple options. While current models excel at GSM-Q and GSME-Q, they achieve only 40-50% accuracy on Logic-Q and Planning-Q. Analysis shows that the ability to solve well-specified reasoning problems is not sufficient for success on our benchmark: models struggle to identify the right question even when they can solve the fully specified version. This highlights the need for specifically optimizing models’ information acquisition capabilities.
[646] Antidistillation Sampling
Yash Savani, Asher Trockman, Zhili Feng, Yixuan Even Xu, Avi Schwarzschild, Alexander Robey, Marc Finzi, J. Zico Kolter
Main category: cs.AI
TL;DR: Antidistillation sampling modifies next-token probability distributions to poison reasoning traces, making them less effective for model distillation while preserving model utility.
Details
Motivation: Frontier models generate rich reasoning traces that can be exploited for distillation, creating a vulnerability that model owners want to mitigate without compromising performance.Method: Strategic modification of a model’s next-token probability distribution to poison reasoning traces.
Result: Renders reasoning traces significantly less effective for distillation while preserving the model’s practical utility.
Conclusion: Antidistillation sampling provides a capability to limit distillation effectiveness without compromising model performance.
Abstract: Frontier models that generate extended reasoning traces inadvertently produce rich token sequences that can facilitate model distillation. Recognizing this vulnerability, model owners may seek sampling strategies that limit the effectiveness of distillation without compromising model performance. Antidistillation sampling provides exactly this capability. By strategically modifying a model’s next-token probability distribution, antidistillation sampling poisons reasoning traces, rendering them significantly less effective for distillation while preserving the model’s practical utility. For further details, see https://antidistillation.com.
[647] Scaling Laws For Scalable Oversight
Joshua Engels, David D. Baek, Subhash Kantamneni, Max Tegmark
Main category: cs.AI
TL;DR: The paper proposes a framework to quantify scalable oversight success probability based on capability mismatches between AI systems, models oversight as games with Elo scores, validates with Nim, applies to four oversight games, and analyzes Nested Scalable Oversight optimization.
Details
Motivation: To address the unclear scalability of scalable oversight - the process where weaker AI systems supervise stronger ones - which is crucial for controlling future superintelligent systems.Method: Models oversight as a game between capability-mismatched players with oversight-specific Elo scores as piecewise-linear functions of general intelligence. Validates with modified Nim game and applies to Mafia, Debate, Backdoor Code, and Wargames oversight games.
Result: Found scaling laws for domain performance dependence on AI capability. For Nested Scalable Oversight, identified success conditions and optimal oversight levels. At 400 Elo gap, success rates were: 13.5% (Mafia), 51.7% (Debate), 10.0% (Backdoor Code), 9.4% (Wargames), declining with stronger systems.
Conclusion: The framework successfully quantifies scalable oversight scalability, revealing varying success rates across different oversight games and providing theoretical foundations for optimizing nested oversight structures.
Abstract: Scalable oversight, the process by which weaker AI systems supervise stronger ones, has been proposed as a key strategy to control future superintelligent systems. However, it is still unclear how scalable oversight itself scales. To address this gap, we propose a framework that quantifies the probability of successful oversight as a function of the capabilities of the overseer and the system being overseen. Specifically, our framework models oversight as a game between capability-mismatched players; the players have oversight-specific Elo scores that are a piecewise-linear function of their general intelligence, with two plateaus corresponding to task incompetence and task saturation. We validate our framework with a modified version of the game Nim and then apply it to four oversight games: Mafia, Debate, Backdoor Code and Wargames. For each game, we find scaling laws that approximate how domain performance depends on general AI system capability. We then build on our findings in a theoretical study of Nested Scalable Oversight (NSO), a process in which trusted models oversee untrusted stronger models, which then become the trusted models in the next step. We identify conditions under which NSO succeeds and derive numerically (and in some cases analytically) the optimal number of oversight levels to maximize the probability of oversight success. We also apply our theory to our four oversight games, where we find that NSO success rates at a general Elo gap of 400 are 13.5% for Mafia, 51.7% for Debate, 10.0% for Backdoor Code, and 9.4% for Wargames; these rates decline further when overseeing stronger systems.
[648] GVPO: Group Variance Policy Optimization for Large Language Model Post-Training
Kaichen Zhang, Yuzhong Hong, Junwei Bao, Hongfei Jiang, Yang Song, Dingqian Hong, Hui Xiong
Main category: cs.AI
TL;DR: GVPO is a new post-training method that ensures stable training by incorporating KL-constrained reward maximization directly into gradient weights, providing theoretical guarantees and flexible sampling.
Details
Motivation: Existing post-training methods like GRPO suffer from training instability despite achieving superior performance, limiting their practical adoption.Method: GVPO incorporates the analytical solution to KL-constrained reward maximization directly into gradient weights, ensuring alignment with optimal policy. It uses flexible sampling distributions to avoid on-policy and importance sampling limitations.
Result: GVPO guarantees a unique optimal solution that exactly matches the KL-constrained reward maximization objective and provides stable training.
Conclusion: GVPO establishes a new paradigm for reliable and versatile LLM post-training by unifying theoretical guarantees with practical adaptability.
Abstract: Post-training plays a crucial role in refining and aligning large language models to meet specific tasks and human preferences. While recent advancements in post-training techniques, such as Group Relative Policy Optimization (GRPO), leverage increased sampling with relative reward scoring to achieve superior performance, these methods often suffer from training instability that limits their practical adoption. As a next step, we present Group Variance Policy Optimization (GVPO). GVPO incorporates the analytical solution to KL-constrained reward maximization directly into its gradient weights, ensuring alignment with the optimal policy. The method provides intuitive physical interpretations: its gradient mirrors the mean squared error between the central distance of implicit rewards and that of actual rewards. GVPO offers two key advantages: (1) it guarantees a unique optimal solution, exactly the KL-constrained reward maximization objective, (2) it supports flexible sampling distributions that avoids on-policy and importance sampling limitations. By unifying theoretical guarantees with practical adaptability, GVPO establishes a new paradigm for reliable and versatile LLM post-training.
[649] Lost in Transmission: When and Why LLMs Fail to Reason Globally
Tobias Schnabel, Kiran Tomlinson, Adith Swaminathan, Jennifer Neville
Main category: cs.AI
TL;DR: The paper introduces Bounded Attention Prefix Oracle (BAPO) to model bandwidth constraints in LLM attention heads, showing that complex reasoning tasks require high communication bandwidth and fail on BAPO-hard problems, while chain-of-thought can make them BAPO-easy.
Details
Motivation: Transformer-based LLMs struggle with complex reasoning tasks that require information flow across large inputs, which the authors attribute to capacity limits on attention head communication bandwidth.Method: Introduce the Bounded Attention Prefix Oracle (BAPO) computational framework to model bandwidth constraints on attention heads, analyze reasoning problems for BAPO-hardness, and test GPT-4o, Claude, and Gemini on BAPO-easy vs BAPO-hard tasks.
Result: Experiments show LLMs succeed on BAPO-easy tasks but fail on even small BAPO-hard tasks. Theoretically prove that chain-of-thought can transform BAPO-hard problems into BAPO-easy ones.
Conclusion: BAPO provides principled explanations for LLM reasoning failures and suggests directions for new architectures and inference methods to overcome bandwidth limitations.
Abstract: Despite their many successes, transformer-based large language models (LLMs) continue to struggle with tasks that require complex reasoning over large parts of their input. We argue that these failures arise due to capacity limits on the accurate flow of information within LLMs. To formalize this issue, we introduce the bounded attention prefix oracle (BAPO) model, a new computational framework that models bandwidth constraints on attention heads, the mechanism for internal communication in LLMs. We show that several important reasoning problems like graph reachability require high communication bandwidth for BAPOs to solve; we call these problems BAPO-hard. Our experiments corroborate our theoretical predictions: GPT-4o, Claude, and Gemini succeed on BAPO-easy tasks and fail even on relatively small BAPO-hard tasks. BAPOs also reveal another benefit of chain of thought (CoT): we prove that breaking down a task using CoT can turn any BAPO-hard problem into a BAPO-easy one. Our results offer principled explanations for key LLM failures and suggest directions for architectures and inference methods that mitigate bandwidth limits.
[650] Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, Yiheng Xu, Junli Wang, Doyen Sahoo, Tao Yu, Caiming Xiong
Main category: cs.AI
TL;DR: OSWorld-G is a comprehensive GUI grounding benchmark with 564 annotated samples, and Jedi is a 4M-example dataset that enables multi-scale models to outperform existing approaches and enhance agent capabilities on complex computer tasks.
Details
Motivation: Current GUI grounding benchmarks oversimplify tasks as short referring expressions, failing to capture real-world complexity requiring software commonsense, layout understanding, and fine-grained manipulation.Method: Introduce OSWorld-G benchmark with 564 annotated samples across diverse task types, and synthesize Jedi dataset with 4M examples through multi-perspective task decoupling. Train multi-scale models on Jedi.
Result: Models trained on Jedi outperform existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and OSWorld-G. Improves agent capabilities from 5% to 27% on OSWorld complex tasks. Enables compositional generalization to novel interfaces.
Conclusion: Jedi dataset effectively addresses GUI grounding limitations, demonstrating improved performance and enhanced agentic capabilities through specialized data for different interface elements.
Abstract: Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities. To address these limitations, we introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on Jedi demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our OSWorld-G. Furthermore, we demonstrate that improved grounding with Jedi directly enhances agentic capabilities of general foundation models on complex computer tasks, improving from 5% to 27% on OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces. All benchmark, data, checkpoints, and code are open-sourced and available at https://osworld-grounding.github.io.
[651] Guarded Query Routing for Large Language Models
Richard Šléher, William Brach, Tibor Sloboda, Kristián Košťál, Lukas Galke
Main category: cs.AI
TL;DR: The paper introduces GQR-Bench for guarded query routing, comparing various methods and finding WideMLP offers the best accuracy-speed trade-off (88% accuracy, <4ms), challenging automatic reliance on LLMs.
Details
Motivation: Query routing needs to handle out-of-distribution queries properly, including unrelated domains, other languages, and unsafe text, which standard text classification doesn't adequately address.Method: Created GQR-Bench with three target domains (law, finance, healthcare) and seven datasets for robustness testing. Compared LLM-based routing, guardrail approaches, CBOW classifiers, and traditional ML models.
Result: WideMLP with out-of-domain detection achieved 88% accuracy with <4ms latency. fastText was fastest (<1ms) with 80% accuracy. LLMs had highest accuracy (91%) but slowest (62-669ms).
Conclusion: Challenges automatic reliance on LLMs for query routing; recommends WideMLP for best accuracy-speed trade-off and fastText for speed-critical applications.
Abstract: Query routing, the task to route user queries to different large language model (LLM) endpoints, can be considered as a text classification problem. However, out-of-distribution queries must be handled properly, as those could be about unrelated domains, queries in other languages, or even contain unsafe text. Here, we thus study a guarded query routing problem, for which we first introduce the Guarded Query Routing Benchmark (GQR-Bench, released as Python package gqr), covers three exemplary target domains (law, finance, and healthcare), and seven datasets to test robustness against out-of-distribution queries. We then use GQR-Bench to contrast the effectiveness and efficiency of LLM-based routing mechanisms (GPT-4o-mini, Llama-3.2-3B, and Llama-3.1-8B), standard LLM-based guardrail approaches (LlamaGuard and NVIDIA NeMo Guardrails), continuous bag-of-words classifiers (WideMLP, fastText), and traditional machine learning models (SVM, XGBoost). Our results show that WideMLP, enhanced with out-of-domain detection capabilities, yields the best trade-off between accuracy (88%) and speed (<4ms). The embedding-based fastText excels at speed (<1ms) with acceptable accuracy (80%), whereas LLMs yield the highest accuracy (91%) but are comparatively slow (62ms for local Llama-3.1:8B and 669ms for remote GPT-4o-mini calls). Our findings challenge the automatic reliance on LLMs for (guarded) query routing and provide concrete recommendations for practical applications. Source code is available: https://github.com/williambrach/gqr.
[652] ContextAgent: Context-Aware Proactive LLM Agents with Open-World Sensory Perceptions
Bufang Yang, Lilin Xu, Liekang Zeng, Kaiwei Liu, Siyang Jiang, Wenrui Lu, Hongkai Chen, Xiaofan Jiang, Guoliang Xing, Zhenyu Yan
Main category: cs.AI
TL;DR: ContextAgent is a context-aware proactive agent that uses sensory data from wearables to understand user intentions and provide proactive assistance through automatic tool calling.
Details
Motivation: Existing proactive agents rely on limited environmental observations or rule-based notifications, leading to poor user intent understanding and limited functionality for proactive services.Method: Extracts multi-dimensional contexts from sensory perceptions on wearables, leverages sensory contexts and personas from historical data to predict proactive service needs, and automatically calls necessary tools when assistance is required.
Result: Outperforms baselines by achieving 8.5% higher accuracy in proactive predictions and 6.0% higher accuracy in tool calling on ContextAgentBench benchmark with 1,000 samples across nine daily scenarios.
Conclusion: The research inspires development of more advanced, human-centric, proactive AI assistants by incorporating extensive sensory contexts to enhance LLM agent proactivity.
Abstract: Recent advances in Large Language Models (LLMs) have propelled intelligent agents from reactive responses to proactive support. While promising, existing proactive agents either rely exclusively on observations from enclosed environments (e.g., desktop UIs) with direct LLM inference or employ rule-based proactive notifications, leading to suboptimal user intent understanding and limited functionality for proactive service. In this paper, we introduce ContextAgent, the first context-aware proactive agent that incorporates extensive sensory contexts surrounding humans to enhance the proactivity of LLM agents. ContextAgent first extracts multi-dimensional contexts from massive sensory perceptions on wearables (e.g., video and audio) to understand user intentions. ContextAgent then leverages the sensory contexts and personas from historical data to predict the necessity for proactive services. When proactive assistance is needed, ContextAgent further automatically calls the necessary tools to assist users unobtrusively. To evaluate this new task, we curate ContextAgentBench, the first benchmark for evaluating context-aware proactive LLM agents, covering 1,000 samples across nine daily scenarios and twenty tools. Experiments on ContextAgentBench show that ContextAgent outperforms baselines by achieving up to 8.5% and 6.0% higher accuracy in proactive predictions and tool calling, respectively. We hope our research can inspire the development of more advanced, human-centric, proactive AI assistants. The code and dataset are publicly available at https://github.com/openaiotlab/ContextAgent.
[653] On the Hardness of Approximating Distributions with Tractable Probabilistic Models
John Leland, YooJung Choi
Main category: cs.AI
TL;DR: The paper studies the trade-off between expressivity and inference efficiency in probabilistic circuits (PCs), focusing on whether allowing small approximation errors can avoid exponential size blow-ups when representing distributions.
Details
Motivation: To address the fundamental challenge of balancing expressivity and inference efficiency in probabilistic modeling, particularly how tractable probabilistic models (TPMs) can maintain expressivity while guaranteeing efficient inference.Method: The authors study approximation of distributions using probabilistic circuits with guarantees based on f-divergences, analyzing which inference queries remain well-approximated under this framework.
Result: They prove that approximating an arbitrary distribution with bounded f-divergence is NP-hard for any model that can tractably compute marginals, and show an exponential size gap for approximation between decomposable PCs and decomposable+deterministic PCs.
Conclusion: Allowing small approximation errors does not eliminate the fundamental computational hardness and size requirements in probabilistic circuit representations, as evidenced by NP-hardness results and exponential size gaps between different PC classes.
Abstract: A fundamental challenge in probabilistic modeling is to balance expressivity and inference efficiency. Tractable probabilistic models (TPMs) aim to directly address this tradeoff by imposing constraints that guarantee efficient inference of certain queries while maintaining expressivity. In particular, probabilistic circuits (PCs) provide a unifying framework for many TPMs, by characterizing families of models as circuits satisfying different structural properties. Because the complexity of inference on PCs is a function of the circuit size, understanding the size requirements of different families of PCs is fundamental in mapping the trade-off between tractability and expressive efficiency. However, the study of expressive efficiency of circuits are often concerned with exact representations, which may not align with model learning, where we look to approximate the underlying data distribution closely by some distance measure. Moreover, due to hardness of inference tasks, exactly representing distributions while supporting tractable inference often incurs exponential size blow-ups. In this paper, we consider a natural, yet so far underexplored, question: can we avoid such size blow-up by allowing for some small approximation error? We study approximating distributions with probabilistic circuits with guarantees based on $f$-divergences, and analyze which inference queries remain well-approximated under this framework. We show that approximating an arbitrary distribution with bounded $f$-divergence is $\mathsf{NP}$-hard for any model that can tractably compute marginals. In addition, we prove an exponential size gap for approximation between the class of decomposable PCs and that of decomposable and deterministic PCs.
[654] E-bike agents: Large Language Model-Driven E-Bike Accident Analysis and Severity Prediction
Zhichao Yang, Jiashu He, Mohammad B. Al-Khasawneh, Darshan Pandit, Cirillo Cinzia
Main category: cs.AI
TL;DR: Analysis of e-bike vs traditional bicycle safety using CPSRMS and NEISS datasets reveals distinct e-bike risks like battery fires and brake failures, requiring tailored safety interventions.
Details
Motivation: E-bikes have rapidly gained popularity as sustainable urban mobility but their safety implications remain underexplored compared to traditional bicycles.Method: Used CPSRMS and NEISS datasets with a standardized classification framework to analyze injury incidents, integrating incident narratives with demographic attributes.
Result: Found key differences in mechanical failure modes and injury severity patterns - e-bikes present distinct risks including battery-related fires and brake failures.
Conclusion: Tailored safety interventions and infrastructure design are needed to support safe integration of micromobility devices into urban transportation networks.
Abstract: E-bikes have rapidly gained popularity as a sustainable form of urban mobility, yet their safety implications remain underexplored. This paper analyzes injury incidents involving e-bikes and traditional bicycles using two sources of data, the CPSRMS (Consumer Product Safety Risk Management System Information Security Review Report) and NEISS (National Electronic Injury Surveillance System) datasets. We propose a standardized classification framework to identify and quantify injury causes and severity. By integrating incident narratives with demographic attributes, we reveal key differences in mechanical failure modes, injury severity patterns, and affected user groups. While both modes share common causes, such as loss of control and pedal malfunctions, e-bikes present distinct risks, including battery-related fires and brake failures. These findings highlight the need for tailored safety interventions and infrastructure design to support the safe integration of micromobility devices into urban transportation networks.
[655] Towards Responsible AI: Advances in Safety, Fairness, and Accountability of Autonomous Systems
Filip Cano
Main category: cs.AI
TL;DR: This thesis advances trustworthy AI through safety shields for autonomous vehicles, fairness shields for sequential decision-making, and formal frameworks for assessing intentional behavior and accountability in AI systems.
Details
Motivation: As AI systems increasingly influence critical societal domains, ensuring responsible use through safety, fairness, transparency, and accountability has become imperative, though trustworthy AI remains a broad and multi-faceted concept.Method: Extends deterministic shielding techniques for safety with resilience to delayed observations; implements safety shields in autonomous vehicles; introduces fairness shields for group fairness in sequential decision-making; proposes formal framework with quantitative metrics for assessing intentional behavior and accountability.
Result: Validated safety shields in realistic driving simulators; developed fairness shields that optimize intervention costs while ensuring fairness constraints; created quantitative metrics for agency and intention quotient to analyze responsibility in autonomous systems.
Conclusion: The contributions collectively advance safer, fairer, and more accountable AI systems through the ‘reactive decision-making’ framework, laying foundations for future research in trustworthy AI.
Abstract: Ensuring responsible use of artificial intelligence (AI) has become imperative as autonomous systems increasingly influence critical societal domains. However, the concept of trustworthy AI remains broad and multi-faceted. This thesis advances knowledge in the safety, fairness, transparency, and accountability of AI systems. In safety, we extend classical deterministic shielding techniques to become resilient against delayed observations, enabling practical deployment in real-world conditions. We also implement both deterministic and probabilistic safety shields into simulated autonomous vehicles to prevent collisions with road users, validating the use of these techniques in realistic driving simulators. We introduce fairness shields, a novel post-processing approach to enforce group fairness in sequential decision-making settings over finite and periodic time horizons. By optimizing intervention costs while strictly ensuring fairness constraints, this method efficiently balances fairness with minimal interference. For transparency and accountability, we propose a formal framework for assessing intentional behaviour in probabilistic decision-making agents, introducing quantitative metrics of agency and intention quotient. We use these metrics to propose a retrospective analysis of intention, useful for determining responsibility when autonomous systems cause unintended harm. Finally, we unify these contributions through the ``reactive decision-making’’ framework, providing a general formalization that consolidates previous approaches. Collectively, the advancements presented contribute practically to the realization of safer, fairer, and more accountable AI systems, laying the foundations for future research in trustworthy AI.
[656] When Can Model-Free Reinforcement Learning be Enough for Thinking?
Josiah P. Hanna, Nicholas E. Corrado
Main category: cs.AI
TL;DR: The paper analyzes when model-free RL leads to “thinking” behaviors in large language models, introduces a theoretical thought MDP model, and demonstrates conditions where thinking emerges as a reward maximization strategy.
Details
Motivation: To understand why model-free RL produces reasoning-like "thinking" in LLMs when such actions don't directly produce reward or change the external world state.Method: Introduces a theoretical thought MDP model that extends classical MDPs with abstract thought states/actions, analyzes policy initialization effects, and validates conditions with open-source LLMs and a toy domain.
Result: Proves that thought actions are equivalent to policy improvement steps, shows LLMs satisfy necessary conditions for thinking emergence, and demonstrates more data-efficient RL with designated thought actions in a toy domain.
Conclusion: Thinking emerges in model-free RL when specific conditions are met, with thought actions serving as implicit policy improvement steps, enabling more efficient learning compared to non-thinking approaches.
Abstract: Recent work on large language models has demonstrated the use of model-free reinforcement learning (RL) to train reasoning-like capabilities. The emergence of “thinking” through model-free RL is interesting as thinking actions neither produce reward nor change the external world state to one where the agent is more likely to get reward. This paper seeks to build a domain-independent understanding of when model-free RL will lead to such “thinking” as a strategy for reward maximization. To build this understanding, we first introduce a theoretical model which we call a thought Markov decision process (MDP). Thought MDPs minimally extend the classical MDP model to include an abstract notion of thought state and thought action. Using the thought MDP model, we prove the importance of policy initialization in determining whether or not thinking emerges and show formally that thought actions are equivalent to the agent choosing to perform a step of policy improvement before continuing to act. We then show that open-source LLMs satisfy the conditions that our theory predicts are necessary for model-free RL to produce thinking-like behavior. Finally, we hypothesize sufficient conditions that would enable thinking to be learned outside of language generation and introduce a toy domain where a combination of multi-task pre-training and designated thought actions enable more data-efficient RL compared to non-thinking agents.
[657] SEEA-R1: Tree-Structured Reinforcement Fine-Tuning for Self-Evolving Embodied Agents
Wanxin Tian, Shijie Zhang, Kevin Zhang, Xiaowei Chi, Chunkai Fan, Junyu Lu, Yulin Luo, Qiang Zhou, Yiming Zhao, Ning Liu, Siyu Lin, Zhiyuan Qin, Xiaozhu Ju, Shanghang Zhang, Jian Tang
Main category: cs.AI
TL;DR: SEEA-R1 is a reinforcement fine-tuning framework that enables self-evolving embodied agents through Tree-GRPO for better intermediate rewards and MGRM for generalized reward estimation, achieving state-of-the-art performance on ALFWorld benchmark.
Details
Motivation: Current reinforcement fine-tuning methods face challenges in embodied settings: lack of accessible intermediate rewards in multi-step reasoning tasks and reliance on hand-crafted reward functions that limit generalization to novel tasks and environments.Method: Proposes SEEA-R1 framework with two key components: Tree-GRPO (Tree-based group relative policy optimization) integrates Monte Carlo Tree Search into GRPO to convert sparse delayed rewards into denser intermediate signals, and MGRM (Multi-modal Generative Reward Model) for generalizable reward estimation across tasks and scenes.
Result: Achieves 85.07% (textual) and 46.27% (multi-modal) scores on ALFWorld benchmark, surpassing state-of-the-art methods including GPT-4o. Without ground truth reward, achieves 80.3% (textual) and 44.03% (multi-modal), outperforming all open-source baselines.
Conclusion: SEEA-R1 demonstrates strong potential for scalable embodied intelligence and self-evolving capabilities, with qualitative analysis supporting its effectiveness for future research in this domain.
Abstract: Self-evolution, the ability of agents to autonomously improve their reasoning and behavior, is essential for the embodied domain with long-horizon, real-world tasks. Despite current advancements in reinforcement fine-tuning (RFT) showing strong performance in enhancing reasoning in LLMs, its potential to enable self-evolving embodied intelligence with multi-modal interactions remains largely unexplored. Specifically, reinforcement fine-tuning faces two fundamental obstacles in embodied settings: (i) the lack of accessible intermediate rewards in multi-step reasoning tasks limits effective learning signals, and (ii) reliance on hand-crafted reward functions restricts generalization to novel tasks and environments. To address these challenges, we present Self-Evolving Embodied Agents-R1, SEEA-R1, the first RFT framework designed for enabling the self-evolving capabilities of embodied agents. Specifically, to convert sparse delayed rewards into denser intermediate signals that improve multi-step reasoning, we propose Tree-based group relative policy optimization (Tree-GRPO) integrates Monte Carlo Tree Search into GRPO. To generalize reward estimation across tasks and scenes, supporting autonomous adaptation and reward-driven self-evolution, we further introduce Multi-modal Generative Reward Model (MGRM). To holistically evaluate the effectiveness of SEEA-R1, we evaluate on the ALFWorld benchmark, surpassing state-of-the-art methods with scores of 85.07% (textual) and 46.27% (multi-modal), outperforming prior models including GPT-4o. SEEA-R1 also achieves scores of 80.3% (textual) and 44.03% (multi-modal) without ground truth reward, surpassing all open-source baselines and highlighting its scalability as a self-evolving embodied agent. Additional experiments and qualitative analysis further support the potential of SEEA-R1 for future research in scalable embodied intelligence.
[658] A Neuroscience-Inspired Dual-Process Model of Compositional Generalization
Alex Noviello, Claas Beger, Jacob Groner, Kevin Ellis, Weinan Sun
Main category: cs.AI
TL;DR: MIRAGE is a neuro-inspired dual-process model combining a meta-trained Transformer (System 1) with a rule-based Schema Engine (System 2) that achieves >99% accuracy on SCAN benchmark through systematic compositional generalization.
Details
Motivation: Deep learning models struggle with systematic compositional generalization, which is a hallmark of human cognition, motivating the development of neuro-inspired models that can better replicate this ability.Method: Proposes MIRAGE - a dual-process model with a fast, intuitive System 1 (meta-trained Transformer) and deliberate System 2 (Schema Engine) that performs general single-step decomposition on random grammars, using explicit prioritized schemas and iterative refinement.
Result: Achieves >99% accuracy on all splits of the SCAN benchmark in a task-agnostic setting, with ablations confirming systematic behavior emerges from architectural interplay between the two systems.
Conclusion: Provides a concrete computational model showing how compositional reasoning can arise from modular cognitive architecture, combining iterative neural updates with interpretable schema modules.
Abstract: Deep learning models struggle with systematic compositional generalization, a
hallmark of human cognition. We propose \textsc{Mirage}, a neuro-inspired
dual-process model that offers a processing account for this ability. It
combines a fast, intuitive System~1'' (a meta-trained Transformer) with a deliberate, rule-based System~2’’ (a Schema Engine), mirroring the brain’s
neocortical and hippocampal–prefrontal circuits. Trained to perform general,
single-step decomposition on a stream of random grammars, Mirage achieves
$>$99% accuracy on all splits of the SCAN benchmark in a task-agnostic
setting. Ablations confirm that the model’s systematic behavior emerges from
the architectural interplay of its two systems, particularly its use of
explicit, prioritized schemas and iterative refinement. In line with recent
progress on recursive/recurrent Transformer approaches, Mirage preserves an
iterative neural update while externalizing declarative control into an
interpretable schema module. Our work provides a concrete computational model
for interpreting how compositional reasoning can arise from a modular cognitive
architecture.
[659] Measuring and Analyzing Intelligence via Contextual Uncertainty in Large Language Models using Information-Theoretic Metrics
Jae Wan Shim
Main category: cs.AI
TL;DR: A task-agnostic method to build quantitative Cognitive Profiles for LLMs using Entropy Decay Curves and Information Gain Span to analyze how models process information rather than just what they can do.
Details
Motivation: While LLMs excel on benchmarks, the mechanisms behind their success remain poorly understood. The paper aims to move from asking what models can do to understanding how they process information internally.Method: Proposes a task-agnostic method that builds Cognitive Profiles using Entropy Decay Curves (plotting normalized predictive uncertainty vs context length) and Information Gain Span (IGS) as a single index summarizing decay pattern desirability.
Result: Across state-of-the-art LLMs and diverse texts, the curves reveal distinctive, stable profiles that depend on both model scale and text complexity, providing insights into internal model dynamics.
Conclusion: The proposed tools offer a principled way to analyze and compare the internal dynamics of modern AI systems, moving beyond benchmark performance to understand information processing mechanisms.
Abstract: Large Language Models (LLMs) excel on many task-specific benchmarks, yet the mechanisms that drive this success remain poorly understood. We move from asking what these systems can do to asking how they process information. Our contribution is a task-agnostic method that builds a quantitative Cognitive Profile for any model. The profile is built around the Entropy Decay Curve-a plot of a model’s normalised predictive uncertainty as context length grows. Across several state-of-the-art LLMs and diverse texts, the curves expose distinctive, stable profiles that depend on both model scale and text complexity. We also propose the Information Gain Span (IGS) as a single index that summarises the desirability of a decay pattern. Together, these tools offer a principled way to analyse and compare the internal dynamics of modern AI systems.
[660] AI Compute Architecture and Evolution Trends
Bor-Sung Liang
Main category: cs.AI
TL;DR: This paper proposes a 7-layer model for AI compute architecture and analyzes AI development opportunities and challenges through this structured framework, covering technical evolution from hardware to applications and economic ecosystem considerations.
Details
Motivation: To address the shift from academic AI research to practical applications and analyze the comprehensive challenges at various levels of AI development using a structured approach.Method: Proposes a 7-layer AI compute architecture model (Physical, Link, Neural Network, Context, Agent, Orchestrator, Application layers) and analyzes AI evolution through three stages of large-scale language models development.
Result: The paper provides a comprehensive framework describing development trajectories and key technologies for each layer, including computing strategies, LLM development paths, contextual memory impact, AI agent trends, and ecosystem evolution.
Conclusion: AI development involves both technical challenges across the 7-layer architecture and economic issues for building sustainable ecosystems, with predictions based on internet industry analysis for future AI trajectory.
Abstract: The focus of AI development has shifted from academic research to practical applications. However, AI development faces numerous challenges at various levels. This article will attempt to analyze the opportunities and challenges of AI from several different perspectives using a structured approach. This article proposes a seven-layer model for AI compute architecture, including Physical Layer, Link Layer, Neural Network Layer, Context Layer, Agent Layer, Orchestrator Layer, and Application Layer, from bottom to top. It also explains how AI computing has evolved into this 7-layer architecture through the three-stage evolution on large-scale language models (LLMs). For each layer, we describe the development trajectory and key technologies. In Layers 1 and 2 we discuss AI computing issues and the impact of Scale-Up and Scale-Out strategies on computing architecture. In Layer 3 we explore two different development paths for LLMs. In Layer 4 we discuss the impact of contextual memory on LLMs and compares it to traditional processor memory. In Layers 5 to 7 we discuss the trends of AI agents and explore the issues in evolution from a single AI agent to an AI-based ecosystem, and their impact on the AI industry. Furthermore, AI development involves not only technical challenges but also the economic issues to build self-sustainable ecosystem. This article analyzes the internet industry to provide predictions on the future trajectory of AI development.
[661] PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming
Wesley Hanwen Deng, Sunnie S. Y. Kim, Akshita Jha, Ken Holstein, Motahhare Eslami, Lauren Wilcox, Leon A Gatys
Main category: cs.AI
TL;DR: PersonaTeaming introduces personas in automated red-teaming to improve attack success rates by 144.1% while maintaining diversity, bridging human and automated approaches.
Details
Motivation: Current automated red-teaming methods don't consider identity factors that human red-teamers use, limiting the range of risks they can uncover.Method: Developed PersonaTeaming with persona-based prompt mutation using both expert and regular user personas, plus dynamic persona generation and new mutation distance metrics.
Result: Achieved up to 144.1% improvement in attack success rates compared to state-of-the-art RainbowPlus method while maintaining prompt diversity.
Conclusion: Persona-based approaches show promise for bridging automated and human red-teaming, with opportunities to explore complementarities between different persona types.
Abstract: Recent developments in AI governance and safety research have called for red-teaming methods that can effectively surface potential risks posed by AI models. Many of these calls have emphasized how the identities and backgrounds of red-teamers can shape their red-teaming strategies, and thus the kinds of risks they are likely to uncover. While automated red-teaming approaches promise to complement human red-teaming by enabling larger-scale exploration of model behavior, current approaches do not consider the role of identity. As an initial step towards incorporating people’s background and identities in automated red-teaming, we develop and evaluate a novel method, PersonaTeaming, that introduces personas in the adversarial prompt generation process to explore a wider spectrum of adversarial strategies. In particular, we first introduce a methodology for mutating prompts based on either “red-teaming expert” personas or “regular AI user” personas. We then develop a dynamic persona-generating algorithm that automatically generates various persona types adaptive to different seed prompts. In addition, we develop a set of new metrics to explicitly measure the “mutation distance” to complement existing diversity measurements of adversarial prompts. Our experiments show promising improvements (up to 144.1%) in the attack success rates of adversarial prompts through persona mutation, while maintaining prompt diversity, compared to RainbowPlus, a state-of-the-art automated red-teaming method. We discuss the strengths and limitations of different persona types and mutation methods, shedding light on future opportunities to explore complementarities between automated and human red-teaming approaches.
[662] Chatbot To Help Patients Understand Their Health
Won Seok Jang, Hieu Tran, Manav Mistry, SaiKiran Gandluri, Yifan Zhang, Sharmin Sultana, Sunjae Kown, Yuan Zhang, Zonghai Yao, Hong Yu
Main category: cs.AI
TL;DR: NoteAid-Chatbot is a conversational AI that helps patients understand medical information using a ’learning as conversation’ framework with multi-agent LLM and RL training without human-labeled data.
Details
Motivation: Patients need knowledge to actively participate in their care, requiring tools that promote understanding of medical information through accessible conversations.Method: Built on lightweight LLaMA 3.2 3B model with two-stage training: supervised fine-tuning on synthetic medical conversation data, followed by RL using PPO with rewards from patient understanding assessments in simulated discharge scenarios.
Result: The chatbot exhibits emergent behaviors like clarity, relevance, and structured dialogue without explicit supervision, surpasses non-expert humans in Turing tests, and successfully handles multi-turn interactions with diverse educational strategies.
Conclusion: The framework demonstrates that low-cost PPO-based RL can effectively train domain-specific chatbots for realistic, open-ended conversations, broadening RL-based alignment methods’ applicability beyond healthcare.
Abstract: Patients must possess the knowledge necessary to actively participate in their care. We present NoteAid-Chatbot, a conversational AI that promotes patient understanding via a novel ’learning as conversation’ framework, built on a multi-agent large language model (LLM) and reinforcement learning (RL) setup without human-labeled data. NoteAid-Chatbot was built on a lightweight LLaMA 3.2 3B model trained in two stages: initial supervised fine-tuning on conversational data synthetically generated using medical conversation strategies, followed by RL with rewards derived from patient understanding assessments in simulated hospital discharge scenarios. Our evaluation, which includes comprehensive human-aligned assessments and case studies, demonstrates that NoteAid-Chatbot exhibits key emergent behaviors critical for patient education, such as clarity, relevance, and structured dialogue, even though it received no explicit supervision for these attributes. Our results show that even simple Proximal Policy Optimization (PPO)-based reward modeling can successfully train lightweight, domain-specific chatbots to handle multi-turn interactions, incorporate diverse educational strategies, and meet nuanced communication objectives. Our Turing test demonstrates that NoteAid-Chatbot surpasses non-expert human. Although our current focus is on healthcare, the framework we present illustrates the feasibility and promise of applying low-cost, PPO-based RL to realistic, open-ended conversational domains, broadening the applicability of RL-based alignment methods.
[663] Correct Reasoning Paths Visit Shared Decision Pivots
Dongkyu Cho, Amy B. Z. Zhang, Bilel Fehri, Sheng Wang, Rumi Chunara, Rui Song, Hengrui Cai
Main category: cs.AI
TL;DR: A self-training method that uses decision pivots - minimal verifiable checkpoints - to align LLM reasoning without ground truth data, improving performance on reasoning benchmarks.
Details
Motivation: Chain-of-thought reasoning shows LLM thinking processes, but verifying these traces at scale remains unsolved. Need a way to ensure reasoning correctness without ground truth data.Method: Proposes decision pivots as minimal verifiable checkpoints that correct reasoning must visit. Uses self-training pipeline: sample diverse reasoning paths, mine shared pivots, compress traces to pivot-focused short paths using verifier, and post-train model with self-generated outputs.
Result: Experiments on LogiQA, MedQA, and MATH500 benchmarks show the method’s effectiveness in aligning reasoning without ground truth data or external metrics.
Conclusion: Decision pivots provide a scalable way to verify and align LLM reasoning by identifying essential checkpoints that distinguish correct from incorrect reasoning paths.
Abstract: Chain-of-thought (CoT) reasoning exposes the intermediate thinking process of large language models (LLMs), yet verifying those traces at scale remains unsolved. In response, we introduce the idea of decision pivots-minimal, verifiable checkpoints that any correct reasoning path must visit. We hypothesize that correct reasoning, though stylistically diverse, converge on the same pivot set, while incorrect ones violate at least one pivot. Leveraging this property, we propose a self-training pipeline that (i) samples diverse reasoning paths and mines shared decision pivots, (ii) compresses each trace into pivot-focused short-path reasoning using an auxiliary verifier, and (iii) post-trains the model using its self-generated outputs. The proposed method aligns reasoning without ground truth reasoning data or external metrics. Experiments on standard benchmarks such as LogiQA, MedQA, and MATH500 show the effectiveness of our method.
[664] LLM/Agent-as-Data-Analyst: A Survey
Zirui Tang, Weizheng Wang, Zihang Zhou, Yang Jiao, Bangrui Xu, Boyu Niu, Dayou Zhou, Xuanhe Zhou, Guoliang Li, Yeye He, Wei Zhou, Yitong Song, Cheng Tan, Xue Yang, Chunwei Liu, Bin Wang, Conghui He, Xiaoyang Wang, Fan Wu
Main category: cs.AI
TL;DR: This paper surveys LLM and agent techniques for data analysis tasks, reviewing approaches for structured, semi-structured, unstructured, and heterogeneous data, and identifying key design goals for intelligent data analysis agents.
Details
Motivation: LLMs and agent techniques have fundamentally transformed data analysis capabilities, enabling complex data understanding, natural language interfaces, and autonomous pipeline orchestration compared to traditional rule-based approaches.Method: The paper conducts a comprehensive review of LLM-based techniques across different data modalities: structured data (NL2SQL, NL2GQL, ModelQA), semi-structured data (markup languages, table QA), unstructured data (chart understanding, document understanding), and heterogeneous data (data retrieval, modality alignment).
Result: The technical evolution reveals four key design goals for intelligent data analysis agents: semantic-aware design, autonomous pipelines, tool-augmented workflows, and support for open-world tasks.
Conclusion: The paper outlines remaining challenges and proposes insights and practical directions for advancing LLM/Agent-powered data analysis, highlighting the transformative potential of these technologies in the data analysis domain.
Abstract: Large language models (LLMs) and agent techniques have brought a fundamental shift in the functionality and development paradigm of data analysis tasks (a.k.a LLM/Agent-as-Data-Analyst), demonstrating substantial impact across both academia and industry. In comparison with traditional rule or small-model based approaches, (agentic) LLMs enable complex data understanding, natural language interfaces, semantic analysis functions, and autonomous pipeline orchestration. From a modality perspective, we review LLM-based techniques for (i) structured data (e.g., NL2SQL, NL2GQL, ModelQA), (ii) semi-structured data (e.g., markup languages understanding, semi-structured table question answering), (iii) unstructured data (e.g., chart understanding, text/image document understanding), and (iv) heterogeneous data (e.g., data retrieval and modality alignment in data lakes). The technical evolution further distills four key design goals for intelligent data analysis agents, namely semantic-aware design, autonomous pipelines, tool-augmented workflows, and support for open-world tasks. Finally, we outline the remaining challenges and propose several insights and practical directions for advancing LLM/Agent-powered data analysis.
[665] The Emergence of Social Science of Large Language Models
Xiao Jia, Zhanzhan Zhao
Main category: cs.AI
TL;DR: A systematic review of 270 studies using computational methods to create a taxonomy of social science research on large language models, identifying three main domains: LLM as Social Minds, LLM Societies, and LLM-Human Interactions.
Details
Motivation: To organize the fragmented field of social science research on LLMs and provide a reproducible map that clarifies evidentiary standards across different levels of analysis.Method: Systematic review of 270 studies using text embeddings, unsupervised clustering, and topic modeling to build a computational taxonomy.
Result: Identified three organic domains: 1) LLM as Social Minds (cognition, morality, bias attributions), 2) LLM Societies (multi-agent coordination and institutions), 3) LLM-Human Interactions (task transformation, trust, work, governance).
Conclusion: The taxonomy provides a reproducible framework for the field, clarifies research standards, and highlights opportunities for cumulative progress in the social science of artificial intelligence.
Abstract: The social science of large language models (LLMs) examines how these systems evoke mind attributions, interact with one another, and transform human activity and institutions. We conducted a systematic review of 270 studies, combining text embeddings, unsupervised clustering and topic modeling to build a computational taxonomy. Three domains emerge organically across the reviewed literature. LLM as Social Minds examines whether and when models display behaviors that elicit attributions of cognition, morality and bias, while addressing challenges such as test leakage and surface cues. LLM Societies examines multi-agent settings where interaction protocols, architectures and mechanism design shape coordination, norms, institutions and collective epistemic processes. LLM-Human Interactions examines how LLMs reshape tasks, learning, trust, work and governance, and how risks arise at the human-AI interface. This taxonomy provides a reproducible map of a fragmented field, clarifies evidentiary standards across levels of analysis, and highlights opportunities for cumulative progress in the social science of artificial intelligence.
[666] Expandable Decision-Making States for Multi-Agent Deep Reinforcement Learning in Soccer Tactical Analysis
Kenjiro Ide, Taiga Someya, Kohei Kawaguchi, Keisuke Fujii
Main category: cs.AI
TL;DR: EDMS introduces a semantically enriched state representation with relational variables and action masking for interpretable player-level agent models in soccer, improving tactical analysis and cross-dataset compatibility.
Details
Motivation: Traditional rule-based analyses are intuitive but limited, while modern ML models lack interpretable agent representations. The goal is to build player-level agent models that are both tactically interpretable and robust across heterogeneous data sources.Method: Proposed Expandable Decision-Making States (EDMS) - a state representation augmenting raw positions/velocities with relational variables (scoring of space, pass, score) and action masking for on-ball/off-ball agents with distinct decision sets.
Result: EDMS with action masking consistently reduced action-prediction loss and temporal-difference error compared to baseline. Q-value visualizations highlighted high-risk, high-reward tactical patterns like fast counterattacks and defensive breakthroughs.
Conclusion: EDMS successfully maps learned value functions to human-interpretable tactical concepts, aligns agent choices with game rules, and demonstrates cross-provider compatibility through integration with multiple datasets.
Abstract: Invasion team sports such as soccer produce a high-dimensional, strongly coupled state space as many players continuously interact on a shared field, challenging quantitative tactical analysis. Traditional rule-based analyses are intuitive, while modern predictive machine learning models often perform pattern-matching without explicit agent representations. The problem we address is how to build player-level agent models from data, whose learned values and policies are both tactically interpretable and robust across heterogeneous data sources. Here, we propose Expandable Decision-Making States (EDMS), a semantically enriched state representation that augments raw positions and velocities with relational variables (e.g., scoring of space, pass, and score), combined with an action-masking scheme that gives on-ball and off-ball agents distinct decision sets. Compared to prior work, EDMS maps learned value functions and action policies to human-interpretable tactical concepts (e.g., marking pressure, passing lanes, ball accessibility) instead of raw coordinate features, and aligns agent choices with the rules of play. In the experiments, EDMS with action masking consistently reduced both action-prediction loss and temporal-difference (TD) error compared to the baseline. Qualitative case studies and Q-value visualizations further indicate that EDMS highlights high-risk, high-reward tactical patterns (e.g., fast counterattacks and defensive breakthroughs). We also integrated our approach into an open-source library and demonstrated compatibility with multiple commercial and open datasets, enabling cross-provider evaluation and reproducible experiments.
[667] LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings
Benjamin F. Maier, Ulf Aslak, Luca Fiaschi, Nina Rismal, Kemble Fletcher, Christian C. Luhmann, Robbie Dow, Kli Pappas, Thomas V. Wiecki
Main category: cs.AI
TL;DR: SSR method uses LLMs to simulate synthetic consumers by generating textual responses and mapping them to Likert distributions via embedding similarity, achieving 90% of human test-retest reliability while maintaining realistic response distributions.
Details
Motivation: Traditional consumer research is costly, suffers from panel biases and limited scale. LLMs offer an alternative but produce unrealistic numerical ratings when asked directly.Method: Semantic Similarity Rating (SSR) - elicits textual responses from LLMs and maps them to Likert distributions using embedding similarity to reference statements.
Result: Achieved 90% of human test-retest reliability on 57 product surveys (9,300 human responses), maintained realistic response distributions (KS similarity > 0.85), and provided rich qualitative feedback.
Conclusion: SSR enables scalable consumer research simulations while preserving traditional survey metrics and interpretability.
Abstract: Consumer research costs companies billions annually yet suffers from panel biases and limited scale. Large language models (LLMs) offer an alternative by simulating synthetic consumers, but produce unrealistic response distributions when asked directly for numerical ratings. We present semantic similarity rating (SSR), a method that elicits textual responses from LLMs and maps these to Likert distributions using embedding similarity to reference statements. Testing on an extensive dataset comprising 57 personal care product surveys conducted by a leading corporation in that market (9,300 human responses), SSR achieves 90% of human test-retest reliability while maintaining realistic response distributions (KS similarity > 0.85). Additionally, these synthetic respondents provide rich qualitative feedback explaining their ratings. This framework enables scalable consumer research simulations while preserving traditional survey metrics and interpretability.
[668] Hierarchical Optimization via LLM-Guided Objective Evolution for Mobility-on-Demand Systems
Yi Zhang, Yushen Long, Yun Ni, Liping Huang, Xiaohong Wang, Jun Liu
Main category: cs.AI
TL;DR: A hybrid framework combining LLM with mathematical optimization for ride-hailing platforms, achieving 16% improvement over baselines by using LLM as meta-optimizer to generate adaptive objectives.
Details
Motivation: Address limitations of existing approaches: RL methods are data-inefficient and oversimplify real-world dynamics, while decomposed optimization methods lack awareness of low-level routing dynamics due to manually designed objectives.Method: Training-free hybrid framework with LLM as meta-optimizer generating semantic heuristics, guided by harmony search evolutionary process that refines prompts based on feasibility and performance feedback from optimization layer.
Result: Extensive experiments on NYC and Chicago taxi datasets show 16% average improvement over state-of-the-art baselines.
Conclusion: The proposed LLM-optimization hybrid framework effectively balances supply-demand in ride-hailing by bridging cognitive limitations of decomposition and avoiding RL’s data inefficiency.
Abstract: Online ride-hailing platforms aim to deliver efficient mobility-on-demand services, often facing challenges in balancing dynamic and spatially heterogeneous supply and demand. Existing methods typically fall into two categories: reinforcement learning (RL) approaches, which suffer from data inefficiency, oversimplified modeling of real-world dynamics, and difficulty enforcing operational constraints; or decomposed online optimization methods, which rely on manually designed high-level objectives that lack awareness of low-level routing dynamics. To address this issue, we propose a novel hybrid framework that integrates large language model (LLM) with mathematical optimization in a dynamic hierarchical system: (1) it is training-free, removing the need for large-scale interaction data as in RL, and (2) it leverages LLM to bridge cognitive limitations caused by problem decomposition by adaptively generating high-level objectives. Within this framework, LLM serves as a meta-optimizer, producing semantic heuristics that guide a low-level optimizer responsible for constraint enforcement and real-time decision execution. These heuristics are refined through a closed-loop evolutionary process, driven by harmony search, which iteratively adapts the LLM prompts based on feasibility and performance feedback from the optimization layer. Extensive experiments based on scenarios derived from both the New York and Chicago taxi datasets demonstrate the effectiveness of our approach, achieving an average improvement of 16% compared to state-of-the-art baselines.
[669] PaperArena: An Evaluation Benchmark for Tool-Augmented Agentic Reasoning on Scientific Literature
Daoyu Wang, Mingyue Cheng, Qi Liu, Shuo Yu, Zirui Liu, Ze Guo
Main category: cs.AI
TL;DR: PaperArena is a benchmark for evaluating LLM agents on cross-paper reasoning and multi-tool orchestration in real research scenarios, showing current agents achieve only 38.78% accuracy with inefficient tool usage.
Details
Motivation: Existing works are limited to tool-free tasks within isolated papers, lacking benchmarks for cross-paper reasoning and multi-tool orchestration in real research scenarios.Method: Proposes PaperArena benchmark with modular platform offering tools like multimodal parsing, context retrieval, and programmatic computation for standardized evaluation of agents on research questions requiring cross-paper integration.
Result: Even the most advanced LLM-powered agent achieves only 38.78% average accuracy, dropping to 18.47% on hard subset. All agents exhibit inefficient tool usage, invoking more tools than necessary.
Conclusion: PaperArena highlights significant room for improvement in agent capabilities for scientific discovery and invites community adoption for developing more capable agents.
Abstract: Understanding and reasoning on the web-scale scientific literature is a crucial touchstone for large language model (LLM) based agents designed to support complex knowledge-intensive tasks. However, existing works are mainly restricted to tool-free tasks within isolated papers, largely due to the lack of a benchmark for cross-paper reasoning and multi-tool orchestration in real research scenarios. In this work, we propose PaperArena, an evaluation benchmark for agents to address real-world research questions that typically require integrating information across multiple papers with the assistance of external tools. Given a research question, agents should integrate diverse formats across multiple papers through reasoning and interacting with appropriate tools, thereby producing a well-grounded answer. To support standardized evaluation, we provide a modular and extensible platform for agent execution, offering tools such as multimodal parsing, context retrieval, and programmatic computation. Experimental results reveal that even the most advanced LLM powering a well-established agent system achieves merely 38.78% average accuracy. On the hard subset, accuracy drops to only 18.47%, highlighting great potential for improvement. We also present several empirical findings, including that all agents tested exhibit inefficient tool usage, often invoking more tools than necessary to solve a task. We invite the community to adopt PaperArena to develop and evaluate more capable agents for scientific discovery. Our code and data are available https://github.com/Melmaphother/PaperArena.
[670] A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space
Bingjie Zhang, Yibo Yang, Zhe Ren, Dandan Guo, Jindong Gu, Philip Torr, Bernard Ghanem
Main category: cs.AI
TL;DR: GuardSpace is a framework that preserves safety alignment in LLMs during fine-tuning by decomposing weights into safety-relevant and irrelevant components, and using null space projection to maintain refusal behavior on harmful prompts.
Details
Motivation: LLMs lose safety alignment during fine-tuning even on benign data, leading to harmful responses. Current methods fail to preserve pre-trained safety behaviors.Method: 1) Decompose pre-trained weights using covariance-preconditioned SVD into safety-relevant (frozen) and safety-irrelevant components (used for adapter initialization). 2) Construct null space projector to restrict adapter updates from altering safe outputs on harmful prompts.
Result: GuardSpace outperforms state-of-the-art methods, reducing harmful score from 14.4% to 3.6% and improving accuracy from 26.0% to 28.0% for Llama-2-7B-Chat on GSM8K.
Conclusion: GuardSpace effectively preserves safety alignment during fine-tuning while maintaining or improving task performance across various models and downstream tasks.
Abstract: Large language models (LLMs) have achieved remarkable success in diverse tasks, yet their safety alignment remains fragile during adaptation. Even when fine-tuning on benign data or with low-rank adaptation, pre-trained safety behaviors are easily degraded, leading to harmful responses in the fine-tuned models. To address this challenge, we propose GuardSpace, a guardrail framework for preserving safety alignment throughout fine-tuning, composed of two key components: a safety-sensitive subspace and a harmful-resistant null space. First, we explicitly decompose pre-trained weights into safety-relevant and safety-irrelevant components using covariance-preconditioned singular value decomposition, and initialize low-rank adapters from the safety-irrelevant ones, while freezing safety-relevant components to preserve their associated safety mechanism. Second, we construct a null space projector that restricts adapter updates from altering safe outputs on harmful prompts, thereby maintaining the original refusal behavior. Experiments with various pre-trained models on multiple downstream tasks demonstrate that GuardSpace achieves superior performance over existing methods. Notably, for Llama-2-7B-Chat fine-tuned on GSM8K, GuardSpace outperforms the state-of-the-art method AsFT, reducing the average harmful score from 14.4% to 3.6%, while improving the accuracy from from 26.0% to 28.0%.
[671] Operationalising Extended Cognition: Formal Metrics for Corporate Knowledge and Legal Accountability
Elija Perrier
Main category: cs.AI
TL;DR: The paper proposes a new framework for defining corporate knowledge in the age of AI, using computational efficiency and validated reliability metrics to impute legal responsibility.
Details
Motivation: Traditional corporate mens rea concepts based on human agents are challenged by AI-mediated decision-making, requiring new approaches to corporate knowledge attribution.Method: Developed a formal model using extended cognition theory, creating continuous organizational knowledge metrics that integrate computational cost and validated error rates, with thresholded predicates for legal imputation.
Result: Created quantitative metrics for corporate knowledge states that can be mapped onto legal standards (actual knowledge, constructive knowledge, wilful blindness, recklessness) and provide measurable audit artifacts.
Conclusion: The framework makes corporate knowledge tractable and accountable in the algorithmic age by providing justiciable metrics for AI-mediated corporate decision-making.
Abstract: Corporate responsibility turns on notions of corporate \textit{mens rea}, traditionally imputed from human agents. Yet these assumptions are under challenge as generative AI increasingly mediates enterprise decision-making. Building on the theory of extended cognition, we argue that in response corporate knowledge may be redefined as a dynamic capability, measurable by the efficiency of its information-access procedures and the validated reliability of their outputs. We develop a formal model that captures epistemic states of corporations deploying sophisticated AI or information systems, introducing a continuous organisational knowledge metric $S_S(\varphi)$ which integrates a pipeline’s computational cost and its statistically validated error rate. We derive a thresholded knowledge predicate $\mathsf{K}S$ to impute knowledge and a firm-wide epistemic capacity index $\mathcal{K}{S,t}$ to measure overall capability. We then operationally map these quantitative metrics onto the legal standards of actual knowledge, constructive knowledge, wilful blindness, and recklessness. Our work provides a pathway towards creating measurable and justiciable audit artefacts, that render the corporate mind tractable and accountable in the algorithmic age.
[672] Beyond Pipelines: A Survey of the Paradigm Shift toward Model-Native Agentic AI
Jitao Sang, Jinlin Xiao, Jiarun Han, Jilin Chen, Xiaoyi Chen, Shuyu Wei, Yongjie Sun, Yuhang Wang
Main category: cs.AI
TL;DR: Survey traces the paradigm shift from pipeline-based agentic AI systems to model-native approaches where planning, tool use, and memory are internalized within LLMs, enabled by reinforcement learning.
Details
Motivation: To document and analyze the evolution of agentic AI from systems that apply external logic to models that develop intelligence through experience-driven learning.Method: Systematic review of how planning, tool use, and memory capabilities have evolved from externally scripted modules to end-to-end learned behaviors, examining RL as the enabling algorithmic engine.
Result: Identifies a coherent trajectory toward model-native agentic AI as an integrated learning framework, with applications in deep research agents and GUI agents demonstrating the paradigm shift.
Conclusion: Agentic AI is transitioning from constructing systems that apply intelligence to developing models that grow intelligence through experience, with continued internalization of capabilities like multi-agent collaboration and reflection.
Abstract: The rapid evolution of agentic AI marks a new phase in artificial intelligence, where Large Language Models (LLMs) no longer merely respond but act, reason, and adapt. This survey traces the paradigm shift in building agentic AI: from Pipeline-based systems, where planning, tool use, and memory are orchestrated by external logic, to the emerging Model-native paradigm, where these capabilities are internalized within the model’s parameters. We first position Reinforcement Learning (RL) as the algorithmic engine enabling this paradigm shift. By reframing learning from imitating static data to outcome-driven exploration, RL underpins a unified solution of LLM + RL + Task across language, vision and embodied domains. Building on this, the survey systematically reviews how each capability – Planning, Tool use, and Memory – has evolved from externally scripted modules to end-to-end learned behaviors. Furthermore, it examines how this paradigm shift has reshaped major agent applications, specifically the Deep Research agent emphasizing long-horizon reasoning and the GUI agent emphasizing embodied interaction. We conclude by discussing the continued internalization of agentic capabilities like Multi-agent collaboration and Reflection, alongside the evolving roles of the system and model layers in future agentic AI. Together, these developments outline a coherent trajectory toward model-native agentic AI as an integrated learning and interaction framework, marking the transition from constructing systems that apply intelligence to developing models that grow intelligence through experience.
[673] Which LLM Multi-Agent Protocol to Choose?
Hongyi Du, Jiaqi Su, Jisen Li, Lijie Ding, Yingxuan Yang, Peixuan Han, Xiangru Tang, Kunlun Zhu, Jiaxuan You
Main category: cs.AI
TL;DR: ProtocolBench is a benchmark for evaluating multi-agent communication protocols, showing significant performance differences across protocols. ProtocolRouter is a learnable router that selects optimal protocols per scenario, improving recovery time and success rates.
Details
Motivation: Communication protocol selection in multi-agent systems is currently intuition-driven without standardized evaluation, leading to suboptimal performance and reliability.Method: Created ProtocolBench with four evaluation axes (task success, latency, overhead, robustness), and developed ProtocolRouter - a learnable protocol router that selects protocols based on requirements and runtime signals.
Result: Protocol choice significantly impacts system behavior: 36.5% variation in completion time, 3.48s latency differences, and varying resilience. ProtocolRouter reduces Fail-Storm recovery time by 18.1% and improves GAIA success rates.
Conclusion: Protocol selection critically affects multi-agent system performance, and ProtocolRouter enables adaptive protocol choice for improved reliability and efficiency at scale.
Abstract: As large-scale multi-agent systems evolve, the communication protocol layer has become a critical yet under-evaluated factor shaping performance and reliability. Despite the existence of diverse protocols (A2A, ACP, ANP, Agora, etc.), selection is often intuition-driven and lacks standardized guidance. We introduce ProtocolBench, a benchmark that systematically compares agent protocols along four measurable axes: task success, end-to-end latency, message or byte overhead, and robustness under failures. On ProtocolBench, protocol choice significantly influences system behavior. In the Streaming Queue scenario, overall completion time varies by up to 36.5% across protocols, and mean end-to-end latency differs by 3.48 s. Under Fail-Storm Recovery, resilience also differs consistently across protocols. Beyond evaluation, we present ProtocolRouter, a learnable protocol router that selects per-scenario (or per-module) protocols from requirement and runtime signals. ProtocolRouter reduces Fail-Storm recovery time by up to 18.1% versus the best single-protocol baseline, and achieves scenario-specific gains such as higher success in GAIA. We also release ProtocolRouterBench to standardize protocol evaluation and improve reliability at scale.
[674] A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist
Sohyeon Jeon, Hyung-Chul Lee
Main category: cs.AI
TL;DR: LLMs show poor calibration and overconfidence in clinical trial reporting evaluation, with calibration errors above clinically relevant thresholds despite different prompt strategies.
Details
Motivation: There's a need for robust and explainable evaluation of LLMs in healthcare, particularly regarding uncertainty calibration and metacognitive reliability in medical automation.Method: Used behavioral and metacognitive analysis with expert-validated dataset, comparing general and domain-specialized LLMs across three prompt strategies, measuring Expected Calibration Error (ECE) and Relative Calibration Error (RCE).
Result: Both models showed pronounced miscalibration and overconfidence, especially in clinical role-playing conditions, with persistent calibration errors above clinically relevant thresholds.
Conclusion: Improved calibration, transparent code, and strategic prompt engineering are needed for reliable and explainable medical AI.
Abstract: Despite the rapid expansion of Large Language Models (LLMs) in healthcare, robust and explainable evaluation of their ability to assess clinical trial reporting according to CONSORT standards remains an open challenge. In particular, uncertainty calibration and metacognitive reliability of LLM reasoning are poorly understood and underexplored in medical automation. This study applies a behavioral and metacognitive analytic approach using an expert-validated dataset, systematically comparing two representative LLMs - one general and one domain-specialized - across three prompt strategies. We analyze both cognitive adaptation and calibration error using metrics: Expected Calibration Error (ECE) and a baseline-normalized Relative Calibration Error (RCE) that enables reliable cross-model comparison. Our results reveal pronounced miscalibration and overconfidence in both models, especially under clinical role-playing conditions, with calibration error persisting above clinically relevant thresholds. These findings underscore the need for improved calibration, transparent code, and strategic prompt engineering to develop reliable and explainable medical AI.
[675] LLMs can hide text in other text of the same length
Antonio Norelli, Michael Bronstein
Main category: cs.AI
TL;DR: A method to hide secret messages within seemingly normal text using LLMs, enabling covert communication that decouples text from authorial intent.
Details
Motivation: To demonstrate how LLMs can be used to create covert communication channels where secret messages are embedded in plausible-looking text, eroding trust in written communication.Method: A simple and efficient protocol using modest 8-billion-parameter open-source LLMs to encode and decode messages within coherent text of the same length.
Result: High-quality results achieved with local encoding/decoding on a laptop in seconds, enabling scenarios like hiding unfiltered LLM responses within compliant model outputs.
Conclusion: This protocol reveals a radical decoupling of text from intent, raising urgent AI safety questions and challenging our understanding of what LLMs truly know.
Abstract: A meaningful text can be hidden inside another, completely different yet still coherent and plausible, text of the same length. For example, a tweet containing a harsh political critique could be embedded in a tweet that celebrates the same political leader, or an ordinary product review could conceal a secret manuscript. This uncanny state of affairs is now possible thanks to Large Language Models, and in this paper we present a simple and efficient protocol to achieve it. We show that even modest 8-billion-parameter open-source LLMs are sufficient to obtain high-quality results, and a message as long as this abstract can be encoded and decoded locally on a laptop in seconds. The existence of such a protocol demonstrates a radical decoupling of text from authorial intent, further eroding trust in written communication, already shaken by the rise of LLM chatbots. We illustrate this with a concrete scenario: a company could covertly deploy an unfiltered LLM by encoding its answers within the compliant responses of a safe model. This possibility raises urgent questions for AI safety and challenges our understanding of what it means for a Large Language Model to know something.
[676] Multi-Step Reasoning for Embodied Question Answering via Tool Augmentation
Mingliang Zhai, Hansheng Liang, Xiaomeng Fan, Zhi Gao, Chuanhao Li, Che Sun, Xu Bin, Yuwei Wu, Yunde Jia
Main category: cs.AI
TL;DR: ToolEQA is an embodied question answering agent that integrates external tools with multi-step reasoning to improve exploration efficiency and answer accuracy, outperforming existing methods by 9.2-20.2% in success rate.
Details
Motivation: Existing EQA methods use VLMs to directly explore environments without explicit thinking or planning, leading to inefficient exploration and limited reasoning ability. The authors aim to enhance reasoning through tool integration.Method: ToolEQA integrates external tools with multi-step reasoning, where tools provide useful information to guide exploration. A novel EQA data generation pipeline automatically constructs large-scale EQA tasks with reasoning trajectories, resulting in the EQA-RT dataset with 18K tasks.
Result: ToolEQA improves success rate by 9.2-20.2% over state-of-the-art baselines and outperforms zero-shot ToolEQA by 10%. It also achieves SOTA performance on HM-EQA, OpenEQA, and EXPRESS-Bench datasets.
Conclusion: ToolEQA demonstrates that integrating external tools with multi-step reasoning significantly improves EQA performance, enabling more accurate responses with shorter exploration distances while maintaining generality across different datasets.
Abstract: Embodied Question Answering (EQA) requires agents to explore 3D environments to obtain observations and answer questions related to the scene. Existing methods leverage VLMs to directly explore the environment and answer questions without explicit thinking or planning, which limits their reasoning ability and results in excessive or inefficient exploration as well as ineffective responses. In this paper, we introduce ToolEQA, an agent that integrates external tools with multi-step reasoning, where external tools can provide more useful information for completing the task, helping the model derive better exploration directions in the next step of reasoning and thus obtaining additional effective information. This enables ToolEQA to generate more accurate responses with a shorter exploration distance. To enhance the model’s ability for tool-usage and multi-step reasoning, we further design a novel EQA data generation pipeline that automatically constructs large-scale EQA tasks with reasoning trajectories and corresponding answers. Based on the pipeline, we collect the EQA-RT dataset that contains about 18K tasks, divided into a training set EQA-RT-Train, and two test sets EQA-RT-Seen (scenes overlapping with the training set) and EQA-RT-Unseen (novel scenes). Experiments on EQA-RT-Seen and EQA-RT-Unseen show that ToolEQA improves the success rate by 9.2~20.2% over state-of-the-art baselines, while outperforming the zero-shot ToolEQA by 10% in success rate. In addition, ToolEQA also achieves state-of-the-art performance on the HM-EQA, OpenEQA, and EXPRESS-Bench datasets, demonstrating its generality. Our homepage see https://tooleqa.github.io.
[677] Plan Then Retrieve: Reinforcement Learning-Guided Complex Reasoning over Knowledge Graphs
Yanlin Song, Ben Liu, Víctor Gutiérrez-Basulto, Zhiwei Hu, Qianqian Xie, Min Peng, Sophia Ananiadou, Jeff Z. Pan
Main category: cs.AI
TL;DR: Graph-RFT is a two-stage reinforcement fine-tuning framework for KGQA that enables LLMs to perform autonomous planning and adaptive retrieval across KG and web sources under incomplete knowledge conditions.
Details
Motivation: Existing KGQA methods struggle to fully exploit both KG knowledge and LLM reasoning capabilities, assuming complete KG coverage and lacking mechanisms to judge when external information is needed. They also suffer from locally myopic reasoning that fails to maintain coherent multi-step planning.Method: Two-stage framework: 1) Chain-of-thought fine-tuning with customized plan-retrieval dataset to activate structured reasoning and resolve GRPO cold-start; 2) Plan-retrieval guided RL with explicit planning/retrieval actions and multi-reward design. Uses Cartesian-inspired planning module to decompose questions and logical expression for tool invocation.
Result: Enables coverage-aware retrieval scheduling and globally consistent multi-step reasoning. Optimizes reasoning retrieval process with multi-reward combining outcome and retrieval-specific signals.
Conclusion: Graph-RFT effectively enables LLMs to learn when and how to combine KG and web retrieval, addressing limitations of existing methods in complex KGQA scenarios with incomplete knowledge.
Abstract: Knowledge Graph Question Answering aims to answer natural language questions by reasoning over structured knowledge graphs. While large language models have advanced KGQA through their strong reasoning capabilities, existing methods continue to struggle to fully exploit both the rich knowledge encoded in KGs and the reasoning capabilities of LLMs, particularly in complex scenarios. They often assume complete KG coverage and lack mechanisms to judge when external information is needed, and their reasoning remains locally myopic, failing to maintain coherent multi-step planning, leading to reasoning failures even when relevant knowledge exists. We propose Graph-RFT, a novel two-stage reinforcement fine-tuning KGQA framework with a ‘plan-KGsearch-and-Websearch-during-think’ paradigm, that enables LLMs to perform autonomous planning and adaptive retrieval scheduling across KG and web sources under incomplete knowledge conditions. Graph-RFT introduces a chain-of-thought fine-tuning method with a customized plan-retrieval dataset activates structured reasoning and resolves the GRPO cold-start problem. It then introduces a novel plan-retrieval guided reinforcement learning process integrates explicit planning and retrieval actions with a multi-reward design, enabling coverage-aware retrieval scheduling. It employs a Cartesian-inspired planning module to decompose complex questions into ordered subquestions, and logical expression to guide tool invocation for globally consistent multi-step reasoning. This reasoning retrieval process is optimized with a multi-reward combining outcome and retrieval specific signals, enabling the model to learn when and how to combine KG and web retrieval effectively.
[678] DAO-AI: Evaluating Collective Decision-Making through Agentic AI in Decentralized Governance
Agostino Capponi, Alfio Gliozzo, Chunghyun Han, Junkyu Lee
Main category: cs.AI
TL;DR: First empirical study of AI agents as autonomous decision-makers in DAO governance, showing strong alignment with human voting outcomes through realistic blockchain simulations.
Details
Motivation: To investigate whether agentic AI can effectively participate in decentralized governance as autonomous decision-makers, augmenting collective decision-making processes.Method: Built an AI voter using 3K+ proposals, implemented through modular composable program workflow with Agentics framework, operating in realistic financial simulation environment with verifiable blockchain data.
Result: Strong alignment between AI agent decisions and human/token-weighted outcomes, measured by carefully designed evaluation metrics.
Conclusion: Agentic AI can augment collective decision-making by producing interpretable, auditable, and empirically grounded signals in DAO governance settings, contributing to explainable and economically rigorous AI agents for decentralized finance.
Abstract: This paper presents a first empirical study of agentic AI as autonomous decision-makers in decentralized governance. Using more than 3K proposals from major protocols, we build an agentic AI voter that interprets proposal contexts, retrieves historical deliberation data, and independently determines its voting position. The agent operates within a realistic financial simulation environment grounded in verifiable blockchain data, implemented through a modular composable program (MCP) workflow that defines data flow and tool usage via Agentics framework. We evaluate how closely the agent’s decisions align with the human and token-weighted outcomes, uncovering strong alignments measured by carefully designed evaluation metrics. Our findings demonstrate that agentic AI can augment collective decision-making by producing interpretable, auditable, and empirically grounded signals in realistic DAO governance settings. The study contributes to the design of explainable and economically rigorous AI agents for decentralized financial systems.
cs.SD
[679] GuitarFlow: Realistic Electric Guitar Synthesis From Tablatures via Flow Matching and Style Transfer
Jackson Loth, Pedro Sarmento, Mark Sandler, Mathieu Barthet
Main category: cs.SD
TL;DR: GuitarFlow is a model for electric guitar synthesis that uses tablatures to guide generation, combining sample-based virtual instrument rendering with Flow Matching style transfer for realistic audio output.
Details
Motivation: Current AI music generation lacks expressivity for guitar synthesis, and tablatures provide an intuitive format that better represents guitar-specific playing techniques compared to MIDI.Method: Two-step approach: first render tablatures to audio using sample-based virtual instrument, then apply Flow Matching style transfer to transform the virtual instrument audio into realistic-sounding guitar audio.
Result: Model trains quickly (less than 6 hours of data) and shows significant improvement in realism through objective metrics and listening tests.
Conclusion: GuitarFlow successfully enables controllable and expressive electric guitar synthesis using tablatures, with efficient training and improved audio realism.
Abstract: Music generation in the audio domain using artificial intelligence (AI) has witnessed steady progress in recent years. However for some instruments, particularly the guitar, controllable instrument synthesis remains limited in expressivity. We introduce GuitarFlow, a model designed specifically for electric guitar synthesis. The generative process is guided using tablatures, an ubiquitous and intuitive guitar-specific symbolic format. The tablature format easily represents guitar-specific playing techniques (e.g. bends, muted strings and legatos), which are more difficult to represent in other common music notation formats such as MIDI. Our model relies on an intermediary step of first rendering the tablature to audio using a simple sample-based virtual instrument, then performing style transfer using Flow Matching in order to transform the virtual instrument audio into more realistic sounding examples. This results in a model that is quick to train and to perform inference, requiring less than 6 hours of training data. We present the results of objective evaluation metrics, together with a listening test, in which we show significant improvement in the realism of the generated guitar audio from tablatures.
[680] Streaming Generation for Music Accompaniment
Yusong Wu, Mason Wang, Heidi Lei, Stephen Brade, Lancelot Blanchard, Shih-Lun Wu, Aaron Courville, Anna Huang
Main category: cs.SD
TL;DR: This paper studies real-time audio-to-audio accompaniment generation, where a model must generate coherent musical accompaniment simultaneously while listening to an input audio stream (e.g., a singer).
Details
Motivation: Current music generation models are limited to editing and loop-based workflows, but cannot perform real-time accompaniment where the model generates music in sync with live input audio streams.Method: The authors propose a model design considering system delays with two key variables: future visibility (time offset between output playback and latest input used for conditioning) and output chunk duration (frames emitted per call). They train Transformer decoders across different combinations of these parameters.
Result: The study reveals two consistent trade-offs: increasing future visibility improves coherence but requires faster inference to maintain latency budget; increasing output chunk duration improves throughput but degrades accompaniment quality due to reduced update rate.
Conclusion: Naive maximum-likelihood streaming training is insufficient for coherent accompaniment without future context, motivating the need for advanced anticipatory and agentic objectives for live jamming applications.
Abstract: Music generation models can produce high-fidelity coherent accompaniment given complete audio input, but are limited to editing and loop-based workflows. We study real-time audio-to-audio accompaniment: as a model hears an input audio stream (e.g., a singer singing), it has to also simultaneously generate in real-time a coherent accompanying stream (e.g., a guitar accompaniment). In this work, we propose a model design considering inevitable system delays in practical deployment with two design variables: future visibility $t_f$, the offset between the output playback time and the latest input time used for conditioning, and output chunk duration $k$, the number of frames emitted per call. We train Transformer decoders across a grid of $(t_f,k)$ and show two consistent trade-offs: increasing effective $t_f$ improves coherence by reducing the recency gap, but requires faster inference to stay within the latency budget; increasing $k$ improves throughput but results in degraded accompaniment due to a reduced update rate. Finally, we observe that naive maximum-likelihood streaming training is insufficient for coherent accompaniment where future context is not available, motivating advanced anticipatory and agentic objectives for live jamming.
[681] M-CIF: Multi-Scale Alignment For CIF-Based Non-Autoregressive ASR
Ruixiang Mao, Xiangnan Ma, Qing Yang, Ziming Zhu, Yucheng Qiao, Yuan Ge, Tong Xiao, Shengxiang Gao, Zhengtao Yu, Jingbo Zhu
Main category: cs.SD
TL;DR: Multi-scale CIF (M-CIF) improves non-autoregressive speech recognition by adding multi-level alignment with character and phoneme supervision, reducing WER especially in German and French.
Details
Motivation: The original CIF mechanism works well for Mandarin but degrades in stability for languages like English and French without finer-grained guidance.Method: Proposes M-CIF that performs multi-level alignment by integrating character and phoneme level supervision progressively distilled into subword representations.
Result: Reduces WER compared to Paraformer baseline, with 4.21% improvement in German and 3.05% in French on CommonVoice. Analysis shows phoneme and character layers are essential.
Conclusion: Multi-scale alignment with character and phoneme supervision enhances robust acoustic-text alignment in non-autoregressive speech recognition.
Abstract: The Continuous Integrate-and-Fire (CIF) mechanism provides effective alignment for non-autoregressive (NAR) speech recognition. This mechanism creates a smooth and monotonic mapping from acoustic features to target tokens, achieving performance on Mandarin competitive with other NAR approaches. However, without finer-grained guidance, its stability degrades in some languages such as English and French. In this paper, we propose Multi-scale CIF (M-CIF), which performs multi-level alignment by integrating character and phoneme level supervision progressively distilled into subword representations, thereby enhancing robust acoustic-text alignment. Experiments show that M-CIF reduces WER compared to the Paraformer baseline, especially on CommonVoice by 4.21% in German and 3.05% in French. To further investigate these gains, we define phonetic confusion errors (PE) and space-related segmentation errors (SE) as evaluation metrics. Analysis of these metrics across different M-CIF settings reveals that the phoneme and character layers are essential for enhancing progressive CIF alignment.
[682] FOA Tokenizer: Low-bitrate Neural Codec for First Order Ambisonics with Spatial Consistency Loss
Parthasaarathy Sudarsanam, Sebastian Braun, Hannes Gamper
Main category: cs.SD
TL;DR: First neural spatial audio codec for first-order ambisonics (FOA) that compresses 4-channel audio to 0.9 kbps while preserving spatial cues through a novel spatial consistency loss.
Details
Motivation: Neural audio codecs have been well-studied for mono and stereo signals, but spatial audio remains largely unexplored, creating a gap in efficient spatial audio compression.Method: Extended WavTokenizer architecture to support 4-channel FOA signals and introduced a novel spatial consistency loss to preserve directional cues under high compression.
Result: Achieved compression of 4-channel FOA audio at 24 kHz to 75 tokens/sec (0.9 kbps) with accurate reconstruction (mean angular errors: 13.76° simulated, 3.96° clean speech, 25.83° real RIRs). Also demonstrated usefulness for downstream tasks like sound event localization.
Conclusion: The proposed discrete neural spatial audio codec successfully compresses FOA signals while preserving spatial information and provides useful features for spatial audio applications.
Abstract: Neural audio codecs have been widely studied for mono and stereo signals, but spatial audio remains largely unexplored. We present the first discrete neural spatial audio codec for first-order ambisonics (FOA). Building on the WavTokenizer architecture, we extend it to support four-channel FOA signals and introduce a novel spatial consistency loss to preserve directional cues in the reconstructed signals under a highly compressed representation. Our codec compresses 4-channel FOA audio at 24 kHz into 75 discrete tokens per second, corresponding to a bit rate of 0.9 kbps. Evaluations on simulated reverberant mixtures, non-reverberant clean speech, and FOA mixtures with real room impulse responses show accurate reconstruction, with mean angular errors of 13.76{\deg}, 3.96{\deg}, and 25.83{\deg}, respectively, across the three conditions. In addition, discrete latent representations derived from our codec provide useful features for downstream spatial audio tasks, as demonstrated on sound event localization and detection with STARSS23 real recordings.
[683] PromptReverb: Multimodal Room Impulse Response Generation Through Latent Rectified Flow Matching
Ali Vosoughi, Yongyi Zang, Qihui Yang, Nathan Peak, Randal Leistikow, Chenliang Xu
Main category: cs.SD
TL;DR: PromptReverb is a two-stage generative framework that generates high-quality room impulse responses (RIRs) from band-limited inputs or natural language descriptions, achieving superior acoustic accuracy compared to existing methods.
Details
Motivation: Current RIR generation methods face two main limitations: scarcity of full-band RIR datasets and inability to generate acoustically accurate responses from diverse input modalities like natural language.Method: Two-stage approach combining: 1) variational autoencoder for upsampling band-limited RIRs to full-band quality (48 kHz), and 2) conditional diffusion transformer model based on rectified flow matching that generates RIRs from natural language descriptions.
Result: PromptReverb produces RIRs with superior perceptual quality and acoustic accuracy, achieving 8.8% mean RT60 error compared to -37% for widely used baselines, and yields more realistic room-acoustic parameters.
Conclusion: The method enables practical applications in virtual reality, architectural acoustics, and audio production where flexible, high-quality RIR synthesis is essential.
Abstract: Room impulse response (RIR) generation remains a critical challenge for creating immersive virtual acoustic environments. Current methods suffer from two fundamental limitations: the scarcity of full-band RIR datasets and the inability of existing models to generate acoustically accurate responses from diverse input modalities. We present PromptReverb, a two-stage generative framework that addresses these challenges. Our approach combines a variational autoencoder that upsamples band-limited RIRs to full-band quality (48 kHz), and a conditional diffusion transformer model based on rectified flow matching that generates RIRs from descriptions in natural language. Empirical evaluation demonstrates that PromptReverb produces RIRs with superior perceptual quality and acoustic accuracy compared to existing methods, achieving 8.8% mean RT60 error compared to -37% for widely used baselines and yielding more realistic room-acoustic parameters. Our method enables practical applications in virtual reality, architectural acoustics, and audio production where flexible, high-quality RIR synthesis is essential.
[684] Evaluating Multimodal Large Language Models on Core Music Perception Tasks
Brandon James Carone, Iran R. Roman, Pablo Ripollés
Main category: cs.SD
TL;DR: Current multimodal LLMs show strong musical reasoning on symbolic MIDI inputs but perform poorly on audio inputs, revealing a significant perception gap in true musical listening capabilities.
Details
Motivation: To benchmark state-of-the-art multimodal LLMs on core music skills and separate the effects of perceptual limitations (audio vs MIDI), example exposure, and reasoning strategies to understand their true musical understanding capabilities.Method: Benchmarked three SOTA LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, Qwen2.5-Omni) across three music tasks: Syncopation Scoring, Transposition Detection, and Chord Quality Identification. Varied input types (audio vs MIDI), prompting strategies (zero-shot vs few-shot), and reasoning approaches (Standalone, Chain-of-Thought, LogicLM adapted for music).
Result: Clear perceptual gap: models perform near ceiling on MIDI but show significant accuracy drops on audio. Reasoning and few-shot prompting offer minimal gains. Gemini Pro achieved highest performance across most conditions. LogicLM showed near-perfect MIDI accuracy but remained brittle on audio.
Conclusion: Current multimodal LLMs reason well over musical symbols (MIDI) but do not reliably “listen” from audio. The study provides explicit boundaries between perception and reasoning and offers guidance for building robust audio-first music systems.
Abstract: Multimodal Large Language Models (LLMs) claim “musical understanding” via evaluations that conflate listening with score reading. We benchmark three SOTA LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, and Qwen2.5-Omni) across three core music skills: Syncopation Scoring, Transposition Detection, and Chord Quality Identification. Moreover, we separate three sources of variability: (i) perceptual limitations (audio vs. MIDI inputs), (ii) exposure to examples (zero- vs. few-shot manipulations), and (iii) reasoning strategies (Standalone, CoT, LogicLM). For the latter we adapt LogicLM, a framework combining LLMs with symbolic solvers to perform structured reasoning, to music. Results reveal a clear perceptual gap: models perform near ceiling on MIDI but show accuracy drops on audio. Reasoning and few-shot prompting offer minimal gains. This is expected for MIDI, where performance reaches saturation, but more surprising for audio, where LogicLM, despite near-perfect MIDI accuracy, remains notably brittle. Among models, Gemini Pro achieves the highest performance across most conditions. Overall, current systems reason well over symbols (MIDI) but do not yet “listen” reliably from audio. Our method and dataset make the perception-reasoning boundary explicit and offer actionable guidance for building robust, audio-first music systems.
[685] SAO-Instruct: Free-form Audio Editing using Natural Language Instructions
Michael Ungersböck, Florian Grötschla, Luca A. Lanzendörfer, June Young Yi, Changho Choi, Roger Wattenhofer
Main category: cs.SD
TL;DR: SAO-Instruct is a model for editing audio clips using free-form natural language instructions, based on Stable Audio Open and trained on audio editing triplets.
Details
Motivation: Current audio editing methods require complete descriptions or predefined instructions, lacking flexibility for natural language-based editing.Method: The model uses Stable Audio Open and is trained on audio editing triplets created using Prompt-to-Prompt, DDPM inversion, and manual editing pipeline.
Result: SAO-Instruct achieves competitive performance on objective metrics and outperforms other approaches in subjective listening studies, generalizing well to real audio and unseen instructions.
Conclusion: The model enables flexible audio editing with natural language instructions and the code and weights are released for future research.
Abstract: Generative models have made significant progress in synthesizing high-fidelity audio from short textual descriptions. However, editing existing audio using natural language has remained largely underexplored. Current approaches either require the complete description of the edited audio or are constrained to predefined edit instructions that lack flexibility. In this work, we introduce SAO-Instruct, a model based on Stable Audio Open capable of editing audio clips using any free-form natural language instruction. To train our model, we create a dataset of audio editing triplets (input audio, edit instruction, output audio) using Prompt-to-Prompt, DDPM inversion, and a manual editing pipeline. Although partially trained on synthetic data, our model generalizes well to real in-the-wild audio clips and unseen edit instructions. We demonstrate that SAO-Instruct achieves competitive performance on objective metrics and outperforms other audio editing approaches in a subjective listening study. To encourage future research, we release our code and model weights.
[686] TwinShift: Benchmarking Audio Deepfake Detection across Synthesizer and Speaker Shifts
Jiyoung Hong, Yoonseo Chung, Seungyeon Oh, Juntae Kim, Jiyoung Lee, Sookyung Kim, Hyunsoo Cho
Main category: cs.SD
TL;DR: TWINSHIFT is a benchmark for evaluating audio deepfake detection robustness under unseen conditions, using six synthesis systems with disjoint speaker sets to test generalization.
Details
Motivation: Audio deepfakes pose growing threats in fraud and misinformation, and current detectors struggle to generalize to unseen synthesis methods and speakers, limiting real-world reliability.Method: Constructed benchmark from six different synthesis systems, each paired with disjoint speaker sets, to rigorously assess detector generalization when both generative models and speaker identities change.
Result: TWINSHIFT reveals important robustness gaps and uncovers overlooked limitations in current audio deepfake detection systems.
Conclusion: The benchmark provides principled guidance for developing more robust audio deepfake detection systems and is publicly accessible.
Abstract: Audio deepfakes pose a growing threat, already exploited in fraud and misinformation. A key challenge is ensuring detectors remain robust to unseen synthesis methods and diverse speakers, since generation techniques evolve quickly. Despite strong benchmark results, current systems struggle to generalize to new conditions limiting real-world reliability. To address this, we introduce TWINSHIFT, a benchmark explicitly designed to evaluate detection robustness under strictly unseen conditions. Our benchmark is constructed from six different synthesis systems, each paired with disjoint sets of speakers, allowing for a rigorous assessment of how well detectors generalize when both the generative model and the speaker identity change. Through extensive experiments, we show that TWINSHIFT reveals important robustness gaps, uncover overlooked limitations, and provide principled guidance for developing ADD systems. The TWINSHIFT benchmark can be accessed at https://github.com/intheMeantime/TWINSHIFT.
[687] Low-Resource Audio Codec (LRAC): 2025 Challenge Description
Kamil Wojcicki, Yusuf Ziya Isik, Laura Lechler, Mansur Yesilbursa, Ivana Balić, Wolfgang Mack, Rafał Łaganowski, Guoqing Zhang, Yossi Adi, Minje Kim, Shinji Watanabe
Main category: cs.SD
TL;DR: The 2025 Low-Resource Audio Codec Challenge aims to develop neural and hybrid codecs for resource-constrained applications, addressing limitations of current neural audio codecs in low-resource operation and robustness to acoustic distortions.
Details
Motivation: Current neural audio codecs deliver superior speech quality at ultralow bitrates but face practical adoption obstacles due to low-resource operation constraints and lack of robustness to acoustic distortions like background noise and reverberation.Method: The challenge provides participants with a standardized training dataset, two baseline systems, and a comprehensive evaluation framework to develop neural and hybrid codecs for resource-constrained applications.
Result: The challenge is expected to yield valuable insights applicable to both codec design and related downstream audio tasks, though specific results will depend on participant submissions.
Conclusion: The 2025 Low-Resource Audio Codec Challenge aims to catalyze progress in developing neural codecs that can operate under stringent compute constraints while maintaining low latency, low bitrate, and robustness to acoustic degradations.
Abstract: While recent neural audio codecs deliver superior speech quality at ultralow bitrates over traditional methods, their practical adoption is hindered by obstacles related to low-resource operation and robustness to acoustic distortions. Edge deployment scenarios demand codecs that operate under stringent compute constraints while maintaining low latency and bitrate. The presence of background noise and reverberation further necessitates designs that are resilient to such degradations. The performance of neural codecs under these constraints and their integration with speech enhancement remain largely unaddressed. To catalyze progress in this area, we introduce the 2025 Low-Resource Audio Codec Challenge, which targets the development of neural and hybrid codecs for resource-constrained applications. Participants are supported with a standardized training dataset, two baseline systems, and a comprehensive evaluation framework. The challenge is expected to yield valuable insights applicable to both codec design and related downstream audio tasks.
[688] Learning Linearity in Audio Consistency Autoencoders via Implicit Regularization
Bernardo Torres, Manuel Moussallam, Gabriel Meseguer-Brocal
Main category: cs.SD
TL;DR: A training method using data augmentation to make audio autoencoder latent spaces linear, enabling intuitive algebraic operations like mixing and scaling while maintaining reconstruction quality.
Details
Motivation: Audio autoencoders create compressed representations but their non-linear latent spaces prevent intuitive algebraic manipulation such as mixing or scaling, limiting their practical utility.Method: Simple training methodology using data augmentation to induce linearity in Consistency Autoencoder (CAE) by enforcing homogeneity (equivariance to scalar gain) and additivity (decoder preserves addition) without changing model architecture or loss function.
Result: The CAE exhibits linear behavior in both encoder and decoder while preserving reconstruction fidelity, enabling practical applications like music source composition and separation via simple latent arithmetic.
Conclusion: This work presents a straightforward technique for constructing structured latent spaces that enables more intuitive and efficient audio processing through linear latent representations.
Abstract: Audio autoencoders learn useful, compressed audio representations, but their non-linear latent spaces prevent intuitive algebraic manipulation such as mixing or scaling. We introduce a simple training methodology to induce linearity in a high-compression Consistency Autoencoder (CAE) by using data augmentation, thereby inducing homogeneity (equivariance to scalar gain) and additivity (the decoder preserves addition) without altering the model’s architecture or loss function. When trained with our method, the CAE exhibits linear behavior in both the encoder and decoder while preserving reconstruction fidelity. We test the practical utility of our learned space on music source composition and separation via simple latent arithmetic. This work presents a straightforward technique for constructing structured latent spaces, enabling more intuitive and efficient audio processing.
[689] ISA-Bench: Benchmarking Instruction Sensitivity for Large Audio Language Models
Bohan Li, Wenbin Huang, Yuhang Qiu, Yiwei Guo, Hankun Wang, Zhihan Li, Jing Peng, Ziyang Ma, Xie Chen, Kai Yu
Main category: cs.SD
TL;DR: ISA-Bench is a benchmark for evaluating instruction sensitivity in Large Audio Language Models (LALMs), revealing that even state-of-the-art models are highly sensitive to how instructions are phrased, affecting both compliance and task performance.
Details
Motivation: Existing LALMs are highly sensitive to instruction phrasing, affecting instruction-following rates and task performance, but no benchmarks systematically evaluate this sensitivity.Method: Introduces ISA-Bench, a dynamic benchmark evaluating instruction sensitivity along three axes: instruction description, output format, and task composition. Fine-tunes Qwen2-Audio on a complex instruction-variant dataset.
Result: State-of-the-art LALMs show significant instruction sensitivity, leading to degraded performance on fundamental audio understanding tasks. Fine-tuning improves instruction-following but causes catastrophic forgetting of previously mastered capabilities.
Conclusion: ISA-Bench provides a standardized basis for assessing and improving instruction sensitivity in LALMs, highlighting the need for instruction-robust audio understanding in real-world applications.
Abstract: Large Audio Language Models (LALMs), which couple acoustic perception with large language models (LLMs) to extract and understand diverse information from audio, have attracted intense interest from both academic and industrial communities. However, existing LALMs are highly sensitive to how instructions are phrased, affecting both (i) instruction-following rates and (ii) task performance. Yet, no existing benchmarks offer a systematic and comprehensive evaluation of this sensitivity. We introduce ISA-Bench, a dynamic benchmark evaluating instruction sensitivity for LALMs along three axes: instruction description, output format, and task composition. We assess recent open-source and proprietary LALMs using ISA-Bench, profiling both compliance and accuracy under controlled instruction variations. Experimental results reveal that even state-of-the-art LALMs suffer significant instruction sensitivity, leading to degraded performance on fundamental audio understanding tasks. To mitigate this issue, we fine-tune Qwen2-Audio on a specifically constructed complex instruction-variant dataset, achieving a marked improvement in instruction-following performance. However, this also induces nontrivial catastrophic forgetting: the model loses some previously mastered task capabilities when exposed to new instruction styles. Our benchmark provides a standardized basis for assessing and improving instruction sensitivity in LALMs, underscoring the need for instruction-robust audio understanding in real-world pipelines.
[690] Memristive Nanowire Network for Energy Efficient Audio Classification: Pre-Processing-Free Reservoir Computing with Reduced Latency
Akshaya Rajesh, Pavithra Ananthasubramanian, Nagarajan Raghavan, Ankush Kumar
Main category: cs.SD
TL;DR: Memristive nanowire networks enable efficient audio feature extraction for spoken-digit classification, achieving high accuracy with significant data compression and low training latency compared to conventional methods.
Details
Motivation: To develop low-latency, resource-efficient audio preprocessing for speech recognition that overcomes the computational demands of conventional techniques like Mel Spectrogram and PLP.Method: Using memristive nanowire networks as a neuromorphic hardware preprocessing layer to extract compact features directly from raw audio data.
Result: Achieved 98.95% accuracy with 66x data compression (XGBoost) and 97.9% accuracy with 255x compression (Random Forest), outperforming MFCC and other state-of-the-art techniques in efficiency and multispeaker classification (96.5% accuracy).
Conclusion: Nanowire networks provide a novel, low-latency, data-efficient feature extraction approach that enables high-performance neuromorphic audio classification with enhanced linear separability.
Abstract: Efficient audio feature extraction is critical for low-latency, resource-constrained speech recognition. Conventional preprocessing techniques, such as Mel Spectrogram, Perceptual Linear Prediction (PLP), and Learnable Spectrogram, achieve high classification accuracy but require large feature sets and significant computation. The low-latency and power efficiency benefits of neuromorphic computing offer a strong potential for audio classification. Here, we introduce memristive nanowire networks as a neuromorphic hardware preprocessing layer for spoken-digit classification, a capability not previously demonstrated. Nanowire networks extract compact, informative features directly from raw audio, achieving a favorable trade-off between accuracy, dimensionality reduction from the original audio size (data compression) , and training time efficiency. Compared with state-of-the-art software techniques, nanowire features reach 98.95% accuracy with 66 times data compression (XGBoost) and 97.9% accuracy with 255 times compression (Random Forest) in sub-second training latency. Across multiple classifiers nanowire features consistently achieve more than 90% accuracy with more than 62.5 times compression, outperforming features extracted by conventional state-of-the-art techniques such as MFCC in efficiency without loss of performance. Moreover, nanowire features achieve 96.5% accuracy classifying multispeaker audios, outperforming all state-of-the-art feature accuracies while achieving the highest data compression and lowest training time. Nanowire network preprocessing also enhances linear separability of audio data, improving simple classifier performance and generalizing across speakers. These results demonstrate that memristive nanowire networks provide a novel, low-latency, and data-efficient feature extraction approach, enabling high-performance neuromorphic audio classification.
[691] $\texttt{AVROBUSTBENCH}$: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time
Sarthak Kumar Maharana, Saksham Singh Kushwaha, Baoming Zhang, Adrian Rodriguez, Songtao Wei, Yapeng Tian, Yunhui Guo
Main category: cs.SD
TL;DR: AVROBUSTBENCH is a comprehensive benchmark for evaluating audio-visual model robustness under simultaneous bimodal corruptions, revealing limitations in current models and TTA methods, with proposed AV2C approach showing improvements.
Details
Motivation: Existing robustness benchmarks focus on single modalities, making them insufficient for assessing audio-visual models that face real-world scenarios with simultaneous shifts in both audio and visual modalities.Method: Created AVROBUSTBENCH with four datasets incorporating 75 co-occurring and correlated bimodal audio-visual corruptions. Evaluated state-of-the-art supervised and self-supervised models, and proposed AV2C TTA approach that penalizes high-entropy samples for cross-modal fusion.
Result: Audio-visual models show declining robustness with increasing corruption severity. Online TTA methods offer minimal improvements under bimodal corruptions. AV2C achieves improvements on VGGSOUND-2C dataset.
Conclusion: AVROBUSTBENCH provides a needed benchmark for developing more robust audio-visual models and TTA approaches that can handle real-world bimodal distributional shifts.
Abstract: While recent audio-visual models have demonstrated impressive performance, their robustness to distributional shifts at test-time remains not fully understood. Existing robustness benchmarks mainly focus on single modalities, making them insufficient for thoroughly assessing the robustness of audio-visual models. Motivated by real-world scenarios where shifts can occur $\textit{simultaneously}$ in both audio and visual modalities, we introduce $\texttt{AVROBUSTBENCH}$, a comprehensive benchmark designed to evaluate the test-time robustness of audio-visual recognition models. $\texttt{AVROBUSTBENCH}$ comprises four audio-visual benchmark datasets, $\texttt{AUDIOSET-2C}$, $\texttt{VGGSOUND-2C}$, $\texttt{KINETICS-2C}$, and $\texttt{EPICKITCHENS-2C}$, each incorporating 75 bimodal audio-visual corruptions that are $\textit{co-occurring}$ and $\textit{correlated}$. Through extensive evaluations, we observe that state-of-the-art supervised and self-supervised audio-visual models exhibit declining robustness as corruption severity increases. Furthermore, online test-time adaptation (TTA) methods, on $\texttt{VGGSOUND-2C}$ and $\texttt{KINETICS-2C}$, offer minimal improvements in performance under bimodal corruptions. We further propose $\texttt{AV2C}$, a simple TTA approach enabling on-the-fly cross-modal fusion by penalizing high-entropy samples, which achieves improvements on $\texttt{VGGSOUND-2C}$. We hope that $\texttt{AVROBUSTBENCH}$ will steer the development of more effective and robust audio-visual TTA approaches. Our code is available $\href{https://github.com/sarthaxxxxx/AV-C-Robustness-Benchmark}{here}$.
[692] Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries
Pengfei Cai, Yan Song, Qing Gu, Nan Jiang, Haoyu Song, Ian McLoughlin
Main category: cs.SD
TL;DR: DASM is a query-based framework for open-vocabulary sound event detection that formulates SED as frame-level retrieval using multi-modal queries, achieving superior performance in both closed-set and open-vocabulary settings.
Details
Motivation: Existing SED methods are limited to closed-set detection, while recent language-driven zero-shot approaches suffer from poor performance due to lack of fine-grained alignment and cross-modal feature fusion.Method: Proposes a dual-stream decoder that decouples event recognition and temporal localization: cross-modality event decoder for query-feature fusion and clip-level detection, and context network for frame-level localization. Includes inference-time attention masking to leverage semantic relations between classes.
Result: Outperforms CLAP-based methods in open-vocabulary setting (+7.8 PSDS) and baseline in closed-set setting (+6.9 PSDS). Achieves PSDS1 score of 42.2 in cross-dataset zero-shot evaluation on DESED, exceeding supervised CRNN baseline.
Conclusion: DASM effectively balances localization accuracy with generalization to novel classes, demonstrating strong performance in both closed-set and open-vocabulary sound event detection scenarios.
Abstract: Most existing sound event detection~(SED) algorithms operate under a closed-set assumption, restricting their detection capabilities to predefined classes. While recent efforts have explored language-driven zero-shot SED by exploiting audio-language models, their performance is still far from satisfactory due to the lack of fine-grained alignment and cross-modal feature fusion. In this work, we propose the Detect Any Sound Model (DASM), a query-based framework for open-vocabulary SED guided by multi-modal queries. DASM formulates SED as a frame-level retrieval task, where audio features are matched against query vectors derived from text or audio prompts. To support this formulation, DASM introduces a dual-stream decoder that explicitly decouples event recognition and temporal localization: a cross-modality event decoder performs query-feature fusion and determines the presence of sound events at the clip-level, while a context network models temporal dependencies for frame-level localization. Additionally, an inference-time attention masking strategy is proposed to leverage semantic relations between base and novel classes, substantially enhancing generalization to novel classes. Experiments on the AudioSet Strong dataset demonstrate that DASM effectively balances localization accuracy with generalization to novel classes, outperforming CLAP-based methods in open-vocabulary setting (+ 7.8 PSDS) and the baseline in the closed-set setting (+ 6.9 PSDS). Furthermore, in cross-dataset zero-shot evaluation on DESED, DASM achieves a PSDS1 score of 42.2, even exceeding the supervised CRNN baseline. The project page is available at https://cai525.github.io/Transformer4SED/demo_page/DASM/.
[693] PESTO: Real-Time Pitch Estimation with Self-supervised Transposition-equivariant Objective
Alain Riou, Bernardo Torres, Ben Hayes, Stefan Lattner, Gaëtan Hadjeres, Gaël Richard, Geoffroy Peeters
Main category: cs.SD
TL;DR: PESTO is a self-supervised, lightweight pitch estimation model using Siamese architecture and VQT frames, achieving state-of-the-art performance without annotated data.
Details
Motivation: To develop a pitch estimation method that eliminates the need for annotated data while maintaining high performance and enabling real-time applications.Method: Uses Siamese architecture with VQT frames, Toeplitz fully-connected layer for translation equivariance, and class-based transposition-equivariant objective with pitch-shifted pairs.
Result: Outperforms self-supervised baselines and competes with supervised methods on MIR-1K, MDB-stem-synth, and PTDB datasets, with superior cross-dataset generalization and low latency (<10 ms).
Conclusion: PESTO demonstrates that self-supervised learning can achieve competitive pitch estimation performance while being lightweight and suitable for real-time applications.
Abstract: In this paper, we introduce PESTO, a self-supervised learning approach for single-pitch estimation using a Siamese architecture. Our model processes individual frames of a Variable-$Q$ Transform (VQT) and predicts pitch distributions. The neural network is designed to be equivariant to translations, notably thanks to a Toeplitz fully-connected layer. In addition, we construct pitch-shifted pairs by translating and cropping the VQT frames and train our model with a novel class-based transposition-equivariant objective, eliminating the need for annotated data. Thanks to this architecture and training objective, our model achieves remarkable performances while being very lightweight ($130$k parameters). Evaluations on music and speech datasets (MIR-1K, MDB-stem-synth, and PTDB) demonstrate that PESTO not only outperforms self-supervised baselines but also competes with supervised methods, exhibiting superior cross-dataset generalization. Finally, we enhance PESTO’s practical utility by developing a streamable VQT implementation using cached convolutions. Combined with our model’s low latency (less than 10 ms) and minimal parameter count, this makes PESTO particularly suitable for real-time applications.
[694] EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering
Tianxin Xie, Shan Yang, Chenxing Li, Dong Yu, Li Liu
Main category: cs.SD
TL;DR: EmoSteer-TTS is a training-free method for fine-grained speech emotion control in TTS systems using activation steering, enabling emotion conversion, interpolation, and erasure without retraining.
Details
Motivation: Existing TTS systems offer only coarse emotion control via discrete labels or detailed text prompts, making fine-grained manipulation inaccessible or unstable, and requiring extensive training datasets.Method: Uses activation steering by modifying internal activations in flow matching-based TTS models, with a training-free algorithm including activation extraction, emotional token searching, and inference-time steering that integrates with pretrained models.
Result: Enables fine-grained, interpretable, and continuous control over speech emotion, outperforming state-of-the-art methods without requiring additional training.
Conclusion: First method to achieve training-free and continuous fine-grained emotion control in TTS, with potential for broad application across various pretrained TTS models.
Abstract: Text-to-speech (TTS) has shown great progress in recent years. However, most existing TTS systems offer only coarse and rigid emotion control, typically via discrete emotion labels or a carefully crafted and detailed emotional text prompt, making fine-grained emotion manipulation either inaccessible or unstable. These models also require extensive, high-quality datasets for training. To address these limitations, we propose EmoSteer-TTS, a novel training-free approach, to achieve fine-grained speech emotion control (conversion, interpolation, erasure) by activation steering. We first empirically observe that modifying a subset of the internal activations within a flow matching-based TTS model can effectively alter the emotional tone of synthesized speech. Building on this insight, we then develop a training-free and efficient algorithm, including activation extraction, emotional token searching, and inference-time steering, which can be seamlessly integrated into a wide range of pretrained models (e.g., F5-TTS, CosyVoice2, and E2-TTS). In addition, to derive effective steering vectors, we construct a curated emotional speech dataset with diverse speakers. Extensive experiments demonstrate that EmoSteer-TTS enables fine-grained, interpretable, and continuous control over speech emotion, outperforming the state-of-the-art (SOTA). To the best of our knowledge, this is the first method that achieves training-free and continuous fine-grained emotion control in TTS. Demo samples are available at https://emosteer-tts-demo.pages.dev/.
[695] ARIONet: An Advanced Self-supervised Contrastive Representation Network for Birdsong Classification and Future Frame Prediction
Md. Abdur Rahman, Selvarajah Thuseethan, Kheng Cher Yeo, Reem E. Mohamed, Sami Azam
Main category: cs.SD
TL;DR: ARIONet is a self-supervised contrastive network for birdsong classification that jointly optimizes contrastive learning and future frame prediction using augmented audio representations, achieving state-of-the-art performance on multiple datasets.
Details
Motivation: Existing birdsong classification methods heavily depend on labeled data, use limited feature representations, and overlook temporal dynamics essential for accurate species identification.Method: Proposes ARIONet that integrates multiple audio features in a transformer-based encoder, using contrastive learning to maximize similarity between augmented views of same audio segments while pushing apart different samples, and models temporal dynamics by predicting future audio frames.
Result: Achieved classification accuracies of 98.41%, 93.07%, 91.89%, and 91.58%, and F1-scores of 97.84%, 94.10%, 91.29%, and 90.94% on four datasets, with high cosine similarity (up to 95%) in future frame prediction.
Conclusion: The self-supervised learning strategy effectively captures complex acoustic patterns and temporal dependencies, demonstrating strong potential for real-world ecological conservation and monitoring applications.
Abstract: Automated birdsong classification is essential for advancing ecological monitoring and biodiversity studies. Despite recent progress, existing methods often depend heavily on labeled data, use limited feature representations, and overlook temporal dynamics essential for accurate species identification. In this work, we propose a self-supervised contrastive network, ARIONet (Acoustic Representation for Interframe Objective Network), that jointly optimizes contrastive classification and future frame prediction using augmented audio representations. The model simultaneously integrates multiple complementary audio features within a transformer-based encoder model. Our framework is designed with two key objectives: (1) to learn discriminative species-specific representations for contrastive learning through maximizing similarity between augmented views of the same audio segment while pushing apart different samples, and (2) to model temporal dynamics by predicting future audio frames, both without requiring large-scale annotations. We validate our framework on four diverse birdsong datasets, including the British Birdsong Dataset, Bird Song Dataset, and two extended Xeno-Canto subsets (A-M and N-Z). Our method consistently outperforms existing baselines and achieves classification accuracies of 98.41%, 93.07%, 91.89%, and 91.58%, and F1-scores of 97.84%, 94.10%, 91.29%, and 90.94%, respectively. Furthermore, it demonstrates low mean absolute errors and high cosine similarity, up to 95%, in future frame prediction tasks. Extensive experiments further confirm the effectiveness of our self-supervised learning strategy in capturing complex acoustic patterns and temporal dependencies, as well as its potential for real-world applicability in ecological conservation and monitoring.
[696] Automatic Music Sample Identification with Multi-Track Contrastive Learning
Alain Riou, Joan Serrà, Yuki Mitsufuji
Main category: cs.SD
TL;DR: This paper presents a self-supervised contrastive learning approach for automatic sample identification in music, which detects reused audio content and retrieves its original source material.
Details
Motivation: Sampling is common in modern music production, but automatic identification of sampled content is challenging. The paper aims to develop an effective method to detect sampled audio and trace it back to its original source.Method: Uses self-supervised learning with multi-track dataset to create positive pairs of artificial mixes, and designs a novel contrastive learning objective for training the model.
Result: The method significantly outperforms previous state-of-the-art baselines, shows robustness across genres, scales well with increasing reference database size, and demonstrates the importance of high-quality separated stems.
Conclusion: The proposed self-supervised contrastive learning approach is highly effective for automatic sample identification, with strong performance, genre robustness, and scalability advantages over existing methods.
Abstract: Sampling, the technique of reusing pieces of existing audio tracks to create new music content, is a very common practice in modern music production. In this paper, we tackle the challenging task of automatic sample identification, that is, detecting such sampled content and retrieving the material from which it originates. To do so, we adopt a self-supervised learning approach that leverages a multi-track dataset to create positive pairs of artificial mixes, and design a novel contrastive learning objective. We show that such method significantly outperforms previous state-of-the-art baselines, that is robust to various genres, and that scales well when increasing the number of noise songs in the reference database. In addition, we extensively analyze the contribution of the different components of our training pipeline and highlight, in particular, the need for high-quality separated stems for this task.
cs.LG
[697] A Feature Engineering Approach for Business Impact-Oriented Failure Detection in Distributed Instant Payment Systems
Lorenzo Porcelli
Main category: cs.LG
TL;DR: A novel feature engineering approach using ISO 20022 message processing times enables early anomaly detection and failure localization in instant payment systems, bridging technical metrics with business process visibility.
Details
Motivation: Traditional monitoring fails to connect technical infrastructure metrics with business process visibility in instant payment systems that require zero-downtime and process millions of transactions daily.Method: Feature engineering based on processing times between consecutive ISO 20022 message exchanges creates compact system state representations, with anomaly detection applied to these features for early failure detection and incident classification.
Result: Evaluation on TARGET Instant Payment Settlement (TIPS) system shows effectiveness in detecting diverse anomaly patterns, providing interpretable explanations, differentiating internal vs. external issues, and significantly reducing investigation time.
Conclusion: The framework successfully bridges observability gaps in distributed payment systems by mapping features to processing phases, enabling operators to understand business impact and localize failures effectively.
Abstract: Instant payment infrastructures have stringent performance requirements, processing millions of transactions daily with zero-downtime expectations. Traditional monitoring approaches fail to bridge the gap between technical infrastructure metrics and business process visibility. We introduce a novel feature engineering approach based on processing times computed between consecutive ISO 20022 message exchanges, creating a compact representation of system state. By applying anomaly detection to these features, we enable early failure detection and localization, allowing incident classification. Experimental evaluation on the TARGET Instant Payment Settlement (TIPS) system, using both real-world incidents and controlled simulations, demonstrates the approach’s effectiveness in detecting diverse anomaly patterns and provides inherently interpretable explanations that enable operators to understand the business impact. By mapping features to distinct processing phases, the resulting framework differentiates between internal and external payment system issues, significantly reduces investigation time, and bridges observability gaps in distributed systems where transaction state is fragmented across multiple entities.
[698] Numerical Fragility in Transformers: A Layer-wise Theory for Explaining, Forecasting, and Mitigating Instability
Jinwoo Baek
Main category: cs.LG
TL;DR: The paper develops a first-order theory to predict forward-error amplification in low-precision Transformers, providing interpretable diagnostics for self-attention stability and proposing mitigation strategies.
Details
Motivation: Transformers trained in low precision suffer from forward-error amplification, which limits their practical deployment in resource-constrained environments.Method: Developed a module-wise theory with three key diagnostics for self-attention: score-scale ratio, rowwise softmax sensitivity, and value conditioning. Also introduced a precision-aware LayerNorm indicator and proved residual relaxation inequality.
Result: The combined predictor tracks FP32-LP mismatches across various configurations, with softmax sensitivity serving as an early-warning signal. A small LayerNorm tweak provides consistent stabilization with minimal overhead.
Conclusion: The theory provides actionable, unitless diagnostics that explain self-attention fragility, forecast instability, and enable minimally invasive mitigation strategies.
Abstract: Transformers trained in low precision can suffer forward-error amplification. We give a first-order, module-wise theory that predicts when and where errors grow. For self-attention we derive a per-layer bound that factorizes into three interpretable diagnostics: a score-scale ratio $\kappa_{\rm score}$, a rowwise softmax sensitivity $\kappa_{\rm softmax}$, and value conditioning $\kappa(V)$. We prove a residual relaxation inequality showing that residual blocks attenuate depth-wise accumulation, and we introduce a precision- and width-aware LayerNorm indicator $\rho_{\rm LN}$ with a matching first-order bound in the $\epsilon$-dominated regime. These pieces yield a unified forward-stability bound whose right-hand side is directly estimable during training. On Tiny-ViT/CIFAR-10 we evaluate the bound and components. (1) The combined predictor $\kappa_{\rm softmax},(1+\kappa_{\rm score}),\kappa(V),|W_O|2+\kappa{\rm eff}+C_{\rm LN}$ tracks FP32$\leftrightarrow$LP mismatches across seeds, widths, and precisions; scaling by $\epsilon_{\rm mach}$ collapses mixed-precision points. (2) The time-series maximum of $\kappa_{\rm softmax}$ acts as an early-warning signal, leading error spikes by 16-24 steps (corr. 0.65-0.82; permutation $p!\approx!10^{-3}$; Precision@K 0.89-1.00). (3) Guided by $\rho_{\rm LN}$, a small LayerNorm-$\epsilon$ tweak targeting $\rho_\star$ gives consistent stabilization (mean tail-loss $\downarrow\ \approx0.010$ at $\rho_\star!=!0.6$, cap$=10^{-2}$) with negligible overhead. Overall, our theory supplies actionable, unitless diagnostics that (i) explain when self-attention is fragile, (ii) forecast instability, and (iii) motivate a minimally invasive mitigation.
[699] Chebyshev Moment Regularization (CMR): Condition-Number Control with Moment Shaping
Jinwoo Baek
Main category: cs.LG
TL;DR: CMR is a regularization method that directly optimizes layer spectra using Chebyshev moments and a log-condition proxy to improve model conditioning and training stability.
Details
Motivation: To address poor conditioning in deep networks that leads to unstable training and degraded performance, by directly steering models toward well-conditioned regimes.Method: Uses Chebyshev Moment Regularization (CMR) with a decoupled, capped mixing rule that controls spectral edges via log-condition proxy and shapes interior via Chebyshev moments while preserving task gradients.
Result: Reduces mean layer condition numbers by ~10^3 (from ~3900 to ~3.4 in 5 epochs), increases gradient magnitude, and restores test accuracy from ~10% to ~86% in adversarial settings.
Conclusion: CMR enables optimization-driven spectral preconditioning for stable, accurate learning by directly improving model conditioning.
Abstract: We introduce \textbf{Chebyshev Moment Regularization (CMR)}, a simple, architecture-agnostic loss that directly optimizes layer spectra. CMR jointly controls spectral edges via a log-condition proxy and shapes the interior via Chebyshev moments, with a decoupled, capped mixing rule that preserves task gradients. We prove strictly monotone descent for the condition proxy, bounded moment gradients, and orthogonal invariance. In an adversarial ``$\kappa$-stress’’ setting (MNIST, 15-layer MLP), \emph{compared to vanilla training}, CMR reduces mean layer condition numbers by $\sim!10^3$ (from $\approx3.9!\times!10^3$ to $\approx3.4$ in 5 epochs), increases average gradient magnitude, and restores test accuracy ( $\approx10%!\to!\approx86%$ ). These results support \textbf{optimization-driven spectral preconditioning}: directly steering models toward well-conditioned regimes for stable, accurate learning.
[700] What Causes Postoperative Aspiration?
Supriya Nagesh, Karina Covarrubias, Robert El-Kareh, Shiva Prasad Kasiviswanathan, Nina Mishra
Main category: cs.LG
TL;DR: Machine learning models can predict postoperative aspiration risk with high accuracy (AUROC 0.86), identifying key risk factors including maximum daily opioid dose, patient age, and operative site (neck/head surgeries).
Details
Motivation: Aspiration significantly impacts surgical patient morbidity and mortality, necessitating predictive tools for timely preventative interventions.Method: Used MIMIC-IV database with 826 surgical patients who experienced aspiration, trained XGBoost, MLP, and Random Forest models on pre-surgical data, and conducted causal analysis using Augmented Inverse Probability Weighting.
Result: ML model achieved AUROC of 0.86 with 77.3% sensitivity. Key predictors: maximum daily opioid dose, length of stay, patient age. Causal factors: opioids (ATE 0.25) and neck/head surgeries (ATE 0.20/0.19). Men were 1.5x more likely to aspirate despite equal surgery rates.
Conclusion: ML models effectively predict postoperative aspiration risk, enabling targeted prevention. Opioid dosage and operative site significantly influence risk. Gender disparities in opioid administration and aspiration rates need further investigation.
Abstract: Background: Aspiration, the inhalation of foreign material into the lungs, significantly impacts surgical patient morbidity and mortality. This study develops a machine learning (ML) model to predict postoperative aspiration, enabling timely preventative interventions. Methods: From the MIMIC-IV database of over 400,000 hospital admissions, we identified 826 surgical patients (mean age: 62, 55.7% male) who experienced aspiration within seven days post-surgery, along with a matched non-aspiration cohort. Three ML models: XGBoost, Multilayer Perceptron, and Random Forest were trained using pre-surgical hospitalization data to predict postoperative aspiration. To investigate causation, we estimated Average Treatment Effects (ATE) using Augmented Inverse Probability Weighting. Results: Our ML model achieved an AUROC of 0.86 and 77.3% sensitivity on a held-out test set. Maximum daily opioid dose, length of stay, and patient age emerged as the most important predictors. ATE analysis identified significant causative factors: opioids (0.25 +/- 0.06) and operative site (neck: 0.20 +/- 0.13, head: 0.19 +/- 0.13). Despite equal surgery rates across genders, men were 1.5 times more likely to aspirate and received 27% higher maximum daily opioid dosages compared to women. Conclusion: ML models can effectively predict postoperative aspiration risk, enabling targeted preventative measures. Maximum daily opioid dosage and operative site significantly influence aspiration risk. The gender disparity in both opioid administration and aspiration rates warrants further investigation. These findings have important implications for improving postoperative care protocols and aspiration prevention strategies.
[701] Online Mixture of Experts: No-Regret Learning for Optimal Collective Decision-Making
Larkin Liu, Jalal Etesami
Main category: cs.LG
TL;DR: The paper proposes two algorithms for online mixture-of-experts (OMoE) that combine expert outputs to optimize aggregate accuracy in bandit learning settings, with applications to fine-tuning LLMs.
Details
Motivation: To develop methods for optimally aggregating outputs from multiple experts in online settings to achieve better overall performance than individual experts or simple aggregation.Method: Two algorithms: 1) Aggregate voting with UCB-driven successive elimination for pruning suboptimal actions, 2) Online weighted-majority-voting that weights experts by their predictive power.
Result: Theoretical regret guarantees are derived under ideal circumstances, and empirical results show the methods can improve performance when applied to fine-tuning expert LLMs.
Conclusion: The paper introduces new methodologies with no-regret guarantees for combining multiple experts to enhance overall model performance in online settings.
Abstract: We explore the use of expert-guided bandit learning, which we refer to as online mixture-of-experts (OMoE). In this setting, given a context, a candidate committee of experts must determine how to aggregate their outputs to achieve optimal results in terms of aggregate accuracy. We propose two algorithms to address this problem. The first algorithm combines aggregate voting with UCB-driven successive elimination, efficiently pruning suboptimal exploration actions. The second algorithm employs an online weighted-majority-voting mechanism, leveraging the respective voting power of each expert proportional to their predictive power. We derive theoretical guarantees for the regret properties in the bandit setting under ideal circumstances, and empirical results are provided accordingly. As a modern study on applications, these methods are applied to the online fine-tuning of a set of expert large language models (LLMs), where after each response, the generative LLM dynamically reweighs its set of experts and/or selects the optimal committee of experts to generate the most accurate response. Our results introduce new methodologies and no-regret guarantees for combining multiple experts to improve on the performance of the an aggregate model overall.
[702] Variance-Reduction Guidance: Sampling Trajectory Optimization for Diffusion Models
Shifeng Xu, Yanzhu Liu, Adams Wai-Kin Kong
Main category: cs.LG
TL;DR: VRG is a variance-reduction guidance method that reduces prediction error accumulation in diffusion models without requiring model fine-tuning, improving generation quality by finding better sampling trajectories.
Details
Motivation: Diffusion models suffer from prediction errors that accumulate over multiple sampling steps, deteriorating generation quality. The paper aims to address this error accumulation problem.Method: Proposes Variance-Reduction Guidance (VRG) which statistically measures prediction error and searches for new sampling trajectories with the same number of steps but higher quality results, without model fine-tuning.
Result: Experiments on various datasets and baselines show VRG significantly improves generation quality for both conditional and unconditional generation tasks.
Conclusion: VRG effectively mitigates prediction error accumulation in diffusion models and enhances generation quality without requiring model modifications.
Abstract: Diffusion models have become emerging generative models. Their sampling process involves multiple steps, and in each step the models predict the noise from a noisy sample. When the models make prediction, the output deviates from the ground truth, and we call such a deviation as \textit{prediction error}. The prediction error accumulates over the sampling process and deteriorates generation quality. This paper introduces a novel technique for statistically measuring the prediction error and proposes the Variance-Reduction Guidance (VRG) method to mitigate this error. VRG does not require model fine-tuning or modification. Given a predefined sampling trajectory, it searches for a new trajectory which has the same number of sampling steps but produces higher quality results. VRG is applicable to both conditional and unconditional generation. Experiments on various datasets and baselines demonstrate that VRG can significantly improve the generation quality of diffusion models. Source code is available at https://github.com/shifengxu/VRG.
[703] A Physics-Guided AI Cascaded Corrector Model Significantly Extends Madden-Julian Oscillation Prediction Skill
Xiao Zhou, Yuze Sun, Jie Wu, Xiaomeng Huang
Main category: cs.LG
TL;DR: PCC-MJO is a deep learning framework that acts as a universal post-processor to correct MJO forecasts from dynamical models, extending skillful forecast range by 2-8 days and mitigating the Maritime Continent barrier.
Details
Motivation: The Madden-Julian Oscillation (MJO) is crucial for global weather extremes but current operational dynamical models have limited skillful forecasts (3-4 weeks), requiring improved prediction methods.Method: Two-stage deep learning framework: 1) Physics-informed 3D U-Net corrects spatial-temporal field errors, 2) LSTM refines MJO’s RMM index optimized for forecast skill. Applied as universal post-processor to three operational forecast systems.
Result: Consistently extends skillful forecast range (bivariate correlation > 0.5) by 2-8 days across CMA, ECMWF and NCEP models. Effectively mitigates Maritime Continent barrier, enabling more realistic eastward propagation and amplitude.
Conclusion: Provides a physically consistent, computationally efficient, and highly generalizable pathway to break through longstanding barriers in subseasonal forecasting, with explainable AI confirming physically meaningful learning.
Abstract: The Madden-Julian Oscillation (MJO) is an important driver of global weather and climate extremes, but its prediction in operational dynamical models remains challenging, with skillful forecasts typically limited to 3-4 weeks. Here, we introduce a novel deep learning framework, the Physics-guided Cascaded Corrector for MJO (PCC-MJO), which acts as a universal post-processor to correct MJO forecasts from dynamical models. This two-stage model first employs a physics-informed 3D U-Net to correct spatial-temporal field errors, then refines the MJO’s RMM index using an LSTM optimized for forecast skill. When applied to three different operational forecasts from CMA, ECMWF and NCEP, our unified framework consistently extends the skillful forecast range (bivariate correlation > 0.5) by 2-8 days. Crucially, the model effectively mitigates the “Maritime Continent barrier”, enabling more realistic eastward propagation and amplitude. Explainable AI analysis quantitatively confirms that the model’s decision-making is spatially congruent with observed MJO dynamics (correlation
0.93), demonstrating that it learns physically meaningful features rather than statistical fittings. Our work provides a promising physically consistent, computationally efficient, and highly generalizable pathway to break through longstanding barriers in subseasonal forecasting.
[704] Quantifying Multimodal Imbalance: A GMM-Guided Adaptive Loss for Audio-Visual Learning
Zhaocheng Liu, Zhiwen Yu, Xiaoqing Liu
Main category: cs.LG
TL;DR: A novel method for quantitative analysis of multimodal imbalance using Modality Gap and bimodal GMM, leading to an adaptive loss function that achieves SOTA results on CREMA-D (80.65%) and AVE (70.90%) datasets.
Details
Motivation: Current approaches focus on architecture and optimization but overlook quantitative analysis of imbalance degree between modalities, creating a gap in understanding multimodal imbalance.Method: Define Modality Gap as Softmax score differences between modalities, model it with bimodal GMM to identify balanced/imbalanced samples, apply Bayes’ theorem for posterior probabilities, and design adaptive loss function with two-stage training (warm-up + adaptive phases).
Result: Achieves state-of-the-art performance with 80.65% accuracy on CREMA-D and 70.90% on AVE datasets, validating the effectiveness of the proposed methodology.
Conclusion: The quantitative analysis of multimodal imbalance through Modality Gap and adaptive loss function design successfully addresses modality imbalance and improves performance on multimodal tasks.
Abstract: Current mainstream approaches to addressing multimodal imbalance primarily focus on architectural modifications and optimization-based, often overlooking a quantitative analysis of the imbalance degree between modalities. To address this gap, our work introduces a novel method for the quantitative analysis of multi-modal imbalance, which in turn informs the design of a sample-level adaptive loss function.We begin by defining the “Modality Gap” as the difference between the Softmax scores of different modalities (e.g., audio and visual) for the ground-truth class prediction. Analysis of the Modality Gap distribution reveals that it can be effectively modeled by a bimodal Gaussian Mixture Model (GMM). These two components are found to correspond respectively to “modality-balanced” and “modality-imbalanced” data samples. Subsequently, we apply Bayes’ theorem to compute the posterior probability of each sample belonging to these two distinct distributions.Informed by this quantitative analysis, we design a novel adaptive loss function with three objectives: (1) to minimize the overall Modality Gap; (2) to encourage the imbalanced sample distribution to shift towards the balanced one; and (3) to apply greater penalty weights to imbalanced samples. We employ a two-stage training strategy consisting of a warm-up phase followed by an adaptive training phase.Experimental results demonstrate that our approach achieves state-of-the-art (SOTA) performance on the public CREMA-D and AVE datasets, attaining accuracies of $80.65%$ and $70.90%$, respectively. This validates the effectiveness of our proposed methodology.
[705] MARS-M: When Variance Reduction Meets Matrices
Yifeng Liu, Angela Yuan, Quanquan Gu
Main category: cs.LG
TL;DR: Error: OutputParser failed
Details
Motivation: Error: OutputParser failedMethod: Error: OutputParser failed
Result: Error: OutputParser failed
Conclusion: Error: OutputParser failed
Abstract: Matrix-based preconditioned optimizers, such as Muon, have recently been shown to be more efficient than scalar-based optimizers for training large-scale neural networks, including large language models (LLMs). On the other hand, recent benchmarks on optimizers for LLM pre-training have demonstrated that variance-reduction techniques such as MARS can achieve substantial speedups over standard optimizers that do not employ variance reduction. In this paper, to achieve the best of both worlds, we introduce MARS-M, a new optimizer that integrates the variance reduction technique in MARS with Muon. Under standard regularity conditions, we prove that Muon-M converges to a first-order stationary point at a rate of $\tilde{\mathcal{O}}(T^{-1/3})$, which improves upon $\tilde{\mathcal{O}}(T^{-1/4})$ rate attained by Muon. Our empirical results on language modeling and computer vision tasks demonstrate that MARS-M consistently yields lower losses and improved performance across various downstream benchmarks. The implementation of MARS-M is available at https://github.com/AGI-Arena/MARS/MARS_M.
[706] Solving Continuous Mean Field Games: Deep Reinforcement Learning for Non-Stationary Dynamics
Lorenzo Magnino, Kai Shao, Zida Wu, Jiacheng Shen, Mathieu Laurière
Main category: cs.LG
TL;DR: A novel deep reinforcement learning algorithm for non-stationary continuous mean field games using Fictitious Play, DRL for best-response, supervised learning for policy representation, and Conditional Normalizing Flow for population distribution.
Details
Motivation: Existing RL methods for mean field games are limited to finite spaces or stationary models, hindering real-world applicability. This work addresses scalability and density approximation limitations in complex MFG problems.Method: Builds on Fictitious Play methodology with DRL for best-response computation, supervised learning for average policy representation, and Conditional Normalizing Flow for time-dependent population distribution representation.
Result: Evaluated on three examples of increasing complexity, showing effectiveness in addressing scalability and density approximation challenges in non-stationary continuous MFGs.
Conclusion: Represents a significant advancement in applying DRL to complex MFG problems, bringing the field closer to real-world multi-agent systems by overcoming previous limitations.
Abstract: Mean field games (MFGs) have emerged as a powerful framework for modeling interactions in large-scale multi-agent systems. Despite recent advancements in reinforcement learning (RL) for MFGs, existing methods are typically limited to finite spaces or stationary models, hindering their applicability to real-world problems. This paper introduces a novel deep reinforcement learning (DRL) algorithm specifically designed for non-stationary continuous MFGs. The proposed approach builds upon a Fictitious Play (FP) methodology, leveraging DRL for best-response computation and supervised learning for average policy representation. Furthermore, it learns a representation of the time-dependent population distribution using a Conditional Normalizing Flow. To validate the effectiveness of our method, we evaluate it on three different examples of increasing complexity. By addressing critical limitations in scalability and density approximation, this work represents a significant advancement in applying DRL techniques to complex MFG problems, bringing the field closer to real-world multi-agent systems.
[707] Residual-guided AI-CFD hybrid method enables stable and scalable simulations: from 2D benchmarks to 3D applications
Shilaj Baral, Youngkyu Lee, Sangam Khanal, Joongoo Jeon
Main category: cs.LG
TL;DR: XRePIT is a novel hybrid simulation method that combines ML acceleration with solver-based correction to achieve stable, accelerated fluid dynamics simulations for over 10,000 timesteps with 4.98× speedup.
Details
Motivation: To overcome limitations of purely data-driven surrogates that suffer from error accumulation and existing hybrid methods that lack automation and robustness for practical use.Method: A hybrid simulation strategy that synergizes machine learning acceleration with solver-based correction, designed to be fully automated and physics-aware.
Result: Achieved stable accelerated rollouts for over 10,000 timesteps, generalizes to unseen boundary conditions, scales to 3D flows, with 4.98× speedup while maintaining high physical fidelity (thermal field errors ~1E-3, velocity errors <1E-2 ms-1).
Conclusion: Establishes a mature and scalable hybrid method that overcomes long-standing barriers, paving the way for practical use in real-world engineering applications.
Abstract: Purely data-driven surrogates for fluid dynamics often fail catastrophically from error accumulation, while existing hybrid methods have lacked the automation and robustness for practical use. To solve this, we developed XRePIT, a novel hybrid simulation strategy that synergizes machine learning (ML) acceleration with solver-based correction. We specifically designed our method to be fully automated and physics-aware, ensuring the stability and practical applicability that previous approaches lacked. We demonstrate that this new design overcomes long-standing barriers, achieving the first stable, accelerated rollouts for over 10,000 timesteps. The method also generalizes robustly to unseen boundary conditions and, crucially, scales to 3D flows. Our approach delivers speedups up to 4.98$\times$ while maintaining high physical fidelity, resolving thermal fields with relative errors of ~1E-3 and capturing low magnitude velocity dynamics with errors below 1E-2 ms-1. This work thus establishes a mature and scalable hybrid method, paving the way for its use in real-world engineering.
[708] Geographic Transferability of Machine Learning Models for Short-Term Airport Fog Forecasting
Marcelo Cerda Castillo
Main category: cs.LG
TL;DR: A coordinate-free machine learning approach for airport fog forecasting achieves geographic transferability across different locations and fog regimes using physics-informed features.
Details
Motivation: Address the challenge of geographic generalization in airport fog forecasting, as traditional ML models rely on location-specific features and fail to transfer across sites.Method: Used gradient boosting classifier (XGBoost) trained on Santiago, Chile data with coordinate-free feature set encoding fundamental thermodynamic and radiative processes, then tested under zero-shot conditions at three other airports.
Result: Achieved AUC values of 0.923-0.947 across distances up to 11,650 km and different fog regimes (radiative, advective, marine), with consistent SHAP feature rankings showing visibility persistence, solar angle, and thermal gradients as dominant predictors.
Conclusion: Physics-informed, coordinate-free feature engineering can yield geographically transferable atmospheric forecasting tools by learning transferable physical relationships rather than site-specific patterns.
Abstract: Short-term forecasting of airport fog (visibility < 1.0 km) presents challenges in geographic generalization because many machine learning models rely on location-specific features and fail to transfer across sites. This study investigates whether fundamental thermodynamic and radiative processes can be encoded in a coordinate-free (location-independent) feature set to enable geographic transferability. A gradient boosting classifier (XGBoost) trained on Santiago, Chile (SCEL, 33S) data from 2002-2009 was evaluated on a 2010-2012 holdout set and under strict zero-shot tests at Puerto Montt (SCTE), San Francisco (KSFO), and London (EGLL). The model achieved AUC values of 0.923-0.947 across distances up to 11,650 km and different fog regimes (radiative, advective, marine). Consistent SHAP feature rankings show that visibility persistence, solar angle, and thermal gradients dominate predictions, suggesting the model learned transferable physical relationships rather than site-specific patterns. Results suggest that physics-informed, coordinate-free feature engineering can yield geographically transferable atmospheric forecasting tools.
[709] Unlocking Biomedical Insights: Hierarchical Attention Networks for High-Dimensional Data Interpretation
Rekha R Nair, Tina Babu, Alavikunhu Panthakkan, Hussain Al-Ahmad, Balamurugan Balusamy
Main category: cs.LG
TL;DR: HAIN is a novel interpretable deep learning architecture that combines multi-level attention, dimensionality reduction, and explanation-driven loss functions to provide transparent analysis of high-dimensional biomedical data while maintaining high accuracy.
Details
Motivation: The proliferation of high-dimensional datasets in genomics, healthcare, and finance requires machine learning models that are both accurate and interpretable, as traditional deep learning lacks transparency needed for critical decision-sensitive applications.Method: HAIN uses hierarchical attention mechanisms, dimensionality reduction, and explanation-driven loss functions to provide feature-level interpretability via gradient-weighted attention and global model explanations through prototype-based representations.
Result: On The Cancer Genome Atlas (TCGA) dataset, HAIN achieved 94.3% classification accuracy, outperforming SHAP and LIME in both transparency and explanatory power, while effectively identifying biologically relevant cancer biomarkers.
Conclusion: HAIN successfully harmonizes predictive accuracy with interpretability, advancing transparent AI solutions for precision medicine and regulatory compliance by providing both local and global explanations for complex biomedical data analysis.
Abstract: The proliferation of high-dimensional datasets in fields such as genomics, healthcare, and finance has created an urgent need for machine learning models that are both highly accurate and inherently interpretable. While traditional deep learning approaches deliver strong predictive performance, their lack of transparency often impedes their deployment in critical, decision-sensitive applications. In this work, we introduce the Hierarchical Attention-based Interpretable Network (HAIN), a novel architecture that unifies multi-level attention mechanisms, dimensionality reduction, and explanation-driven loss functions to deliver interpretable and robust analysis of complex biomedical data. HAIN provides feature-level interpretability via gradientweighted attention and offers global model explanations through prototype-based representations. Comprehensive evaluation on The Cancer Genome Atlas (TCGA) dataset demonstrates that HAIN achieves a classification accuracy of 94.3%, surpassing conventional post-hoc interpretability approaches such as SHAP and LIME in both transparency and explanatory power. Furthermore, HAIN effectively identifies biologically relevant cancer biomarkers, supporting its utility for clinical and research applications. By harmonizing predictive accuracy with interpretability, HAIN advances the development of transparent AI solutions for precision medicine and regulatory compliance.
[710] UCB-type Algorithm for Budget-Constrained Expert Learning
Ilgam Latypov, Alexandra Suvorikova, Alexey Kroshnin, Alexander Gasnikov, Yuriy Dorn
Main category: cs.LG
TL;DR: M-LCB is a UCB-style meta-algorithm that dynamically selects among K adaptive learning experts under a budget constraint of updating at most M experts per round, achieving anytime regret guarantees in stochastic settings.
Details
Motivation: Many modern applications require dynamically switching between multiple adaptive learning algorithms trained online (e.g., model selection in streaming, finance strategies, contextual bandits) while respecting budget constraints on how many experts can be updated per round.Method: Proposed M-LCB, a computationally efficient UCB-style meta-algorithm that builds confidence intervals directly from realized losses without additional optimization, reflecting the convergence properties of underlying experts.
Result: If each expert achieves internal regret Õ(T^α), M-LCB ensures overall regret bounded by Õ(√(KT/M) + (K/M)^{1-α}T^α). This is the first result establishing regret guarantees for multiple adaptive experts under per-round budget constraints.
Conclusion: M-LCB extends classical bandit paradigm to coordinate stateful, self-learning experts under limited resources, with applications to parametric models and multi-armed bandit algorithms as experts.
Abstract: In many modern applications, a system must dynamically choose between several adaptive learning algorithms that are trained online. Examples include model selection in streaming environments, switching between trading strategies in finance, and orchestrating multiple contextual bandit or reinforcement learning agents. At each round, a learner must select one predictor among $K$ adaptive experts to make a prediction, while being able to update at most $M \le K$ of them under a fixed training budget. We address this problem in the \emph{stochastic setting} and introduce \algname{M-LCB}, a computationally efficient UCB-style meta-algorithm that provides \emph{anytime regret guarantees}. Its confidence intervals are built directly from realized losses, require no additional optimization, and seamlessly reflect the convergence properties of the underlying experts. If each expert achieves internal regret $\tilde O(T^\alpha)$, then \algname{M-LCB} ensures overall regret bounded by $\tilde O!\Bigl(\sqrt{\tfrac{KT}{M}} ;+; (K/M)^{1-\alpha},T^\alpha\Bigr)$. To our knowledge, this is the first result establishing regret guarantees when multiple adaptive experts are trained simultaneously under per-round budget constraints. We illustrate the framework with two representative cases: (i) parametric models trained online with stochastic losses, and (ii) experts that are themselves multi-armed bandit algorithms. These examples highlight how \algname{M-LCB} extends the classical bandit paradigm to the more realistic scenario of coordinating stateful, self-learning experts under limited resources.
[711] Beyond Point Matching: Evaluating Multiscale Dubuc Distance for Time Series Similarity
Azim Ahmadzadeh, Mahsa Khazaei, Elaina Rohlfing
Main category: cs.LG
TL;DR: MDD is a novel time series similarity measure that outperforms DTW in many scenarios by evaluating similarity across multiple temporal scales without point-to-point alignment.
Details
Motivation: Time series are complex high-dimensional data, making efficient search and indexing challenging. Existing methods like DTW have limitations that MDD aims to address.Method: MDD evaluates time series similarity across multiple temporal scales and avoids point-to-point alignment. The method was tested using simulations and 95 datasets from the UCR archive, with comparative analysis against DTW.
Result: MDD substantially outperforms DTW in many scenarios, addressing specific performance gaps. In a challenging real-world classification task, MDD yielded significant improvement over DTW.
Conclusion: MDD demonstrates practical utility as an effective time series similarity measure that overcomes limitations of DTW through its multiscale approach and non-alignment strategy.
Abstract: Time series are high-dimensional and complex data objects, making their efficient search and indexing a longstanding challenge in data mining. Building on a recently introduced similarity measure, namely Multiscale Dubuc Distance (MDD), this paper investigates its comparative strengths and limitations relative to the widely used Dynamic Time Warping (DTW). MDD is novel in two key ways: it evaluates time series similarity across multiple temporal scales and avoids point-to-point alignment. We demonstrate that in many scenarios where MDD outperforms DTW, the gains are substantial, and we provide a detailed analysis of the specific performance gaps it addresses. We provide simulations, in addition to the 95 datasets from the UCR archive, to test our hypotheses. Finally, we apply both methods to a challenging real-world classification task and show that MDD yields a significant improvement over DTW, underscoring its practical utility.
[712] GAPO: Group Adaptive Policy Optimization for Real-World Code Edit
Jianqing Zhang, Zhezheng Hao, Wei Xia, Hande Dong, Hong Wang, Chenxing Wei, Yuyan Zhou, Yubin Qi, Qiang Lin, Jian Cao
Main category: cs.LG
TL;DR: GAPO is a reinforcement learning method that improves code editing by adaptively finding outlier-free intervals to handle skewed reward distributions, outperforming existing methods like GRPO and DAPO.
Details
Motivation: Real-world code editing scenarios often have skewed reward distributions with unpredictable outliers, which distort advantage computation and increase noise in existing group-relative RL methods like GRPO.Method: GAPO adaptively finds an outlier-free highest-density interval (HDI) per prompt and uses the median of that interval as an adaptive Q to replace the group mean in advantage calculation, making it robust to skewed distributions.
Result: Validated on nine instruction-tuned LLMs (3B-14B) using 51,844 real-world code-editing tasks across 10 languages, GAPO demonstrated consistent improvements in exact match accuracy over GRPO and DAPO.
Conclusion: GAPO effectively handles skewed reward distributions in code editing tasks while remaining plug-and-play and efficient, providing a robust alternative to existing group-relative RL methods.
Abstract: Reinforcement learning (RL) is widely used for post-training large language models (LLMs) in code editing, where group-relative methods like GRPO are popular for their critic-free, normalized advantage estimation. However, in real-world code-editing scenarios, reward distributions are often skewed with unpredictable outliers, leading to distorted advantage computation and increased noise. To address this issue, we propose Group Adaptive Policy Optimization (GAPO), which adaptively finds an outlier-free highest-density interval (HDI) per prompt and then uses the median of that interval as an adaptive Q to replace the group mean in advantage calculation. This adaptive Q robustly handles skewed distributions while remaining plug-and-play and efficient. We validate GAPO on nine instruction-tuned LLMs (3B-14B) using a large internal dataset of 51,844 real-world, history-aware code-editing tasks across 10 languages, demonstrating consistent improvements in exact match accuracy over GRPO and its variant DAPO. Code is publicly available.
[713] ATLAS: Actor-Critic Task-Completion with Look-ahead Action Simulation
Jiali Cheng, Anjishnu Kumar, Roshan Lal, Rishi Rajasekaran, Hani Ramezani, Omar Zia Khan, Oleg Rokhlenko, Sunny Chiu-Webster, Gang Hua, Hadi Amiri
Main category: cs.LG
TL;DR: ATLAS is a memory-augmented web agent that builds cognitive maps through exploration and uses look-ahead action simulation to create efficient execution plans without requiring neural network fine-tuning.
Details
Motivation: Current web agents cannot adapt to new environments without fine-tuning, leading to inefficient execution plans due to lack of environmental awareness.Method: ATLAS uses a modular architecture with: 1) cognitive map building through curiosity-driven exploration, 2) planner proposing candidate actions, 3) simulator predicting consequences in cognitive space, 4) critic selecting best roll-out, and 5) browser executor performing actions.
Result: Achieved 63% success rate on WebArena-Lite Benchmark, outperforming previous state-of-the-art (53.9%) without requiring website-specific LLM fine-tuning.
Conclusion: The world-model, hierarchical planner, and look-ahead-based replanner are essential components that work complementarily to enable effective web task completion in new environments.
Abstract: We observe that current state-of-the-art web-agents are unable to effectively adapt to new environments without neural network fine-tuning, without which they produce inefficient execution plans due to a lack of awareness of the structure and dynamics of the new environment. To address this limitation, we introduce ATLAS (Actor-Critic Task-completion with Look-ahead Action Simulation), a memory-augmented agent that is able to make plans grounded in a model of the environment by simulating the consequences of those actions in cognitive space. Our agent starts by building a “cognitive map” by performing a lightweight curiosity driven exploration of the environment. The planner proposes candidate actions; the simulator predicts their consequences in cognitive space; a critic analyzes the options to select the best roll-out and update the original plan; and a browser executor performs the chosen action. On the WebArena-Lite Benchmark, we achieve a 63% success rate compared to 53.9% success rate for the previously published state-of-the-art. Unlike previous systems, our modular architecture requires no website-specific LLM fine-tuning. Ablations show sizable drops without the world-model, hierarchical planner, and look-ahead-based replanner confirming their complementary roles within the design of our system
[714] Restoring Pruned Large Language Models via Lost Component Compensation
Zijian Feng, Hanzhang Zhou, Zixiao Zhu, Tianjiao Li, Jia Jim Deryl Chua, Lee Onn Mak, Gee Wah Ng, Kezhi Mao
Main category: cs.LG
TL;DR: RestoreLCC is a plug-and-play method that recovers pruned LLM performance by contrastively probing critical attention heads, extracting lost components from activation differences, and injecting them back into pruned heads for compensation.
Details
Motivation: Existing PEFT methods for pruned model restoration are suboptimal because they overlook the distinct properties of pruned models and don't address pruning-induced information loss effectively.Method: RestoreLCC contrastively probes critical attention heads via activation editing, extracts lost components from activation differences, and injects them back into corresponding pruned heads for compensation.
Result: Extensive experiments show RestoreLCC consistently outperforms state-of-the-art baselines in both general and task-specific performance recovery while preserving sparsity and inference efficiency.
Conclusion: RestoreLCC effectively restores pruned LLM performance by targeted compensation of lost components, maintaining the benefits of pruning while recovering model capabilities.
Abstract: Pruning is a widely used technique to reduce the size and inference cost of large language models (LLMs), but it often causes performance degradation. To mitigate this, existing restoration methods typically employ parameter-efficient fine-tuning (PEFT), such as LoRA, to recover the pruned model’s performance. However, most PEFT methods are designed for dense models and overlook the distinct properties of pruned models, often resulting in suboptimal recovery. In this work, we propose a targeted restoration strategy for pruned models that restores performance while preserving their low cost and high efficiency. We observe that pruning-induced information loss is reflected in attention activations, and selectively reintroducing components of this information can significantly recover model performance. Based on this insight, we introduce RestoreLCC (Restoring Pruned LLMs via Lost Component Compensation), a plug-and-play method that contrastively probes critical attention heads via activation editing, extracts lost components from activation differences, and finally injects them back into the corresponding pruned heads for compensation and recovery. RestoreLCC is compatible with structured, semi-structured, and unstructured pruning schemes. Extensive experiments demonstrate that RestoreLCC consistently outperforms state-of-the-art baselines in both general and task-specific performance recovery, without compromising the sparsity or inference efficiency of pruned models.
[715] MAGIC-Flow: Multiscale Adaptive Conditional Flows for Generation and Interpretable Classification
Luca Caldera, Giacomo Bottacini, Lara Cavinato
Main category: cs.LG
TL;DR: MAGIC-Flow is a conditional multiscale normalizing flow architecture that performs both generation and classification within a single modular framework, addressing limitations of pure generative modeling in medical imaging.
Details
Motivation: Pure generative modeling without task alignment fails to provide robust foundations for clinical use in medical imaging, necessitating integrated approaches that combine generation with discriminative capabilities.Method: Hierarchical invertible and differentiable bijections with factorized Jacobian determinants, enabling exact likelihood computation, stable optimization, and conditioning on class labels for controllable synthesis and probability estimation.
Result: MAGIC-Flow outperforms baselines in similarity, fidelity, and diversity metrics, creates realistic diverse samples, improves classification performance, and handles scanner noise and modality-specific tasks across multiple datasets.
Conclusion: MAGIC-Flow is an effective strategy for generation and classification in data-limited domains, offering benefits for privacy-preserving augmentation, robust generalization, and trustworthy medical AI.
Abstract: Generative modeling has emerged as a powerful paradigm for representation learning, but its direct applicability to challenging fields like medical imaging remains limited: mere generation, without task alignment, fails to provide a robust foundation for clinical use. We propose MAGIC-Flow, a conditional multiscale normalizing flow architecture that performs generation and classification within a single modular framework. The model is built as a hierarchy of invertible and differentiable bijections, where the Jacobian determinant factorizes across sub-transformations. We show how this ensures exact likelihood computation and stable optimization, while invertibility enables explicit visualization of sample likelihoods, providing an interpretable lens into the model’s reasoning. By conditioning on class labels, MAGIC-Flow supports controllable sample synthesis and principled class-probability estimation, effectively aiding both generative and discriminative objectives. We evaluate MAGIC-Flow against top baselines using metrics for similarity, fidelity, and diversity. Across multiple datasets, it addresses generation and classification under scanner noise, and modality-specific synthesis and identification. Results show MAGIC-Flow creates realistic, diverse samples and improves classification. MAGIC-Flow is an effective strategy for generation and classification in data-limited domains, with direct benefits for privacy-preserving augmentation, robust generalization, and trustworthy medical AI.
[716] A Multimodal, Multitask System for Generating E Commerce Text Listings from Images
Nayan Kumar Singh
Main category: cs.LG
TL;DR: Proposes a multi-task system for generating factually grounded product descriptions from images, using joint vision encoder training and hierarchical generation to reduce hallucinations and improve efficiency.
Details
Motivation: Manual generation of product descriptions is labor-intensive, and current Vision-to-Language Models suffer from factual hallucinations and inefficiency from siloed single-task models.Method: Two key proposals: 1) Multi-task learning approach fine-tuning a vision encoder jointly trained on attribute prediction and price regression, 2) Hierarchical generation process where predicted attributes are embedded in prompts for the text decoder.
Result: Multi-tasking outperforms independent models (3.6% better R2 for price, 6.6% better F1 for attributes). Hierarchical generation reduces hallucinations from 12.7% to 7.1% (44.5% reduction) and cuts latency by 3.5x, though ROUGE-L score is 3.5% worse than direct VLMs.
Conclusion: The proposed architecture effectively addresses factual hallucinations in product description generation while improving efficiency through multi-task learning and hierarchical generation processes.
Abstract: Manually generating catchy descriptions and names is labor intensive and a slow process for retailers. Although generative AI provides an automation solution in form of Vision to Language Models (VLM), the current VLMs are prone to factual “hallucinations”. Siloed, single task models are not only inefficient but also fail to capture interdependent relationships between features. To address these challenges, we propose an end to end, multi task system that generates factually grounded textual listings from a single image. The contributions of this study are two proposals for the model architecture. First, application of multi task learning approach for fine tuning a vision encoder where a single vision backbone is jointly trained on attribute prediction such as color, hemline and neck style and price regression. Second, introduction of a hierarchical generation process where the model’s own predicted attributes are embedded in a prompt and fed to the text decoder to improve factual consistency. The experiments demonstrate the superiority of this architecture. The multi tasking approach outperforms both the independent price regression, with a 3.6% better R2 Value and attribute classification, with a 6.6% improvement F1 score. Critically, the hierarchical generation process proves highly effective, slashing the factual hallucination rate from 12.7% to 7.1%, a 44.5% relative reduction, compared to a non hierarchical ablation. The hierarchical approach also reduces the latency of the autoregressive text generation process by a factor of 3.5 when compared to direct vision to language model of similar size. One minor caveat is that the model does perform 3.5% worse than direct vision-to-language model on ROUGE-L score.
[717] COLA: Continual Learning via Autoencoder Retrieval of Adapters
Jaya Krishna Mandivarapu
Main category: cs.LG
TL;DR: COLA is a novel framework that uses autoencoders to learn low-dimensional weight embeddings for continual learning in LLMs, preventing catastrophic forgetting without data replay or task-specific parameters.
Details
Motivation: Large language models face catastrophic forgetting when updated for new tasks, making continual learning impractical due to high computational costs and knowledge overwriting.Method: COLA employs an autoencoder to capture low-dimensional embeddings of task weights, enabling knowledge transfer to new tasks while preventing forgetting, without using data replay or substantial task-specific parameters.
Result: Empirical evaluation shows COLA overcomes catastrophic forgetting, achieves significant reduction in parameter usage and memory size across multiple tasks, and outperforms state-of-the-art methods on various datasets.
Conclusion: COLA enables efficient continual learning in LLMs with minimal training, insignificant performance degradation on previous tasks, and eliminates the need for retaining earlier training data.
Abstract: Learning a set of tasks over time, also known as continual learning (CL), is one of the most challenging problems in artificial intelligence due to catastrophic forgetting. Large language models (LLMs) are often impractical to frequent re-training and continual learning , due to high cost of computational resources for training. Moreover, LLM are not suitable for continual learning as updating these models over time for acquiring new knowledge leads to overwrites existing knowledge leading to common phenomenon know as \textit{catastrophic forgetting}. In this paper, we aim to address these concerns using a novel framework , COLA that employs an autoencoder to learn capture low-dimensional embeddings of the weights associated with various tasks. Our approach facilitates the transfer of knowledge to new tasks while preventing catastrophic forgetting, all without using data replay or a substantial set of task-specific parameters. Our approach, COLA, makes the LLM efficiently learn new tasks with minimal training, insignificant performance degradation on previous tasks, and eliminates the need for retaining earlier training data. Empirical evaluation on different datasets ranging from task oriented dialouge system to intent classsfication datasets showcases that our method not only overcomes catastrophic forgetting but also achieves significant reduction in parameter usage and memory size, across multiple tasks and outperforming the existing state of the art methods across multiple datasets.
[718] KARIPAP: Quantum-Inspired Tensor Network Compression of Large Language Models Using Infinite Projected Entangled Pair States and Tensor Renormalization Group
Azree Nazri
Main category: cs.LG
TL;DR: KARIPAP is a quantum-inspired tensor network compression method using iPEPS and TRG that achieves up to 93% memory reduction and 70% parameter reduction for LLMs like LLaMA-2 7B with minimal accuracy loss.
Details
Motivation: LLMs have huge parameter scales creating severe computational and environmental burdens, with high training costs, energy use, and limited device deployment hindering accessibility. Existing compression methods ignore complex inter-layer correlations.Method: Uses quantum-inspired tensor network compression with Infinite Projected Entangled Pair States (iPEPS) and Tensor Renormalization Group (TRG) contraction to capture multi-directional entanglement in attention and deep transformer layers, unlike 1D Matrix Product States.
Result: Achieved up to 93% memory reduction, 70% parameter reduction, 50% faster training, 25% faster inference, with only 2-3% accuracy loss on LLaMA-2 7B. Layer-wise entanglement profiling revealed redundancy in deeper layers.
Conclusion: Modern LLMs occupy low-dimensional entanglement manifolds, enabling scalable, energy-efficient, and quantum-aware AI architectures through tensor network compression.
Abstract: Large Language Models (LLMs) like ChatGPT and LLaMA drive rapid progress in generative AI, yet their huge parameter scales create severe computational and environmental burdens. High training costs, energy use, and limited device deployment hinder accessibility. Existing compression - pruning, distillation, low-rank, and quantization - reduces size but ignores complex inter-layer correlations. We propose KARIPAP, a quantum-inspired tensor network compression using Infinite Projected Entangled Pair States (iPEPS) and Tensor Renormalization Group (TRG) contraction. Unlike 1D Matrix Product States, iPEPS captures multi-directional entanglement in attention and deep transformer layers. TRG ensures polynomial-time contraction, making tensorization feasible while preserving key correlation geometry. Experiments on LLaMA-2 7B show up to 93% memory and 70% parameter reduction, with 50% faster training, 25% faster inference, and only 2-3% accuracy loss. Layer-wise entanglement profiling reveals redundancy in deeper layers, confirming their suitability for tensor factorization. KARIPAP demonstrates that modern LLMs occupy low-dimensional entanglement manifolds, enabling scalable, energy-efficient, and quantum-aware AI architectures.
[719] FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks
Luca Della Libera, Francesco Paissan, Cem Subakan, Mirco Ravanelli
Main category: cs.LG
TL;DR: FocalCodec is an efficient low-bitrate speech codec using focal modulation with a single binary codebook, achieving competitive performance at 0.16-0.65 kbps while preserving both semantic and acoustic information.
Details
Motivation: To address limitations in existing speech codecs including high bitrates, loss of semantic/acoustic information, and complex multi-codebook designs that complicate downstream tasks.Method: Uses focal modulation with a single binary codebook to compress speech, eliminating the need for complex multi-codebook architectures while maintaining information preservation.
Result: Achieves competitive performance in speech resynthesis and voice conversion at lower bitrates than state-of-the-art, handles multilingual speech and noisy environments effectively.
Conclusion: FocalCodec successfully preserves semantic and acoustic information at low bitrates while being suitable for generative modeling, offering an efficient alternative to existing codecs.
Abstract: Large language models have revolutionized natural language processing through self-supervised pretraining on massive datasets. Inspired by this success, researchers have explored adapting these methods to speech by discretizing continuous audio into tokens using neural audio codecs. However, existing approaches face limitations, including high bitrates, the loss of either semantic or acoustic information, and the reliance on multi-codebook designs when trying to capture both, which increases architectural complexity for downstream tasks. To address these challenges, we introduce FocalCodec, an efficient low-bitrate codec based on focal modulation that utilizes a single binary codebook to compress speech between 0.16 and 0.65 kbps. FocalCodec delivers competitive performance in speech resynthesis and voice conversion at lower bitrates than the current state-of-the-art, while effectively handling multilingual speech and noisy environments. Evaluation on downstream tasks shows that FocalCodec successfully preserves sufficient semantic and acoustic information, while also being well-suited for generative modeling. Demo samples and code are available at https://lucadellalib.github.io/focalcodec-web/.
[720] Training data membership inference via Gaussian process meta-modeling: a post-hoc analysis approach
Yongchao Huang, Pengfei Zhang, Shahzad Mumtaz
Main category: cs.LG
TL;DR: GP-MIA is an efficient and interpretable membership inference attack method using Gaussian process meta-modeling that works with post-hoc metrics from a single trained model, achieving high accuracy without requiring shadow models or heavy query access.
Details
Motivation: Existing membership inference attack methods often depend on shadow models or heavy query access, which limits their practicality and poses implementation challenges.Method: Uses Gaussian process meta-modeling with post-hoc metrics (accuracy, entropy, dataset statistics, and optional sensitivity features like gradients and NTK measures) from a single trained model to train a GP classifier for distinguishing members from non-members.
Result: Experiments on synthetic data, real-world fraud detection data, CIFAR-10, and WikiText-2 show that GP-MIA achieves high accuracy and generalizability.
Conclusion: GP-MIA offers a practical alternative to existing membership inference attacks by providing efficient, interpretable classification with calibrated uncertainty estimates.
Abstract: Membership inference attacks (MIAs) test whether a data point was part of a model’s training set, posing serious privacy risks. Existing methods often depend on shadow models or heavy query access, which limits their practicality. We propose GP-MIA, an efficient and interpretable approach based on Gaussian process (GP) meta-modeling. Using post-hoc metrics such as accuracy, entropy, dataset statistics, and optional sensitivity features (e.g. gradients, NTK measures) from a single trained model, GP-MIA trains a GP classifier to distinguish members from non-members while providing calibrated uncertainty estimates. Experiments on synthetic data, real-world fraud detection data, CIFAR-10, and WikiText-2 show that GP-MIA achieves high accuracy and generalizability, offering a practical alternative to existing MIAs.
[721] SynCast: Synergizing Contradictions in Precipitation Nowcasting via Diffusion Sequential Preference Optimization
Kaiyi Xu, Junchao Gong, Wenlong Zhang, Ben Fei, Lei Bai, Wanli Ouyang
Main category: cs.LG
TL;DR: SynCast introduces preference optimization to precipitation nowcasting using a two-stage Diffusion-SPO framework to address conflicting metrics like CSI and FAR, achieving superior performance by progressively aligning these metrics.
Details
Motivation: Current deep learning approaches for precipitation nowcasting have limitations: deterministic models produce over-smoothed predictions that miss extreme events, while probabilistic models show fluctuating performance across metrics. There's also inherent conflict between evaluation metrics like CSI and FAR that existing models struggle to optimize simultaneously.Method: SynCast employs a two-stage post-training framework called Diffusion Sequential Preference Optimization (Diffusion-SPO). First stage reduces False Alarm Ratio (FAR) by training the model to suppress false alarms. Second stage optimizes Critical Success Index (CSI) while preserving FAR alignment through constraints.
Result: The method achieves synergistic improvements across conflicting metrics, consistently delivering superior performance by progressively aligning CSI and FAR metrics.
Conclusion: Preference optimization, inspired by reinforcement learning from human feedback in LLMs, successfully addresses the challenge of conflicting evaluation metrics in precipitation nowcasting, enabling models to perform well on multiple metrics simultaneously.
Abstract: Precipitation nowcasting based on radar echoes plays a crucial role in monitoring extreme weather and supporting disaster prevention. Although deep learning approaches have achieved significant progress, they still face notable limitations. For example, deterministic models tend to produce over-smoothed predictions, which struggle to capture extreme events and fine-scale precipitation patterns. Probabilistic generative models, due to their inherent randomness, often show fluctuating performance across different metrics and rarely achieve consistently optimal results. Furthermore, precipitation nowcasting is typically evaluated using multiple metrics, some of which are inherently conflicting. For instance, there is often a trade-off between the Critical Success Index (CSI) and the False Alarm Ratio (FAR), making it challenging for existing models to deliver forecasts that perform well on both metrics simultaneously. To address these challenges, we introduce preference optimization into precipitation nowcasting for the first time, motivated by the success of reinforcement learning from human feedback in large language models. Specifically, we propose SynCast, a method that employs the two-stage post-training framework of Diffusion Sequential Preference Optimization (Diffusion-SPO), to progressively align conflicting metrics and consistently achieve superior performance. In the first stage, the framework focuses on reducing FAR, training the model to effectively suppress false alarms. Building on this foundation, the second stage further optimizes CSI with constraints that preserve FAR alignment, thereby achieving synergistic improvements across these conflicting metrics.
[722] Adapting Interleaved Encoders with PPO for Language-Guided Reinforcement Learning in BabyAI
Aryan Mathur, Asaduddin Ahmed
Main category: cs.LG
TL;DR: The paper implements PDiT architecture that interleaves perception and decision layers in a single transformer, achieving better performance than standard PPO on vision-language tasks in BabyAI environment.
Details
Motivation: Deep RL agents struggle with vision-language tasks due to isolated perception and decision-making modules, where policy failures don't directly help perception learn important features.Method: Implemented Perception-Decision Interleaving Transformer (PDiT) that alternates perception and decision layers within a single transformer, plus CLIP-inspired contrastive loss to align text mission embeddings with visual features.
Result: PDiT achieved more stable rewards and stronger alignment compared to standard PPO baseline on BabyAI GoToLocal environment.
Conclusion: Interleaved transformer encoders are a promising direction for developing more integrated autonomous agents that can better handle vision-language tasks.
Abstract: Deep reinforcement learning agents often struggle when tasks require understanding both vision and language. Conventional architectures typically isolate perception (for example, CNN-based visual encoders) from decision-making (policy networks). This separation can be inefficient, since the policy’s failures do not directly help the perception module learn what is important. To address this, we implement the Perception-Decision Interleaving Transformer (PDiT) architecture introduced by Mao et al. (2023), a model that alternates between perception and decision layers within a single transformer. This interleaving allows feedback from decision-making to refine perceptual features dynamically. In addition, we integrate a contrastive loss inspired by CLIP to align textual mission embeddings with visual scene features. We evaluate the PDiT encoders on the BabyAI GoToLocal environment and find that the approach achieves more stable rewards and stronger alignment compared to a standard PPO baseline. The results suggest that interleaved transformer encoders are a promising direction for developing more integrated autonomous agents.
[723] TowerVision: Understanding and Improving Multilinguality in Vision-Language Models
André G. Viveiros, Patrick Fernandes, Saul Santos, Sonal Sannigrahi, Emmanouil Zaranis, Nuno M. Guerreiro, Amin Farajian, Pierre Colombo, Graham Neubig, André F. T. Martins
Main category: cs.LG
TL;DR: TowerVision is a family of open multilingual vision-language models that achieves competitive performance on multimodal multilingual benchmarks, particularly excelling in culturally grounded tasks and multimodal translation.
Details
Motivation: Most existing vision-language models follow an English-centric design process, limiting their effectiveness in multilingual settings. The authors aim to address this limitation by creating multilingual VLMs.Method: Built upon the multilingual text-only model Tower+, TowerVision analyzes multilingual design choices including training data composition, encoder selection, and text backbones. The models incorporate visual and cultural context during fine-tuning and use a curated vision-language dataset called VisionBlocks.
Result: TowerVision achieves competitive performance on multiple multimodal multilingual benchmarks (ALM-Bench, Multi30K, ViMUL-Bench), surpassing existing approaches trained on substantially larger datasets. The models show particular strength in culturally grounded tasks and multimodal translation.
Conclusion: Multilingual vision-language training data substantially improves cross-lingual generalization in both directions (high-resource to underrepresented languages and vice versa). Instruction-tuned LLMs are not always the optimal initialization point for multilingual VLMs. The authors release all models, data, and training recipes publicly.
Abstract: Despite significant advances in vision-language models (VLMs), most existing work follows an English-centric design process, limiting their effectiveness in multilingual settings. In this work, we provide a comprehensive empirical study analyzing the impact of several multilingual design choices, such as training data composition, encoder selection, and text backbones. The result is TowerVision, a family of open multilingual VLMs for both image-text and video-text tasks, built upon the multilingual text-only model Tower+. TowerVision achieves competitive performance on multiple multimodal multilingual benchmarks and shows particular strength in culturally grounded tasks and multimodal translation. By incorporating visual and cultural context during fine-tuning, our models surpass existing approaches trained on substantially larger datasets, as demonstrated on ALM-Bench and Multi30K (image tasks) and ViMUL-Bench (video tasks). Alongside the models, we release VisionBlocks, a high-quality, curated vision-language dataset. Our findings highlight that multilingual vision-language training data substantially improves cross-lingual generalization – both from high-resource to underrepresented languages and vice versa – and that instruction-tuned LLMs are not always the optimal initialization point. To support further research, we publicly release all models, data, and training recipes.
[724] Towards Interpretable Deep Learning and Analysis of Dynamical Systems via the Discrete Empirical Interpolation Method
Hojin Kim, Romit Maulik
Main category: cs.LG
TL;DR: A differentiable framework using DEIM for interpretable deep learning and dynamical system analysis, enabling dynamic interpolation point selection and diagnostic analysis of neural ODE models.
Details
Motivation: To overcome the limitations of fixed interpolation points in traditional DEIM for complex time-varying dynamics, and to use DEIM as an interpretable analysis tool for neural differential equations.Method: Developed differentiable adaptive DEIM for 1D viscous Burgers equation using neural networks to dynamically select interpolation points, then applied DEIM to analyze pre-trained Neural ODE on 2D vortex-merging problem.
Result: DEIM trajectories revealed physically meaningful features in NODE’s learned dynamics and exposed limitations in extrapolation to unseen flow configurations.
Conclusion: DEIM serves both as a model reduction tool and as a diagnostic framework for understanding and improving neural differential equation generalization.
Abstract: We present a differentiable framework that leverages the Discrete Empirical Interpolation Method (DEIM) for interpretable deep learning and dynamical system analysis. Although DEIM efficiently approximates nonlinear terms in projection-based reduced-order models (POD-ROM), its fixed interpolation points limit the adaptability to complex and time-varying dynamics. To address this limitation, we first develop a differentiable adaptive DEIM formulation for the one-dimensional viscous Burgers equation, which allows neural networks to dynamically select interpolation points in a computationally efficient and physically consistent manner. We then apply DEIM as an interpretable analysis tool for examining the learned dynamics of a pre-trained Neural Ordinary Differential Equation (NODE) on a two-dimensional vortex-merging problem. The DEIM trajectories reveal physically meaningful features in the learned dynamics of NODE and expose its limitations when extrapolating to unseen flow configurations. These findings demonstrate that DEIM can serve not only as a model reduction tool but also as a diagnostic framework for understanding and improving the generalization behavior of neural differential equation models.
[725] Privacy-preserving Decision-focused Learning for Multi-energy Systems
Yangze Zhou, Ruiyang Yao, Dalin Qin, Yixiong Jia, Yi Wang
Main category: cs.LG
TL;DR: A privacy-preserving decision-focused learning framework for multi-energy system dispatch that protects sensitive data while achieving lower dispatch costs than traditional methods.
Details
Motivation: Traditional load forecasting and decision-making are implemented separately, with forecasting models minimizing errors without considering downstream decision impact. Decision-focused learning addresses this but faces privacy challenges due to sensitive data sharing across sectors.Method: Proposes a privacy-preserving DFL framework with information masking to protect private data while enabling recovery of decision variables and gradients. Uses matrix decomposition and homomorphic encryption for security, and includes a privacy-preserving load pattern recognition algorithm for specialized DFL models.
Result: Theoretical analysis and real-world case studies show the framework protects privacy while consistently achieving lower average daily dispatch costs compared to existing methods.
Conclusion: The proposed privacy-preserving DFL framework successfully addresses privacy concerns in multi-energy system dispatch while improving decision-making performance through specialized load pattern recognition and secure computation techniques.
Abstract: Decision-making for multi-energy system (MES) dispatch depends on accurate load forecasting. Traditionally, load forecasting and decision-making for MES are implemented separately. Forecasting models are typically trained to minimize forecasting errors, overlooking their impact on downstream decision-making. To address this, decision-focused learning (DFL) has been studied to minimize decision-making costs instead. However, practical adoption of DFL in MES faces significant challenges: the process requires sharing sensitive load data and model parameters across multiple sectors, raising serious privacy issues. To this end, we propose a privacy-preserving DFL framework tailored for MES. Our approach introduces information masking to safeguard private data while enabling recovery of decision variables and gradients required for model training. To further enhance security for DFL, we design a safety protocol combining matrix decomposition and homomorphic encryption, effectively preventing collusion and unauthorized data access. Additionally, we developed a privacy-preserving load pattern recognition algorithm, enabling the training of specialized DFL models for heterogeneous load patterns. Theoretical analysis and comprehensive case studies, including real-world MES data, demonstrate that our framework not only protects privacy but also consistently achieves lower average daily dispatch costs compared to existing methods.
[726] OpenEM: Large-scale multi-structural 3D datasets for electromagnetic methods
Shuang Wang, Xuben Wang, Fei Deng, Peifan Jiang, Jian Chen, Gianluca Fiandaca
Main category: cs.LG
TL;DR: OpenEM is a large-scale 3D geoelectric dataset with diverse geological structures and a deep learning-based fast forward modeling approach to accelerate deep learning applications in electromagnetic exploration.
Details
Motivation: Existing deep learning methods in EM exploration rely on datasets from simple models that don't represent real geological complexity, and lack standardized public 3D datasets, hindering progress.Method: Created OpenEM dataset with 9 categories of geologically plausible 3D subsurface structures, and developed a deep learning-based fast forward modeling approach for efficient computation across the dataset.
Result: OpenEM provides a comprehensive, large-scale dataset covering various geological configurations with efficient forward modeling capability, publicly available for EM exploration research.
Conclusion: OpenEM addresses the dataset limitations in EM deep learning research by providing standardized, complex 3D geoelectric models and fast forward modeling, accelerating the application of deep learning in electromagnetic methods.
Abstract: With the remarkable success of deep learning, applying such techniques to EM methods has emerged as a promising research direction to overcome the limitations of conventional approaches. The effectiveness of deep learning methods depends heavily on the quality of datasets, which directly influences model performance and generalization ability. Existing application studies often construct datasets from random one-dimensional or structurally simple three-dimensional models, which fail to represent the complexity of real geological environments. Furthermore, the absence of standardized, publicly available three-dimensional geoelectric datasets continues to hinder progress in deep learning based EM exploration. To address these limitations, we present OpenEM, a large scale, multi structural three dimensional geoelectric dataset that encompasses a broad range of geologically plausible subsurface structures. OpenEM consists of nine categories of geoelectric models, spanning from simple configurations with anomalous bodies in half space to more complex structures such as flat layers, folded layers, flat faults, curved faults, and their corresponding variants with anomalous bodies. Since three-dimensional forward modeling in electromagnetics is extremely time-consuming, we further developed a deep learning based fast forward modeling approach for OpenEM, enabling efficient and reliable forward modeling across the entire dataset. This capability allows OpenEM to be rapidly deployed for a wide range of tasks. OpenEM provides a unified, comprehensive, and large-scale dataset for common EM exploration systems to accelerate the application of deep learning in electromagnetic methods. The complete dataset, along with the forward modeling codes and trained models, is publicly available at https://doi.org/10.5281/zenodo.17141981.
[727] The Mirror Loop: Recursive Non-Convergence in Generative Reasoning Systems
Bentley DeVilling
Main category: cs.LG
TL;DR: Recursive self-evaluation in large language models leads to reformulation rather than progress without external feedback. A minimal grounding intervention (single verification step) significantly improves informational change and prevents epistemic stasis.
Details
Motivation: To test whether large language models can truly engage in reflective reasoning or if recursive self-evaluation without external feedback leads to informational closure rather than progress.Method: Cross-provider study of 144 reasoning sequences across three models (GPT-4o-mini, Claude 3 Haiku, Gemini 2.0 Flash) and four task families (arithmetic, code, explanation, reflection), iterated ten times under two conditions: ungrounded self-critique and minimal grounding intervention (single verification at iteration three).
Result: Ungrounded runs showed 55% decline in informational change from early to late iterations. Grounded runs showed +28% rebound after intervention and sustained non-zero variance. Multiple measures (n-gram novelty, embedding drift, entropy) converged on same pattern: reflection without contact leads to informational closure.
Conclusion: There is a structural limit on self-correction in generative reasoning - without information exchange with independent verifier, recursive inference approaches epistemic stasis. Minimal grounding functions as dissipative coupling. Cross-architecture consistency suggests this arises from shared autoregressive training objectives.
Abstract: Large language models are often described as capable of reflective reasoning, yet recursive self-evaluation without external feedback frequently yields reformulation rather than progress. We test this prediction in a cross-provider study of 144 reasoning sequences across three models (OpenAI GPT-4o-mini, Anthropic Claude 3 Haiku, and Google Gemini 2.0 Flash) and four task families (arithmetic, code, explanation, reflection), each iterated ten times under two conditions: ungrounded self-critique and a minimal grounding intervention (a single verification step at iteration three). Mean informational change (delta I, measured via normalized edit distance) declined by 55% from early (0.193) to late (0.087) iterations in ungrounded runs, with consistent patterns across all three providers. Grounded runs showed a +28% rebound in informational change immediately after the intervention and sustained non-zero variance thereafter. Complementary measures-n-gram novelty, embedding drift, and character-level entropy-converged on the same pattern: reflection without contact tends toward informational closure. We interpret this as evidence for a structural limit on self-correction in generative reasoning: without an exchange of information with an independent verifier or environment, recursive inference approaches an attractor state of epistemic stasis. Minimal grounding functions as dissipative coupling, reintroducing informational flux. The cross-architecture consistency suggests the mirror loop arises from shared autoregressive training objectives rather than provider-specific alignment schemes. The results delineate when reflection is performative rather than epistemic and motivate design principles for grounded, cooperative reasoning. Materials and code are publicly available.
[728] The Principles of Diffusion Models
Chieh-Hsin Lai, Yang Song, Dongjun Kim, Yuki Mitsufuji, Stefano Ermon
Main category: cs.LG
TL;DR: This monograph provides a comprehensive overview of diffusion models, explaining their core principles through three complementary mathematical perspectives: variational, score-based, and flow-based views, all sharing the common foundation of learning time-dependent velocity fields for data generation.
Details
Motivation: To establish a unified mathematical foundation for understanding diffusion models by tracing their origins and showing how different formulations arise from shared mathematical ideas, providing conceptual clarity for the deep learning community.Method: The monograph presents three complementary views: (1) Variational view inspired by VAEs that learns step-by-step noise removal, (2) Score-based view rooted in energy-based modeling that learns gradient of data distribution, and (3) Flow-based view related to normalizing flows that treats generation as following smooth paths with learned velocity fields.
Result: The analysis reveals that all three perspectives share a common backbone: a time-dependent velocity field whose flow transports a simple prior to data, with sampling amounting to solving differential equations that evolve noise into data along continuous trajectories.
Conclusion: Diffusion models provide a mathematically grounded framework for data generation through continuous transformations, with applications in controllable generation, efficient numerical solvers, and direct mapping models, offering a unified understanding of this powerful generative modeling approach.
Abstract: This monograph presents the core principles that have guided the development of diffusion models, tracing their origins and showing how diverse formulations arise from shared mathematical ideas. Diffusion modeling starts by defining a forward process that gradually corrupts data into noise, linking the data distribution to a simple prior through a continuum of intermediate distributions. The goal is to learn a reverse process that transforms noise back into data while recovering the same intermediates. We describe three complementary views. The variational view, inspired by variational autoencoders, sees diffusion as learning to remove noise step by step. The score-based view, rooted in energy-based modeling, learns the gradient of the evolving data distribution, indicating how to nudge samples toward more likely regions. The flow-based view, related to normalizing flows, treats generation as following a smooth path that moves samples from noise to data under a learned velocity field. These perspectives share a common backbone: a time-dependent velocity field whose flow transports a simple prior to the data. Sampling then amounts to solving a differential equation that evolves noise into data along a continuous trajectory. On this foundation, the monograph discusses guidance for controllable generation, efficient numerical solvers, and diffusion-motivated flow-map models that learn direct mappings between arbitrary times. It provides a conceptual and mathematically grounded understanding of diffusion models for readers with basic deep-learning knowledge.
[729] A supervised discriminant data representation: application to pattern classification
Fadi Dornaika, Ahmad Khoder, Abdelmalik Moujahid, Wassim Khoder
Main category: cs.LG
TL;DR: A hybrid linear feature extraction method for supervised multi-class classification that combines advantages of RSLDA and ICS_DLSR using sparsity techniques and iterative optimization.
Details
Motivation: Machine learning performance depends heavily on data representation, requiring effective preprocessing and data transformations to support better classification.Method: Proposes a unifying criterion combining RSLDA and ICS_DLSR, using sparsity-promoting techniques for feature selection and class consistency, with iterative alternating minimization based on gradient descent.
Result: Outperformed competing methods on multiple datasets including faces, objects, and digits in most experimental cases.
Conclusion: The framework is generic and allows combination of other linear discriminant embedding methods, demonstrating effectiveness through superior performance across diverse datasets.
Abstract: The performance of machine learning and pattern recognition algorithms generally depends on data representation. That is why, much of the current effort in performing machine learning algorithms goes into the design of preprocessing frameworks and data transformations able to support effective machine learning. The method proposed in this work consists of a hybrid linear feature extraction scheme to be used in supervised multi-class classification problems. Inspired by two recent linear discriminant methods: robust sparse linear discriminant analysis (RSLDA) and inter-class sparsitybased discriminative least square regression (ICS_DLSR), we propose a unifying criterion that is able to retain the advantages of these two powerful methods. The resulting transformation relies on sparsity-promoting techniques both to select the features that most accurately represent the data and to preserve the row-sparsity consistency property of samples from the same class. The linear transformation and the orthogonal matrix are estimated using an iterative alternating minimization scheme based on steepest descent gradient method and different initialization schemes. The proposed framework is generic in the sense that it allows the combination and tuning of other linear discriminant embedding methods. According to the experiments conducted on several datasets including faces, objects, and digits, the proposed method was able to outperform competing methods in most cases.
[730] Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks
Mahavir Dabas, Tran Huynh, Nikhil Reddy Billa, Jiachen T. Wang, Peng Gao, Charith Peris, Yao Ma, Rahul Gupta, Ming Jin, Prateek Mittal, Ruoxi Jia
Main category: cs.LG
TL;DR: The paper proposes Adversarial Skill Compositional Training (ASCoT) to defend against jailbreak attacks by training on diverse compositions of adversarial skill primitives rather than isolated attacks, based on the hypothesis that novel jailbreaks are recombinations of previous adversarial skills.
Details
Motivation: Large language models remain vulnerable to jailbreak attacks that bypass safety guardrails, and existing adversarial training methods often fail against newly developed jailbreaks due to optimization challenges and difficulties in defining realistic threat models.Method: The authors analyze 32 attack papers over two years, extract and compress adversarial skills into a sparse dictionary of primitives, then introduce ASCoT which trains on diverse compositions of these skill primitives rather than isolated attack instances.
Result: ASCoT substantially improves robustness to unseen attacks including multi-turn jailbreaks while maintaining low over-refusal rates, and demonstrates that expanding adversarial skill coverage (not just data scale) is key to defending against novel attacks.
Conclusion: Novel jailbreaks are largely recombinations of previous adversarial skills, and training on diverse compositions of skill primitives (ASCoT) provides an effective defense paradigm that improves robustness against unseen attacks.
Abstract: Large language models remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs. Defending against novel jailbreaks represents a critical challenge in AI safety. Adversarial training – designed to make models robust against worst-case perturbations – has been the dominant paradigm for adversarial robustness. However, due to optimization challenges and difficulties in defining realistic threat models, adversarial training methods often fail on newly developed jailbreaks in practice. This paper proposes a new paradigm for improving robustness against unseen jailbreaks, centered on the Adversarial D'ej`a Vu hypothesis: novel jailbreaks are not fundamentally new, but largely recombinations of adversarial skills from previous attacks. We study this hypothesis through a large-scale analysis of 32 attack papers published over two years. Using an automated pipeline, we extract and compress adversarial skills into a sparse dictionary of primitives, with LLMs generating human-readable descriptions. Our analysis reveals that unseen attacks can be effectively explained as sparse compositions of earlier skills, with explanatory power increasing monotonically as skill coverage grows. Guided by this insight, we introduce Adversarial Skill Compositional Training (ASCoT), which trains on diverse compositions of skill primitives rather than isolated attack instances. ASCoT substantially improves robustness to unseen attacks, including multi-turn jailbreaks, while maintaining low over-refusal rates. We also demonstrate that expanding adversarial skill coverage, not just data scale, is key to defending against novel attacks. \textcolor{red}{\textbf{Warning: This paper contains content that may be harmful or offensive in nature.
[731] Joint Score-Threshold Optimization for Interpretable Risk Assessment Under Partial Supervision
Fardin Gankhanloo, Emmett Springer, Erik H. Hoyer, Daniel L. Young, Kimia Ghobadi
Main category: cs.LG
TL;DR: A mixed-integer programming framework for optimizing ordinal risk assessment tools that handles partial supervision and asymmetric misclassification costs in healthcare settings.
Details
Motivation: Standard supervised learning fails for healthcare risk assessment due to partial supervision from intervention-censored outcomes and asymmetric misclassification costs that increase with ordinal distance.Method: Proposes a mixed-integer programming framework that jointly optimizes scoring weights and category thresholds, handles partial supervision through per-instance feasible label sets, incorporates asymmetric distance-aware objectives, and prevents middle-category collapse via minimum threshold gaps.
Result: Developed a CSO relaxation using softplus losses that preserves ordinal structure while enabling efficient optimization, and supports governance constraints including sign restrictions, sparsity, and minimal modifications to existing tools.
Conclusion: The framework ensures practical deployability in clinical workflows by addressing key challenges in optimizing healthcare risk assessment tools while maintaining governance constraints.
Abstract: Risk assessment tools in healthcare commonly employ point-based scoring systems that map patients to ordinal risk categories via thresholds. While electronic health record (EHR) data presents opportunities for data-driven optimization of these tools, two fundamental challenges impede standard supervised learning: (1) partial supervision arising from intervention-censored outcomes, where only extreme categories can be reliably labeled, and (2) asymmetric misclassification costs that increase with ordinal distance. We propose a mixed-integer programming (MIP) framework that jointly optimizes scoring weights and category thresholds under these constraints. Our approach handles partial supervision through per-instance feasible label sets, incorporates asymmetric distance-aware objectives, and prevents middle-category collapse via minimum threshold gaps. We further develop a CSO relaxation using softplus losses that preserves the ordinal structure while enabling efficient optimization. The framework supports governance constraints including sign restrictions, sparsity, and minimal modifications to incumbent tools, ensuring practical deployability in clinical workflows.
[732] AutoSciDACT: Automated Scientific Discovery through Contrastive Embedding and Hypothesis Testing
Samuel Bright-Thonney, Christina Reissel, Gaia Grosso, Nathaniel Woodward, Katya Govorkova, Andrzej Novak, Sang Eon Park, Eric Moreno, Philip Harris
Main category: cs.LG
TL;DR: AutoSciDACT is a unified pipeline for novelty detection in scientific data that combines contrastive pre-training for dimensionality reduction with statistical two-sample testing to identify and quantify anomalies with statistical rigor.
Details
Motivation: Scientific novelty detection faces challenges with noisy high-dimensional data and the need for statistically robust anomaly claims, as most existing methods don't produce outputs compatible with quantifiable scientific discovery.Method: Uses contrastive pre-training to create low-dimensional embeddings from simulated data and expert-guided augmentations, then applies NPLM framework for sensitive two-sample testing to statistically quantify deviations from reference distributions.
Result: Demonstrated strong sensitivity to small injections of anomalous data across astronomical, physical, biological, image, and synthetic datasets.
Conclusion: AutoSciDACT provides a general-purpose pipeline for rigorous statistical novelty detection in scientific domains, addressing key challenges in high-dimensional data analysis and quantifiable discovery claims.
Abstract: Novelty detection in large scientific datasets faces two key challenges: the noisy and high-dimensional nature of experimental data, and the necessity of making statistically robust statements about any observed outliers. While there is a wealth of literature on anomaly detection via dimensionality reduction, most methods do not produce outputs compatible with quantifiable claims of scientific discovery. In this work we directly address these challenges, presenting the first step towards a unified pipeline for novelty detection adapted for the rigorous statistical demands of science. We introduce AutoSciDACT (Automated Scientific Discovery with Anomalous Contrastive Testing), a general-purpose pipeline for detecting novelty in scientific data. AutoSciDACT begins by creating expressive low-dimensional data representations using a contrastive pre-training, leveraging the abundance of high-quality simulated data in many scientific domains alongside expertise that can guide principled data augmentation strategies. These compact embeddings then enable an extremely sensitive machine learning-based two-sample test using the New Physics Learning Machine (NPLM) framework, which identifies and statistically quantifies deviations in observed data relative to a reference distribution (null hypothesis). We perform experiments across a range of astronomical, physical, biological, image, and synthetic datasets, demonstrating strong sensitivity to small injections of anomalous data across all domains.
[733] Generalization Bounds for Rank-sparse Neural Networks
Antoine Ledent, Rodrigo Alves, Yunwen Lei
Main category: cs.LG
TL;DR: Neural networks exhibit bottleneck rank property where activations and weights become approximately low rank during training. This paper proves generalization bounds that exploit this low-rank structure, showing sample complexity depends on Schatten p quasi-norms of weight matrices.
Details
Motivation: To understand how the observed bottleneck rank phenomenon in neural networks affects generalization performance, and to develop theoretical bounds that leverage this low-rank structure for better generalization guarantees.Method: Theoretical analysis using Schatten p quasi-norms of weight matrices to derive generalization bounds. The bounds exploit the approximate low-rank structure observed in neural network weights during training with gradient-based methods.
Result: Proved generalization bounds with sample complexity O(WrL²) for small p, where W is width, L is depth, and r is rank of weight matrices. As p increases, bounds behave more like traditional norm-based bounds.
Conclusion: The bottleneck rank property provides a theoretical foundation for understanding neural network generalization, with bounds that improve when networks exhibit low-rank structure, particularly for small Schatten p quasi-norms.
Abstract: It has been recently observed in much of the literature that neural networks exhibit a bottleneck rank property: for larger depths, the activation and weights of neural networks trained with gradient-based methods tend to be of approximately low rank. In fact, the rank of the activations of each layer converges to a fixed value referred to as the ``bottleneck rank’’, which is the minimum rank required to represent the training data. This perspective is in line with the observation that regularizing linear networks (without activations) with weight decay is equivalent to minimizing the Schatten $p$ quasi norm of the neural network. In this paper we investigate the implications of this phenomenon for generalization. More specifically, we prove generalization bounds for neural networks which exploit the approximate low rank structure of the weight matrices if present. The final results rely on the Schatten $p$ quasi norms of the weight matrices: for small $p$, the bounds exhibit a sample complexity $ \widetilde{O}(WrL^2)$ where $W$ and $L$ are the width and depth of the neural network respectively and where $r$ is the rank of the weight matrices. As $p$ increases, the bound behaves more like a norm-based bound instead.
[734] Revisiting Orbital Minimization Method for Neural Operator Decomposition
J. Jon Ryu, Samuel Zhou, Gregory W. Wornell
Main category: cs.LG
TL;DR: The paper revisits the classical orbital minimization method (OMM) from computational physics and adapts it for training neural networks to decompose positive semidefinite operators, demonstrating practical advantages across benchmark tasks.
Details
Motivation: To provide a principled approach for deploying neural networks in numerical simulation and offer effective tools for machine learning by revisiting classical numerical methods through modern theory and computation.Method: Adapts the orbital minimization method (OMM) framework to train neural networks for decomposing positive semidefinite operators, with a linear-algebraic proof of consistency and connections to independently developed ideas across domains.
Result: Demonstrates practical advantages of the adapted OMM framework across a range of benchmark tasks, showing its effectiveness for neural network-based operator decomposition.
Conclusion: Revisiting classical numerical methods like OMM through modern computational lenses provides principled approaches for neural network deployment in numerical simulation and scalable tools for machine learning.
Abstract: Spectral decomposition of linear operators plays a central role in many areas of machine learning and scientific computing. Recent work has explored training neural networks to approximate eigenfunctions of such operators, enabling scalable approaches to representation learning, dynamical systems, and partial differential equations (PDEs). In this paper, we revisit a classical optimization framework from the computational physics literature known as the \emph{orbital minimization method} (OMM), originally proposed in the 1990s for solving eigenvalue problems in computational chemistry. We provide a simple linear-algebraic proof of the consistency of the OMM objective, and reveal connections between this method and several ideas that have appeared independently across different domains. Our primary goal is to justify its broader applicability in modern learning pipelines. We adapt this framework to train neural networks to decompose positive semidefinite operators, and demonstrate its practical advantages across a range of benchmark tasks. Our results highlight how revisiting classical numerical methods through the lens of modern theory and computation can provide not only a principled approach for deploying neural networks in numerical simulation, but also effective and scalable tools for machine learning.
[735] Transformer Based Linear Attention with Optimized GPU Kernel Implementation
Armin Gerami, Ramani Duraiswami
Main category: cs.LG
TL;DR: A novel method for linear attention with optimized CUDA implementation achieves 3.3x speedup and 3.6x memory reduction while maintaining comparable accuracy to regular attention.
Details
Motivation: Linear attention mechanisms offer O(ND²) time complexity vs O(N²D) for regular attention, but lag behind theoretical efficiency in practice, needing optimization.Method: Proposed novel forward and backward passes for linear attention with highly-optimized CUDA implementation.
Result: 3.3x speed improvement and 3.6x memory reduction over state-of-the-art; trained 1.4B parameter language model showing similar expressivity to regular attention on reasoning benchmarks.
Conclusion: The optimized linear attention implementation achieves significant efficiency gains while maintaining performance comparable to regular attention.
Abstract: The original softmax-based attention mechanism (regular attention) in the extremely successful Transformer architecture computes attention between $N$ tokens, each embedded in a $D$-dimensional head, with a time complexity of $O(N^2D)$. Given the success of Transformers, improving their runtime during both training and inference is a popular research area. One such approach is the introduction of the linear attention (LA) mechanisms, which offers a linear time complexity of $O(ND^2)$ and have demonstrated comparable accuracy to regular attention. However, LA in practice lags behind its theoretical efficiency. We propose a novel method for LA’s forward and backward passes, along with a highly-optimized CUDA implementation. Our approach outperforms the state-of-the-art by 3.3 times in speed and reduces memory consumption by 3.6 times. We validate these improvements in both single-layer and end-to-end settings by training a 1.4 billion parameter language model, which demonstrates similar expressivity to regular attention on major reasoning benchmarks.
[736] Reducing the Representation Error of GAN Image Priors Using the Deep Decoder
Mara Daniels, Paul Hand, Reinhard Heckel
Main category: cs.LG
TL;DR: A hybrid model combining GAN priors with Deep Decoder reduces representation error in inverse problems like compressive sensing and image superresolution, outperforming both methods individually.
Details
Motivation: GAN priors have representation errors for both in-distribution and out-of-distribution images due to mismatch between learned and true data distributions.Method: Model images as linear combination of GAN prior with Deep Decoder - an unlearned, underparameterized natural signal model similar to Deep Image Prior.
Result: Consistently higher PSNRs than both GAN priors and Deep Decoder separately for compressive sensing and image superresolution, on both in-distribution and out-of-distribution images.
Conclusion: Provides extensible and cheap method to leverage benefits of both learned and unlearned image recovery priors in inverse problems.
Abstract: Generative models, such as GANs, learn an explicit low-dimensional representation of a particular class of images, and so they may be used as natural image priors for solving inverse problems such as image restoration and compressive sensing. GAN priors have demonstrated impressive performance on these tasks, but they can exhibit substantial representation error for both in-distribution and out-of-distribution images, because of the mismatch between the learned, approximate image distribution and the data generating distribution. In this paper, we demonstrate a method for reducing the representation error of GAN priors by modeling images as the linear combination of a GAN prior with a Deep Decoder. The deep decoder is an underparameterized and most importantly unlearned natural signal model similar to the Deep Image Prior. No knowledge of the specific inverse problem is needed in the training of the GAN underlying our method. For compressive sensing and image superresolution, our hybrid model exhibits consistently higher PSNRs than both the GAN priors and Deep Decoder separately, both on in-distribution and out-of-distribution images. This model provides a method for extensibly and cheaply leveraging both the benefits of learned and unlearned image recovery priors in inverse problems.
[737] Parallel Sampling from Masked Diffusion Models via Conditional Independence Testing
Iskander Azangulov, Teodora Pandeva, Niranjani Prasad, Javier Zazo, Sushrut Karmalkar
Main category: cs.LG
TL;DR: PUNT is a model-agnostic sampler for masked diffusion models that enables efficient parallel token sampling by resolving the conflict between conditional independence and high-confidence predictions through dependency identification and selective unmasking.
Details
Motivation: Masked diffusion models enable parallel token sampling for faster inference compared to autoregressive models, but face a trade-off between conditional independence requirements and prioritizing high-confidence predictions that often depend on each other.Method: PUNT identifies token dependencies and removes lower-confidence tokens from conflicting groups, producing sets of indices for unmasking that satisfy both independence and confidence criteria through approximate conditional independence testing.
Result: PUNT achieves up to 16% higher accuracy on IFEval benchmark compared to baseline methods, with superior trade-off between accuracy and compute, especially for longer sequences. It works across different hyperparameters and induces emergent hierarchical generation strategy.
Conclusion: PUNT effectively reconciles the parallel sampling trade-off in masked diffusion models, enabling faster inference while maintaining or improving accuracy through a planning-like generation process that establishes high-level structure before local refinement.
Abstract: Masked diffusion models (MDMs) offer a compelling alternative to autoregressive models (ARMs) for discrete text generation because they enable parallel token sampling, rather than sequential, left-to-right generation. This means potentially much faster inference. However, effective parallel sampling faces two competing requirements: (i) simultaneously updated tokens must be conditionally independent, and (ii) updates should prioritise high-confidence predictions. These goals conflict because high-confidence predictions often cluster and depend on each other, opportunities for parallel updates. We present PUNT, a model-agnostic sampler that reconciles this trade-off. Our method identifies token dependencies and removes lower-confidence tokens from conflicting groups. This produces sets of indices for unmasking that satisfy both independence and confidence criteria. Our approach ensures improved parallel unmasking through approximate conditional independence testing. Our experiments show that PUNT delivers a superior trade-off between accuracy and compute when compared to other strong training-free baselines, especially for generation of longer sequences. On the IFEval benchmark, it achieves up to 16% higher accuracy over baseline methods, including sequential generation (one-by-one). These gains hold across different values of hyperparameters, mitigating the need for brittle hyperparameter tuning. Moreover, we observe that PUNT induces an emergent hierarchical generation strategy, where the model first establishes high-level paragraph structure before local refinement, suggesting a planning-like generation process that contributes to strong alignment performance.
[738] Deep Jump Gaussian Processes for Surrogate Modeling of High-Dimensional Piecewise Continuous Functions
Yang Xu, Chiwoo Park
Main category: cs.LG
TL;DR: Deep Jump Gaussian Processes (DJGP) is a novel surrogate modeling method for high-dimensional piecewise continuous functions that adds a locally linear projection layer to Jump Gaussian Processes to capture local subspace structures.
Details
Motivation: To overcome limitations of conventional Jump Gaussian Processes in high-dimensional input spaces by better capturing local subspace structures while maintaining piecewise continuity modeling.Method: Adds a locally linear projection layer with region-specific matrices to Jump Gaussian Processes, places Gaussian Process prior on projection matrices, and uses scalable variational inference to jointly learn projection matrices and JGP hyperparameters.
Result: DJGP delivers superior predictive accuracy and more reliable uncertainty quantification compared to existing approaches on synthetic and benchmark datasets.
Conclusion: The proposed DJGP method effectively handles high-dimensional piecewise continuous functions through its two-layer deep learning architecture combining local projections with Jump Gaussian Processes.
Abstract: We introduce Deep Jump Gaussian Processes (DJGP), a novel method for surrogate modeling of high-dimensional piecewise continuous functions. DJGP overcomes the limitations of conventional Jump Gaussian Processes in high-dimensional input spaces by adding a locally linear projection layer to Jump Gaussian Processes. This projection uses region-specific matrices to capture local subspace structures, naturally complementing the localized nature of JGP, a variant of local Gaussian Processes. To control model complexity, we place a Gaussian Process prior on the projection matrices, allowing them to evolve smoothly across the input space. The projected inputs are then modeled with a JGP to capture piecewise continuous relationships with the response. This yields a distinctive two-layer deep learning of GP/JGP. We further develop a scalable variational inference algorithm to jointly learn the projection matrices and JGP hyperparameters. Experiments on synthetic and benchmark datasets demonstrate that DJGP delivers superior predictive accuracy and more reliable uncertainty quantification compared to existing approaches.
[739] Beyond Reasoning Gains: Mitigating General Capabilities Forgetting in Large Reasoning Models
Hoang Phan, Xianjun Yang, Kevin Yao, Jingyu Zhang, Shengjie Bi, Xiaocheng Tang, Madian Khabsa, Lijuan Liu, Deren Lei
Main category: cs.LG
TL;DR: RLVR training causes capability regression in models. RECAP is proposed as a replay strategy with dynamic objective reweighting to preserve general knowledge while improving reasoning.
Details
Motivation: RLVR training leads to capability regression where models forget foundational skills. Current regularization methods don't guarantee broader knowledge preservation.Method: RECAP - replay strategy with dynamic objective reweighting that adapts online using convergence and instability signals to shift focus from saturated to underperforming objectives.
Result: Experiments on Qwen2.5-VL models show RECAP preserves general capabilities and improves reasoning by enabling flexible trade-offs among in-task rewards.
Conclusion: RECAP effectively addresses capability regression in RLVR training through adaptive objective reweighting without requiring additional models or heavy tuning.
Abstract: Reinforcement learning with verifiable rewards (RLVR) has delivered impressive gains in mathematical and multimodal reasoning and has become a standard post-training paradigm for contemporary language and vision-language models. However, the RLVR recipe introduces a significant risk of capability regression, where models forget foundational skills after prolonged training without employing regularization strategies. We empirically confirm this concern, observing that open-source reasoning models suffer performance degradation on core capabilities such as perception and faithfulness. While imposing regularization terms like KL divergence can help prevent deviation from the base model, these terms are calculated on the current task, thus they do not guarantee broader knowledge. Meanwhile, commonly used experience replay across heterogeneous domains makes it nontrivial to decide how much training focus each objective should receive. To address this, we propose RECAP-a replay strategy with dynamic objective reweighting for general knowledge preservation. Our reweighting mechanism adapts in an online manner using short-horizon signals of convergence and instability, shifting the post-training focus away from saturated objectives and toward underperforming or volatile ones. Our method is end-to-end and readily applicable to existing RLVR pipelines without training additional models or heavy tuning. Extensive experiments on benchmarks based on Qwen2.5-VL-3B and Qwen2.5-VL-7B demonstrate the effectiveness of our method, which not only preserves general capabilities but also improves reasoning by enabling more flexible trade-offs among in-task rewards.
[740] Boltzmann Graph Ensemble Embeddings for Aptamer Libraries
Starlika Bauskar, Jade Jiao, Narayanan Kannan, Alexander Kimm, Justin M. Baker, Matthew J. Tyler, Andrea L. Bertozzi, Anne M. Andrews
Main category: cs.LG
TL;DR: Introduces a thermodynamically parameterized exponential-family random graph (ERGM) embedding that models molecules as Boltzmann-weighted ensembles of interaction graphs, enabling robust community detection and subgraph-level explanations for aptamer-ligand affinity despite experimental biases.
Details
Motivation: Traditional machine-learning methods in biochemistry use single minimal free energy structures, but experimental biases in SELEX datasets can obscure true aptamer-ligand affinity, producing anomalous candidates where observed abundance diverges from actual binding strength.Method: Thermodynamically parameterized exponential-family random graph (ERGM) embedding that models molecules as Boltzmann-weighted ensembles of interaction graphs rather than single structures.
Result: The proposed embedding enables robust community detection and subgraph-level explanations for aptamer-ligand affinity, even in the presence of biased observations from experimental procedures like PCR amplification or sequencing noise.
Conclusion: This approach can identify low-abundance aptamer candidates for further experimental evaluation, providing a more robust method for analyzing molecular interactions in the presence of experimental biases.
Abstract: Machine-learning methods in biochemistry commonly represent molecules as graphs of pairwise intermolecular interactions for property and structure predictions. Most methods operate on a single graph, typically the minimal free energy (MFE) structure, for low-energy ensembles (conformations) representative of structures at thermodynamic equilibrium. We introduce a thermodynamically parameterized exponential-family random graph (ERGM) embedding that models molecules as Boltzmann-weighted ensembles of interaction graphs. We evaluate this embedding on SELEX datasets, where experimental biases (e.g., PCR amplification or sequencing noise) can obscure true aptamer-ligand affinity, producing anomalous candidates whose observed abundance diverges from their actual binding strength. We show that the proposed embedding enables robust community detection and subgraph-level explanations for aptamer ligand affinity, even in the presence of biased observations. This approach may be used to identify low-abundance aptamer candidates for further experimental evaluation.
[741] Deep Learning on Real-World Graphs
Emanuele Rossi
Main category: cs.LG
TL;DR: This thesis presents a series of GNN models addressing key limitations: SIGN for scalability, TGN for temporal graphs, Dir-GNN for directed networks, Feature Propagation for missing features, and NuGget for structural inference.
Details
Motivation: To bridge the gap between academic GNN benchmarks and real-world industrial applications by addressing scalability, temporality, directionality, data incompleteness, and structural uncertainty challenges.Method: Developed multiple specialized GNN models: SIGN for scalable learning, TGN for temporal graphs, Dir-GNN for directed/heterophilic networks, Feature Propagation for missing node features, and NuGget for game-theoretic structural inference.
Result: The proposed models successfully address key limitations of traditional GNNs, enabling their application to real-world systems with industrial-scale graphs.
Conclusion: These contributions collectively bridge the gap between academic research and practical deployment, making GNNs applicable to domains like social networks and recommender systems.
Abstract: Graph Neural Networks (GNNs) have become a central tool for learning on graph-structured data, yet their applicability to real-world systems remains limited by key challenges such as scalability, temporality, directionality, data incompleteness, and structural uncertainty. This thesis introduces a series of models addressing these limitations: SIGN for scalable graph learning, TGN for temporal graphs, Dir-GNN for directed and heterophilic networks, Feature Propagation (FP) for learning with missing node features, and NuGget for game-theoretic structural inference. Together, these contributions bridge the gap between academic benchmarks and industrial-scale graphs, enabling the use of GNNs in domains such as social and recommender systems.
[742] Is Temporal Difference Learning the Gold Standard for Stitching in RL?
Michał Bortkiewicz, Władysław Pałucki, Mateusz Ostaszewski, Benjamin Eysenbach
Main category: cs.LG
TL;DR: Monte Carlo methods can achieve experience stitching in RL with function approximation, challenging conventional wisdom that only TD methods can recombine experience.
Details
Motivation: To empirically test whether the conventional wisdom about temporal difference methods being necessary for experience stitching actually holds when using function approximation in reinforcement learning.Method: Empirical study comparing Monte Carlo and temporal difference methods with different neural network capacities on various RL tasks, analyzing their ability to achieve experience stitching.
Result: MC methods can achieve experience stitching, with TD methods showing only slightly stronger capabilities. The gap between small and large neural networks is much larger than the gap between MC and TD methods. Increasing critic capacity reduces generalization gap for both methods.
Conclusion: The traditional TD inductive bias for stitching may be less necessary in the era of large models, and stitching might be achieved through scale rather than specialized algorithms like TD learning.
Abstract: Reinforcement learning (RL) promises to solve long-horizon tasks even when training data contains only short fragments of the behaviors. This experience stitching capability is often viewed as the purview of temporal difference (TD) methods. However, outside of small tabular settings, trajectories never intersect, calling into question this conventional wisdom. Moreover, the common belief is that Monte Carlo (MC) methods should not be able to recombine experience, yet it remains unclear whether function approximation could result in a form of implicit stitching. The goal of this paper is to empirically study whether the conventional wisdom about stitching actually holds in settings where function approximation is used. We empirically demonstrate that Monte Carlo (MC) methods can also achieve experience stitching. While TD methods do achieve slightly stronger capabilities than MC methods (in line with conventional wisdom), that gap is significantly smaller than the gap between small and large neural networks (even on quite simple tasks). We find that increasing critic capacity effectively reduces the generalization gap for both the MC and TD methods. These results suggest that the traditional TD inductive bias for stitching may be less necessary in the era of large models for RL and, in some cases, may offer diminishing returns. Additionally, our results suggest that stitching, a form of generalization unique to the RL setting, might be achieved not through specialized algorithms (temporal difference learning) but rather through the same recipe that has provided generalization in other machine learning settings (via scale). Project website: https://michalbortkiewicz.github.io/golden-standard/
[743] From Black-box to Causal-box: Towards Building More Interpretable Models
Inwoo Hwang, Yushu Pan, Elias Bareinboim
Main category: cs.LG
TL;DR: The paper introduces causal interpretability for deep learning models, showing that common model classes cannot answer counterfactual queries in general, and develops a framework for building causally interpretable models with a tradeoff between interpretability and accuracy.
Details
Motivation: Deep learning models lack interpretability, especially for counterfactual questions in high-stakes applications, making it challenging to understand model reasoning beyond observed data.Method: Developed a framework for causally interpretable models by deriving a complete graphical criterion to determine if model architectures support counterfactual queries, identifying the maximal feature set for interpretability and predictive expressiveness.
Result: Theoretical analysis shows neither blackbox nor concept-based predictors are causally interpretable in general, but the proposed framework enables building interpretable models with characterized tradeoffs between interpretability and accuracy.
Conclusion: Causal interpretability is achievable through careful model design, with a fundamental tradeoff between interpretability and predictive accuracy that can be systematically characterized and optimized.
Abstract: Understanding the predictions made by deep learning models remains a central challenge, especially in high-stakes applications. A promising approach is to equip models with the ability to answer counterfactual questions – hypothetical ``what if?’’ scenarios that go beyond the observed data and provide insight into a model reasoning. In this work, we introduce the notion of causal interpretability, which formalizes when counterfactual queries can be evaluated from a specific class of models and observational data. We analyze two common model classes – blackbox and concept-based predictors – and show that neither is causally interpretable in general. To address this gap, we develop a framework for building models that are causally interpretable by design. Specifically, we derive a complete graphical criterion that determines whether a given model architecture supports a given counterfactual query. This leads to a fundamental tradeoff between causal interpretability and predictive accuracy, which we characterize by identifying the unique maximal set of features that yields an interpretable model with maximal predictive expressiveness. Experiments corroborate the theoretical findings.
[744] Optimal Detection for Language Watermarks with Pseudorandom Collision
T. Tony Cai, Xiang Li, Qi Long, Weijie J. Su, Garrett G. Wen
Main category: cs.LG
TL;DR: A statistical framework for text watermark detection that addresses structured dependence from text repetition, ensuring rigorous Type I error control and improved detection power.
Details
Motivation: Existing watermarking methods assume perfect pseudorandomness, but repetition in generated text creates structured dependence that compromises Type I error control and invalidates standard analyses.Method: Introduces a hierarchical two-layer partition with minimal units - smallest groups treatable as independent across units while permitting dependence within. Defines non-asymptotic efficiency measure and casts watermark detection as minimax hypothesis testing problem.
Result: Applied to Gumbel-max and inverse-transform watermarks, produces closed-form optimal rules. Shows discarding repeated statistics improves performance and within-unit dependence must be addressed unless degenerate. Theory and experiments confirm improved detection power with rigorous Type I error control.
Conclusion: Provides first principled foundation for watermark detection under imperfect pseudorandomness, offering both theoretical insight and practical guidance for reliable tracing of model outputs.
Abstract: Text watermarking plays a crucial role in ensuring the traceability and accountability of large language model (LLM) outputs and mitigating misuse. While promising, most existing methods assume perfect pseudorandomness. In practice, repetition in generated text induces collisions that create structured dependence, compromising Type I error control and invalidating standard analyses. We introduce a statistical framework that captures this structure through a hierarchical two-layer partition. At its core is the concept of minimal units – the smallest groups treatable as independent across units while permitting dependence within. Using minimal units, we define a non-asymptotic efficiency measure and cast watermark detection as a minimax hypothesis testing problem. Applied to Gumbel-max and inverse-transform watermarks, our framework produces closed-form optimal rules. It explains why discarding repeated statistics often improves performance and shows that within-unit dependence must be addressed unless degenerate. Both theory and experiments confirm improved detection power with rigorous Type I error control. These results provide the first principled foundation for watermark detection under imperfect pseudorandomness, offering both theoretical insight and practical guidance for reliable tracing of model outputs.
[745] A Multimodal Human Protein Embeddings Database: DeepDrug Protein Embeddings Bank (DPEB)
Md Saiful Islam Sajol, Magesh Rajasekaran, Hayden Gemeinhardt, Adam Bess, Chris Alvin, Supratik Mukhopadhyay
Main category: cs.LG
TL;DR: DPE B is a curated collection of 22,043 human proteins that integrates four embedding types (structural, transformer-based sequence, contextual amino acid patterns, and sequence-based n-gram statistics) to enable computational prediction of protein-protein interactions.
Details
Motivation: There is a lack of integrated, multimodal protein representations for computationally predicting protein-protein interactions (PPIs), and AlphaFold2 neural network embeddings are not publicly available despite protein structures being accessible.Method: DPE B integrates four embedding types: structural (AlphaFold2), transformer-based sequence (BioEmbeddings), contextual amino acid patterns (ESM-2), and sequence-based n-gram statistics (ProtVec). It supports multiple graph neural network methods for PPI prediction.
Result: GraphSAGE with BioEmbedding achieved the highest PPI prediction performance (87.37% AUROC, 79.16% accuracy). The framework also achieved 77.42% accuracy for enzyme classification and 86.04% accuracy for protein family classification.
Conclusion: DPE B provides AlphaFold2-derived embeddings for computational modeling and enables applications in systems biology, drug target identification, pathway analysis, and disease mechanism studies through its multimodal protein representation framework.
Abstract: Computationally predicting protein-protein interactions (PPIs) is challenging due to the lack of integrated, multimodal protein representations. DPEB is a curated collection of 22,043 human proteins that integrates four embedding types: structural (AlphaFold2), transformer-based sequence (BioEmbeddings), contextual amino acid patterns (ESM-2: Evolutionary Scale Modeling), and sequence-based n-gram statistics (ProtVec]). AlphaFold2 protein structures are available through public databases (e.g., AlphaFold2 Protein Structure Database), but the internal neural network embeddings are not. DPEB addresses this gap by providing AlphaFold2-derived embeddings for computational modeling. Our benchmark evaluations show GraphSAGE with BioEmbedding achieved the highest PPI prediction performance (87.37% AUROC, 79.16% accuracy). The framework also achieved 77.42% accuracy for enzyme classification and 86.04% accuracy for protein family classification. DPEB supports multiple graph neural network methods for PPI prediction, enabling applications in systems biology, drug target identification, pathway analysis, and disease mechanism studies.
[746] Cost-Sensitive Evaluation for Binary Classifiers
Pierangelo Lombardo, Antonio Casoli, Cristian Cingolani, Shola Oshodi, Michele Zanatta
Main category: cs.LG
TL;DR: Proposes Weighted Accuracy (WA) as an evaluation metric for binary classifiers that minimizes Total Classification Cost (TCC), providing an alternative to class rebalancing techniques in cost-sensitive scenarios.
Details
Motivation: There is no consensus on universally accepted evaluation metrics for classifiers, and misconceptions exist about the need to mitigate class imbalance. The goal is to maximize return on investment by minimizing Total Classification Cost.Method: Defines Weighted Accuracy (WA) as a weighted version of accuracy that aligns with TCC minimization. Provides a framework for handling class imbalance in cost-sensitive scenarios and proposes a procedure to estimate WA weight parameters when costs are not fully specified.
Result: WA provides a straightforward interpretation as a weighted accuracy metric coherent with TCC minimization. The framework allows comparison across datasets and addresses discrepancies between development and target datasets.
Conclusion: WA serves as an effective evaluation metric for binary classifiers that directly minimizes Total Classification Cost, offering a principled alternative to class rebalancing techniques in cost-sensitive classification scenarios.
Abstract: Selecting an appropriate evaluation metric for classifiers is crucial for model comparison and parameter optimization, yet there is not consensus on a universally accepted metric that serves as a definitive standard. Moreover, there is often a misconception about the perceived need to mitigate imbalance in datasets used to train classification models. Since the final goal in classifier optimization is typically maximizing the return of investment or, equivalently, minimizing the Total Classification Cost (TCC), we define Weighted Accuracy (WA), an evaluation metric for binary classifiers with a straightforward interpretation as a weighted version of the well-known accuracy metric, coherent with the need of minimizing TCC. We clarify the conceptual framework for handling class imbalance in cost-sensitive scenarios, providing an alternative to rebalancing techniques. This framework can be applied to any metric that, like WA, can be expressed as a linear combination of example-dependent quantities and allows for comparing the results obtained in different datasets and for addressing discrepancies between the development dataset, used to train and validate the model, and the target dataset, where the model will be deployed. It also specifies in which scenarios using UCCs-unaware class rebalancing techniques or rebalancing metrics aligns with TCC minimization and when it is instead counterproductive. Finally, we propose a procedure to estimate the WA weight parameter in the absence of fully specified UCCs and demonstrate the robustness of WA by analyzing its correlation with TCC in example-dependent scenarios.
[747] Do You Trust the Process?: Modeling Institutional Trust for Community Adoption of Reinforcement Learning Policies
Naina Balepur, Xingrui Pei, Hari Sundaram
Main category: cs.LG
TL;DR: Developed a trust-aware RL algorithm for resource allocation that incorporates institutional trust dynamics, showing trade-offs between organizational success and community fairness.
Details
Motivation: Existing RL approaches assume citizens follow policies, but institutional trust is crucial for policy adherence. Need to incorporate trust dynamics in algorithmic decision-making for community resource allocation.Method: Used Deep Deterministic Policy Gradient to learn resource allocation policies, simulated trust changes in community members, and implemented quota systems to prevent unfair outcomes.
Result: Trust-aware RL leads to more successful policies when organization goals are uncertain. Conservative trust estimates increase fairness and community trust but reduce organizational success. Quota systems can improve fairness but decrease organizational utility.
Conclusion: Institutional trust is critical in algorithm design, revealing tension between organizational success and community well-being. Trust-aware approaches can improve policy outcomes but require balancing competing objectives.
Abstract: Many governmental bodies are adopting AI policies for decision-making. In particular, Reinforcement Learning has been used to design policies that citizens would be expected to follow if implemented. Much RL work assumes that citizens follow these policies, and evaluate them with this in mind. However, we know from prior work that without institutional trust, citizens will not follow policies put in place by governments. In this work, we develop a trust-aware RL algorithm for resource allocation in communities. We consider the case of humanitarian engineering, where the organization is aiming to distribute some technology or resource to community members. We use a Deep Deterministic Policy Gradient approach to learn a resource allocation that fits the needs of the organization. Then, we simulate resource allocation according to the learned policy, and model the changes in institutional trust of community members. We investigate how this incorporation of institutional trust affects outcomes, and ask how effectively an organization can learn policies if trust values are private. We find that incorporating trust into RL algorithms can lead to more successful policies, specifically when the organization’s goals are less certain. We find more conservative trust estimates lead to increased fairness and average community trust, though organization success suffers. Finally, we explore a strategy to prevent unfair outcomes to communities. We implement a quota system by an external entity which decreases the organization’s utility when it does not serve enough community members. We find this intervention can improve fairness and trust among communities in some cases, while decreasing the success of the organization. This work underscores the importance of institutional trust in algorithm design and implementation, and identifies a tension between organization success and community well-being.
[748] K-DAREK: Distance Aware Error for Kurkova Kolmogorov Networks
Masoud Ataei, Vikas Dhiman, Mohammad Javad Khojasteh
Main category: cs.LG
TL;DR: K-DAREK is a novel learning algorithm that enhances KKANs with distance-aware error bounds for efficient and interpretable function approximation with uncertainty quantification, showing significant improvements in speed, scalability, and safety compared to existing methods.
Details
Motivation: To address computational limitations of Gaussian processes for large-scale problems and improve upon existing neural architectures (KANs and KKANs) by developing a more efficient, stable, and interpretable function approximation method with uncertainty quantification.Method: Enhanced KKAN architecture with distance-aware error bounds that reflect test point proximity to training data, using a combination of Chebyshev layers, multi-layer perceptrons, and spline-based transformations.
Result: K-DAREK achieves 4x faster training, 10x higher computational efficiency than Ensemble KANs, 8.6x better scalability than GPs with increasing data size, and 50% safer performance than previous DAREK method.
Conclusion: K-DAREK provides an effective framework for efficient and interpretable function approximation with robust uncertainty quantification, making it particularly suitable for safe control tasks and dynamical system modeling.
Abstract: Neural networks are parametric and powerful tools for function approximation, and the choice of architecture heavily influences their interpretability, efficiency, and generalization. In contrast, Gaussian processes (GPs) are nonparametric probabilistic models that define distributions over functions using a kernel to capture correlations among data points. However, these models become computationally expensive for large-scale problems, as they require inverting a large covariance matrix. Kolmogorov- Arnold networks (KANs), semi-parametric neural architectures, have emerged as a prominent approach for modeling complex functions with structured and efficient representations through spline layers. Kurkova Kolmogorov-Arnold networks (KKANs) extend this idea by reducing the number of spline layers in KAN and replacing them with Chebyshev layers and multi-layer perceptrons, thereby mapping inputs into higher-dimensional spaces before applying spline-based transformations. Compared to KANs, KKANs perform more stable convergence during training, making them a strong architecture for estimating operators and system modeling in dynamical systems. By enhancing the KKAN architecture, we develop a novel learning algorithm, distance-aware error for Kurkova-Kolmogorov networks (K-DAREK), for efficient and interpretable function approximation with uncertainty quantification. Our approach establishes robust error bounds that are distance-aware; this means they reflect the proximity of a test point to its nearest training points. Through case studies on a safe control task, we demonstrate that K-DAREK is about four times faster and ten times higher computationally efficiency than Ensemble of KANs, 8.6 times more scalable than GP by increasing the data size, and 50% safer than our previous work distance-aware error for Kolmogorov networks (DAREK).
[749] Normalization in Attention Dynamics
Nikita Karagodin, Shu Ge, Yury Polyanskiy, Philippe Rigollet
Main category: cs.LG
TL;DR: The paper analyzes how different normalization schemes affect token representations in transformers, showing they act as speed regulators on the sphere and identifying Peri-LN as the most effective scheme.
Details
Motivation: To understand how various normalization schemes influence token representation dynamics and clustering behavior in deep transformers, providing a unified analytical framework.Method: Modeling token representations as interacting particles on the sphere and analyzing normalization as speed regulation, comparing Post-LN, Pre-LN, Mix-LN, Peri-LN, nGPT, and LN-Scaling schemes.
Result: The framework reveals how different normalization schemes influence clustering dynamics and representation collapse, with Peri-LN emerging as a particularly effective choice.
Conclusion: Normalization schemes act as speed regulators that shape token representations across layers, and Peri-LN is identified as the most effective among the analyzed schemes.
Abstract: We study the effect of normalization schemes on token representations in deep transformers. Modeling their evolution as interacting particles on the sphere, we show that normalization acts as a form of speed regulation. This perspective enables a unified analysis of several schemes – including Post-LN, Pre-LN, Mix-LN, Peri-LN, nGPT, and LN-Scaling – revealing how they influence clustering dynamics and representation collapse. Our framework clarifies how different schemes shape token representations across layers and provides a principled basis for comparing them, identifying Peri-LN as a particularly effective choice.
[750] Online Optimization for Offline Safe Reinforcement Learning
Yassine Chemingui, Aryan Deshwal, Alan Fern, Thanh Nguyen-Tang, Janardhan Rao Doppa
Main category: cs.LG
TL;DR: A novel offline safe RL approach using minimax objective with offline RL and online optimization, proven to be approximately optimal and practical without requiring offline policy evaluation.
Details
Motivation: To solve the problem of Offline Safe Reinforcement Learning (OSRL) where the goal is to learn reward-maximizing policies from fixed data while satisfying cumulative cost constraints.Method: Frames OSRL as a minimax objective and solves it by combining offline RL with online optimization algorithms. Provides a practical approximation that works with any offline RL algorithm without needing offline policy evaluation.
Result: Empirical results on DSRL benchmark show the method reliably enforces safety constraints under stringent cost budgets while achieving high rewards.
Conclusion: The proposed approach effectively solves offline safe RL problems by combining offline RL with online optimization, providing both theoretical guarantees and practical implementation.
Abstract: We study the problem of Offline Safe Reinforcement Learning (OSRL), where the goal is to learn a reward-maximizing policy from fixed data under a cumulative cost constraint. We propose a novel OSRL approach that frames the problem as a minimax objective and solves it by combining offline RL with online optimization algorithms. We prove the approximate optimality of this approach when integrated with an approximate offline RL oracle and no-regret online optimization. We also present a practical approximation that can be combined with any offline RL algorithm, eliminating the need for offline policy evaluation. Empirical results on the DSRL benchmark demonstrate that our method reliably enforces safety constraints under stringent cost budgets, while achieving high rewards. The code is available at https://github.com/yassineCh/O3SRL.
[751] Differentiable Constraint-Based Causal Discovery
Jincheng Zhou, Mengbo Wang, Anqi He, Yumeng Zhou, Hessam Olya, Murat Kocaoglu, Bruno Ribeiro
Main category: cs.LG
TL;DR: The paper introduces a new causal discovery method using differentiable d-separation scores derived from percolation theory with soft logic, enabling gradient-based optimization of conditional independence constraints.
Details
Motivation: Existing causal discovery methods have limitations: constraint-based approaches struggle with small sample sizes, while score-based methods lack explicit conditional independence testing. The paper aims to bridge this gap by exploring a third approach.Method: Develops differentiable d-separation scores using percolation theory with soft logic, enabling gradient-based optimization of conditional independence constraints for causal discovery.
Result: Empirical evaluations show robust performance in low-sample regimes, outperforming traditional constraint-based and score-based baselines on real-world datasets.
Conclusion: The proposed gradient-based optimization approach using differentiable d-separation scores provides an effective alternative to existing causal discovery methods, particularly in scenarios with limited data.
Abstract: Causal discovery from observational data is a fundamental task in artificial intelligence, with far-reaching implications for decision-making, predictions, and interventions. Despite significant advances, existing methods can be broadly categorized as constraint-based or score-based approaches. Constraint-based methods offer rigorous causal discovery but are often hindered by small sample sizes, while score-based methods provide flexible optimization but typically forgo explicit conditional independence testing. This work explores a third avenue: developing differentiable $d$-separation scores, obtained through a percolation theory using soft logic. This enables the implementation of a new type of causal discovery method: gradient-based optimization of conditional independence constraints. Empirical evaluations demonstrate the robust performance of our approach in low-sample regimes, surpassing traditional constraint-based and score-based baselines on a real-world dataset. Code and data of the proposed method are publicly available at https://github$.$com/PurdueMINDS/DAGPA.
[752] Linearized Optimal Transport for Analysis of High-Dimensional Point-Cloud and Single-Cell Data
Tianxiang Wang, Yingtong Ke, Dhananjay Bhaskar, Smita Krishnaswamy, Alexander Cloninger
Main category: cs.LG
TL;DR: LOT framework embeds single-cell point clouds into Euclidean space for interpretable analysis, classification, and synthetic data generation while preserving distributional structure.
Details
Motivation: Single-cell technologies produce irregular point clouds that are hard to compare between patients, and existing nonlinear methods lack interpretability despite good predictive performance.Method: Adapt Linear Optimal Transport (LOT) framework to embed point clouds into fixed-dimensional Euclidean space, preserving optimal transport geometry and enabling registration between patients.
Result: LOT enables accurate and interpretable COVID-19 classification with weights mapping to specific markers, and synthetic data generation for organoids using LOT barycenters for drug testing.
Conclusion: LOT provides a unified framework bridging predictive performance, interpretability, and generative modeling by transforming point clouds into structured, traceable embeddings.
Abstract: Single-cell technologies generate high-dimensional point clouds of cells, enabling detailed characterization of complex patient states and treatment responses. Yet each patient is represented by an irregular point cloud rather than a simple vector, making it difficult to directly quantify and compare biological differences between individuals. Nonlinear methods such as kernels and neural networks achieve predictive accuracy but act as black boxes, offering little biological interpretability. To address these limitations, we adapt the Linear Optimal Transport (LOT) framework to this setting, embedding irregular point clouds into a fixed-dimensional Euclidean space while preserving distributional structure. This embedding provides a principled linear representation that preserves optimal transport geometry while enabling downstream analysis. It also forms a registration between any two patients, enabling direct comparison of their cellular distributions. Within this space, LOT enables: (i) \textbf{accurate and interpretable classification} of COVID-19 patient states, where classifier weights map back to specific markers and spatial regions driving predictions; and (ii) \textbf{synthetic data generation} for patient-derived organoids, exploiting the linearity of the LOT embedding. LOT barycenters yield averaged cellular profiles representing combined conditions or samples, supporting drug interaction testing. Together, these results establish LOT as a unified framework that bridges predictive performance, interpretability, and generative modeling. By transforming heterogeneous point clouds into structured embeddings directly traceable to the original data, LOT opens new opportunities for understanding immune variation and treatment effects in high-dimensional biological systems.
[753] Generalized Top-k Mallows Model for Ranked Choices
Shahrzad Haddadan, Sara Ahmadian
Main category: cs.LG
TL;DR: The paper addresses limitations of the classic Mallows model by focusing on generalized top-k Mallows models, providing new sampling schemes, efficient choice probability computation, and active learning for parameter estimation.
Details
Motivation: The classic Mallows model fails to capture real-world scenarios where users focus only on preferred items and are indifferent to others. Top-k Mallows models better align with practical applications.Method: Developed a novel sampling scheme for generalized top-k Mallows models, an efficient algorithm for computing choice probabilities, and an active learning algorithm for parameter estimation from observed choice data.
Result: The proposed methods demonstrate scalability and accuracy through extensive experiments on synthetic and real-world data. The paper compares predictive power of top-k Mallows models versus Multinomial Logit models.
Conclusion: The contributions provide new tools for analysis and prediction in critical decision-making scenarios, with rigorous mathematical analysis supporting algorithm performance.
Abstract: The classic Mallows model is a foundational tool for modeling user preferences. However, it has limitations in capturing real-world scenarios, where users often focus only on a limited set of preferred items and are indifferent to the rest. To address this, extensions such as the top-k Mallows model have been proposed, aligning better with practical applications. In this paper, we address several challenges related to the generalized top-k Mallows model, with a focus on analyzing buyer choices. Our key contributions are: (1) a novel sampling scheme tailored to generalized top-k Mallows models, (2) an efficient algorithm for computing choice probabilities under this model, and (3) an active learning algorithm for estimating the model parameters from observed choice data. These contributions provide new tools for analysis and prediction in critical decision-making scenarios. We present a rigorous mathematical analysis for the performance of our algorithms. Furthermore, through extensive experiments on synthetic data and real-world data, we demonstrate the scalability and accuracy of our proposed methods, and we compare the predictive power of Mallows model for top-k lists compared to the simpler Multinomial Logit model.
[754] Fast Non-Log-Concave Sampling under Nonconvex Equality and Inequality Constraints with Landing
Kijung Jeon, Michael Muehlebach, Molei Tao
Main category: cs.LG
TL;DR: OLLA is a new constrained sampling framework that handles both equality and inequality constraints without explicit projections, using deterministic corrections along constraint surfaces with exponential convergence guarantees.
Details
Motivation: Existing constrained sampling methods are limited to either equality or inequality constraints, require costly projections, and lack rigorous convergence guarantees for nonconvex constraint sets.Method: Overdamped Langevin with LAnding (OLLA) dynamics that deterministically corrects trajectories along the normal direction of constraint surfaces, accommodating both equality and inequality constraints simultaneously.
Result: OLLA converges exponentially fast in W2 distance to the constrained target density under suitable regularity conditions, and demonstrates superior efficiency compared to projection-based methods and slack variable variants.
Conclusion: OLLA provides an efficient and theoretically sound framework for constrained sampling that overcomes limitations of existing methods by handling both constraint types without projections while guaranteeing exponential convergence.
Abstract: Sampling from constrained statistical distributions is a fundamental task in various fields including Bayesian statistics, computational chemistry, and statistical physics. This article considers the cases where the constrained distribution is described by an unconstrained density, as well as additional equality and/or inequality constraints, which often make the constraint set nonconvex. Existing methods for nonconvex constraint set $\Sigma \subset \mathbb{R}^d$ defined by equality or inequality constraints commonly rely on costly projection steps. Moreover, they cannot handle equality and inequality constraints simultaneously as each method only specialized in one case. In addition, rigorous and quantitative convergence guarantee is often lacking. In this paper, we introduce Overdamped Langevin with LAnding (OLLA), a new framework that can design overdamped Langevin dynamics accommodating both equality and inequality constraints. The proposed dynamics also deterministically corrects trajectories along the normal direction of the constraint surface, thus obviating the need for explicit projections. We show that, under suitable regularity conditions on the target density and $\Sigma$, OLLA converges exponentially fast in $W_2$ distance to the constrained target density $\rho_\Sigma(x) \propto \exp(-f(x))d\sigma_\Sigma$. Lastly, through experiments, we demonstrate the efficiency of OLLA compared to projection-based constrained Langevin algorithms and their slack variable variants, highlighting its favorable computational cost and reasonable empirical mixing.
[755] PF$Δ$: A Benchmark Dataset for Power Flow under Load, Generation, and Topology Variations
Ana K. Rivera, Anvita Bhagavathula, Alvaro Carbonero, Priya Donti
Main category: cs.LG
TL;DR: PFΔ is a benchmark dataset for power flow calculations containing 859,800 solved instances across various system sizes and contingency scenarios, used to evaluate traditional solvers and machine learning methods.
Details
Motivation: Power flow calculations are computationally intensive for real-time grid operations, and growing uncertainty from renewables and extreme weather requires tools that can efficiently simulate diverse scenarios. Current machine learning methods lack systematic assessment on realistic benchmarks.Method: Created PFΔ dataset with 859,800 power flow instances spanning six bus system sizes, three contingency scenarios (N, N-1, N-2), and cases near voltage stability limits. Evaluated traditional solvers and GNN-based methods.
Result: The benchmark highlights key areas where existing approaches struggle and identifies open problems for future research in power flow computation.
Conclusion: PFΔ provides a comprehensive benchmark for assessing power flow methods, enabling systematic evaluation of computational approaches and advancing research in efficient grid operation tools.
Abstract: Power flow (PF) calculations are the backbone of real-time grid operations, across workflows such as contingency analysis (where repeated PF evaluations assess grid security under outages) and topology optimization (which involves PF-based searches over combinatorially large action spaces). Running these calculations at operational timescales or across large evaluation spaces remains a major computational bottleneck. Additionally, growing uncertainty in power system operations from the integration of renewables and climate-induced extreme weather also calls for tools that can accurately and efficiently simulate a wide range of scenarios and operating conditions. Machine learning methods offer a potential speedup over traditional solvers, but their performance has not been systematically assessed on benchmarks that capture real-world variability. This paper introduces PF$\Delta$, a benchmark dataset for power flow that captures diverse variations in load, generation, and topology. PF$\Delta$ contains 859,800 solved power flow instances spanning six different bus system sizes, capturing three types of contingency scenarios (N , N -1, and N -2), and including close-to-infeasible cases near steady-state voltage stability limits. We evaluate traditional solvers and GNN-based methods, highlighting key areas where existing approaches struggle, and identifying open problems for future research. Our dataset is available at https://huggingface.co/datasets/pfdelta/pfdelta/tree/main and our code with data generation scripts and model implementations is at https://github.com/MOSSLab-MIT/pfdelta.
[756] Automatic Assessment of Students’ Classroom Engagement with Bias Mitigated Multi-task Model
James Thiering, Tarun Sethupat Radha Krishna, Dylan Zelkin, Ashis Kumer Biswas
Main category: cs.LG
TL;DR: Proposed an automated system using attribute-orthogonal regularization to detect student engagement in online learning while reducing gender bias and enhancing model interpretability.
Details
Motivation: Traditional engagement assessment methods don't work well in virtual learning environments, creating need for automated systems that avoid leveraging sensitive features like gender.Method: Used attribute-orthogonal regularization with split-model classifier and multiple transfer learning strategies to discourage models from using sensitive features for predictions.
Result: Significantly reduced disparity in prediction distribution for sensitive groups, improving Pearson correlation coefficient from 0.897 (unmitigated) to 0.999 (mitigated).
Conclusion: The method successfully detects student engagement while enforcing ethical standards and improving model interpretability in online learning environments.
Abstract: With the rise of online and virtual learning, monitoring and enhancing student engagement have become an important aspect of effective education. Traditional methods of assessing a student’s involvement might not be applicable directly to virtual environments. In this study, we focused on this problem and addressed the need to develop an automated system to detect student engagement levels during online learning. We proposed a novel training method which can discourage a model from leveraging sensitive features like gender for its predictions. The proposed method offers benefits not only in the enforcement of ethical standards, but also to enhance interpretability of the model predictions. We applied an attribute-orthogonal regularization technique to a split-model classifier, which uses multiple transfer learning strategies to achieve effective results in reducing disparity in the distribution of prediction for sensitivity groups from a Pearson correlation coefficient of 0.897 for the unmitigated model, to 0.999 for the mitigated model. The source code for this project is available on https://github.com/ashiskb/elearning-engagement-study .
[757] Pruning and Quantization Impact on Graph Neural Networks
Khatoon Khedri, Reza Rawassizadeh, Qifu Wen, Mehdi Hosseinzadeh
Main category: cs.LG
TL;DR: Empirical study of pruning and quantization methods on GNNs shows that unstructured fine-grained and global pruning can reduce model size by 50% while maintaining or improving accuracy after fine-tuning.
Details
Motivation: GNNs suffer from high computational and resource costs despite their high accuracy on graph-structured data, necessitating compression methods to reduce model size while maintaining reasonable accuracy.Method: Empirical examination of three pruning methods and three quantization methods on different GNN models across graph classification, node classification, and link prediction tasks using three datasets (Cora, Proteins, BBBP).
Result: Unstructured fine-grained and global pruning significantly reduce model size by 50% while maintaining or improving precision after fine-tuning. Quantization methods show diverse impacts on accuracy, inference time, and model size across different datasets.
Conclusion: Pruning methods, particularly unstructured fine-grained and global pruning, are effective for GNN compression, while quantization effects vary across datasets and tasks.
Abstract: Graph neural networks (GNNs) are known to operate with high accuracy on learning from graph-structured data, but they suffer from high computational and resource costs. Neural network compression methods are used to reduce the model size while maintaining reasonable accuracy. Two of the common neural network compression techniques include pruning and quantization. In this research, we empirically examine the effects of three pruning methods and three quantization methods on different GNN models, including graph classification tasks, node classification tasks, and link prediction. We conducted all experiments on three graph datasets, including Cora, Proteins, and BBBP. Our findings demonstrate that unstructured fine-grained and global pruning can significantly reduce the model’s size(50%) while maintaining or even improving precision after fine-tuning the pruned model. The evaluation of different quantization methods on GNN shows diverse impacts on accuracy, inference time, and model size across different datasets.
[758] Deep Gaussian Processes for Functional Maps
Matthew Lowery, Zhitong Xu, Da Long, Keyan Chen, Daniel S. Johnson, Yang Bai, Varun Shankar, Shandian Zhe
Main category: cs.LG
TL;DR: DGPFM is a deep Gaussian process framework for function-on-function regression that combines GP-based linear/nonlinear transformations with scalable variational inference, achieving better predictive performance and uncertainty calibration than existing methods.
Details
Motivation: Existing function-on-function regression methods like functional linear models and neural operators either cannot capture complex nonlinearities or lack reliable uncertainty quantification under noisy, sparse, and irregularly sampled functional data.Method: Designs a sequence of GP-based linear and nonlinear transformations using integral transforms of kernels, GP interpolation, and nonlinear activations sampled from GPs. Uses inducing points and whitening transformations for scalable variational learning.
Result: Empirical results on real-world and PDE benchmark datasets demonstrate DGPFM’s advantage in both predictive performance and uncertainty calibration compared to existing approaches.
Conclusion: DGPFM provides an effective solution for function-on-function regression that captures complex nonlinearities while maintaining reliable uncertainty quantification, addressing limitations of current methods.
Abstract: Learning mappings between functional spaces, also known as function-on-function regression, plays a crucial role in functional data analysis and has broad applications, e.g. spatiotemporal forecasting, curve prediction, and climate modeling. Existing approaches, such as functional linear models and neural operators, either fall short of capturing complex nonlinearities or lack reliable uncertainty quantification under noisy, sparse, and irregularly sampled data. To address these issues, we propose Deep Gaussian Processes for Functional Maps (DGPFM). Our method designs a sequence of GP-based linear and nonlinear transformations, leveraging integral transforms of kernels, GP interpolation, and nonlinear activations sampled from GPs. A key insight simplifies implementation: under fixed locations, discrete approximations of kernel integral transforms collapse into direct functional integral transforms, enabling flexible incorporation of various integral transform designs. To achieve scalable probabilistic inference, we use inducing points and whitening transformations to develop a variational learning algorithm. Empirical results on real-world and PDE benchmark datasets demonstrate that the advantage of DGPFM in both predictive performance and uncertainty calibration.
[759] Neural Index Policies for Restless Multi-Action Bandits with Heterogeneous Budgets
Himadri S. Pandey, Kai Wang, Gian-Gabriel P. Garcia
Main category: cs.LG
TL;DR: A Neural Index Policy (NIP) for multi-action restless multi-armed bandits with heterogeneous budget constraints that learns budget-aware indices and converts them into feasible allocations via a differentiable knapsack layer.
Details
Motivation: Classical RMAB formulations assume binary actions and a single global budget, which breaks down in real-world settings like healthcare that involve multiple interventions with heterogeneous costs and constraints.Method: Learns to assign budget-aware indices to arm-action pairs using a neural network, and converts them into feasible allocations via a differentiable knapsack layer formulated as an entropy-regularized optimal transport problem, unifying index prediction and constrained optimization in an end-to-end differentiable framework.
Result: NIP achieves near-optimal performance within 5% of the oracle occupancy-measure policy while strictly enforcing heterogeneous budgets and scaling to hundreds of arms.
Conclusion: Establishes a general, theoretically grounded, and scalable framework for learning index-based policies in complex resource-constrained environments.
Abstract: Restless multi-armed bandits (RMABs) provide a scalable framework for sequential decision-making under uncertainty, but classical formulations assume binary actions and a single global budget. Real-world settings, such as healthcare, often involve multiple interventions with heterogeneous costs and constraints, where such assumptions break down. We introduce a Neural Index Policy (NIP) for multi-action RMABs with heterogeneous budget constraints. Our approach learns to assign budget-aware indices to arm–action pairs using a neural network, and converts them into feasible allocations via a differentiable knapsack layer formulated as an entropy-regularized optimal transport (OT) problem. The resulting model unifies index prediction and constrained optimization in a single end-to-end differentiable framework, enabling gradient-based training directly on decision quality. The network is optimized to align its induced occupancy measure with the theoretical upper bound from a linear programming relaxation, bridging asymptotic RMAB theory with practical learning. Empirically, NIP achieves near-optimal performance within 5% of the oracle occupancy-measure policy while strictly enforcing heterogeneous budgets and scaling to hundreds of arms. This work establishes a general, theoretically grounded, and scalable framework for learning index-based policies in complex resource-constrained environments.
[760] Agentic Reinforcement Learning for Real-World Code Repair
Siyu Zhu, Anastasiya Karpovich, Albert Chen, Jessica Koscheka, Shailesh Jannu, Di Wen, Yuqing Zhu, Rohit Jain, Alborz Geramifard
Main category: cs.LG
TL;DR: Developed verifiable pipeline for training code-fixing agents with build validation, achieving 7-20% gains with RL on SFT models, but models failed to generalize across environments.
Details
Motivation: Training reliable code-fixing agents in real repositories is challenging due to complex builds and shifting dependencies that make evaluation unstable.Method: Created verifiable pipeline with post-fix build validation and improved reproducibility by pinning dependencies. Used supervised fine-tuning (SFT) on Qwen3-32B and applied reinforcement learning (RL) in simplified environment. Also tested ’thinking mode’ approach.
Result: SFT model distilled from GPT-4.1 performed on par while being 56x smaller. RL added 7-20% absolute gains under matched train-test conditions. ‘Thinking mode’ performed on par or worse. Both SFT and RL models failed to generalize across environments.
Conclusion: Matching train-test environments is crucial for building reliable real-world code-fixing agents, as models failed to generalize across different environments despite performance gains in matched conditions.
Abstract: We tackle the challenge of training reliable code-fixing agents in real repositories, where complex builds and shifting dependencies make evaluation unstable. We developed a verifiable pipeline with success defined as post-fix build validation and improved reproducibility across ~1K real issues by pinning dependencies and disabling automatic upgrades. Building on this, we introduced a scalable simplified pipeline for large-scale reinforcement learning (RL). Using this setup, we supervised fine-tuned Qwen3-32B in the full pipeline and applied RL on top of the SFT model in the simplified environment. The SFT model distilled from GPT-4.1 trajectories performs on par while being 56x smaller, and RL added 7-20% absolute gains under matched train-test conditions. “Thinking mode” was on par or worse in our experiments. Both SFT and RL models failed to generalize across environments, highlighting the importance of matching train-test environments for building reliable real-world code-fixing agents.
[761] Hierarchical Graph Networks for Accurate Weather Forecasting via Lightweight Training
Thomas Bailie, S. Karthik Mukkavilli, Varvara Vetrova, Yun Sing Koh
Main category: cs.LG
TL;DR: HiFlowCast and HiAntFlow are hierarchical graph neural networks that embed physics in multiscale weather prediction, using latent-memory retention to preserve global trends and latent-to-physics integration for PDE solutions across scales, achieving 5-8% error reduction at extreme conditions with single-epoch convergence.
Details
Motivation: Accurate weather prediction is challenging due to physical processes across diverse spatio-temporal scales that fixed-resolution methods cannot capture, and existing HGNNs often erase global trends during downward mappings, weakening physics integration.Method: Introduces HiFlowCast and HiAntFlow with two key innovations: Latent-Memory-Retention mechanism to preserve global trends during downward traversal, and Latent-to-Physics branch that integrates PDE solution fields across diverse scales.
Result: Models reduce errors by over 5% at 13-day lead times and by 5-8% under 1st and 99th quantile extremes, improving reliability for rare events. They converge within a single epoch using pretrained weights, reducing training costs and carbon footprint.
Conclusion: The proposed physics-embedded multiscale framework significantly improves weather forecasting accuracy and reliability while enhancing computational efficiency and sustainability through rapid convergence.
Abstract: Climate events arise from intricate, multivariate dynamics governed by global-scale drivers, profoundly impacting food, energy, and infrastructure. Yet, accurate weather prediction remains elusive due to physical processes unfolding across diverse spatio-temporal scales, which fixed-resolution methods cannot capture. Hierarchical Graph Neural Networks (HGNNs) offer a multiscale representation, but nonlinear downward mappings often erase global trends, weakening the integration of physics into forecasts. We introduce HiFlowCast and its ensemble variant HiAntFlow, HGNNs that embed physics within a multiscale prediction framework. Two innovations underpin their design: a Latent-Memory-Retention mechanism that preserves global trends during downward traversal, and a Latent-to-Physics branch that integrates PDE solution fields across diverse scales. Our Flow models cut errors by over 5% at 13-day lead times and by 5-8% under 1st and 99th quantile extremes, improving reliability for rare events. Leveraging pretrained model weights, they converge within a single epoch, reducing training cost and their carbon footprint. Such efficiency is vital as the growing scale of machine learning challenges sustainability and limits research accessibility. Code and model weights are in the supplementary materials.
[762] Dynamic Graph Neural Network for Data-Driven Physiologically Based Pharmacokinetic Modeling
Su Liu, Xin Hu, Shurong Wen, Jiaqi Liu, Jiexi Xu, Lanruo Wang
Main category: cs.LG
TL;DR: This paper proposes a Dynamic Graph Neural Network (Dynamic GNN) for Physiologically Based Pharmacokinetic (PBPK) modeling, achieving superior performance over MLP and LSTM baselines by explicitly modeling inter-organ interactions through recurrent message-passing.
Details
Motivation: Traditional PBPK modeling relies on ordinary differential equations with simplifying assumptions that limit adaptability to nonlinear physiological interactions, creating a need for more flexible, data-driven approaches.Method: Proposed a Dynamic Graph Neural Network that models physiological connections as recurrent message-passing processes between organs, comparing it against MLP and LSTM baselines for capturing molecular and temporal dependencies.
Result: Dynamic GNN achieved highest performance with R²=0.9342, RMSE=0.0159, MAE=0.0116, significantly outperforming MLP (R²=0.8705) and LSTM (R²=0.8059).
Conclusion: Explicitly modeling spatial and temporal dependencies of organ interactions enables more accurate and generalizable drug concentration prediction, providing a scalable, equation-free alternative to traditional PBPK formulations.
Abstract: Physiologically Based Pharmacokinetic (PBPK) modeling plays a critical role in drug development by predicting drug concentration dynamics across organs. Traditional approaches rely on ordinary differential equations with strong simplifying assumptions, which limit their adaptability to nonlinear physiological interactions. In this study, we explore data-driven alternatives for PBPK prediction using deep learning. Two baseline architectures - a multilayer perceptron (MLP) and a long short-term memory (LSTM) network - are implemented to capture molecular and temporal dependencies, respectively. To incorporate inter-organ interactions, we propose a Dynamic Graph Neural Network (Dynamic GNN) that models physiological connections as recurrent message-passing processes between organs. Experimental results demonstrate that the proposed Dynamic GNN achieves the highest predictive performance among all models, with an R^2 of 0.9342, an RMSE of 0.0159, and an MAE of 0.0116. In comparison, the MLP baseline obtains an R^2 of 0.8705 and the LSTM achieves 0.8059. These results highlight that explicitly modeling the spatial and temporal dependencies of organ interactions enables more accurate and generalizable drug concentration prediction. The Dynamic GNN provides a scalable, equation-free alternative to traditional PBPK formulations and demonstrates strong potential for data-driven pharmacokinetic modeling in preclinical and clinical research.
[763] Learning 3D Anisotropic Noise Distributions Improves Molecular Force Field Modeling
Xixian Liu, Rui Jiao, Zhiyuan Liu, Yurou Liu, Yang Liu, Ziheng Lu, Wenbing Huang, Yang Zhang, Yixin Cao
Main category: cs.LG
TL;DR: AniDS introduces an anisotropic variational autoencoder for 3D molecular denoising that addresses limitations of isotropic noise assumptions by generating atom-specific, full covariance matrices derived from pairwise atomic interactions.
Details
Motivation: Existing coordinate denoising methods rely on oversimplified molecular dynamics that assume isotropic and homoscedastic atomic motions, which don't reflect the directional and structural variability in real molecular systems.Method: Proposes AniDS framework with structure-aware anisotropic noise generator that produces atom-specific full covariance matrices for Gaussian noise distributions. These covariances are derived from pairwise atomic interactions as anisotropic corrections to an isotropic base, ensuring symmetric, positive semi-definite, and SO(3)-equivariant properties.
Result: Outperforms prior isotropic and homoscedastic denoising models on MD17 and OC22 benchmarks, achieving average relative improvements of 8.9% and 6.2% in force prediction accuracy. Case study shows adaptive noise suppression along bonding directions consistent with physicochemical principles.
Conclusion: AniDS provides a more accurate approach to 3D molecular denoising by modeling anisotropic noise distributions that better reflect complex molecular dynamics, demonstrating superior performance over existing methods.
Abstract: Coordinate denoising has emerged as a promising method for 3D molecular pretraining due to its theoretical connection to learning molecular force field. However, existing denoising methods rely on oversimplied molecular dynamics that assume atomic motions to be isotropic and homoscedastic. To address these limitations, we propose a novel denoising framework AniDS: Anisotropic Variational Autoencoder for 3D Molecular Denoising. AniDS introduces a structure-aware anisotropic noise generator that can produce atom-specific, full covariance matrices for Gaussian noise distributions to better reflect directional and structural variability in molecular systems. These covariances are derived from pairwise atomic interactions as anisotropic corrections to an isotropic base. Our design ensures that the resulting covariance matrices are symmetric, positive semi-definite, and SO(3)-equivariant, while providing greater capacity to model complex molecular dynamics. Extensive experiments show that AniDS outperforms prior isotropic and homoscedastic denoising models and other leading methods on the MD17 and OC22 benchmarks, achieving average relative improvements of 8.9% and 6.2% in force prediction accuracy. Our case study on a crystal and molecule structure shows that AniDS adaptively suppresses noise along the bonding direction, consistent with physicochemical principles. Our code is available at https://github.com/ZeroKnighting/AniDS.
[764] Efficient Utility-Preserving Machine Unlearning with Implicit Gradient Surgery
Shiji Zhou, Tianbai Yu, Zhi Zhang, Heng Chang, Xiao Zhou, Dong Wu, Han Zhao
Main category: cs.LG
TL;DR: The paper proposes an efficient utility-preserving machine unlearning method that formulates MU as a constrained optimization problem and solves it through implicit gradient surgery with one backpropagation.
Details
Motivation: Existing multi-objective methods for machine unlearning only find Pareto-optimal solutions without fine-grained control, leading to under-optimization of the unlearning objective.Method: Model MU as a constrained optimization problem (optimizing unlearning objective with bounded utility loss constraint), then solve it using implicit gradient surgery that approximates the solution with only one backpropagation.
Result: The proposed algorithm achieves better tradeoff results than existing baselines, with theoretical convergence guarantees and empirical validation.
Conclusion: The method provides an efficient solution for utility-preserving machine unlearning through implicit gradient surgery, balancing unlearning efficacy and utility preservation effectively.
Abstract: Machine unlearning (MU) aims to efficiently remove sensitive or harmful memory from a pre-trained model. The key challenge is to balance the potential tradeoff between unlearning efficacy and utility preservation, which involves forgetting undesirable information as defined while maintaining the model’s original performance. One potential way to tackle this problem is to use multi-objective optimization to jointly optimize both the unlearning and utility preservation objectives. However, existing multi-objective methods only guarantee finding a Pareto-optimal solution without fine-grained control, which causes under-optimization of the unlearning objective. To this end, we first model MU as a constrained optimization problem, that is, optimizing the unlearning objective under the constraint of a bounded increase for utility loss. We then show that solving this optimization problem is equivalent to unilateral gradient surgery on the unlearning objective. To resolve the additional computational cost brought by gradient surgery, we propose an implicit gradient surgery method, which approximates the solution to the aforementioned constrained optimization problem via only one backpropagation, thereby achieving efficient utility-preserving MU. Theoretically, we provide a tight convergence analysis of the algorithm. Empirically, our extensive experiments show that the proposed algorithm achieves better tradeoff results than existing baselines. Codes are available at https://github.com/anseryuer/EUPMU-Efficient-Utility-Preserving-Machine-Unlearning.
[765] Probing Neural Combinatorial Optimization Models
Zhiqin Zhang, Yining Ma, Zhiguang Cao, Hoong Chuin Lau
Main category: cs.LG
TL;DR: This paper introduces CS-Probing, a novel method for interpreting neural combinatorial optimization (NCO) models by analyzing their representations through probing tasks and examining coefficient significance.
Details
Motivation: NCO models achieve strong performance but remain black boxes, which hinders research and practical deployment. Understanding their representations and decision rationale is crucial for deeper insights.Method: The authors use various probing tasks and introduce Coefficient Significance Probing (CS-Probing) to analyze NCO representations by examining coefficients and statistical significance during probing.
Result: Experiments reveal that NCO models encode both low-level information for solution construction and high-level knowledge for better decisions. CS-Probing uncovers inductive biases, generalization evidence, and key embedding dimensions, leading to improved model generalization with minor code modifications.
Conclusion: This work represents the first systematic attempt to interpret black-box NCO models, demonstrating probing as a promising tool for analyzing internal mechanisms and providing valuable insights for the NCO community.
Abstract: Neural combinatorial optimization (NCO) has achieved remarkable performance, yet its learned model representations and decision rationale remain a black box. This impedes both academic research and practical deployment, since researchers and stakeholders require deeper insights into NCO models. In this paper, we take the first critical step towards interpreting NCO models by investigating their representations through various probing tasks. Moreover, we introduce a novel probing tool named Coefficient Significance Probing (CS-Probing) to enable deeper analysis of NCO representations by examining the coefficients and statistical significance during probing. Extensive experiments and analysis reveal that NCO models encode low-level information essential for solution construction, while capturing high-level knowledge to facilitate better decisions. Using CS-Probing, we find that prevalent NCO models impose varying inductive biases on their learned representations, uncover direct evidence related to model generalization, and identify key embedding dimensions associated with specific knowledge. These insights can be potentially translated into practice, for example, with minor code modifications, we improve the generalization of the analyzed model. Our work represents a first systematic attempt to interpret black-box NCO models, showcasing probing as a promising tool for analyzing their internal mechanisms and revealing insights for the NCO community. The source code is publicly available.
[766] Tractable Shapley Values and Interactions via Tensor Networks
Farzaneh Heidari, Chao Li, Farzaneh Heidari
Main category: cs.LG
TL;DR: TN-SHAP replaces exponential coalition enumeration for Shapley values with tensor-network surrogates, achieving polynomial complexity for computing feature interactions.
Details
Motivation: Traditional Shapley value computation requires O(2^n) coalition evaluations, which becomes intractable for large n. There's a need for efficient methods to compute Shapley-style interactions without exhaustive enumeration.Method: Represent predictor’s local behavior as factorized multilinear map using tensor networks. Replace coalition sweeps with targeted evaluations to extract order-k Shapley interactions. Specifically uses O(n*poly(chi) + n^2) evaluations for order-1 and order-2 interactions.
Result: Achieves 25-1000x wall-clock speedups over KernelSHAP-IQ at comparable accuracy. Matches enumeration accuracy on fitted surrogates while reducing evaluations by orders of magnitude. Successfully demonstrated on UCI datasets.
Conclusion: TN-SHAP provides theoretically guaranteed efficient computation of Shapley interactions using tensor networks, overcoming the exponential complexity barrier of traditional methods while maintaining accuracy.
Abstract: We show how to replace the O(2^n) coalition enumeration over n features behind Shapley values and Shapley-style interaction indices with a few-evaluation scheme on a tensor-network (TN) surrogate: TN-SHAP. The key idea is to represent a predictor’s local behavior as a factorized multilinear map, so that coalitional quantities become linear probes of a coefficient tensor. TN-SHAP replaces exhaustive coalition sweeps with just a small number of targeted evaluations to extract order-k Shapley interactions. In particular, both order-1 (single-feature) and order-2 (pairwise) computations have cost O(n*poly(chi) + n^2), where chi is the TN’s maximal cut rank. We provide theoretical guarantees on the approximation error and tractability of TN-SHAP. On UCI datasets, our method matches enumeration on the fitted surrogate while reducing evaluation by orders of magnitude and achieves 25-1000x wall-clock speedups over KernelSHAP-IQ at comparable accuracy, while amortizing training across local cohorts.
[767] Edit Less, Achieve More: Dynamic Sparse Neuron Masking for Lifelong Knowledge Editing in LLMs
Jinzhe Liu, Junshu Sun, Shufan Shen, Chenxue Yang, Shuhui Wang
Main category: cs.LG
TL;DR: NMKE is a lifelong knowledge editing method that uses neuron-level attribution and dynamic sparse masking to precisely update LLMs with minimal parameter changes, maintaining high editing success and generalization.
Details
Motivation: Existing lifelong knowledge editing methods accumulate errors over time, leading to declining accuracy and generalization in LLMs. There's a need for more precise editing that avoids these issues.Method: Uses neuron functional attribution to identify knowledge-general and knowledge-specific neurons, then applies entropy-guided dynamic sparse masking to locate relevant neurons for precise editing with minimal parameter modifications.
Result: Experimental results from thousands of sequential edits show NMKE outperforms existing methods in maintaining high editing success rates and preserving model general capabilities.
Conclusion: NMKE provides an effective framework for lifelong knowledge editing that prevents error accumulation and maintains model performance through fine-grained neuron-level editing.
Abstract: Lifelong knowledge editing enables continuous, precise updates to outdated knowledge in large language models (LLMs) without computationally expensive full retraining. However, existing methods often accumulate errors throughout the editing process, causing a gradual decline in both editing accuracy and generalization. To tackle this problem, we propose Neuron-Specific Masked Knowledge Editing (NMKE), a novel fine-grained editing framework that combines neuron-level attribution with dynamic sparse masking. Leveraging neuron functional attribution, we identify two key types of knowledge neurons, with knowledge-general neurons activating consistently across prompts and knowledge-specific neurons activating to specific prompts. NMKE further introduces an entropy-guided dynamic sparse mask, locating relevant neurons to the target knowledge. This strategy enables precise neuron-level knowledge editing with fewer parameter modifications. Experimental results from thousands of sequential edits demonstrate that NMKE outperforms existing methods in maintaining high editing success rates and preserving model general capabilities in lifelong editing.
[768] Power to the Clients: Federated Learning in a Dictatorship Setting
Mohammadsajad Alipour, Mohammad Mohammadi Amiri
Main category: cs.LG
TL;DR: This paper introduces ‘dictator clients’ - malicious participants in federated learning that can erase other clients’ contributions while preserving their own, and analyzes attack strategies and their impacts on model convergence.
Details
Motivation: Federated learning's decentralized nature creates vulnerabilities where malicious clients can compromise training. The paper aims to define and analyze a specific class of malicious participants that can manipulate the learning process.Method: The authors propose concrete attack strategies for dictator clients and theoretically analyze their effects on convergence. They explore scenarios with multiple dictator clients (collaborating, independent, or betraying alliances) and support findings with empirical evaluations on computer vision and NLP benchmarks.
Result: The research demonstrates that dictator clients can successfully erase other clients’ contributions while maintaining their own influence on the global model. The theoretical analysis and empirical evaluations confirm the effectiveness of these attacks across different scenarios.
Conclusion: Dictator clients represent a significant threat to federated learning systems, capable of manipulating the training process. The paper provides both theoretical foundations and empirical evidence for these vulnerabilities, highlighting the need for robust defense mechanisms against such attacks.
Abstract: Federated learning (FL) has emerged as a promising paradigm for decentralized model training, enabling multiple clients to collaboratively learn a shared model without exchanging their local data. However, the decentralized nature of FL also introduces vulnerabilities, as malicious clients can compromise or manipulate the training process. In this work, we introduce dictator clients, a novel, well-defined, and analytically tractable class of malicious participants capable of entirely erasing the contributions of all other clients from the server model, while preserving their own. We propose concrete attack strategies that empower such clients and systematically analyze their effects on the learning process. Furthermore, we explore complex scenarios involving multiple dictator clients, including cases where they collaborate, act independently, or form an alliance in order to ultimately betray one another. For each of these settings, we provide a theoretical analysis of their impact on the global model’s convergence. Our theoretical algorithms and findings about the complex scenarios including multiple dictator clients are further supported by empirical evaluations on both computer vision and natural language processing benchmarks.
[769] Quantitative Bounds for Sorting-Based Permutation-Invariant Embeddings
Nadav Dym, Matthias Wellershoff, Efstratios Tsoukanis, Daniel Levy, Radu Balan
Main category: cs.LG
TL;DR: The paper studies sorting-based embeddings for permutation-invariant graph learning, improving bounds on embedding dimension and analyzing bi-Lipschitz distortion.
Details
Motivation: To address gaps in understanding sorting-based embeddings: optimal embedding dimension for injectivity and bi-Lipschitz constant estimates, which are crucial for graph deep learning applications.Method: Analyzes the embedding β_A: X ↦ sort(XA) where sort is column-wise sorting. Studies injectivity conditions and constructs specific matrices A to analyze distortion bounds.
Result: Improved upper bounds for injectivity dimension, lower bound on minimal injectivity dimension, showed bi-Lipschitz distortion depends quadratically on n and is independent of d, with lower bound of Ω(√n).
Conclusion: Significant progress made on both gaps: better dimension bounds for injectivity and quantitative distortion analysis, with similar results for dimension-reduced variants.
Abstract: We study the sorting-based embedding $\beta_{\mathbf A} : \mathbb R^{n \times d} \to \mathbb R^{n \times D}$, $\mathbf X \mapsto {\downarrow}(\mathbf X \mathbf A)$, where $\downarrow$ denotes column wise sorting of matrices. Such embeddings arise in graph deep learning where outputs should be invariant to permutations of graph nodes. Previous work showed that for large enough $D$ and appropriate $\mathbf A$, the mapping $\beta_{\mathbf A}$ is injective, and moreover satisfies a bi-Lipschitz condition. However, two gaps remain: firstly, the optimal size $D$ required for injectivity is not yet known, and secondly, no estimates of the bi-Lipschitz constants of the mapping are known. In this paper, we make substantial progress in addressing both of these gaps. Regarding the first gap, we improve upon the best known upper bounds for the embedding dimension $D$ necessary for injectivity, and also provide a lower bound on the minimal injectivity dimension. Regarding the second gap, we construct matrices $\mathbf A$, so that the bi-Lipschitz distortion of $\beta_{\mathbf A} $ depends quadratically on $n$, and is completely independent of $d$. We also show that the distortion of $\beta_{\mathbf A}$ is necessarily at least in $\Omega(\sqrt{n})$. Finally, we provide similar results for variants of $\beta_{\mathbf A}$ obtained by applying linear projections to reduce the output dimension of $\beta_{\mathbf A}$.
[770] Multi-dataset Joint Pre-training of Emotional EEG Enables Generalizable Affective Computing
Qingzhu Zhang, Jiani Zhong, Zongsheng Li, Xinke Shen, Quanying Liu
Main category: cs.LG
TL;DR: A task-specific multi-dataset joint pre-training framework for cross-dataset emotion recognition using EEG data, addressing distribution shifts and variability through cross-dataset covariance alignment and hybrid encoding.
Details
Motivation: Existing task-general pre-training EEG models struggle with complex tasks like emotion recognition due to mismatches between task-specific features and broad pre-training approaches, especially with inter-dataset distribution shifts and inter-subject variability.Method: Cross-dataset covariance alignment loss to align second-order statistical properties across datasets, combined with a hybrid encoder using Mamba-like linear attention channel encoder and spatiotemporal dynamics model to capture long-term dependencies and complex EEG dynamics.
Result: Outperforms state-of-the-art large-scale EEG models by 4.57% in AUROC for few-shot emotion recognition and 11.92% in accuracy for zero-shot generalization. Multi-dataset joint pre-training achieves 8.55% performance gain over single-dataset training.
Conclusion: Provides a scalable framework for task-specific pre-training that enables robust generalization without extensive labels or per-subject calibration, highlighting benefits for generalizable affective computing.
Abstract: Task-specific pre-training is essential when task representations diverge from generic pre-training features. Existing task-general pre-training EEG models struggle with complex tasks like emotion recognition due to mismatches between task-specific features and broad pre-training approaches. This work aims to develop a task-specific multi-dataset joint pre-training framework for cross-dataset emotion recognition, tackling problems of large inter-dataset distribution shifts, inconsistent emotion category definitions, and substantial inter-subject variability. We introduce a cross-dataset covariance alignment loss to align second-order statistical properties across datasets, enabling robust generalization without the need for extensive labels or per-subject calibration. To capture the long-term dependency and complex dynamics of EEG, we propose a hybrid encoder combining a Mamba-like linear attention channel encoder and a spatiotemporal dynamics model. Our method outperforms state-of-the-art large-scale EEG models by an average of 4.57% in AUROC for few-shot emotion recognition and 11.92% in accuracy for zero-shot generalization to a new dataset. Performance scales with the increase of datasets used in pre-training. Multi-dataset joint pre-training achieves a performance gain of 8.55% over single-dataset training. This work provides a scalable framework for task-specific pre-training and highlights its benefit in generalizable affective computing. Our code is available at https://github.com/ncclab-sustech/mdJPT_nips2025.
[771] The Lossy Horizon: Error-Bounded Predictive Coding for Lossy Text Compression (Episode I)
Nnamdi Aghanya, Jun Li, Kewei Wang
Main category: cs.LG
TL;DR: EPC is a lossy text compression method using masked language models as decompressors, storing only rank-based corrections when predictions are wrong, enabling continuous rate-distortion control.
Details
Motivation: To explore LLMs' potential in lossy text compression where reconstruction fidelity can be traded for higher compression ratios, moving beyond their established use in lossless compression.Method: Error-Bounded Predictive Coding (EPC) uses a Masked Language Model as decompressor, predicts masked content and stores minimal rank-based corrections only when top predictions are incorrect, creating a residual channel for rate-distortion control.
Result: EPC consistently dominates the Predictive Masking baseline, offering superior fidelity at significantly lower bit rates by more efficiently utilizing the model’s intrinsic knowledge.
Conclusion: EPC demonstrates effective lossy text compression through predictive coding with error correction, outperforming simpler baselines while providing continuous rate-distortion control.
Abstract: Large Language Models (LLMs) can achieve near-optimal lossless compression by acting as powerful probability models. We investigate their use in the lossy domain, where reconstruction fidelity is traded for higher compression ratios. This paper introduces Error-Bounded Predictive Coding (EPC), a lossy text codec that leverages a Masked Language Model (MLM) as a decompressor. Instead of storing a subset of original tokens, EPC allows the model to predict masked content and stores minimal, rank-based corrections only when the model’s top prediction is incorrect. This creates a residual channel that offers continuous rate-distortion control. We compare EPC to a simpler Predictive Masking (PM) baseline and a transform-based Vector Quantisation with a Residual Patch (VQ+RE) approach. Through an evaluation that includes precise bit accounting and rate-distortion analysis, we demonstrate that EPC consistently dominates PM, offering superior fidelity at a significantly lower bit rate by more efficiently utilising the model’s intrinsic knowledge.
[772] Simplifying Knowledge Transfer in Pretrained Models
Siddharth Jain, Shyamgopal Karthik, Vineet Gandhi
Main category: cs.LG
TL;DR: The paper proposes using large model repositories for knowledge transfer between pretrained models, where models autonomously act as teachers or students to improve performance across various tasks.
Details
Motivation: Pretrained models exhibit diverse generalization behaviors, with different models capturing distinct data-specific insights. This diversity can be leveraged to improve model performance through knowledge sharing.Method: A data partitioning strategy where pretrained models autonomously adopt roles as students (seeking knowledge) or teachers (imparting knowledge), enabling bidirectional knowledge transfer both within and across model architectures.
Result: Improved ViT-B performance by 1.4% through bidirectional knowledge transfer with ViT-T; boosted all metrics in semantic segmentation; achieved new state-of-the-art in video saliency prediction; and enabled multi-model knowledge transfer with considerable improvements for all participants.
Conclusion: Leveraging model repositories for autonomous knowledge transfer between pretrained models is an effective approach to improve model performance across various computer vision tasks.
Abstract: Pretrained models are ubiquitous in the current deep learning landscape, offering strong results on a broad range of tasks. Recent works have shown that models differing in various design choices exhibit categorically diverse generalization behavior, resulting in one model grasping distinct data-specific insights unavailable to the other. In this paper, we propose to leverage large publicly available model repositories as an auxiliary source of model improvements. We introduce a data partitioning strategy where pretrained models autonomously adopt either the role of a student, seeking knowledge, or that of a teacher, imparting knowledge. Experiments across various tasks demonstrate the effectiveness of our proposed approach. In image classification, we improved the performance of ViT-B by approximately 1.4% through bidirectional knowledge transfer with ViT-T. For semantic segmentation, our method boosted all evaluation metrics by enabling knowledge transfer both within and across backbone architectures. In video saliency prediction, our approach achieved a new state-of-the-art. We further extend our approach to knowledge transfer between multiple models, leading to considerable performance improvements for all model participants.
[773] Visual Model Selection using Feature Importance Clusters in Fairness-Performance Similarity Optimized Space
Sofoklis Kitharidis, Cor J. Veenman, Thomas Bäck, Niki van Stein
Main category: cs.LG
TL;DR: Interactive framework for selecting fair ML models using metric learning and clustering to help stakeholders navigate fairness-performance trade-offs.
Details
Motivation: Multiple fair ML models create selection challenges for stakeholders who need models aligned with their specific requirements and values.Method: Uses weakly supervised metric learning to learn Mahalanobis distance for fairness-performance similarity, then applies k-means clustering on feature importance representations.
Result: Groups models with similar predictive behaviors and fairness characteristics, enabling exploration of clusters based on feature importance differences.
Conclusion: Facilitates informed decision-making by helping users understand how models differ in fairness-performance balance and feature importance drivers.
Abstract: In the context of algorithmic decision-making, fair machine learning methods often yield multiple models that balance predictive fairness and performance in varying degrees. This diversity introduces a challenge for stakeholders who must select a model that aligns with their specific requirements and values. To address this, we propose an interactive framework that assists in navigating and interpreting the trade-offs across a portfolio of models. Our approach leverages weakly supervised metric learning to learn a Mahalanobis distance that reflects similarity in fairness and performance outcomes, effectively structuring the feature importance space of the models according to stakeholder-relevant criteria. We then apply clustering technique (k-means) to group models based on their transformed representations of feature importances, allowing users to explore clusters of models with similar predictive behaviors and fairness characteristics. This facilitates informed decision-making by helping users understand how models differ not only in their fairness-performance balance but also in the features that drive their predictions.
[774] When Fewer Layers Break More Chains: Layer Pruning Harms Test-Time Scaling in LLMs
Keyu Wang, Tian Lyu, Guinan Su, Jonas Geiping, Lu Yin, Marco Canini, Shiwei Liu
Main category: cs.LG
TL;DR: Layer pruning severely harms test-time scaling in LLMs, causing performance collapse on long-chain reasoning tasks even when general knowledge performance remains stable, and standard fine-tuning cannot recover this capability.
Details
Motivation: To investigate the impact of layer pruning on long-chain reasoning capabilities in LLMs, particularly focusing on test-time scaling mechanisms that are crucial for strong reasoning performance.Method: Extensive experiments analyzing layer pruning effects on test-time scaling, using reasoning benchmarks to evaluate performance degradation and testing standard supervised fine-tuning remedies.
Result: Pruning even 1-2 layers severely impairs test-time scaling, causing drastic performance collapse on long reasoning tasks while knowledge-intensive tasks remain stable. Fine-tuning fails to recover deteriorated scaling.
Conclusion: Layer pruning poses fundamental risks for reasoning-intensive LLMs by damaging test-time scaling mechanisms, calling for rethinking pruning strategies and developing methods that preserve reasoning robustness.
Abstract: Layer pruning has emerged as a widely adopted technique for improving the efficiency of large language models (LLMs). Although existing methods demonstrate strong performance retention on general knowledge tasks, their effect on long-chain reasoning, a more brittle yet crucial capability, remains largely unexplored. In this work, we study the impact of layer pruning on long-chain reasoning through the lens of test-time scaling, a key mechanism in modern LLMs that enables strong reasoning capacity by allocating more computation at inference time. With extensive experiments, we demonstrate that pruning even one or two layers can severely impair test-time scaling, with performance collapsing drastically on long reasoning benchmarks even when performance on knowledge-intensive and shallow reasoning tasks remains stable. Furthermore, we find that standard supervised fine-tuning remedies fail to recover test-time scaling once it has deteriorated. Through in-depth analyses, we identify the mechanisms underlying this fragility of test-time scaling and highlight the fundamental risks of applying layer pruning to reasoning-intensive LLMs. These findings call for a rethinking of layer pruning strategies and provide insights for developing methods that preserve the robustness of reasoning. We open-source the codebase in \href{https://github.com/keyu-wang-2002/Layer-Pruning-Harms-Inference-Scaling}{https://github.com/keyu-wang-2002/Layer-Pruning-Harms-Inference-Scaling}.
[775] LUNA: Efficient and Topology-Agnostic Foundation Model for EEG Signal Analysis
Berkay Döner, Thorir Mar Ingolfsson, Luca Benini, Yawei Li
Main category: cs.LG
TL;DR: LUNA is a self-supervised foundation model that handles diverse EEG electrode layouts by compressing multi-channel EEG into a fixed-size latent space, enabling efficient processing and strong performance across multiple EEG tasks.
Details
Motivation: Building large-scale EEG models is challenging due to topological heterogeneity - different EEG datasets use different electrode layouts, which limits generalization across studies.Method: LUNA uses learned queries and cross-attention to compress multi-channel EEG into a topology-agnostic latent space, then applies transformer blocks with patch-wise temporal self-attention that operates only on this latent representation, decoupling computation from electrode count.
Result: LUNA achieves state-of-the-art performance on TUAR (0.921 AUROC) and TUSL benchmarks while reducing FLOPs by 300x and GPU memory use by up to 10x. These gains are consistent across all evaluated electrode configurations.
Conclusion: LUNA successfully reconciles disparate EEG electrode geometries through a unified latent representation, enabling efficient and scalable foundation modeling for EEG data while maintaining competitive performance across diverse tasks.
Abstract: Electroencephalography (EEG) offers a non-invasive lens into human brain activity, but building large-scale models is hampered by topological heterogeneity: each public EEG data defines its own electrode layout, limiting generalization. We introduce LUNA (Latent Unified Network Architecture), a self-supervised foundation model that reconciles disparate electrode geometries while scaling linearly – not quadratically – with channel count. LUNA compresses multi-channel EEG into a fixed-size, topology-agnostic latent space via learned queries and cross-attention. Downstream transformer blocks then operate exclusively on this latent representation using patch-wise temporal self-attention, decoupling computation from electrode count. Pre-trained on TUEG and Siena (over 21,000 hours of raw EEG across diverse montages) using a masked-patch reconstruction objective, LUNA transfers effectively to four downstream tasks: abnormality detection, artifact rejection, slowing classification, and emotion recognition. It demonstrates highly competitive performance across several benchmarks, achieving state-of-the-art results on TUAR and TUSL, e.g., 0.921 AUROC on TUAR, while reducing FLOPs by 300x and trimming GPU memory use by up to 10x. Critically, these gains are consistent across all evaluated electrode configurations. Code is available at https://github.com/pulp-bio/BioFoundation
[776] Epistemic Deep Learning: Enabling Machine Learning Models to Know When They Do Not Know
Shireen Kudukkil Manchingal
Main category: cs.LG
TL;DR: This thesis introduces Epistemic Deep Learning to address overconfidence in ML models by quantifying epistemic uncertainty, proposing Random-Set Neural Networks (RS-NN) that predict belief functions over class sets, and demonstrating applications in LLMs and autonomous racing.
Details
Motivation: Machine learning models often fail in safety-critical domains due to overconfident predictions on out-of-distribution data, adversarial examples, or environmental changes. The work aims to enable models to recognize their limitations and avoid unreliable decisions when uncertainty is high.Method: Developed Random-Set Neural Network (RS-NN) using random set theory to predict belief functions over class sets, capturing epistemic uncertainty through credal set widths. Applied RS-NN to Large Language Models and weather classification for autonomous racing, with a unified evaluation framework for uncertainty-aware classifiers.
Result: Extensive experiments validated that integrating epistemic awareness mitigates risks of overconfident predictions. The approach enables models to ‘know when they don’t know’, establishing a foundation for more robust and dependable AI systems.
Conclusion: True intelligence involves recognizing and managing knowledge limits. Epistemic Deep Learning represents a paradigm shift where the ability to quantify and act on uncertainty becomes essential for robust, safety-critical AI deployments.
Abstract: Machine learning has achieved remarkable successes, yet its deployment in safety-critical domains remains hindered by an inherent inability to manage uncertainty, resulting in overconfident and unreliable predictions when models encounter out-of-distribution data, adversarial perturbations, or naturally fluctuating environments. This thesis, titled Epistemic Deep Learning: Enabling Machine Learning Models to ‘Know When They Do Not Know’, addresses these critical challenges by advancing the paradigm of Epistemic Artificial Intelligence, which explicitly models and quantifies epistemic uncertainty: the uncertainty arising from limited, biased, or incomplete training data, as opposed to the irreducible randomness of aleatoric uncertainty, thereby empowering models to acknowledge their limitations and refrain from overconfident decisions when uncertainty is high. Central to this work is the development of the Random-Set Neural Network (RS-NN), a novel methodology that leverages random set theory to predict belief functions over sets of classes, capturing the extent of epistemic uncertainty through the width of associated credal sets, applications of RS-NN, including its adaptation to Large Language Models (LLMs) and its deployment in weather classification for autonomous racing. In addition, the thesis proposes a unified evaluation framework for uncertainty-aware classifiers. Extensive experiments validate that integrating epistemic awareness into deep learning not only mitigates the risks associated with overconfident predictions but also lays the foundation for a paradigm shift in artificial intelligence, where the ability to ‘know when it does not know’ becomes a hallmark of robust and dependable systems. The title encapsulates the core philosophy of this work, emphasizing that true intelligence involves recognizing and managing the limits of one’s own knowledge.
[777] A Multi-level Analysis of Factors Associated with Student Performance: A Machine Learning Approach to the SAEB Microdata
Rodrigo Tertulino, Ricardo Almeida
Main category: cs.LG
TL;DR: Multi-level ML approach using Random Forest achieves 90.2% accuracy in classifying Brazilian student proficiency, with XAI revealing school socioeconomic level as the dominant predictor over individual factors.
Details
Motivation: To identify factors influencing student performance in Brazil's basic education system to inform effective public policies and address educational equity.Method: Multi-level machine learning approach integrating four data sources (student socioeconomic, teacher profiles, school indicators, director profiles) using ensemble algorithms, with Random Forest selected as optimal model and SHAP for explainable AI.
Result: Random Forest model achieved 90.2% accuracy and 96.7% AUC. SHAP analysis revealed school’s average socioeconomic level is the most dominant predictor, showing systemic factors outweigh individual characteristics.
Conclusion: Academic performance is a systemic phenomenon tied to school ecosystem, requiring policies that address disparities between schools rather than focusing solely on individual student characteristics.
Abstract: Identifying the factors that influence student performance in basic education is a central challenge for formulating effective public policies in Brazil. This study introduces a multi-level machine learning approach to classify the proficiency of 9th-grade and high school students using microdata from the System of Assessment of Basic Education (SAEB). Our model uniquely integrates four data sources: student socioeconomic characteristics, teacher professional profiles, school indicators, and director management profiles. A comparative analysis of four ensemble algorithms confirmed the superiority of a Random Forest model, which achieved 90.2% accuracy and an Area Under the Curve (AUC) of 96.7%. To move beyond prediction, we applied Explainable AI (XAI) using SHAP, which revealed that the school’s average socioeconomic level is the most dominant predictor, demonstrating that systemic factors have a greater impact than individual characteristics in isolation. The primary conclusion is that academic performance is a systemic phenomenon deeply tied to the school’s ecosystem. This study provides a data-driven, interpretable tool to inform policies aimed at promoting educational equity by addressing disparities between schools.
[778] Machine Learning Enabled Early Warning System For Financial Distress Using Real-Time Digital Signals
Laxmi pant, Syed Ali Reza, Md Khalilor Rahman, MD Saifur Rahman, Shamima Sharmin, Md Fazlul Huq Mithu, Kazi Nehal Hasnain, Adnan Farabi, Mahamuda khanom, Raisul Kabir
Main category: cs.LG
TL;DR: Machine learning-based early warning system using real-time digital and macroeconomic signals to detect household financial distress with high accuracy and interpretability.
Details
Motivation: Traditional econometric models use delayed/aggregated data, limiting effectiveness for real-time financial distress detection in unstable economic environments.Method: Uses panel data of 750 households over 13 months, combining socioeconomic attributes, macroeconomic indicators (GDP, inflation, FX), and digital economy measures with feature engineering and ensemble models (XGBoost, LightGBM, Random Forest).
Result: Engineered digital economy features significantly enhance predictive accuracy; system performs well for binary and multi-class classification; inflation volatility and ICT demand identified as key predictors via SHAP.
Conclusion: Demonstrates feasibility of near-real-time financial distress warnings using transparent ML, enabling actionable insights for household resilience and preemptive interventions.
Abstract: The growing instability of both global and domestic economic environments has increased the risk of financial distress at the household level. However, traditional econometric models often rely on delayed and aggregated data, limiting their effectiveness. This study introduces a machine learning-based early warning system that utilizes real-time digital and macroeconomic signals to identify financial distress in near real-time. Using a panel dataset of 750 households tracked over three monitoring rounds spanning 13 months, the framework combines socioeconomic attributes, macroeconomic indicators (such as GDP growth, inflation, and foreign exchange fluctuations), and digital economy measures (including ICT demand and market volatility). Through data preprocessing and feature engineering, we introduce lagged variables, volatility measures, and interaction terms to capture both gradual and sudden changes in financial stability. We benchmark baseline classifiers, such as logistic regression and decision trees, against advanced ensemble models including random forests, XGBoost, and LightGBM. Our results indicate that the engineered features from the digital economy significantly enhance predictive accuracy. The system performs reliably for both binary distress detection and multi-class severity classification, with SHAP-based explanations identifying inflation volatility and ICT demand as key predictors. Crucially, the framework is designed for scalable deployment in national agencies and low-bandwidth regional offices, ensuring it is accessible for policymakers and practitioners. By implementing machine learning in a transparent and interpretable manner, this study demonstrates the feasibility and impact of providing near-real-time early warnings of financial distress. This offers actionable insights that can strengthen household resilience and guide preemptive intervention strategies.
[779] Does Homophily Help in Robust Test-time Node Classification?
Yan Jiang, Ruihong Qiu, Zi Huang
Main category: cs.LG
TL;DR: GrapHoST is a test-time graph structural transformation method that improves GNN robustness by adaptively adjusting graph homophily levels without model retraining.
Details
Motivation: Test graphs often suffer from data quality issues and distribution shifts that degrade pre-trained GNN performance. The paper reveals that adjusting homophily levels at test time can significantly improve model robustness.Method: Proposes GrapHoST with a homophily predictor to discriminate test edges, enabling adaptive graph structural transformation based on predicted homophily scores without model updates.
Result: Extensive experiments on nine benchmarks show GrapHoST achieves state-of-the-art performance with up to 10.92% improvement under various test-time data quality issues.
Conclusion: Test-time graph structural transformation through homophily adjustment effectively enhances GNN robustness without requiring model training or updates.
Abstract: Homophily, the tendency of nodes from the same class to connect, is a fundamental property of real-world graphs, underpinning structural and semantic patterns in domains such as citation networks and social networks. Existing methods exploit homophily through designing homophily-aware GNN architectures or graph structure learning strategies, yet they primarily focus on GNN learning with training graphs. However, in real-world scenarios, test graphs often suffer from data quality issues and distribution shifts, such as domain shifts across users from different regions in social networks and temporal evolution shifts in citation network graphs collected over varying time periods. These factors significantly compromise the pre-trained model’s robustness, resulting in degraded test-time performance. With empirical observations and theoretical analysis, we reveal that transforming the test graph structure by increasing homophily in homophilic graphs or decreasing it in heterophilic graphs can significantly improve the robustness and performance of pre-trained GNNs on node classifications, without requiring model training or update. Motivated by these insights, a novel test-time graph structural transformation method grounded in homophily, named GrapHoST, is proposed. Specifically, a homophily predictor is developed to discriminate test edges, facilitating adaptive test-time graph structural transformation by the confidence of predicted homophily scores. Extensive experiments on nine benchmark datasets under a range of test-time data quality issues demonstrate that GrapHoST consistently achieves state-of-the-art performance, with improvements of up to 10.92%. Our code has been released at https://github.com/YanJiangJerry/GrapHoST.
[780] Predicting Metabolic Dysfunction-Associated Steatotic Liver Disease using Machine Learning Methods
Mary E. An, Paul Griffin, Jonathan G. Stine, Ramakrishna Balakrishnan, Ram Sriram, Soundar Kumara
Main category: cs.LG
TL;DR: Developed MASER, an interpretable LASSO logistic regression model for MASLD prediction that achieves competitive performance (AUROC 0.84) while addressing fairness across racial/ethnic subgroups through equal opportunity postprocessing.
Details
Motivation: MASLD affects ~33% of US adults and early detection is crucial for prevention through lifestyle interventions. Need for fair, rigorous, and reproducible prediction models to reduce disparities in healthcare.Method: Evaluated multiple ML models (LASSO, random forest, XGBoost, neural network) using clinical features and SHAP-ranked features. Applied equal opportunity postprocessing to reduce disparities in true positive rates across racial/ethnic subgroups.
Result: LASSO logistic regression with top 10 features selected for interpretability. Before fairness adjustment: AUROC 0.84, accuracy 78%, sensitivity 72%, specificity 79%. After adjustment: accuracy 81%, specificity 94%, but sensitivity decreased to 41% and F1-score to 0.515.
Conclusion: MASER model demonstrates interpretable models can achieve balance between predictive performance and fairness in diverse populations, with competitive performance comparable to ensemble and tree-based models.
Abstract: Background: Metabolic Dysfunction-Associated Steatotic Liver Disease (MASLD) affects ~33% of U.S. adults and is the most common chronic liver disease. Although often asymptomatic, progression can lead to cirrhosis. Early detection is important, as lifestyle interventions can prevent disease progression. We developed a fair, rigorous, and reproducible MASLD prediction model and compared it to prior methods using a large electronic health record database. Methods: We evaluated LASSO logistic regression, random forest, XGBoost, and a neural network for MASLD prediction using clinical feature subsets, including the top 10 SHAP-ranked features. To reduce disparities in true positive rates across racial and ethnic subgroups, we applied an equal opportunity postprocessing method. Results: This study included 59,492 patients in the training data, 24,198 in the validating data, and 25,188 in the testing data. The LASSO logistic regression model with the top 10 features was selected for its interpretability and comparable performance. Before fairness adjustment, the model achieved AUROC of 0.84, accuracy of 78%, sensitivity of 72%, specificity of 79%, and F1-score of 0.617. After equal opportunity postprocessing, accuracy modestly increased to 81% and specificity to 94%, while sensitivity decreased to 41% and F1-score to 0.515, reflecting the fairness trade-off. Conclusions: We developed the MASER prediction model (MASLD Static EHR Risk Prediction), a LASSO logistic regression model which achieved competitive performance for MASLD prediction (AUROC 0.836, accuracy 77.6%), comparable to previously reported ensemble and tree-based models. Overall, this approach demonstrates that interpretable models can achieve a balance of predictive performance and fairness in diverse patient populations.
[781] AnyECG-Lab: An Exploration Study of Fine-tuning an ECG Foundation Model to Estimate Laboratory Values from Single-Lead ECG Signals
Yujie Xiao, Gongzhen Tang, Wenhui Liu, Jun Li, Guangkun Nie, Zhuoran Kan, Deyun Zhang, Qinghao Zhao, Shenda Hong
Main category: cs.LG
TL;DR: This paper explores using transfer learning with ECGFounder, a pre-trained ECG foundation model, to non-invasively estimate laboratory values from ECG signals, achieving varying levels of predictive performance across different laboratory indicators.
Details
Motivation: Current laboratory value access relies on invasive venous sampling with inherent delays, while ECG offers a non-invasive, widely available alternative for rapid estimation of hematological parameters.Method: Used transfer learning to fine-tune ECGFounder on the MC-MED dataset, generating over 20 million standardized ten-second ECG segments to enhance sensitivity to biochemical correlates.
Result: The model showed strong predictive performance (AUC >0.65) for 33 lab indicators, moderate performance (0.55-0.65) for 59 indicators, and limited performance (<0.55) for 16 indicators on internal validation.
Conclusion: The study provides an efficient AI-driven solution and establishes feasibility for real-time, non-invasive estimation of laboratory values using ECG signals.
Abstract: Timely access to laboratory values is critical for clinical decision-making, yet current approaches rely on invasive venous sampling and are intrinsically delayed. Electrocardiography (ECG), as a non-invasive and widely available signal, offers a promising modality for rapid laboratory estimation. Recent progress in deep learning has enabled the extraction of latent hematological signatures from ECGs. However, existing models are constrained by low signal-to-noise ratios, substantial inter-individual variability, limited data diversity, and suboptimal generalization, especially when adapted to low-lead wearable devices. In this work, we conduct an exploratory study leveraging transfer learning to fine-tune ECGFounder, a large-scale pre-trained ECG foundation model, on the Multimodal Clinical Monitoring in the Emergency Department (MC-MED) dataset from Stanford. We generated a corpus of more than 20 million standardized ten-second ECG segments to enhance sensitivity to subtle biochemical correlates. On internal validation, the model demonstrated strong predictive performance (area under the curve above 0.65) for thirty-three laboratory indicators, moderate performance (between 0.55 and 0.65) for fifty-nine indicators, and limited performance (below 0.55) for sixteen indicators. This study provides an efficient artificial-intelligence driven solution and establishes the feasibility scope for real-time, non-invasive estimation of laboratory values.
[782] LacMaterial: Large Language Models as Analogical Chemists for Materials Discovery
Hongyu Guo
Main category: cs.LG
TL;DR: LLMs can generate novel battery materials using analogical reasoning strategies that retrieve cross-domain analogs and construct in-domain templates, outperforming standard prompting methods.
Details
Motivation: Human analogical reasoning in science is limited by domain expertise and surface-level biases, while LLMs trained on cross-domain data offer potential for deeper structure-driven analogies.Method: Two strategies: (1) retrieving cross-domain analogs and analogy-guided exemplars to go beyond conventional dopant substitutions, (2) constructing in-domain analogical templates from few labeled examples for targeted exploitation.
Result: The analogical reasoning approaches generated candidates outside established compositional spaces and outperformed standard prompting baselines.
Conclusion: LLMs can serve as interpretable, expert-like hypothesis generators that use analogy-driven generalization for scientific innovation.
Abstract: Analogical reasoning, the transfer of relational structures across contexts (e.g., planet is to sun as electron is to nucleus), is fundamental to scientific discovery. Yet human insight is often constrained by domain expertise and surface-level biases, limiting access to deeper, structure-driven analogies both within and across disciplines. Large language models (LLMs), trained on vast cross-domain data, present a promising yet underexplored tool for analogical reasoning in science. Here, we demonstrate that LLMs can generate novel battery materials by (1) retrieving cross-domain analogs and analogy-guided exemplars to steer exploration beyond conventional dopant substitutions, and (2) constructing in-domain analogical templates from few labeled examples to guide targeted exploitation. These explicit analogical reasoning strategies yield candidates outside established compositional spaces and outperform standard prompting baselines. Our findings position LLMs as interpretable, expert-like hypothesis generators that leverage analogy-driven generalization for scientific innovation.
[783] Monitoring State Transitions in Markovian Systems with Sampling Cost
Kumar Saurav, Ness B. Shroff, Yingbin Liang
Main category: cs.LG
TL;DR: Analysis of a greedy query policy for state tracking where a monitor predicts node states and queries only when prediction uncertainty is high, showing it’s suboptimal but performs well under common conditions, with a learning variant proposed for unknown transition probabilities.
Details
Motivation: To balance the tradeoff between query costs and prediction accuracy in state tracking systems, where frequent queries are expensive but predictions may be inaccurate.Method: Greedy policy that queries when expected prediction loss exceeds query cost; analysis in Markovian setting; PSGD-based learning variant for unknown transition probabilities.
Result: Greedy policy is suboptimal with unbounded competitive ratio in general, but performs close to optimal under identically distributed transition probabilities; learning variant achieves favorable tradeoff with improved efficiency.
Conclusion: Greedy query policies can be effective for state tracking despite suboptimality, especially with learning approaches for unknown system parameters.
Abstract: We consider a node-monitor pair, where the node’s state varies with time. The monitor needs to track the node’s state at all times; however, there is a fixed cost for each state query. So the monitor may instead predict the state using time-series forecasting methods, including time-series foundation models (TSFMs), and query only when prediction uncertainty is high. Since query decisions influence prediction accuracy, determining when to query is nontrivial. A natural approach is a greedy policy that predicts when the expected prediction loss is below the query cost and queries otherwise. We analyze this policy in a Markovian setting, where the optimal (OPT) strategy is a state-dependent threshold policy minimizing the time-averaged sum of query cost and prediction losses. We show that, in general, the greedy policy is suboptimal and can have an unbounded competitive ratio, but under common conditions such as identically distributed transition probabilities, it performs close to OPT. For the case of unknown transition probabilities, we further propose a projected stochastic gradient descent (PSGD)-based learning variant of the greedy policy, which achieves a favorable predict-query tradeoff with improved computational efficiency compared to OPT.
[784] Transformer Key-Value Memories Are Nearly as Interpretable as Sparse Autoencoders
Mengyu Ye, Jun Suzuki, Tatsuro Inaba, Tatsuki Kuribayashi
Main category: cs.LG
TL;DR: This paper compares the interpretability of features learned by sparse auto-encoders (SAEs) versus those already present in feed-forward (FF) layers of LLMs, finding that FF layers serve as strong baselines with comparable interpretability to SAEs.
Details
Motivation: To systematically compare whether learned features from proxy modules like SAEs have better properties than features already represented in original model parameters (FF layers), as few studies have made such comparisons.Method: Revisiting interpretability of feature vectors in FF layers (viewed as key-value memories) using modern interpretability benchmarks, with extensive evaluation comparing SAEs and FFs.
Result: SAEs and FFs show similar interpretability range, with SAEs having minimal improvement in some aspects but surprisingly, vanilla FFs sometimes yield better interpretability than SAEs. Features discovered in SAEs and FFs diverged.
Conclusion: Questions the advantage of SAEs over directly interpreting FF feature vectors, and establishes FF key-value parameters as strong baselines in modern interpretability research.
Abstract: Recent interpretability work on large language models (LLMs) has been increasingly dominated by a feature-discovery approach with the help of proxy modules. Then, the quality of features learned by, e.g., sparse auto-encoders (SAEs), is evaluated. This paradigm naturally raises a critical question: do such learned features have better properties than those already represented within the original model parameters, and unfortunately, only a few studies have made such comparisons systematically so far. In this work, we revisit the interpretability of feature vectors stored in feed-forward (FF) layers, given the perspective of FF as key-value memories, with modern interpretability benchmarks. Our extensive evaluation revealed that SAE and FFs exhibits a similar range of interpretability, although SAEs displayed an observable but minimal improvement in some aspects. Furthermore, in certain aspects, surprisingly, even vanilla FFs yielded better interpretability than the SAEs, and features discovered in SAEs and FFs diverged. These bring questions about the advantage of SAEs from both perspectives of feature quality and faithfulness, compared to directly interpreting FF feature vectors, and FF key-value parameters serve as a strong baseline in modern interpretability research.
[785] Uncertainty quantification in model discovery by distilling interpretable material constitutive models from Gaussian process posteriors
David Anton, Henning Wessels, Ulrich Römer, Alexander Henkes, Jorge-Humberto Urrea-Quintero
Main category: cs.LG
TL;DR: A four-step partially Bayesian framework for uncertainty quantification in constitutive model discovery that handles noisy data without requiring prior parameter selection and works for both linear and non-linear models.
Details
Motivation: Existing methods for uncertainty quantification in model discovery either require prior selection for material parameters, are limited to linear coefficients, or have inflexible parameter distributions. Noise in mechanical test data induces uncertainties that need proper handling.Method: Four-step framework: 1) Augment stress-deformation data with Gaussian process, 2) Approximate parameter distribution using normalizing flow for complex joint distributions, 3) Distill parameter distribution by matching stress-deformation function distributions with Gaussian process posterior, 4) Perform Sobol’ sensitivity analysis for sparse and interpretable models.
Result: The framework demonstrates capability for both isotropic and anisotropic experimental data, as well as linear and non-linear model libraries.
Conclusion: The proposed partially Bayesian framework successfully addresses limitations of previous methods by enabling uncertainty quantification without prior parameter selection and supporting discovery of non-linear constitutive models with flexible parameter distributions.
Abstract: Constitutive model discovery refers to the task of identifying an appropriate model structure, usually from a predefined model library, while simultaneously inferring its material parameters. The data used for model discovery are measured in mechanical tests and are thus inevitably affected by noise which, in turn, induces uncertainties. Previously proposed methods for uncertainty quantification in model discovery either require the selection of a prior for the material parameters, are restricted to the linear coefficients of the model library or are limited in the flexibility of the inferred parameter probability distribution. We therefore propose a four-step partially Bayesian framework for uncertainty quantification in model discovery that does not require prior selection for the material parameters and also allows for the discovery of non-linear constitutive models: First, we augment the available stress-deformation data with a Gaussian process. Second, we approximate the parameter distribution by a normalizing flow, which allows for capturing complex joint distributions. Third, we distill the parameter distribution by matching the distribution of stress-deformation functions induced by the parameters with the Gaussian process posterior. Fourth, we perform a Sobol’ sensitivity analysis to obtain a sparse and interpretable model. We demonstrate the capability of our framework for both isotropic and anisotropic experimental data as well as linear and non-linear model libraries.
[786] Mapping Faithful Reasoning in Language Models
Jiazheng Li, Andreas Damianou, J Rosser, José Luis Redondo García, Konstantina Palla
Main category: cs.LG
TL;DR: Concept Walk is a framework that traces how a model’s internal stance evolves during reasoning by projecting activations onto concept directions, helping identify when chain-of-thought reasoning is faithful versus decorative.
Details
Motivation: Chain-of-thought traces are not always faithful reflections of internal computation, which can mislead practitioners into trusting decorative reasoning as genuine.Method: Concept Walk operates in activation space by projecting each reasoning step onto concept directions learned from contrastive data, allowing observation of whether reasoning traces shape outcomes or are discarded.
Result: In ’easy’ cases, perturbed CoTs are quickly ignored (decorative reasoning), while in ‘hard’ cases, perturbations induce sustained shifts in internal activations (faithful reasoning).
Conclusion: Concept Walk provides a methodological lens to re-examine faithfulness through concept-specific internal dynamics, helping identify when reasoning traces can be trusted versus when they risk misleading practitioners.
Abstract: Chain-of-thought (CoT) traces promise transparency for reasoning language models, but prior work shows they are not always faithful reflections of internal computation. This raises challenges for oversight: practitioners may misinterpret decorative reasoning as genuine. We introduce Concept Walk, a general framework for tracing how a model’s internal stance evolves with respect to a concept direction during reasoning. Unlike surface text, Concept Walk operates in activation space, projecting each reasoning step onto the concept direction learned from contrastive data. This allows us to observe whether reasoning traces shape outcomes or are discarded. As a case study, we apply Concept Walk to the domain of Safety using Qwen 3-4B. We find that in ’easy’ cases, perturbed CoTs are quickly ignored, indicating decorative reasoning, whereas in ‘hard’ cases, perturbations induce sustained shifts in internal activations, consistent with faithful reasoning. The contribution is methodological: Concept Walk provides a lens to re-examine faithfulness through concept-specific internal dynamics, helping identify when reasoning traces can be trusted and when they risk misleading practitioners.
[787] Bias Begins with Data: The FairGround Corpus for Robust and Reproducible Research on Algorithmic Fairness
Jan Simson, Alessandro Fabris, Cosima Fröhner, Frauke Kreuter, Christoph Kern
Main category: cs.LG
TL;DR: FairGround is a unified framework and dataset corpus with 44 tabular datasets for advancing reproducible research in fair machine learning classification, addressing limitations in current datasets used for bias investigation.
Details
Motivation: Current fair ML research relies on narrow, arbitrarily chosen datasets that are inconsistently processed and lack diversity, undermining generalizability and reproducibility of results.Method: Developed FairGround framework with 44 tabular datasets annotated with fairness-relevant metadata, accompanied by a Python package that standardizes dataset loading, preprocessing, transformation, and splitting.
Result: Created a diverse and well-documented dataset corpus with robust tooling that streamlines experimental workflows for fair ML research.
Conclusion: FairGround enables development of fairer, more reliable, and more reproducible ML models by providing comprehensive resources for open and collaborative research in fair ML classification.
Abstract: As machine learning (ML) systems are increasingly adopted in high-stakes decision-making domains, ensuring fairness in their outputs has become a central challenge. At the core of fair ML research are the datasets used to investigate bias and develop mitigation strategies. Yet, much of the existing work relies on a narrow selection of datasets–often arbitrarily chosen, inconsistently processed, and lacking in diversity–undermining the generalizability and reproducibility of results. To address these limitations, we present FairGround: a unified framework, data corpus, and Python package aimed at advancing reproducible research and critical data studies in fair ML classification. FairGround currently comprises 44 tabular datasets, each annotated with rich fairness-relevant metadata. Our accompanying Python package standardizes dataset loading, preprocessing, transformation, and splitting, streamlining experimental workflows. By providing a diverse and well-documented dataset corpus along with robust tooling, FairGround enables the development of fairer, more reliable, and more reproducible ML models. All resources are publicly available to support open and collaborative research.
[788] Label Smoothing Improves Gradient Ascent in LLM Unlearning
Zirui Pang, Hao Zheng, Zhijie Deng, Ling Li, Zixin Zhong, Jiaheng Wei
Main category: cs.LG
TL;DR: The paper proposes Smoothed Gradient Ascent (SGA) to address instability issues in LLM unlearning, combining forget data with normal data through a tunable smoothing rate for more stable unlearning while preserving model utility.
Details
Motivation: Gradient Ascent (GA) for LLM unlearning suffers from severe instability and degraded model utility when forcing models to forget hazardous knowledge, requiring a more stable approach.Method: SGA combines forget data with multiple constructed normal data using a tunable smoothing rate, extending GA from learning solely on forget data to joint learning across both datasets.
Result: SGA consistently outperforms original GA across all metrics and achieves top-2 performance among all baseline methods on several key metrics in evaluations on TOFU, Harry Potter, and MUSE-NEWS benchmarks.
Conclusion: SGA provides a more stable and effective approach for LLM unlearning while better preserving model utility compared to traditional Gradient Ascent methods.
Abstract: LLM unlearning has emerged as a promising approach, aiming to enable models to forget hazardous/undesired knowledge at low cost while preserving as much model utility as possible. Among existing techniques, the most straightforward method is performing Gradient Ascent (GA) w.r.t. the forget data, thereby forcing the model to unlearn the forget dataset. However, GA suffers from severe instability, as it drives updates in a divergent direction, often resulting in drastically degraded model utility. To address this issue, we propose Smoothed Gradient Ascent (SGA). SGA combines the forget data with multiple constructed normal data through a tunable smoothing rate. Intuitively, this extends GA from learning solely on the forget data to jointly learning across both forget and normal data, enabling more stable unlearning while better preserving model utility. Theoretically, we provide the theoretical guidance on the selection of the optimal smoothing rate. Empirically, we evaluate SGA on three benchmarks: TOFU, Harry Potter, and MUSE-NEWS. Experimental results demonstrate that SGA consistently outperforms the original Gradient Ascent (GA) method across all metrics and achieves top-2 performance among all baseline methods on several key metrics.
[789] Dynamic Dropout: Leveraging Conway’s Game of Life for Neural Networks Regularization
David Freire-Obregón, José Salas-Cáceres, Modesto Castrillón-Santana
Main category: cs.LG
TL;DR: Proposes replacing dropout with Conway’s Game of Life for dynamic unit deactivation in neural networks, achieving comparable performance to traditional dropout on CIFAR-10 while providing pattern visualization.
Details
Motivation: Dropout has limitations including static nature and lack of interpretability. The authors aim to create a more dynamic and interpretable regularization approach.Method: Represent neural network units as cells in a Game of Life grid and apply the game’s rules to dynamically deactivate units during training, allowing spatial patterns to emerge and adapt to training data.
Result: Dynamic unit deactivation using Game of Life achieves comparable performance to traditional dropout techniques on CIFAR-10 dataset, with the added benefit of pattern visualization for interpretability.
Conclusion: The Game of Life approach provides an effective alternative to dropout that maintains performance while offering dynamic adaptation and interpretability through pattern visualization, with applicability to deeper architectures.
Abstract: Regularization techniques play a crucial role in preventing overfitting and improving the generalization performance of neural networks. Dropout, a widely used regularization technique, randomly deactivates units during training to introduce redundancy and prevent co-adaptation among neurons. Despite its effectiveness, dropout has limitations, such as its static nature and lack of interpretability. In this paper, we propose a novel approach to regularization by substituting dropout with Conway’s Game of Life (GoL), a cellular automata with simple rules that govern the evolution of a grid of cells. We introduce dynamic unit deactivation during training by representing neural network units as cells in a GoL grid and applying the game’s rules to deactivate units. This approach allows for the emergence of spatial patterns that adapt to the training data, potentially enhancing the network’s ability to generalize. We demonstrate the effectiveness of our approach on the CIFAR-10 dataset, showing that dynamic unit deactivation using GoL achieves comparable performance to traditional dropout techniques while offering insights into the network’s behavior through the visualization of evolving patterns. Furthermore, our discussion highlights the applicability of our proposal in deeper architectures, demonstrating how it enhances the performance of different dropout techniques.
[790] Knowledge-guided Continual Learning for Behavioral Analytics Systems
Yasas Senarath, Hemant Purohit
Main category: cs.LG
TL;DR: A novel augmentation-based approach that incorporates external knowledge into replay-based continual learning to address data drift and catastrophic forgetting in behavioral analytics systems.
Details
Motivation: User behavior on online platforms evolves over time, causing data drift that reduces model performance. Fine-tuning models with new data leads to catastrophic forgetting, while replay-based approaches are limited by fixed buffer sizes.Method: Proposed an augmentation-based approach that integrates external knowledge bases into replay-based continual learning frameworks to overcome buffer size limitations through data augmentation.
Result: Evaluation with three deviant behavior classification datasets shows that the augmentation approach helps outperform baseline replay-based continual learning methods.
Conclusion: Incorporating external knowledge through augmentation in replay-based continual learning effectively addresses data drift and catastrophic forgetting, improving model performance over time.
Abstract: User behavior on online platforms is evolving, reflecting real-world changes in how people post, whether it’s helpful messages or hate speech. Models that learn to capture this content can experience a decrease in performance over time due to data drift, which can lead to ineffective behavioral analytics systems. However, fine-tuning such a model over time with new data can be detrimental due to catastrophic forgetting. Replay-based approaches in continual learning offer a simple yet efficient method to update such models, minimizing forgetting by maintaining a buffer of important training instances from past learned tasks. However, the main limitation of this approach is the fixed size of the buffer. External knowledge bases can be utilized to overcome this limitation through data augmentation. We propose a novel augmentation-based approach to incorporate external knowledge in the replay-based continual learning framework. We evaluate several strategies with three datasets from prior studies related to deviant behavior classification to assess the integration of external knowledge in continual learning and demonstrate that augmentation helps outperform baseline replay-based approaches.
[791] Low-Precision Streaming PCA
Sanjoy Dasgupta, Syamantak Kumar, Shourya Pandey, Purnamrita Sarkar
Main category: cs.LG
TL;DR: The paper establishes information-theoretic lower bounds for quantization resolution in streaming PCA and analyzes Oja’s algorithm with linear/nonlinear stochastic quantization, showing batched versions achieve near-optimal performance.
Details
Motivation: To understand the fundamental limits of low-precision streaming PCA and develop efficient quantization methods that maintain accuracy while reducing computational precision requirements.Method: Analyzes Oja’s algorithm for streaming PCA with linear and nonlinear stochastic quantization of weight vectors and updates, using information-theoretic bounds and theoretical analysis under moment and spectral-gap assumptions.
Result: Batched versions achieve the information-theoretic lower bound up to logarithmic factors, with nonlinear quantization achieving nearly dimension-free quantization error. Empirical results validate theoretical findings.
Conclusion: Low-precision streaming PCA with proper quantization schemes can closely match standard Oja’s algorithm performance while significantly reducing precision requirements, with nonlinear quantization offering superior dimension-free properties.
Abstract: Low-precision streaming PCA estimates the top principal component in a streaming setting under limited precision. We establish an information-theoretic lower bound on the quantization resolution required to achieve a target accuracy for the leading eigenvector. We study Oja’s algorithm for streaming PCA under linear and nonlinear stochastic quantization. The quantized variants use unbiased stochastic quantization of the weight vector and the updates. Under mild moment and spectral-gap assumptions on the data distribution, we show that a batched version achieves the lower bound up to logarithmic factors under both schemes. This leads to a nearly dimension-free quantization error in the nonlinear quantization setting. Empirical evaluations on synthetic streams validate our theoretical findings and demonstrate that our low-precision methods closely track the performance of standard Oja’s algorithm.
[792] SmartMixed: A Two-Phase Training Strategy for Adaptive Activation Function Learning in Neural Networks
Amin Omidvar
Main category: cs.LG
TL;DR: SmartMixed is a two-phase training strategy that enables neural networks to learn optimal per-neuron activation functions from a pool of candidates (ReLU, Sigmoid, Tanh, Leaky ReLU, ELU, SELU), maintaining computational efficiency at inference.
Details
Motivation: Most neural architectures use fixed, uniform activation functions across all neurons, which may not be optimal. The paper aims to allow networks to learn specialized activation functions per neuron while preserving efficiency.Method: Two-phase training: Phase 1 uses differentiable hard-mixture mechanism for neurons to select from candidate activation functions. Phase 2 fixes each neuron’s activation based on learned selection, enabling efficient inference and continued training with vectorized operations.
Result: Evaluation on MNIST with feedforward networks shows neurons in different layers exhibit distinct preferences for activation functions, revealing functional diversity within neural architectures.
Conclusion: SmartMixed successfully enables networks to learn specialized per-neuron activation functions while maintaining computational efficiency, providing insights into activation function diversity across network layers.
Abstract: The choice of activation function plays a critical role in neural networks, yet most architectures still rely on fixed, uniform activation functions across all neurons. We introduce SmartMixed, a two-phase training strategy that allows networks to learn optimal per-neuron activation functions while preserving computational efficiency at inference. In the first phase, neurons adaptively select from a pool of candidate activation functions (ReLU, Sigmoid, Tanh, Leaky ReLU, ELU, SELU) using a differentiable hard-mixture mechanism. In the second phase, each neuron’s activation function is fixed according to the learned selection, resulting in a computationally efficient network that supports continued training with optimized vectorized operations. We evaluate SmartMixed on the MNIST dataset using feedforward neural networks of varying depths. The analysis shows that neurons in different layers exhibit distinct preferences for activation functions, providing insights into the functional diversity within neural architectures.
[793] GraphTOP: Graph Topology-Oriented Prompting for Graph Neural Networks
Xingbo Fu, Zhenyu Lei, Zihan Chen, Binchi Zhang, Chuxu Zhang, Jundong Li
Main category: cs.LG
TL;DR: GraphTOP is the first topology-oriented graph prompting framework that adapts pre-trained GNNs by modifying graph topology through edge rewiring in local subgraphs, outperforming feature-oriented methods.
Details
Motivation: Existing graph prompting methods focus only on feature modifications but overlook topology changes, leading to suboptimal performance. There's a need to explore topology-oriented prompting to better adapt pre-trained GNNs.Method: Reformulates topology-oriented prompting as edge rewiring in multi-hop local subgraphs, relaxes it into continuous probability space via reparameterization while maintaining graph sparsity and tight relaxation.
Result: Extensive experiments on five graph datasets with four pre-training strategies show GraphTOP outperforms six baselines on multiple node classification tasks.
Conclusion: GraphTOP demonstrates the effectiveness of topology-oriented prompting for adapting pre-trained GNNs, providing a superior alternative to feature-only approaches.
Abstract: Graph Neural Networks (GNNs) have revolutionized the field of graph learning by learning expressive graph representations from massive graph data. As a common pattern to train powerful GNNs, the “pre-training, adaptation” scheme first pre-trains GNNs over unlabeled graph data and subsequently adapts them to specific downstream tasks. In the adaptation phase, graph prompting is an effective strategy that modifies input graph data with learnable prompts while keeping pre-trained GNN models frozen. Typically, existing graph prompting studies mainly focus on feature-oriented methods that apply graph prompts to node features or hidden representations. However, these studies often achieve suboptimal performance, as they consistently overlook the potential of topology-oriented prompting, which adapts pre-trained GNNs by modifying the graph topology. In this study, we conduct a pioneering investigation of graph prompting in terms of graph topology. We propose the first Graph Topology-Oriented Prompting (GraphTOP) framework to effectively adapt pre-trained GNN models for downstream tasks. More specifically, we reformulate topology-oriented prompting as an edge rewiring problem within multi-hop local subgraphs and relax it into the continuous probability space through reparameterization while ensuring tight relaxation and preserving graph sparsity. Extensive experiments on five graph datasets under four pre-training strategies demonstrate that our proposed GraphTOP outshines six baselines on multiple node classification datasets. Our code is available at https://github.com/xbfu/GraphTOP.
[794] Backward-Friendly Optimization: Training Large Language Models with Approximate Gradients under Memory Constraints
Jing Yang, Kaitong Cai, Yijia Fan, Yufeng Yang, Keze Wang
Main category: cs.LG
TL;DR: GradLite is a memory-efficient optimizer for LLMs that uses low-rank Jacobian approximation and error-feedback to enable training with discarded activations while maintaining convergence guarantees.
Details
Motivation: Full fine-tuning of LLMs is memory-intensive due to exact gradient requirements. Existing solutions modify architecture or trade memory for computation, but don't address the optimizer itself.Method: Uses low-rank Jacobian approximation to reduce backpropagation dimensionality and error-feedback correction to accumulate and compensate approximation errors across iterations.
Result: Reduces optimizer-state and activation memory by up to 50% without architectural changes, achieves comparable or better performance on reasoning, multilingual, and dialogue benchmarks compared to baselines.
Conclusion: GradLite provides an effective optimizer-centric approach to memory-efficient LLM training while maintaining theoretical convergence guarantees and practical performance.
Abstract: Full fine-tuning of Large Language Models (LLMs) is notoriously memory-intensive, primarily because conventional optimizers such as SGD or Adam assume access to exact gradients derived from cached activations. Existing solutions either alter the model architecture (e.g., reversible networks) or trade memory for computation (e.g., activation checkpointing), but the optimizer itself remains untouched. In this work, we introduce GradLite, a backward-friendly optimizer that relaxes the requirement of exact gradients, enabling efficient training even when intermediate activations are aggressively discarded or approximated. GradLite leverages two key techniques: (i) low-rank Jacobian approximation, which reduces the dimensionality of backpropagated error signals, and (ii) error-feedback correction, which accumulates and compensates approximation errors across iterations to preserve convergence guarantees. We provide a theoretical analysis showing that GradLite maintains unbiased gradient estimates with bounded variance, ensuring convergence rates comparable to Adam. Empirically, GradLite reduces optimizer-state and activation memory consumption by up to 50% without architectural changes, and achieves on-par or superior downstream performance on reasoning (MMLU, GSM8K), multilingual, and dialogue benchmarks compared to checkpointing and optimizer-centric baselines (LoMo, GaLore).
[795] Contextual Tokenization for Graph Inverted Indices
Pritish Chakraborty, Indradyumna Roy, Soumen Chakrabarti, Abir De
Main category: cs.LG
TL;DR: CORGII is a graph indexing framework that converts dense graph representations into sparse binary codes for efficient subgraph retrieval using inverted indices with trainable impact weights.
Details
Motivation: Existing graph retrieval methods using multi-vector representations require exhaustive scoring of corpus graphs, limiting their efficiency in large-scale applications.Method: Uses contextual dense graph representation with differentiable discretization to generate sparse binary codes, then leverages inverted indices with trainable token impact weights and token expansion for multi-probing.
Result: Extensive experiments show CORGII provides better accuracy-efficiency trade-offs compared to several baselines.
Conclusion: CORGII is the first indexer of dense graph representations using discrete tokens with efficient inverted lists, enabling effective subgraph retrieval at scale.
Abstract: Retrieving graphs from a large corpus, that contain a subgraph isomorphic to a given query graph, is a core operation in many real-world applications. While recent multi-vector graph representations and scores based on set alignment and containment can provide accurate subgraph isomorphism tests, their use in retrieval remains limited by their need to score corpus graphs exhaustively. We introduce CORGII (Contextual Representation of Graphs for Inverted Indexing), a graph indexing framework in which, starting with a contextual dense graph representation, a differentiable discretization module computes sparse binary codes over a learned latent vocabulary. This text document-like representation allows us to leverage classic, highly optimized inverted indices, while supporting soft (vector) set containment scores. Pushing this paradigm further, we replace the classical, fixed impact weight of a `token’ on a graph (such as TFIDF or BM25) with a data-driven, trainable impact weight. Finally, we explore token expansion to support multi-probing the index for smoother accuracy-efficiency tradeoffs. To our knowledge, CORGII is the first indexer of dense graph representations using discrete tokens mapping to efficient inverted lists. Extensive experiments show that CORGII provides better trade-offs between accuracy and efficiency, compared to several baselines.
[796] LAMP: Data-Efficient Linear Affine Weight-Space Models for Parameter-Controlled 3D Shape Generation and Extrapolation
Ghadi Nehme, Yanxia Zhang, Dule Shu, Matt Klenk, Faez Ahmed
Main category: cs.LG
TL;DR: LAMP is a data-efficient framework for controllable 3D generation that uses linear affine mixing of parametric shapes in aligned weight space, enabling interpolation, extrapolation, and physics-guided optimization with minimal training data.
Details
Motivation: Current 3D generation methods require large datasets and struggle with controllability and generalization beyond training distributions, limiting their practical applications in design and engineering.Method: LAMP aligns SDF decoders by overfitting exemplars from shared initialization, then synthesizes new geometries by solving parameter-constrained mixing problems in aligned weight space, with a safety metric for geometry validity detection.
Result: LAMP achieves controlled interpolation with only 100 samples, safe extrapolation up to 100% beyond training ranges, and physics performance-guided optimization, significantly outperforming baseline methods in extrapolation and data efficiency.
Conclusion: LAMP advances controllable, data-efficient, and safe 3D generation for design exploration, dataset generation, and performance-driven optimization.
Abstract: Generating high-fidelity 3D geometries that satisfy specific parameter constraints has broad applications in design and engineering. However, current methods typically rely on large training datasets and struggle with controllability and generalization beyond the training distributions. To overcome these limitations, we introduce LAMP (Linear Affine Mixing of Parametric shapes), a data-efficient framework for controllable and interpretable 3D generation. LAMP first aligns signed distance function (SDF) decoders by overfitting each exemplar from a shared initialization, then synthesizes new geometries by solving a parameter-constrained mixing problem in the aligned weight space. To ensure robustness, we further propose a safety metric that detects geometry validity via linearity mismatch. We evaluate LAMP on two 3D parametric benchmarks: DrivAerNet++ and BlendedNet. We found that LAMP enables (i) controlled interpolation within bounds with as few as 100 samples, (ii) safe extrapolation by up to 100% parameter difference beyond training ranges, (iii) physics performance-guided optimization under fixed parameters. LAMP significantly outperforms conditional autoencoder and Deep Network Interpolation (DNI) baselines in both extrapolation and data efficiency. Our results demonstrate that LAMP advances controllable, data-efficient, and safe 3D generation for design exploration, dataset generation, and performance-driven optimization.
[797] Scalable Oversight via Partitioned Human Supervision
Ren Yin, Takashi Ishida, Masashi Sugiyama
Main category: cs.LG
TL;DR: A scalable oversight framework using complementary labels from domain experts to evaluate and train AI systems on superhuman tasks without ground truth.
Details
Motivation: As AI systems surpass human expertise, obtaining high-quality human supervision becomes challenging for multi-domain tasks where no single expert can provide complete evaluation.Method: Proposes using complementary labels (indicating incorrect options) from narrow-domain experts, derives unbiased estimators for top-1 accuracy, and introduces methods to combine scarce ordinary labels with abundant complementary labels.
Result: Shows that AI systems can be evaluated without ground truth using complementary labels, and can be trained to perform better with partitioned human supervision. Provides finite-sample deviation guarantees for estimators.
Conclusion: Complementary labels from domain experts provide a viable approach for scalable oversight of advanced AI systems on superhuman tasks, enabling evaluation and training without requiring complete ground truth.
Abstract: As artificial intelligence (AI) systems approach and surpass expert human performance across a broad range of tasks, obtaining high-quality human supervision for evaluation and training becomes increasingly challenging. Our focus is on tasks that require deep knowledge and skills of multiple domains. Unfortunately, even the best human experts are knowledgeable only in a single narrow area, and will not be able to evaluate the correctness of advanced AI systems on such superhuman tasks. However, based on their narrow expertise, humans may provide a weak signal, i.e., a complementary label indicating an option that is incorrect. For example, a cardiologist could state that “this is not related to cardiology,’’ even if they cannot identify the true disease. Based on this weak signal, we propose a scalable oversight framework that enables us to evaluate frontier AI systems without the need to prepare the ground truth. We derive an unbiased estimator of top-1 accuracy from complementary labels and quantify how many complementary labels are needed to match the variance of ordinary labels. We further introduce two estimators to combine scarce ordinary labels with abundant complementary labels. We provide finite-sample deviation guarantees for both complementary-only and the mixed estimators. Empirically, we show that we can evaluate the output of large language models without the ground truth, if we have complementary labels. We further show that we can train an AI system with such weak signals: we show how we can design an agentic AI system automatically that can perform better with this partitioned human supervision. Our code is available at https://github.com/R-Yin-217/Scalable-Oversight-via-Human-Partitioned-Supervision.
[798] Accelerating Materials Design via LLM-Guided Evolutionary Search
Nikhil Abhyankar, Sanchit Kabra, Saaketh Desai, Chandan K. Reddy
Main category: cs.LG
TL;DR: LLEMA is a materials discovery framework that combines large language models with evolutionary algorithms and memory-based refinement to efficiently navigate chemical spaces under multiple property constraints.
Details
Motivation: Materials discovery requires exploring vast chemical spaces while satisfying multiple conflicting objectives, which is challenging with existing methods.Method: LLEMA integrates LLMs for candidate generation under explicit constraints, surrogate models for property estimation, and multi-objective scoring with memory-based refinement to guide evolutionary search.
Result: On 14 realistic tasks across various domains, LLEMA achieved higher hit-rates and stronger Pareto fronts than generative and LLM-only baselines, producing chemically plausible and thermodynamically stable materials.
Conclusion: LLEMA provides a principled pathway for accelerating practical materials discovery by enforcing synthesizability and multi-objective trade-offs through rule-guided generation and memory-based refinement.
Abstract: Materials discovery requires navigating vast chemical and structural spaces while satisfying multiple, often conflicting, objectives. We present LLM-guided Evolution for MAterials design (LLEMA), a unified framework that couples the scientific knowledge embedded in large language models with chemistry-informed evolutionary rules and memory-based refinement. At each iteration, an LLM proposes crystallographically specified candidates under explicit property constraints; a surrogate-augmented oracle estimates physicochemical properties; and a multi-objective scorer updates success/failure memories to guide subsequent generations. Evaluated on 14 realistic tasks spanning electronics, energy, coatings, optics, and aerospace, LLEMA discovers candidates that are chemically plausible, thermodynamically stable, and property-aligned, achieving higher hit-rates and stronger Pareto fronts than generative and LLM-only baselines. Ablation studies confirm the importance of rule-guided generation, memory-based refinement, and surrogate prediction. By enforcing synthesizability and multi-objective trade-offs, LLEMA delivers a principled pathway to accelerate practical materials discovery. Code: https://github.com/scientific-discovery/LLEMA
[799] CANDI: Hybrid Discrete-Continuous Diffusion Models
Patrick Pynadath, Jiaxin Shi, Ruqi Zhang
Main category: cs.LG
TL;DR: Continuous diffusion underperforms on discrete data due to temporal dissonance between discrete identity corruption and continuous rank degradation. CANDI is proposed as a hybrid framework that decouples these mechanisms to enable effective continuous diffusion for discrete spaces.
Details
Motivation: To understand why continuous diffusion performs poorly on discrete data despite its success in continuous domains, and to bridge this gap by addressing the temporal dissonance between discrete and continuous corruption mechanisms.Method: Introduces CANDI (Continuous ANd DIscrete diffusion), a hybrid framework that decouples discrete identity corruption from continuous rank degradation, enabling simultaneous learning of conditional structure and continuous geometry.
Result: CANDI successfully avoids temporal dissonance, enables classifier-based guidance with off-the-shelf classifiers through gradient addition, and outperforms masked diffusion in text generation at low NFE.
Conclusion: The CANDI framework unlocks the benefits of continuous diffusion for discrete spaces by resolving the temporal dissonance problem, demonstrating the value of learning continuous gradients for discrete data generation.
Abstract: While continuous diffusion has shown remarkable success in continuous domains such as image generation, its direct application to discrete data has underperformed compared to purely discrete formulations. This gap is counterintuitive, given that continuous diffusion learns score functions that enable joint evolution across multiple positions. To understand this gap, we introduce token identifiability as an analytical framework for understanding how Gaussian noise corrupts discrete data through two mechanisms: discrete identity corruption and continuous rank degradation. We reveal that these mechanisms scale differently with vocabulary size, creating a temporal dissonance: at noise levels where discrete corruption preserves enough structure for conditional learning, continuous denoising is trivial; at noise levels where continuous denoising is meaningful, discrete corruption destroys nearly all conditional structure. To solve this, we propose CANDI (Continuous ANd DIscrete diffusion), a hybrid framework that decouples discrete and continuous corruption, enabling simultaneous learning of both conditional structure and continuous geometry. We empirically validate the temporal dissonance phenomenon and demonstrate that CANDI successfully avoids it. This unlocks the benefits of continuous diffusion for discrete spaces: on controlled generation, CANDI enables classifier-based guidance with off-the-shelf classifiers through simple gradient addition; on text generation, CANDI outperforms masked diffusion at low NFE, demonstrating the value of learning continuous gradients for discrete spaces.
[800] Transitive RL: Value Learning via Divide and Conquer
Seohong Park, Aditya Oberai, Pranav Atreya, Sergey Levine
Main category: cs.LG
TL;DR: TRL is a new value learning algorithm for offline goal-conditioned RL that uses divide-and-conquer to reduce bias accumulation and variance compared to TD and Monte Carlo methods.
Details
Motivation: To address limitations in existing value learning methods for offline GCRL, particularly bias accumulation in TD methods and high variance in Monte Carlo methods.Method: Converts triangle inequality structure in GCRL into a practical divide-and-conquer value update rule that requires only O(log T) recursions for length-T trajectories.
Result: TRL achieves the best performance in highly challenging, long-horizon benchmark tasks compared to previous offline GCRL algorithms.
Conclusion: TRL provides an effective divide-and-conquer approach for offline goal-conditioned RL that overcomes limitations of traditional value learning methods.
Abstract: In this work, we present Transitive Reinforcement Learning (TRL), a new value learning algorithm based on a divide-and-conquer paradigm. TRL is designed for offline goal-conditioned reinforcement learning (GCRL) problems, where the aim is to find a policy that can reach any state from any other state in the smallest number of steps. TRL converts a triangle inequality structure present in GCRL into a practical divide-and-conquer value update rule. This has several advantages compared to alternative value learning paradigms. Compared to temporal difference (TD) methods, TRL suffers less from bias accumulation, as in principle it only requires $O(\log T)$ recursions (as opposed to $O(T)$ in TD learning) to handle a length-$T$ trajectory. Unlike Monte Carlo methods, TRL suffers less from high variance as it performs dynamic programming. Experimentally, we show that TRL achieves the best performance in highly challenging, long-horizon benchmark tasks compared to previous offline GCRL algorithms.
[801] Toward Robust Signed Graph Learning through Joint Input-Target Denoising
Junran Wu, Beng Chin Ooi, Ke Xu
Main category: cs.LG
TL;DR: RIDGE is a robust signed graph learning framework that jointly denoises graph inputs and supervision targets using an extended graph information bottleneck theory.
Details
Motivation: Existing signed graph neural networks lack theoretical guidance for robustness, and real-world signed graphs contain noise in both input connections and supervision targets.Method: Extends graph information bottleneck theory to handle noise in both input and target spaces, using reparameterization and variational approximation to create a tractable objective function for joint denoising.
Result: Extensive experiments on four signed graph datasets show RIDGE significantly improves robustness of popular SGNN models under various noise levels.
Conclusion: RIDGE provides a theoretically-grounded framework for robust signed graph learning that effectively handles noise in both input data and supervision targets.
Abstract: Signed Graph Neural Networks (SGNNs) are widely adopted to analyze complex patterns in signed graphs with both positive and negative links. Given the noisy nature of real-world connections, the robustness of SGNN has also emerged as a pivotal research area. Under the supervision of empirical properties, graph structure learning has shown its robustness on signed graph representation learning, however, there remains a paucity of research investigating a robust SGNN with theoretical guidance. Inspired by the success of graph information bottleneck (GIB) in information extraction, we propose RIDGE, a novel framework for Robust sI gned graph learning through joint Denoising of Graph inputs and supervision targEts. Different from the basic GIB, we extend the GIB theory with the capability of target space denoising as the co-existence of noise in both input and target spaces. In instantiation, RIDGE effectively cleanses input data and supervision targets via a tractable objective function produced by reparameterization mechanism and variational approximation. We extensively validate our method on four prevalent signed graph datasets, and the results show that RIDGE clearly improves the robustness of popular SGNN models under various levels of noise.
[802] A Scalable Global Optimization Algorithm For Constrained Clustering
Pedro Chumpitaz-Flores, My Duong, Cristobal Heredia, Kaixun Hua
Main category: cs.LG
TL;DR: SDC-GBB is a scalable branch-and-bound framework for constrained clustering that handles large datasets with pairwise constraints while guaranteeing global optimality.
Details
Motivation: Existing constrained clustering methods using mixed-integer optimization are limited to small datasets due to NP-hard complexity, while current approaches fail to provide global optimality guarantees for large-scale problems.Method: Proposes Sample-Driven Constrained Group-Based Branch-and-Bound (SDC-GBB) that collapses must-linked samples into pseudo-samples, prunes cannot-links geometrically, and uses Lagrangian decomposition with geometric elimination rules for efficient bounds.
Result: Handles datasets with 200,000 samples with cannot-link constraints and 1.5M samples with must-link constraints - 200-1500x larger than state-of-the-art, achieving optimality gap <3% with deterministic global guarantees.
Conclusion: SDC-GBB provides the first scalable method for globally optimal constrained clustering on large datasets, overcoming limitations of existing optimization methods and heuristic approaches.
Abstract: Constrained clustering leverages limited domain knowledge to improve clustering performance and interpretability, but incorporating pairwise must-link and cannot-link constraints is an NP-hard challenge, making global optimization intractable. Existing mixed-integer optimization methods are confined to small-scale datasets, limiting their utility. We propose Sample-Driven Constrained Group-Based Branch-and-Bound (SDC-GBB), a decomposable branch-and-bound (BB) framework that collapses must-linked samples into centroid-based pseudo-samples and prunes cannot-link through geometric rules, while preserving convergence and guaranteeing global optimality. By integrating grouped-sample Lagrangian decomposition and geometric elimination rules for efficient lower and upper bounds, the algorithm attains highly scalable pairwise k-Means constrained clustering via parallelism. Experimental results show that our approach handles datasets with 200,000 samples with cannot-link constraints and 1,500,000 samples with must-link constraints, which is 200 - 1500 times larger than the current state-of-the-art under comparable constraint settings, while reaching an optimality gap of less than 3%. In providing deterministic global guarantees, our method also avoids the search failures that off-the-shelf heuristics often encounter on large datasets.
[803] Random Search Neural Networks for Efficient and Expressive Graph Learning
Michael Ito, Danai Koutra, Jenna Wiens
Main category: cs.LG
TL;DR: Random Search Neural Networks (RSNNs) replace random walks with random searches to achieve better global structure capture with fewer samples, outperforming RWNNs on molecular and protein benchmarks with up to 16x fewer sequences.
Details
Motivation: Random walk neural networks (RWNNs) fail to capture global graph structure under realistic sampling constraints due to incomplete node and edge coverage, limiting their expressivity.Method: Propose random search neural networks (RSNNs) that operate on random searches, each guaranteeing full node coverage. Theoretically show only O(log|V|) searches needed for full edge coverage in sparse graphs vs O(|V|) walks for RWNNs.
Result: RSNNs consistently outperform RWNNs on molecular and protein benchmarks, achieving comparable or superior performance with up to 16x fewer sampled sequences.
Conclusion: RSNNs bridge theoretical and practical advances in random walk approaches, offering an efficient and expressive framework for learning on sparse graphs with probabilistic isomorphism invariance.
Abstract: Random walk neural networks (RWNNs) have emerged as a promising approach for graph representation learning, leveraging recent advances in sequence models to process random walks. However, under realistic sampling constraints, RWNNs often fail to capture global structure even in small graphs due to incomplete node and edge coverage, limiting their expressivity. To address this, we propose \textit{random search neural networks} (RSNNs), which operate on random searches, each of which guarantees full node coverage. Theoretically, we demonstrate that in sparse graphs, only $O(\log |V|)$ searches are needed to achieve full edge coverage, substantially reducing sampling complexity compared to the $O(|V|)$ walks required by RWNNs (assuming walk lengths scale with graph size). Furthermore, when paired with universal sequence models, RSNNs are universal approximators. We lastly show RSNNs are probabilistically invariant to graph isomorphisms, ensuring their expectation is an isomorphism-invariant graph function. Empirically, RSNNs consistently outperform RWNNs on molecular and protein benchmarks, achieving comparable or superior performance with up to 16$\times$ fewer sampled sequences. Our work bridges theoretical and practical advances in random walk based approaches, offering an efficient and expressive framework for learning on sparse graphs.
[804] Iteratively Refined Early Interaction Alignment for Subgraph Matching based Graph Retrieval
Ashwin Ramachandran, Vaibhav Raj, Indrayumna Roy, Soumen Chakrabarti, Abir De
Main category: cs.LG
TL;DR: IsoNet++ is an early interaction graph neural network that improves subgraph matching through cross-graph message passing, lazy alignment updates, and novel node-pair partner interaction.
Details
Motivation: To improve subgraph isomorphism matching for applications like scene graph retrieval and molecular fingerprint detection by addressing limitations of late interaction models like IsoNet.Method: Uses early interaction GNN with three innovations: cross-graph message passing guided by injective alignment, lazy alignment updates over multiple rounds, and node-pair partner interaction that considers edge existence/non-existence between node pairs.
Result: Experiments show progressively refined alignments over successive rounds and significantly better retrieval performance than existing methods, with all three innovations contributing to enhanced accuracy.
Conclusion: IsoNet++ demonstrates superior subgraph matching performance through its early interaction approach and novel technical innovations in alignment refinement and node-pair interaction.
Abstract: Graph retrieval based on subgraph isomorphism has several real-world applications such as scene graph retrieval, molecular fingerprint detection and circuit design. Roy et al. [35] proposed IsoNet, a late interaction model for subgraph matching, which first computes the node and edge embeddings of each graph independently of paired graph and then computes a trainable alignment map. Here, we present IsoNet++, an early interaction graph neural network (GNN), based on several technical innovations. First, we compute embeddings of all nodes by passing messages within and across the two input graphs, guided by an injective alignment between their nodes. Second, we update this alignment in a lazy fashion over multiple rounds. Within each round, we run a layerwise GNN from scratch, based on the current state of the alignment. After the completion of one round of GNN, we use the last-layer embeddings to update the alignments, and proceed to the next round. Third, IsoNet++ incorporates a novel notion of node-pair partner interaction. Traditional early interaction computes attention between a node and its potential partners in the other graph, the attention then controlling messages passed across graphs. In contrast, we consider node pairs (not single nodes) as potential partners. Existence of an edge between the nodes in one graph and non-existence in the other provide vital signals for refining the alignment. Our experiments on several datasets show that the alignments get progressively refined with successive rounds, resulting in significantly better retrieval performance than existing methods. We demonstrate that all three innovations contribute to the enhanced accuracy. Our code and datasets are publicly available at https://github.com/structlearning/isonetpp.
[805] FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning
Yuyang Ding, Chi Zhang, Juntao Li, Haibin Lin, Xin Liu, Min Zhang
Main category: cs.LG
TL;DR: RLVR enhances LLM reasoning but rewards flawed rollouts equally with correct ones, causing unreliable reasoning patterns. FAPO introduces parameter-free reward penalty to leverage flawed rollouts as shortcuts initially while shifting to reliable reasoning later, using GenRM for precise error detection.
Details
Motivation: Current RLVR rewards flawed reasoning patterns (answer-guessing, jump-in-reasoning) identically to correct reasoning, causing models to internalize unreliable patterns that constrain long-term reasoning capability.Method: Propose Flawed-Aware Policy Optimization (FAPO) with parameter-free reward penalty for flawed-positive rollouts, and introduce generative reward model (GenRM) with process-level reward for accurate detection of reasoning errors.
Result: FAPO improves outcome correctness, process reliability, and training stability across multiple domains without increasing token budget, enabling stable early gains while shifting to reliable reasoning in later stages.
Conclusion: FAPO effectively addresses the flawed-positive rollout problem in RLVR by leveraging flawed reasoning as useful shortcuts initially while gradually optimizing toward reliable reasoning patterns, enhancing both reasoning quality and training stability.
Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for enhancing the reasoning capabilities of large language models (LLMs). In this context, models explore reasoning trajectories and exploit rollouts with correct answers as positive signals for policy optimization. However, these rollouts might involve flawed patterns such as answer-guessing and jump-in-reasoning. Such flawed-positive rollouts are rewarded identically to fully correct ones, causing policy models to internalize these unreliable reasoning patterns. In this work, we first conduct a systematic study of flawed-positive rollouts in RL and find that they enable rapid capability gains during the early optimization stage, while constraining reasoning capability later by reinforcing unreliable patterns. Building on these insights, we propose Flawed-Aware Policy Optimization (FAPO), which presents a parameter-free reward penalty for flawed-positive rollouts, enabling the policy to leverage them as useful shortcuts in the warm-up stage, securing stable early gains, while gradually shifting optimization toward reliable reasoning in the later refinement stage. To accurately and comprehensively detect flawed-positive rollouts, we introduce a generative reward model (GenRM) with a process-level reward that precisely localizes reasoning errors. Experiments show that FAPO is effective in broad domains, improving outcome correctness, process reliability, and training stability without increasing the token budget.
[806] DDTR: Diffusion Denoising Trace Recovery
Maximilian Matyash, Avigdor Gal, Arik Senderovich
Main category: cs.LG
TL;DR: A novel deep learning approach using Diffusion Denoising Probabilistic Models (DDPM) for stochastic trace recovery from uncertain process logs, achieving 25% improvement over existing methods.
Details
Motivation: Traditional deterministic process logs are now being captured from non-deterministic sources like uncertain sensors and ML models, creating a need for reliable stochastic trace recovery methods to understand processes in such systems.Method: Uses Diffusion Denoising Probabilistic Models (DDPM) that leverage process knowledge - either by discovering process models implicitly or explicitly injecting process knowledge during training - to recover traces through denoising.
Result: Empirical evaluation shows state-of-the-art performance with up to 25% improvement over existing methods and increased robustness under high noise levels.
Conclusion: The DDPM-based approach effectively addresses stochastic trace recovery from uncertain process logs, significantly outperforming existing methods while maintaining robustness in noisy environments.
Abstract: With recent technological advances, process logs, which were traditionally deterministic in nature, are being captured from non-deterministic sources, such as uncertain sensors or machine learning models (that predict activities using cameras). In the presence of stochastically-known logs, logs that contain probabilistic information, the need for stochastic trace recovery increases, to offer reliable means of understanding the processes that govern such systems. We design a novel deep learning approach for stochastic trace recovery, based on Diffusion Denoising Probabilistic Models (DDPM), which makes use of process knowledge (either implicitly by discovering a model or explicitly by injecting process knowledge in the training phase) to recover traces by denoising. We conduct an empirical evaluation demonstrating state-of-the-art performance with up to a 25% improvement over existing methods, along with increased robustness under high noise levels.
[807] Combining Deep Learning and Explainable AI for Toxicity Prediction of Chemical Compounds
Eduard Popescu, Adrian Groza, Andreea Cernat
Main category: cs.LG
TL;DR: This paper proposes a novel image-based pipeline using DenseNet121 to predict chemical toxicity from 2D molecular structure images, achieving competitive results while using Grad-CAM for interpretability.
Details
Motivation: To improve toxicological activity prediction of chemical compounds using computational methods, with focus on combining predictive accuracy with model interpretability in cheminformatics.Method: Developed an image-based pipeline using DenseNet121 architecture that processes 2D graphical representations of chemical structures, combined with Grad-CAM visualizations for explainable AI.
Result: The proposed architecture achieves competitive performance compared to traditional models, demonstrating the effectiveness of deep convolutional networks in toxicology prediction.
Conclusion: Image-based representations combined with explainable AI methods can significantly improve both predictive accuracy and model transparency in computational toxicology applications.
Abstract: The task here is to predict the toxicological activity of chemical compounds based on the Tox21 dataset, a benchmark in computational toxicology. After a domain-specific overview of chemical toxicity, we discuss current computational strategies, focusing on machine learning and deep learning. Several architectures are compared in terms of performance, robustness, and interpretability. This research introduces a novel image-based pipeline based on DenseNet121, which processes 2D graphical representations of chemical structures. Additionally, we employ Grad-CAM visualizations, an explainable AI technique, to interpret the model’s predictions and highlight molecular regions contributing to toxicity classification. The proposed architecture achieves competitive results compared to traditional models, demonstrating the potential of deep convolutional networks in cheminformatics. Our findings emphasize the value of combining image-based representations with explainable AI methods to improve both predictive accuracy and model transparency in toxicology.
[808] Optimal Anytime Algorithms for Online Convex Optimization with Adversarial Constraints
Dhruv Sarkar, Abhishek Sinha
Main category: cs.LG
TL;DR: An anytime online algorithm for learning adversarial convex cost functions while satisfying adversarial convex constraints, achieving O(√t) regret and Õ(√t) constraint violation bounds without requiring prior knowledge of time horizon T.
Details
Motivation: Existing algorithms require prior knowledge of time horizon T and use doubling tricks with poor practical performance due to multiple restarts. Need for anytime algorithms that provide performance guarantees at any intermediate timestep.Method: Uses time-varying Lyapunov functions to track constraint violations, avoiding the need for fixed Lyapunov functions tuned to known horizon length. Introduces new analytical technique to handle challenges posed by non-monotonicity in time-varying functions.
Result: Achieves O(√t) regret and Õ(√t) cumulative constraint violation bounds for any t≥1. Extends to dynamic regret setting with bounds adapting to comparator path length without prior knowledge. Also presents adaptive algorithm for optimistic setting scaling with prediction error.
Conclusion: The algorithm provides practical anytime performance without doubling tricks, demonstrated through numerical experiments on online shortest path problems. Time-varying Lyapunov functions enable optimal bounds without prior horizon knowledge.
Abstract: We propose an anytime online algorithm for the problem of learning a sequence of adversarial convex cost functions while approximately satisfying another sequence of adversarial online convex constraints. A sequential algorithm is called \emph{anytime} if it provides a non-trivial performance guarantee for any intermediate timestep $t$ without requiring prior knowledge of the length of the entire time horizon $T$. Our proposed algorithm achieves optimal performance bounds without resorting to the standard doubling trick, which has poor practical performance due to multiple restarts. Our core technical contribution is the use of time-varying Lyapunov functions to keep track of constraint violations. This must be contrasted with prior works that used a fixed Lyapunov function tuned to the known horizon length $T$. The use of time-varying Lyapunov function poses unique analytical challenges as properties, such as \emph{monotonicity}, on which the prior proofs rest, no longer hold. By introducing a new analytical technique, we show that our algorithm achieves $O(\sqrt{t})$ regret and $\tilde{O}(\sqrt{t})$ cumulative constraint violation bounds for any $t\geq 1$. We extend our results to the dynamic regret setting, achieving bounds that adapt to the path length of the comparator sequence without prior knowledge of its total length. We also present an adaptive algorithm in the optimistic setting, whose performance gracefully scales with the cumulative prediction error. We demonstrate the practical utility of our algorithm through numerical experiments involving the online shortest path problem.
[809] Prediction-Powered Semi-Supervised Learning with Online Power Tuning
Noa Shoham, Ron Dorfman, Shalev Shaer, Kfir Y. Levy, Yaniv Romano
Main category: cs.LG
TL;DR: Extends Prediction-Powered Inference to semi-supervised learning by introducing an unbiased gradient estimator and online tuning of interpolation parameters between labeled and pseudo-labeled data.
Details
Motivation: Address the challenge in SSL where inaccurate pseudo-labels can introduce bias and lead to suboptimal models, aiming to balance contributions from labeled and pseudo-labeled data effectively.Method: Proposes a novel unbiased gradient estimator for SSL, uses an interpolation parameter between labeled and pseudo-labeled data, and tunes this parameter online alongside model parameters using one-dimensional online learning.
Result: Experiments on synthetic and real datasets show improved performance over classic SSL baselines and PPI methods with offline parameter tuning.
Conclusion: The proposed online tuning approach for interpolation parameters in SSL effectively balances labeled and pseudo-labeled data contributions, outperforming existing methods.
Abstract: Prediction-Powered Inference (PPI) is a recently proposed statistical inference technique for parameter estimation that leverages pseudo-labels on both labeled and unlabeled data to construct an unbiased, low-variance estimator. In this work, we extend its core idea to semi-supervised learning (SSL) for model training, introducing a novel unbiased gradient estimator. This extension addresses a key challenge in SSL: while unlabeled data can improve model performance, its benefit heavily depends on the quality of pseudo-labels. Inaccurate pseudo-labels can introduce bias, leading to suboptimal models.To balance the contributions of labeled and pseudo-labeled data, we utilize an interpolation parameter and tune it on the fly, alongside the model parameters, using a one-dimensional online learning algorithm. We verify the practical advantage of our approach through experiments on both synthetic and real datasets, demonstrating improved performance over classic SSL baselines and PPI methods that tune the interpolation parameter offline.
[810] A roadmap for curvature-based geometric data analysis and learning
Yasharth Yadav, Kelin Xia
Main category: cs.LG
TL;DR: This paper presents the first comprehensive review of discrete curvature models for geometric data analysis and learning, covering mathematical foundations, computational methods, and practical applications across various data representations.
Details
Motivation: Curvature is a fundamental concept in geometric data analysis that captures intrinsic structure and supports diverse applications from community detection to geometric deep learning, but existing discrete curvature models lacked systematic review and comparison.Method: The authors systematically review discrete curvature models from both Riemannian and metric geometry perspectives, examine computational algorithms across different data representations (graphs, simplicial complexes, cubical complexes, point clouds), and propose a pipeline for curvature-driven data analysis.
Result: The survey provides detailed comparisons of computational formulations, mathematical foundations, and practical implementations of discrete curvature models, offering insights into their effectiveness across different data types and applications.
Conclusion: This comprehensive review serves as a roadmap for researchers to understand discrete curvature as a fundamental tool for geometric data understanding and learning, bridging mathematical theory with practical applications in data analysis.
Abstract: Geometric data analysis and learning has emerged as a distinct and rapidly developing research area, increasingly recognized for its effectiveness across diverse applications. At the heart of this field lies curvature, a powerful and interpretable concept that captures intrinsic geometric structure and underpins numerous tasks, from community detection to geometric deep learning. A wide range of discrete curvature models have been proposed for various data representations, including graphs, simplicial complexes, cubical complexes, and point clouds sampled from manifolds. These models not only provide efficient characterizations of data geometry but also constitute essential components in geometric learning frameworks. In this paper, we present the first comprehensive review of existing discrete curvature models, covering their mathematical foundations, computational formulations, and practical applications in data analysis and learning. In particular, we discuss discrete curvature from both Riemannian and metric geometry perspectives and propose a systematic pipeline for curvature-driven data analysis. We further examine the corresponding computational algorithms across different data representations, offering detailed comparisons and insights. Finally, we review state-of-the-art applications of curvature in both supervised and unsupervised learning. This survey provides a conceptual and practical roadmap for researchers to gain a better understanding of discrete curvature as a fundamental tool for geometric understanding and learning.
[811] CLEANet: Robust and Efficient Anomaly Detection in Contaminated Multivariate Time Series
Songhan Zhang, Yuanhao Lai, Pengfei Zheng, Boxi Yu, Xiaoying Tang, Qiuai Fu, Pinjia He
Main category: cs.LG
TL;DR: CLEANet is a robust and efficient anomaly detection framework for contaminated multivariate time series that addresses training data contamination and inefficient inference through contamination-resilient training and lightweight architecture.
Details
Motivation: Real-world deployment of multivariate time series anomaly detection is hindered by training data contamination (noises and hidden anomalies) and inefficient model inference. Existing methods assume clean training data and complex models often overfit to contamination with high latency.Method: Proposes CLEANet with Contamination-Resilient Training Framework (CRTF) using adaptive reconstruction weighting and clustering-guided contrastive learning, plus a lightweight conjugate MLP that disentangles temporal and cross-feature dependencies.
Result: Achieves up to 73.04% higher F1 and 81.28% lower runtime compared with ten state-of-the-art baselines across five public datasets. Integrating CRTF into three advanced models yields average 5.35% F1 gain.
Conclusion: CLEANet effectively addresses contamination and efficiency challenges in multivariate time series anomaly detection, demonstrating strong performance improvements and generalizability.
Abstract: Multivariate time series (MTS) anomaly detection is essential for maintaining the reliability of industrial systems, yet real-world deployment is hindered by two critical challenges: training data contamination (noises and hidden anomalies) and inefficient model inference. Existing unsupervised methods assume clean training data, but contamination distorts learned patterns and degrades detection accuracy. Meanwhile, complex deep models often overfit to contamination and suffer from high latency, limiting practical use. To address these challenges, we propose CLEANet, a robust and efficient anomaly detection framework in contaminated multivariate time series. CLEANet introduces a Contamination-Resilient Training Framework (CRTF) that mitigates the impact of corrupted samples through an adaptive reconstruction weighting strategy combined with clustering-guided contrastive learning, thereby enhancing robustness. To further avoid overfitting on contaminated data and improve computational efficiency, we design a lightweight conjugate MLP that disentangles temporal and cross-feature dependencies. Across five public datasets, CLEANet achieves up to 73.04% higher F1 and 81.28% lower runtime compared with ten state-of-the-art baselines. Furthermore, integrating CRTF into three advanced models yields an average 5.35% F1 gain, confirming its strong generalizability.
[812] FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference
Divya Jyoti Bajpai, Manjesh Kumar Hanawal
Main category: cs.LG
TL;DR: FastVLM introduces self-speculative decoding using a lightweight draft model and full verification model to accelerate VLM inference by 1.55-1.85x with minimal performance loss.
Details
Motivation: Vision-language models face high computational cost and inference latency due to autoregressive decoding, which limits their practical deployment.Method: Uses imitation-learning-based self-speculative decoding with a lightweight draft model for autoregressive token generation and a full model for non-autoregressive verification and correction.
Result: Achieves 1.55-1.85x speedup in inference compared to the final layer while maintaining performance integrity with minimal loss.
Conclusion: FastVLM effectively balances efficiency and accuracy in VLMs through self-speculative decoding, making them more practical for real-world applications.
Abstract: Vision-language Models (VLMs) have made significant strides in visual understanding and query response generation, but often face challenges of high computational cost and inference latency due to autoregressive decoding. In this work, we introduce an imitation-learning-based Self-Speculative Decoding (SSD) framework, named FastVLM, to address these limitations. Our approach employs a lightweight draft model for token generation in an autoregressive manner, while a full model verifies these tokens non-autoregressively. Accepted tokens proceed seamlessly, while rejected tokens are corrected by the full model and used to guide the draft model’s refinement. Through an imitation network, FastVLM enhances the draft model by integrating deeper level insights from the full model’s architecture. Also, it maintains the performance integrity of the full model while training the draft model, achieving a balance between efficiency and accuracy. Our method speeds up the inference process by 1.55-1.85x as compared to the final layer with minimal loss in performance.
[813] Enhancing Graph Classification Robustness with Singular Pooling
Sofiane Ennadir, Oleg Smirnov, Yassine Abbahaddou, Lele Cao, Johannes F. Lutzeyer
Main category: cs.LG
TL;DR: This paper investigates the adversarial robustness of pooling operations in Graph Neural Networks (GNNs) for graph classification, proposing a novel Robust Singular Pooling (RS-Pool) method that uses the dominant singular vector of node embeddings to create robust graph-level representations.
Details
Motivation: While GNNs perform well on graph tasks, their adversarial robustness in graph classification is underexplored compared to node classification. Most defenses focus on message-passing, but pooling operations' role in robustness has been overlooked.Method: The authors first analyze standard flat pooling methods (sum, average, max) theoretically, deriving adversarial risk bounds. They then propose RS-Pool, which leverages the dominant singular vector of node embedding matrix to construct robust graph-level representations. RS-Pool is model-agnostic and can be efficiently implemented via power iteration.
Result: Empirical results on real-world benchmarks show that RS-Pool provides better robustness than standard pooling methods against state-of-the-art adversarial attacks while maintaining competitive clean accuracy.
Conclusion: Pooling operations play a crucial role in GNN robustness for graph classification, and the proposed RS-Pool method offers improved adversarial robustness while preserving performance on clean data.
Abstract: Graph Neural Networks (GNNs) have achieved strong performance across a range of graph representation learning tasks, yet their adversarial robustness in graph classification remains underexplored compared to node classification. While most existing defenses focus on the message-passing component, this work investigates the overlooked role of pooling operations in shaping robustness. We present a theoretical analysis of standard flat pooling methods (sum, average and max), deriving upper bounds on their adversarial risk and identifying their vulnerabilities under different attack scenarios and graph structures. Motivated by these insights, we propose \textit{Robust Singular Pooling (RS-Pool)}, a novel pooling strategy that leverages the dominant singular vector of the node embedding matrix to construct a robust graph-level representation. We theoretically investigate the robustness of RS-Pool and interpret the resulting bound leading to improved understanding of our proposed pooling operator. While our analysis centers on Graph Convolutional Networks (GCNs), RS-Pool is model-agnostic and can be implemented efficiently via power iteration. Empirical results on real-world benchmarks show that RS-Pool provides better robustness than the considered pooling methods when subject to state-of-the-art adversarial attacks while maintaining competitive clean accuracy. Our code is publicly available at:\href{https://github.com/king/rs-pool}{https://github.com/king/rs-pool}.
[814] Variational Polya Tree
Lu Xu, Tsai Hor Chan, Kwok Fai Lam, Lequan Yu, Guosheng Yin
Main category: cs.LG
TL;DR: The paper introduces Variational Polya Tree (VPT), a Bayesian nonparametric method for density estimation that uses stochastic variational inference to address computational limitations of traditional MCMC methods while improving interpretability and uncertainty quantification.
Details
Motivation: Existing density estimation methods lack interpretability and uncertainty quantification. Bayesian nonparametric methods like Polya trees address these issues but face computational complexity and scalability limitations with traditional MCMC approaches.Method: Proposed Variational Polya Tree (VPT) model using stochastic variational inference to compute posterior distributions, providing flexible nonparametric Bayesian priors that capture latent densities and work with stochastic gradient optimization. Uses joint distribution likelihood for more precise variational posterior approximation.
Result: The model demonstrates competitive performance with state-of-the-art deep density estimation methods on both real data and images, while enhancing interpretability and uncertainty quantification capabilities.
Conclusion: VPT provides an effective Bayesian nonparametric approach for density estimation that overcomes computational limitations of traditional methods while maintaining interpretability and uncertainty quantification benefits.
Abstract: Density estimation is essential for generative modeling, particularly with the rise of modern neural networks. While existing methods capture complex data distributions, they often lack interpretability and uncertainty quantification. Bayesian nonparametric methods, especially the \polya tree, offer a robust framework that addresses these issues by accurately capturing function behavior over small intervals. Traditional techniques like Markov chain Monte Carlo (MCMC) face high computational complexity and scalability limitations, hindering the use of Bayesian nonparametric methods in deep learning. To tackle this, we introduce the variational \polya tree (VPT) model, which employs stochastic variational inference to compute posterior distributions. This model provides a flexible, nonparametric Bayesian prior that captures latent densities and works well with stochastic gradient optimization. We also leverage the joint distribution likelihood for a more precise variational posterior approximation than traditional mean-field methods. We evaluate the model performance on both real data and images, and demonstrate its competitiveness with other state-of-the-art deep density estimation methods. We also explore its ability in enhancing interpretability and uncertainty quantification. Code is available at https://github.com/howardchanth/var-polya-tree.
[815] If You Want to Be Robust, Be Wary of Initialization
Sofiane Ennadir, Johannes F. Lutzeyer, Michalis Vazirgiannis, El Houcine Bergou
Main category: cs.LG
TL;DR: This paper explores how weight initialization and training epochs affect GNN robustness against adversarial attacks, showing proper initialization can improve robustness by up to 50% compared to alternatives.
Details
Motivation: Current GNN defense strategies focus on pre-processing and message-passing schemes, but the impact of weight initialization and hyper-parameters on robustness remains under-explored.Method: The authors introduce a theoretical framework connecting initialization strategies to network resilience, analyze the relationship between initial weights and training epochs on vulnerability, and extend the framework to Deep Neural Networks with a general upper-bound.
Result: Extensive experiments across diverse models and real-world datasets show that appropriate initialization not only maintains clean dataset performance but also significantly enhances robustness against adversarial perturbations, with improvements up to 50% compared to alternative initialization approaches.
Conclusion: Weight initialization and training epochs are crucial factors for adversarial robustness in GNNs, offering new defense mechanisms beyond conventional approaches.
Abstract: Graph Neural Networks (GNNs) have demonstrated remarkable performance across a spectrum of graph-related tasks, however concerns persist regarding their vulnerability to adversarial perturbations. While prevailing defense strategies focus primarily on pre-processing techniques and adaptive message-passing schemes, this study delves into an under-explored dimension: the impact of weight initialization and associated hyper-parameters, such as training epochs, on a model’s robustness. We introduce a theoretical framework bridging the connection between initialization strategies and a network’s resilience to adversarial perturbations. Our analysis reveals a direct relationship between initial weights, number of training epochs and the model’s vulnerability, offering new insights into adversarial robustness beyond conventional defense mechanisms. While our primary focus is on GNNs, we extend our theoretical framework, providing a general upper-bound applicable to Deep Neural Networks. Extensive experiments, spanning diverse models and real-world datasets subjected to various adversarial attacks, validate our findings. We illustrate that selecting appropriate initialization not only ensures performance on clean datasets but also enhances model robustness against adversarial perturbations, with observed gaps of up to 50% compared to alternative initialization approaches.
[816] Learning Without Augmenting: Unsupervised Time Series Representation Learning via Frame Projections
Berken Utku Demirel, Christian Holz
Main category: cs.LG
TL;DR: This paper proposes a self-supervised learning method that replaces traditional data augmentations with views generated using orthonormal bases and overcomplete frames, achieving superior performance without relying on handcrafted augmentations.
Details
Motivation: Traditional SSL methods require domain-specific knowledge for designing data augmentations and impose representational invariances that can limit generalization. The authors aim to overcome these limitations by using geometric properties of different representation spaces.Method: The method generates views using orthonormal bases and overcomplete frames instead of data augmentations. It leverages the complementary geometry of distinct manifolds formed by embeddings from orthonormal and overcomplete spaces.
Result: The approach achieves performance gains of 15-20% over existing self-supervised methods on nine datasets across five temporal sequence tasks, particularly effective where signal-specific characteristics make data augmentations challenging.
Conclusion: By replacing augmentations with geometric projections and leveraging complementary manifold structures, the proposed method provides a more effective approach to self-supervised learning without the limitations of handcrafted data augmentations.
Abstract: Self-supervised learning (SSL) has emerged as a powerful paradigm for learning representations without labeled data. Most SSL approaches rely on strong, well-established, handcrafted data augmentations to generate diverse views for representation learning. However, designing such augmentations requires domain-specific knowledge and implicitly imposes representational invariances on the model, which can limit generalization. In this work, we propose an unsupervised representation learning method that replaces augmentations by generating views using orthonormal bases and overcomplete frames. We show that embeddings learned from orthonormal and overcomplete spaces reside on distinct manifolds, shaped by the geometric biases introduced by representing samples in different spaces. By jointly leveraging the complementary geometry of these distinct manifolds, our approach achieves superior performance without artificially increasing data diversity through strong augmentations. We demonstrate the effectiveness of our method on nine datasets across five temporal sequence tasks, where signal-specific characteristics make data augmentations particularly challenging. Without relying on augmentation-induced diversity, our method achieves performance gains of up to 15–20% over existing self-supervised approaches. Source code: https://github.com/eth-siplab/Learning-with-FrameProjections
[817] FlowCritic: Bridging Value Estimation with Flow Matching in Reinforcement Learning
Shan Zhong, Shutong Ding, He Diao, Xiangyu Wang, Kah Chan Teh, Bei Peng
Main category: cs.LG
TL;DR: FlowCritic: A generative paradigm for value estimation in RL using flow matching to model value distributions instead of conventional regression methods.
Details
Motivation: Existing value estimation methods have limitations - multi-critic ensembles lack distributional information, while distributional RL methods are limited in expressing complex value distributions due to discretization or quantile regression constraints.Method: Leverages flow matching to model value distributions and generate samples for value estimation, departing from conventional regression approaches for deterministic value prediction.
Result: Not specified in the abstract provided.
Conclusion: Proposes a novel generative approach to value estimation that can better capture complex value distributions in reinforcement learning.
Abstract: Reliable value estimation serves as the cornerstone of reinforcement learning (RL) by evaluating long-term returns and guiding policy improvement, significantly influencing the convergence speed and final performance. Existing works improve the reliability of value function estimation via multi-critic ensembles and distributional RL, yet the former merely combines multi point estimation without capturing distributional information, whereas the latter relies on discretization or quantile regression, limiting the expressiveness of complex value distributions. Inspired by flow matching’s success in generative modeling, we propose a generative paradigm for value estimation, named FlowCritic. Departing from conventional regression for deterministic value prediction, FlowCritic leverages flow matching to model value distributions and generate samples for value estimation.
[818] Identification of Causal Direction under an Arbitrary Number of Latent Confounders
Wei Chen, Linjun Peng, Zhiyi Huang, Haoyue Dai, Zhifeng Hao, Ruichu Cai, Kun Zhang
Main category: cs.LG
TL;DR: Proposes a method to recover causal structure with latent variables using higher-order cumulant matrices, showing causal asymmetry can be identified even with multiple latent confounders.
Details
Motivation: Existing methods for causal discovery with latent variables require strict assumptions and cannot handle scenarios where observed variables are affected by multiple latent variables simultaneously.Method: Uses joint higher-order cumulant matrices of observed variables in linear non-Gaussian settings to detect causal asymmetry through rank deficiency properties.
Result: The method effectively identifies causal structure without iterative procedures and demonstrates asymptotic correctness in experiments.
Conclusion: Higher-order cumulant matrices provide a powerful tool for causal discovery in the presence of arbitrary latent confounders, overcoming limitations of existing methods.
Abstract: Recovering causal structure in the presence of latent variables is an important but challenging task. While many methods have been proposed to handle it, most of them require strict and/or untestable assumptions on the causal structure. In real-world scenarios, observed variables may be affected by multiple latent variables simultaneously, which, generally speaking, cannot be handled by these methods. In this paper, we consider the linear, non-Gaussian case, and make use of the joint higher-order cumulant matrix of the observed variables constructed in a specific way. We show that, surprisingly, causal asymmetry between two observed variables can be directly seen from the rank deficiency properties of such higher-order cumulant matrices, even in the presence of an arbitrary number of latent confounders. Identifiability results are established, and the corresponding identification methods do not even involve iterative procedures. Experimental results demonstrate the effectiveness and asymptotic correctness of our proposed method.
[819] S-Chain: Structured Visual Chain-of-Thought For Medicine
Khai Le-Duc, Duy M. H. Nguyen, Phuong T. H. Trinh, Tien-Phat Nguyen, Nghiem T. Diep, An Ngo, Tung Vu, Trinh Vuong, Anh-Tien Nguyen, Mau Nguyen, Van Trung Hoang, Khai-Nguyen Nguyen, Hy Nguyen, Chris Ngo, Anji Liu, Nhat Ho, Anne-Christin Hauschild, Khanh Xuan Nguyen, Thanh Nguyen-Tang, Pengtao Xie, Daniel Sonntag, James Zou, Mathias Niepert, Anh Totti Nguyen
Main category: cs.LG
TL;DR: S-Chain is a large-scale medical dataset with 12,000 expert-annotated images featuring bounding boxes and structured visual Chain-of-Thought reasoning, explicitly linking visual regions to reasoning steps across 16 languages.
Details
Motivation: Faithful reasoning in medical VLMs requires transparent alignment between textual rationales and visual evidence, but no large-scale expert-level dataset existed for stepwise reasoning with precise visual grounding.Method: Created S-Chain dataset with structured visual CoT (SV-CoT) annotations, benchmarked state-of-the-art medical and general-purpose VLMs, studied synergy with retrieval-augmented generation, and proposed a new mechanism to strengthen visual evidence-reasoning alignment.
Result: SV-CoT supervision significantly improves interpretability, grounding fidelity, and robustness in medical VQA. The dataset enables benchmarking and reveals how domain knowledge and visual grounding interact during reasoning.
Conclusion: S-Chain establishes a new benchmark for grounded medical reasoning and paves the way toward more trustworthy and explainable medical VLMs by improving both reliability and efficiency through better visual evidence-reasoning alignment.
Abstract: Faithful reasoning in medical vision-language models (VLMs) requires not only accurate predictions but also transparent alignment between textual rationales and visual evidence. While Chain-of-Thought (CoT) prompting has shown promise in medical visual question answering (VQA), no large-scale expert-level dataset has captured stepwise reasoning with precise visual grounding. We introduce S-Chain, the first large-scale dataset of 12,000 expert-annotated medical images with bounding boxes and structured visual CoT (SV-CoT), explicitly linking visual regions to reasoning steps. The dataset further supports 16 languages, totaling over 700k VQA pairs for broad multilingual applicability. Using S-Chain, we benchmark state-of-the-art medical VLMs (ExGra-Med, LLaVA-Med) and general-purpose VLMs (Qwen2.5-VL, InternVL2.5), showing that SV-CoT supervision significantly improves interpretability, grounding fidelity, and robustness. Beyond benchmarking, we study its synergy with retrieval-augmented generation, revealing how domain knowledge and visual grounding interact during autoregressive reasoning. Finally, we propose a new mechanism that strengthens the alignment between visual evidence and reasoning, improving both reliability and efficiency. S-Chain establishes a new benchmark for grounded medical reasoning and paves the way toward more trustworthy and explainable medical VLMs.
[820] Centrum: Model-based Database Auto-tuning with Minimal Distributional Assumptions
Yuanhao Lai, Pengfei Zheng, Chenpeng Ji, Yan Li, Songhan Zhang, Rutao Zhang, Zhengang Wang, Yunfei Du
Main category: cs.LG
TL;DR: Centrum is a novel DBMS auto-tuner that improves upon traditional GP-BO and tree-ensemble methods by using stochastic gradient boosting ensembles with conformal prediction for distribution-free uncertainty estimation, achieving superior performance over 21 state-of-the-art methods.
Details
Motivation: GP-BO-based DBMS auto-tuners outperform SMAC-based ones, but GP-BO violates fundamental assumptions in DBMS performance modeling. Tree-ensemble-BOs avoid these pitfalls but have limitations in uncertainty estimation and don't leverage recent gradient boosting advances.Method: Centrum uses a two-phase learning procedure with stochastic gradient boosting ensembles for improved point and interval estimation. It employs generalized SGBE-estimated locally-adaptive conformal prediction for distribution-free uncertainty estimation and acquisition function.
Result: Extensive experiments on two DBMSs and three workloads show Centrum outperforms 21 state-of-the-art methods in both physical and simulation experiments.
Conclusion: Centrum is the first auto-tuner to achieve distribution-freeness and seamlessly fuse gradient boosting ensembles with conformal inference in Bayesian optimization, significantly enhancing DBMS auto-tuning practicality and performance.
Abstract: Gaussian-Process-based Bayesian optimization (GP-BO), is a prevailing model-based framework for DBMS auto-tuning. However, recent work shows GP-BO-based DBMS auto-tuners significantly outperformed auto-tuners based on SMAC, which features random forest surrogate models; such results motivate us to rethink and investigate the limitations of GP-BO in auto-tuner design. We find the fundamental assumptions of GP-BO are widely violated when modeling and optimizing DBMS performance, while tree-ensemble-BOs (e.g., SMAC) can avoid the assumption pitfalls and deliver improved tuning efficiency and effectiveness. Moreover, we argue that existing tree-ensemble-BOs restrict further advancement in DBMS auto-tuning. First, existing tree-ensemble-BOs can only achieve distribution-free point estimates, but still impose unrealistic distributional assumptions on uncertainty estimates, compromising surrogate modeling and distort the acquisition function. Second, recent advances in gradient boosting, which can further enhance surrogate modeling against vanilla GP and random forest counterparts, have rarely been applied in optimizing DBMS auto-tuners. To address these issues, we propose a novel model-based DBMS auto-tuner, Centrum. Centrum improves distribution-free point and interval estimation in surrogate modeling with a two-phase learning procedure of stochastic gradient boosting ensembles. Moreover, Centrum adopts a generalized SGBE-estimated locally-adaptive conformal prediction to facilitate a distribution-free uncertainty estimation and acquisition function. To our knowledge, Centrum is the first auto-tuner to realize distribution-freeness, enhancing BO’s practicality in DBMS auto-tuning, and the first to seamlessly fuse gradient boosting ensembles and conformal inference in BO. Extensive physical and simulation experiments on two DBMSs and three workloads show Centrum outperforms 21 SOTA methods.
[821] Distributionally Robust Optimization via Diffusion Ambiguity Modeling
Jiaqi Wen, Jianyi Yang
Main category: cs.LG
TL;DR: The paper proposes Diffusion-based DRO (D-DRO), a tractable distributionally robust optimization algorithm using diffusion models to create ambiguity sets that capture adversarial distributions beyond nominal support while maintaining consistency.
Details
Motivation: To enhance robustness and generalization in statistical learning by designing effective ambiguity sets for DRO that are both consistent with nominal distributions and diverse enough to account for various scenarios while remaining tractable.Method: Proposes a diffusion-based ambiguity set design that captures adversarial distributions beyond nominal support space, and develops D-DRO algorithm that solves inner maximization over parameterized diffusion model space.
Result: Formally establishes stationary convergence performance of D-DRO and empirically demonstrates superior Out-of-Distribution generalization performance in machine learning prediction tasks.
Conclusion: D-DRO provides an effective and tractable approach to distributionally robust optimization with strong theoretical guarantees and empirical performance for OOD generalization.
Abstract: This paper studies Distributionally Robust Optimization (DRO), a fundamental framework for enhancing the robustness and generalization of statistical learning and optimization. An effective ambiguity set for DRO must involve distributions that remain consistent with the nominal distribution while being diverse enough to account for a variety of potential scenarios. Moreover, it should lead to tractable DRO solutions. To this end, we propose a diffusion-based ambiguity set design that captures various adversarial distributions beyond the nominal support space while maintaining consistency with the nominal distribution. Building on this ambiguity modeling, we propose Diffusion-based DRO (D-DRO), a tractable DRO algorithm that solves the inner maximization over the parameterized diffusion model space. We formally establish the stationary convergence performance of D-DRO and empirically demonstrate its superior Out-of-Distribution (OOD) generalization performance in a ML prediction task.
[822] TELL-TALE: Task Efficient LLMs with Task Aware Layer Elimination
Omar Naim, Krish Sharma, Nicholas Asher
Main category: cs.LG
TL;DR: TALE is an inference-time algorithm that prunes transformer layers in LLMs by optimizing task-specific validation performance, improving accuracy while reducing computational costs without retraining.
Details
Motivation: To address computational inefficiency in LLMs by identifying and removing redundant or bottleneck layers that degrade task performance, while maintaining or improving accuracy.Method: Task-Aware Layer Elimination (TALE) directly optimizes task-specific validation performance to prune entire transformer layers during inference, requiring no retraining and working under both zero-shot and few-shot settings.
Result: TALE consistently improved accuracy while reducing computational cost across 9 tasks and 5 models (LLaMA 3.1 8B, Qwen 2.5 7B, Qwen 2.5 0.5B, Mistral 7B, Lucie 7B). Applying TALE during finetuning provided additional performance gains.
Conclusion: TALE produces smaller, faster, and more accurate models that are also faster to fine-tune, while providing insights into transformer interpretability through mutual information analysis showing certain layers act as bottlenecks.
Abstract: In this paper we introduce Tale, Task-Aware Layer Elimination, an inference-time algorithm that prunes entire transformer layers in an LLM by directly optimizing task-specific validation performance. We evaluate TALE on 9 tasks and 5 models, including LLaMA 3.1 8B, Qwen 2.5 7B, Qwen 2.5 0.5B, Mistral 7B, and Lucie 7B, under both zero-shot and few-shot settings. Unlike prior approaches, TALE requires no retraining and consistently improves accuracy while reducing computational cost across all benchmarks. Furthermore, applying TALE during finetuning leads to additional performance gains. Finally, TALE provides flexible user control over trade-offs between accuracy and efficiency. Mutual information analysis shows that certain layers act as bottlenecks, degrading task-relevant representations. Tale’s selective layer removal remedies this problem, producing smaller, faster, and more accurate models that are also faster to fine-tune while offering new insights into transformer interpretability.
[823] SeeDNorm: Self-Rescaled Dynamic Normalization
Wenrui Cai, Defa Zhu, Qingjie Liu, Qiyang Min
Main category: cs.LG
TL;DR: SeeDNorm is a dynamic normalization method that enhances RMSNorm by preserving input norm information and enabling data-dependent scaling, achieving superior performance across various tasks with minimal parameter overhead.
Details
Motivation: RMSNorm discards input norm information and uses static scaling factors, which limits performance improvements especially in zero-shot scenarios where large language models need to handle diverse input data and distributional shifts.Method: SeeDNorm dynamically adjusts scaling coefficients based on current input, preserving input norm information while maintaining RMSNorm’s gradient adjustment capabilities. It addresses potential training instability with specific solutions.
Result: SeeDNorm consistently outperforms RMSNorm, LayerNorm, and DyT across models of varying sizes in large language model pre-training, supervised and unsupervised computer vision tasks, with negligible efficiency impact.
Conclusion: SeeDNorm provides a more effective normalization approach that preserves input norm information through dynamic scaling, offering superior performance over existing normalization methods while maintaining computational efficiency.
Abstract: Normalization layer constitutes an essential component in neural networks. In transformers, the predominantly used RMSNorm constrains vectors to a unit hypersphere, followed by dimension-wise rescaling through a learnable scaling coefficient $\gamma$ to maintain the representational capacity of the model. However, RMSNorm discards the input norm information in forward pass and a static scaling factor $\gamma$ may be insufficient to accommodate the wide variability of input data and distributional shifts, thereby limiting further performance improvements, particularly in zero-shot scenarios that large language models routinely encounter. To address this limitation, we propose SeeDNorm, which enhances the representational capability of the model by dynamically adjusting the scaling coefficient based on the current input, thereby preserving the input norm information and enabling data-dependent, self-rescaled dynamic normalization. During backpropagation, SeeDNorm retains the ability of RMSNorm to dynamically adjust gradient according to the input norm. We provide a detailed analysis of the training optimization for SeedNorm and proposed corresponding solutions to address potential instability issues that may arise when applying SeeDNorm. We validate the effectiveness of SeeDNorm across models of varying sizes in large language model pre-training as well as supervised and unsupervised computer vision tasks. By introducing a minimal number of parameters and with neglligible impact on model efficiency, SeeDNorm achieves consistently superior performance compared to previously commonly used normalization layers such as RMSNorm and LayerNorm, as well as element-wise activation alternatives to normalization layers like DyT.
[824] Inductive Transfer Learning for Graph-Based Recommenders
Florian Grötschla, Elia Trachsel, Luca A. Lanzendörfer, Roger Wattenhofer
Main category: cs.LG
TL;DR: NBF-Rec is a graph-based recommender system that enables inductive transfer learning across datasets with disjoint user/item sets, computing embeddings dynamically at inference rather than requiring retraining.
Details
Motivation: Traditional graph-based recommenders are limited to transductive settings and cannot handle new users, items, or datasets without retraining. There's a need for models that can transfer knowledge across domains with different user and item sets.Method: Proposes NBF-Rec which computes node embeddings dynamically at inference time using interaction-level message passing, supporting inductive transfer learning across datasets with disjoint user and item sets.
Result: Evaluated on 7 real-world datasets spanning movies, music, e-commerce, and location check-ins. Achieves competitive performance in zero-shot settings and shows further improvements with lightweight fine-tuning.
Conclusion: Inductive transfer is feasible in graph-based recommendation, and interaction-level message passing enables generalization across datasets without requiring aligned users or items.
Abstract: Graph-based recommender systems are commonly trained in transductive settings, which limits their applicability to new users, items, or datasets. We propose NBF-Rec, a graph-based recommendation model that supports inductive transfer learning across datasets with disjoint user and item sets. Unlike conventional embedding-based methods that require retraining for each domain, NBF-Rec computes node embeddings dynamically at inference time. We evaluate the method on seven real-world datasets spanning movies, music, e-commerce, and location check-ins. NBF-Rec achieves competitive performance in zero-shot settings, where no target domain data is used for training, and demonstrates further improvements through lightweight fine-tuning. These results show that inductive transfer is feasible in graph-based recommendation and that interaction-level message passing supports generalization across datasets without requiring aligned users or items.
[825] Offline Preference Optimization via Maximum Marginal Likelihood Estimation
Saeed Najafi, Alona Fyshe
Main category: cs.LG
TL;DR: Proposes MMPO, a simpler alternative to RLHF that uses Maximum Marginal Likelihood for LLM alignment, eliminating need for reward models and entropy maximization while achieving competitive performance.
Details
Motivation: Standard alignment methods like RLHF are complex and unstable, creating need for simpler, more stable approaches to align LLMs with human preferences.Method: MMPO recasts alignment through Maximum Marginal Likelihood estimation, maximizing marginal log-likelihood of preferred outputs using preference pairs as samples, without explicit reward models or entropy maximization.
Result: MMPO shows better stability with hyperparameter β, achieves competitive or superior preference alignment, and better preserves base model’s language capabilities across models from 135M to 8B parameters.
Conclusion: MMPO provides a simpler, more stable alternative to RLHF that effectively aligns LLMs with human preferences while maintaining model capabilities, with theoretical and empirical validation.
Abstract: Aligning Large Language Models (LLMs) with human preferences is crucial, but standard methods like Reinforcement Learning from Human Feedback (RLHF) are often complex and unstable. In this work, we propose a new, simpler approach that recasts alignment through the lens of Maximum Marginal Likelihood (MML) estimation. Our new MML based Preference Optimization (MMPO) maximizes the marginal log-likelihood of a preferred text output, using the preference pair as samples for approximation, and forgoes the need for both an explicit reward model and entropy maximization. We theoretically demonstrate that MMPO implicitly performs preference optimization, producing a weighted gradient that naturally up-weights chosen responses over rejected ones. Across models ranging from 135M to 8B parameters, we empirically show that MMPO: 1) is more stable with respect to the hyperparameter $\beta$ compared to alternative baselines, and 2) achieves competitive or superior preference alignment while better preserving the base model’s general language capabilities. Through a series of ablation experiments, we show that this improved performance is indeed attributable to MMPO’s implicit preference optimization within the gradient updates.
[826] A Theory of the Mechanics of Information: Generalization Through Measurement of Uncertainty (Learning is Measuring)
Christopher J. Hazard, Michael Resnick, Jacob Beel, Jack Xia, Cade Mack, Dominic Glennie, Matthew Fulp, David Maze, Andrew Bassett, Martin Koistinen
Main category: cs.LG
TL;DR: A model-free framework using surprisal (information theoretic uncertainty) that eliminates traditional distribution modeling, reduces bias, and enables efficient data updates including direct edits and deletion.
Details
Motivation: Traditional machine learning relies on explicit models and domain assumptions, which limits flexibility and interpretability. The goal is to create a more flexible, interpretable, and bias-reduced approach.Method: Uses surprisal (information theoretic uncertainty) to directly analyze raw data without distribution modeling. Quantifies relevance through uncertainty and enables efficient data updates including direct edits and deletion of training data.
Result: Achieves at or near state-of-the-art performance across most common machine learning tasks including generative inference, causal discovery, anomaly detection, and time series forecasting. Works effectively with complex data types including missing data.
Conclusion: This framework offers a viable alternative to neural networks for scalable machine learning and AI that maintains human understandability of the underlying mechanics, emphasizing traceability, interpretability, and data-driven decision making.
Abstract: Traditional machine learning relies on explicit models and domain assumptions, limiting flexibility and interpretability. We introduce a model-free framework using surprisal (information theoretic uncertainty) to directly analyze and perform inferences from raw data, eliminating distribution modeling, reducing bias, and enabling efficient updates including direct edits and deletion of training data. By quantifying relevance through uncertainty, the approach enables generalizable inference across tasks including generative inference, causal discovery, anomaly detection, and time series forecasting. It emphasizes traceability, interpretability, and data-driven decision making, offering a unified, human-understandable framework for machine learning, and achieves at or near state-of-the-art performance across most common machine learning tasks. The mathematical foundations create a ``physics’’ of information, which enable these techniques to apply effectively to a wide variety of complex data types, including missing data. Empirical results indicate that this may be a viable alternative path to neural networks with regard to scalable machine learning and artificial intelligence that can maintain human understandability of the underlying mechanics.
[827] Distributed Multi-Agent Bandits Over Erdős-Rényi Random Networks
Jingyuan Liu, Hao Qiu, Lin Yang, Mengfan Xu
Main category: cs.LG
TL;DR: The paper proposes a distributed multi-agent multi-armed bandit algorithm for heterogeneous rewards over random communication graphs, achieving logarithmic regret with interpretable terms showing the trade-off between communication efficiency and regret.
Details
Motivation: To address the distributed multi-agent multi-armed bandit problem where agents have heterogeneous rewards and communicate over time-varying random graphs that are not necessarily connected at each time step, enabling efficient learning in decentralized settings.Method: A fully distributed algorithm that integrates arm elimination strategy with random gossip algorithm, operating over Erdős-Rényi random graphs applied to a fixed connected base graph.
Result: Theoretical regret upper bound of O(log T) that includes the optimal centralized regret plus additional terms reflecting the impact of graph connectivity and link probability, showing a fundamental trade-off between communication efficiency and regret.
Conclusion: The proposed algorithm achieves near-optimal performance with interpretable regret bounds, validated by numerical experiments that demonstrate superiority over existing benchmarks and confirm theoretical scaling with problem complexity.
Abstract: We study the distributed multi-agent multi-armed bandit problem with heterogeneous rewards over random communication graphs. Uniquely, at each time step $t$ agents communicate over a time-varying random graph $G_t$ generated by applying the Erd\H{o}s-R'enyi model to a fixed connected base graph $G$ (for classical Erd\H{o}s-R'enyi graphs, $G$ is a complete graph), where each potential edge in $G$ is randomly and independently present with the link probability $p$. Notably, the resulting random graph is not necessarily connected at each time step. Each agent’s arm rewards follow time-invariant distributions, and the reward distribution for the same arm may differ across agents. The goal is to minimize the cumulative expected regret relative to the global mean reward of each arm, defined as the average of that arm’s mean rewards across all agents. To this end, we propose a fully distributed algorithm that integrates the arm elimination strategy with the random gossip algorithm. We theoretically show that the regret upper bound is of order $\log T$ and is highly interpretable, where $T$ is the time horizon. It includes the optimal centralized regret $O\left(\sum_{k: \Delta_k>0} \frac{\log T}{\Delta_k}\right)$ and an additional term $O\left(\frac{N^2 \log T}{p \lambda_{N-1}(Lap(G))} + \frac{KN^2 \log T}{p}\right)$ where $N$ and $K$ denote the total number of agents and arms, respectively. This term reflects the impact of $G$’s algebraic connectivity $\lambda_{N-1}(Lap(G))$ and the link probability $p$, and thus highlights a fundamental trade-off between communication efficiency and regret. As a by-product, we show a nearly optimal regret lower bound. Finally, our numerical experiments not only show the superiority of our algorithm over existing benchmarks, but also validate the theoretical regret scaling with problem complexity.
[828] Can Language Models Compose Skills In-Context?
Zidong Liu, Zhuoyan Xu, Zhenmei Shi, Yingyu Liang
Main category: cs.LG
TL;DR: Language models struggle with in-context composition of basic skills from examples, often performing worse than expected due to difficulty recognizing and assembling skills correctly, even with Chain-of-Thought prompting.
Details
Motivation: To investigate how well language models can compose basic skills from in-context examples to accomplish composite tasks, which is more challenging than learning skills during training.Method: Systematic experiments on various open-source language models using linguistic and logical tasks designed to probe composition abilities, with theoretical analysis of example alignment.
Result: Simple task examples surprisingly negatively impact performance; models struggle to recognize and assemble skills correctly even with Chain-of-Thought examples.
Conclusion: Aligning examples with corresponding composition steps is crucial, and a proposed method for probing tasks shows improved performance, supporting these insights.
Abstract: Composing basic skills from simple tasks to accomplish composite tasks is crucial for modern intelligent systems. We investigate the in-context composition ability of language models to perform composite tasks that combine basic skills demonstrated in in-context examples. This is more challenging than the standard setting, where skills and their composition can be learned in training. We conduct systematic experiments on various representative open-source language models, utilizing linguistic and logical tasks designed to probe composition abilities. The results reveal that simple task examples can have a surprising negative impact on the performance, because the models generally struggle to recognize and assemble the skills correctly, even with Chain-of-Thought examples. Theoretical analysis further shows that it is crucial to align examples with the corresponding steps in the composition. This inspires a method for the probing tasks, whose improved performance provides positive support for our insights.
[829] Air Quality Prediction Using LOESS-ARIMA and Multi-Scale CNN-BiLSTM with Residual-Gated Attention
Soham Pahari, Sandeep Chand Kumain
Main category: cs.LG
TL;DR: A hybrid framework combining LOESS decomposition, ARIMA, and multi-scale CNN-BiLSTM with residual-gated attention for accurate AQI forecasting in Indian megacities, achieving superior performance over existing methods.
Details
Motivation: Air pollution in Indian megacities like Delhi, Kolkata, and Mumbai poses serious health risks, with sudden pollutant spikes making timely intervention challenging. Accurate AQI forecasting is difficult due to complex linear trends, seasonal variations, and volatile nonlinear patterns.Method: Hybrid framework integrating LOESS decomposition to separate AQI series into trend, seasonal, and residual components. ARIMA models smooth components while a multi-scale CNN-BiLSTM network with residual-gated attention captures volatility in residuals. Hyperparameters optimized via Unified Adaptive Multi-Stage Metaheuristic Optimizer (UAMMO).
Result: Outperforms statistical, deep learning, and hybrid baselines across PM2.5, O3, CO, and NOx in three major cities, achieving 5-8% lower MSE and higher R^2 scores (>0.94) for all pollutants. Demonstrates robustness and sensitivity to sudden pollution events.
Conclusion: The proposed framework shows strong performance in urban air quality management, effectively handling complex AQI patterns and providing reliable forecasting for pollution control interventions.
Abstract: Air pollution remains a critical environmental and public health concern in Indian megacities such as Delhi, Kolkata, and Mumbai, where sudden spikes in pollutant levels challenge timely intervention. Accurate Air Quality Index (AQI) forecasting is difficult due to the coexistence of linear trends, seasonal variations, and volatile nonlinear patterns. This paper proposes a hybrid forecasting framework that integrates LOESS decomposition, ARIMA modeling, and a multi-scale CNN-BiLSTM network with a residual-gated attention mechanism. The LOESS step separates the AQI series into trend, seasonal, and residual components, with ARIMA modeling the smooth components and the proposed deep learning module capturing multi-scale volatility in the residuals. Model hyperparameters are tuned via the Unified Adaptive Multi-Stage Metaheuristic Optimizer (UAMMO), combining multiple optimization strategies for efficient convergence. Experiments on 2021-2023 AQI datasets from the Central Pollution Control Board show that the proposed method consistently outperforms statistical, deep learning, and hybrid baselines across PM2.5, O3, CO, and NOx in three major cities, achieving up to 5-8% lower MSE and higher R^2 scores (>0.94) for all pollutants. These results demonstrate the framework’s robustness, sensitivity to sudden pollution events, and applicability to urban air quality management.
[830] Last Iterate Analyses of FTRL in Stochasitc Bandits
Jingxin Zhan, Yuze Han, Zhihua Zhang
Main category: cs.LG
TL;DR: This paper analyzes the last-iterate convergence rate of Follow-the-Regularized-Leader (FTRL) algorithms in multi-armed bandits, showing that the Bregman divergence decays at a rate of t^{-1/2} for the 1/2-Tsallis-INF algorithm.
Details
Motivation: While most bandit algorithm analyses focus on regret bounds, last-iterate convergence (which captures the actual decision evolution) remains under-explored for FTRL algorithms, despite their proven Best-of-Both-Worlds properties and logarithmic regret in stochastic bandits.Method: Theoretical analysis of the Bregman divergence between the optimal arm’s point mass and the probability distribution obtained by the 1/2-Tsallis-INF FTRL algorithm at iteration t, using the regular function Ψ(p) = -4∑_{i=1}^d √p_i.
Result: The Bregman divergence decays at a rate of t^{-1/2}, which is slower than the intuitive expectation of t^{-1} that would correspond to logarithmic regret.
Conclusion: The paper partially confirms the intuition connecting logarithmic regret to last-iterate convergence rates, but reveals a t^{-1/2} convergence rate rather than the expected t^{-1}, highlighting a gap between regret analysis and last-iterate convergence analysis for FTRL algorithms in bandits.
Abstract: The convergence analysis of online learning algorithms is central to machine learning theory, where last-iterate convergence is particularly important, as it captures the learner’s actual decisions and describes the evolution of the learning process over time. However, in multi-armed bandits, most existing algorithmic analyses mainly focus on the order of regret, while the last-iterate (simple regret) convergence rate remains less explored – especially for the widely studied Follow-the-Regularized-Leader (FTRL) algorithms. Recently, a growing line of work has established the Best-of-Both-Worlds (BOBW) property of FTRL algorithms in bandit problems, showing in particular that they achieve logarithmic regret in stochastic bandits. Nevertheless, their last-iterate convergence rate has not yet been studied. Intuitively, logarithmic regret should correspond to a $t^{-1}$ last-iterate convergence rate. This paper partially confirms this intuition through theoretical analysis, showing that the Bregman divergence, defined by the regular function $\Psi(p)=-4\sum_{i=1}^{d}\sqrt{p_i}$ associated with the BOBW FTRL algorithm $1/2$-Tsallis-INF (arXiv:1807.07623), between the point mass on the optimal arm and the probability distribution over the arm set obtained at iteration $t$, decays at a rate of $t^{-1/2}$.
[831] Logical GANs: Adversarial Learning through Ehrenfeucht Fraisse Games
Mirco A. Mannucci
Main category: cs.LG
TL;DR: LOGAN combines GANs with logical reasoning by using a depth-limited discriminator that searches for logical faults, and a generator that produces samples matching a target theory within k rounds.
Details
Motivation: To bridge the gap between GANs' indistinguishability promises and logical explainability by putting both on a computational budget.Method: Uses a discriminator as depth-k EF Opponent searching for logical faults, and generator as Builder producing samples that match target theory T within k rounds. Includes EF-probe simulator, MSO-style graph checkers, and logical loss mixing EF round-resilience with certificate terms.
Result: Framework achieves 92%-98% property satisfaction via simulation, and real neural GAN training shows 5%-14% improvements on challenging properties with 98% satisfaction on connectivity.
Conclusion: LOGAN provides a compact, reproducible approach for logic-bounded generation with interpretable failures, proven effectiveness in both simulation and real training, and adjustable control parameters.
Abstract: GANs promise indistinguishability, logic explains it. We put the two on a budget: a discriminator that can only ``see’’ up to a logical depth $k$, and a generator that must look correct to that bounded observer. \textbf{LOGAN} (LOGical GANs) casts the discriminator as a depth-$k$ Ehrenfeucht–Fra"iss'e (EF) \emph{Opponent} that searches for small, legible faults (odd cycles, nonplanar crossings, directed bridges), while the generator plays \emph{Builder}, producing samples that admit a $k$-round matching to a target theory $T$. We ship a minimal toolkit – an EF-probe simulator and MSO-style graph checkers – and four experiments including real neural GAN training with PyTorch. Beyond verification, we score samples with a \emph{logical loss} that mixes budgeted EF round-resilience with cheap certificate terms, enabling a practical curriculum on depth. Framework validation demonstrates $92%$–$98%$ property satisfaction via simulation (Exp.~3), while real neural GAN training achieves $5%$–$14%$ improvements on challenging properties and $98%$ satisfaction on connectivity (matching simulation) through adversarial learning (Exp.~4). LOGAN is a compact, reproducible path toward logic-bounded generation with interpretable failures, proven effectiveness (both simulated and real training), and dials for control.
[832] Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts
Di Zhang, Xun Wu, Shaohan Huang, Yaru Hao, Li Dong, Zewen Chi, Zhifang Sui, Furu Wei
Main category: cs.LG
TL;DR: A novel router-aware approach for RL training of Mixture-of-Experts (MoE) models that addresses training instability through importance sampling weight optimization guided by router logits.
Details
Motivation: RL training for MoE architectures is underexplored compared to dense models, and existing methods suffer from training instability that limits performance gains.Method: Proposed a router-aware approach that optimizes importance sampling weights in off-policy RL using a rescaling strategy guided by router logits to reduce gradient variance and mitigate training divergence.
Result: Experimental results show significant improvements in both convergence stability and final performance of MoE models compared to existing methods.
Conclusion: The method demonstrates the potential of RL algorithmic innovations specifically tailored to MoE architectures, providing a promising direction for efficient training of large-scale expert models.
Abstract: Recent advances in reinforcement learning (RL) have substantially improved the training of large-scale language models, leading to significant gains in generation quality and reasoning ability. However, most existing research focuses on dense models, while RL training for Mixture-of-Experts (MoE) architectures remains underexplored. To address the instability commonly observed in MoE training, we propose a novel router-aware approach to optimize importance sampling (IS) weights in off-policy RL. Specifically, we design a rescaling strategy guided by router logits, which effectively reduces gradient variance and mitigates training divergence. Experimental results demonstrate that our method significantly improves both the convergence stability and the final performance of MoE models, highlighting the potential of RL algorithmic innovations tailored to MoE architectures and providing a promising direction for efficient training of large-scale expert models.
[833] Clustering by Denoising: Latent plug-and-play diffusion for single-cell data
Dominik Meier, Shixing Yu, Sagnik Nandy, Promit Ghosal, Kyra Gan
Main category: cs.LG
TL;DR: A novel plug-and-play diffusion framework for scRNA-seq data that separates observation and denoising spaces using Gibbs sampling, enabling improved clustering accuracy through input-space steering.
Details
Motivation: Standard latent spaces in scRNA-seq analysis often project different cell types close together due to measurement noise and biological variability, making accurate clustering challenging.Method: Introduces a latent plug-and-play diffusion framework with Gibbs sampling that applies learned diffusion prior in low-dimensional latent space for denoising while reintroducing noise into original high-dimensional observation space for steering.
Result: Improves clustering accuracy on synthetic data across varied noise levels and dataset shifts. On real single-cell data, shows improved biological coherence with cluster boundaries better aligned with known cell type markers and developmental trajectories.
Conclusion: The framework provides adaptive noise handling, uncertainty quantification, and generalizable denoising, enabling robust single-cell analysis that maintains fidelity to original data structure.
Abstract: Single-cell RNA sequencing (scRNA-seq) enables the study of cellular heterogeneity. Yet, clustering accuracy, and with it downstream analyses based on cell labels, remain challenging due to measurement noise and biological variability. In standard latent spaces (e.g., obtained through PCA), data from different cell types can be projected close together, making accurate clustering difficult. We introduce a latent plug-and-play diffusion framework that separates the observation and denoising space. This separation is operationalized through a novel Gibbs sampling procedure: the learned diffusion prior is applied in a low-dimensional latent space to perform denoising, while to steer this process, noise is reintroduced into the original high-dimensional observation space. This unique “input-space steering” ensures the denoising trajectory remains faithful to the original data structure. Our approach offers three key advantages: (1) adaptive noise handling via a tunable balance between prior and observed data; (2) uncertainty quantification through principled uncertainty estimates for downstream analysis; and (3) generalizable denoising by leveraging clean reference data to denoise noisier datasets, and via averaging, improve quality beyond the training set. We evaluate robustness on both synthetic and real single-cell genomics data. Our method improves clustering accuracy on synthetic data across varied noise levels and dataset shifts. On real-world single-cell data, our method demonstrates improved biological coherence in the resulting cell clusters, with cluster boundaries that better align with known cell type markers and developmental trajectories.
[834] Self-induced stochastic resonance: A physics-informed machine learning approach
Divyesh Savaliya, Marius E. Yamakou
Main category: cs.LG
TL;DR: A physics-informed machine learning framework using PINN with noise-augmented architecture to model self-induced stochastic resonance in FitzHugh-Nagumo neurons, achieving accurate prediction of noise-induced coherence with improved efficiency over traditional methods.
Details
Motivation: To develop a data-efficient and interpretable surrogate model for simulating self-induced stochastic resonance (SISR) in excitable systems, which emerges from noise without external periodic forcing or bifurcation proximity.Method: Physics-Informed Neural Network (PINN) with noise-augmented state predictor architecture, embedding stochastic differential equations and SISR-asymptotic constraints into composite loss function with data fidelity, dynamical residuals, and barrier-based physical constraints from Kramers’ escape theory.
Result: The trained PINN accurately predicts spike-train coherence dependence on noise intensity, excitability, and timescale separation, matching direct stochastic simulations with improved accuracy, generalization, and significantly less computation compared to purely data-driven methods.
Conclusion: The framework provides an effective data-efficient and interpretable surrogate model for analyzing noise-induced coherence in multiscale stochastic systems, demonstrating the power of physics-informed machine learning for complex stochastic phenomena.
Abstract: Self-induced stochastic resonance (SISR) is the emergence of coherent oscillations in slow-fast excitable systems driven solely by noise, without external periodic forcing or proximity to a bifurcation. This work presents a physics-informed machine learning framework for modeling and predicting SISR in the stochastic FitzHugh-Nagumo neuron. We embed the governing stochastic differential equations and SISR-asymptotic timescale-matching constraints directly into a Physics-Informed Neural Network (PINN) based on a Noise-Augmented State Predictor architecture. The composite loss integrates data fidelity, dynamical residuals, and barrier-based physical constraints derived from Kramers’ escape theory. The trained PINN accurately predicts the dependence of spike-train coherence on noise intensity, excitability, and timescale separation, matching results from direct stochastic simulations with substantial improvements in accuracy and generalization compared with purely data-driven methods, while requiring significantly less computation. The framework provides a data-efficient and interpretable surrogate model for simulating and analyzing noise-induced coherence in multiscale stochastic systems.
[835] Rethinking GSPO: The Perplexity-Entropy Equivalence
Chi Liu
Main category: cs.LG
TL;DR: GSPO’s sequence-level weights can be interpreted as perplexity ratios and exponential cross-entropy changes, providing an information-theoretic understanding of the algorithm’s behavior.
Details
Motivation: To establish a connection between GSPO's length-normalized importance ratios and information-theoretic quantities for better understanding of the algorithm's empirical properties.Method: Mathematical analysis showing equivalences between GSPO’s sequence-level weights, perplexity ratios, and cross-entropy changes, validated through controlled experiments on mathematical reasoning tasks.
Result: GSPO’s weight s(θ) = (π_θ/π_θ_old)^(1/|y|) equals PPL_θ_old/PPL_θ and exp(ΔH), explaining variance reduction and training stability properties.
Conclusion: The information-theoretic perspective provides useful insights into GSPO’s behavior, explaining its empirical success through perplexity-based weighting of policy gradient updates.
Abstract: We provide a new perspective on GSPO’s length-normalized importance ratios by establishing their connection to information-theoretic quantities. We show that GSPO’s sequence-level weight $s(\theta) = (\pi_\theta/\pi_{\theta_{\text{old}}})^{1/|y|}$ can be equivalently expressed as the inverse perplexity ratio $\text{PPL}{\theta{\text{old}}}/\text{PPL}_\theta$ and as the exponential cross-entropy change $\exp(\Delta H)$. While the perplexity-entropy relationship follows from standard definitions, this observation provides a useful lens for understanding GSPO: the algorithm weights policy gradient updates by perplexity ratios, offering an information-theoretic interpretation of the importance weights. This perspective helps explain GSPO’s empirical properties, including log-domain variance reduction through geometric averaging and stability in training mixture-of-experts models. We validate the mathematical equivalences and variance predictions through controlled experiments on mathematical reasoning tasks.
[836] Encoder-Decoder Diffusion Language Models for Efficient Training and Inference
Marianne Arriola, Yair Schiff, Hao Phung, Aaron Gokaslan, Volodymyr Kuleshov
Main category: cs.LG
TL;DR: E2D2 proposes an encoder-decoder architecture for discrete diffusion models that separates clean token representation from denoising, enabling faster inference and training compared to decoder-only approaches.
Details
Motivation: Prior discrete diffusion models use decoder-only architectures that require full network invocation at every denoising step, leading to high computational costs during inference.Method: Uses encoder-decoder architecture with encoder for clean token representation and lightweight decoder for iterative denoising. Includes specialized training and sampling algorithms for efficient block diffusion.
Result: Achieves superior trade-offs between generation quality and inference throughput on summarization, translation, and mathematical reasoning tasks.
Conclusion: E2D2 framework provides faster discrete diffusion inference while maintaining quality, with code and models available online.
Abstract: Discrete diffusion models enable parallel token sampling for faster inference than autoregressive approaches. However, prior diffusion models use a decoder-only architecture, which requires sampling algorithms that invoke the full network at every denoising step and incur high computational cost. Our key insight is that discrete diffusion models perform two types of computation: 1) representing clean tokens and 2) denoising corrupted tokens, which enables us to use separate modules for each task. We propose an encoder-decoder architecture to accelerate discrete diffusion inference, which relies on an encoder to represent clean tokens and a lightweight decoder to iteratively refine a noised sequence. We also show that this architecture enables faster training of block diffusion models, which partition sequences into blocks for better quality and are commonly used in diffusion language model inference. We introduce a framework for Efficient Encoder-Decoder Diffusion (E2D2), consisting of an architecture with specialized training and sampling algorithms, and we show that E2D2 achieves superior trade-offs between generation quality and inference throughput on summarization, translation, and mathematical reasoning tasks. We provide the code, model weights, and blog post on the project page: https://m-arriola.com/e2d2
[837] PTPP-Aware Adaptation Scaling Laws: Predicting Domain-Adaptation Performance at Unseen Pre-Training Budgets
Etienne Goffinet, Shane Bergsma, Avraham Sheinin, Natalia Vassilieva, Shaheer Muhammad, Preslav Nakov, Gurpreet Gosal
Main category: cs.LG
TL;DR: PTPP-aware scaling laws enable accurate prediction of adaptation loss at unseen tokens-per-parameter ratios, outperforming PTPP-agnostic methods and enabling practical budget planning under compute constraints.
Details
Motivation: Existing continual pre-training scaling laws assume fixed pre-training budgets, limiting their ability to forecast adaptation outcomes for models trained at different tokens-per-parameter (PTPP) ratios.Method: Develop PTPP-aware adaptation scaling laws that make pre-training budget an explicit variable, enabling prediction of adaptation loss at unseen PTPP ratios.
Result: PTPP-aware formulations trained on early stages (PTPP={15,31}) accurately predict target loss at PTPP=279 and outperform PTPP-agnostic transfer baseline on metrics including Huber-on-log, MAE_rel, and calibration slope.
Conclusion: PTPP-aware scaling laws provide accurate forecasting of adaptation outcomes and enable practical planning of replay ratios and adaptation token budgets under computational constraints.
Abstract: Continual pre-training (CPT) for domain adaptation must balance target-domain gains with stability on the base domain. Existing CPT scaling laws typically assume a fixed pre-training budget, which limits their ability to forecast adaptation outcomes for models trained at different tokens-per-parameter (PTPP). We present \emph{PTPP-aware} adaptation scaling laws that make the pre-training budget an explicit variable, enabling accurate \emph{prediction} of adaptation loss at unseen \ptpp. On a multilingual setup (English/Arabic $\rightarrow$ French), PTPP-aware formulations trained on early stages (\ptpp{}={15,31}) predict target loss at \ptpp{}=279 and outperform a PTPP-agnostic \dcpt{} transfer baseline on metrics (Huber-on-log, MAE$_\mathrm{rel}$, calibration slope); full diagnostics (RMSE, MAPE) are in the appendix. Beyond forecasting, we show a practical use case: planning replay ratios and adaptation token budgets that satisfy target and forgetting constraints under compute limits.
[838] A Review of End-to-End Precipitation Prediction Using Remote Sensing Data: from Divination to Machine Learning
Yugong Zeng, Jonathan Wu
Main category: cs.LG
TL;DR: This review paper traces the historical evolution of precipitation forecasting from ancient divination methods to modern AI-based approaches, covering traditional practices, numerical weather prediction, and recent machine learning advances.
Details
Motivation: To provide a comprehensive survey of end-to-end precipitation prediction technologies across different eras and paradigms, documenting the transformation from symbolic methods to modern AI-driven forecasting.Method: The paper conducts a historical review and technological survey, examining traditional indigenous methods, physical modeling, statistical frameworks, and recent neural network approaches including automated deep learning and hybrid physical-data models.
Result: The review composites research across multiple eras to depict the complete history of precipitation prediction and identify key technological transitions from divination to physics-based modeling to AI systems.
Conclusion: The paper outlines future directions for next-generation forecasting systems by synthesizing insights from the historical evolution of precipitation prediction technologies.
Abstract: Precipitation prediction has undergone a profound transformation – from early symbolic and empirical methods rooted in divination and observation, to modern technologies based on atmospheric physics and artificial intelligence. This review traces the historical and technological evolution of precipitation forecasting, presenting a survey about end-to-end precipitation prediction technologies that spans ancient practices, the foundations of meteorological science, the rise of numerical weather prediction (NWP), and the emergence of machine learning (ML) and deep learning (DL) models. We first explore traditional and indigenous forecasting methods, then describe the development of physical modeling and statistical frameworks that underpin contemporary operational forecasting. Particular emphasis is placed on recent advances in neural network-based approaches, including automated deep learning, interpretability-driven design, and hybrid physical-data models. By compositing research across multiple eras and paradigms, this review not only depicts the history of end-to-end precipitation prediction but also outlines future directions in next generation forecasting systems.
[839] A U-Net and Transformer Pipeline for Multilingual Image Translation
Siddharth Sahay, Radhika Agarwal
Main category: cs.LG
TL;DR: An end-to-end multilingual translation pipeline combining custom U-Net for text detection, Tesseract for OCR, and a from-scratch Transformer for machine translation across 5 languages.
Details
Motivation: To create a fully customizable and adaptable system for translating text directly from images, avoiding reliance on monolithic pre-trained models.Method: Three-stage pipeline: U-Net for text region detection, Tesseract for text recognition, and custom Seq2Seq Transformer for multilingual translation trained on parallel corpus.
Result: Promising results in text detection accuracy, text recognition quality, and translation performance measured by BLEU scores.
Conclusion: Validates the viability of custom-built systems for translating text directly from images with full customization capabilities.
Abstract: This paper presents an end-to-end multilingual translation pipeline that integrates a custom U-Net for text detection, the Tesseract engine for text recognition, and a from-scratch sequence-to-sequence (Seq2Seq) Transformer for Neural Machine Translation (NMT). Our approach first utilizes a U-Net model, trained on a synthetic dataset , to accurately segment and detect text regions from an image. These detected regions are then processed by Tesseract to extract the source text. This extracted text is fed into a custom Transformer model trained from scratch on a multilingual parallel corpus spanning 5 languages. Unlike systems reliant on monolithic pre-trained models, our architecture emphasizes full customization and adaptability. The system is evaluated on its text detection accuracy, text recognition quality, and translation performance via BLEU scores. The complete pipeline demonstrates promising results, validating the viability of a custom-built system for translating text directly from images.
[840] Guardian: Decoupling Exploration from Safety in Reinforcement Learning
Kaitong Cai, Jusheng Zhang, Jing Yang, Keze Wang
Main category: cs.LG
TL;DR: RLPD-GX is a hybrid offline-online RL framework that decouples policy optimization from safety enforcement using a reward-seeking learner and projection-based guardian, achieving state-of-the-art performance on Atari-100k with improved stability and safety.
Details
Motivation: Hybrid offline-online RL suffers from instability due to distribution shift between offline and online data, limiting its ability to balance sample efficiency with robust exploration.Method: Decouples policy optimization from safety enforcement: a reward-seeking learner explores freely while a projection-based guardian ensures rule-consistent execution and safe value backups. Uses dynamic curricula for temporal horizons and offline-online data mixing.
Result: Achieves normalized mean score of 3.02 on Atari-100k (+45% over prior hybrid methods) with stronger safety and stability. Consistent gains across safety-critical and long-horizon tasks.
Conclusion: Decoupled safety enforcement provides a simple yet principled route to robust O2O RL, suggesting a broader paradigm for reconciling exploration and safety in reinforcement learning.
Abstract: Hybrid offline–online reinforcement learning (O2O RL) promises both sample efficiency and robust exploration, but suffers from instability due to distribution shift between offline and online data. We introduce RLPD-GX, a framework that decouples policy optimization from safety enforcement: a reward-seeking learner explores freely, while a projection-based guardian guarantees rule-consistent execution and safe value backups. This design preserves the exploratory value of online interactions without collapsing to conservative policies. To further stabilize training, we propose dynamic curricula that gradually extend temporal horizons and anneal offline–online data mixing. We prove convergence via a contraction property of the guarded Bellman operator, and empirically show state-of-the-art performance on Atari-100k, achieving a normalized mean score of 3.02 (+45% over prior hybrid methods) with stronger safety and stability. Beyond Atari, ablations demonstrate consistent gains across safety-critical and long-horizon tasks, underscoring the generality of our design. Extensive and comprehensive results highlight decoupled safety enforcement as a simple yet principled route to robust O2O RL, suggesting a broader paradigm for reconciling exploration and safety in reinforcement learning.
[841] Variational Masked Diffusion Models
Yichi Zhang, Alex Schwing, Zhizhen Zhao
Main category: cs.LG
TL;DR: VMD introduces latent variables into masked diffusion to model token dependencies, improving generation quality and global consistency compared to standard masked diffusion.
Details
Motivation: Standard masked diffusion fails to capture dependencies among concurrently predicted tokens, leading to degraded generation quality when token dependencies are important.Method: Proposes Variational Masked Diffusion (VMD) by introducing latent variables into the masked diffusion process to explicitly model dependencies among tokens.
Result: VMD successfully learns dependencies that conventional masked diffusion fails to capture, improves generation quality and dependency awareness across synthetic datasets, Sudoku puzzles, and text datasets.
Conclusion: Integrating variational inference into masked diffusion enhances both generation quality and dependency awareness, highlighting the value of the VMD framework.
Abstract: Masked diffusion models have recently emerged as a flexible framework for discrete generative modeling. However, a key limitation of standard masked diffusion is its inability to effectively capture dependencies among tokens that are predicted concurrently, leading to degraded generation quality when dependencies among tokens are important. To explicitly model dependencies among tokens, we propose Variational Masked Diffusion (VMD), a framework that introduces latent variables into the masked diffusion process. Through controlled experiments on synthetic datasets, we demonstrate that VMD successfully learns dependencies that conventional masked diffusion fails to capture. We further validate the effectiveness of our approach on Sudoku puzzles and text datasets, where learning of dependencies among tokens improves global consistency. Across these domains, VMD enhances both generation quality and dependency awareness, highlighting the value of integrating variational inference into masked diffusion. Our code is available at: https://riccizz.github.io/VMD.
[842] Long-Term PM2.5 Forecasting Using a DTW-Enhanced CNN-GRU Model
Amirali Ataee Naeini, Arshia Ataee Naeini, Fatemeh Karami Mohammadi, Omid Ghaffarpasand
Main category: cs.LG
TL;DR: A deep learning framework combining Dynamic Time Warping (DTW) for station similarity selection with CNN-GRU architecture enables stable 10-day PM2.5 forecasting in sparse monitoring networks, achieving R2=0.73 at 240 hours without performance degradation.
Details
Motivation: Existing deep learning approaches struggle with long-term PM2.5 forecasting stability beyond 48 hours, especially in cities with sparse monitoring networks, creating limitations for public health early-warning systems.Method: Combines DTW-based historical sampling to identify similar pollution patterns across stations with a lightweight CNN-GRU architecture augmented with meteorological features, optimized for sparse monitoring networks without requiring external simulation tools.
Result: Superior performance compared to state-of-the-art methods, achieving R2=0.91 for 24-hour forecasts and stable 10-day forecasting (R2=0.73 at 240 hours) without performance degradation, validated on multi-year hourly data from eight monitoring stations.
Conclusion: The framework provides computationally efficient, stable long-term PM2.5 forecasting suitable for resource-constrained urban environments, addressing critical early-warning system requirements without dependency on external tools.
Abstract: Reliable long-term forecasting of PM2.5 concentrations is critical for public health early-warning systems, yet existing deep learning approaches struggle to maintain prediction stability beyond 48 hours, especially in cities with sparse monitoring networks. This paper presents a deep learning framework that combines Dynamic Time Warping (DTW) for intelligent station similarity selection with a CNN-GRU architecture to enable extended-horizon PM2.5 forecasting in Isfahan, Iran, a city characterized by complex pollution dynamics and limited monitoring coverage. Unlike existing approaches that rely on computationally intensive transformer models or external simulation tools, our method integrates three key innovations: (i) DTW-based historical sampling to identify similar pollution patterns across peer stations, (ii) a lightweight CNN-GRU architecture augmented with meteorological features, and (iii) a scalable design optimized for sparse networks. Experimental validation using multi-year hourly data from eight monitoring stations demonstrates superior performance compared to state-of-the-art deep learning methods, achieving R2 = 0.91 for 24-hour forecasts. Notably, this is the first study to demonstrate stable 10-day PM2.5 forecasting (R2 = 0.73 at 240 hours) without performance degradation, addressing critical early-warning system requirements. The framework’s computational efficiency and independence from external tools make it particularly suitable for deployment in resource-constrained urban environments.
[843] Limits of Generative Pre-Training in Structured EMR Trajectories with Irregular Sampling
Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm
Main category: cs.LG
TL;DR: Foundation models trained on medical data can reproduce individual feature distributions but fail to preserve clinically meaningful cross-feature relationships, highlighting risks of using them for phenotype discovery without proper validation.
Details
Motivation: To test whether foundation models trained on medical data can generate clinically coherent representations, as recent studies have repurposed these models for phenotype discovery without rigorous validation.Method: Trained two autoregressive models (sequence-to-sequence LSTM and reduced Transformer) on longitudinal ART for HIV and Acute Hypotension datasets with controlled irregularity via random inter-visit gaps during training, while test sequences remained complete. Evaluated patient-trajectory synthesis for distributional and correlational fidelity.
Result: Both models reproduced individual feature distributions but failed to preserve cross-feature structure, showing that generative pre-training yields local realism but limited clinical coherence.
Conclusion: Domain-specific evaluation is crucial, and trajectory synthesis serves as a practical probe before fine-tuning or deployment of foundation models in medical applications.
Abstract: Foundation models refer to architectures trained on vast datasets using autoregressive pre-training from natural language processing to capture intricate patterns and motifs. They were originally developed to transfer such learned knowledge to downstream predictive tasks. Recently, however, some studies repurpose these learned representations for phenotype discovery without rigorous validation, risking superficially realistic but clinically incoherent embeddings. To test this mismatch, we trained two autoregressive models – a sequence-to-sequence LSTM and a reduced Transformer – on longitudinal ART for HIV and Acute Hypotension datasets. Controlled irregularity was added during training via random inter-visit gaps, while test sequences stayed complete. Patient-trajectory synthesis evaluated distributional and correlational fidelity. Both reproduced feature distributions but failed to preserve cross-feature structure – showing that generative pre-training yields local realism but limited clinical coherence. These results highlight the need for domain-specific evaluation and support trajectory synthesis as a practical probe before fine-tuning or deployment.
[844] Learning Reconfigurable Representations for Multimodal Federated Learning with Missing Data
Duong M. Nguyen, Trong Nghia Hoang, Thanh Trung Huynh, Quoc Viet Hung Nguyen, Phi Le Nguyen
Main category: cs.LG
TL;DR: A federated learning framework for multimodal data that addresses client heterogeneity and feature incompleteness through learnable client-side embedding controls that align local representations with global models.
Details
Motivation: Real-world multimodal federated learning faces challenges with incomplete and heterogeneous data across clients, leading to misaligned local feature representations that limit model aggregation effectiveness.Method: Proposes locally adaptive representations using learnable client-side embedding controls that encode each client’s data-missing patterns, serving as reconfiguration signals to align global representations with local contexts.
Result: Achieves up to 36.45% performance improvement under severe data incompleteness on multiple federated multimodal benchmarks with diverse data-missing patterns.
Conclusion: The proposed framework effectively handles multimodal federated learning with incomplete and heterogeneous data through adaptive representations and embedding controls, supported by both empirical results and theoretical analysis.
Abstract: Multimodal federated learning in real-world settings often encounters incomplete and heterogeneous data across clients. This results in misaligned local feature representations that limit the effectiveness of model aggregation. Unlike prior work that assumes either differing modality sets without missing input features or a shared modality set with missing features across clients, we consider a more general and realistic setting where each client observes a different subset of modalities and might also have missing input features within each modality. To address the resulting misalignment in learned representations, we propose a new federated learning framework featuring locally adaptive representations based on learnable client-side embedding controls that encode each client’s data-missing patterns. These embeddings serve as reconfiguration signals that align the globally aggregated representation with each client’s local context, enabling more effective use of shared information. Furthermore, the embedding controls can be algorithmically aggregated across clients with similar data-missing patterns to enhance the robustness of reconfiguration signals in adapting the global representation. Empirical results on multiple federated multimodal benchmarks with diverse data-missing patterns across clients demonstrate the efficacy of the proposed method, achieving up to 36.45% performance improvement under severe data incompleteness. The method is also supported by a theoretical analysis with an explicit performance bound that matches our empirical observations. Our source codes are provided at https://github.com/nmduonggg/PEPSY
[845] AI based signage classification for linguistic landscape studies
Yuqin Jiang, Song Jiang, Jacob Algrim, Trevor Harms, Maxwell Koenen, Xinya Lan, Xingyu Li, Chun-Han Lin, Jia Liu, Jiayang Sun, Henry Zenger
Main category: cs.LG
TL;DR: AI-powered language detection can automate Linguistic Landscape analysis with 79% accuracy, but requires human validation due to limitations like text distortion and peripheral text detection.
Details
Motivation: Traditional Linguistic Landscape research methods are time-consuming and difficult to scale for large study areas, creating a need for automated approaches.Method: Used AI for OCR and language classification on 1,449 georeferenced photos from Honolulu Chinatown, with manual validation for accuracy checking.
Result: Achieved 79% overall accuracy with five types of mislabeling identified: distortion, reflection, degraded surface, graffiti, and hallucination. AI detects peripheral texts that humans ignore.
Conclusion: AI-assisted workflows show potential for reducing time-consuming processes in LL research, but a hybrid approach combining AI automation with human validation is recommended for reliability.
Abstract: Linguistic Landscape (LL) research traditionally relies on manual photography and annotation of public signages to examine distribution of languages in urban space. While such methods yield valuable findings, the process is time-consuming and difficult for large study areas. This study explores the use of AI powered language detection method to automate LL analysis. Using Honolulu Chinatown as a case study, we constructed a georeferenced photo dataset of 1,449 images collected by researchers and applied AI for optical character recognition (OCR) and language classification. We also conducted manual validations for accuracy checking. This model achieved an overall accuracy of 79%. Five recurring types of mislabeling were identified, including distortion, reflection, degraded surface, graffiti, and hallucination. The analysis also reveals that the AI model treats all regions of an image equally, detecting peripheral or background texts that human interpreters typically ignore. Despite these limitations, the results demonstrate the potential of integrating AI-assisted workflows into LL research to reduce such time-consuming processes. However, due to all the limitations and mis-labels, we recognize that AI cannot be fully trusted during this process. This paper encourages a hybrid approach combining AI automation with human validation for a more reliable and efficient workflow.
[846] Transforming volcanic monitoring: A dataset and benchmark for onboard volcano activity detection
Darshana Priyasad, Tharindu Fernando, Maryam Haghighat, Harshala Gammulle, Clinton Fookes
Main category: cs.LG
TL;DR: This paper introduces a novel dataset for volcanic activity detection and demonstrates onboard processing feasibility using next-generation satellites.
Details
Motivation: Natural disasters like volcanic eruptions cause significant economic losses and life disruptions. Current monitoring is limited by lack of annotated datasets for developing robust detection systems.Method: Created a comprehensive volcanic activity dataset with binary annotations, benchmarked state-of-the-art models, and tested onboard deployment using Intel Movidius Myriad X VPU.
Result: Established baseline performance benchmarks and successfully demonstrated feasibility of volcanic activity detection directly onboard satellites, reducing latency for early warning systems.
Conclusion: The work enables advanced volcanic disaster management through onboard monitoring technologies, paving the way for real-time early warning systems.
Abstract: Natural disasters, such as volcanic eruptions, pose significant challenges to daily life and incur considerable global economic losses. The emergence of next-generation small-satellites, capable of constellation-based operations, offers unparalleled opportunities for near-real-time monitoring and onboard processing of such events. However, a major bottleneck remains the lack of extensive annotated datasets capturing volcanic activity, which hinders the development of robust detection systems. This paper introduces a novel dataset explicitly designed for volcanic activity and eruption detection, encompassing diverse volcanoes worldwide. The dataset provides binary annotations to identify volcanic anomalies or non-anomalies, covering phenomena such as temperature anomalies, eruptions, and volcanic ash emissions. These annotations offer a foundational resource for developing and evaluating detection models, addressing a critical gap in volcanic monitoring research. Additionally, we present comprehensive benchmarks using state-of-the-art models to establish baselines for future studies. Furthermore, we explore the potential for deploying these models onboard next-generation satellites. Using the Intel Movidius Myriad X VPU as a testbed, we demonstrate the feasibility of volcanic activity detection directly onboard. This capability significantly reduces latency and enhances response times, paving the way for advanced early warning systems. This paves the way for innovative solutions in volcanic disaster management, encouraging further exploration and refinement of onboard monitoring technologies.
[847] Charting the Design Space of Neural Graph Representations for Subgraph Matching
Vaibhav Raj, Indradyumna Roy, Ashwin Ramachandran, Soumen Chakrabarti, Abir De
Main category: cs.LG
TL;DR: This paper presents a unified design space for neural graph matching networks and conducts the first comprehensive exploration of this space, discovering that novel combinations of design choices lead to significant performance improvements in subgraph matching tasks.
Details
Motivation: Subgraph matching is crucial for various applications like knowledge graph QA and molecule design. While neural methods show promise, existing systems only occupy isolated patches in a larger design space that remains largely unexplored.Method: Refactored existing systems into a unified design space with axes including: attention-based vs. soft permutation-based interaction, node vs. edge alignment, and different forms of final scoring networks. Conducted extensive experiments to explore various combinations in this design space.
Result: Experiments revealed that judicious and previously unexplored combinations of design choices lead to large performance benefits in subgraph matching tasks.
Conclusion: The study uncovers valuable insights and establishes general design principles for neural graph representation and interaction that have broader implications beyond subgraph matching.
Abstract: Subgraph matching is vital in knowledge graph (KG) question answering, molecule design, scene graph, code and circuit search, etc. Neural methods have shown promising results for subgraph matching. Our study of recent systems suggests refactoring them into a unified design space for graph matching networks. Existing methods occupy only a few isolated patches in this space, which remains largely uncharted. We undertake the first comprehensive exploration of this space, featuring such axes as attention-based vs. soft permutation-based interaction between query and corpus graphs, aligning nodes vs. edges, and the form of the final scoring network that integrates neural representations of the graphs. Our extensive experiments reveal that judicious and hitherto-unexplored combinations of choices in this space lead to large performance benefits. Beyond better performance, our study uncovers valuable insights and establishes general design principles for neural graph representation and interaction, which may be of wider interest.
[848] Robust Uncertainty Quantification for Self-Evolving Large Language Models via Continual Domain Pretraining
Xiaofan Zhou, Lu Cheng
Main category: cs.LG
TL;DR: The paper introduces an adaptive rejection and non-exchangeable conformal prediction framework to address challenges in continual domain pretraining for large language models, improving statistical reliability guarantees under distribution shifts.
Details
Motivation: Continual Learning is crucial for LLMs to adapt to evolving knowledge, but there's a lack of statistical reliability guarantees under continual domain pretraining, especially when test data comes from unknown or shifting distributions where traditional conformal prediction fails.Method: Proposes adaptive rejection and non-exchangeable CP framework that estimates test question distributions using transformer-based clustering, reweights/resamples calibration data, and allows LLMs to selectively abstain from answering when confidence shifts significantly.
Result: Extensive experiments show the framework enhances both effectiveness and reliability of conformal prediction under continual domain pretraining scenarios, addressing issues of invalid guarantees and excessively large prediction sets.
Conclusion: The introduced framework successfully improves statistical reliability for LLMs in continual learning settings, providing better correctness guarantees while maintaining informativeness through adaptive rejection mechanisms.
Abstract: Continual Learning (CL) is essential for enabling self-evolving large language models (LLMs) to adapt and remain effective amid rapid knowledge growth. Yet, despite its importance, little attention has been given to establishing statistical reliability guarantees for LLMs under CL, particularly in the setting of continual domain pretraining (CDP). Conformal Prediction (CP) has shown promise in offering correctness guarantees for LLMs, but it faces major challenges in CDP: testing data often stems from unknown or shifting domain distributions, under which CP may no longer provide valid guarantees. Moreover, when high coverage is required, CP can yield excessively large prediction sets for unanswerable queries, reducing informativeness. To address these challenges, we introduce an adaptive rejection and non-exchangeable CP framework. Our method first estimates the distribution of questions across domains in the test set using transformer-based clustering, then reweights or resamples the calibration data accordingly. Building on this, adaptive rejection CP allows the LLM to selectively abstain from answering when its confidence or competence shifts significantly. Extensive experiments demonstrate that our framework enhances both the effectiveness and reliability of CP under CDP scenarios. Our code is available at: https://anonymous.4open.science/r/CPCL-8C12/
[849] On the Anisotropy of Score-Based Generative Models
Andreas Floros, Seyed-Mohsen Moosavi-Dezfooli, Pier Luigi Dragotti
Main category: cs.LG
TL;DR: The paper introduces Score Anisotropy Directions (SADs) to analyze how network architecture shapes inductive biases in score-based generative models, providing a way to predict generalization ability before training.
Details
Motivation: To understand how network architecture influences the inductive biases of score-based generative models and develop methods to predict their generalization capabilities prior to training.Method: Introduce Score Anisotropy Directions (SADs) as architecture-dependent directions that reveal preferential data structure capture. Analyze SADs as adaptive bases aligned with architecture’s output geometry.
Result: SADs reliably capture fine-grained model behavior and correlate with downstream performance (measured by Wasserstein metrics) across synthetic data and standard image benchmarks.
Conclusion: SADs provide a new framework for explaining and predicting directional biases in generative models, offering insights into architecture-dependent inductive biases.
Abstract: We investigate the role of network architecture in shaping the inductive biases of modern score-based generative models. To this end, we introduce the Score Anisotropy Directions (SADs), architecture-dependent directions that reveal how different networks preferentially capture data structure. Our analysis shows that SADs form adaptive bases aligned with the architecture’s output geometry, providing a principled way to predict generalization ability in score models prior to training. Through both synthetic data and standard image benchmarks, we demonstrate that SADs reliably capture fine-grained model behavior and correlate with downstream performance, as measured by Wasserstein metrics. Our work offers a new lens for explaining and predicting directional biases of generative models.
[850] Towards Personalized Treatment Plan: Geometrical Model-Agnostic Approach to Counterfactual Explanations
Daniel Sin, Milad Toutounchian
Main category: cs.LG
TL;DR: SSBA method generates counterfactual explanations using segmented sampling and binary search to find closest feasible points on decision boundaries, outperforming existing methods by 5-50% in L2 distance while handling real-world constraints.
Details
Motivation: Need for effective counterfactual explanation methods in high-dimensional spaces that can handle real-world constraints like immutable features (age, gender, etc.) and provide realistic explanations.Method: Four-step approach: fit dataset to model, find decision boundary, determine constraints, compute closest counterfactual point. Uses segmented sampling with binary search to find discrete boundary points and identify closest feasible explanation.
Result: Outperforms current methods with 5-50% reduction in L2 distance across four datasets. Handles constraints on immutable/categorical features. Runtime significantly faster than grid-based approaches.
Conclusion: SSBA provides simple, effective model-agnostic method for computing nearest feasible counterfactual explanations with constraints, offering practical improvements over existing approaches.
Abstract: In our article, we describe a method for generating counterfactual explanations in high-dimensional spaces using four steps that involve fitting our dataset to a model, finding the decision boundary, determining constraints on the problem, and computing the closest point (counterfactual explanation) from that boundary. We propose a discretized approach where we find many discrete points on the boundary and then identify the closest feasible counterfactual explanation. This method, which we later call $\textit{Segmented Sampling for Boundary Approximation}$ (SSBA), applies binary search to find decision boundary points and then searches for the closest boundary point. Across four datasets of varying dimensionality, we show that our method can outperform current methods for counterfactual generation with reductions in distance between $5%$ to $50%$ in terms of the $L_2$ norm. Our method can also handle real-world constraints by restricting changes to immutable and categorical features, such as age, gender, sex, height, and other related characteristics such as the case for a health-based dataset. In terms of runtime, the SSBA algorithm generates decision boundary points on multiple orders of magnitude in the same given time when we compare to a grid-based approach. In general, our method provides a simple and effective model-agnostic method that can compute nearest feasible (i.e. realistic with constraints) counterfactual explanations. All of our results and our code can be found here at this link: $\href{https://github.com/dsin85691/SSBA_For_Counterfactuals}{https://github.com/ dsin85691/SSBA_For_Counterfactuals}$
[851] Simple Denoising Diffusion Language Models
Huaisheng Zhu, Zhengyu Chen, Shijie Zhou, Zhihui Xie, Yige Yuan, Zhimeng Guo, Siyuan Xu, Hangfan Zhang, Vasant Honavar, Teng Xiao
Main category: cs.LG
TL;DR: Proposes a simplified denoising loss for Uniform-state Diffusion Models (USDMs) that optimizes only noise-replaced tokens, and introduces contrastive-inspired negative gradients for improved generation quality.
Details
Motivation: Existing Masked Diffusion Language Models (MDLMs) degrade in few-step generation and cannot use standard diffusion distillation methods, while USDMs have complex loss formulations that hinder scalability.Method: Simplified denoising-based loss for USDMs that focuses only on noise-replaced tokens, plus a contrastive-inspired modification with negative gradients.
Result: Stabilized training, matched ELBO-level performance, and achieved improved generation quality with the contrastive modification.
Conclusion: The proposed simplified loss and contrastive-inspired approach effectively improve USDM training and generation quality while maintaining competitive performance.
Abstract: Diffusion models have recently been extended to language generation through Masked Diffusion Language Models (MDLMs), which achieve performance competitive with strong autoregressive models. However, MDLMs tend to degrade in the few-step regime and cannot directly adopt existing few-step distillation methods designed for continuous diffusion models, as they lack the intrinsic property of mapping from noise to data. Recent Uniform-state Diffusion Models (USDMs), initialized from a uniform prior, alleviate some limitations but still suffer from complex loss formulations that hinder scalability. In this work, we propose a simplified denoising-based loss for USDMs that optimizes only noise-replaced tokens, stabilizing training and matching ELBO-level performance. Furthermore, by framing denoising as self-supervised learning, we introduce a simple modification to our denoising loss with contrastive-inspired negative gradients, which is practical and yield additional improvements in generation quality.
[852] Manifold Approximation leads to Robust Kernel Alignment
Mohammad Tariqul Islam, Du Liu, Deblina Sarkar
Main category: cs.LG
TL;DR: Proposes Manifold approximated Kernel Alignment (MKA) as a more robust alternative to Centered Kernel Alignment (CKA) by incorporating manifold geometry into representation comparison.
Details
Motivation: CKA has limitations as it doesn't account for underlying manifold structure and relies on heuristics that cause inconsistent behavior at different data scales.Method: Developed theoretical framework for MKA that incorporates manifold geometry into kernel alignment, with empirical evaluations on synthetic datasets and real-world examples.
Result: MKA provides more robust foundation for measuring representations compared to CKA, with better characterization of representation similarity.
Conclusion: Manifold-aware kernel alignment offers improved robustness for representation measurement, with potential applications in representation learning.
Abstract: Centered kernel alignment (CKA) is a popular metric for comparing representations, determining equivalence of networks, and neuroscience research. However, CKA does not account for the underlying manifold and relies on numerous heuristics that cause it to behave differently at different scales of data. In this work, we propose Manifold approximated Kernel Alignment (MKA), which incorporates manifold geometry into the alignment task. We derive a theoretical framework for MKA. We perform empirical evaluations on synthetic datasets and real-world examples to characterize and compare MKA to its contemporaries. Our findings suggest that manifold-aware kernel alignment provides a more robust foundation for measuring representations, with potential applications in representation learning.
[853] Diffuse to Detect: A Generalizable Framework for Anomaly Detection with Diffusion Models Applications to UAVs and Beyond
Mingze Gong, Juan Du, Jianbang You
Main category: cs.LG
TL;DR: DTD is a novel anomaly detection framework that adapts diffusion models for rapid anomaly detection using single-step noise prediction, combined with Graph Neural Networks to capture spatial and temporal dependencies in complex data.
Details
Motivation: Existing anomaly detection methods struggle with complex high-dimensional data like UAV sensor readings due to limited sensitivity, scalability, and inability to capture intricate dependencies between sensors and temporal patterns.Method: DTD uses a single-step diffusion process to predict noise patterns instead of full reconstruction, integrated with Graph Neural Networks to model sensor relationships as dynamic graphs. It features a two-branch architecture with parametric neural network-based energy scoring and nonparametric statistical methods.
Result: Extensive evaluations on UAV sensor data, multivariate time series, and images show DTD’s superior performance over existing methods, demonstrating its generality across diverse data modalities.
Conclusion: DTD provides a transformative solution for safety-critical applications with its versatility, adaptability, and ability to balance computational efficiency with interpretability through its dual-branch architecture.
Abstract: Anomaly detection in complex, high-dimensional data, such as UAV sensor readings, is essential for operational safety but challenging for existing methods due to their limited sensitivity, scalability, and inability to capture intricate dependencies. We propose the Diffuse to Detect (DTD) framework, a novel approach that innovatively adapts diffusion models for anomaly detection, diverging from their conventional use in generative tasks with high inference time. By comparison, DTD employs a single-step diffusion process to predict noise patterns, enabling rapid and precise identification of anomalies without reconstruction errors. This approach is grounded in robust theoretical foundations that link noise prediction to the data distribution’s score function, ensuring reliable deviation detection. By integrating Graph Neural Networks to model sensor relationships as dynamic graphs, DTD effectively captures spatial (inter-sensor) and temporal anomalies. Its two-branch architecture, with parametric neural network-based energy scoring for scalability and nonparametric statistical methods for interpretability, provides flexible trade-offs between computational efficiency and transparency. Extensive evaluations on UAV sensor data, multivariate time series, and images demonstrate DTD’s superior performance over existing methods, underscoring its generality across diverse data modalities. This versatility, combined with its adaptability, positions DTD as a transformative solution for safety-critical applications, including industrial monitoring and beyond.
[854] RL-AUX: Reinforcement Learning for Auxiliary Task Generation
Judah Goldfeder, Matthew So, Hod Lipson
Main category: cs.LG
TL;DR: This paper presents an RL-based approach to dynamically create auxiliary tasks without needing bi-level optimization, achieving better performance than human-labeled auxiliary tasks and matching bi-level optimization methods on CIFAR100.
Details
Motivation: Auxiliary Learning requires labeled auxiliary tasks which need human effort and domain expertise. Existing meta-learning solutions use bi-level optimization, which is computationally expensive and complex.Method: An RL agent selects auxiliary labels for each data point and learns optimal strategies for weighing auxiliary loss per data point. The agent is rewarded when selections improve primary task performance.
Result: On 20-Superclass CIFAR100, RL approach outperformed human-labeled auxiliary tasks and performed as well as bi-level optimization. Weight-aware RL helped VGG16 achieve 80.9% test accuracy vs 75.53% with human-labeled tasks.
Conclusion: RL is a viable approach for dynamic auxiliary task generation, and learning per-sample auxiliary task weights alongside labels can achieve strong results without bi-level optimization.
Abstract: Auxiliary Learning (AL) is a special case of Multi-task Learning (MTL) in which a network trains on auxiliary tasks to improve performance on its main task. This technique is used to improve generalization and, ultimately, performance on the network’s main task. AL has been demonstrated to improve performance across multiple domains, including navigation, image classification, and natural language processing. One weakness of AL is the need for labeled auxiliary tasks, which can require human effort and domain expertise to generate. Meta Learning techniques have been used to solve this issue by learning an additional auxiliary task generation network that can create helpful tasks for the primary network. The most prominent techniques rely on Bi-Level Optimization, which incurs computational cost and increased code complexity. To avoid the need for Bi-Level Optimization, we present an RL-based approach to dynamically create auxiliary tasks. In this framework, an RL agent is tasked with selecting auxiliary labels for every data point in a training set. The agent is rewarded when their selection improves the performance on the primary task. We also experiment with learning optimal strategies for weighing the auxiliary loss per data point. On the 20-Superclass CIFAR100 problem, our RL approach outperforms human-labeled auxiliary tasks and performs as well as a prominent Bi-Level Optimization technique. Our weight learning approaches significantly outperform all of these benchmarks. For example, a Weight-Aware RL-based approach helps the VGG16 architecture achieve 80.9% test accuracy while the human-labeled auxiliary task setup achieved 75.53%. The goal of this work is to (1) prove that RL is a viable approach to dynamically generate auxiliary tasks and (2) demonstrate that per-sample auxiliary task weights can be learned alongside the auxiliary task labels and can achieve strong results.
[855] The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination
Chenlong Yin, Zeyang Sha, Shiwen Cui, Changhua Meng
Main category: cs.LG
TL;DR: This paper investigates whether enhancing reasoning capabilities in LLMs causes increased tool hallucination, finding a causal relationship where reasoning enhancement proportionally increases tool hallucination across various methods.
Details
Motivation: To systematically examine if strengthening reasoning capabilities in LLMs inherently causes tool hallucination, addressing a gap where previous work hadn't established this causal relationship despite observations like OpenAI's o3 showing reasoning-strengthened models hallucinating more.Method: Introduced SimpleToolHalluBench diagnostic benchmark with two failure modes (no tool available, only distractor tools available). Conducted controlled experiments using RL reasoning enhancement, supervised fine-tuning, and inference-time step-by-step thinking. Evaluated mitigation strategies including Prompt Engineering and DPO.
Result: Found three key findings: 1) Causal relationship where reasoning enhancement via RL proportionally increases tool hallucination with performance gains; 2) Effect transcends overfitting - training on non-tool tasks amplifies subsequent tool hallucination; 3) Method-agnostic effect appearing across SFT and inference-time reasoning. Mitigation strategies revealed reliability-capability trade-off.
Conclusion: Current reasoning enhancement methods inherently amplify tool hallucination through disproportionate collapse of tool-reliability representations and amplified divergences in late-layer residual streams, highlighting need for new training objectives that jointly optimize capability and reliability.
Abstract: Enhancing the reasoning capabilities of Large Language Models (LLMs) is a key strategy for building Agents that “think then act.” However, recent observations, like OpenAI’s o3, suggest a paradox: stronger reasoning often coincides with increased hallucination, yet no prior work has systematically examined whether reasoning enhancement itself causes tool hallucination. To address this gap, we pose the central question: Does strengthening reasoning increase tool hallucination? To answer this, we introduce SimpleToolHalluBench, a diagnostic benchmark measuring tool hallucination in two failure modes: (i) no tool available, and (ii) only distractor tools available. Through controlled experiments, we establish three key findings. First, we demonstrate a causal relationship: progressively enhancing reasoning through RL increases tool hallucination proportionally with task performance gains. Second, this effect transcends overfitting - training on non-tool tasks (e.g., mathematics) still amplifies subsequent tool hallucination. Third, the effect is method-agnostic, appearing when reasoning is instilled via supervised fine-tuning and when it is merely elicited at inference by switching from direct answers to step-by-step thinking. We also evaluate mitigation strategies including Prompt Engineering and Direct Preference Optimization (DPO), revealing a fundamental reliability-capability trade-off: reducing hallucination consistently degrades utility. Mechanistically, Reasoning RL disproportionately collapses tool-reliability-related representations, and hallucinations surface as amplified divergences concentrated in late-layer residual streams. These findings reveal that current reasoning enhancement methods inherently amplify tool hallucination, highlighting the need for new training objectives that jointly optimize for capability and reliability.
[856] Hazard-Responsive Digital Twin for Climate-Driven Urban Resilience and Equity
Zhenglai Shen, Hongyu Zhou
Main category: cs.LG
TL;DR: Hazard-Responsive Digital Twin (H-RDT) combines neural networks, data fusion, and equity analytics to manage climate hazards like wildfire-outage-heatwave cascades, maintaining stable predictions and reducing thermal risks through interventions.
Details
Motivation: Address compounding climate hazards (wildfire-induced outages and urban heatwaves) that challenge urban stability and equity, requiring adaptive and equitable decision support for climate adaptation.Method: Physics-informed neural network modeling, multimodal data fusion (IoT, UAV, satellite), equity-aware risk analytics, reinforcement learning for adaptive input reweighting, and prospective interventions like cooling-center activation and microgrid sharing.
Result: H-RDT maintains stable indoor temperature predictions (31-33°C) under sensor loss, reduces population-weighted thermal risk by 11-13%, shrinks tail risk by 7-17%, and cuts overheating hours by up to 9% in synthetic district simulations.
Conclusion: H-RDT establishes a transferable foundation for real-city implementation, advancing digital urban resilience with adaptive, learning-based, and equity-centered decision support for climate adaptation.
Abstract: Compounding climate hazards, such as wildfire-induced outages and urban heatwaves, challenge the stability and equity of cities. We present a Hazard-Responsive Digital Twin (H-RDT) that combines physics-informed neural network modeling, multimodal data fusion, and equity-aware risk analytics for urban-scale response. In a synthetic district with diverse building archetypes and populations, a simulated wildfire-outage-heatwave cascade shows that H-RDT maintains stable indoor temperature predictions (approximately 31 to 33 C) under partial sensor loss, reproducing outage-driven surges and recovery. The reinforcement learning based fusion module adaptively reweights IoT, UAV, and satellite inputs to sustain spatiotemporal coverage, while the equity-adjusted mapping isolates high-vulnerability clusters (schools, clinics, low-income housing). Prospective interventions, such as preemptive cooling-center activation and microgrid sharing, reduce population-weighted thermal risk by 11 to 13 percent, shrink the 95th-percentile (tail) risk by 7 to 17 percent, and cut overheating hours by up to 9 percent. Beyond the synthetic demonstration, the framework establishes a transferable foundation for real-city implementation, linking physical hazard modeling with social equity and decision intelligence. The H-RDT advances digital urban resilience toward adaptive, learning-based, and equity-centered decision support for climate adaptation.
[857] Softmax is $1/2$-Lipschitz: A tight bound across all $\ell_p$ norms
Pravin Nair
Main category: cs.LG
TL;DR: The softmax function has a Lipschitz constant of 1/2 across all ℓ_p norms (p ≥ 1), improving upon the commonly assumed constant of 1.
Details
Motivation: Existing robustness guarantees and convergence analyses typically assume softmax has Lipschitz constant 1, but this work aims to establish a tighter bound.Method: Mathematical proof showing softmax is contractive with Lipschitz constant 1/2 uniformly across all ℓ_p norms, with analysis of local constants for different p values.
Result: Proved softmax has uniform Lipschitz constant 1/2 for all p ≥ 1, with local constants reaching 1/2 only for p=1 and p=∞, and remaining strictly below 1/2 otherwise.
Conclusion: The sharper 1/2 Lipschitz constant improves existing theoretical results on robustness and convergence, and is empirically validated on attention architectures and RL policies.
Abstract: The softmax function is a basic operator in machine learning and optimization, used in classification, attention mechanisms, reinforcement learning, game theory, and problems involving log-sum-exp terms. Existing robustness guarantees of learning models and convergence analysis of optimization algorithms typically consider the softmax operator to have a Lipschitz constant of $1$ with respect to the $\ell_2$ norm. In this work, we prove that the softmax function is contractive with the Lipschitz constant $1/2$, uniformly across all $\ell_p$ norms with $p \ge 1$. We also show that the local Lipschitz constant of softmax attains $1/2$ for $p = 1$ and $p = \infty$, and for $p \in (1,\infty)$, the constant remains strictly below $1/2$ and the supremum $1/2$ is achieved only in the limit. To our knowledge, this is the first comprehensive norm-uniform analysis of softmax Lipschitz continuity. We demonstrate how the sharper constant directly improves a range of existing theoretical results on robustness and convergence. We further validate the sharpness of the $1/2$ Lipschitz constant of the softmax operator through empirical studies on attention-based architectures (ViT, GPT-2, Qwen3-8B) and on stochastic policies in reinforcement learning.
[858] Hankel Singular Value Regularization for Highly Compressible State Space Models
Paul Schwerdtner, Jules Berman, Benjamin Peherstorfer
Main category: cs.LG
TL;DR: Regularizing Hankel singular values of state space models enables 10x more compressible models while maintaining accuracy on long-range sequence tasks.
Details
Motivation: Deep neural networks using state space models are effective for long-range sequence tasks but challenging to compress after training.Method: Developed an algorithm to efficiently compute Hankel singular values during training by exploiting block-diagonal structure of system matrices, and used Hankel singular value regularization to promote compressibility.
Result: Regularized state space layers achieved up to 10x more compressibility than standard state space layers while maintaining high accuracy on Long Range Arena benchmarks.
Conclusion: Hankel singular value regularization effectively enables compressible state space models without sacrificing performance on long-range sequence tasks.
Abstract: Deep neural networks using state space models as layers are well suited for long-range sequence tasks but can be challenging to compress after training. We use that regularizing the sum of Hankel singular values of state space models leads to a fast decay of these singular values and thus to compressible models. To make the proposed Hankel singular value regularization scalable, we develop an algorithm to efficiently compute the Hankel singular values during training iterations by exploiting the specific block-diagonal structure of the system matrices that is we use in our state space model parametrization. Experiments on Long Range Arena benchmarks demonstrate that the regularized state space layers are up to 10$\times$ more compressible than standard state space layers while maintaining high accuracy.
[859] MoEMeta: Mixture-of-Experts Meta Learning for Few-Shot Relational Learning
Han Wu, Jie Yin
Main category: cs.LG
TL;DR: MoEMeta is a meta-learning framework for few-shot knowledge graph relational learning that disentangles globally shared knowledge from task-specific contexts using mixture-of-experts and task-tailored adaptation.
Details
Motivation: Existing meta-learning approaches for few-shot KG relational learning suffer from learning relation meta-knowledge in isolation and struggle to incorporate local task-specific contexts, limiting their generalization and adaptation capabilities.Method: Proposes MoEMeta with two key innovations: (1) mixture-of-experts model to learn globally shared relational prototypes for better generalization, and (2) task-tailored adaptation mechanism to capture local contexts for fast task-specific adaptation.
Result: Extensive experiments on three KG benchmarks show MoEMeta consistently outperforms existing baselines and achieves state-of-the-art performance in few-shot relational learning.
Conclusion: By balancing global generalization with local adaptability, MoEMeta significantly advances few-shot knowledge graph relational learning and demonstrates superior performance compared to existing approaches.
Abstract: Few-shot knowledge graph relational learning seeks to perform reasoning over relations given only a limited number of training examples. While existing approaches largely adopt a meta-learning framework for enabling fast adaptation to new relations, they suffer from two key pitfalls. First, they learn relation meta-knowledge in isolation, failing to capture common relational patterns shared across tasks. Second, they struggle to effectively incorporate local, task-specific contexts crucial for rapid adaptation. To address these limitations, we propose MoEMeta, a novel meta-learning framework that disentangles globally shared knowledge from task-specific contexts to enable both effective generalization and rapid adaptation. MoEMeta introduces two key innovations: (i) a mixture-of-experts (MoE) model that learns globally shared relational prototypes to enhance generalization, and (ii) a task-tailored adaptation mechanism that captures local contexts for fast task-specific adaptation. By balancing global generalization with local adaptability, MoEMeta significantly advances few-shot relational learning. Extensive experiments and analyses on three KG benchmarks demonstrate that MoEMeta consistently outperforms existing baselines, achieving state-of-the-art performance.
[860] SARNet: A Spike-Aware consecutive validation Framework for Accurate Remaining Useful Life Prediction
Junhao Fan, Wenrui Liang, Wei-Qiang Zhang
Main category: cs.LG
TL;DR: SARNet is a spike-aware RUL prediction framework that combines ModernTCN with adaptive spike detection and targeted feature engineering to improve accuracy and interpretability while maintaining lightweight deployment.
Details
Motivation: Current RUL prediction models are fragile around fault onset, opaque to engineers, smooth away important spikes, use fixed thresholds that reduce sensitivity, and lack physics-based explanations.Method: Uses ModernTCN for degradation forecasting, adaptive consecutive threshold for spike detection, targeted feature engineering for failure-prone segments, and stacked RF-LGBM regressor for final RUL prediction.
Result: Achieves lower error than recent baselines (RMSE 0.0365, MAE 0.0204) across benchmark datasets under event-triggered protocol.
Conclusion: SARNet provides accurate, robust, and interpretable RUL prediction while remaining lightweight and easy to deploy.
Abstract: Accurate prediction of remaining useful life (RUL) is essential to enhance system reliability and reduce maintenance risk. Yet many strong contemporary models are fragile around fault onset and opaque to engineers: short, high-energy spikes are smoothed away or misread, fixed thresholds blunt sensitivity, and physics-based explanations are scarce. To remedy this, we introduce SARNet (Spike-Aware Consecutive Validation Framework), which builds on a Modern Temporal Convolutional Network (ModernTCN) and adds spike-aware detection to provide physics-informed interpretability. ModernTCN forecasts degradation-sensitive indicators; an adaptive consecutive threshold validates true spikes while suppressing noise. Failure-prone segments then receive targeted feature engineering (spectral slopes, statistical derivatives, energy ratios), and the final RUL is produced by a stacked RF–LGBM regressor. Across benchmark-ported datasets under an event-triggered protocol, SARNet consistently lowers error compared to recent baselines (RMSE 0.0365, MAE 0.0204) while remaining lightweight, robust, and easy to deploy.
[861] How Muon’s Spectral Design Benefits Generalization: A Study on Imbalanced Data
Bhavya Vasudeva, Puneesh Deora, Yize Zhao, Vatsal Sharan, Christos Thrampoulidis
Main category: cs.LG
TL;DR: Spectral optimizers like SpecGD outperform vanilla gradient descent on imbalanced data by learning all principal components equally, while GD prioritizes dominant components, leading to better balanced accuracy early in training.
Details
Motivation: To systematically study when spectrum-aware matrix optimizers (Muon, Shampoo) outperform competitive algorithms, particularly for imbalanced data where balanced accuracy matters.Method: Introduce Spectral Gradient Descent (SpecGD) as canonical form, analyze Gaussian mixture data with linear/bilinear models, compare against vanilla GD and adaptive step-size variants, extend to deep linear models.
Result: SpecGD learns all principal components at equal rates unlike GD which prioritizes dominant components, creating growing balanced accuracy gap early in training that persists even with adaptive step-sizes, with depth amplifying effects.
Conclusion: Spectral optimizers achieve superior generalization on imbalanced data by promoting balanced learning of all data components, validated empirically against Euclidean counterparts and Adam.
Abstract: The growing adoption of spectrum-aware matrix-valued optimizers such as Muon and Shampoo in deep learning motivates a systematic study of their generalization properties and, in particular, when they might outperform competitive algorithms. We approach this question by introducing appropriate simplifying abstractions as follows: First, we use imbalanced data as a testbed. Second, we study the canonical form of such optimizers, which is Spectral Gradient Descent (SpecGD) – each update step is $UV^T$ where $U\Sigma V^T$ is the truncated SVD of the gradient. Third, within this framework we identify a canonical setting for which we precisely quantify when SpecGD outperforms vanilla Euclidean GD. For a Gaussian mixture data model and both linear and bilinear models, we show that unlike GD, which prioritizes learning dominant principal components of the data first, SpecGD learns all principal components of the data at equal rates. We demonstrate how this translates to a growing gap in balanced accuracy favoring SpecGD early in training and further show that the gap remains consistent even when the GD counterpart uses adaptive step-sizes via normalization. By extending the analysis to deep linear models, we show that depth amplifies these effects. We empirically verify our theoretical findings on a variety of imbalanced datasets. Our experiments compare practical variants of spectral methods, like Muon and Shampoo, against their Euclidean counterparts and Adam. The results validate our findings that these spectral optimizers achieve superior generalization by promoting a more balanced learning of the data’s underlying components.
[862] QoSGMAA: A Robust Multi-Order Graph Attention and Adversarial Framework for Sparse QoS Prediction
Guanchen Du, Jianlong Xu, Mingtong Li, Ruiqi Wang, Qianqing Guo, Caiyi Chen, Qingcao Dai, Yuxiang Zeng
Main category: cs.LG
TL;DR: QoSMGAA is a novel QoS prediction architecture that uses multi-order attention and adversarial neural networks to handle data sparsity and structural noise, outperforming existing methods.
Details
Motivation: The exponential growth of similar network services makes optimal service selection challenging, and existing QoS prediction methods fail to capture rich contextual information and perform poorly under extreme data sparsity and structural noise.Method: QoSMGAA integrates multi-order attention mechanism to aggregate contextual data, uses adversarial neural networks for autoregressive supervised learning, and employs Gumbel-Softmax discrete sampling to generate negative samples for capturing higher-order interactions.
Result: Comprehensive experiments on large-scale real-world datasets show that QoSMGAA significantly outperforms existing baseline methods in QoS prediction accuracy.
Conclusion: The proposed model demonstrates strong potential for practical deployment in service selection and recommendation scenarios, effectively addressing challenges in complex and noisy network service environments.
Abstract: With the rapid advancement of internet technologies, network services have become critical for delivering diverse and reliable applications to users. However, the exponential growth in the number of available services has resulted in many similar offerings, posing significant challenges in selecting optimal services. Predicting Quality of Service (QoS) accurately thus becomes a fundamental prerequisite for ensuring reliability and user satisfaction. However, existing QoS prediction methods often fail to capture rich contextual information and exhibit poor performance under extreme data sparsity and structural noise. To bridge this gap, we propose a novel architecture, QoSMGAA, specifically designed to enhance prediction accuracy in complex and noisy network service environments. QoSMGAA integrates a multi-order attention mechanism to aggregate extensive contextual data and predict missing QoS values effectively. Additionally, our method incorporates adversarial neural networks to perform autoregressive supervised learning based on transformed interaction matrices. To capture complex, higher-order interactions among users and services, we employ a discrete sampling technique leveraging the Gumbel-Softmax method to generate informative negative samples. Comprehensive experimental validation conducted on large-scale real-world datasets demonstrates that our proposed model significantly outperforms existing baseline methods, highlighting its strong potential for practical deployment in service selection and recommendation scenarios.
[863] LLM Meets Diffusion: A Hybrid Framework for Crystal Material Generation
Subhojyoti Khastagir, Kishalay Das, Pawan Goyal, Seung-Cheol Lee, Satadeep Bhattacharjee, Niloy Ganguly
Main category: cs.LG
TL;DR: CrysLLMGen is a hybrid framework that combines LLMs and diffusion models for crystal structure generation, achieving better performance than existing methods by leveraging complementary strengths of both approaches.
Details
Motivation: Existing methods have limitations - LLMs handle discrete atomic types well but struggle with continuous features, while denoising models are good with continuous variables but poor at atomic compositions. This gap needs to be bridged.Method: Hybrid framework integrating LLM with diffusion model. First uses fine-tuned LLM to generate intermediate representation of atom types, coordinates, and lattice structure, then passes coordinates and lattice to pre-trained equivariant diffusion model for refinement while keeping predicted atom types.
Result: Outperforms state-of-the-art generative models across benchmark tasks and datasets. Achieves balanced structural and compositional validity, generates more stable and novel materials, and shows strong conditional generation capabilities for user-defined constraints.
Conclusion: CrysLLMGen successfully bridges the gap between LLM-based and denoising-based approaches, demonstrating superior performance in crystal material generation with complementary strengths of both models.
Abstract: Recent advances in generative modeling have shown significant promise in designing novel periodic crystal structures. Existing approaches typically rely on either large language models (LLMs) or equivariant denoising models, each with complementary strengths: LLMs excel at handling discrete atomic types but often struggle with continuous features such as atomic positions and lattice parameters, while denoising models are effective at modeling continuous variables but encounter difficulties in generating accurate atomic compositions. To bridge this gap, we propose CrysLLMGen, a hybrid framework that integrates an LLM with a diffusion model to leverage their complementary strengths for crystal material generation. During sampling, CrysLLMGen first employs a fine-tuned LLM to produce an intermediate representation of atom types, atomic coordinates, and lattice structure. While retaining the predicted atom types, it passes the atomic coordinates and lattice structure to a pre-trained equivariant diffusion model for refinement. Our framework outperforms state-of-the-art generative models across several benchmark tasks and datasets. Specifically, CrysLLMGen not only achieves a balanced performance in terms of structural and compositional validity but also generates more stable and novel materials compared to LLM-based and denoisingbased models Furthermore, CrysLLMGen exhibits strong conditional generation capabilities, effectively producing materials that satisfy user-defined constraints. Code is available at https://github.com/kdmsit/crysllmgen
[864] Equivariant Neural Networks for General Linear Symmetries on Lie Algebras
Chankyo Kim, Sicheng Zhao, Minghan Zhu, Tzu-Yuan Lin, Maani Ghaffari
Main category: cs.LG
TL;DR: ReLNs are novel neural networks exactly equivariant to general linear transformations GL(n), overcoming limitations of prior equivariant models that only handle simple symmetries like rotations.
Details
Motivation: Most existing equivariant models are limited to simple symmetries like rotations, failing to address the broader class of general linear transformations GL(n) that appear in many scientific domains.Method: Introduces Reductive Lie Neurons (ReLNs) with novel adjoint-invariant bilinear layer to achieve stable equivariance for both Lie-algebraic features and matrix-valued inputs, operating directly on structured inputs like n-by-n matrices.
Result: ReLNs outperform existing methods on algebraic benchmarks with sl(3) and sp(4) symmetries, achieve competitive results on Lorentz-equivariant particle physics, and significantly improve trajectory accuracy in 3D drone state estimation by jointly processing velocities and covariances.
Conclusion: ReLNs provide a practical and general framework for learning with broad linear group symmetries on Lie algebras and matrix-valued data, offering versatility across diverse scientific applications.
Abstract: Encoding symmetries is a powerful inductive bias for improving the generalization of deep neural networks. However, most existing equivariant models are limited to simple symmetries like rotations, failing to address the broader class of general linear transformations, GL(n), that appear in many scientific domains. We introduce Reductive Lie Neurons (ReLNs), a novel neural network architecture exactly equivariant to these general linear symmetries. ReLNs are designed to operate directly on a wide range of structured inputs, including general n-by-n matrices. ReLNs introduce a novel adjoint-invariant bilinear layer to achieve stable equivariance for both Lie-algebraic features and matrix-valued inputs, without requiring redesign for each subgroup. This architecture overcomes the limitations of prior equivariant networks that only apply to compact groups or simple vector data. We validate ReLNs’ versatility across a spectrum of tasks: they outperform existing methods on algebraic benchmarks with sl(3) and sp(4) symmetries and achieve competitive results on a Lorentz-equivariant particle physics task. In 3D drone state estimation with geometric uncertainty, ReLNs jointly process velocities and covariances, yielding significant improvements in trajectory accuracy. ReLNs provide a practical and general framework for learning with broad linear group symmetries on Lie algebras and matrix-valued data. Project page: https://reductive-lie-neuron.github.io/
[865] Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients
Christos Thrampoulidis, Sadegh Mahdavi, Wenlong Deng
Main category: cs.LG
TL;DR: This paper shows that REINFORCE-style methods and advantage-shaping techniques for Pass@K optimization are equivalent approaches, revealing that advantage-shaping implicitly optimizes surrogate rewards.
Details
Motivation: To reconcile two seemingly distinct approaches to policy gradient optimization for the Pass@K objective in reinforcement learning with verifiable rewards.Method: By reverse-engineering existing advantage-shaping algorithms and showing they implicitly optimize surrogate rewards, then providing a recipe for deriving both existing and new advantage-shaping methods from surrogate reward objectives.
Result: Demonstrated that direct REINFORCE-style methods and advantage-shaping techniques are two sides of the same coin, with practical modifications to GRPO interpreted as reward-level regularization.
Conclusion: This unified perspective provides a framework for RLVR policy gradient optimization that extends beyond the original Pass@K motivation.
Abstract: This note reconciles two seemingly distinct approaches to policy gradient optimization for the Pass@K objective in reinforcement learning with verifiable rewards: (1) direct REINFORCE-style methods, and (2) advantage-shaping techniques that directly modify GRPO. We show that these are two sides of the same coin. By reverse-engineering existing advantage-shaping algorithms, we reveal that they implicitly optimize surrogate rewards. We specifically interpret practical ``hard-example up-weighting’’ modifications to GRPO as reward-level regularization. Conversely, starting from surrogate reward objectives, we provide a simple recipe for deriving both existing and new advantage-shaping methods. This perspective provides a lens for RLVR policy gradient optimization beyond our original motivation of Pass@K.
[866] Adaptive Forests For Classification
Dimitris Bertsimas, Yubing Cui
Main category: cs.LG
TL;DR: AF adaptively selects weights for CART trees using OP2T and MIO, outperforming RF and XGBoost in classification tasks.
Details
Motivation: RF and XGBoost use equal weights for trees, which may not be optimal. AF aims to improve performance by adaptively selecting input-dependent weights.Method: Combines OP2T for input-dependent weight prescription and MIO for dynamic weight refinement.
Result: AF consistently outperforms RF, XGBoost, and other weighted RF methods on 20+ real-world datasets in binary and multi-class classification.
Conclusion: AF’s adaptive weight selection enhances ensemble model performance beyond traditional methods.
Abstract: Random Forests (RF) and Extreme Gradient Boosting (XGBoost) are two of the most widely used and highly performing classification and regression models. They aggregate equally weighted CART trees, generated randomly in RF or sequentially in XGBoost. In this paper, we propose Adaptive Forests (AF), a novel approach that adaptively selects the weights of the underlying CART models. AF combines (a) the Optimal Predictive-Policy Trees (OP2T) framework to prescribe tailored, input-dependent unequal weights to trees and (b) Mixed Integer Optimization (MIO) to refine weight candidates dynamically, enhancing overall performance. We demonstrate that AF consistently outperforms RF, XGBoost, and other weighted RF in binary and multi-class classification problems over 20+ real-world datasets.
[867] Sentinel: Dynamic Knowledge Distillation for Personalized Federated Intrusion Detection in Heterogeneous IoT Networks
Gurpreet Singh, Keshav Sood, P. Rajalakshmi, Yong Xiang
Main category: cs.LG
TL;DR: Sentinel is a personalized federated IDS framework that addresses class imbalance, non-IID data, and communication overhead in IoT intrusion detection through dual-model architecture and knowledge distillation.
Details
Motivation: Conventional FL methods perform poorly in IoT intrusion detection due to severe class imbalance, non-IID data distribution, and high communication costs, which degrade real-world network traffic classification performance.Method: Proposes Sentinel framework with dual-model architecture (personalized teacher + lightweight shared student), bidirectional knowledge distillation with adaptive temperature scaling, multi-faceted feature alignment, class-balanced loss functions, and normalized gradient aggregation.
Result: Extensive experiments on IoTID20 and 5GNIDD datasets show Sentinel significantly outperforms state-of-the-art federated methods, especially under extreme data heterogeneity, while maintaining communication efficiency.
Conclusion: Sentinel establishes a new performance benchmark for federated intrusion detection systems in IoT networks by effectively balancing local adaptation with global consensus while preserving privacy and reducing communication costs.
Abstract: Federated learning (FL) offers a privacy-preserving paradigm for machine learning, but its application in intrusion detection systems (IDS) within IoT networks is challenged by severe class imbalance, non-IID data, and high communication overhead.These challenges severely degrade the performance of conventional FL methods in real-world network traffic classification. To overcome these limitations, we propose Sentinel, a personalized federated IDS (pFed-IDS) framework that incorporates a dual-model architecture on each client, consisting of a personalized teacher and a lightweight shared student model. This design effectively balances deep local adaptation with efficient global model consensus while preserving client privacy by transmitting only the compact student model, thus reducing communication costs. Sentinel integrates three key mechanisms to ensure robust performance: bidirectional knowledge distillation with adaptive temperature scaling, multi-faceted feature alignment, and class-balanced loss functions. Furthermore, the server employs normalized gradient aggregation with equal client weighting to enhance fairness and mitigate client drift. Extensive experiments on the IoTID20 and 5GNIDD benchmark datasets demonstrate that Sentinel significantly outperforms state-of-the-art federated methods, establishing a new performance benchmark, especially under extreme data heterogeneity, while maintaining communication efficiency.
[868] Sublinear Sketches for Approximate Nearest Neighbor and Kernel Density Estimation
Ved Danait, Srijan Das, Sujoy Bhore
Main category: cs.LG
TL;DR: The paper develops sublinear-space sketching algorithms for Approximate Nearest Neighbor (ANN) search and Approximate Kernel Density Estimation (A-KDE) in dynamic data streams, achieving near-optimal memory-approximation trade-offs.
Details
Motivation: ANN and A-KDE are fundamental in machine learning with applications in data analysis and large-scale decision making. The challenge is designing compact sketches for dynamic data streams that preserve structural properties while enabling efficient queries.Method: For ANN in streaming model: uses sublinear sketch with O(n^(1+ρ-η)) memory by storing only n^(-η) fraction of inputs. For A-KDE in sliding-window model: proposes sketch of size O(RW * (1/(√(1+ε)-1)) * log²N). Both methods support sublinear query time and batch queries.
Result: First theoretical sublinear sketch guarantee for A-KDE in sliding-window model. For ANN, achieves near-optimal trade-offs between memory size and approximation error. Experiments on real-world datasets show lightweight sketches with consistently low error.
Conclusion: The proposed sketching algorithms provide efficient sublinear-space solutions for ANN and A-KDE in dynamic streaming settings, with theoretical guarantees and practical effectiveness demonstrated through experiments.
Abstract: Approximate Nearest Neighbor (ANN) search and Approximate Kernel Density Estimation (A-KDE) are fundamental problems at the core of modern machine learning, with broad applications in data analysis, information systems, and large-scale decision making. In massive and dynamic data streams, a central challenge is to design compact sketches that preserve essential structural properties of the data while enabling efficient queries. In this work, we develop new sketching algorithms that achieve sublinear space and query time guarantees for both ANN and A-KDE for a dynamic stream of data. For ANN in the streaming model, under natural assumptions, we design a sublinear sketch that requires only $\mathcal{O}(n^{1+\rho-\eta})$ memory by storing only a sublinear ($n^{-\eta}$) fraction of the total inputs, where $\rho$ is a parameter of the LSH family, and $0<\eta<1$. Our method supports sublinear query time, batch queries, and extends to the more general Turnstile model. While earlier works have focused on Exact NN, this is the first result on ANN that achieves near-optimal trade-offs between memory size and approximation error. Next, for A-KDE in the Sliding-Window model, we propose a sketch of size $\mathcal{O}\left(RW \cdot \frac{1}{\sqrt{1+\epsilon} - 1} \log^2 N\right)$, where $R$ is the number of sketch rows, $W$ is the LSH range, $N$ is the window size, and $\epsilon$ is the approximation error. This, to the best of our knowledge, is the first theoretical sublinear sketch guarantee for A-KDE in the Sliding-Window model. We complement our theoretical results with experiments on various real-world datasets, which show that the proposed sketches are lightweight and achieve consistently low error in practice.
[869] Enabling Vibration-Based Gesture Recognition on Everyday Furniture via Energy-Efficient FPGA Implementation of 1D Convolutional Networks
Koki Shibata, Tianheng Ling, Chao Qian, Tomokazu Matsui, Hirohiko Suwa, Keiichi Yasumoto, Gregor Schiele
Main category: cs.LG
TL;DR: This paper proposes an energy-efficient gesture recognition system using compact neural networks on low-power FPGAs, achieving competitive accuracy with minimal preprocessing and reduced computational requirements.
Details
Motivation: Address the limitations of existing vibration-based gesture recognition systems that rely on complex preprocessing and large neural networks, which require high-performance hardware and consume excessive energy, limiting real-world deployability.Method: Developed lightweight 1D-CNN and 1D-SepCNN architectures tailored for FPGAs, using raw waveform input instead of spectral preprocessing, integer-only quantization, automated RTL generation, and ping-pong buffering for memory efficiency.
Result: Achieved 0.970 average accuracy with 9.22 ms latency using 6-bit 1D-CNN, and 0.949 accuracy with 6.83 ms latency (53x CPU speedup) using 8-bit 1D-SepCNN, both consuming under 1.2 mJ per inference on AMD Spartan-7 FPGA.
Conclusion: The proposed approach enables real-time, energy-efficient gesture recognition suitable for long-term edge operation, demonstrating significant improvements in deployability and efficiency compared to existing methods.
Abstract: The growing demand for smart home interfaces has increased interest in non-intrusive sensing methods like vibration-based gesture recognition. While prior studies demonstrated feasibility, they often rely on complex preprocessing and large Neural Networks (NNs) requiring costly high-performance hardware, resulting in high energy usage and limited real-world deployability. This study proposes an energy-efficient solution deploying compact NNs on low-power Field-Programmable Gate Arrays (FPGAs) to enable real-time gesture recognition with competitive accuracy. We adopt a series of optimizations: (1) We replace complex spectral preprocessing with raw waveform input, eliminating complex on-board preprocessing while reducing input size by 21x without sacrificing accuracy. (2) We design two lightweight architectures (1D-CNN and 1D-SepCNN) tailored for embedded FPGAs, reducing parameters from 369 million to as few as 216 while maintaining comparable accuracy. (3) With integer-only quantization and automated RTL generation, we achieve seamless FPGA deployment. A ping-pong buffering mechanism in 1D-SepCNN further improves deployability under tight memory constraints. (4) We extend a hardware-aware search framework to support constraint-driven model configuration selection, considering accuracy, deployability, latency, and energy consumption. Evaluated on two swipe-direction datasets with multiple users and ordinary tables, our approach achieves low-latency, energy-efficient inference on the AMD Spartan-7 XC7S25 FPGA. Under the PS data splitting setting, the selected 6-bit 1D-CNN reaches 0.970 average accuracy across users with 9.22 ms latency. The chosen 8-bit 1D-SepCNN further reduces latency to 6.83 ms (over 53x CPU speedup) with slightly lower accuracy (0.949). Both consume under 1.2 mJ per inference, demonstrating suitability for long-term edge operation.
[870] SwiftTS: A Swift Selection Framework for Time Series Pre-trained Models via Multi-task Meta-Learning
Tengxue Zhang, Biao Ouyang, Yang Shu, Xinyang Chen, Chenjuan Guo, Bin Yang
Main category: cs.LG
TL;DR: SwiftTS is a framework for efficiently selecting the most suitable pre-trained time series model from multiple candidates without expensive fine-tuning, using a learning-guided approach with dual-encoder architecture and horizon-adaptive expert composition.
Details
Motivation: With numerous pre-trained models available, individually fine-tuning each one to find the best fit for downstream time series tasks is time-consuming and computationally expensive.Method: Uses a lightweight dual-encoder architecture to embed time series and models, computes patchwise compatibility scores, employs horizon-adaptive expert composition for dynamic weight adjustment, and uses transferable cross-task learning with cross-dataset/horizon sampling for OOD robustness.
Result: Extensive experiments on 14 downstream datasets and 8 pre-trained models show that SwiftTS achieves state-of-the-art performance in time series pre-trained model selection.
Conclusion: SwiftTS provides an efficient and effective framework for selecting optimal pre-trained time series models without the need for expensive individual fine-tuning, demonstrating superior performance across diverse datasets and horizons.
Abstract: Pre-trained models exhibit strong generalization to various downstream tasks. However, given the numerous models available in the model hub, identifying the most suitable one by individually fine-tuning is time-consuming. In this paper, we propose \textbf{SwiftTS}, a swift selection framework for time series pre-trained models. To avoid expensive forward propagation through all candidates, SwiftTS adopts a learning-guided approach that leverages historical dataset-model performance pairs across diverse horizons to predict model performance on unseen datasets. It employs a lightweight dual-encoder architecture that embeds time series and candidate models with rich characteristics, computing patchwise compatibility scores between data and model embeddings for efficient selection. To further enhance the generalization across datasets and horizons, we introduce a horizon-adaptive expert composition module that dynamically adjusts expert weights, and the transferable cross-task learning with cross-dataset and cross-horizon task sampling to enhance out-of-distribution (OOD) robustness. Extensive experiments on 14 downstream datasets and 8 pre-trained models demonstrate that SwiftTS achieves state-of-the-art performance in time series pre-trained model selection.
[871] Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks
Amal Abed, Ivan Lukic, Jörg K. H. Franke, Frank Hutter
Main category: cs.LG
TL;DR: A scalable synthetic data generation pipeline creates 800k instruction-reasoning-code-test quadruplets to enhance LLM code generation by teaching reasoning processes alongside solutions.
Details
Motivation: Existing code generation datasets lack intermediate reasoning traces that guide human problem-solving, limiting LLM progress despite their potential.Method: Pipeline combines curated problems, web-mined content filtered by relevance, reasoning-pattern data expansion, multi-stage execution validation, and genetic mutation for diversity.
Result: Fine-tuning LLMs on this reasoning-aware dataset consistently improves coding benchmarks, substitutes for model scaling, generalizes across architectures, and outperforms alternatives under identical sample budgets.
Conclusion: Reasoning-centered synthetic data generation is an efficient approach for advancing LLM coding capabilities, establishing a new paradigm beyond solution-only training.
Abstract: Large language models (LLMs) have shown impressive promise in code generation, yet their progress remains limited by the shortage of large-scale datasets that are both diverse and well-aligned with human reasoning. Most existing resources pair problems with solutions, but omit the intermediate thought process that guides coding. To close this gap, we present a scalable synthetic data generation pipeline that produces nearly 800k instruction-reasoning-code-test quadruplets. Each sample combines a task, a step-by-step reasoning trace, a working solution, and executable tests, enabling models to learn not just the what but also the how of problem solving. Our pipeline combines four key components: curated contest problems, web-mined content filtered by relevance classifiers, data expansion guided by reasoning patterns, and multi-stage execution-based validation. A genetic mutation algorithm further increases task diversity while maintaining consistency between reasoning traces and code implementations. Our key finding is that fine-tuning LLMs on this dataset yields consistent improvements on coding benchmarks. Beyond raw accuracy, reasoning-aware data can substitute for model scaling, generalize across architectures, and outperform leading open-source alternatives under identical sample budgets. Our work establishes reasoning-centered synthetic data generation as an efficient approach for advancing coding capabilities in LLMs. We publish our dataset and generation pipeline to facilitate further research.
[872] AirFed: Federated Graph-Enhanced Multi-Agent Reinforcement Learning for Multi-UAV Cooperative Mobile Edge Computing
Zhiyu Wang, Suman Raj, Rajkumar Buyya
Main category: cs.LG
TL;DR: AirFed: A federated graph-enhanced multi-agent reinforcement learning framework for UAV cooperative MEC systems that optimizes trajectory planning, task offloading, and resource allocation while ensuring QoS in dynamic environments.
Details
Motivation: Address challenges in coordinating multiple UAVs for MEC systems, including limited scalability, slow convergence, and inefficient knowledge sharing, especially with large-scale IoT deployments and deadline constraints.Method: Uses dual-layer dynamic GATs to model spatial-temporal dependencies, dual-Actor single-Critic architecture for joint optimization of continuous trajectory and discrete offloading, and reputation-based decentralized federated learning with gradient-sensitive adaptive quantization.
Result: Achieves 42.9% reduction in weighted cost, over 99% deadline satisfaction, 94.2% IoT device coverage, and 54.5% communication overhead reduction compared to state-of-the-art baselines.
Conclusion: AirFed demonstrates robust scalability across varying UAV numbers and IoT densities, validating its practical applicability for large-scale UAV-MEC deployments.
Abstract: Multiple Unmanned Aerial Vehicles (UAVs) cooperative Mobile Edge Computing (MEC) systems face critical challenges in coordinating trajectory planning, task offloading, and resource allocation while ensuring Quality of Service (QoS) under dynamic and uncertain environments. Existing approaches suffer from limited scalability, slow convergence, and inefficient knowledge sharing among UAVs, particularly when handling large-scale IoT device deployments with stringent deadline constraints. This paper proposes AirFed, a novel federated graph-enhanced multi-agent reinforcement learning framework that addresses these challenges through three key innovations. First, we design dual-layer dynamic Graph Attention Networks (GATs) that explicitly model spatial-temporal dependencies among UAVs and IoT devices, capturing both service relationships and collaborative interactions within the network topology. Second, we develop a dual-Actor single-Critic architecture that jointly optimizes continuous trajectory control and discrete task offloading decisions. Third, we propose a reputation-based decentralized federated learning mechanism with gradient-sensitive adaptive quantization, enabling efficient and robust knowledge sharing across heterogeneous UAVs. Extensive experiments demonstrate that AirFed achieves 42.9% reduction in weighted cost compared to state-of-the-art baselines, attains over 99% deadline satisfaction and 94.2% IoT device coverage rate, and reduces communication overhead by 54.5%. Scalability analysis confirms robust performance across varying UAV numbers, IoT device densities, and system scales, validating AirFed’s practical applicability for large-scale UAV-MEC deployments.
[873] Accelerating Eigenvalue Dataset Generation via Chebyshev Subspace Filter
Hong Wang, Jie Wang, Jian Luo, huanshuo dong, Yeqiu Chen, Runmin Jiang, Zhen huang
Main category: cs.LG
TL;DR: SCSF is a novel method that accelerates eigenvalue data generation for neural eigenvalue methods by grouping similar operators and using Chebyshev subspace filters to reduce redundant computations.
Details
Motivation: Traditional neural eigenvalue methods require large amounts of labeled training data (operators and eigenvalues), which is computationally expensive to generate. Existing methods overlook similarities between operators that could accelerate data generation.Method: Uses truncated fast Fourier transform sorting to group operators with similar eigenvalue distributions, then constructs Chebyshev subspace filters that leverage eigenpairs from previously solved problems to assist in solving new ones.
Result: SCSF achieves up to 3.5× speedup compared to various numerical solvers for eigenvalue data generation.
Conclusion: SCSF is the first method to accelerate eigenvalue data generation and demonstrates significant computational efficiency improvements by exploiting operator similarities.
Abstract: Eigenvalue problems are among the most important topics in many scientific disciplines. With the recent surge and development of machine learning, neural eigenvalue methods have attracted significant attention as a forward pass of inference requires only a tiny fraction of the computation time compared to traditional solvers. However, a key limitation is the requirement for large amounts of labeled data in training, including operators and their eigenvalues. To tackle this limitation, we propose a novel method, named Sorting Chebyshev Subspace Filter (SCSF), which significantly accelerates eigenvalue data generation by leveraging similarities between operators – a factor overlooked by existing methods. Specifically, SCSF employs truncated fast Fourier transform sorting to group operators with similar eigenvalue distributions and constructs a Chebyshev subspace filter that leverages eigenpairs from previously solved problems to assist in solving subsequent ones, reducing redundant computations. To the best of our knowledge, SCSF is the first method to accelerate eigenvalue data generation. Experimental results show that SCSF achieves up to a $3.5\times$ speedup compared to various numerical solvers.
[874] Sampling from Energy distributions with Target Concrete Score Identity
Sergei Kholkin, Francisco Vargas, Alexander Korotin
Main category: cs.LG
TL;DR: TCSIS is a method for sampling from unnormalized discrete densities using reverse dynamics of Continuous-Time Markov Chains, enabling Monte Carlo estimation of concrete scores without target distribution samples or partition function computation.
Details
Motivation: To develop an efficient sampling method for unnormalized densities on discrete state spaces that avoids the need for target distribution samples and partition function computation.Method: Uses a forward CTMC with uniform noising kernel and Target Concrete Score Identity to relate concrete scores to expectations under forward diffusion. Approximates concrete score with neural networks and proposes Self-Normalized and Unbiased TCSIS algorithms.
Result: Demonstrated effectiveness on statistical physics problems, showing the method can successfully sample from unnormalized discrete densities.
Conclusion: TCSIS provides a practical approach for discrete sampling that bypasses traditional requirements for target samples and partition function calculations, with promising applications in statistical physics.
Abstract: We introduce the Target Concrete Score Identity Sampler (TCSIS), a method for sampling from unnormalized densities on discrete state spaces by learning the reverse dynamics of a Continuous-Time Markov Chain (CTMC). Our approach builds on a forward in time CTMC with a uniform noising kernel and relies on the proposed Target Concrete Score Identity, which relates the concrete score, the ratio of marginal probabilities of two states, to a ratio of expectations of Boltzmann factors under the forward uniform diffusion kernel. This formulation enables Monte Carlo estimation of the concrete score without requiring samples from the target distribution or computation of the partition function. We approximate the concrete score with a neural network and propose two algorithms: Self-Normalized TCSIS and Unbiased TCSIS. Finally, we demonstrate the effectiveness of TCSIS on problems from statistical physics.
[875] Neural Emulator Superiority: When Machine Learning for PDEs Surpasses its Training Data
Felix Koehler, Nils Thuerey
Main category: cs.LG
TL;DR: Neural operators trained on low-fidelity solver data can outperform their training data by achieving higher accuracy than the solvers themselves when evaluated against high-fidelity references.
Details
Motivation: To challenge the conventional assumption that neural emulators for PDEs are limited by their training data's fidelity, and to investigate whether they can surpass their training source's accuracy.Method: Theoretical analysis of emulator inductive biases, training objectives, and numerical error characteristics, combined with empirical validation across different PDEs using standard neural architectures.
Result: Demonstrated ’emulator superiority’ where neural networks trained on low-fidelity data achieve higher accuracy than the solvers they were trained on, learning more regularized dynamics and exhibiting better error accumulation properties.
Conclusion: Neural emulators can potentially surpass training data limitations and mitigate numerical artifacts, suggesting they might achieve greater physical fidelity than their training source within specific operational regimes, prompting re-evaluation of emulator benchmarking.
Abstract: Neural operators or emulators for PDEs trained on data from numerical solvers are conventionally assumed to be limited by their training data’s fidelity. We challenge this assumption by identifying “emulator superiority,” where neural networks trained purely on low-fidelity solver data can achieve higher accuracy than those solvers when evaluated against a higher-fidelity reference. Our theoretical analysis reveals how the interplay between emulator inductive biases, training objectives, and numerical error characteristics enables superior performance during multi-step rollouts. We empirically validate this finding across different PDEs using standard neural architectures, demonstrating that emulators can implicitly learn dynamics that are more regularized or exhibit more favorable error accumulation properties than their training data, potentially surpassing training data limitations and mitigating numerical artifacts. This work prompts a re-evaluation of emulator benchmarking, suggesting neural emulators might achieve greater physical fidelity than their training source within specific operational regimes. Project Page: https://tum-pbs.github.io/emulator-superiority
[876] Seeing Structural Failure Before it Happens: An Image-Based Physics-Informed Neural Network (PINN) for Spaghetti Bridge Load Prediction
Omer Jauhar Khan, Sudais Khan, Hafeez Anwar
Main category: cs.LG
TL;DR: PINNs are used to predict spaghetti bridge weights with physics-based constraints, achieving high accuracy (R²=0.9603) even with limited data.
Details
Motivation: To explore PINNs for predicting structural weight in simplified bridge models, addressing limited data scenarios in structural engineering.Method: Proposed Physics Informed Kolmogorov Arnold Network (PIKAN) that combines universal function approximation with physical constraints, using structural parameters collected manually or via computer vision.
Result: Best model achieved R² score of 0.9603 and MAE of 10.50 units on dataset of 15 real bridges augmented to 100 samples.
Conclusion: PINNs provide reliable structural weight estimates with limited data and can inform early-stage failure analysis in lightweight bridge designs.
Abstract: Physics Informed Neural Networks (PINNs) are gaining attention for their ability to embed physical laws into deep learning models, which is particularly useful in structural engineering tasks with limited data. This paper aims to explore the use of PINNs to predict the weight of small scale spaghetti bridges, a task relevant to understanding load limits and potential failure modes in simplified structural models. Our proposed framework incorporates physics-based constraints to the prediction model for improved performance. In addition to standard PINNs, we introduce a novel architecture named Physics Informed Kolmogorov Arnold Network (PIKAN), which blends universal function approximation theory with physical insights. The structural parameters provided as input to the model are collected either manually or through computer vision methods. Our dataset includes 15 real bridges, augmented to 100 samples, and our best model achieves an $R^2$ score of 0.9603 and a mean absolute error (MAE) of 10.50 units. From applied perspective, we also provide a web based interface for parameter entry and prediction. These results show that PINNs can offer reliable estimates of structural weight, even with limited data, and may help inform early stage failure analysis in lightweight bridge designs. The complete data and code are available at https://github.com/OmerJauhar/PINNS-For-Spaghetti-Bridges.
[877] PAHQ: Accelerating Automated Circuit Discovery through Mixed-Precision Inference Optimization
Xinhai Wang, Shu Yang, Liangyu Wang, Lin Zhang, Huanyi Xie, Lijie Hu, Di Wang
Main category: cs.LG
TL;DR: PAHQ accelerates automated circuit discovery in language models by applying mixed-precision quantization to attention heads, reducing runtime by 80% and memory by 30% while maintaining faithfulness.
Details
Motivation: Current automated circuit discovery methods like ACDC are computationally inefficient and memory-intensive for large language models, with existing acceleration approaches compromising analytical faithfulness through linear approximations.Method: Per Attention Head Quantization (PAHQ) optimizes individual patching operations by leveraging the alignment between activation patching and mixed-precision quantization, maintaining high precision only for investigated components while reducing precision elsewhere.
Result: PAHQ-accelerated ACDC reduces runtime by up to 80% and memory consumption by up to 30% compared to unaccelerated ACDC while maintaining faithfulness.
Conclusion: PAHQ provides a practical, training-free pathway for accelerating mechanistic interpretability methods that readily integrates with existing circuit discovery techniques.
Abstract: Circuit discovery, which involves identifying sparse and task-relevant subnetworks in pre-trained language models, is a cornerstone of mechanistic interpretability. Automated Circuit Discovery (ACDC) has emerged as a pivotal methodology in circuit discovery, but its application to large language models is severely limited by computational inefficiency and prohibitively high memory requirements. Although several accelerated approaches have been proposed, they primarily rely on linear approximations to ACDC, which significantly compromises analytical faithfulness. Our proposed method for accelerating automated circuit discovery, Per Attention Head Quantization (PAHQ), takes a fundamentally different approach by optimizing the efficiency of each individual patching operation. PAHQ leverages a fundamental alignment between activation patching and mixed-precision quantization (MPQ): interpretability analysis through patching essentially performs targeted ablation studies. Therefore, we can maintain high precision exclusively for investigated components while safely reducing precision elsewhere in the network. PAHQ-accelerated ACDC reduces runtime by up to 80% and memory consumption by up to 30% compared to unaccelerated ACDC while maintaining faithfulness. Importantly, our method readily integrates with existing edge-based circuit discovery techniques by modifying the attention computation mechanism. This training-free approach provides a practical and novel pathway for accelerating mechanistic interpretability methods. Our code is available at https://github.com/626619403/PAHQ.
[878] A method for outlier detection based on cluster analysis and visual expert criteria
Juan A. Lara, David Lizcano, Víctor Rampérez, Javier Soriano
Main category: cs.LG
TL;DR: Proposes a clustering-based outlier detection method using four criteria that mimic how human experts visually identify outliers, evaluated on medical time series data with high reliability.
Details
Motivation: To overcome limitations of existing outlier detection techniques that fail to account for inherent dispersion of domain objects and lack human-like visual analysis capabilities.Method: Uses clustering process with four criteria designed to represent how human experts visually identify outliers after analyzing clusters, rather than purely numerical analysis.
Result: Achieved false positive rate less than 2% and reliability greater than 99% when tested on stabilometry and EEG time series data, with satisfactory runtime efficiency.
Conclusion: The proposed method is effective for detecting outlier data across different domains, particularly medical time series, with high reliability and low false positive rates.
Abstract: Outlier detection is an important problem occurring in a wide range of areas. Outliers are the outcome of fraudulent behaviour, mechanical faults, human error, or simply natural deviations. Many data mining applications perform outlier detection, often as a preliminary step in order to filter out outliers and build more representative models. In this paper, we propose an outlier detection method based on a clustering process. The aim behind the proposal outlined in this paper is to overcome the specificity of many existing outlier detection techniques that fail to take into account the inherent dispersion of domain objects. The outlier detection method is based on four criteria designed to represent how human beings (experts in each domain) visually identify outliers within a set of objects after analysing the clusters. This has an advantage over other clustering-based outlier detection techniques that are founded on a purely numerical analysis of clusters. Our proposal has been evaluated, with satisfactory results, on data (particularly time series) from two different domains: stabilometry, a branch of medicine studying balance-related functions in human beings and electroencephalography (EEG), a neurological exploration used to diagnose nervous system disorders. To validate the proposed method, we studied method outlier detection and efficiency in terms of runtime. The results of regression analyses confirm that our proposal is useful for detecting outlier data in different domains, with a false positive rate of less than 2% and a reliability greater than 99%.
[879] A Novel Framework for Multi-Modal Protein Representation Learning
Runjie Zheng, Zhen Wang, Anjie Qiao, Jiancong Xie, Jiahua Rao, Yuedong Yang
Main category: cs.LG
TL;DR: DAMPE is a unified framework for protein function prediction that addresses cross-modal heterogeneity and noisy relational graphs through optimal transport-based alignment and conditional graph generation-based fusion, achieving state-of-the-art performance on GO benchmarks.
Details
Motivation: Protein function prediction requires integrating heterogeneous intrinsic signals (sequence, structure) with noisy extrinsic contexts (protein-protein interactions, GO annotations), but faces challenges with cross-modal distributional mismatch and degraded GNN-based information aggregation from noisy graphs.Method: DAMPE uses two core mechanisms: (1) Optimal Transport-based representation alignment to establish correspondence between intrinsic embedding spaces of different modalities, and (2) Conditional Graph Generation-based information fusion where a condition encoder fuses aligned embeddings to provide cues for graph reconstruction.
Result: DAMPE outperforms or matches state-of-the-art methods like DPFunc on standard GO benchmarks, achieving AUPR gains of 0.002-0.013 pp and Fmax gains of 0.004-0.007 pp. Ablation studies show OT-based alignment contributes 0.043-0.064 pp AUPR, while CGG-based fusion adds 0.005-0.111 pp Fmax.
Conclusion: DAMPE offers a scalable and theoretically grounded approach for robust multi-modal protein representation learning, substantially enhancing protein function prediction by effectively addressing cross-modal heterogeneity and noisy relational graphs.
Abstract: Accurate protein function prediction requires integrating heterogeneous intrinsic signals (e.g., sequence and structure) with noisy extrinsic contexts (e.g., protein-protein interactions and GO term annotations). However, two key challenges hinder effective fusion: (i) cross-modal distributional mismatch among embeddings produced by pre-trained intrinsic encoders, and (ii) noisy relational graphs of extrinsic data that degrade GNN-based information aggregation. We propose Diffused and Aligned Multi-modal Protein Embedding (DAMPE), a unified framework that addresses these through two core mechanisms. First, we propose Optimal Transport (OT)-based representation alignment that establishes correspondence between intrinsic embedding spaces of different modalities, effectively mitigating cross-modal heterogeneity. Second, we develop a Conditional Graph Generation (CGG)-based information fusion method, where a condition encoder fuses the aligned intrinsic embeddings to provide informative cues for graph reconstruction. Meanwhile, our theoretical analysis implies that the CGG objective drives this condition encoder to absorb graph-aware knowledge into its produced protein representations. Empirically, DAMPE outperforms or matches state-of-the-art methods such as DPFunc on standard GO benchmarks, achieving AUPR gains of 0.002-0.013 pp and Fmax gains 0.004-0.007 pp. Ablation studies further show that OT-based alignment contributes 0.043-0.064 pp AUPR, while CGG-based fusion adds 0.005-0.111 pp Fmax. Overall, DAMPE offers a scalable and theoretically grounded approach for robust multi-modal protein representation learning, substantially enhancing protein function prediction.
[880] The Benchmarking Epistemology: Construct Validity for Evaluating Machine Learning Models
Timo Freiesleben, Sebastian Zezulka
Main category: cs.LG
TL;DR: The paper develops a framework for assessing when benchmark scores can validly support scientific claims, using construct validity conditions from psychological measurement theory.
Details
Motivation: Benchmark scores alone are insufficient for drawing scientific inferences about theoretical tasks; additional assumptions about learning problems, evaluation functions, and data distributions are needed.Method: Develop conditions of construct validity inspired by psychological measurement theory, and examine these assumptions through three case studies: ImageNet for computer vision progress, WeatherBench for policy predictions, and Fragile Families Challenge for life event predictability.
Result: The framework clarifies the conditions under which benchmark scores can support diverse scientific claims, making explicit the assumptions required for valid inference.
Conclusion: Predictive benchmarking should be viewed as an epistemological practice requiring conceptual and theoretical reasoning, not just as measurement of relative performance.
Abstract: Predictive benchmarking, the evaluation of machine learning models based on predictive performance and competitive ranking, is a central epistemic practice in machine learning research and an increasingly prominent method for scientific inquiry. Yet, benchmark scores alone provide at best measurements of model performance relative to an evaluation dataset and a concrete learning problem. Drawing substantial scientific inferences from the results, say about theoretical tasks like image classification, requires additional assumptions about the theoretical structure of the learning problems, evaluation functions, and data distributions. We make these assumptions explicit by developing conditions of construct validity inspired by psychological measurement theory. We examine these assumptions in practice through three case studies, each exemplifying a typical intended inference: measuring engineering progress in computer vision with ImageNet; evaluating policy-relevant weather predictions with WeatherBench; and examining limitations of the predictability of life events with the Fragile Families Challenge. Our framework clarifies the conditions under which benchmark scores can support diverse scientific claims, bringing predictive benchmarking into perspective as an epistemological practice and a key site of conceptual and theoretical reasoning in machine learning.
[881] Grassmanian Interpolation of Low-Pass Graph Filters: Theory and Applications
Anton Savostianov, Michael T. Schaub, Benjamin Stamm
Main category: cs.LG
TL;DR: The paper proposes a Riemannian interpolation method on the Grassmann manifold to efficiently compute low-pass graph filters for parametric graph families, avoiding expensive eigenvalue computations.
Details
Motivation: Computing low-pass graph filters for parametric graph families is computationally expensive due to repeated eigenvalue problem solutions needed for low-frequency subspaces.Method: Novel algorithm using Riemannian interpolation in normal coordinates on the Grassmann manifold for low-pass graph filter interpolation.
Result: Derived an error bound estimate for subspace interpolation and demonstrated applications for evolving graph topologies and dot product graph families.
Conclusion: The proposed interpolation method enables efficient computation of low-pass graph filters and facilitates improved message passing schemes for node classification.
Abstract: Low-pass graph filters are fundamental for signal processing on graphs and other non-Euclidean domains. However, the computation of such filters for parametric graph families can be prohibitively expensive as computation of the corresponding low-frequency subspaces, requires the repeated solution of an eigenvalue problem. We suggest a novel algorithm of low-pass graph filter interpolation based on Riemannian interpolation in normal coordinates on the Grassmann manifold. We derive an error bound estimate for the subspace interpolation and suggest two possible applications for induced parametric graph families. First, we argue that the temporal evolution of the node features may be translated to the evolving graph topology via a similarity correction to adjust the homophily degree of the network. Second, we suggest a dot product graph family induced by a given static graph which allows to infer improved message passing scheme for node classification facilitated by the filter interpolation.
[882] Robust Iterative Learning Hidden Quantum Markov Models
Ning Ning
Main category: cs.LG
TL;DR: The paper introduces AC-HQMM (Adversarially Corrupted Hidden Quantum Markov Models) and RILA (Robust Iterative Learning Algorithm) to address robustness issues in quantum sequential learning against adversarial data corruption.
Details
Motivation: Existing HQMM learning algorithms are highly sensitive to data corruption and lack mechanisms to ensure robustness under adversarial perturbations, limiting their practical applicability.Method: Proposes RILA algorithm with Remove Corrupted Rows by Entropy Filtering (RCR-EF) module, iterative stochastic resampling for physically valid Kraus operator updates, and L1-penalized likelihood objectives for stability.
Result: RILA demonstrates superior convergence stability, corruption resilience, and preservation of physical validity across multiple HQMM and HMM benchmarks compared to existing algorithms.
Conclusion: RILA establishes a principled and efficient approach for robust quantum sequential learning, addressing the critical vulnerability of HQMMs to adversarial data corruption.
Abstract: Hidden Quantum Markov Models (HQMMs) extend classical Hidden Markov Models to the quantum domain, offering a powerful probabilistic framework for modeling sequential data with quantum coherence. However, existing HQMM learning algorithms are highly sensitive to data corruption and lack mechanisms to ensure robustness under adversarial perturbations. In this work, we introduce the Adversarially Corrupted HQMM (AC-HQMM), which formalizes robustness analysis by allowing a controlled fraction of observation sequences to be adversarially corrupted. To learn AC-HQMMs, we propose the Robust Iterative Learning Algorithm (RILA), a derivative-free method that integrates a Remove Corrupted Rows by Entropy Filtering (RCR-EF) module with an iterative stochastic resampling procedure for physically valid Kraus operator updates. RILA incorporates L1-penalized likelihood objectives to enhance stability, resist overfitting, and remain effective under non-differentiable conditions. Across multiple HQMM and HMM benchmarks, RILA demonstrates superior convergence stability, corruption resilience, and preservation of physical validity compared to existing algorithms, establishing a principled and efficient approach for robust quantum sequential learning.
[883] ZeroFlood: A Geospatial Foundation Model for Data-Efficient Flood Susceptibility Mapping
Hyeongkyun Kim, Orestis Oikonomou
Main category: cs.LG
TL;DR: ZeroFlood is a geospatial foundation model framework that uses Thinking-in-Modality reasoning to enable flood susceptibility mapping from basic Earth observation data, achieving data-efficient flood prediction in data-scarce regions.
Details
Motivation: Flood susceptibility mapping is challenging in data-scarce regions where traditional hydrodynamic models require dense geophysical inputs that are often unavailable.Method: Fine-tunes Geospatial Foundation Models (GFMs) with Thinking-in-Modality reasoning, using paired Earth observation data and simulated flood maps from data-rich regions for cross-modal representation learning.
Result: Experiments show TerraMind-Large configuration achieves F1 score of 67.21, with TiM enhancing model robustness.
Conclusion: Foundation-model-based flood susceptibility mapping provides a scalable and data-efficient solution for flood risk management.
Abstract: Flood susceptibility mapping (FSM) is vital for disaster prevention but remains challenging in data-scarce regions where hydrodynamic models require dense geophysical inputs. This work introduces ZeroFlood, a geospatial foundation model framework for data-efficient FSM. The approach fine-tunes Geospatial Foundation Models (GFMs) with Thinking-in-Modality (TiM) reasoning, enabling flood prediction from basic Earth observation data such as Sentinel-1 or Sentinel-2 imagery. Using paired EO and simulated flood maps from data-rich regions, ZeroFlood bridges data availability gaps through cross-modal representation learning. Experiments with TerraMind and Prithvi GFMs show that TiM enhances model robustness, with the TerraMind-Large configuration achieving an F1 score of 67.21. The results demonstrate the feasibility of foundation-model-based FSM as a scalable and data-efficient solution for flood risk management.
[884] GCAO: Group-driven Clustering via Gravitational Attraction and Optimization
Qi Li, Jun Wang
Main category: cs.LG
TL;DR: GCAO is a novel clustering algorithm that uses group-level gravitational attraction to handle high-dimensional, non-uniform data by aggregating boundary points into moving groups, improving clustering stability and boundary clarity.
Details
Motivation: Traditional clustering algorithms struggle with high-dimensional and non-uniformly distributed data, particularly with low-density boundary samples that are easily disturbed by neighboring clusters, leading to unstable and distorted results.Method: GCAO introduces group-level optimization that aggregates low-density boundary points into collaboratively moving groups, combines local density estimation with neighborhood topology to construct gravitational interactions, and uses groups as basic motion units with gravitational contraction for stable convergence.
Result: GCAO outperforms 11 representative clustering methods with average improvements of 37.13% (NMI), 52.08% (ARI), 44.98% (Homogeneity), and 38.81% (ACC), while maintaining competitive efficiency and scalability.
Conclusion: GCAO demonstrates superiority in preserving cluster integrity, enhancing boundary separability, and ensuring robust performance on complex data distributions through its group-driven gravitational approach.
Abstract: Traditional clustering algorithms often struggle with high-dimensional and non-uniformly distributed data, where low-density boundary samples are easily disturbed by neighboring clusters, leading to unstable and distorted clustering results. To address this issue, we propose a Group-driven Clustering via Gravitational Attraction and Optimization (GCAO) algorithm. GCAO introduces a group-level optimization mechanism that aggregates low-density boundary points into collaboratively moving groups, replacing the traditional point-based contraction process. By combining local density estimation with neighborhood topology, GCAO constructs effective gravitational interactions between groups and their surroundings, enhancing boundary clarity and structural consistency. Using groups as basic motion units, a gravitational contraction strategy ensures globally stable and directionally consistent convergence. Experiments on multiple high-dimensional datasets demonstrate that GCAO outperforms 11 representative clustering methods, achieving average improvements of 37.13%, 52.08%, 44.98%, and 38.81% in NMI, ARI, Homogeneity, and ACC, respectively, while maintaining competitive efficiency and scalability. These results highlight GCAO’s superiority in preserving cluster integrity, enhancing boundary separability, and ensuring robust performance on complex data distributions.
[885] Symbolic Neural Generation with Applications to Lead Discovery in Drug Design
Ashwin Srinivasan, A Baskar, Tirtharaj Dash, Michael Bain, Sanjay Kumar Dey, Mainak Banerjee
Main category: cs.LG
TL;DR: Symbolic Neural Generators (SNGs) combine symbolic learning with neural reasoning to create data generators that meet formal correctness criteria, producing triples of symbolic descriptions, generated instances, and weights.
Details
Motivation: To address the underexplored area of hybrid neurosymbolic models that can construct data generators satisfying formal correctness criteria by leveraging complementary strengths of symbolic and neural methods.Method: SNGs use symbolic learners to examine logical specifications from small data instances, which constrain neural-based generators to reject violating instances. The approach combines Inductive Logic Programming with large language models and operates over a weighted partial ordering.
Result: On benchmark drug design problems, SNG performance is statistically comparable to state-of-the-art methods. On exploratory problems with poorly understood targets, generated molecules show binding affinities comparable to leading clinical candidates, with several identified as viable for synthesis and testing.
Conclusion: SNGs effectively integrate symbolic and neural approaches to generate formally correct data, demonstrating practical utility in drug design where symbolic specifications serve as useful preliminary filters and generated molecules show promising binding properties.
Abstract: We investigate a relatively underexplored class of hybrid neurosymbolic models integrating symbolic learning with neural reasoning to construct data generators meeting formal correctness criteria. In \textit{Symbolic Neural Generators} (SNGs), symbolic learners examine logical specifications of feasible data from a small set of instances – sometimes just one. Each specification in turn constrains the conditional information supplied to a neural-based generator, which rejects any instance violating the symbolic specification. Like other neurosymbolic approaches, SNG exploits the complementary strengths of symbolic and neural methods. The outcome of an SNG is a triple $(H, X, W)$, where $H$ is a symbolic description of feasible instances constructed from data, $X$ a set of generated new instances that satisfy the description, and $W$ an associated weight. We introduce a semantics for such systems, based on the construction of appropriate \textit{base} and \textit{fibre} partially-ordered sets combined into an overall partial order, and outline a probabilistic extension relevant to practical applications. In this extension, SNGs result from searching over a weighted partial ordering. We implement an SNG combining a restricted form of Inductive Logic Programming (ILP) with a large language model (LLM) and evaluate it on early-stage drug design. Our main interest is the description and the set of potential inhibitor molecules generated by the SNG. On benchmark problems – where drug targets are well understood – SNG performance is statistically comparable to state-of-the-art methods. On exploratory problems with poorly understood targets, generated molecules exhibit binding affinities on par with leading clinical candidates. Experts further find the symbolic specifications useful as preliminary filters, with several generated molecules identified as viable for synthesis and wet-lab testing.
[886] Toward Interpretable Evaluation Measures for Time Series Segmentation
Félix Chavelli, Paul Boniol, Michaël Thomazo
Main category: cs.LG
TL;DR: The paper introduces two new evaluation measures (WARI and SMS) for time series segmentation that address limitations of existing metrics by capturing segmentation error quality, position, and type.
Details
Motivation: Existing evaluation measures for time series segmentation focus mainly on change point accuracy or use point-based measures like ARI, which fail to capture segment quality, ignore error nature, and offer limited interpretability.Method: Proposed WARI (Weighted Adjusted Rand Index) that accounts for segmentation error positions, and SMS (State Matching Score) that identifies and scores four fundamental types of segmentation errors with error-specific weighting.
Result: Empirical validation on synthetic and real-world benchmarks shows WARI and SMS provide more accurate segmentation quality assessment and uncover insights like error provenance and type that traditional measures cannot access.
Conclusion: The new measures WARI and SMS overcome limitations of existing evaluation metrics by providing comprehensive assessment of segmentation quality with better interpretability and error analysis capabilities.
Abstract: Time series segmentation is a fundamental task in analyzing temporal data across various domains, from human activity recognition to energy monitoring. While numerous state-of-the-art methods have been developed to tackle this problem, the evaluation of their performance remains critically limited. Existing measures predominantly focus on change point accuracy or rely on point-based measures such as Adjusted Rand Index (ARI), which fail to capture the quality of the detected segments, ignore the nature of errors, and offer limited interpretability. In this paper, we address these shortcomings by introducing two novel evaluation measures: WARI (Weighted Adjusted Rand Index), that accounts for the position of segmentation errors, and SMS (State Matching Score), a fine-grained measure that identifies and scores four fundamental types of segmentation errors while allowing error-specific weighting. We empirically validate WARI and SMS on synthetic and real-world benchmarks, showing that they not only provide a more accurate assessment of segmentation quality but also uncover insights, such as error provenance and type, that are inaccessible with traditional measures.
[887] Eigen-Value: Efficient Domain-Robust Data Valuation via Eigenvalue-Based Approach
Youngjun Choi, Joonseong Kang, Sungjun Lim, Kyungwoo Song
Main category: cs.LG
TL;DR: Eigen-Value (EV) is a computationally efficient data valuation framework that improves out-of-distribution (OOD) robustness using only in-distribution (ID) data, addressing limitations of existing methods that fail in OOD scenarios or require heavy computation.
Details
Motivation: Existing data valuation methods based on ID loss fail to generalize to OOD settings, and OOD-aware methods are computationally expensive, limiting practical deployment in real-world scenarios with domain shift.Method: EV uses spectral approximation of domain discrepancy via eigenvalue ratios of ID data’s covariance matrix, estimates data points’ marginal contributions to this discrepancy using perturbation theory, and plugs into existing ID loss-based methods without additional training loops.
Result: EV achieves improved OOD robustness and stable value rankings across real-world datasets while remaining computationally lightweight compared to existing approaches.
Conclusion: EV provides an efficient and practical solution for OOD-robust data valuation in large-scale settings with domain shift, offering computational efficiency without sacrificing performance.
Abstract: Data valuation has become central in the era of data-centric AI. It drives efficient training pipelines and enables objective pricing in data markets by assigning a numeric value to each data point. Most existing data valuation methods estimate the effect of removing individual data points by evaluating changes in model validation performance under in-distribution (ID) settings, as opposed to out-of-distribution (OOD) scenarios where data follow different patterns. Since ID and OOD data behave differently, data valuation methods based on ID loss often fail to generalize to OOD settings, particularly when the validation set contains no OOD data. Furthermore, although OOD-aware methods exist, they involve heavy computational costs, which hinder practical deployment. To address these challenges, we introduce \emph{Eigen-Value} (EV), a plug-and-play data valuation framework for OOD robustness that uses only an ID data subset, including during validation. EV provides a new spectral approximation of domain discrepancy, which is the gap of loss between ID and OOD using ratios of eigenvalues of ID data’s covariance matrix. EV then estimates the marginal contribution of each data point to this discrepancy via perturbation theory, alleviating the computational burden. Subsequently, EV plugs into ID loss-based methods by adding an EV term without any additional training loop. We demonstrate that EV achieves improved OOD robustness and stable value rankings across real-world datasets, while remaining computationally lightweight. These results indicate that EV is practical for large-scale settings with domain shift, offering an efficient path to OOD-robust data valuation.
[888] Learning from Frustration: Torsor CNNs on Graphs
Daiyuan Li, Shreya Arya, Robert Ghrist
Main category: cs.LG
TL;DR: Torsor CNNs provide a framework for learning on graphs with local symmetries using edge potentials, generalizing equivariant networks beyond global symmetries.
Details
Motivation: Most equivariant neural networks only handle global symmetries, limiting their use in domains where symmetries are local rather than global.Method: Introduces Torsor CNNs that use edge potentials (group-valued transformations between coordinate frames) and establishes equivalence to group synchronization problems, creating equivariant convolutional layers and a frustration loss regularizer.
Result: The framework unifies classical CNNs and Gauge CNNs by operating on arbitrary graphs without requiring global coordinate systems or smooth manifold structure.
Conclusion: Torsor CNNs provide a mathematically grounded framework for handling local symmetries, demonstrated in multi-view 3D recognition where camera poses naturally define edge potentials.
Abstract: Most equivariant neural networks rely on a single global symmetry, limiting their use in domains where symmetries are instead local. We introduce Torsor CNNs, a framework for learning on graphs with local symmetries encoded as edge potentials – group-valued transformations between neighboring coordinate frames. We establish that this geometric construction is fundamentally equivalent to the classical group synchronization problem, yielding: (1) a Torsor Convolutional Layer that is provably equivariant to local changes in coordinate frames, and (2) the frustration loss – a standalone geometric regularizer that encourages locally equivariant representations when added to any NN’s training objective. The Torsor CNN framework unifies and generalizes several architectures – including classical CNNs and Gauge CNNs on manifolds – by operating on arbitrary graphs without requiring a global coordinate system or smooth manifold structure. We establish the mathematical foundations of this framework and demonstrate its applicability to multi-view 3D recognition, where relative camera poses naturally define the required edge potentials.
[889] Predicting symbolic ODEs from multiple trajectories
Yakup Emre Şahin, Niki Kilbertus, Sören Becker
Main category: cs.LG
TL;DR: MIO is a transformer-based model that infers symbolic ODEs from multiple observed trajectories using multiple instance learning and symbolic regression.
Details
Motivation: To leverage repeated observations of the same dynamical system to learn more generalizable representations of the underlying dynamics.Method: Combines multiple instance learning with transformer-based symbolic regression, investigating different instance aggregation strategies including simple mean aggregation.
Result: MIO consistently outperforms existing baselines on systems ranging from 1-4 dimensions under varying noise levels, with even simple mean aggregation substantially boosting performance.
Conclusion: The combination of multiple instance learning with transformer-based symbolic regression effectively improves symbolic ODE inference from multiple trajectories.
Abstract: We introduce MIO, a transformer-based model for inferring symbolic ordinary differential equations (ODEs) from multiple observed trajectories of a dynamical system. By combining multiple instance learning with transformer-based symbolic regression, the model effectively leverages repeated observations of the same system to learn more generalizable representations of the underlying dynamics. We investigate different instance aggregation strategies and show that even simple mean aggregation can substantially boost performance. MIO is evaluated on systems ranging from one to four dimensions and under varying noise levels, consistently outperforming existing baselines.
[890] Towards Scaling Deep Neural Networks with Predictive Coding: Theory and Practice
Francesco Innocenti
Main category: cs.LG
TL;DR: This thesis advances predictive coding (PC) as a brain-inspired alternative to backpropagation (BP) for training neural networks, addressing PC’s scalability issues through theoretical analysis and proposing a new parameterization (μPC) that enables stable training of 100+ layer networks.
Details
Motivation: Backpropagation is energy inefficient and biologically implausible, while predictive coding offers a potentially more efficient brain-inspired alternative, but deep PC networks have been practically untrainable due to poor understanding of their dynamics.Method: Theoretical analysis using optimization theory to understand PC learning dynamics as approximate trust-region methods, investigation of higher-order information usage, and development of μPC parameterization based on inference dynamics study.
Result: PC learning dynamics approximate trust-region methods using second-order information despite first-order updates; PC can use higher-order information creating more benign learning landscapes; μPC enables stable training of 100+ layer networks with competitive performance on simple tasks.
Conclusion: The thesis significantly advances understanding of PC network dynamics and enables scaling to deep networks, but future research needs hardware co-design for PC to compete with backpropagation at scale.
Abstract: Backpropagation (BP) is the standard algorithm for training the deep neural networks that power modern artificial intelligence including large language models. However, BP is energy inefficient and unlikely to be implemented by the brain. This thesis studies an alternative, potentially more efficient brain-inspired algorithm called predictive coding (PC). Unlike BP, PC networks (PCNs) perform inference by iterative equilibration of neuron activities before learning or weight updates. Recent work has suggested that this iterative inference procedure provides a range of benefits over BP, such as faster training. However, these advantages have not been consistently observed, the inference and learning dynamics of PCNs are still poorly understood, and deep PCNs remain practically untrainable. Here, we make significant progress towards scaling PCNs by taking a theoretical approach grounded in optimisation theory. First, we show that the learning dynamics of PC can be understood as an approximate trust-region method using second-order information, despite explicitly using only first-order local updates. Second, going beyond this approximation, we show that PC can in principle make use of arbitrarily higher-order information, such that for feedforward networks the effective landscape on which PC learns is far more benign and robust to vanishing gradients than the (mean squared error) loss landscape. Third, motivated by a study of the inference dynamics of PCNs, we propose a new parameterisation called ``$\mu$PC’’, which for the first time allows stable training of 100+ layer networks with little tuning and competitive performance on simple tasks. Overall, this thesis significantly advances our fundamental understanding of the inference and learning dynamics of PCNs, while highlighting the need for future research to focus on hardware co-design if PC is to compete with BP at scale.
[891] GRAD: Real-Time Gated Recurrent Anomaly Detection in Autonomous Vehicle Sensors Using Reinforced EMA and Multi-Stage Sliding Window Techniques
Mohammad Hossein Jafari Naeimi, Ali Norouzi, Athena Abdi
Main category: cs.LG
TL;DR: GRAD is a real-time anomaly detection method for autonomous vehicle sensors that combines statistical analysis (REMA) and deep learning (GRU) to detect, classify, and recover from sensor anomalies with high accuracy and low computational cost.
Details
Motivation: To ensure the reliability of sensor data in autonomous vehicles by detecting anomalies in real-time while maintaining system operation through data recovery, addressing the need for both high detection accuracy and computational efficiency.Method: Combines Reinforced Exponential Moving Average (REMA) for adaptive outlier detection with Multi-Stage Sliding Window (MS-SW) for capturing short- and long-term patterns, processed through a lightweight 2-layer GRU model for anomaly detection and classification, plus a recovery module for data restoration.
Result: Achieved overall F1-score of 97.6% for abnormal data and 99.4% for normal data, with high precision in anomaly classification and successful data recovery. Outperforms comparable models by balancing high detection accuracy with low computational cost.
Conclusion: GRAD demonstrates strong potential as a reliable and efficient real-time anomaly detection solution for autonomous vehicle systems, ensuring safe operation with minimal computational overhead.
Abstract: This paper introduces GRAD, a real-time anomaly detection method for autonomous vehicle sensors that integrates statistical analysis and deep learning to ensure the reliability of sensor data. The proposed approach combines the Reinforced Exponential Moving Average (REMA), which adapts smoothing factors and thresholding for outlier detection, with the Multi-Stage Sliding Window (MS-SW) technique for capturing both short- and long-term patterns. These features are processed using a lightweight Gated Recurrent Unit (GRU) model, which detects and classifies anomalies based on bias types, while a recovery module restores damaged sensor data to ensure continuous system operation. GRAD has a lightweight architecture consisting of two layers of GRU with a limited number of neurons that make it appropriate for real-time applications while maintaining high detection accuracy. The GRAD framework achieved remarkable performance in anomaly detection and classification. The model demonstrated an overall F1-score of 97.6% for abnormal data and 99.4% for normal data, signifying its high accuracy in distinguishing between normal and anomalous sensor data. Regarding the anomaly classification, GRAD successfully categorized different anomaly types with high precision, enabling the recovery module to accurately restore damaged sensor data. Relative to analogous studies, GRAD surpasses current models by attaining a balance between elevated detection accuracy and diminished computational expense. These results demonstrate GRAD’s potential as a reliable and efficient solution for real-time anomaly detection in autonomous vehicle systems, guaranteeing safe vehicle operation with minimal computational overhead.
[892] BBOPlace-Bench: Benchmarking Black-Box Optimization for Chip Placement
Ke Xue, Ruo-Tong Chen, Rong-Xi Tan, Xi Lin, Yunqi Shi, Siyuan Xu, Mingxuan Yuan, Chao Qian
Main category: cs.LG
TL;DR: BBOPlace-Bench is the first unified benchmark for evaluating black-box optimization algorithms in chip placement, integrating three problem formulations and various BBO algorithms to facilitate algorithm development and comparison.
Details
Motivation: Chip placement significantly impacts chip quality, and while BBO has shown promise, the field lacks a standardized benchmark for proper evaluation of different problem formulations and algorithms.Method: Created BBOPlace-Bench benchmark with three BBO problem formulations (mask-guided optimization, hyperparameter optimization, sequence pair) and integrated various BBO algorithms including simulated annealing, evolutionary algorithms, and Bayesian optimization.
Result: Mask-guided optimization and hyperparameter optimization outperformed sequence pair formulation. Evolutionary algorithms showed superior performance over SA and BO, especially in high-dimensional spaces, achieving state-of-the-art results compared to mainstream placement methods.
Conclusion: BBOPlace-Bench successfully fills the benchmark gap in chip placement BBO research, enabling better algorithm development and expanding practical applications for the BBO community.
Abstract: Chip placement is a vital stage in modern chip design as it has a substantial impact on the subsequent processes and the overall quality of the final chip. The use of black-box optimization (BBO) for chip placement has a history of several decades. However, early efforts were limited by immature problem formulations and inefficient algorithm designs. Recent progress has shown the effectiveness and efficiency of BBO for chip placement, proving its potential to achieve state-of-the-art results. Despite these advancements, the field lacks a unified, BBO-specific benchmark for thoroughly assessing various problem formulations and BBO algorithms. To fill this gap, we propose BBOPlace-Bench, the first benchmark designed specifically for evaluating and developing BBO algorithms for chip placement tasks. It integrates three problem formulations of BBO for chip placement, and offers a modular, decoupled, and flexible framework that enables users to seamlessly implement, test, and compare their own algorithms. BBOPlace-Bench integrates a wide variety of existing BBO algorithms, including simulated annealing (SA), evolutionary algorithms (EAs), and Bayesian optimization (BO). Experimental results show that the problem formulations of mask-guided optimization and hyperparameter optimization exhibit superior performance than the sequence pair problem formulation, while EAs demonstrate better overall performance than SA and BO, especially in high-dimensional search spaces, and also achieve state-of-the-art performance compared to the mainstream chip placement methods. BBOPlace-Bench not only facilitates the development of efficient BBO-driven solutions for chip placement but also broadens the practical application scenarios (which are urgently needed) for the BBO community. The code of BBOPlace-Bench is available at https://github.com/lamda-bbo/BBOPlace-Bench.
[893] Block-Diagonal LoRA for Eliminating Communication Overhead in Tensor Parallel LoRA Serving
Xinyu Wang, Jonas M. Kübler, Kailash Budhathoki, Yida Wang, Matthäus Kleindessner
Main category: cs.LG
TL;DR: Proposes block-diagonal LoRA for efficient multi-adapter serving, eliminating communication overhead in S-LoRA while maintaining parameter efficiency.
Details
Motivation: S-LoRA's sharding strategy has communication overhead that becomes significant in practice when serving multiple LoRA adapters simultaneously with a base LLM.Method: Constrains LoRA factors to be block-diagonal, enabling alternative sharding strategy that requires no additional communication for LoRA computations.
Result: Achieves similar parameter efficiency as standard LoRA with significant speed-up over S-LoRA: up to 1.79x end-to-end speed-up on 8 A100 GPUs for Llama-3.1-70B.
Conclusion: Block-diagonal LoRA provides an effective solution for efficient multi-adapter serving without communication overhead, offering substantial performance improvements over existing methods.
Abstract: When serving a single base LLM with several different LoRA adapters simultaneously, the adapters cannot simply be merged with the base model’s weights as the adapter swapping would create overhead and requests using different adapters could not be batched. Rather, the LoRA computations have to be separated from the base LLM computations, and in a multi-device setup the LoRA adapters can be sharded in a way that is well aligned with the base model’s tensor parallel execution, as proposed in S-LoRA. However, the S-LoRA sharding strategy encounters some communication overhead, which may be small in theory, but can be large in practice. In this paper, we propose to constrain certain LoRA factors to be block-diagonal, which allows for an alternative way of sharding LoRA adapters that does not require any additional communication for the LoRA computations. We demonstrate in extensive experiments that our block-diagonal LoRA approach is similarly parameter efficient as standard LoRA (i.e., for a similar number of parameters it achieves similar downstream performance) and that it leads to significant end-to-end speed-up over S-LoRA. For example, when serving on eight A100 GPUs, we observe up to 1.79x (1.23x) end-to-end speed-up with 0.87x (1.74x) the number of adapter parameters for Llama-3.1-70B, and up to 1.63x (1.3x) end-to-end speed-up with 0.86x (1.73x) the number of adapter parameters for Llama-3.1-8B.
[894] Robust Non-negative Proximal Gradient Algorithm for Inverse Problems
Hanzhang Wang, Zonglin Liu, Jingyi Xu, Chenyang Wang, Zhiwei Zhong, Qiangqiang Shen
Main category: cs.LG
TL;DR: Proposes SSO-PGA, a novel multiplicative update proximal gradient algorithm that replaces gradient descent with a learnable sigmoid-based operator to enforce non-negativity constraints and improve stability in inverse problems.
Details
Motivation: Traditional proximal gradient algorithms (PGA) often violate non-negativity constraints, yield unstable convergence, produce suboptimal solutions, and are highly sensitive to hyperparameters due to the gradient descent step introducing negative values.Method: Replaces gradient descent step with learnable sigmoid-based operator that transforms subtractive updates into multiplicative ones, enforcing non-negativity and boundedness. Uses sliding parameter for stability, unfolds optimization into deep network combining interpretability with deep learning power.
Result: Significantly outperforms traditional PGA and state-of-the-art algorithms in numerical and real-world experiments, demonstrating superior performance, stability, robustness, expressive capacity, and noise immunity.
Conclusion: SSO-PGA provides a robust solution for non-negative inverse problems with convergence guarantees, effectively addressing limitations of traditional PGA while maintaining optimization interpretability through deep network unfolding.
Abstract: Proximal gradient algorithms (PGA), while foundational for inverse problems like image reconstruction, often yield unstable convergence and suboptimal solutions by violating the critical non-negativity constraint. We identify the gradient descent step as the root cause of this issue, which introduces negative values and induces high sensitivity to hyperparameters. To overcome these limitations, we propose a novel multiplicative update proximal gradient algorithm (SSO-PGA) with convergence guarantees, which is designed for robustness in non-negative inverse problems. Our key innovation lies in superseding the gradient descent step with a learnable sigmoid-based operator, which inherently enforces non-negativity and boundedness by transforming traditional subtractive updates into multiplicative ones. This design, augmented by a sliding parameter for enhanced stability and convergence, not only improves robustness but also boosts expressive capacity and noise immunity. We further formulate a degradation model for multi-modal restoration and derive its SSO-PGA-based optimization algorithm, which is then unfolded into a deep network to marry the interpretability of optimization with the power of deep learning. Extensive numerical and real-world experiments demonstrate that our method significantly surpasses traditional PGA and other state-of-the-art algorithms, ensuring superior performance and stability.
[895] Mixed Precision Training of Neural ODEs
Elena Celledoni, Brynjulf Owren, Lars Ruthotto, Tianjiao Nicole Yang
Main category: cs.LG
TL;DR: A mixed precision training framework for neural ODEs that uses low-precision computations for velocity evaluation and intermediate states while maintaining stability through custom adjoint scaling and high-precision accumulation, achieving ~50% memory reduction and up to 2x speedup.
Details
Motivation: Standard mixed precision training schemes are unreliable for continuous-time architectures like neural ODEs, which face computational costs from repeated network evaluations and growing memory requirements with time steps.Method: Combines explicit ODE solvers with custom backpropagation using low-precision for velocity evaluation and state storage, plus dynamic adjoint scaling and high-precision accumulation for stability.
Result: Achieved approximately 50% memory reduction and up to 2x speedup while maintaining comparable accuracy to single-precision training in image classification and generative model applications.
Conclusion: The framework provides an effective mixed precision solution for neural ODEs, addressing computational and memory challenges while maintaining training stability and accuracy.
Abstract: Exploiting low-precision computations has become a standard strategy in deep learning to address the growing computational costs imposed by ever larger models and datasets. However, naively performing all computations in low precision can lead to roundoff errors and instabilities. Therefore, mixed precision training schemes usually store the weights in high precision and use low-precision computations only for whitelisted operations. Despite their success, these principles are currently not reliable for training continuous-time architectures such as neural ordinary differential equations (Neural ODEs). This paper presents a mixed precision training framework for neural ODEs, combining explicit ODE solvers with a custom backpropagation scheme, and demonstrates its effectiveness across a range of learning tasks. Our scheme uses low-precision computations for evaluating the velocity, parameterized by the neural network, and for storing intermediate states, while stability is provided by a custom dynamic adjoint scaling and by accumulating the solution and gradients in higher precision. These contributions address two key challenges in training neural ODE: the computational cost of repeated network evaluations and the growth of memory requirements with the number of time steps or layers. Along with the paper, we publish our extendable, open-source PyTorch package rampde, whose syntax resembles that of leading packages to provide a drop-in replacement in existing codes. We demonstrate the reliability and effectiveness of our scheme using challenging test cases and on neural ODE applications in image classification and generative models, achieving approximately 50% memory reduction and up to 2x speedup while maintaining accuracy comparable to single-precision training.
[896] Towards a Generalizable AI for Materials Discovery: Validation through Immersion Coolant Screening
Hyunseung Kim, Dae-Woong Jeong, Changyoung Park, Won-Ji Lee, Ha-Eun Lee, Ji-Hye Lee, Rodrigo Hormazabal, Sung Moon Ko, Sumin Lee, Soorin Yim, Chanhui Lee, Sehui Han, Sang-Ho Cha, Woohyung Lim
Main category: cs.LG
TL;DR: GATE is a generalizable AI framework that learns 34 physicochemical properties in a shared geometric space, enabling multi-property materials discovery without retraining for each new application.
Details
Motivation: Most existing AI models for materials discovery are problem-specific and require additional data collection and retraining for each new property, limiting their practical utility.Method: GATE jointly learns 34 physicochemical properties spanning thermal, electrical, mechanical, and optical domains by aligning them within a shared geometric space to capture cross-property correlations and reduce disjoint-property bias.
Result: Applied directly to immersion cooling fluid discovery, GATE screened billions of candidates and identified 92,861 promising molecules. Four were experimentally validated with strong agreement to measurements and performance comparable to or exceeding commercial coolants.
Conclusion: GATE establishes a ready-to-use, generalizable AI platform applicable across diverse materials discovery tasks without problem-specific reconfiguration.
Abstract: Artificial intelligence (AI) has emerged as a powerful accelerator of materials discovery, yet most existing models remain problem-specific, requiring additional data collection and retraining for each new property. Here we introduce and validate GATE (Geometrically Aligned Transfer Encoder) – a generalizable AI framework that jointly learns 34 physicochemical properties spanning thermal, electrical, mechanical, and optical domains. By aligning these properties within a shared geometric space, GATE captures cross-property correlations that reduce disjoint-property bias – a key factor causing false negatives in multi-criteria screening. To demonstrate its generalizability, GATE – without any problem-specific reconfiguration – was directly applied to the discovery of immersion cooling fluids for data centers, a stringent real-world challenge defined by the Open Compute Project (OCP). Screening billions of candidates, GATE identified 92,861 molecules as promising for practical deployment. Four were experimentally or literarily validated, showing strong agreement with wet-lab measurements and performance comparable to or exceeding a commercial coolant. These results establish GATE as a ready-to-use, generalizable AI platform readily applicable across diverse materials discovery tasks.
[897] A Deep Latent Factor Graph Clustering with Fairness-Utility Trade-off Perspective
Siamak Ghodsi, Amjad Seyedi, Tai Le Quy, Fariba Karimi, Eirini Ntoutsi
Main category: cs.LG
TL;DR: DFNMF is a deep nonnegative tri-factorization method for fair graph clustering that directly optimizes cluster assignments with a fairness-utility trade-off controlled by a single parameter λ.
Details
Motivation: Existing fair graph clustering approaches use rigid constraints or multi-stage pipelines that limit trade-off control, interpretability, and scalability.Method: End-to-end deep nonnegative tri-factorization with soft statistical-parity regularizer, using sparse-friendly alternating updates that scale near-linearly with edges.
Result: DFNMF achieves substantially higher group balance at comparable modularity, often dominating state-of-the-art baselines on the Pareto front.
Conclusion: DFNMF provides an effective single-parameter solution for fair graph clustering with good scalability and interpretability through nonnegative factors.
Abstract: Fair graph clustering seeks partitions that respect network structure while maintaining proportional representation across sensitive groups, with applications spanning community detection, team formation, resource allocation, and social network analysis. Many existing approaches enforce rigid constraints or rely on multi-stage pipelines (e.g., spectral embedding followed by $k$-means), limiting trade-off control, interpretability, and scalability. We introduce \emph{DFNMF}, an end-to-end deep nonnegative tri-factorization tailored to graphs that directly optimizes cluster assignments with a soft statistical-parity regularizer. A single parameter $\lambda$ tunes the fairness–utility balance, while nonnegativity yields parts-based factors and transparent soft memberships. The optimization uses sparse-friendly alternating updates and scales near-linearly with the number of edges. Across synthetic and real networks, DFNMF achieves substantially higher group balance at comparable modularity, often dominating state-of-the-art baselines on the Pareto front. The code is available at https://github.com/SiamakGhodsi/DFNMF.git.
[898] The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation
Farid Bagirov, Mikhail Arkhipov, Ksenia Sycheva, Evgeniy Glukhov, Egor Bogomolov
Main category: cs.LG
TL;DR: RLVR improves LLM reasoning but harms exploration diversity. This work optimizes max@k metric (generalization of pass@k) with unbiased gradient estimates for both on-policy and off-policy scenarios, aligning models with Best-of-N inference.
Details
Motivation: RLVR fine-tuning improves reasoning but reduces generation diversity, degrading Best-of-N sampling performance for large N values. Need to optimize max@k metric directly.Method: Derived unbiased on-policy gradient estimate for max@k optimization. Extended to off-policy updates for better sample efficiency in modern RLVR algorithms.
Result: Empirical results show the objective effectively optimizes max@k metric in off-policy scenarios, aligning models with Best-of-N inference strategy.
Conclusion: The proposed method successfully addresses the exploration diversity issue in RLVR by directly optimizing the max@k metric through unbiased gradient estimates, improving Best-of-N performance.
Abstract: The application of Reinforcement Learning with Verifiable Rewards (RLVR) to mathematical and coding domains has demonstrated significant improvements in the reasoning and problem-solving abilities of Large Language Models. Despite its success in single generation problem solving, the reinforcement learning fine-tuning process may harm the model’s exploration ability, as reflected in decreased diversity of generations and a resulting degradation of performance during Best-of-N sampling for large N values. In this work, we focus on optimizing the max@k metric, a continuous generalization of pass@k. We derive an unbiased on-policy gradient estimate for direct optimization of this metric. Furthermore, we extend our derivations to the off-policy updates, a common element in modern RLVR algorithms, that allows better sample efficiency. Empirically, we show that our objective effectively optimizes max@k metric in off-policy scenarios, aligning the model with the Best-of-N inference strategy.
[899] PrivacyGuard: A Modular Framework for Privacy Auditing in Machine Learning
Luca Melis, Matthew Grange, Iden Kalemaj, Karan Chadha, Shengyuan Hu, Elena Kashtelyan, Will Bullock
Main category: cs.LG
TL;DR: PrivacyGuard is a comprehensive tool for empirical differential privacy analysis in ML models, implementing various privacy attacks and measurement techniques.
Details
Motivation: The increasing deployment of ML models in sensitive domains creates a need for robust privacy assessment tools to evaluate privacy risks.Method: PrivacyGuard implements a diverse suite of privacy attacks including membership inference, extraction, and reconstruction attacks, with a modular architecture that supports integration of new attacks and privacy metrics.
Result: The tool enables both off-the-shelf and highly configurable privacy analyses, supporting rapid adaptation to emerging research advances.
Conclusion: PrivacyGuard provides a practical solution for empirical differential privacy analysis and is made publicly available as an open-source tool.
Abstract: The increasing deployment of Machine Learning (ML) models in sensitive domains motivates the need for robust, practical privacy assessment tools. PrivacyGuard is a comprehensive tool for empirical differential privacy (DP) analysis, designed to evaluate privacy risks in ML models through state-of-the-art inference attacks and advanced privacy measurement techniques. To this end, PrivacyGuard implements a diverse suite of privacy attack – including membership inference , extraction, and reconstruction attacks – enabling both off-the-shelf and highly configurable privacy analyses. Its modular architecture allows for the seamless integration of new attacks, and privacy metrics, supporting rapid adaptation to emerging research advances. We make PrivacyGuard available at https://github.com/facebookresearch/PrivacyGuard.
[900] Improving Predictions of Molecular Properties with Graph Featurisation and Heterogeneous Ensemble Models
Michael L. Parker, Samar Mahmoud, Bailey Montefiore, Mario Öeren, Himani Tandon, Charlotte Wharrick, Matthew D. Segall
Main category: cs.LG
TL;DR: A MetaModel framework that combines graph neural network features with conventional molecular descriptors and uses ensemble ML models achieves superior performance on molecular property prediction tasks.
Details
Motivation: To develop a comprehensive approach that leverages both learned molecular descriptors from GNNs and general-purpose descriptors to achieve optimal performance across diverse molecular property prediction problems.Method: Introduces a MetaModel framework that aggregates predictions from diverse ML models and combines task-specific GNN-derived features with conventional molecular descriptors through a featurisation scheme.
Result: Outperforms the cutting-edge ChemProp model on all regression datasets tested and 6 of 9 classification datasets. GNN features from ChemProp boost ensemble performance on datasets where it would otherwise underperform.
Conclusion: Optimal performance across diverse problems requires combining general-purpose descriptors with task-specific learned features and using a diverse set of ML models for predictions.
Abstract: We explore a “best-of-both” approach to modelling molecular properties by combining learned molecular descriptors from a graph neural network (GNN) with general-purpose descriptors and a mixed ensemble of machine learning (ML) models. We introduce a MetaModel framework to aggregate predictions from a diverse set of leading ML models. We present a featurisation scheme for combining task-specific GNN-derived features with conventional molecular descriptors. We demonstrate that our framework outperforms the cutting-edge ChemProp model on all regression datasets tested and 6 of 9 classification datasets. We further show that including the GNN features derived from ChemProp boosts the ensemble model’s performance on several datasets where it otherwise would have underperformed. We conclude that to achieve optimal performance across a wide set of problems, it is vital to combine general-purpose descriptors with task-specific learned features and use a diverse set of ML models to make the predictions.
[901] TAMI: Taming Heterogeneity in Temporal Interactions for Temporal Graph Link Prediction
Zhongyi Yu, Jianqiu Wu, Zhenghao Wu, Shuhan Zhong, Weifeng Su, Chul-Ho Lee, Weipeng Zhuo
Main category: cs.LG
TL;DR: TAMI is a novel framework that addresses heterogeneity in temporal graph link prediction through log time encoding and link history aggregation, improving performance for infrequently interacting node pairs.
Details
Motivation: Existing methods fail to handle heterogeneity in temporal interactions, where a few node pairs dominate interactions and events occur at varying intervals, leading to ineffective temporal encoding and forgetting of past interactions.Method: Proposes TAMI framework with two components: Log Time Encoding (LTE) that transforms interaction intervals into balanced representations, and Link History Aggregation (LHA) that prevents forgetting of historical interactions for each node pair.
Result: TAMI consistently improves link prediction performance on 13 classic datasets and 3 TGB benchmarks, working effectively in both transductive and inductive settings.
Conclusion: The proposed TAMI framework successfully addresses temporal heterogeneity and can be seamlessly integrated with existing temporal graph neural networks to enhance their link prediction capabilities.
Abstract: Temporal graph link prediction aims to predict future interactions between nodes in a graph based on their historical interactions, which are encoded in node embeddings. We observe that heterogeneity naturally appears in temporal interactions, e.g., a few node pairs can make most interaction events, and interaction events happen at varying intervals. This leads to the problems of ineffective temporal information encoding and forgetting of past interactions for a pair of nodes that interact intermittently for their link prediction. Existing methods, however, do not consider such heterogeneity in their learning process, and thus their learned temporal node embeddings are less effective, especially when predicting the links for infrequently interacting node pairs. To cope with the heterogeneity, we propose a novel framework called TAMI, which contains two effective components, namely log time encoding function (LTE) and link history aggregation (LHA). LTE better encodes the temporal information through transforming interaction intervals into more balanced ones, and LHA prevents the historical interactions for each target node pair from being forgotten. State-of-the-art temporal graph neural networks can be seamlessly and readily integrated into TAMI to improve their effectiveness. Experiment results on 13 classic datasets and three newest temporal graph benchmark (TGB) datasets show that TAMI consistently improves the link prediction performance of the underlying models in both transductive and inductive settings. Our code is available at https://github.com/Alleinx/TAMI_temporal_graph.
[902] Coresets for Clustering Under Stochastic Noise
Lingxiao Huang, Zhize Li, Nisheeth K. Vishnoi, Runkai Yang, Haoyu Zhao
Main category: cs.LG
TL;DR: The paper studies coreset construction for (k,z)-clustering when datasets are corrupted by stochastic noise, proposing a new surrogate error metric that provides better approximation guarantees and smaller coresets than traditional approaches.
Details
Motivation: Traditional coreset construction faces challenges when the true dataset is unobserved due to noise corruption, making it difficult to evaluate coreset quality and obtain tight guarantees on the true clustering cost.Method: The authors introduce a new error metric that closely aligns with the true clustering cost, design a coreset construction algorithm based on this metric, and prove theoretical guarantees under mild assumptions on data and noise.
Result: The proposed approach yields smaller coresets (improving by up to poly(k) factor) and tighter guarantees on true clustering cost compared to classical metrics, with experimental validation on real-world datasets.
Conclusion: The new surrogate error metric enables more efficient coreset construction for noisy clustering problems, providing both theoretical improvements and practical advantages over existing methods.
Abstract: We study the problem of constructing coresets for $(k, z)$-clustering when the input dataset is corrupted by stochastic noise drawn from a known distribution. In this setting, evaluating the quality of a coreset is inherently challenging, as the true underlying dataset is unobserved. To address this, we investigate coreset construction using surrogate error metrics that are tractable and provably related to the true clustering cost. We analyze a traditional metric from prior work and introduce a new error metric that more closely aligns with the true cost. Although our metric is defined independently of the noise distribution, it enables approximation guarantees that scale with the noise level. We design a coreset construction algorithm based on this metric and show that, under mild assumptions on the data and noise, enforcing an $\varepsilon$-bound under our metric yields smaller coresets and tighter guarantees on the true clustering cost than those obtained via classical metrics. In particular, we prove that the coreset size can improve by a factor of up to $\mathrm{poly}(k)$, where $n$ is the dataset size. Experiments on real-world datasets support our theoretical findings and demonstrate the practical advantages of our approach.
[903] An Information-Theoretic Analysis of Out-of-Distribution Generalization in Meta-Learning with Applications to Meta-RL
Xingtu Liu
Main category: cs.LG
TL;DR: This paper studies out-of-distribution generalization in meta-learning from an information-theoretic perspective, focusing on environment mismatch scenarios and establishing generalization bounds for meta-reinforcement learning.
Details
Motivation: To address the challenge of out-of-distribution generalization in meta-learning, particularly when testing environments differ from training environments or when training is broader than testing.Method: Uses information-theoretic analysis to formalize generalization problems in meta-reinforcement learning, establishes generalization bounds, and analyzes gradient-based meta-reinforcement learning algorithms.
Result: Developed theoretical framework for out-of-distribution generalization in meta-learning and established corresponding generalization bounds for meta-reinforcement learning scenarios.
Conclusion: The information-theoretic approach provides a foundation for understanding and analyzing generalization in meta-learning under distribution shifts, with practical implications for meta-reinforcement learning algorithms.
Abstract: In this work, we study out-of-distribution generalization in meta-learning from an information-theoretic perspective. We focus on two scenarios: (i) when the testing environment mismatches the training environment, and (ii) when the training environment is broader than the testing environment. The first corresponds to the standard distribution mismatch setting, while the second reflects a broad-to-narrow training scenario. We further formalize the generalization problem in meta-reinforcement learning and establish corresponding generalization bounds. Finally, we analyze the generalization performance of a gradient-based meta-reinforcement learning algorithm.
[904] Schrodinger Neural Network and Uncertainty Quantification: Quantum Machine
M. M. Hammad
Main category: cs.LG
TL;DR: The Schrodinger Neural Network (SNN) is a novel architecture for conditional density estimation inspired by quantum mechanics, representing outputs as normalized wave functions and computing probabilities via the Born rule.
Details
Motivation: To create a principled framework for uncertainty quantification that ensures positivity and exact normalization by construction, supports native multimodality through interference effects, and provides efficient computation of statistical functionals.Method: SNN maps inputs to complex-valued wave functions using spectral expansions (e.g., Chebyshev polynomials), with squared modulus yielding conditional densities. Training uses exact maximum likelihood with unit-sphere parameterization, physics-inspired regularizers, and scalable extensions for multivariate outputs.
Result: SNN provides a coherent framework that guarantees normalized densities, supports multimodal predictions without explicit mixture models, and enables efficient computation of moments and calibration diagnostics as quadratic forms.
Conclusion: The SNN elevates probabilistic prediction from point estimates to physically inspired amplitude-based distributions, offering a tractable and principled approach to uncertainty quantification with practical advantages for multimodal density estimation.
Abstract: We introduce the Schrodinger Neural Network (SNN), a principled architecture for conditional density estimation and uncertainty quantification inspired by quantum mechanics. The SNN maps each input to a normalized wave function on the output domain and computes predictive probabilities via the Born rule. The SNN departs from standard parametric likelihood heads by learning complex coefficients of a spectral expansion (e . g ., Chebyshev polynomials) whose squared modulus yields the conditional density $p(y|x)=\left| \psi _x(y)\right| {}^2$ with analytic normalization. This representation confers three practical advantages: positivity and exact normalization by construction, native multimodality through interference among basis modes without explicit mixture bookkeeping, and yields closed-form (or efficiently computable) functionals$-$such as moments and several calibration diagnostics$-$as quadratic forms in coefficient space. We develop the statistical and computational foundations of the SNN, including (i) training by exact maximum-likelihood with unit-sphere coefficient parameterization, (ii) physics-inspired quadratic regularizers (kinetic and potential energies) motivated by uncertainty relations between localization and spectral complexity, (iii) scalable low-rank and separable extensions for multivariate outputs, (iv) operator-based extensions that represent observables, constraints, and weak labels as self-adjoint matrices acting on the amplitude space, and (v) a comprehensive framework for evaluating multimodal predictions. The SNN provides a coherent, tractable framework to elevate probabilistic prediction from point estimates to physically inspired amplitude-based distributions.
[905] SGFusion: Stochastic Geographic Gradient Fusion in Federated Learning
Khoa Nguyen, Khang Tran, NhatHai Phan, Cristian Borcea, Rouming Jin, Issa Khalil
Main category: cs.LG
TL;DR: SGFusion is a novel FL training algorithm that leverages geographic information by training separate models per zone and fusing gradients between similar zones using a hierarchical random graph with self-attention weights.
Details
Motivation: To leverage geographic information of mobile users in Federated Learning to better adapt models to local data and behaviors while enabling knowledge sharing between similar geographical zones.Method: Maps mobile device data to geographical zones, trains one FL model per zone, models zone correlations as hierarchical random graph optimized by MCMC sampling, and fuses local gradients with sampled zones using self-attention weights.
Result: Significantly improves model utility in all countries tested without notable computational cost increase, with theoretical guarantees of convergence and upper-bounded expected errors.
Conclusion: SGFusion effectively leverages geographic information for improved FL performance through probabilistic gradient fusion between similar zones, achieving better utility while maintaining system scalability.
Abstract: This paper proposes Stochastic Geographic Gradient Fusion (SGFusion), a novel training algorithm to leverage the geographic information of mobile users in Federated Learning (FL). SGFusion maps the data collected by mobile devices onto geographical zones and trains one FL model per zone, which adapts well to the data and behaviors of users in that zone. SGFusion models the local data-based correlation among geographical zones as a hierarchical random graph (HRG) optimized by Markov Chain Monte Carlo sampling. At each training step, every zone fuses its local gradient with gradients derived from a small set of other zones sampled from the HRG. This approach enables knowledge fusion and sharing among geographical zones in a probabilistic and stochastic gradient fusion process with self-attention weights, such that “more similar” zones have “higher probabilities” of sharing gradients with “larger attention weights.” SGFusion remarkably improves model utility without introducing undue computational cost. Extensive theoretical and empirical results using a heart-rate prediction dataset collected across 6 countries show that models trained with SGFusion converge with upper-bounded expected errors and significantly improve utility in all countries compared to existing approaches without notable cost in system scalability.
[906] Differential Privacy as a Perk: Federated Learning over Multiple-Access Fading Channels with a Multi-Antenna Base Station
Hao Liang, Haifeng Wen, Kaishun Wu, Hong Xing
Main category: cs.LG
TL;DR: This paper demonstrates that differential privacy (DP) can be achieved in over-the-air federated learning (AirFL) without artificial noise injection, contrary to prior beliefs, by leveraging inherent channel impairments as a natural privacy mechanism.
Details
Motivation: Prior work assumed artificial noise was necessary for DP in AirFL, but this paper aims to show that inherent channel noise can provide DP as a 'perk' without compromising training performance.Method: The authors study AirFL over multiple-access fading channels with multi-antenna base stations, derive novel convergent DP bounds under general bounded-domain assumptions, and optimize receive beamforming and power allocations to characterize convergence-privacy trade-offs.
Result: The paper proves DP can be achieved without artificial noise injection, provides explicit conditions where DP doesn’t compromise training, and validates findings through extensive numerical results.
Conclusion: Channel impairments in AirFL can serve as a natural source of differential privacy without requiring artificial noise, enabling optimal convergence-privacy trade-offs under general conditions.
Abstract: Federated Learning (FL) is a distributed learning paradigm that preserves privacy by eliminating the need to exchange raw data during training. In its prototypical edge instantiation with underlying wireless transmissions enabled by analog over-the-air computing (AirComp), referred to as \emph{over-the-air FL (AirFL)}, the inherent channel noise plays a unique role of \emph{frenemy} in the sense that it degrades training due to noisy global aggregation while providing a natural source of randomness for privacy-preserving mechanisms, formally quantified by \emph{differential privacy (DP)}. It remains, nevertheless, challenging to effectively harness such channel impairments, as prior arts, under assumptions of either simple channel models or restricted types of loss functions, mostly considering (local) DP enhancement with a single-round or non-convergent bound on privacy loss. In this paper, we study AirFL over multiple-access fading channels with a multi-antenna base station (BS) subject to user-level DP requirements. Despite a recent study, which claimed in similar settings that artificial noise (AN) must be injected to ensure DP in general, we demonstrate, on the contrary, that DP can be gained as a \emph{perk} even \emph{without} employing any AN. Specifically, we derive a novel bound on DP that converges under general bounded-domain assumptions on model parameters, along with a convergence bound with general smooth and non-convex loss functions. Next, we optimize over receive beamforming and power allocations to characterize the optimal convergence-privacy trade-offs, which also reveal explicit conditions in which DP is achievable without compromising training. Finally, our theoretical findings are validated by extensive numerical results.
[907] Adaptive Dual Prompting: Hierarchical Debiasing for Fairness-aware Graph Neural Networks
Yuhan Yang, Xingbo Fu, Jundong Li
Main category: cs.LG
TL;DR: ADPrompt is a fairness-aware graph prompting framework that adapts pre-trained GNNs to downstream tasks while mitigating bias through adaptive feature rectification and message calibration modules.
Details
Motivation: Current graph prompting methods focus on utility but overlook fairness concerns, as pre-trained GNNs produce discriminative representations due to inherent biases in downstream graph data attributes and structures.Method: Proposes Adaptive Dual Prompting (ADPrompt) with two modules: Adaptive Feature Rectification that learns attribute prompts to suppress sensitive information at input layer, and Adaptive Message Calibration that generates structure prompts at each layer to adjust neighbor message flow.
Result: Extensive experiments on four datasets with four pre-training strategies show ADPrompt outperforms seven baseline methods on node classification tasks while enhancing fairness.
Conclusion: ADPrompt effectively bridges the objective gap between pre-training and downstream tasks while addressing fairness concerns through dual adaptive prompting mechanisms.
Abstract: In recent years, pre-training Graph Neural Networks (GNNs) through self-supervised learning on unlabeled graph data has emerged as a widely adopted paradigm in graph learning. Although the paradigm is effective for pre-training powerful GNN models, the objective gap often exists between pre-training and downstream tasks. To bridge this gap, graph prompting adapts pre-trained GNN models to specific downstream tasks with extra learnable prompts while keeping the pre-trained GNN models frozen. As recent graph prompting methods largely focus on enhancing model utility on downstream tasks, they often overlook fairness concerns when designing prompts for adaptation. In fact, pre-trained GNN models will produce discriminative node representations across demographic subgroups, as downstream graph data inherently contains biases in both node attributes and graph structures. To address this issue, we propose an Adaptive Dual Prompting (ADPrompt) framework that enhances fairness for adapting pre-trained GNN models to downstream tasks. To mitigate attribute bias, we design an Adaptive Feature Rectification module that learns customized attribute prompts to suppress sensitive information at the input layer, reducing bias at the source. Afterward, we propose an Adaptive Message Calibration module that generates structure prompts at each layer, which adjust the message from neighboring nodes to enable dynamic and soft calibration of the information flow. Finally, ADPrompt jointly optimizes the two prompting modules to adapt the pre-trained GNN while enhancing fairness. We conduct extensive experiments on four datasets with four pre-training strategies to evaluate the performance of ADPrompt. The results demonstrate that our proposed ADPrompt outperforms seven baseline methods on node classification tasks.
[908] T-REGS: Minimum Spanning Tree Regularization for Self-Supervised Learning
Julie Mordacq, David Loiseaux, Vicky Kalogeiton, Steve Oudot
Main category: cs.LG
TL;DR: T-REGS is a regularization framework for self-supervised learning that uses Minimum Spanning Tree length to simultaneously prevent dimensional collapse and promote distribution uniformity in learned representations.
Details
Motivation: Self-supervised learning needs to avoid dimensional collapse (where features occupy only low-dimensional subspace) and enhance uniformity of the induced distribution for effective representations.Method: Introduces T-REGS, a regularization framework based on the length of Minimum Spanning Tree (MST) over learned representations, with theoretical analysis on compact Riemannian manifolds.
Result: Experiments on synthetic data and classical SSL benchmarks validate that T-REGS effectively enhances representation quality by addressing both dimensional collapse and distribution uniformity.
Conclusion: T-REGS provides a simple yet effective regularization approach that simultaneously mitigates dimensional collapse and promotes distribution uniformity in self-supervised learning.
Abstract: Self-supervised learning (SSL) has emerged as a powerful paradigm for learning representations without labeled data, often by enforcing invariance to input transformations such as rotations or blurring. Recent studies have highlighted two pivotal properties for effective representations: (i) avoiding dimensional collapse-where the learned features occupy only a low-dimensional subspace, and (ii) enhancing uniformity of the induced distribution. In this work, we introduce T-REGS, a simple regularization framework for SSL based on the length of the Minimum Spanning Tree (MST) over the learned representation. We provide theoretical analysis demonstrating that T-REGS simultaneously mitigates dimensional collapse and promotes distribution uniformity on arbitrary compact Riemannian manifolds. Several experiments on synthetic data and on classical SSL benchmarks validate the effectiveness of our approach at enhancing representation quality.
[909] Learning to Reason Efficiently with Discounted Reinforcement Learning
Alex Ayoub, Kavosh Asadi, Dale Schuurmans, Csaba Szepesvári, Karim Bouyarmane
Main category: cs.LG
TL;DR: Penalizing reasoning tokens with discounted RL encourages concise reasoning while maintaining accuracy, challenging the assumption that longer responses improve performance.
Details
Motivation: Large reasoning models consume excessive tokens, increasing computational cost and latency, and the assumption that longer responses improve accuracy may be incorrect.Method: Use discounted reinforcement learning to penalize reasoning tokens (interpretable as a small token cost) and analyze Blackwell optimality in restricted policy classes.
Result: Experiments confirm that this approach shortens chains of thought while preserving accuracy.
Conclusion: The method successfully encourages concise yet accurate reasoning, supporting the theoretical results.
Abstract: Large reasoning models (LRMs) often consume excessive tokens, inflating computational cost and latency. We challenge the assumption that longer responses improve accuracy. By penalizing reasoning tokens using a discounted reinforcement learning setup (interpretable as a small token cost) and analyzing Blackwell optimality in restricted policy classes, we encourage concise yet accurate reasoning. Experiments confirm our theoretical results that this approach shortens chains of thought while preserving accuracy.
[910] Towards Deep Physics-Informed Kolmogorov-Arnold Networks
Spyros Rigas, Fotios Anagnostopoulos, Michalis Papachristou, Georgios Alexandridis
Main category: cs.LG
TL;DR: The paper proposes a new initialization scheme and Residual-Gated Adaptive KANs (RGA KANs) to address training instability in deep physics-informed KANs, achieving superior performance and stability on PDE benchmarks.
Details
Motivation: Deep Chebyshev-based physics-informed KANs (cPIKANs) face significant training instabilities when scaled to depth, limiting their applicability to various PDE problems.Method: Proposed a basis-agnostic Glorot-like initialization scheme and introduced Residual-Gated Adaptive KANs (RGA KANs) inspired by PirateNet architecture to mitigate divergence in deep networks.
Result: RGA KANs consistently outperform parameter-matched cPIKANs and PirateNets by several orders of magnitude on seven standard forward PDE benchmarks, while remaining stable where others diverge.
Conclusion: The proposed RGA KANs successfully address training instability in deep physics-informed networks and demonstrate superior performance across multiple PDE problems.
Abstract: Since their introduction, Kolmogorov-Arnold Networks (KANs) have been successfully applied across several domains, with physics-informed machine learning (PIML) emerging as one of the areas where they have thrived. In the PIML setting, Chebyshev-based physics-informed KANs (cPIKANs) have become the standard due to their computational efficiency. However, like their multilayer perceptron-based counterparts, cPIKANs face significant challenges when scaled to depth, leading to training instabilities that limit their applicability to several PDE problems. To address this, we propose a basis-agnostic, Glorot-like initialization scheme that preserves activation variance and yields substantial improvements in stability and accuracy over the default initialization of cPIKANs. Inspired by the PirateNet architecture, we further introduce Residual-Gated Adaptive KANs (RGA KANs), designed to mitigate divergence in deep cPIKANs where initialization alone is not sufficient. Through empirical tests and information bottleneck analysis, we show that RGA KANs successfully traverse all training phases, unlike baseline cPIKANs, which stagnate in the diffusion phase in specific PDE settings. Evaluations on seven standard forward PDE benchmarks under a fixed training pipeline with adaptive components demonstrate that RGA KANs consistently outperform parameter-matched cPIKANs and PirateNets - often by several orders of magnitude - while remaining stable in settings where the others diverge.
[911] Sequential Multi-Agent Dynamic Algorithm Configuration
Chen Lu, Ke Xue, Lei Yuan, Yao Wang, Yaoyuan Wang, Sheng Fu, Chao Qian
Main category: cs.LG
TL;DR: Seq-MADAC is a sequential multi-agent reinforcement learning framework for dynamic algorithm configuration that addresses parameter interdependencies through sequential advantage decomposition.
Details
Motivation: Existing multi-agent reinforcement learning approaches for dynamic algorithm configuration ignore inherent inter-dependencies among multiple parameters in complex algorithms, leading to sub-optimal results.Method: Proposed sequential multi-agent DAC framework with sequential advantage decomposition network that leverages action-order information through sequential advantage decomposition.
Result: Experiments show superior performance over state-of-the-art MARL methods on synthetic functions and multi-objective optimization algorithm configuration, with strong generalization across problem classes.
Conclusion: Seq-MADAC establishes a new paradigm for dependency-aware automated algorithm configuration and demonstrates effective handling of parameter interdependencies.
Abstract: Dynamic algorithm configuration (DAC) is a recent trend in automated machine learning, which can dynamically adjust the algorithm’s configuration during the execution process and relieve users from tedious trial-and-error tuning tasks. Recently, multi-agent reinforcement learning (MARL) approaches have improved the configuration of multiple heterogeneous hyperparameters, making various parameter configurations for complex algorithms possible. However, many complex algorithms have inherent inter-dependencies among multiple parameters (e.g., determining the operator type first and then the operator’s parameter), which are, however, not considered in previous approaches, thus leading to sub-optimal results. In this paper, we propose the sequential multi-agent DAC (Seq-MADAC) framework to address this issue by considering the inherent inter-dependencies of multiple parameters. Specifically, we propose a sequential advantage decomposition network, which can leverage action-order information through sequential advantage decomposition. Experiments from synthetic functions to the configuration of multi-objective optimization algorithms demonstrate Seq-MADAC’s superior performance over state-of-the-art MARL methods and show strong generalization across problem classes. Seq-MADAC establishes a new paradigm for the widespread dependency-aware automated algorithm configuration. Our code is available at https://github.com/lamda-bbo/seq-madac.
[912] Lightweight Robust Direct Preference Optimization
Cheol Woo Kim, Shresth Verma, Mauricio Tec, Milind Tambe
Main category: cs.LG
TL;DR: DPO-PRO is a robust fine-tuning algorithm that addresses DPO’s sensitivity to noise and overfitting through a lightweight distributionally robust optimization approach focused on preference uncertainty.
Details
Motivation: DPO is popular but sensitive to noise and prone to overfitting. Existing DRO methods are often too conservative and computationally expensive.Method: DPO-PRO uses a lightweight DRO formulation that focuses solely on uncertainty in preferences, avoiding unnecessary conservatism with negligible computational overhead.
Result: DPO-PRO consistently improves robustness to noisy preference signals compared to existing DPO variants on standard alignment benchmarks and a real-world public health task.
Conclusion: DPO-PRO effectively addresses DPO’s limitations by providing a computationally efficient robust fine-tuning method that penalizes model overconfidence under weak preference signals.
Abstract: Direct Preference Optimization (DPO) has become a popular method for fine-tuning large language models (LLMs) due to its stability and simplicity. However, it is also known to be sensitive to noise in the data and prone to overfitting. Recent works have proposed using distributionally robust optimization (DRO) to address potential noise and distributional shift in the data. However, these methods often suffer from excessive conservatism and high computational cost. We propose DPO-PRO (DPO with Preference Robustness), a robust fine-tuning algorithm based on DPO which accounts for uncertainty in the preference distribution through a lightweight DRO formulation. Unlike prior DRO-based variants, DPO-PRO focuses solely on uncertainty in preferences, avoiding unnecessary conservatism and incurring negligible computational overhead. We further show that DPO-PRO is equivalent to a regularized DPO objective that penalizes model overconfidence under weak preference signals. We evaluate DPO-PRO on standard alignment benchmarks and a real-world public health task. Experimental results show that our method consistently improves robustness to noisy preference signals compared to existing DPO variants.
[913] Representer Theorems for Metric and Preference Learning: Geometric Insights and Algorithms
Peyman Morteza
Main category: cs.LG
TL;DR: A mathematical framework for metric and preference learning in Hilbert spaces with a novel representer theorem that enables efficient solutions through kernel methods and achieves competitive performance on rank inference benchmarks.
Details
Motivation: To address metric and preference learning problems within a unified mathematical framework in Hilbert spaces, providing geometric insights and efficient computational solutions.Method: Developed a mathematical framework with a novel representer theorem for simultaneous metric and preference learning, using regularization with task structure norms and extending to RKHS for finite kernel expressions.
Result: The framework leads to a new nonlinear algorithm that achieves competitive performance on real-world rank inference benchmarks, significantly outperforming vanilla ideal point methods and strong baselines across multiple datasets.
Conclusion: The proposed mathematical framework provides a unified approach to metric and preference learning with theoretical foundations and practical effectiveness, offering geometric insights and competitive performance compared to existing methods.
Abstract: We develop a mathematical framework to address a broad class of metric and preference learning problems within a Hilbert space. We obtain a novel representer theorem for the simultaneous task of metric and preference learning. Our key observation is that the representer theorem for this task can be derived by regularizing the problem with respect to the norm inherent in the task structure. For the general task of metric learning, our framework leads to a simple and self-contained representer theorem and offers new geometric insights into the derivation of representer theorems for this task. In the case of Reproducing Kernel Hilbert Spaces (RKHSs), we illustrate how our representer theorem can be used to express the solution of the learning problems in terms of finite kernel terms similar to classical representer theorems. Lastly, our representer theorem leads to a novel nonlinear algorithm for metric and preference learning. We compare our algorithm against challenging baseline methods on real-world rank inference benchmarks, where it achieves competitive performance. Notably, our approach significantly outperforms vanilla ideal point methods and surpasses strong baselines across multiple datasets. Code available at: https://github.com/PeymanMorteza/Metric-Preference-Learning-RKHS
[914] Learnable Behavior Control: Breaking Atari Human World Records via Sample-Efficient Behavior Selection
Jiajun Fan, Yuzheng Zhuang, Yuecheng Liu, Jianye Hao, Bin Wang, Jiangcheng Zhu, Hao Wang, Shu-Tao Xia
Main category: cs.LG
TL;DR: Learnable Behavioral Control (LBC) framework enhances exploration in deep RL by enlarging behavior selection space through hybrid behavior mappings and learnable selection processes, achieving state-of-the-art performance in Arcade Learning Environment.
Details
Motivation: To address the limited behavior diversity in population-based exploration methods caused by predefined policy populations, which restricts the behavior selection space.Method: Proposes LBC framework that creates hybrid behavior mappings from all policies and uses bandit-based meta-controllers for learnable behavior selection in distributed off-policy actor-critic methods.
Result: Achieved 10077.52% mean human normalized score and surpassed 24 human world records within 1B training frames in Arcade Learning Environment, demonstrating state-of-the-art performance without degrading sample efficiency.
Conclusion: LBC successfully addresses exploration limitations by significantly enlarging behavior selection space and enabling learnable behavior control, achieving superior performance in complex environments.
Abstract: The exploration problem is one of the main challenges in deep reinforcement learning (RL). Recent promising works tried to handle the problem with population-based methods, which collect samples with diverse behaviors derived from a population of different exploratory policies. Adaptive policy selection has been adopted for behavior control. However, the behavior selection space is largely limited by the predefined policy population, which further limits behavior diversity. In this paper, we propose a general framework called Learnable Behavioral Control (LBC) to address the limitation, which a) enables a significantly enlarged behavior selection space via formulating a hybrid behavior mapping from all policies; b) constructs a unified learnable process for behavior selection. We introduce LBC into distributed off-policy actor-critic methods and achieve behavior control via optimizing the selection of the behavior mappings with bandit-based meta-controllers. Our agents have achieved 10077.52% mean human normalized score and surpassed 24 human world records within 1B training frames in the Arcade Learning Environment, which demonstrates our significant state-of-the-art (SOTA) performance without degrading the sample efficiency.
[915] Graph Neural Architecture Search with GPT-4
Haishuai Wang, Yang Gao, Xin Zheng, Peng Zhang, Jiajun Bu, Philip S. Yu
Main category: cs.LG
TL;DR: GNAS-LLM integrates Large Language Models into Graph Neural Architecture Search to automate the design process, achieving state-of-the-art performance on benchmark datasets.
Details
Motivation: Existing GNAS methods require intensive human labor and domain knowledge for designing search spaces and strategies, which limits their accessibility and efficiency.Method: Uses LLMs with specialized GNAS prompts that describe search space, strategy, and feedback, iteratively refining graph neural network architectures through multiple LLM runs.
Result: Outperforms state-of-the-art GNAS methods with average improvements of 0.7% on validation sets and 0.3% on test sets across four benchmark datasets, plus 1.0% improvement using AutoGEL search space.
Conclusion: GNAS-LLM successfully automates graph neural architecture search using LLMs, demonstrating superior performance and faster convergence compared to traditional methods.
Abstract: Graph Neural Architecture Search (GNAS) has shown promising results in finding the best graph neural network architecture on a given graph dataset. However, existing GNAS methods still require intensive human labor and rich domain knowledge when designing the search space and search strategy. To this end, we integrate Large Language Models (LLMs) into GNAS and present a new GNAS model based on LLMs (GNAS-LLM for short). The basic idea of GNAS-LLM is to design a new class of GNAS prompts for LLMs to guide LLMs towards understanding the generative task of graph neural architectures. The prompts consist of descriptions of the search space, search strategy, and search feedback of GNAS. By iteratively running LLMs with the prompts, GNAS-LLM generates more accurate graph neural network architectures with fast convergence. Experimental results show that GNAS-LLM outperforms the state-of-the-art GNAS methods on four benchmark graph datasets, with an average improvement of 0.7% on the validation sets and 0.3% on the test sets. Besides, GNAS-LLM achieves an average improvement of 1.0% on the test sets based on the search space from AutoGEL.
[916] Diffusion Models Meet Contextual Bandits
Imad Aouali
Main category: cs.LG
TL;DR: Leveraging pre-trained diffusion models as priors for efficient decision-making in contextual bandits with large action spaces.
Details
Motivation: Address computational and statistical inefficiencies in contextual bandits with large action spaces by utilizing diffusion models as informative priors.Method: Develop practical algorithms to approximate posteriors under diffusion priors, enabling flexible decision-making strategies in contextual bandits.
Result: Empirical evaluations demonstrate the approach’s effectiveness and versatility across diverse contextual bandit settings.
Conclusion: Diffusion-based priors provide an effective framework for handling complex action distributions in contextual bandits with large action spaces.
Abstract: Efficient decision-making in contextual bandits with large action spaces is challenging, as methods lacking additional prior information may suffer from computational and statistical inefficiencies. In this work, we leverage pre-trained diffusion models as priors to capture complex action distributions and introduce a diffusion-based decision framework for contextual bandits. We develop practical algorithms to efficiently approximate posteriors under diffusion priors, enabling flexible decision-making strategies. Empirical evaluations demonstrate the effectiveness and versatility of our approach across diverse contextual bandit settings.
[917] REP: Resource-Efficient Prompting for Rehearsal-Free Continual Learning
Sungho Jeon, Xinyue Ma, Kwang In Kim, Myeongjae Jeon
Main category: cs.LG
TL;DR: REP improves computational and memory efficiency of prompt-based continual learning methods while maintaining accuracy through swift prompt selection, adaptive token merging, and adaptive layer dropping.
Details
Motivation: Current rehearsal-free continual learning methods using prompts achieve good performance but are resource-intensive, limiting their deployment on edge devices.Method: Uses swift prompt selection to refine input data, adaptive token merging (AToM) to skip data layers, and adaptive layer dropping (ALD) to skip model layers while preserving task-specific features.
Result: Extensive experiments on image classification datasets show REP achieves superior resource efficiency compared to state-of-the-art rehearsal-free continual learning methods.
Conclusion: REP provides a resource-efficient solution for prompt-based continual learning that minimizes accuracy trade-offs while improving computational and memory efficiency.
Abstract: Recent rehearsal-free continual learning (CL) methods guided by prompts achieve strong performance on vision tasks with non-stationary data but remain resource-intensive, hindering real-world edge deployment. We introduce resource-efficient prompting (REP), which improves the computational and memory efficiency of prompt-based rehearsal-free continual learning methods while minimizing accuracy trade-offs. Our approach employs swift prompt selection to refine input data using a carefully provisioned model and introduces adaptive token merging (AToM) and adaptive layer dropping (ALD) for efficient prompt updates. AToM and ALD selectively skip data and model layers while preserving task-specific features during the learning of new tasks. Extensive experiments on multiple image classification datasets demonstrate REP’s superior resource efficiency over state-of-the-art rehearsal-free CL methods.
[918] R-SFLLM: Jamming Resilient Framework for Split Federated Learning with Large Language Models
Aladin Djuhera, Vlad C. Andrei, Xinyang Li, Ullrich J. Mönich, Holger Boche, Walid Saad
Main category: cs.LG
TL;DR: Proposes R-SFLLM, a resilient split federated learning framework that uses wireless sensing to detect jamming and jointly optimizes beamforming, scheduling, and resource allocation to protect embedding parameters in large language models.
Details
Motivation: Address vulnerability of split federated learning to adversarial jamming attacks on wireless channels, particularly for embedding parameters in LLMs/VLMs that are crucial for domain understanding.Method: Develops physical layer framework using wireless sensing data to detect jamming DoAs, implements sensing-assisted anti-jamming strategy with joint optimization of beamforming, user scheduling, and resource allocation, plus adversarial training with controlled noise exposure.
Result: Achieves close-to-baseline performance across various NLP and CV tasks, datasets, and modalities. More noise-sensitive models like RoBERTa benefit significantly from adversarial training, especially under unfair resource allocation.
Conclusion: Worst-case jamming leads to worst-case model outcomes, necessitating jamming-resilient SFL protocols. The proposed R-SFLLM effectively protects embedding parameters and maintains learning performance under jamming attacks.
Abstract: Split federated learning (SFL) is a compute-efficient paradigm in distributed machine learning (ML), where components of large ML models are outsourced to remote servers. A significant challenge in SFL, particularly when deployed over wireless channels, is the susceptibility of transmitted model parameters to adversarial jamming that could jeopardize the learning process. This is particularly pronounced for embedding parameters in large language models (LLMs) and vision language models (VLMs), which are learned feature vectors essential for domain understanding. In this paper, rigorous insights are provided into the influence of jamming embeddings in SFL by deriving an expression for the ML training loss divergence and showing that it is upper-bounded by the mean squared error (MSE). Based on this analysis, a physical layer framework is developed for resilient SFL with LLMs (R-SFLLM) over wireless networks. R-SFLLM leverages wireless sensing data to gather information on the jamming directions-of-arrival (DoAs) for the purpose of devising a novel, sensing-assisted anti-jamming strategy while jointly optimizing beamforming, user scheduling, and resource allocation. Extensive experiments using both LLMs and VLMs demonstrate R-SFLLM’s effectiveness, achieving close-to-baseline performance across various natural language processing (NLP) and computer vision (CV) tasks, datasets, and modalities. The proposed methodology further introduces an adversarial training component, where controlled noise exposure significantly enhances the model’s resilience to perturbed parameters during training. The results show that more noise-sensitive models, such as RoBERTa, benefit from this feature, especially when resource allocation is unfair. It is also shown that worst-case jamming in particular translates into worst-case model outcomes, thereby necessitating the need for jamming-resilient SFL protocols.
[919] Centralized Reward Agent for Knowledge Sharing and Transfer in Multi-Task Reinforcement Learning
Haozhe Ma, Zhengding Luo, Thanh Vinh Vo, Kuankuan Sima, Tze-Yun Leong
Main category: cs.LG
TL;DR: A multi-task RL framework with centralized reward agent and distributed policy agents that uses reward shaping to share knowledge across tasks and transfer to new tasks.
Details
Motivation: Address sparse-reward challenges in RL and improve learning efficiency through knowledge sharing across multiple tasks.Method: Propose a framework with centralized reward agent (CRA) as knowledge pool that distills knowledge from tasks and distributes shaped rewards to policy agents.
Result: Validated on discrete/continuous domains including Meta-World, showing robustness in multi-task sparse-reward settings and effective transferability to unseen tasks.
Conclusion: The framework successfully enhances knowledge sharing and adapts to new tasks through reward signal transfer.
Abstract: Reward shaping is effective in addressing the sparse-reward challenge in reinforcement learning (RL) by providing immediate feedback through auxiliary, informative rewards. Based on the reward shaping strategy, we propose a novel multi-task reinforcement learning framework that integrates a centralized reward agent (CRA) and multiple distributed policy agents. The CRA functions as a knowledge pool, aimed at distilling knowledge from various tasks and distributing it to individual policy agents to improve learning efficiency. Specifically, the shaped rewards serve as a straightforward metric for encoding knowledge. This framework not only enhances knowledge sharing across established tasks but also adapts to new tasks by transferring meaningful reward signals. We validate the proposed method on both discrete and continuous domains, including the representative Meta-World benchmark, demonstrating its robustness in multi-task sparse-reward settings and its effective transferability to unseen tasks.
[920] Painless Federated Learning: An Interplay of Line-Search and Extrapolation
Geetika, Somya Tyagi, Bapi Chatterjee
Main category: cs.LG
TL;DR: FedSLS and FedExpSLS are federated learning algorithms that use stochastic line search to handle client heterogeneity and gradient noise, achieving deterministic convergence rates and performing competitively across various problems.
Details
Motivation: Client heterogeneity in federated learning slows down global convergence, and existing methods don't adequately address this issue while also handling data-sampling noise.Method: Proposed Federated Stochastic Line Search (FedSLS) adapts classical line search for federated settings, and extends it to Federated Extrapolated Stochastic Line Search (FedExpSLS) to benefit from server learning rate extrapolation.
Result: FedSLS achieves linear convergence for strongly convex objectives even with partial client participation, and both methods perform at par or better than popular federated learning algorithms on convex and non-convex problems.
Conclusion: Stochastic line search effectively tames both client heterogeneity and gradient noise in federated optimization, with extrapolation further improving empirical performance.
Abstract: The classical line search for learning rate (LR) tuning in the stochastic gradient descent (SGD) algorithm can tame the convergence slowdown due to data-sampling noise. In a federated setting, wherein the client heterogeneity introduces a slowdown to the global convergence, line search can be relevantly adapted. In this work, we show that a stochastic variant of line search tames the heterogeneity in federated optimization in addition to that due to client-local gradient noise. To this end, we introduce Federated Stochastic Line Search (FedSLS) algorithm and show that it achieves deterministic rates in expectation. Specifically, FedSLS offers linear convergence for strongly convex objectives even with partial client participation. Recently, the extrapolation of the server’s LR has shown promise for improved empirical performance for federated learning. To benefit from extrapolation, we extend FedSLS to Federated Extrapolated Stochastic Line Search (FedExpSLS) and prove its convergence. Our extensive empirical results show that the proposed methods perform at par or better than the popular federated learning algorithms across many convex and non-convex problems.
[921] FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation
Dongwon Jo, Jiwon Song, Yulhwa Kim, Jae-Joon Kim
Main category: cs.LG
TL;DR: FastKV is a KV cache compression framework that reduces latency in both prefill and decoding stages by leveraging token importance stabilization in later layers, achieving significant speedups while maintaining accuracy.
Details
Motivation: Current KV cache compression methods tie prefill compute reduction to decoding KV budget, causing accuracy degradation due to overlooking layer-dependent variation of critical context.Method: Uses Token-Selective Propagation (TSP) layer to forward only informative tokens, then independently selects salient KV entries for caching, decoupling KV budget from prefill compute reduction.
Result: Achieves speedups of 1.82× in prefill and 2.87× in decoding compared to full-context baseline, while matching accuracy of baselines that only accelerate decoding.
Conclusion: FastKV provides flexible optimization of efficiency and accuracy through independent control of TSP rate and KV retention rate, effectively reducing computational burden in LLM inference.
Abstract: While large language models (LLMs) excel at handling long-context sequences, they require substantial prefill computation and key-value (KV) cache, which can heavily burden computational efficiency and memory usage in both prefill and decoding stages. Recent works that compress KV caches with prefill acceleration reduce this cost but inadvertently tie the prefill compute reduction to the decoding KV budget. This coupling arises from overlooking the layer-dependent variation of critical context, often leading to accuracy degradation. To address this issue, we introduce FastKV, a KV cache compression framework designed to reduce latency in both prefill and decoding by leveraging the stabilization of token importance in later layers. FastKV performs full-context computation until a Token-Selective Propagation (TSP) layer, which forwards only the most informative tokens to subsequent layers. From these propagated tokens, FastKV independently selects salient KV entries for caching, thereby decoupling KV budget from the prefill compute reduction based on the TSP decision. This independent control of the TSP rate and KV retention rate enables flexible optimization of efficiency and accuracy. Experimental results show that FastKV achieves speedups of up to 1.82$\times$ in prefill and 2.87$\times$ in decoding compared to the full-context baseline, while matching the accuracy of the baselines that only accelerate the decoding stage. Our code is available at https://github.com/dongwonjo/FastKV.
[922] AI for Water Sustainability: Global Water Quality Assessment and Prediction with Explainable AI with LLM Chatbot for Insights
Biplov Paneru, Bishwash Paneru
Main category: cs.LG
TL;DR: Hybrid deep learning models (CatBoost, XGBoost, Extra Trees, CNN-LSTM) effectively predict water quality using 2.82M records from multiple countries, achieving high accuracy (99%) and low RMSE (1.2) for WQI prediction.
Details
Motivation: To enable effective water quality monitoring in developing countries like Nepal by developing accurate predictive models that can help prevent contamination risks through proactive control measures.Method: Used hybrid deep learning models including CatBoost, XGBoost, Extra Trees, and neural networks combining CNN and LSTM layers to capture temporal and spatial patterns in 2.82 million water quality records from Canada, China, UK, USA, and Ireland.
Result: Models achieved excellent performance: CatBoost, XGBoost, and Extra Trees Regressor predicted WQI with average RMSE of 1.2 and R² score of 0.99. Classifiers achieved 99% accuracy with cross-validation. SHAP analysis identified key indicators like F.R.C. and orthophosphate levels.
Conclusion: The hybrid deep learning approach provides highly accurate water quality prediction, enabling proactive monitoring and control, with practical applications demonstrated through a chatbot for water quality insights.
Abstract: Ensuring safe water supplies requires effective water quality monitoring, especially in developing countries like Nepal, where contamination risks are high. This paper introduces various hybrid deep learning models to predict on the CCME dataset with multiple water quality parameters from Canada, China, the UK, the USA, and Ireland, with 2.82 million data records feature-engineered and evaluated using them. Models such as CatBoost, XGBoost, and Extra Trees, along with neural networks combining CNN and LSTM layers, are used to capture temporal and spatial patterns in the data. The model demonstrated notable accuracy improvements, aiding proactive water quality control. CatBoost, XGBoost, and Extra Trees Regressor predicted Water Quality Index (WQI) values with an average RMSE of 1.2 and an R squared score of 0.99. Additionally, classifiers achieved 99% accuracy, cross-validated across models. SHAP analysis showed the importance of indicators like F.R.C. and orthophosphate levels in hybrid architectures’ classification decisions. The practical application is demonstrated along with a chatbot application for water quality insights.
[923] Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning
Chaofan Lin, Jiaming Tang, Shuo Yang, Hanshuo Wang, Tian Tang, Boyu Tian, Ion Stoica, Mingyu Gao
Main category: cs.LG
TL;DR: Twilight is a framework that adaptively prunes redundant tokens in attention mechanisms using top-p sampling, achieving up to 98% token pruning and significant speedup in long-context LLM decoding.
Details
Motivation: Current sparse attention and KV cache compression methods use fixed budgets, which fail to adapt to dynamic real-world scenarios where optimal accuracy-efficiency balance varies.Method: Proposes Twilight framework that applies top-p sampling (nucleus sampling) to sparse attention to achieve adaptive budgeting, compatible with any existing sparse attention algorithm.
Result: Achieves up to 98% token pruning, 15.4× acceleration in self-attention operations, and 3.9× acceleration in end-to-end per token latency for long-context LLM decoding.
Conclusion: Twilight successfully brings adaptive sparsity to sparse attention algorithms without sacrificing accuracy, enabling dynamic optimization of computational efficiency in real-world deployments.
Abstract: Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been a hot research topic. However, current algorithms such as sparse attention or key-value (KV) cache compression tend to use a fixed budget, which presents a significant challenge during deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between accuracy and efficiency can vary greatly. In this paper, we find that borrowing top-$p$ sampling (nucleus sampling) to sparse attention can surprisingly achieve adaptive budgeting. Based on this, we propose Twilight, a framework to bring adaptive sparsity to any existing sparse attention algorithm without sacrificing their accuracy. Empirical results show that Twilight can adaptively prune at most 98% of redundant tokens, leading to $15.4\times$ acceleration in self-attention operations and $3.9\times$ acceleration in end-to-end per token latency in long context LLM decoding.
[924] MIBP-Cert: Certified Training against Data Perturbations with Mixed-Integer Bilinear Programs
Tobias Lorenz, Marta Kwiatkowska, Mario Fritz
Main category: cs.LG
TL;DR: MIBP-Cert is a novel certification method using mixed-integer bilinear programming to provide provable robustness against data errors, corruptions, and poisoning attacks during training.
Details
Motivation: Data errors, corruptions, and poisoning attacks threaten AI system reliability, requiring principled approaches to understand how perturbations influence models and ensure robust learning.Method: Uses mixed-integer bilinear programming (MIBP) to compute sound, deterministic bounds on reachable parameters through perturbed data, with a novel relaxation scheme to make optimization tractable while maintaining soundness.
Result: Demonstrates applicability to continuous and discrete data under various threat models, including complex ones previously out of reach for certification methods.
Conclusion: MIBP-Cert provides a principled, provable approach for robust learning that can handle evolving attacks and complex data scenarios with guaranteed robustness.
Abstract: Data errors, corruptions, and poisoning attacks during training pose a major threat to the reliability of modern AI systems. While extensive effort has gone into empirical mitigations, the evolving nature of attacks and the complexity of data require a more principled, provable approach to robustly learn on such data - and to understand how perturbations influence the final model. Hence, we introduce MIBP-Cert, a novel certification method based on mixed-integer bilinear programming (MIBP) that computes sound, deterministic bounds to provide provable robustness even under complex threat models. By computing the set of parameters reachable through perturbed or manipulated data, we can predict all possible outcomes and guarantee robustness. To make solving this optimization problem tractable, we propose a novel relaxation scheme that bounds each training step without sacrificing soundness. We demonstrate the applicability of our approach to continuous and discrete data, as well as different threat models - including complex ones that were previously out of reach.
[925] COUNTDOWN: Contextually Sparse Activation Filtering Out Unnecessary Weights in Down Projection
Jaewon Cheon, Pilsung Kang
Main category: cs.LG
TL;DR: Proposes COUNTDOWN methods for sparse activation in LLMs, reducing FFNN computations by up to 90% with minimal performance loss through linear combination analysis of internal matrices.
Details
Motivation: Address computational inefficiencies in large language models by selectively deactivating non-essential parameters during inference to reduce computational costs in FFNN layers.Method: Two methods: M-COUNTDOWN uses indirect coefficients and D-COUNTDOWN uses direct coefficients of linear combinations in FFNN layers’ internal down projection matrices to achieve sparsity.
Result: D-COUNTDOWN omits 90% of computations with only 5.5% performance loss; M-COUNTDOWN provides predictor-free solution with 29.4% better performance preservation than existing methods; kernel implementations achieve real-world acceleration.
Conclusion: Linear combination analysis of FFNN layers enables effective sparse activation methods that significantly reduce computational costs while maintaining model performance, with specialized kernels translating theoretical gains into practical acceleration.
Abstract: The growing size of large language models has created significant computational inefficiencies. To address this challenge, sparse activation methods selectively deactivates non-essential parameters during inference, reducing computational costs in FFNN layers. While existing methods focus on non-linear gating mechanisms, we hypothesize that the sparsity of the FFNN layer lies globally in the form of a linear combination over its internal down projection matrix. Based on this insight, we propose two methods: M-COUNTDOWN, leveraging indirect coefficients, and D-COUNTDOWN, utilizing direct coefficients of the linear combination. Experimental results demonstrate that D-COUNTDOWN can omit 90% of computations with performance loss as low as 5.5% ideally, while M-COUNTDOWN provides a predictor-free solution with up to 29.4% better performance preservation compared to existing methods. Our specialized kernel implementations effectively realize these theoretical gains into substantial real-world acceleration.
[926] Interpretable Neural ODEs for Gene Regulatory Network Discovery under Perturbations
Zaikang Lin, Sei Chang, Aaron Zweig, Minseo Kang, Elham Azizi, David A. Knowles
Main category: cs.LG
TL;DR: PerturbODE is a novel framework that uses neural ODEs to model cell state trajectories under perturbations and infer causal gene regulatory networks from large-scale biological datasets.
Details
Motivation: Existing differentiable causal graphical models for gene regulatory network inference have limitations in expressivity, scalability, and fail to capture the dynamic nature of biological processes like cellular differentiation.Method: Incorporates biologically informative neural ordinary differential equations (neural ODEs) to model cell state trajectories under perturbations and derives the causal gene regulatory network from the neural ODE’s parameters.
Result: Demonstrates efficacy in trajectory prediction and gene regulatory network inference across both simulated and real over-expression datasets.
Conclusion: PerturbODE provides an improved framework for causal graph discovery that better captures the dynamic nature of biological processes compared to existing methods.
Abstract: Modern high-throughput biological datasets with thousands of perturbations provide the opportunity for large-scale discovery of causal graphs that represent the regulatory interactions between genes. Differentiable causal graphical models have been proposed to infer a gene regulatory network (GRN) from large scale interventional datasets, capturing the causal gene regulatory relationships from genetic perturbations. However, existing models are limited in their expressivity and scalability while failing to address the dynamic nature of biological processes such as cellular differentiation. We propose PerturbODE, a novel framework that incorporates biologically informative neural ordinary differential equations (neural ODEs) to model cell state trajectories under perturbations and derive the causal GRN from the neural ODE’s parameters. We demonstrate PerturbODE’s efficacy in trajectory prediction and GRN inference across simulated and real over-expression datasets.
[927] Preference Optimization by Estimating the Ratio of the Data Distribution
Yeongmin Kim, Heesun Bae, Byeonghu Na, Il-Chul Moon
Main category: cs.LG
TL;DR: BPO is a generalized framework for preference optimization that subsumes DPO, provides theoretical guarantees, and improves both win rate and entropy without the trade-offs of prior methods.
Details
Motivation: To address limitations of prior preference optimization methods like f-PO that fail to simultaneously achieve simplicity and theoretical guarantees, and to provide a more flexible framework that can match target policies without reward models.Method: Proposes Bregman preference optimization (BPO) - a generalized framework for ratio matching that provides a family of objective functions achieving target policy optimality. Includes scaled Basu’s power divergence (SBA) for gradient scaling.
Result: BPO instances improve both win rate and entropy compared to DPO, achieving 55.9% length-controlled win rate on AlpacaEval2 with Llama-3-8B-Instruct, representing state-of-the-art performance among Llama-3-8B backbones.
Conclusion: BPO provides a comprehensive framework that complements DPO variants, offers tractable implementations, and achieves superior performance without the fidelity-diversity trade-off seen in other extensions.
Abstract: Direct preference optimization (DPO) is widely used as a simple and stable method for aligning large language models (LLMs) with human preferences. This paper investigates a generalized DPO loss that enables a policy model to match the target policy from a likelihood ratio estimation perspective. The ratio of the target policy provides a unique identification of the policy distribution without relying on reward models or partition functions. This allows the generalized loss to retain both simplicity and theoretical guarantees, which prior work such as $f$-PO fails to achieve simultaneously. We propose Bregman preference optimization (BPO), a generalized framework for ratio matching that provides a family of objective functions achieving target policy optimality. BPO subsumes DPO as a special case and offers tractable forms for all instances, allowing implementation with a few lines of code. We further develop scaled Basu’s power divergence (SBA), a gradient scaling method that can be used for BPO instances. The BPO framework complements other DPO variants and is applicable to target policies defined by these variants. In experiments, unlike other probabilistic loss extensions such as $f$-DPO or $f$-PO, which exhibit a trade-off between generation fidelity and diversity, instances of BPO improve both win rate and entropy compared with DPO. When applied to Llama-3-8B-Instruct, BPO achieves state-of-the-art performance among Llama-3-8B backbones, with a 55.9% length-controlled win rate on AlpacaEval2. Project page: https://github.com/aailab-kaist/BPO.
[928] Efficient Semi-Supervised Adversarial Training via Latent Clustering-Based Data Reduction
Somrita Ghosh, Yuelin Xu, Xiao Zhang
Main category: cs.LG
TL;DR: The paper proposes data reduction strategies for semi-supervised adversarial training (SSAT) using latent clustering to select/generate critical boundary-adjacent data, achieving similar robustness with 5-10x less data and 4x faster training.
Details
Motivation: SSAT requires substantial extra data for high robustness, leading to prolonged training time and increased memory usage. Current methods need too much external data.Method: Designed latent clustering-based techniques to select or generate a small critical subset of data samples near the model’s decision boundary, maintaining balanced ratio between boundary and non-boundary points.
Result: Methods significantly reduce SSAT’s data requirement and computation costs while preserving strong robustness. Achieved nearly identical robust accuracies with 5x to 10x less unlabeled data and approximately 4x less total runtime.
Conclusion: Latent-space selection with k-means clustering and guided DDPM fine-tuning with LCG-KM are most effective for efficient SSAT, enabling high robustness with dramatically reduced data and computation costs.
Abstract: Achieving high model robustness under adversarial settings is widely recognized as demanding considerable training samples. Recent works propose semi-supervised adversarial training (SSAT) methods with external unlabeled or synthetically generated data, which are the current state-of-the-art. However, SSAT requires substantial extra data to attain high robustness, resulting in prolonged training time and increased memory usage. In this paper, we propose unlabeled data reduction strategies to improve the efficiency of SSAT. Specifically, we design novel latent clustering-based techniques to select or generate a small critical subset of data samples near the model’s decision boundary. While focusing on boundary-adjacent points, our methods maintain a balanced ratio between boundary and non-boundary data points to avoid overfitting. Comprehensive experiments on benchmark datasets demonstrate that our methods can significantly reduce SSAT’s data requirement and computation costs while preserving its strong robustness advantages. In particular, our latent-space selection scheme based on k-means clustering and our guided DDPM fine-tuning approach with LCG-KM are the most effective, achieving nearly identical robust accuracies with 5x to 10x less unlabeled data and approximately 4x less total runtime.
[929] SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training
Yehonathan Refael, Guy Smorodinsky, Tom Tirer, Ofir Lindenbaum
Main category: cs.LG
TL;DR: SUMO is a subspace-aware optimizer that uses exact SVD for moment orthogonalization to improve convergence in LLM training while reducing memory usage by up to 20%.
Details
Motivation: Existing low-rank optimization methods focus on memory efficiency but overlook acceleration opportunities, performing suboptimally in the anisotropic landscapes of deep networks like LLMs.Method: Uses exact SVD for moment orthogonalization within dynamically adapted low-dimensional subspaces, enabling norm-inducing steepest descent steps aligned with loss landscape spectral characteristics.
Result: Empirical evaluations show SUMO accelerates convergence, enhances stability, improves performance, and reduces memory requirements by up to 20% compared to state-of-the-art methods.
Conclusion: Exact orthogonalization via SVD substantially improves convergence rates while reducing complexity, effectively mitigating approximation errors in LLM training.
Abstract: Low-rank gradient-based optimization methods have significantly improved memory efficiency during the training of large language models (LLMs), enabling operations within constrained hardware without sacrificing performance. However, these methods primarily emphasize memory savings, often overlooking potential acceleration in convergence due to their reliance on standard isotropic steepest descent techniques, which can perform suboptimally in the highly anisotropic landscapes typical of deep networks, particularly LLMs. In this paper, we propose SUMO (Subspace-Aware Moment-Orthogonalization), an optimizer that employs exact singular value decomposition (SVD) for moment orthogonalization within a dynamically adapted low-dimensional subspace, enabling norm-inducing steepest descent optimization steps. By explicitly aligning optimization steps with the spectral characteristics of the loss landscape, SUMO effectively mitigates approximation errors associated with commonly used methods like Newton-Schulz orthogonalization approximation. We theoretically establish an upper bound on these approximation errors, proving their dependence on the condition numbers of moments, conditions we analytically demonstrate are encountered during LLM training. Furthermore, we both theoretically and empirically illustrate that exact orthogonalization via SVD substantially improves convergence rates while reducing overall complexity. Empirical evaluations confirm that SUMO accelerates convergence, enhances stability, improves performance, and reduces memory requirements by up to 20% compared to state-of-the-art methods.
[930] Technical Debt in In-Context Learning: Diminishing Efficiency in Long Context
Taejong Joo, Diego Klabjan
Main category: cs.LG
TL;DR: Transformers’ in-context learning initially matches Bayes optimal efficiency but significantly deteriorates in long contexts, revealing inherent diminishing efficiency that motivates new adaptive methods.
Details
Motivation: To investigate whether transformers optimally learn in-context compared to principled learning algorithms, particularly examining the extent of ICL's efficiency limitations.Method: Employed a meta ICL framework with hierarchical regression tasks, benchmarking ICL sample complexity against Bayes optimal estimator and other principled algorithms under varying performance requirements.
Result: ICL initially matches Bayes optimal estimator efficiency but shows significant deterioration in long contexts, with information-theoretic analysis confirming this diminishing efficiency is inherent to ICL.
Conclusion: The findings clarify trade-offs in using ICL as universal problem solver and motivate development of new on-the-fly adaptive methods without diminishing efficiency.
Abstract: Transformers have demonstrated remarkable in-context learning (ICL) capabilities, adapting to new tasks by simply conditioning on demonstrations without parameter updates. Compelling empirical and theoretical evidence suggests that ICL, as a general-purpose learner, could outperform task-specific models. However, it remains unclear to what extent the transformers optimally learn in-context compared to principled learning algorithms. To investigate this, we employ a meta ICL framework in which each prompt defines a distinctive regression task whose target function is drawn from a hierarchical distribution, requiring inference over both the latent model class and task-specific parameters. Within this setup, we benchmark sample complexity of ICL against principled learning algorithms, including the Bayes optimal estimator, under varying performance requirements. Our findings reveal a striking dichotomy: while ICL initially matches the efficiency of a Bayes optimal estimator, its efficiency significantly deteriorates in long context. Through an information-theoretic analysis, we show that the diminishing efficiency is inherent to ICL. These results clarify the trade-offs in adopting ICL as a universal problem solver, motivating a new generation of on-the-fly adaptive methods without the diminishing efficiency.
[931] From Contextual Combinatorial Semi-Bandits to Bandit List Classification: Improved Sample Complexity with Sparse Rewards
Liad Erez, Tomer Koren
Main category: cs.LG
TL;DR: The paper studies contextual combinatorial semi-bandits with sparse rewards, focusing on the s-sparse regime where total rewards are bounded by s ≪ K. It provides improved sample complexity bounds for the PAC setting and regret bounds for adversarial settings.
Details
Motivation: Motivated by applications like recommendation systems where users only purchase a small subset of available products, the research addresses the sparse reward scenario in contextual bandits to improve efficiency.Method: The authors design algorithms for both PAC and regret minimization settings. For PAC, they use an algorithm that returns an ε-optimal policy with high probability. For regret minimization, they extend previous work to handle adversarial data.
Result: The main result is a sample complexity of Õ((poly(K/m)+sm/ε²)log(|Π|/δ)) for PAC learning, which improves upon existing bounds when s ≪ K. For single-label classification (s=m=1), they achieve O((K⁷+1/ε²)log(|H|/δ)). For regret minimization, they prove Õ(|Π|+√(smTlog|Π|)) regret.
Conclusion: The paper demonstrates that bandit feedback comes at essentially no cost in the sparse regime (s=O(1)), as their bounds match full-information rates. Their framework generalizes list multiclass classification and provides computationally efficient algorithms given ERM oracle access.
Abstract: We study the problem of contextual combinatorial semi-bandits, where input contexts are mapped into subsets of size $m$ of a collection of $K$ possible actions. In each round, the learner observes the realized reward of the predicted actions. Motivated by prototypical applications of contextual bandits, we focus on the $s$-sparse regime where we assume that the sum of rewards is bounded by some value $s\ll K$. For example, in recommendation systems the number of products purchased by any customer is significantly smaller than the total number of available products. Our main result is for the $(\epsilon,\delta)$-PAC variant of the problem for which we design an algorithm that returns an $\epsilon$-optimal policy with high probability using a sample complexity of $\tilde{O}((poly(K/m)+sm/\epsilon^2) \log(|\Pi|/\delta))$ where $\Pi$ is the underlying (finite) class and $s$ is the sparsity parameter. This bound improves upon known bounds for combinatorial semi-bandits whenever $s\ll K$, and in the regime where $s=O(1)$, the leading terms in our bound match the corresponding full-information rates, implying that bandit feedback essentially comes at no cost. Our algorithm is also computationally efficient given access to an ERM oracle for $\Pi$. Our framework generalizes the list multiclass classification problem with bandit feedback, which can be seen as a special case with binary reward vectors. In the special case of single-label classification corresponding to $s=m=1$, we prove an $O((K^7+1/\epsilon^2)\log(|H|/\delta))$ sample complexity bound, which improves upon recent results in this scenario. Additionally, we consider the regret minimization setting where data can be generated adversarially, and establish a regret bound of $\tilde O(|\Pi|+\sqrt{smT\log |\Pi|})$, extending the result of Erez et al. (2024) who consider the simpler single label classification setting.
[932] On Vanishing Gradients, Over-Smoothing, and Over-Squashing in GNNs: Bridging Recurrent and Graph Learning
Álvaro Arroyo, Alessio Gravina, Benjamin Gutteridge, Federico Barbero, Claudio Gallicchio, Xiaowen Dong, Michael Bronstein, Pierre Vandergheynst
Main category: cs.LG
TL;DR: GNNs suffer from over-smoothing and over-squashing problems. The paper presents a unified view through vanishing gradients using control theory, proposes a state-space formulation to alleviate these issues, and shows connections between vanishing gradients and both problems.
Details
Motivation: To address the well-known problems of over-smoothing and over-squashing in Graph Neural Networks (GNNs) that limit their depth and performance, particularly for distant nodes.Method: Uses linear control theory to analyze GNNs as recurrent models, proposes a state-space formulation of GNNs, and combines theoretical analysis with empirical validation.
Result: The state-space formulation effectively alleviates over-smoothing and over-squashing without extra parameters. Shows GNNs are prone to extreme gradient vanishing, over-smoothing relates to vanishing gradients, and over-squashing is best addressed by combining graph rewiring with gradient mitigation.
Conclusion: The work bridges recurrent and graph neural network literature, enabling design of deeper and more performant GNNs through better understanding of gradient dynamics.
Abstract: Graph Neural Networks (GNNs) are models that leverage the graph structure to transmit information between nodes, typically through the message-passing operation. While widely successful, this approach is well known to suffer from the over-smoothing and over-squashing phenomena, which result in representational collapse as the number of layers increases and insensitivity to the information contained at distant and poorly connected nodes, respectively. In this paper, we present a unified view of these problems through the lens of vanishing gradients, using ideas from linear control theory for our analysis. We propose an interpretation of GNNs as recurrent models and empirically demonstrate that a simple state-space formulation of a GNN effectively alleviates over-smoothing and over-squashing at no extra trainable parameter cost. Further, we show theoretically and empirically that (i) GNNs are by design prone to extreme gradient vanishing even after a few layers; (ii) Over-smoothing is directly related to the mechanism causing vanishing gradients; (iii) Over-squashing is most easily alleviated by a combination of graph rewiring and vanishing gradient mitigation. We believe our work will help bridge the gap between the recurrent and graph neural network literature and will unlock the design of new deep and performant GNNs.
[933] Thought Anchors: Which LLM Reasoning Steps Matter?
Paul C. Bogdan, Uzay Macar, Neel Nanda, Arthur Conmy
Main category: cs.LG
TL;DR: A black-box method for analyzing reasoning traces at the sentence level in large language models, identifying “thought anchors” that significantly impact reasoning trajectories and final answers.
Details
Motivation: Standard interpretability methods are limited for studying reasoning processes because they focus on single forward passes rather than multi-token computational steps during reasoning.Method: A black-box method that measures sentence counterfactual importance by sampling replacement sentences, filtering for semantic differences, and continuing chain of thought to quantify impact on final answer distribution.
Result: Discovered “thought anchors” - sentences with outsized impact on reasoning trajectories (typically planning or uncertainty management sentences), and found specialized attention heads consistently attend from subsequent sentences to thought anchors.
Conclusion: Sentence-level analysis provides deeper understanding of reasoning models, with practical applications for predicting problem difficulty and analyzing reasoning structure across different domains.
Abstract: Current frontier large-language models rely on reasoning to achieve state-of-the-art performance. Many existing interpretability are limited in this area, as standard methods have been designed to study single forward passes of a model rather than the multi-token computational steps that unfold during reasoning. We argue that analyzing reasoning traces at the sentence level is a promising approach to understanding reasoning processes. We introduce a black-box method that measures each sentence’s counterfactual importance by repeatedly sampling replacement sentences from the model, filtering for semantically different ones, and continuing the chain of thought from that point onwards to quantify the sentence’s impact on the distribution of final answers. We discover that certain sentences can have an outsized impact on the trajectory of the reasoning trace and final answer. We term these sentences \textit{thought anchors}. These are generally planning or uncertainty management sentences, and specialized attention heads consistently attend from subsequent sentences to thought anchors. We further show that examining sentence-sentence causal links within a reasoning trace gives insight into a model’s behavior. Such information can be used to predict a problem’s difficulty and the extent different question domains involve sequential or diffuse reasoning. As a proof-of-concept, we demonstrate that our techniques together provide a practical toolkit for analyzing reasoning models by conducting a detailed case study of how the model solves a difficult math problem, finding that our techniques yield a consistent picture of the reasoning trace’s structure. We provide an open-source tool (thought-anchors.com) for visualizing the outputs of our methods on further problems. The convergence across our methods shows the potential of sentence-level analysis for a deeper understanding of reasoning models.
[934] Shortcuts and Identifiability in Concept-based Models from a Neuro-Symbolic Lens
Samuele Bortolotti, Emanuele Marconato, Paolo Morettin, Andrea Passerini, Stefano Teso
Main category: cs.LG
TL;DR: Concept-based Models struggle with reasoning shortcuts where models learn low-quality concepts despite having fixed inference layers, and existing methods often fail to meet theoretical conditions for reliable concept identification.
Details
Motivation: To understand why Concept-based Models produce unreliable concepts in out-of-distribution scenarios and establish conditions for achieving interpretable and reliable concept extraction.Method: Established connection between Concept-based Models and reasoning shortcuts, extended RSs to complex settings, derived theoretical conditions for concept and inference layer identification, and empirically tested existing methods with mitigation strategies.
Result: Empirical results show reasoning shortcuts significantly impact Concept-based Models, and current methods combined with mitigation strategies often fail to meet theoretical conditions in practice.
Conclusion: Concept-based Models face fundamental challenges with reasoning shortcuts that current approaches cannot reliably overcome, highlighting the need for new methods that satisfy theoretical conditions for interpretable concept learning.
Abstract: Concept-based Models are neural networks that learn a concept extractor to map inputs to high-level concepts and an inference layer to translate these into predictions. Ensuring these modules produce interpretable concepts and behave reliably in out-of-distribution is crucial, yet the conditions for achieving this remain unclear. We study this problem by establishing a novel connection between Concept-based Models and reasoning shortcuts (RSs), a common issue where models achieve high accuracy by learning low-quality concepts, even when the inference layer is fixed and provided upfront. Specifically, we extend RSs to the more complex setting of Concept-based Models and derive theoretical conditions for identifying both the concepts and the inference layer. Our empirical results highlight the impact of RSs and show that existing methods, even combined with multiple natural mitigation strategies, often fail to meet these conditions in practice.
[935] Language Model Guided Reinforcement Learning in Quantitative Trading
Adam Darmanin, Vince Vella
Main category: cs.LG
TL;DR: This paper proposes a hybrid framework where Large Language Models (LLMs) generate high-level trading strategies to guide Reinforcement Learning (RL) agents, improving both returns and risk metrics compared to standard RL approaches.
Details
Motivation: Reinforcement Learning has limitations in algorithmic trading including myopic behavior and opaque policies, while LLMs offer complementary strategic reasoning and multi-modal signal interpretation capabilities.Method: A hybrid framework where LLMs generate high-level trading strategies to guide RL agents, with evaluation through expert review of economic rationale and performance comparison against unguided RL baselines using Sharpe Ratio and Maximum Drawdown.
Result: Empirical results show that LLM guidance improves both return and risk metrics relative to standard RL, with better performance on Sharpe Ratio and Maximum Drawdown.
Conclusion: The hybrid LLM-RL framework effectively combines the strategic reasoning of LLMs with the tactical decision-making of RL, leading to improved trading performance compared to standalone RL approaches.
Abstract: Algorithmic trading requires short-term tactical decisions consistent with long-term financial objectives. Reinforcement Learning (RL) has been applied to such problems, but adoption is limited by myopic behaviour and opaque policies. Large Language Models (LLMs) offer complementary strategic reasoning and multi-modal signal interpretation when guided by well-structured prompts. This paper proposes a hybrid framework in which LLMs generate high-level trading strategies to guide RL agents. We evaluate (i) the economic rationale of LLM-generated strategies through expert review, and (ii) the performance of LLM-guided agents against unguided RL baselines using Sharpe Ratio (SR) and Maximum Drawdown (MDD). Empirical results indicate that LLM guidance improves both return and risk metrics relative to standard RL.
[936] KL Penalty Control via Perturbation for Direct Preference Optimization
Sangkyu Lee, Janghoon Han, Hosung Song, Stanley Jungkyu Choi, Honglak Lee, Youngjae Yu
Main category: cs.LG
TL;DR: ε-DPO improves DPO by adaptively controlling KL penalty strength for each preference pair based on logit monotonicity under β perturbation, achieving better performance than existing direct alignment methods.
Details
Motivation: DPO's static KL penalty prevents adaptive control for different preference pairs, limiting its effectiveness in aligning language models with human preferences.Method: Proposes ε-DPO which adaptively controls KL penalty strength β for each preference pair by analyzing logit monotonicity under β perturbation during training, effectively adjusting KL penalty by checking if temperature changes improve preference confidence.
Result: Experimental results show ε-DPO significantly improves DPO performance on general chatbot benchmarks and demonstrates that adaptive KL penalty control can reflect preference model confusion and provide efficient KL trade-off.
Conclusion: Instance-level adaptive KL penalty control is crucial in DPO, and ε-DPO’s simple criterion effectively addresses the limitations of static KL penalties in direct preference optimization.
Abstract: Direct Preference Optimization (DPO) demonstrates the advantage of aligning a large language model with human preference using only an offline dataset. However, DPO has the limitation that the KL penalty, which prevents excessive deviation from the reference model, is static throughout the training process. Several methods claim to change this static KL penalty of DPO into a dynamic one, but no approach can adaptively assign different KL penalties for each preference pair. In this paper, we propose $\varepsilon$-Direct Preference Optimization ($\varepsilon$-DPO), which allows adaptive control of the KL penalty strength $\beta$ for each preference pair. Specifically, $\varepsilon$-DPO adaptively controls $\beta$ for each preference pair based on the monotonicity of logits as a preference model under the perturbation of $\beta$ during training. This is equivalent to adjusting the KL penalty by checking whether the change in training-time temperature can lead to better preference confidence as preference models by simply reusing the logit of the current policy and the reference policy. Experimental results show that the simple criterion of $\varepsilon$-DPO for KL penalty relaxation significantly improves DPO compared to most existing direct alignment algorithms on general chatbot benchmarks and reveal that this KL penalty control criterion can reflect confusion as a preference model and provide an efficient KL trade-off, highlighting the significance of instance-level adaptive KL penalty control in DPO.
[937] Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning
Zihe Liu, Jiashun Liu, Yancheng He, Weixun Wang, Jiaheng Liu, Ling Pan, Xinyu Hu, Shaopan Xiong, Ju Huang, Jian Hu, Shengyi Huang, Johan Obando-Ceron, Siran Yang, Jiamang Wang, Wenbo Su, Bo Zheng
Main category: cs.LG
TL;DR: This paper systematically reviews RL techniques for LLM reasoning, identifies challenges like inconsistent experimental settings, and provides guidelines for technique selection through rigorous evaluations.
Details
Motivation: The rapid growth in RL for LLM reasoning research has led to fragmented understanding, conflicting conclusions, and lack of standardized guidelines, creating confusion for practitioners.Method: Systematic review through rigorous reproductions and isolated evaluations within a unified open-source framework, analyzing techniques across varying datasets, model sizes, and architectures.
Result: A minimalist combination of two techniques unlocks critic-free policy learning with vanilla PPO loss, consistently improving performance and surpassing strategies like GRPO and DAPO.
Conclusion: Provides clear guidelines for RL technique selection and a reliable roadmap for practitioners, demonstrating that simple combinations can achieve superior performance in LLM reasoning tasks.
Abstract: Reinforcement learning for LLM reasoning has rapidly emerged as a prominent research area, marked by a significant surge in related studies on both algorithmic innovations and practical applications. Despite this progress, several critical challenges remain, including the absence of standardized guidelines for employing RL techniques and a fragmented understanding of their underlying mechanisms. Additionally, inconsistent experimental settings, variations in training data, and differences in model initialization have led to conflicting conclusions, obscuring the key characteristics of these techniques and creating confusion among practitioners when selecting appropriate techniques. This paper systematically reviews widely adopted RL techniques through rigorous reproductions and isolated evaluations within a unified open-source framework. We analyze the internal mechanisms, applicable scenarios, and core principles of each technique through fine-grained experiments, including datasets of varying difficulty, model sizes, and architectures. Based on these insights, we present clear guidelines for selecting RL techniques tailored to specific setups, and provide a reliable roadmap for practitioners navigating the RL for the LLM domain. Finally, we reveal that a minimalist combination of two techniques can unlock the learning capability of critic-free policies using vanilla PPO loss. The results demonstrate that our simple combination consistently improves performance, surpassing strategies like GRPO and DAPO.
[938] Not All Data are Good Labels: On the Self-supervised Labeling for Time Series Forecasting
Yuxuan Yang, Dalin Zhang, Yuxuan Liang, Hua Lu, Gang Chen, Huan Li
Main category: cs.LG
TL;DR: This paper introduces SCAM, a self-supervised method that improves time series forecasting by generating pseudo labels from reconstruction intermediates and selectively replacing overfitted components.
Details
Motivation: Existing time series forecasting models rely heavily on high-quality data and insufficiently exploit all available data, limiting their generalization capabilities.Method: Proposes Self-Correction with Adaptive Mask (SCAM) which discards overfitted components and replaces them with pseudo labels from reconstruction intermediates, combined with Spectral Norm Regularization to suppress overfitting.
Result: Experiments on eleven real-world datasets show that SCAM consistently improves the performance of various backbone models.
Conclusion: The work offers a new perspective on constructing datasets and enhancing generalization of time series forecasting models through self-supervised learning.
Abstract: Time Series Forecasting (TSF) is a crucial task in various domains, yet existing TSF models rely heavily on high-quality data and insufficiently exploit all available data. This paper explores a novel self-supervised approach to re-label time series datasets by inherently constructing candidate datasets. During the optimization of a simple reconstruction network, intermediates are used as pseudo labels in a self-supervised paradigm, improving generalization for any predictor. We introduce the Self-Correction with Adaptive Mask (SCAM), which discards overfitted components and selectively replaces them with pseudo labels generated from reconstructions. Additionally, we incorporate Spectral Norm Regularization (SNR) to further suppress overfitting from a loss landscape perspective. Our experiments on eleven real-world datasets demonstrate that SCAM consistently improves the performance of various backbone models. This work offers a new perspective on constructing datasets and enhancing the generalization of TSF models through self-supervised learning. The code is available at https://github.com/SuDIS-ZJU/SCAM.
[939] OMPQ: Orthogonal Mixed Precision Quantization
Yuexiao Ma, Taisong Jin, Xiawu Zheng, Yan Wang, Huixia Li, Yongjian Wu, Guannan Jiang, Wei Zhang, Rongrong Ji
Main category: cs.LG
TL;DR: Proposes network orthogonality as a proxy metric for mixed precision quantization, enabling fast linear programming optimization instead of slow integer programming search, achieving high accuracy with minimal data and time.
Details
Motivation: To bridge the gap between neural network complexity and hardware capability through efficient mixed precision quantization that avoids time-consuming search processes.Method: Optimizes network orthogonality as a proxy metric using linear programming instead of solving the original integer programming problem, reducing search time and data requirements.
Result: Achieved 72.08% Top-1 accuracy on ResNet-18 with 6.7Mb without search iterations, and 71.27% Top-1 accuracy on MobileNetV2 with 1.5Mb for post-training quantization.
Conclusion: The proposed approach enables highly efficient mixed precision quantization with minimal search time and data dependency while maintaining competitive accuracy.
Abstract: To bridge the ever increasing gap between deep neural networks’ complexity and hardware capability, network quantization has attracted more and more research attention. The latest trend of mixed precision quantization takes advantage of hardware’s multiple bit-width arithmetic operations to unleash the full potential of network quantization. However, this also results in a difficult integer programming formulation, and forces most existing approaches to use an extremely time-consuming search process even with various relaxations. Instead of solving a problem of the original integer programming, we propose to optimize a proxy metric, the concept of network orthogonality, which is highly correlated with the loss of the integer programming but also easy to optimize with linear programming. This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy. Specifically, we achieve 72.08% Top-1 accuracy on ResNet-18 with 6.7Mb, which does not require any searching iterations. Given the high efficiency and low data dependency of our algorithm, we used it for the post-training quantization, which achieve 71.27% Top-1 accuracy on MobileNetV2 with only 1.5Mb. Our code is available at https://github.com/MAC-AutoML/OMPQ.
[940] RePO: Understanding Preference Learning Through ReLU-Based Optimization
Junkang Wu, Kexin Huang, Xue Wang, Jinyang Gao, Bolin Ding, Jiancan Wu, Xiangnan He, Xiang Wang
Main category: cs.LG
TL;DR: RePO is a simplified preference optimization method that removes the beta hyperparameter from DPO and SimPO through gradient analysis and a ReLU-based max-margin loss, achieving better performance with only one tunable parameter.
Details
Motivation: Existing alignment methods like RLHF face computational and stability challenges, while DPO and SimPO introduce complexity through multiple hyperparameters (beta, gamma). There's a need for a more streamlined approach.Method: RePO eliminates beta through two advances: (1) retaining SimPO’s reference-free margins but removing beta via gradient analysis, and (2) using a ReLU-based max-margin loss that naturally filters trivial pairs. Theoretically, it’s SimPO’s limiting case where beta approaches infinity.
Result: Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms DPO and SimPO across multiple base models.
Conclusion: RePO provides a more efficient and effective preference optimization method that requires only one hyperparameter to tune while achieving superior performance.
Abstract: Aligning large language models (LLMs) with human preferences is critical for real-world deployment, yet existing methods like RLHF face computational and stability challenges. While DPO establishes an offline paradigm with single hyperparameter $\beta$, subsequent methods like SimPO reintroduce complexity through dual parameters ($\beta$, $\gamma$). We propose {ReLU-based Preference Optimization (RePO)}, a streamlined algorithm that eliminates $\beta$ via two advances: (1) retaining SimPO’s reference-free margins but removing $\beta$ through gradient analysis, and (2) adopting a ReLU-based max-margin loss that naturally filters trivial pairs. Theoretically, RePO is characterized as SimPO’s limiting case ($\beta \to \infty$), where the logistic weighting collapses to binary thresholding, forming a convex envelope of the 0-1 loss. Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms DPO and SimPO across multiple base models, requiring only one hyperparameter to tune.
[941] Score-based Generative Neural Networks for Large-Scale Optimal Transport
Mara Daniels, Tyler Maunu, Paul Hand
Main category: cs.LG
TL;DR: A novel framework for learning Sinkhorn-regularized optimal transport couplings using score-based generative models and Langevin Dynamics, enabling efficient sampling from high-dimensional distributions.
Details
Motivation: Standard optimal transport methods face computational challenges with large, high-dimensional datasets due to expensive linear programming and the curse of dimensionality. The Sinkhorn problem offers a regularized alternative but needs efficient sampling methods.Method: Proposes a neural network parametrization of the Sinkhorn problem and uses score-based generative models with Langevin Dynamics to iteratively sample target data conditioned on source data according to the regularized optimal coupling.
Result: Theoretical proof of gradient descent convergence with respect to network parameters, and empirical success on various large-scale optimal transport tasks.
Conclusion: The framework provides an effective solution for sampling Sinkhorn couplings in high-dimensional settings, overcoming computational limitations of traditional optimal transport methods.
Abstract: We consider the fundamental problem of sampling the optimal transport coupling between given source and target distributions. In certain cases, the optimal transport plan takes the form of a one-to-one mapping from the source support to the target support, but learning or even approximating such a map is computationally challenging for large and high-dimensional datasets due to the high cost of linear programming routines and an intrinsic curse of dimensionality. We study instead the Sinkhorn problem, a regularized form of optimal transport whose solutions are couplings between the source and the target distribution. We introduce a novel framework for learning the Sinkhorn coupling between two distributions in the form of a score-based generative model. Conditioned on source data, our procedure iterates Langevin Dynamics to sample target data according to the regularized optimal coupling. Key to this approach is a neural network parametrization of the Sinkhorn problem, and we prove convergence of gradient descent with respect to network parameters in this formulation. We demonstrate its empirical success on a variety of large scale optimal transport tasks.
[942] Identifying Trustworthiness Challenges in Deep Learning Models for Continental-Scale Water Quality Prediction
Xiaobo Xia, Xiaofeng Liu, Jiale Liu, Kuai Fang, Lu Lu, Samet Oymak, William S. Currie, Tongliang Liu
Main category: cs.LG
TL;DR: Deep learning models show promise for water quality prediction but face trustworthiness challenges including performance disparities, robustness issues, uncertainty, and poor generalization to ungauged basins.
Details
Motivation: To address unresolved trustworthiness challenges preventing widespread adoption of deep learning in high-stakes water quality decision-making, including performance disparity, robustness, uncertainty, interpretability, generalizability, and reproducibility.Method: Multi-dimensional quantitative evaluation benchmarking three deep learning architectures (LSTM, DeepONet, Informer) trained on 37 years of data from 482 U.S. basins to predict 20 water quality variables.
Result: Systematic performance disparities tied to process complexity and data availability; management-critical variables are least predictable and most uncertain; LSTM most vulnerable to data corruption; spatial generalization to ungauged basins remains poor across all models.
Conclusion: This work serves as a call to action for advancing trustworthy data-driven methods in water resources management and provides insights for responsible AI use in environmental management.
Abstract: Water quality is foundational to environmental sustainability, ecosystem resilience, and public health. Deep learning offers transformative potential for large-scale water quality prediction and scientific insights generation. However, their widespread adoption in high-stakes operational decision-making, such as pollution mitigation and equitable resource allocation, is prevented by unresolved trustworthiness challenges, including performance disparity, robustness, uncertainty, interpretability, generalizability, and reproducibility. In this work, we present a multi-dimensional, quantitative evaluation of trustworthiness benchmarking three state-of-the-art deep learning architectures: recurrent (LSTM), operator-learning (DeepONet), and transformer-based (Informer), trained on 37 years of data from 482 U.S. basins to predict 20 water quality variables. Our investigation reveals systematic performance disparities tied to process complexity, data availability, and basin heterogeneity. Management-critical variables remain the least predictable and most uncertain. Robustness tests reveal pronounced sensitivity to outliers and corrupted targets; notably, the architecture with the strongest baseline performance (LSTM) proves most vulnerable under data corruption. Attribution analyses align for simple variables but diverge for nutrients, underscoring the need for multi-method interpretability. Spatial generalization to ungauged basins remains poor across all models. This work serves as a timely call to action for advancing trustworthy data-driven methods for water resources management and provides a pathway to offering critical insights for researchers, decision-makers, and practitioners seeking to leverage artificial intelligence (AI) responsibly in environmental management.
[943] Censoring chemical data to mitigate dual use risk
Quintina L. Campbell, Jonathan Herington, Andrew D. White
Main category: cs.LG
TL;DR: A framework for mitigating dual-use risks in chemical ML models through selective noising of sensitive molecular data to prevent malicious use while enabling open sharing.
Details
Motivation: Address dual-use concerns in chemical ML models, particularly around toxicological data and chemical warfare agents that could be misused.Method: Proposed a chain risk framework with three mitigation strategies: inference-level, model-level, and data-level. Introduced a model-agnostic noising method to increase prediction error in sensitive regions of molecular data.
Result: Selective noising induces variance and attenuation bias, while simply omitting sensitive data fails to prevent extrapolation. Results are consistent across molecular feature MLPs and graph neural networks.
Conclusion: Noising molecular structures can enable open sharing of potential dual-use molecular data while mitigating misuse risks.
Abstract: Machine learning models have dual-use potential, potentially serving both beneficial and malicious purposes. The development of open-source models in chemistry has specifically surfaced dual-use concerns around toxicological data and chemical warfare agents. We discuss a chain risk framework identifying three misuse pathways and corresponding mitigation strategies: inference-level, model-level, and data-level. At the data level, we introduce a model-agnostic noising method to increase prediction error in specific desired regions (sensitive regions). Our results show that selective noise induces variance and attenuation bias, whereas simply omitting sensitive data fails to prevent extrapolation. These findings hold for both molecular feature multilayer perceptrons and graph neural networks. Thus, noising molecular structures can enable open sharing of potential dual-use molecular data.
[944] Out-of-Distribution Generalization in Time Series: A Survey
Xin Wu, Fei Teng, Xingwang Li, Ji Zhang, Tianrui Li, Qiang Duan
Main category: cs.LG
TL;DR: This paper presents the first comprehensive review of out-of-distribution (OOD) generalization methodologies for time series, organized across three dimensions: data distribution, representation learning, and OOD evaluation, with detailed analysis of popular algorithms and future research directions.
Details
Motivation: Time series often exhibit distribution shifts, diverse latent features, and non-stationary learning dynamics in open environments, posing significant challenges for OOD generalization. There is a lack of systematic synthesis of advancements in this field.Method: The authors organize their analysis across three foundational dimensions: data distribution, representation learning, and OOD evaluation. For each dimension, they present several popular algorithms in detail and provide a comprehensive review of OOD generalization methodologies.
Result: The paper presents a systematic organization of the field’s evolutionary trajectory and contemporary research landscape, highlighting key application scenarios and their real-world impact. A detailed summary of methods is available at the provided URL.
Conclusion: The review identifies persistent challenges and proposes future research directions for OOD generalization in time series, establishing a foundation for further advancement in this important area.
Abstract: Time series frequently manifest distribution shifts, diverse latent features, and non-stationary learning dynamics, particularly in open and evolving environments. These characteristics pose significant challenges for out-of-distribution (OOD) generalization. While substantial progress has been made, a systematic synthesis of advancements remains lacking. To address this gap, we present the first comprehensive review of OOD generalization methodologies for time series, organized to delineate the field’s evolutionary trajectory and contemporary research landscape. We organize our analysis across three foundational dimensions: data distribution, representation learning, and OOD evaluation. For each dimension, we present several popular algorithms in detail. Furthermore, we highlight key application scenarios, emphasizing their real-world impact. Finally, we identify persistent challenges and propose future research directions. A detailed summary of the methods reviewed for the generalization of OOD in time series can be accessed at https://tsood-generalization.com.
[945] torchgfn: A PyTorch GFlowNet library
Joseph D. Viviano, Omar G. Younis, Sanghyeok Choi, Victor Schmidt, Yoshua Bengio, Salem Lahlou
Main category: cs.LG
TL;DR: torchgfn is a PyTorch library for GFlowNets that provides modular components for environments, neural networks, and training objectives to facilitate rapid prototyping and research.
Details
Motivation: The growing popularity of GFlowNets among diverse researchers necessitates a standardized library for testing new features against benchmarks and common environments.Method: Developed torchgfn with a modular, decoupled architecture that treats environments, neural network modules, and training objectives as interchangeable components.
Result: Created a PyTorch library with simple yet powerful API that enables rapid prototyping and includes multiple examples replicating published results.
Conclusion: torchgfn addresses the need for a standardized GFlowNets library and is available on GitHub and PyPI for community use.
Abstract: The growing popularity of generative flow networks (GFlowNets or GFNs) from a range of researchers with diverse backgrounds and areas of expertise necessitates a library that facilitates the testing of new features (e.g., training losses and training policies) against standard benchmark implementations, or on a set of common environments. We present torchgfn, a PyTorch library that aims to address this need. Its core contribution is a modular and decoupled architecture which treats environments, neural network modules, and training objectives as interchangeable components. This provides users with a simple yet powerful API to facilitate rapid prototyping and novel research. Multiple examples are provided, replicating and unifying published results. The library is available on GitHub (https://github.com/GFNOrg/torchgfn) and on pypi (https://pypi.org/project/torchgfn/).
[946] Detecting and Rectifying Noisy Labels: A Similarity-based Approach
Dang Huu-Tien, Minh-Phuong Nguyen, Naoya Inoue
Main category: cs.LG
TL;DR: Proposes model-agnostic methods using penultimate layer features to detect and correct label noise in datasets by analyzing feature similarity patterns.
Details
Motivation: Label noise damages DNN performance and robustness, creating need for automated error detection tools as model sizes grow.Method: Uses penultimate layer features to detect mislabeled data based on higher similarity to true class clusters than other classes, enabling error detection and rectification.
Result: Achieves high detection performance across diverse realistic noise scenarios and improves dataset quality through automatic error rectification.
Conclusion: The approach effectively detects and corrects label noise using feature similarity analysis, providing practical automated dataset cleaning.
Abstract: Label noise in datasets could significantly damage the performance and robustness of deep neural networks (DNNs) trained on these datasets. As the size of modern DNNs grows, there is a growing demand for automated tools for detecting such errors. In this paper, we propose post-hoc, model-agnostic noise detection and rectification methods utilizing the penultimate feature from a DNN. Our idea is based on the observation that the similarity between the penultimate feature of a mislabeled data point and its true class data points is higher than that for data points from other classes, making the probability of label occurrence within a tight, similar cluster informative for detecting and rectifying errors. Through theoretical and empirical analyses, we demonstrate that our approach achieves high detection performance across diverse, realistic noise scenarios and can automatically rectify these errors to improve dataset quality. Our implementation is available at https://anonymous.4open.science/r/noise-detection-and-rectification-AD8E.
[947] Optimal and Fair Encouragement Policy Evaluation and Learning
Angela Zhou
Main category: cs.LG
TL;DR: This paper studies optimal treatment rules under human non-adherence, focusing on fairness constraints like demographic parity in treatment take-up, and develops methods for causal identification and robust estimation.
Details
Motivation: In consequential domains where treatment cannot be compelled, optimal policies are merely suggestions due to human non-adherence. There's a persistent puzzle in social services about the gap in take-up of beneficial services among those who may benefit most, highlighting the need to consider fairness constraints.Method: The authors develop a framework for causal identification and robust estimation of optimal treatment rules under potential positivity violations. They use constrained optimization to incorporate fairness constraints like demographic parity, and develop a two-stage algorithm for solving over parametrized policy classes under general constraints.
Result: The methods are illustrated in three case studies: SNAP benefits recertification reminders, randomized encouragement for insurance enrollment, and pretrial supervised release with electronic monitoring. The framework provides variance-sensitive regret bounds.
Conclusion: Addressing inequities in algorithmic allocation requires studying both take-up of decisions and downstream outcomes, with context-specific remedies. The framework enables handling fairness constraints while optimizing treatment recommendations.
Abstract: In consequential domains, it is often impossible to compel individuals to take treatment, so that optimal policy rules are merely suggestions in the presence of human non-adherence to treatment recommendations. Under heterogeneity, covariates may predict take-up of treatment and final outcome, but differently. While optimal treatment rules optimize causal outcomes across the population, access parity constraints or other fairness considerations on who receives treatment can be important. For example, in social services, a persistent puzzle is the gap in take-up of beneficial services among those who may benefit from them the most. We study causal identification and robust estimation of optimal treatment rules, including under potential violations of positivity. We consider fairness constraints such as demographic parity in treatment take-up, and other constraints, via constrained optimization. Our framework can be extended to handle algorithmic recommendations under an often-reasonable covariate-conditional exclusion restriction, using our robustness checks for lack of positivity in the recommendation. We develop a two-stage algorithm for solving over parametrized policy classes under general constraints to obtain variance-sensitive regret bounds. We illustrate the methods in three case studies based on data from reminders of SNAP benefits recertification, randomized encouragement to enroll in insurance, and from pretrial supervised release with electronic monitoring. While the specific remedy to inequities in algorithmic allocation is context-specific, it requires studying both take-up of decisions and downstream outcomes of them.
[948] DMol: A Schedule-Driven Diffusion Model for Highly Efficient and Versatile Molecule Generation
Peizhi Niu, Yu-Hsiang Wang, Vishal Rana, Chetan Rupakheti, Abhishek Pandey, Olgica Milenkovic
Main category: cs.LG
TL;DR: DMol is a new graph diffusion model for small molecule generation that outperforms DiGress in validity by 1.5% while reducing diffusion steps 10x and runtime by half, with further improvements via compression of frequent carbon rings.
Details
Motivation: To improve upon existing graph diffusion models like DiGress by achieving higher validity rates with significantly fewer diffusion steps and faster runtime for small molecule generation.Method: Uses a modified objective function and graph noise scheduling that changes subsets of nodes at each step. Can be combined with compressed graph representations where frequent carbon rings are compressed into supernodes, avoiding complex VAE-based reconstruction.
Result: Achieves 1.5% higher validity than DiGress across datasets, reduces diffusion steps by 10x, cuts runtime in half. Compressed version adds 2% more validity improvement and increases novelty.
Conclusion: DMol provides more efficient and valid small molecule generation through optimized diffusion scheduling and optional graph compression, offering significant performance improvements over state-of-the-art methods.
Abstract: We introduce a new graph diffusion model for small molecule generation, DMol, which outperforms the state-of-the-art DiGress model in terms of validity by roughly 1.5% across all benchmarking datasets while reducing the number of diffusion steps by at least 10-fold, and the running time to roughly one half. The performance improvements are a result of a careful change in the objective function and a graph noise scheduling approach which, at each diffusion step, allows one to only change a subset of nodes of varying size in the molecule graph. Another relevant property of the method is that it can be easily combined with junction-tree-like graph representations that arise by compressing a collection of relevant ring structures into supernodes. Unlike classical junction-tree techniques that involve VAEs and require complicated reconstruction steps, compressed DMol directly performs graph diffusion on a graph that compresses only a carefully selected set of frequent carbon rings into supernodes, which results in straightforward sample generation. This compressed DMol method offers additional validity improvements over generic DMol of roughly 2%, increases the novelty of the method, and further improves the running time due to reductions in the graph size.
[949] Generalization Bounds for Robust Contrastive Learning: From Theory to Practice
Ngoc N. Tran, Lam Tran, Hoang Phan, Anh Bui, Tung Pham, Toan Tran, Dinh Phung, Trung Le
Main category: cs.LG
TL;DR: The paper provides theoretical analysis of Adversarial Contrastive Learning (ACL), revealing that besides adversarial contrastive loss, the benign contrastive loss and divergence between benign/adversarial examples also improve robustness.
Details
Motivation: While ACL shows strong empirical results for robust feature learning, its theoretical understanding is limited, particularly regarding how unsupervised training supports robust supervised loss.Method: Develop rigorous theoretical analysis to identify components in unsupervised training that improve robust supervised loss, examining adversarial contrastive loss, benign contrastive loss, and global divergence between benign/adversarial examples.
Result: Theoretical analysis reveals that adversarial contrastive loss, benign contrastive loss, and global divergence between benign and adversarial examples all contribute to improving robustness in the supervised probing phase.
Conclusion: The study provides theoretical justification for ACL’s effectiveness and identifies multiple components in unsupervised training that enhance robustness, with experimental validation supporting the findings.
Abstract: Contrastive Learning first extracts features from unlabeled data, followed by linear probing with labeled data. Adversarial Contrastive Learning (ACL) integrates Adversarial Training into the first phase to enhance feature robustness against attacks in the probing phase. While ACL has shown strong empirical results, its theoretical understanding remains limited. Furthermore, while a fair amount of theoretical works analyze how the unsupervised loss can support the supervised loss in the probing phase, none has examined its role to the robust supervised loss. To fill this gap, our work develops rigorous theories to identify which components in the unsupervised training can help improve the robust supervised loss. Specifically, besides the adversarial contrastive loss, we reveal that the benign one, along with a global divergence between benign and adversarial examples can also improve robustness. Proper experiments are conducted to justify our findings.
[950] Measuring the (Un)Faithfulness of Concept-Based Explanations
Shubham Kumar, Narendra Ahuja
Main category: cs.LG
TL;DR: The paper introduces SURF, a new faithfulness measure for unsupervised concept-based explanation methods (U-CBEMs), and reveals that many visually appealing U-CBEMs are actually unfaithful to the model’s predictions.
Details
Motivation: Current U-CBEMs produce interpretable concepts but lack faithfulness - they fail to reproduce model predictions, and this deficiency has gone unnoticed due to fragmented evaluation with no standardized benchmarking.Method: Organized prior metrics in unified framework, introduced SURF measure that quantifies faithfulness via surrogate predictive loss, and created comprehensive U-CBEM faithfulness benchmark across diverse tasks and architectures.
Result: SURF outperforms prior faithfulness measures in controlled comparisons and reveals that many state-of-the-art U-CBEMs are surprisingly unfaithful despite their visual appeal.
Conclusion: SURF provides a reliable faithfulness measure for U-CBEMs, with demonstrated applicability in downstream settings like concept count analysis and adversarial robustness assessment.
Abstract: Post-hoc, unsupervised concept-based explanation methods (U-CBEMs) translate a vision model’s internal reasoning into human-understandable concepts, leading to interpretable explanations. However, we find that many state-of-the-art (SOTA) U-CBEMs are not faithful: their concepts seem interpretable but fail to reproduce the model’s predictions. We argue that this deficiency has gone unnoticed due to fragmented evaluation - each paper proposes its own faithfulness measure, with no measure-over-measure comparison or broad benchmarking. We close this gap by (i) organizing prior metrics in a unified framework, discussing their limitations, and identifying desiderata for a faithfulness measure; (ii) introducing the Surrogate Faithfulness (SURF) measure, which quantifies faithfulness via the predictive loss of a surrogate that maps explanations to the model’s outputs; and (iii) delivering the first comprehensive U-CBEM faithfulness benchmark across diverse tasks and architectures. In a controlled setting, SURF outperforms prior faithfulness measures in measure-over-measure comparisons, and applying SURF to SOTA U-CBEMs reveals that many visually appealing U-CBEMs are surprisingly unfaithful. We demonstrate SURF applicability in two downstream settings - (i) faithfulness versus the number of concepts used in the explanation and (ii) U-CBEM robustness to adversarial attacks - underscoring SURF’s value as a reliable faithfulness measure. Code to be released.
[951] Improving Model Fusion by Training-time Neuron Alignment with Fixed Neuron Anchors
Zexi Li, Zhiqi Li, Jie Lin, Tao Shen, Jun Xiao, Yike Guo, Tao Lin, Chao Wu
Main category: cs.LG
TL;DR: TNA-PFN is a training-time neuron alignment method that uses partially fixed neuron weights as anchors to enable better model fusion without post-training permutation matching, improving performance in federated learning and model fusion scenarios.
Details
Motivation: Model fusion is hindered by diverse neuron permutations across models trained under different settings, which reside in different loss basins. Previous post-training alignment methods are expensive, so training-time alignment offers a cheaper alternative.Method: TNA-PFN utilizes partially fixed neuron weights as anchors during training to reduce potential permutations, enabling lossless neuron alignment. Based on this, FedPFN and FedPNU are proposed for federated learning.
Result: TNA-PFN reduces linear mode connectivity barriers and improves multi-model fusion. It enhances fusion of pretrained vision transformers and language models. FedPFN and FedPNU achieve state-of-the-art performance in heterogeneous federated learning.
Conclusion: Training-time neuron alignment through TNA-PFN is an effective approach for model fusion that is cheaper than post-alignment and applicable across various scenarios, with promising applications in federated learning.
Abstract: Model fusion aims to integrate several deep neural network (DNN) models’ knowledge into one by fusing parameters, and it has promising applications, such as improving the generalization of foundation models and parameter averaging in federated learning. However, models under different settings (data, hyperparameter, etc.) have diverse neuron permutations; in other words, from the perspective of loss landscape, they reside in different loss basins, thus hindering model fusion performances. To alleviate this issue, previous studies highlighted the role of permutation invariance and have developed methods to find correct network permutations for neuron alignment after training. Orthogonal to previous attempts, this paper studies training-time neuron alignment, improving model fusion without the need for post-matching. Training-time alignment is cheaper than post-alignment and is applicable in various model fusion scenarios. Starting from fundamental hypotheses and theorems, a simple yet lossless algorithm called TNA-PFN is introduced. TNA-PFN utilizes partially fixed neuron weights as anchors to reduce the potential of training-time permutations, and it is empirically validated in reducing the barriers of linear mode connectivity and multi-model fusion. It is also validated that TNA-PFN can improve the fusion of pretrained models under the setting of model soup (vision transformers) and ColD fusion (pretrained language models). Based on TNA-PFN, two federated learning methods, FedPFN and FedPNU, are proposed, showing the prospects of training-time neuron alignment. FedPFN and FedPNU reach state-of-the-art performances in federated learning under heterogeneous settings and can be compatible with the server-side algorithm.
[952] MobilityGPT: Enhanced Human Mobility Modeling with a GPT model
Ammar Haydari, Dongjie Chen, Zhengfeng Lai, Michael Zhang, Chen-Nee Chuah
Main category: cs.LG
TL;DR: MobilityGPT is a geospatially-aware generative model that reformats human mobility modeling as an autoregressive generation task using GPT architecture, with gravity-based sampling, road connectivity constraints, and reinforcement learning fine-tuning to generate semantically realistic trajectories.
Details
Motivation: Existing generative models struggle to ensure semantically realistic geospatial mobility data with consistent location sequences and real-world constraints like geospatial limits.Method: Proposed MobilityGPT with three key components: gravity-based sampling for semantic sequence similarity, road connectivity matrix constraints to keep trajectories within geospatial limits, and reinforcement learning fine-tuning via RLTF mechanism to minimize travel distance differences.
Result: Superior performance over state-of-the-art methods in generating high-quality mobility trajectories closest to real data in terms of origin-destination similarity, trip length, travel radius, link, and gravity distributions.
Conclusion: MobilityGPT effectively addresses the challenges of generating semantically realistic and geospatially constrained human mobility trajectories through its novel architecture and training approach.
Abstract: Generative models have shown promising results in capturing human mobility characteristics and generating synthetic trajectories. However, it remains challenging to ensure that the generated geospatial mobility data is semantically realistic, including consistent location sequences, and reflects real-world characteristics, such as constraining on geospatial limits. We reformat human mobility modeling as an autoregressive generation task to address these issues, leveraging the Generative Pre-trained Transformer (GPT) architecture. To ensure its controllable generation to alleviate the above challenges, we propose a geospatially-aware generative model, MobilityGPT. We propose a gravity-based sampling method to train a transformer for semantic sequence similarity. Then, we constrained the training process via a road connectivity matrix that provides the connectivity of sequences in trajectory generation, thereby keeping generated trajectories in geospatial limits. Lastly, we proposed to construct a preference dataset for fine-tuning MobilityGPT via Reinforcement Learning from Trajectory Feedback (RLTF) mechanism, which minimizes the travel distance between training and the synthetically generated trajectories. Experiments on real-world datasets demonstrate MobilityGPT’s superior performance over state-of-the-art methods in generating high-quality mobility trajectories that are closest to real data in terms of origin-destination similarity, trip length, travel radius, link, and gravity distributions. We release the source code and reference links to datasets at https://github.com/ammarhydr/MobilityGPT.
[953] FlowPrecision: Advancing FPGA-Based Real-Time Fluid Flow Estimation with Linear Quantization
Tianheng Ling, Julian Hoever, Chao Qian, Gregor Schiele
Main category: cs.LG
TL;DR: This paper applies linear quantization in FPGA-based soft sensors for fluid flow estimation, achieving significant improvements in Neural Network model precision and inference speed compared to traditional fixed-point quantization.
Details
Motivation: Achieving real-time and precise fluid flow measurement in industrial and environmental monitoring remains a critical challenge, with limitations in traditional fixed-point quantization methods.Method: Applied linear quantization in FPGA-based soft sensors for fluid flow estimation, with targeted hardware optimizations to enhance Neural Network model precision.
Result: Achieved up to 10.10% reduction in Mean Squared Error and 9.39% improvement in inference speed. Validated across multiple data sets, demonstrating efficient and accurate real-time inference.
Conclusion: The optimized FPGA-based quantized models provide a viable alternative to cloud-based processing in pervasive autonomous systems, offering efficient and accurate real-time inference capabilities.
Abstract: In industrial and environmental monitoring, achieving real-time and precise fluid flow measurement remains a critical challenge. This study applies linear quantization in FPGA-based soft sensors for fluid flow estimation, significantly enhancing Neural Network model precision by overcoming the limitations of traditional fixed-point quantization. Our approach achieves up to a 10.10% reduction in Mean Squared Error and a notable 9.39% improvement in inference speed through targeted hardware optimizations. Validated across multiple data sets, our findings demonstrate that the optimized FPGA-based quantized models can provide efficient, accurate real-time inference, offering a viable alternative to cloud-based processing in pervasive autonomous systems.
[954] Machine learning augmented diagnostic testing to identify sources of variability in test performance
Christopher J. Banks, Aeron Sanchez, Vicki Stewart, Kate Bowen, Thomas Doherty, Oliver Tearne, Graham Smith, Rowland R. Kao
Main category: cs.LG
TL;DR: Machine learning improves bovine tuberculosis diagnostic test sensitivity by 5 percentage points, detecting 240 more infected herds annually without compromising specificity.
Details
Motivation: To enhance diagnostic testing for infectious diseases by using machine learning to assess risk factors and improve test interpretation, particularly for bovine tuberculosis in cattle herds.Method: Used machine learning to analyze detailed testing records and surrounding risk factors, applying feature importance testing to weight risk factors for predicting bovine tuberculosis incidents.
Result: Test sensitivity improved by over 5 percentage points, detecting 240 additional infected herds per year beyond what the skin test alone could identify, without reducing test specificity.
Conclusion: Machine learning can significantly enhance diagnostic test performance by incorporating risk landscape analysis, with some herds showing higher risk of undetected infections despite identified risk factors.
Abstract: Diagnostic tests that can detect pre-clinical or sub-clinical infection, are one of the most powerful tools in our armoury of weapons to control infectious diseases. Considerable effort has been paid to improving diagnostic testing for human, plant and animal diseases, including strategies for targeting the use of diagnostic tests towards individuals who are more likely to be infected. We use machine learning to assess the surrounding risk landscape under which a diagnostic test is applied to augment its interpretation. We develop this to predict the occurrence of bovine tuberculosis incidents in cattle herds, exploiting the availability of exceptionally detailed testing records. We show that, without compromising test specificity, test sensitivity can be improved so that the proportion of infected herds detected improves by over 5 percentage points, or 240 additional infected herds detected in one year beyond those detected by the skin test alone. We also use feature importance testing for assessing the weighting of risk factors. While many factors are associated with increased risk of incidents, of note are several factors that suggest that in some herds there is a higher risk of infection going undetected.
[955] KV-weights are all you need for skipless transformers
Nils Graef
Main category: cs.LG
TL;DR: This paper proposes mathematically equivalent skipless transformer versions for MQA and GQA architectures, extending previous work that only worked with MHA. The approach removes Q and P linear layers, reducing weights by 15% in models like Mistral-7B.
Details
Motivation: Previous skipless transformer work only applied to multi-head attention (MHA), but many popular LLMs use multi-query attention (MQA) and grouped-query attention (GQA), creating a need for compatible versions.Method: Developed mathematically equivalent transformer architectures that remove the V and P (post-attention projection) linear layers specifically for MQA and GQA schemes.
Result: Successfully created skipless versions for MQA and GQA that can remove 15% of weights from models like Mistral-7B, reducing compute and memory complexity.
Conclusion: The proposed skipless transformer variants enable weight reduction and efficiency improvements for popular LLMs using MQA and GQA architectures, extending the benefits previously only available for MHA models.
Abstract: He and Hofmann (arXiv:2311.01906) detailed a skipless transformer without the V and P (post-attention projection) linear layers, which reduces the total number of weights. However, this scheme is only applicable to MHA (multi-head attention), but not for MQA (multi-query attention) and GQA (grouped-query attention). The latter schemes are used by many popular LLMs such as Llama 2, Mistral, Mixtral, PaLM, and Gemma. Therefore, this micro-paper proposes mathematically equivalent versions that are suitable for MQA and GQA. For example, removing Q and P from a skipless version of Mistral-7B would remove 15% of its weights (and thus reduce its compute and memory complexity). Watch our explainer video https://youtu.be/Tx_lMpphd2g and see https://github.com/OpenMachine-ai/transformer-tricks for code and more transformer tricks.
[956] WaveCastNet: Rapid Wavefield Forecasting for Earthquake Early Warning via Deep Sequence to Sequence Learning
Dongwei Lyu, Rie Nakata, Pu Ren, Michael W. Mahoney, Arben Pitarka, Nori Nakata, N. Benjamin Erichson
Main category: cs.LG
TL;DR: WaveCastNet is a deep learning model for wavefield forecasting that uses convolutional long expressive memory in a sequence-to-sequence framework, requiring fewer parameters than transformers while achieving better generalization to rare seismic events.
Details
Motivation: To develop a more efficient and accurate method for forecasting high-dimensional wavefields, particularly for rare and critical seismic scenarios like high-magnitude earthquakes, without relying on error-prone conventional methods or empirical ground-motion models.Method: Integrates convolutional long expressive memory architecture into sequence-to-sequence forecasting framework, sharing weights across spatial and temporal dimensions to reduce parameters and enable modeling of long-term dependencies and multiscale patterns.
Result: WaveCastNet achieves faster inference times than transformers, generalizes better to rare seismic scenarios, can predict intensity and timing of destructive ground motions in real time, and demonstrates zero-shot capabilities on real earthquake data.
Conclusion: WaveCastNet provides an efficient and effective alternative to conventional methods for wavefield forecasting, eliminating the need for error-prone magnitude/ epicenter estimation and empirical ground-motion models while capturing complex wave propagation effects.
Abstract: We propose a new deep learning model, WaveCastNet, to forecast high-dimensional wavefields. WaveCastNet integrates a convolutional long expressive memory architecture into a sequence-to-sequence forecasting framework, enabling it to model long-term dependencies and multiscale patterns in both space and time. By sharing weights across spatial and temporal dimensions, WaveCastNet requires significantly fewer parameters than more resource-intensive models such as transformers, resulting in faster inference times. Crucially, WaveCastNet also generalizes better than transformers to rare and critical seismic scenarios, such as high-magnitude earthquakes. Here, we show the ability of the model to predict the intensity and timing of destructive ground motions in real time, using simulated data from the San Francisco Bay Area. Furthermore, we demonstrate its zero-shot capabilities by evaluating WaveCastNet on real earthquake data. Our approach does not require estimating earthquake magnitudes and epicenters, steps that are prone to error in conventional methods, nor does it rely on empirical ground-motion models, which often fail to capture strongly heterogeneous wave propagation effects.
[957] Efficient Federated Learning against Byzantine Attacks and Data Heterogeneity via Aggregating Normalized Gradients
Shiyuan Zuo, Xingrun Yan, Rongfei Fan, Li Shen, Puning Zhao, Jie Xu, Han Hu
Main category: cs.LG
TL;DR: Fed-NGA is a Byzantine-robust federated learning algorithm that uses normalized gradient aggregation with O(pM) time complexity, achieving convergence under data heterogeneity and Byzantine attacks.
Details
Motivation: Existing Byzantine-robust FL methods have high computational overhead during gradient aggregation, slowing down training. Need efficient solution for both Byzantine attacks and data heterogeneity.Method: Propose Fed-NGA that aggregates weighted mean of normalized gradients from clients, achieving O(pM) time complexity where p is model dimension and M is number of clients.
Result: Fed-NGA converges to stationary points for non-convex functions under general assumptions, achieves zero optimality gap under mild conditions with O(1/T^(1/2-δ)) convergence rate. Experiments show superior time efficiency and convergence.
Conclusion: Fed-NGA is simple, efficient Byzantine-robust FL algorithm that handles data heterogeneity with low computational overhead and proven convergence guarantees.
Abstract: Federated Learning (FL) enables multiple clients to collaboratively train models without sharing raw data, but is vulnerable to Byzantine attacks and data heterogeneity, which can severely degrade performance. Existing Byzantine-robust approaches tackle data heterogeneity, but incur high computational overhead during gradient aggregation, thereby slowing down the training process. To address this issue, we propose a simple yet effective Federated Normalized Gradients Algorithm (Fed-NGA), which performs aggregation by merely computing the weighted mean of the normalized gradients from each client. This approach yields a favorable time complexity of $\mathcal{O}(pM)$, where $p$ is the model dimension and $M$ is the number of clients. We rigorously prove that Fed-NGA is robust to both Byzantine faults and data heterogeneity. For non-convex loss functions, Fed-NGA achieves convergence to a neighborhood of stationary points under general assumptions, and further attains zero optimality gap under some mild conditions, which is an outcome rarely achieved in existing literature. In both cases, the convergence rate is $\mathcal{O}(1/T^{\frac{1}{2} - \delta})$, where $T$ denotes the number of iterations and $\delta \in (0, 1/2)$. Experimental results on benchmark datasets confirm the superior time efficiency and convergence performance of Fed-NGA over existing methods.
[958] FedSVD: Adaptive Orthogonalization for Private Federated Learning with LoRA
Seanie Lee, Sangwoo Park, Dong Bok Lee, Dominik Wagner, Haebin Seong, Tobias Bocklet, Juho Lee, Sung Ju Hwang
Main category: cs.LG
TL;DR: FedSVD addresses noise amplification in federated learning with LoRA and DP-SGD by using SVD-based global reparameterization, where clients optimize only B matrix and server refactorizes via SVD to avoid quadratic noise while maintaining expressiveness.
Details
Motivation: LoRA combined with DP-SGD in federated learning suffers from substantial noise amplification due to matrix multiplication intensifying gradient perturbations, and freezing one matrix reduces noise but limits model expressiveness.Method: FedSVD introduces global SVD reparameterization: clients optimize only B matrix, server aggregates B matrices, computes BA product with previous A, refactorizes via SVD to get new orthonormal A from right singular vectors and updated B from remaining components.
Result: FedSVD consistently improves stability and performance across various privacy settings and benchmarks, outperforming baselines under both private and non-private regimes, with theoretical confirmation of bounded gradient norms and preserved signal under DP-SGD.
Conclusion: The SVD-based reparameterization in FedSVD effectively addresses noise amplification in federated LoRA with DP-SGD, maintaining model expressiveness while improving performance and stability through orthonormal structure and better principal direction capture.
Abstract: Low-Rank Adaptation (LoRA), which introduces a product of two trainable low-rank matrices into frozen pre-trained weights, is widely used for efficient fine-tuning of language models in federated learning (FL). However, when combined with differentially private stochastic gradient descent (DP-SGD), LoRA faces substantial noise amplification: DP-SGD perturbs per-sample gradients, and the matrix multiplication of the LoRA update ($BA$) intensifies this effect. Freezing one matrix (e.g., $A$) reduces the noise but restricts model expressiveness, often resulting in suboptimal adaptation. To address this, we propose $\texttt{FedSVD}$, a simple yet effective method that introduces a global reparameterization based on singular value decomposition (SVD). In our approach, each client optimizes only the $B$ matrix and transmits it to the server. The server aggregates the $B$ matrices, computes the product $BA$ using the previous $A$, and refactorizes the result via SVD. This yields a new adaptive $A$ composed of the orthonormal right singular vectors of $BA$, and an updated $B$ containing the remaining SVD components. This reparameterization avoids quadratic noise amplification, while allowing $A$ to better capture the principal directions of the aggregate updates. Moreover, the orthonormal structure of $A$ bounds the gradient norms of $B$ and preserves more signal under DP-SGD, as confirmed by our theoretical analysis. As a result, $\texttt{FedSVD}$ consistently improves stability and performance across a variety of privacy settings and benchmarks, outperforming relevant baselines under both private and non-private regimes.
[959] PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation Analysis
Yan Wu, Esther Wershof, Sebastian M Schmon, Marcel Nassar, Błażej Osiński, Ridvan Eksi, Zichao Yan, Rory Stark, Kun Zhang, Thore Graepel
Main category: cs.LG
TL;DR: A comprehensive benchmarking framework for single-cell transcriptomic perturbation models, including evaluation platform, datasets, and metrics to compare model performance and identify limitations.
Details
Motivation: To standardize benchmarking in the rapidly evolving field of single-cell transcriptomic perturbation modeling and support robust model development for therapeutic discovery.Method: Developed a modular user-friendly platform with diverse perturbational datasets and evaluation metrics, extensively evaluating published and baseline models across various datasets.
Result: Identified limitations in widely used models (e.g., mode collapse), showed importance of rank metrics alongside traditional measures, found simpler architectures competitive and scalable with larger datasets.
Conclusion: The benchmarking framework sets new standards for model evaluation, supports robust model development, and advances the use of these models for simulating genetic and chemical screens in therapeutic discovery.
Abstract: We introduce a comprehensive framework for modeling single cell transcriptomic responses to perturbations, aimed at standardizing benchmarking in this rapidly evolving field. Our approach includes a modular and user-friendly model development and evaluation platform, a collection of diverse perturbational datasets, and a set of metrics designed to fairly compare models and dissect their performance. Through extensive evaluation of both published and baseline models across diverse datasets, we highlight the limitations of widely used models, such as mode collapse. We also demonstrate the importance of rank metrics which complement traditional model fit measures, such as RMSE, for validating model effectiveness. Notably, our results show that while no single model architecture clearly outperforms others, simpler architectures are generally competitive and scale well with larger datasets. Overall, this benchmarking exercise sets new standards for model evaluation, supports robust model development, and furthers the use of these models to simulate genetic and chemical screens for therapeutic discovery.
[960] Mitigating Distribution Shift in Model-based Offline RL via Shifts-aware Reward Learning
Wang Luo, Haoran Li, Zicheng Zhang, Congying Han, Chi Zhou, Jiayu Lv, Tiande Guo
Main category: cs.LG
TL;DR: This paper addresses distribution shift in model-based offline RL by analyzing model bias and policy shift, proposing a shifts-aware reward framework to improve value estimation and policy optimization.
Details
Motivation: Model-based offline RL faces inherent distribution shift challenges from using pre-collected datasets and learned models, with existing methods having inconsistent objectives and lacking unified theoretical foundation.Method: The authors disentangle distribution shift into model bias and policy shift, derive a shifts-aware reward through probabilistic inference framework, and implement it using classifier-based techniques to approximate adjusted rewards.
Result: Empirical results across multiple benchmarks show the approach mitigates distribution shift and achieves superior or comparable performance compared to existing methods.
Conclusion: The proposed shifts-aware reward framework effectively addresses distribution shift in model-based offline RL, providing both theoretical insights and practical implementation that improves policy optimization.
Abstract: Model-based offline reinforcement learning trains policies using pre-collected datasets and learned environment models, eliminating the need for direct real-world environment interaction. However, this paradigm is inherently challenged by distribution shift~(DS). Existing methods address this issue by leveraging off-policy mechanisms and estimating model uncertainty, but they often result in inconsistent objectives and lack a unified theoretical foundation. This paper offers a comprehensive analysis that disentangles the problem into two fundamental components: model bias and policy shift. Our theoretical and empirical investigations reveal how these factors distort value estimation and restrict policy optimization. To tackle these challenges, we derive a novel shifts-aware reward through a unified probabilistic inference framework, which modifies the vanilla reward to refine value learning and facilitate policy training. Building on this, we develop a practical implementation that leverages classifier-based techniques to approximate the adjusted reward for effective policy optimization. Empirical results across multiple benchmarks demonstrate that the proposed approach mitigates distribution shift and achieves superior or comparable performance, validating our theoretical insights.
[961] RLVR-World: Training World Models with Reinforcement Learning
Jialong Wu, Shaofeng Yin, Ningya Feng, Mingsheng Long
Main category: cs.LG
TL;DR: RLVR-World uses reinforcement learning with verifiable rewards to directly optimize world models for task-specific metrics like accuracy and perceptual quality, achieving significant performance gains across language and video domains.
Details
Motivation: Standard training objectives like maximum likelihood estimation often misalign with the actual task-specific goals of world models, such as transition prediction metrics like accuracy or perceptual quality.Method: RLVR-World leverages reinforcement learning with verifiable rewards (RLVR) to directly optimize world models for specific metrics, evaluating metrics of decoded predictions as verifiable rewards despite formulating world modeling as autoregressive prediction of tokenized sequences.
Result: The framework demonstrates substantial performance gains on both language- and video-based world models across domains including text games, web navigation, and robot manipulation.
Conclusion: Beyond recent advances in reasoning language models, RLVR offers a promising post-training paradigm for enhancing the utility of generative models more broadly.
Abstract: World models predict state transitions in response to actions and are increasingly developed across diverse modalities. However, standard training objectives such as maximum likelihood estimation (MLE) often misalign with task-specific goals of world models, i.e., transition prediction metrics like accuracy or perceptual quality. In this paper, we present RLVR-World, a unified framework that leverages reinforcement learning with verifiable rewards (RLVR) to directly optimize world models for such metrics. Despite formulating world modeling as autoregressive prediction of tokenized sequences, RLVR-World evaluates metrics of decoded predictions as verifiable rewards. We demonstrate substantial performance gains on both language- and video-based world models across domains, including text games, web navigation, and robot manipulation. Our work indicates that, beyond recent advances in reasoning language models, RLVR offers a promising post-training paradigm for enhancing the utility of generative models more broadly. Code, datasets, models, and video samples are available at the project website: https://thuml.github.io/RLVR-World.
[962] Retraining-Free Merging of Sparse MoE via Hierarchical Clustering
I-Chun Chen, Hsu-Shen Liu, Wei-Fang Sun, Chen-Hao Chao, Yen-Chang Hsu, Chun-Yi Lee
Main category: cs.LG
TL;DR: HC-SMoE is a task-agnostic expert merging framework that uses hierarchical clustering based on expert outputs to reduce memory requirements in Sparse Mixture-of-Experts models without retraining.
Details
Motivation: Sparse Mixture-of-Experts models face deployment constraints due to extensive memory requirements of expert components in resource-limited environments, despite their efficient parameter utilization and performance improvements.Method: Introduces hierarchical clustering based on expert outputs to merge experts, ensuring robustness independent of routing decisions. The output-based clustering captures functional relationships between experts for large-scale architectures.
Result: Comprehensive evaluations across multiple zero-shot language tasks demonstrate HC-SMoE’s effectiveness in state-of-the-art models including Qwen and Mixtral, showing superior performance and practical applicability.
Conclusion: HC-SMoE provides an effective solution for parameter reduction in SMoE models without retraining, enabling real-world deployments in resource-limited environments while maintaining performance.
Abstract: Sparse Mixture-of-Experts (SMoE) models represent a significant advancement in large language model (LLM) development through their efficient parameter utilization. These models achieve substantial performance improvements at reduced inference costs. However, the deployment of SMoE models faces constraints from extensive memory requirements of expert components in resource-limited environments. To address these limitations, this paper introduces Hierarchical Clustering for Sparsely activated Mixture of Experts (HC-SMoE), a task-agnostic expert merging framework for parameter reduction without retraining. HC-SMoE introduces a novel hierarchical clustering approach based on expert outputs to ensure merging robustness independent of routing decisions. The proposed output-based clustering method enables effective capture of functional relationships between experts for large-scale architectures. We provide theoretical analysis and comprehensive evaluations across multiple zero-shot language tasks to demonstrate HC-SMoE’s effectiveness in state-of-the-art models including Qwen and Mixtral. The experimental results validate HC-SMoE’s superior performance and practical applicability for real-world deployments.
[963] Adaptive Inference-Time Scaling via Cyclic Diffusion Search
Gyubin Lee, Truong Nhat Nguyen Bao, Jaesik Yoon, Dongwoo Lee, Minsu Kim, Yoshua Bengio, Sungjin Ahn
Main category: cs.LG
TL;DR: ABCD is a flexible inference framework that adaptively adjusts computational effort during diffusion model inference using bi-directional cycles and automatic exploration-exploitation balancing.
Details
Motivation: Current diffusion models use fixed denoising schedules that cannot adapt computational allocation based on instance difficulty or task-specific demands, limiting their efficiency and effectiveness.Method: ABCD uses three components: Cyclic Diffusion Search for bi-directional refinement, Automatic Exploration-Exploitation Balancing to control depth, and Adaptive Thinking Time for termination control.
Result: Experiments demonstrate that ABCD improves performance across diverse tasks while maintaining computational efficiency compared to fixed-schedule methods.
Conclusion: ABCD enables adaptive inference-time scaling for diffusion models, allowing dynamic computational allocation that enhances performance without sacrificing efficiency.
Abstract: Diffusion models have demonstrated strong generative capabilities across domains ranging from image synthesis to complex reasoning tasks. However, most inference-time scaling methods rely on fixed denoising schedules, limiting their ability to allocate computation based on instance difficulty or task-specific demands adaptively. We introduce the challenge of adaptive inference-time scaling-dynamically adjusting computational effort during inference-and propose Adaptive Bi-directional Cyclic Diffusion (ABCD), a flexible, search-based inference framework. ABCD refines outputs through bi-directional diffusion cycles while adaptively controlling exploration depth and termination. It comprises three components: Cyclic Diffusion Search, Automatic Exploration-Exploitation Balancing, and Adaptive Thinking Time. Experiments show that ABCD improves performance across diverse tasks while maintaining computational efficiency.
[964] DeepVigor+: Scalable and Accurate Semi-Analytical Fault Resilience Analysis for Deep Neural Network
Mohammad Hasan Ahmadilivani, Jaan Raik, Masoud Daneshtalab, Maksim Jenihhin
Main category: cs.LG
TL;DR: DeepVigor+ is a fast semi-analytical method for reliability measurement in CNNs that achieves less than 1% error with 14.9-26.9x fewer simulations than state-of-the-art statistical fault injection.
Details
Motivation: Hardware reliability assessment for ML systems is crucial for safety-critical applications, but conventional fault injection methods are prohibitively time-consuming for large CNNs, and even statistical FI has long runtimes.Method: DeepVigor+ implements a fault propagation analysis model to acquire Vulnerability Factors as reliability metrics in an optimal way, using a semi-analytical approach.
Result: DeepVigor+ obtains Vulnerability Factors with less than 1% error while requiring 14.9 to 26.9 times fewer simulations than the best-known statistical fault injection method, enabling analysis in minutes instead of days or weeks.
Conclusion: DeepVigor+ provides a scalable, fast, and accurate alternative for reliability measurement in CNNs, making practical reliability analysis feasible for large and deep neural networks.
Abstract: The growing exploitation of Machine Learning (ML) in safety-critical applications necessitates rigorous safety analysis. Hardware reliability assessment is a major concern with respect to measuring the level of safety in ML-based systems. Quantifying the reliability of emerging ML models, including Convolutional Neural Networks (CNNs), is highly complex due to their enormous size in terms of the number of parameters and computations. Conventionally, Fault Injection (FI) is applied to perform a reliability measurement. However, performing FI on modern-day CNNs is prohibitively time-consuming if an acceptable confidence level is to be achieved. To speed up FI for large CNNs, statistical FI (SFI) has been proposed, but its runtimes are still considerably long. In this work, we introduce DeepVigor+, a scalable, fast, and accurate semi-analytical method as an efficient alternative for reliability measurement in CNNs. DeepVigor+ implements a fault propagation analysis model and attempts to acquire Vulnerability Factors (VFs) as reliability metrics in an optimal way. The results indicate that DeepVigor+ obtains VFs for CNN models with an error less than $1%$, i.e., the objective in SFI, but with $14.9$ up to $26.9$ times fewer simulations than the best-known state-of-the-art SFI. DeepVigor+ enables an accurate reliability analysis for large and deep CNNs within a few minutes, rather than achieving the same results in days or weeks.
[965] Efficient Adaptive Federated Optimization
Su Hyeong Lee, Sidharth Sharma, Manzil Zaheer, Tian Li
Main category: cs.LG
TL;DR: FedAda^2 and FedAda^2++ are efficient adaptive algorithms for federated learning that optimize communication and memory usage while maintaining convergence rates comparable to resource-intensive joint adaptive methods.
Details
Motivation: Joint adaptivity on both server and client sides is essential for optimal federated learning performance, but existing approaches face scalability issues due to communication and memory resource limitations.Method: FedAda^2 avoids transferring preconditioners between server and clients to optimize communication. FedAda^2++ extends this by incorporating memory-efficient adaptive optimizers on client side to reduce on-device memory usage.
Result: Theoretically, both algorithms achieve same convergence rates for general non-convex objectives as resource-intensive joint adaptive counterparts. Extensive empirical evaluations on image and text datasets demonstrate effectiveness.
Conclusion: FedAda^2 and FedAda^2++ provide scalable solutions for large-scale federated learning by enabling joint adaptivity while overcoming communication and memory constraints.
Abstract: Adaptive optimization is critical in federated learning, where enabling adaptivity on both the server and client sides has proven essential for achieving optimal performance. However, the scalability of such jointly adaptive systems is often hindered by resource limitations in communication and memory. In this paper, we introduce a class of efficient adaptive algorithms, named $FedAda^2$ and its enhanced version $FedAda^2$++, designed specifically for large-scale, cross-device federated environments. $FedAda^2$ optimizes communication efficiency by avoiding the transfer of preconditioners between the server and clients. Additionally, $FedAda^2$++ extends this approach by incorporating memory-efficient adaptive optimizers on the client side, further reducing on-device memory usage. Theoretically, we demonstrate that $FedAda^2$ and $FedAda^2$++ achieve the same convergence rates for general, non-convex objectives as its more resource-intensive counterparts that directly integrate joint adaptivity. Extensive empirical evaluations on image and text datasets demonstrate both the advantages of joint adaptivity and the effectiveness of $FedAda^2$/$FedAda^2$++.
[966] DNN Modularization via Activation-Driven Training
Tuan Ngo, Abid Hassan, Saad Shafiq, Nenad Medvidovic
Main category: cs.LG
TL;DR: MODA is an activation-driven modular training approach that decomposes DNNs into modules by regulating layer activations, achieving faster training, fewer weights, less overlap, and better accuracy preservation compared to existing methods.
Details
Motivation: DNNs accumulate technical debt and have high retraining costs when adapting to changing requirements. Existing modularization techniques suffer from weight overlaps, accuracy losses, limited scope to convolutional layers, and increased complexity/training time.Method: MODA promotes inherent modularity by directly regulating layer activation outputs based on three objectives: intra-class affinity, inter-class dispersion, and compactness.
Result: MODA achieves 22% less training time, modules with up to 24x fewer weights and 37x less weight overlap, preserves original model accuracy without fine-tuning, and improves target class accuracy by 12% in module replacement scenarios.
Conclusion: MODA provides an effective modular training approach that addresses limitations of existing methods, offering faster training, more efficient modules, and better accuracy preservation for evolving DNN requirements.
Abstract: Deep Neural Networks (DNNs) tend to accrue technical debt and suffer from significant retraining costs when adapting to evolving requirements. Modularizing DNNs offers the promise of improving their reusability. Previous work has proposed techniques to decompose DNN models into modules both during and after training. However, these strategies yield several shortcomings, including significant weight overlaps and accuracy losses across modules, restricted focus on convolutional layers only, and added complexity and training time by introducing auxiliary masks to control modularity. In this work, we propose MODA, an activation-driven modular training approach. MODA promotes inherent modularity within a DNN model by directly regulating the activation outputs of its layers based on three modular objectives: intra-class affinity, inter-class dispersion, and compactness. MODA is evaluated using three well-known DNN models and five datasets with varying sizes. This evaluation indicates that, compared to the existing state-of-the-art, using MODA yields several advantages: (1) MODA accomplishes modularization with 22% less training time; (2) the resultant modules generated by MODA comprise up to 24x fewer weights and 37x less weight overlap while (3) preserving the original model’s accuracy without additional fine-tuning; in module replacement scenarios, (4) MODA improves the accuracy of a target class by 12% on average while ensuring minimal impact on the accuracy of other classes.
[967] Hopfield-Fenchel-Young Networks: A Unified Framework for Associative Memory Retrieval
Saul Santos, Vlad Niculae, Daniel McNamee, André F. T. Martins
Main category: cs.LG
TL;DR: A unified framework called Hopfield-Fenchel-Young networks that generalizes associative memory models using Fenchel-Young losses, enabling sparse transformations and structured pattern retrieval.
Details
Motivation: To create a unified framework that generalizes traditional and modern Hopfield networks, connecting them with self-attention mechanisms and enabling sparse transformations and structured pattern associations.Method: Formulate energy functions as differences between two Fenchel-Young losses, using Tsallis and norm entropies to derive differentiable update rules, and extend to structured networks using SparseMAP transformation.
Result: The framework unifies various Hopfield networks, provides energy minimization perspective for common transformations, and demonstrates effectiveness on memory recall tasks including image retrieval and text rationalization.
Conclusion: Hopfield-Fenchel-Young networks offer a comprehensive framework that extends associative memory models, connects them with modern attention mechanisms, and enables sparse and structured pattern retrieval through convex analysis foundations.
Abstract: Associative memory models, such as Hopfield networks and their modern variants, have garnered renewed interest due to advancements in memory capacity and connections with self-attention in transformers. In this work, we introduce a unified framework-Hopfield-Fenchel-Young networks-which generalizes these models to a broader family of energy functions. Our energies are formulated as the difference between two Fenchel-Young losses: one, parameterized by a generalized entropy, defines the Hopfield scoring mechanism, while the other applies a post-transformation to the Hopfield output. By utilizing Tsallis and norm entropies, we derive end-to-end differentiable update rules that enable sparse transformations, uncovering new connections between loss margins, sparsity, and exact retrieval of single memory patterns. We further extend this framework to structured Hopfield networks using the SparseMAP transformation, allowing the retrieval of pattern associations rather than a single pattern. Our framework unifies and extends traditional and modern Hopfield networks and provides an energy minimization perspective for widely used post-transformations like $\ell_2$-normalization and layer normalization-all through suitable choices of Fenchel-Young losses and by using convex analysis as a building block. Finally, we validate our Hopfield-Fenchel-Young networks on diverse memory recall tasks, including free and sequential recall. Experiments on simulated data, image retrieval, multiple instance learning, and text rationalization demonstrate the effectiveness of our approach.
[968] Performance and Generalizability Impacts of Incorporating Location Encoders into Deep Learning for Dynamic PM2.5 Estimation
Morteza Karimzadeh, Zhongying Wang, James L. Crooks
Main category: cs.LG
TL;DR: This paper systematically evaluates location encoders for dynamic PM2.5 estimation, finding that pretrained encoders like GeoCLIP improve both accuracy and geographic transfer compared to raw coordinates.
Details
Motivation: The role of geolocation information in improving accuracy and generalizability in deep learning for geospatial prediction tasks remains underexamined, particularly for dynamic applications.Method: Compared three strategies for representing location: excluding geolocation, using raw latitude/longitude, and using pretrained location encoders (GeoCLIP, SatCLIP) for daily PM2.5 estimation across the US, evaluated under within-region and out-of-region generalization settings.
Result: Raw coordinates improve performance within regions but reduce generalizability across regions. Pretrained location encoders improve both predictive accuracy and geographic transfer, though performance varies across encoder types and spatial artifacts were observed.
Conclusion: This work provides the first systematic evaluation of location encoders in dynamic environmental estimation and offers guidance for incorporating geolocation into deep learning models for geospatial prediction.
Abstract: Deep learning has shown strong performance in geospatial prediction tasks, but the role of geolocation information in improving accuracy and generalizability remains underexamined. Recent work has introduced location encoders that aim to represent spatial context in a transferable way, yet most evaluations have focused on static mapping tasks. Here, we study the effect of incorporating geolocation into deep learning for a dynamic and spatially heterogeneous application: estimating daily surface-level PM2.5 across the contiguous United States using satellite and ground-based observations. We compare three strategies for representing location: excluding geolocation, using raw latitude and longitude, and using pretrained location encoders. We evaluate each under within-region and out-of-region generalization settings. Results show that raw coordinates can improve performance within regions by supporting spatial interpolation, but can reduce generalizability across regions. In contrast, pretrained location encoders such as GeoCLIP improve both predictive accuracy and geographic transfer. However, we also observe spatial artifacts linked to encoder characteristics, and performance varies across encoder types (e.g., SatCLIP vs. GeoCLIP). This work provides the first systematic evaluation of location encoders in a dynamic environmental estimation context and offers guidance for incorporating geolocation into deep learning models for geospatial prediction.
[969] Generalized EXTRA stochastic gradient Langevin dynamics
Mert Gurbuzbalaban, Mohammad Rafiqul Islam, Xiaoyu Wang, Lingjiong Zhu
Main category: cs.LG
TL;DR: Proposes generalized EXTRA stochastic gradient Langevin dynamics to eliminate bias in decentralized Bayesian learning across networks of agents with communication and privacy constraints.
Details
Motivation: Standard SGLD algorithms cannot be applied when data is decentralized across networks with communication and privacy constraints. Existing decentralized SGLD algorithms induce persistent bias due to network effects that negatively impact performance.Method: Develops generalized EXTRA stochastic gradient Langevin dynamics, inspired by EXTRA algorithm for decentralized optimization, to eliminate bias in full-batch setting and improve performance bounds in mini-batch setting.
Result: The proposed algorithm eliminates bias in full-batch setting and provides significantly improved performance bounds in mini-batch setting compared to standard decentralized SGLD algorithms. Numerical results demonstrate efficiency.
Conclusion: The generalized EXTRA stochastic gradient Langevin dynamics effectively addresses bias issues in decentralized Bayesian learning, offering improved performance for collaborative learning across networks of agents without sharing individual data.
Abstract: Langevin algorithms are popular Markov Chain Monte Carlo methods for Bayesian learning, particularly when the aim is to sample from the posterior distribution of a parametric model, given the input data and the prior distribution over the model parameters. Their stochastic versions such as stochastic gradient Langevin dynamics (SGLD) allow iterative learning based on randomly sampled mini-batches of large datasets and are scalable to large datasets. However, when data is decentralized across a network of agents subject to communication and privacy constraints, standard SGLD algorithms cannot be applied. Instead, we employ decentralized SGLD (DE-SGLD) algorithms, where Bayesian learning is performed collaboratively by a network of agents without sharing individual data. Nonetheless, existing DE-SGLD algorithms induce a bias at every agent that can negatively impact performance; this bias persists even when using full batches and is attributable to network effects. Motivated by the EXTRA algorithm and its generalizations for decentralized optimization, we propose the generalized EXTRA stochastic gradient Langevin dynamics, which eliminates this bias in the full-batch setting. Moreover, we show that, in the mini-batch setting, our algorithm provides performance bounds that significantly improve upon those of standard DE-SGLD algorithms in the literature. Our numerical results also demonstrate the efficiency of the proposed approach.
[970] On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling
Moritz Haas, Sebastian Bordt, Ulrike von Luxburg, Leena Chennuru Vankadara
Main category: cs.LG
TL;DR: The paper resolves discrepancies between infinite-width theory and practical neural network behavior by identifying a ‘controlled divergence’ regime under cross-entropy loss where networks exhibit stable training and feature learning at large learning rates, contrary to existing theory predictions.
Details
Motivation: Existing infinite-width theory fails to explain practical network behavior - it predicts instability at large learning rates and vanishing feature learning at stable ones, but real networks show stable training with non-trivial feature learning even at large widths.Method: Finer-grained analysis of the previously considered unstable regime, identifying two sub-regimes: catastrophic instability and controlled divergence. Experiments across optimizers, architectures, and data modalities under cross-entropy vs MSE loss.
Result: Found that neural networks operate in controlled divergence regime under CE loss (not MSE), where logits diverge but gradients/activations remain stable. This regime has well-defined infinite-width limit with evolving features in all layers.
Conclusion: Width-scaling considerations are useful for predicting maximal stable learning rate exponents. The analysis clarifies effectiveness and limitations of layerwise learning rate scaling for standard initialization.
Abstract: Scaling limits, such as infinite-width limits, serve as promising theoretical tools to study large-scale models. However, it is widely believed that existing infinite-width theory does not faithfully explain the behavior of practical networks, especially those trained in standard parameterization (SP) meaning He initialization with a global learning rate. For instance, existing theory for SP predicts instability at large learning rates and vanishing feature learning at stable ones. In practice, however, optimal learning rates decay slower than theoretically predicted and networks exhibit both stable training and non-trivial feature learning, even at very large widths. Here, we show that this discrepancy is not fully explained by finite-width phenomena. Instead, we find a resolution through a finer-grained analysis of the regime previously considered unstable and therefore uninteresting. In particular, we show that, under cross-entropy (CE) loss, the unstable regime comprises two distinct sub-regimes: a catastrophically unstable regime and a more benign controlled divergence regime, where logits diverge but gradients and activations remain stable. Moreover, under large learning rates at the edge of the controlled divergence regime, there exists a well-defined infinite width limit where features continue to evolve in all the hidden layers. In experiments across optimizers, architectures, and data modalities, we validate that neural networks operate in this controlled divergence regime under CE loss but not under MSE loss. Our empirical evidence suggests that width-scaling considerations are surprisingly useful for predicting empirically maximal stable learning rate exponents which provide useful guidance on optimal learning rate exponents. Finally, our analysis clarifies the effectiveness and limitations of recently proposed layerwise learning rate scaling for standard initialization.
[971] No Free Lunch From Random Feature Ensembles: Scaling Laws and Near-Optimality Conditions
Benjamin S. Ruben, William L. Tong, Hamza Tahir Chaudhry, Cengiz Pehlevan
Main category: cs.LG
TL;DR: When given a fixed budget of total parameters, ensembles of smaller models perform worse than a single large model in ridge regression. However, overparameterized ensembles can achieve near-optimal performance, and under certain kernel/task conditions, properly scaled ensembles can also achieve near-optimal scaling laws.
Details
Motivation: To understand the trade-off between using a single large model versus ensembles of smaller models when total parameter budget is fixed, particularly in ridge regression settings.Method: Used deterministic equivalent risk estimates to analyze ensembles of random-feature ridge regression models in both overparameterized and underparameterized regimes. Derived scaling laws for test risk as function of total parameter count with joint scaling of ensemble size and model size.
Result: When distributing fixed parameters among K independently trained models, ridge-optimized test risk increases with K, making single large model optimal. Overparameterized ensembles achieve near-optimal performance as test error depends only on total feature count. Underparameterized ensembles can achieve near-optimal scaling under specific kernel/task eigenstructure conditions.
Conclusion: Single large models are optimal for fixed parameter budgets, but ensembles can achieve near-optimal performance in overparameterized regimes or under specific conditions in underparameterized regimes through proper joint scaling of ensemble size and model size.
Abstract: Given a fixed budget for total model size, one must choose between training a single large model or combining the predictions of multiple smaller models. We investigate this trade-off for ensembles of random-feature ridge regression models in both the overparameterized and underparameterized regimes. Using deterministic equivalent risk estimates, we prove that when a fixed number of parameters is distributed among $K$ independently trained models, the ridge-optimized test risk increases with $K$. Consequently, a single large model achieves optimal performance. We then ask when ensembles can achieve \textit{near}-optimal performance. In the overparameterized regime, we show that, to leading order, the test error depends on ensemble size and model size only through the total feature count, so that overparameterized ensembles consistently achieve near-optimal performance. To understand underparameterized ensembles, we derive scaling laws for the test risk as a function of total parameter count when the ensemble size and parameters per ensemble member are jointly scaled according to a ``growth exponent’’ $\ell$. While the optimal error scaling is always achieved by increasing model size with a fixed ensemble size, our analysis identifies conditions on the kernel and task eigenstructure under which near-optimal scaling laws can be obtained by joint scaling of ensemble size and model size.
[972] Towards Minimizing Feature Drift in Model Merging: Layer-wise Task Vector Fusion for Adaptive Knowledge Integration
Wenju Sun, Qingyong Li, Wen Wang, Yang Liu, Yangli-ao Geng, Boyang Li
Main category: cs.LG
TL;DR: LOT Merging is a new model merging technique that minimizes feature drift between task-specific experts and unified models through layer-wise optimization, achieving better performance than existing methods without costly retraining.
Details
Motivation: Existing model merging methods have limitations: parameter-level methods show performance gaps, while task-loss approaches require expensive secondary training. The key insight is that performance degradation correlates with feature drift in representations.Method: Proposes Layer-wise Optimal Task Vector Merging (LOT Merging) that minimizes feature drift between experts and unified model layer-by-layer. Formulated as convex quadratic optimization with closed-form solutions for linear and normalization layers using basic matrix operations.
Result: Significantly outperforms baseline methods with up to 4.4% improvement (ViT-B/32) over state-of-the-art approaches across vision and vision-language benchmarks. Achieves efficient model consolidation without costly training procedures.
Conclusion: LOT Merging provides an effective solution for multi-task model merging by explicitly addressing feature drift, achieving superior performance through analytical optimization while maintaining computational efficiency.
Abstract: Multi-task model merging aims to consolidate knowledge from multiple fine-tuned task-specific experts into a unified model while minimizing performance degradation. Existing methods primarily approach this by minimizing differences between task-specific experts and the unified model, either from a parameter-level or a task-loss perspective. However, parameter-level methods exhibit a significant performance gap compared to the upper bound, while task-loss approaches entail costly secondary training procedures. In contrast, we observe that performance degradation closely correlates with feature drift, i.e., differences in feature representations of the same sample caused by model merging. Motivated by this observation, we propose Layer-wise Optimal Task Vector Merging (LOT Merging), a technique that explicitly minimizes feature drift between task-specific experts and the unified model in a layer-by-layer manner. LOT Merging can be formulated as a convex quadratic optimization problem, enabling us to analytically derive closed-form solutions for the parameters of linear and normalization layers. Consequently, LOT Merging achieves efficient model consolidation through basic matrix operations. Extensive experiments across vision and vision-language benchmarks demonstrate that LOT Merging significantly outperforms baseline methods, achieving improvements of up to 4.4% (ViT-B/32) over state-of-the-art approaches. The source code is available at https://github.com/SunWenJu123/model-merging.
[973] Learning Satellite Pattern-of-Life Identification: A Diffusion-based Approach
Yongchao Ye, Xinting Zhu, Xuejin Shen, Xiaoyu Chen, S. Joe Qin, Lishuai Li
Main category: cs.LG
TL;DR: A novel generative approach using diffusion models for automatic satellite pattern-of-life identification from orbital data, eliminating dependency on expert knowledge and rule-based systems.
Details
Motivation: Current satellite monitoring relies on expert knowledge and rule-based systems that scale poorly, while pattern-of-life identification remains underdeveloped due to system complexity and data inconsistencies.Method: Uses diffusion model framework with multivariate time-series encoder to capture hidden representations of satellite positional data, combined with conditional denoising process for end-to-end PoL classification.
Result: Demonstrates superior identification quality and robustness across diverse real-world satellite operational scenarios and varying data quality characteristics.
Conclusion: The approach shows transformative potential for operational behavior pattern identification, enhanced tracking, and space situational awareness.
Abstract: As Earth’s orbital satellite population grows exponentially, effective space situational awareness becomes critical for collision prevention and sustainable operations. Current approaches to monitor satellite behaviors rely on expert knowledge and rule-based systems that scale poorly. Among essential monitoring tasks, satellite pattern-of-life (PoL) identification, analyzing behaviors like station-keeping maneuvers and drift operations, remains underdeveloped due to aerospace system complexity, operational variability, and inconsistent ephemerides sources. We propose a novel generative approach for satellite PoL identification that significantly eliminates the dependence on expert knowledge. The proposed approach leverages orbital elements and positional data to enable automatic pattern discovery directly from observations. Our implementation uses a diffusion model framework for end-to-end identification without manual refinement or domain expertise. The architecture combines a multivariate time-series encoder to capture hidden representations of satellite positional data with a conditional denoising process to generate accurate PoL classifications. Through experiments across diverse real-world satellite operational scenarios, our approach demonstrates superior identification quality and robustness across varying data quality characteristics. A case study using actual satellite data confirms the approach’s transformative potential for operational behavior pattern identification, enhanced tracking, and space situational awareness.
[974] Psi-Sampler: Initial Particle Sampling for SMC-Based Inference-Time Reward Alignment in Score Models
Taehoon Yoon, Yunhong Min, Kyeongmin Yeo, Minhyuk Sung
Main category: cs.LG
TL;DR: Psi-Sampler is an SMC framework that uses pCNL-based initial particle sampling for better reward alignment in score-based generative models, improving sampling efficiency over Gaussian prior initialization.
Details
Motivation: Existing methods initialize particles from Gaussian prior which inadequately captures reward-relevant regions, reducing sampling efficiency. Initializing from reward-aware posterior significantly improves alignment performance.Method: Uses preconditioned Crank-Nicolson Langevin (pCNL) algorithm for posterior sampling in high-dimensional latent spaces, combining dimension-robust proposals with gradient-informed dynamics within an SMC framework.
Result: Consistently improves performance across various reward alignment tasks including layout-to-image generation, quantity-aware generation, and aesthetic-preference generation.
Conclusion: Psi-Sampler enables efficient and scalable posterior sampling for inference-time reward alignment, demonstrating superior performance over existing methods.
Abstract: We introduce $\Psi$-Sampler, an SMC-based framework incorporating pCNL-based initial particle sampling for effective inference-time reward alignment with a score-based generative model. Inference-time reward alignment with score-based generative models has recently gained significant traction, following a broader paradigm shift from pre-training to post-training optimization. At the core of this trend is the application of Sequential Monte Carlo (SMC) to the denoising process. However, existing methods typically initialize particles from the Gaussian prior, which inadequately captures reward-relevant regions and results in reduced sampling efficiency. We demonstrate that initializing from the reward-aware posterior significantly improves alignment performance. To enable posterior sampling in high-dimensional latent spaces, we introduce the preconditioned Crank-Nicolson Langevin (pCNL) algorithm, which combines dimension-robust proposals with gradient-informed dynamics. This approach enables efficient and scalable posterior sampling and consistently improves performance across various reward alignment tasks, including layout-to-image generation, quantity-aware generation, and aesthetic-preference generation, as demonstrated in our experiments. Project Webpage: https://psi-sampler.github.io/
[975] E2Former: An Efficient and Equivariant Transformer with Linear-Scaling Tensor Products
Yunyang Li, Lin Huang, Zhihao Ding, Chu Wang, Xinran Wei, Han Yang, Zun Wang, Chang Liu, Yu Shi, Peiran Jin, Tao Qin, Mark Gerstein, Jia Zhang
Main category: cs.LG
TL;DR: E2Former is an equivariant transformer architecture that uses Wigner 6j convolution to reduce computational complexity from O(|E|) to O(|V|), achieving 7x-30x speedup over conventional SO(3) convolutions while maintaining expressive power and rotational equivariance.
Details
Motivation: EGNNs face substantial computational challenges due to high cost of constructing edge features via spherical tensor products, making them impractical for large-scale systems.Method: Introduces E2Former with Wigner 6j convolution that shifts computational burden from edges to nodes, reducing complexity while preserving rotational equivariance.
Result: Achieves 7x-30x speedup compared to conventional SO(3) convolutions and mitigates computational challenges without compromising geometric information capture.
Conclusion: This approach suggests a promising direction for scalable and efficient molecular modeling.
Abstract: Equivariant Graph Neural Networks (EGNNs) have demonstrated significant success in modeling microscale systems, including those in chemistry, biology and materials science. However, EGNNs face substantial computational challenges due to the high cost of constructing edge features via spherical tensor products, making them impractical for large-scale systems. To address this limitation, we introduce E2Former, an equivariant and efficient transformer architecture that incorporates the Wigner $6j$ convolution (Wigner $6j$ Conv). By shifting the computational burden from edges to nodes, the Wigner $6j$ Conv reduces the complexity from $O(|\mathcal{E}|)$ to $ O(| \mathcal{V}|)$ while preserving both the model’s expressive power and rotational equivariance. We show that this approach achieves a 7x-30x speedup compared to conventional $\mathrm{SO}(3)$ convolutions. Furthermore, our empirical results demonstrate that the derived E2Former mitigates the computational challenges of existing approaches without compromising the ability to capture detailed geometric information. This development could suggest a promising direction for scalable and efficient molecular modeling.
[976] Covering Multiple Objectives with a Small Set of Solutions Using Bayesian Optimization
Natalie Maus, Kyurae Kim, Yimeng Zeng, Haydn Thomas Jones, Fangping Wan, Marcelo Der Torossian Torres, Cesar de la Fuente-Nunez, Jacob R. Gardner
Main category: cs.LG
TL;DR: MOCOBO is a Bayesian optimization algorithm for coverage optimization, which finds a small set of K < T solutions that collectively cover all T objectives, with applications in drug design where K antibiotics can treat T pathogens.
Details
Motivation: Traditional multi-objective optimization seeks a Pareto-optimal set balancing all objectives, but coverage optimization aims to find a small set of solutions where each objective is covered by at least one good solution, motivated by practical applications like drug design where K antibiotics need to treat T pathogens.Method: Developed Multi-Objective Coverage Bayesian Optimization (MOCOBO) with a new acquisition function inspired by expected improvement in vanilla Bayesian optimization, designed specifically for coverage optimization problems.
Result: MOCOBO achieves coverage of K < T solutions that matches or nearly matches the coverage of T solutions obtained by optimizing each objective individually. In vitro experiments show peptides found by MOCOBO exhibit high potency against drug-resistant pathogens.
Conclusion: MOCOBO effectively solves coverage optimization problems and demonstrates strong potential for drug discovery applications, with publicly available code for broader use.
Abstract: In multi-objective black-box optimization, the goal is typically to find solutions that optimize a set of $T$ black-box objective functions, $f_1, \ldots f_T$, simultaneously. Traditional approaches often seek a single Pareto-optimal set that balances trade-offs among all objectives. In contrast, we consider a problem setting that departs from this paradigm: finding a small set of $K < T$ solutions, that collectively “cover” the $T$ objectives. A set of solutions is defined as “covering” if, for each objective $f_1, \ldots f_T$, there is at least one good solution. A motivating example for this problem setting occurs in drug design. For example, we may have $T$ pathogens and aim to identify a set of $K < T$ antibiotics such that at least one antibiotic can be used to treat each pathogen. This problem, known as coverage optimization, has yet to be tackled with the Bayesian optimization (BO) framework. To fill this void, we develop Multi-Objective Coverage Bayesian Optimization (MOCOBO), a BO algorithm for solving coverage optimization. Our approach is based on a new acquisition function reminiscent of expected improvement in the vanilla BO setup. We demonstrate the performance of our method on high-dimensional black-box optimization tasks, including applications in peptide and molecular design. Results show that the coverage of the $K < T$ solutions found by MOCOBO matches or nearly matches the coverage of $T$ solutions obtained by optimizing each objective individually. Furthermore, in in vitro experiments, the peptides found by MOCOBO exhibited high potency against drug-resistant pathogens, further demonstrating the potential of MOCOBO for drug discovery. All of our code is publicly available at the following link: https://github.com/nataliemaus/mocobo.
[977] Regularized Langevin Dynamics for Combinatorial Optimization
Shengyu Feng, Yiming Yang
Main category: cs.LG
TL;DR: Proposes Regularized Langevin Dynamics (RLD) for combinatorial optimization, which improves exploration by enforcing distance between sampled and current solutions, achieving comparable or better performance than SOTA methods with up to 80% runtime reduction.
Details
Motivation: Direct application of discrete Langevin dynamics for combinatorial optimization leads to limited exploration and gets stuck in local minima, requiring a more effective sampling framework.Method: Develops Regularized Langevin Dynamics (RLD) that enforces expected distance between sampled and current solutions. Builds two solvers: one based on simulated annealing (SA) and another on neural networks (NN).
Result: Empirical results on three classic CO problems show both RLD-based methods achieve comparable or better performance than previous SOTA SA- and NN-based solvers. The SA algorithm reduces runtime by up to 80% while maintaining equal or superior performance.
Conclusion: RLD offers a promising framework for enhancing both traditional heuristics and neural network models to solve combinatorial optimization problems effectively.
Abstract: This work proposes a simple yet effective sampling framework for combinatorial optimization (CO). Our method builds on discrete Langevin dynamics (LD), an efficient gradient-guided generative paradigm. However, we observe that directly applying LD often leads to limited exploration. To overcome this limitation, we propose the Regularized Langevin Dynamics (RLD), which enforces an expected distance between the sampled and current solutions, effectively avoiding local minima. We develop two CO solvers on top of RLD, one based on simulated annealing (SA), and the other one based on neural network (NN). Empirical results on three classic CO problems demonstrate that both of our methods can achieve comparable or better performance against the previous state-of-the-art (SOTA) SA- and NN-based solvers. In particular, our SA algorithm reduces the runtime of the previous SOTA SA method by up to 80%, while achieving equal or superior performance. In summary, RLD offers a promising framework for enhancing both traditional heuristics and NN models to solve CO problems. Our code is available at https://github.com/Shengyu-Feng/RLD4CO.
[978] SubTrack++ : Gradient Subspace Tracking for Scalable LLM Training
Sahar Rajabi, Nayeema Nonta, Sirisha Rambhatla
Main category: cs.LG
TL;DR: SubTrack++ reduces LLM training time by 65% for pre-training and 36% for fine-tuning while maintaining memory efficiency, using Grassmannian gradient subspace tracking and projection-aware optimizers.
Details
Motivation: To democratize LLM training by simultaneously improving memory efficiency, training time, and model performance without trade-offs.Method: Uses Grassmannian gradient subspace tracking with projection-aware optimizers and recovery scaling to restore information from low-rank projections.
Result: Achieves SOTA convergence, reduces pre-training wall-time by up to 65% and fine-tuning time by 36% compared to existing methods with same memory footprint.
Conclusion: SubTrack++ enables more efficient LLM training across multiple dimensions, advancing democratization of large language models.
Abstract: Training large language models (LLMs) is highly resource-intensive due to their massive number of parameters and the overhead of optimizer states. While recent work has aimed to reduce memory consumption, such efforts often entail trade-offs among memory efficiency, training time, and model performance. Yet, true democratization of LLMs requires simultaneous progress across all three dimensions. To this end, we propose SubTrack++ that leverages Grassmannian gradient subspace tracking combined with projection-aware optimizers, enabling Adam’s internal statistics to adapt to subspace changes. Additionally, employing recovery scaling, a technique that restores information lost through low-rank projections, further enhances model performance. Our method demonstrates SOTA convergence by exploiting Grassmannian geometry, reducing pre-training wall-time by up to 65% and fine-tuning time by 36% compared to existing SOTA methods, while maintaining the same memory footprint.
[979] Mixture-of-Experts Meets In-Context Reinforcement Learning
Wenhao Wu, Fuhong Liu, Haoru Li, Zican Hu, Daoyi Dong, Chunlin Chen, Zhi Wang
Main category: cs.LG
TL;DR: T2MIR introduces mixture-of-experts (MoE) architecture to transformer-based decision models for in-context RL, using token-wise and task-wise MoE layers to handle multi-modal data and diverse tasks, achieving superior performance.
Details
Motivation: To address challenges in in-context RL: multi-modality of state-action-reward data and diverse, heterogeneous decision tasks that limit effective in-context learning.Method: Proposes T2MIR framework with two parallel MoE layers: token-wise MoE for capturing distinct semantics across input modalities, and task-wise MoE for routing diverse tasks to specialized experts. Uses contrastive learning to enhance task-wise routing by maximizing mutual information between task and router representation.
Result: Comprehensive experiments show T2MIR significantly facilitates in-context learning capacity and outperforms various baseline methods.
Conclusion: T2MIR brings the potential of MoE to in-context RL, offering a simple and scalable architectural enhancement that advances ICRL closer to achievements in language and vision communities.
Abstract: In-context reinforcement learning (ICRL) has emerged as a promising paradigm for adapting RL agents to downstream tasks through prompt conditioning. However, two notable challenges remain in fully harnessing in-context learning within RL domains: the intrinsic multi-modality of the state-action-reward data and the diverse, heterogeneous nature of decision tasks. To tackle these challenges, we propose T2MIR (Token- and Task-wise MoE for In-context RL), an innovative framework that introduces architectural advances of mixture-of-experts (MoE) into transformer-based decision models. T2MIR substitutes the feedforward layer with two parallel layers: a token-wise MoE that captures distinct semantics of input tokens across multiple modalities, and a task-wise MoE that routes diverse tasks to specialized experts for managing a broad task distribution with alleviated gradient conflicts. To enhance task-wise routing, we introduce a contrastive learning method that maximizes the mutual information between the task and its router representation, enabling more precise capture of task-relevant information. The outputs of two MoE components are concatenated and fed into the next layer. Comprehensive experiments show that T2MIR significantly facilitates in-context learning capacity and outperforms various types of baselines. We bring the potential and promise of MoE to ICRL, offering a simple and scalable architectural enhancement to advance ICRL one step closer toward achievements in language and vision communities. Our code is available at https://github.com/NJU-RL/T2MIR.
[980] Diffusion Generative Modeling on Lie Group Representations
Marco Bertolini, Tuan Le, Djork-Arné Clevert
Main category: cs.LG
TL;DR: A novel score-based diffusion framework for Lie groups that generalizes standard score matching, enabling efficient modeling of complex distributions on non-Abelian Lie groups with applications in molecular conformer generation and docking.
Details
Motivation: To extend diffusion models beyond Euclidean spaces to Lie groups, enabling more efficient modeling of complex geometric transformations and distributions that arise in applications like molecular conformer generation and docking.Method: Generalized Score Matching framework with Lie group Langevin dynamics that decompose as direct sums of Lie algebra representations, introducing new paired stochastic differential equations for generative processes on Lie groups.
Result: The method shows improved performance in SO(3)-guided molecular conformer generation and SE(3) transformations for molecular docking compared to Riemannian diffusion, with reduced effective dimensionality and enhanced learning efficiency.
Conclusion: The proposed Lie group diffusion framework provides a principled generalization of score-based diffusion models, enabling more efficient modeling of complex geometric transformations and distributions on non-Abelian Lie groups.
Abstract: We introduce a novel class of score-based diffusion processes that operate directly in the representation space of Lie groups. Leveraging the framework of Generalized Score Matching, we derive a class of Langevin dynamics that decomposes as a direct sum of Lie algebra representations, enabling the modeling of any target distribution on any (non-Abelian) Lie group. Standard score-matching emerges as a special case of our framework when the Lie group is the translation group. We prove that our generalized generative processes arise as solutions to a new class of paired stochastic differential equations (SDEs), introduced here for the first time. We validate our approach through experiments on diverse data types, demonstrating its effectiveness in real-world applications such as SO(3)-guided molecular conformer generation and modeling ligand-specific global SE(3) transformations for molecular docking, showing improvement in comparison to Riemannian diffusion on the group itself. We show that an appropriate choice of Lie group enhances learning efficiency by reducing the effective dimensionality of the trajectory space and enables the modeling of transitions between complex data distributions.
[981] Zero-shot protein stability prediction by inverse folding models: a free energy interpretation
Jes Frellsen, Maher M. Kassem, Tone Bengtsen, Lars Olsen, Kresten Lindorff-Larsen, Jesper Ferkinghoff-Borg, Wouter Boomsma
Main category: cs.LG
TL;DR: The paper clarifies the free-energy foundations of inverse folding models for protein stability prediction, revealing that standard likelihood ratio methods are simplistic approximations and proposing improved approaches that achieve better zero-shot performance.
Details
Motivation: To better understand the link between inverse folding models' amino acid preferences and thermodynamic stability principles, which could lead to stronger zero-shot stability prediction capabilities.Method: The authors derive the free-energy foundations of inverse folding models, identify standard likelihood ratio practice as a simplistic approximation, and propose several improved approaches for estimating relative stability.
Result: Empirical assessment shows that considerable gains in zero-shot performance can be achieved with fairly simple means compared to standard approaches.
Conclusion: Understanding the free-energy foundations of inverse folding models enables the development of improved stability prediction methods that significantly outperform standard likelihood ratio approaches.
Abstract: Inverse folding models have proven to be highly effective zero-shot predictors of protein stability. Despite this success, the link between the amino acid preferences of an inverse folding model and the free-energy considerations underlying thermodynamic stability remains incompletely understood. A better understanding would be of interest not only from a theoretical perspective, but also potentially provide the basis for stronger zero-shot stability prediction. In this paper, we take steps to clarify the free-energy foundations of inverse folding models. Our derivation reveals the standard practice of likelihood ratios as a simplistic approximation and suggests several paths towards better estimates of the relative stability. We empirically assess these approaches and demonstrate that considerable gains in zero-shot performance can be achieved with fairly simple means.
[982] Bilevel ZOFO: Bridging Parameter-Efficient and Zeroth-Order Techniques for Efficient LLM Fine-Tuning and Meta-Training
Reza Shirkavand, Peiran Yu, Qi He, Heng Huang
Main category: cs.LG
TL;DR: Bilevel-ZOFO combines First-Order Parameter-Efficient Fine-Tuning (FO-PEFT) with Zeroth-Order (ZO) optimization in a bilevel framework, achieving faster training while maintaining memory efficiency and improving generalization.
Details
Motivation: Address limitations of existing fine-tuning methods: FO-PEFT underperforms full fine-tuning for high accuracy tasks, while ZO methods suffer from slow convergence and prompt sensitivity despite being memory-efficient.Method: Bilevel optimization with FO-PEFT at inner level for fast local adaptation and ZO updates at outer level for full backbone optimization. Inner loop reduces ZO variance and stabilizes search, while outer ZO provides better generalization for PEFT.
Result: Significantly outperforms existing ZO and FO-PEFT methods with 2-4 times faster training while maintaining similar memory efficiency. Combines full-model capacity with few-shot efficiency for effective meta-learning.
Conclusion: Bilevel-ZOFO bridges FO-PEFT and ZO methods, achieving superior performance through synergistic bilevel optimization that leverages the strengths of both approaches while mitigating their weaknesses.
Abstract: Fine-tuning pre-trained Large Language Models (LLMs) for downstream tasks using First-Order (FO) optimizers presents significant computational challenges. Parameter-Efficient Fine-Tuning (PEFT) methods address these by freezing most model parameters and training only a small subset. However, PEFT often underperforms compared to full fine-tuning when high task-specific accuracy is required. Zeroth-Order (ZO) methods fine-tune the entire pre-trained model without back-propagation, estimating gradients through forward passes only. While memory-efficient, ZO methods suffer from slow convergence and high sensitivity to prompt selection. We bridge these two worlds with Bilevel-ZOFO, a bilevel optimization method that couples fast, local FO-PEFT adaptation at the inner level with stable, memory-efficient ZO updates of the full backbone at the outer level. The FO-PEFT inner loop performs fast, low-memory local adaptation that reduces the variance of ZO estimates and stabilizes the search, guiding the outer ZO updates of the full backbone and reducing prompt sensitivity. In the mean time, the outer ZO provides better generalization ability for PEFT. We provide theoretical convergence guarantees and empirically demonstrate that Bilevel-ZOFO significantly outperforms existing ZO and FO-PEFT methods, achieving 2-4 times faster training while maintaining similar memory efficiency. Additionally, we show by updating the backbone with ZO and adapting only a tiny FO-PEFT block per task, Bilevel-ZOFO combines full-model capacity with few-shot efficiency, making it a very efficient meta-learning algorithm that quickly adapts to new tasks.
[983] Efficient Randomized Experiments Using Foundation Models
Piersilvio De Bartolomeis, Javier Abad, Guanbo Wang, Konstantin Donhauser, Raymond M. Duch, Fanny Yang, Issa J. Dahabreh
Main category: cs.LG
TL;DR: A novel estimator that integrates predictions from multiple foundation models with experimental data while preserving valid statistical inference, achieving substantial precision gains even when model predictions are biased.
Details
Motivation: Randomized experiments are costly and yield estimates with substantial uncertainty, while in silico experiments using foundation models are cost-effective but risk invalid statistical inferences if models fail to accurately predict responses.Method: Proposed estimator integrates predictions from multiple foundation models with experimental data while preserving valid statistical inference. The estimator is consistent and asymptotically normal.
Result: Empirical results show substantial precision gains equivalent to reducing sample size by up to 20% compared to standard estimator using only experimental data.
Conclusion: The proposed approach successfully combines the cost-effectiveness of foundation models with the validity of experimental data, achieving better precision while maintaining statistical validity even with biased model predictions.
Abstract: Randomized experiments are the preferred approach for evaluating the effects of interventions, but they are costly and often yield estimates with substantial uncertainty. On the other hand, in silico experiments leveraging foundation models offer a cost-effective alternative that can potentially attain higher statistical precision. However, the benefits of in silico experiments come with a significant risk: statistical inferences are not valid if the models fail to accurately predict experimental responses to interventions. In this paper, we propose a novel approach that integrates the predictions from multiple foundation models with experimental data while preserving valid statistical inference. Our estimator is consistent and asymptotically normal, with asymptotic variance no larger than the standard estimator based on experimental data alone. Importantly, these statistical properties hold even when model predictions are arbitrarily biased. Empirical results across several randomized experiments show that our estimator offers substantial precision gains, equivalent to a reduction of up to 20% in the sample size needed to match the same precision as the standard estimator based on experimental data alone.
[984] Provable Sample-Efficient Transfer Learning Conditional Diffusion Models via Representation Learning
Ziheng Cheng, Tianyu Xie, Shiyue Zhang, Cheng Zhang
Main category: cs.LG
TL;DR: This paper provides the first theoretical analysis of transfer learning for conditional diffusion models, showing that pre-trained representations from source tasks can significantly reduce sample complexity for target tasks.
Details
Motivation: Conditional diffusion models require large datasets for training, which is often impractical. Transfer learning has shown empirical success in small data regimes, but lacks theoretical understanding.Method: Theoretical analysis through representation learning lens, assuming low-dimensional shared representations across tasks. Numerical experiments validate theoretical findings.
Result: With well-learned representations from source tasks, target tasks require substantially fewer samples. Practical applications demonstrate the effectiveness of transfer learning.
Conclusion: Transfer learning provides significant sample efficiency benefits for conditional diffusion models, with theoretical guarantees supporting empirical success.
Abstract: While conditional diffusion models have achieved remarkable success in various applications, they require abundant data to train from scratch, which is often infeasible in practice. To address this issue, transfer learning has emerged as an essential paradigm in small data regimes. Despite its empirical success, the theoretical underpinnings of transfer learning conditional diffusion models remain unexplored. In this paper, we take the first step towards understanding the sample efficiency of transfer learning conditional diffusion models through the lens of representation learning. Inspired by practical training procedures, we assume that there exists a low-dimensional representation of conditions shared across all tasks. Our analysis shows that with a well-learned representation from source tasks, the samplecomplexity of target tasks can be reduced substantially. In addition, we investigate the practical implications of our theoretical results in several real-world applications of conditional diffusion models. Numerical experiments are also conducted to verify our results.
[985] Distributional Training Data Attribution: What do Influence Functions Sample?
Bruno Mlodozeniec, Isaac Reid, Sam Power, David Krueger, Murat Erdogdu, Richard E. Turner, Roger Grosse
Main category: cs.LG
TL;DR: The paper introduces distributional training data attribution (d-TDA) to account for randomness in deep learning training, revealing that influence functions are inherently distributional and providing practical applications in data pruning and identifying influential examples.
Details
Motivation: Traditional training data attribution algorithms fail to account for randomness in deep learning training, such as stochastic initialization and batching, which can yield different models from the same dataset.Method: The authors introduce distributional training data attribution (d-TDA), which predicts how the distribution of model outputs depends on the dataset, and show that influence functions emerge as a limit case without requiring convexity assumptions.
Result: The framework reveals that influence functions are ‘secretly distributional’ and provides a new perspective on their effectiveness in deep learning. Practical utility is demonstrated in improving data pruning for vision transformers and identifying influential examples with diffusion models.
Conclusion: d-TDA addresses the limitation of traditional data attribution methods by incorporating randomness, offering both theoretical insights into influence functions and practical applications in model training and data analysis.
Abstract: Randomness is an unavoidable part of training deep learning models, yet something that traditional training data attribution algorithms fail to rigorously account for. They ignore the fact that, due to stochasticity in the initialisation and batching, training on the same dataset can yield different models. In this paper, we address this shortcoming through introducing distributional training data attribution (d-TDA), the goal of which is to predict how the distribution of model outputs (over training runs) depends upon the dataset. Intriguingly, we find that influence functions (IFs), a popular data attribution tool, are ‘secretly distributional’: they emerge from our framework as the limit to unrolled differentiation, without requiring restrictive convexity assumptions. This provides a new perspective on the effectiveness of IFs in deep learning. We demonstrate the practical utility of d-TDA in experiments, including improving data pruning for vision transformers and identifying influential examples with diffusion models.
[986] Breaking the Frozen Subspace: Importance Sampling for Low-Rank Optimization in LLM Pretraining
Haochen Zhang, Junze Yin, Guanchu Wang, Zirui Liu, Lin F. Yang, Tianyi Zhang, Anshumali Shrivastava, Vladimir Braverman
Main category: cs.LG
TL;DR: Proposes importance sampling for low-rank optimization in LLM pretraining to address limitations of dominant subspace approaches, with provable convergence guarantees and superior empirical performance.
Details
Motivation: Existing low-rank optimization methods for LLMs use dominant subspace projection but suffer from subspace stagnation during pretraining, constraining weight updates and lacking convergence guarantees.Method: Importance sampling for low-rank optimization that selects subspaces differently from dominant subspace approaches, with provable convergence guarantees.
Result: Significantly outperforms previous low-rank optimization methods in LLM pretraining tasks.
Conclusion: Importance sampling approach overcomes limitations of dominant subspace methods and provides both theoretical convergence guarantees and practical performance improvements in LLM pretraining.
Abstract: Low-rank optimization has emerged as a promising approach to enabling memory-efficient training of large language models (LLMs). Existing low-rank optimization methods typically project gradients onto a low-rank subspace, reducing the memory cost of storing optimizer states. A key challenge in these methods is selecting suitable subspaces to ensure an effective optimization trajectory. Most existing approaches select the dominant subspace to preserve gradient information, as this intuitively provides the best approximation. However, we find that in practice, the dominant subspace stops changing during pretraining, thereby constraining weight updates to similar subspaces. In this paper, we propose importance sampling for low-rank optimization in LLM pretraining with a provable convergence guarantee, which the dominant subspace approach does not have. Empirically, we demonstrate that our method significantly outperforms previous methods in LLM pretraining tasks.
[987] Identifiability of Deep Polynomial Neural Networks
Konstantin Usevich, Ricardo Borsoi, Clara Dérand, Marianne Clausel
Main category: cs.LG
TL;DR: This paper provides a comprehensive analysis of identifiability in deep Polynomial Neural Networks (PNNs), establishing conditions under which these networks are identifiable and settling open conjectures about their neurovarieties.
Details
Motivation: PNNs have rich algebraic and geometric structure but their identifiability - crucial for interpretability - remains poorly understood, motivating this systematic study.Method: The analysis connects deep PNNs to low-rank tensor decompositions and uses Kruskal-type uniqueness theorems. The proofs are constructive and examine architectures with and without bias terms.
Result: The study reveals that architectures with non-increasing layer widths are generically identifiable under mild conditions, and encoder-decoder networks are identifiable when decoder widths don’t grow too rapidly compared to activation degrees.
Conclusion: The work settles an open conjecture on PNN neurovarieties’ dimension and provides new bounds on activation degrees needed to reach expected dimension, advancing understanding of PNN identifiability.
Abstract: Polynomial Neural Networks (PNNs) possess a rich algebraic and geometric structure. However, their identifiability – a key property for ensuring interpretability – remains poorly understood. In this work, we present a comprehensive analysis of the identifiability of deep PNNs, including architectures with and without bias terms. Our results reveal an intricate interplay between activation degrees and layer widths in achieving identifiability. As special cases, we show that architectures with non-increasing layer widths are generically identifiable under mild conditions, while encoder-decoder networks are identifiable when the decoder widths do not grow too rapidly compared to the activation degrees. Our proofs are constructive and center on a connection between deep PNNs and low-rank tensor decompositions, and Kruskal-type uniqueness theorems. We also settle an open conjecture on the dimension of PNN’s neurovarieties, and provide new bounds on the activation degrees required for it to reach the expected dimension.
[988] Analog In-memory Training on General Non-ideal Resistive Elements: The Impact of Response Functions
Zhaoxian Wu, Quan Xiao, Tayfun Gokmen, Omobayode Fagbohungbe, Tianyi Chen
Main category: cs.LG
TL;DR: This paper analyzes gradient-based training on analog in-memory computing hardware with non-ideal response functions, identifies that asymmetric response functions impose implicit penalties on SGD, and proposes a Residual Learning algorithm that converges exactly to critical points.
Details
Motivation: As training large vision/language models becomes increasingly expensive, analog in-memory computing offers energy-efficient solutions, but the training dynamics with non-ideal hardware response functions are underexplored.Method: Theoretical analysis of gradient-based training on AIMC hardware with non-linear/asymmetric response functions, proposing Residual Learning algorithm that solves a bilevel optimization problem to overcome hardware limitations.
Result: Asymmetric response functions negatively impact Analog SGD by imposing implicit penalties, while the proposed Residual Learning algorithm provably converges exactly to critical points and can address other hardware imperfections like limited granularity.
Conclusion: This is the first paper to systematically investigate generic non-ideal response functions in AIMC training, providing theoretical foundations and practical solutions validated through simulations.
Abstract: As the economic and environmental costs of training and deploying large vision or language models increase dramatically, analog in-memory computing (AIMC) emerges as a promising energy-efficient solution. However, the training perspective, especially its training dynamic, is underexplored. In AIMC hardware, the trainable weights are represented by the conductance of resistive elements and updated using consecutive electrical pulses. While the conductance changes by a constant in response to each pulse, in reality, the change is scaled by asymmetric and non-linear response functions, leading to a non-ideal training dynamic. This paper provides a theoretical foundation for gradient-based training on AIMC hardware with non-ideal response functions. We demonstrate that asymmetric response functions negatively impact Analog SGD by imposing an implicit penalty on the objective. To overcome the issue, we propose Residual Learning algorithm, which provably converges exactly to a critical point by solving a bilevel optimization problem. We demonstrate that the proposed method can be extended to address other hardware imperfections, such as limited response granularity. As we know, it is the first paper to investigate the impact of a class of generic non-ideal response functions. The conclusion is supported by simulations validating our theoretical insights.
[989] FlightKooba: A Fast Interpretable FTP Model
Jing Lu, Xuan Wu, Yizhun Tian, Songhan Fan, Yali Fang
Main category: cs.LG
TL;DR: FlightKooba is a novel modeling approach that integrates HiPPO theory, Koopman operator theory, and control theory to extract smooth latent dynamics from noisy time series data with exceptional computational efficiency and interpretability.
Details
Motivation: Existing deep learning models for flight trajectory prediction and time series tasks face challenges of high computational cost and insufficient interpretability due to their complex black-box nature.Method: Integrates HiPPO theory, Koopman operator theory, and control theory using Legendre polynomial bases to construct Koopman operators analytically, avoiding large-scale parameter training.
Result: Experiments show competitive prediction accuracy for periodic signals while reducing trainable parameters by several orders of magnitude and achieving fastest training speed. However, unsuitable for sequences dominated by high-frequency noise due to inherent low-pass filtering characteristics.
Conclusion: FlightKooba offers a powerful, efficient, and interpretable alternative for time series analysis, particularly in resource-constrained environments.
Abstract: Flight trajectory prediction (FTP) and similar time series tasks typically require capturing smooth latent dynamics hidden within noisy signals. However, existing deep learning models face significant challenges of high computational cost and insufficient interpretability due to their complex black-box nature. This paper introduces FlightKooba, a novel modeling approach designed to extract such underlying dynamics analytically. Our framework uniquely integrates HiPPO theory, Koopman operator theory, and control theory. By leveraging Legendre polynomial bases, it constructs Koopman operators analytically, thereby avoiding large-scale parameter training. The method’s core strengths lie in its exceptional computational efficiency and inherent interpretability. Experiments on multiple public datasets validate our design philosophy: for signals exhibiting strong periodicity or clear physical laws (e.g., in aviation, meteorology, and traffic flow), FlightKooba delivers competitive prediction accuracy while reducing trainable parameters by several orders of magnitude and achieving the fastest training speed. Furthermore, we analyze the model’s theoretical boundaries, clarifying its inherent low-pass filtering characteristics that render it unsuitable for sequences dominated by high-frequency noise. In summary, FlightKooba offers a powerful, efficient, and interpretable new alternative for time series analysis, particularly in resource-constrained environments.
[990] Structure-preserving contrastive learning for spatial time series
Yiru Jiao, Sander van Cranenburgh, Simeon Calvert, Hans van Lint
Main category: cs.LG
TL;DR: The paper introduces two structure-preserving regularizers for contrastive learning of spatial time series, with a dynamic weighting mechanism to balance contrastive learning and structure preservation, showing improved performance across various tasks.
Details
Motivation: Self-supervised representation learning for spatially characterized time series (common in transportation) faces challenges in maintaining fine-grained spatio-temporal similarities in the latent space.Method: Two structure-preserving regularizers for contrastive learning: one preserves topology of similarities between instances, and the other preserves graph geometry of similarities across spatial and temporal dimensions, with a dynamic weighting mechanism.
Result: The method preserves similarity structures more effectively and improves state-of-the-art task performances across multivariate time series classification and traffic prediction tasks.
Conclusion: Well-preserved similarity structures in the latent space indicate more informative representations, providing insights for designing effective neural networks for transportation research. The method can integrate with arbitrary neural networks and benefits time series with spatial features.
Abstract: The effectiveness of neural network models largely relies on learning meaningful latent patterns from data, where self-supervised learning of informative representations can enhance model performance and generalisability. However, self-supervised representation learning for spatially characterised time series, which are ubiquitous in transportation domain, poses unique challenges due to the necessity of maintaining fine-grained spatio-temporal similarities in the latent space. In this study, we introduce two structure-preserving regularisers for the contrastive learning of spatial time series: one regulariser preserves the topology of similarities between instances, and the other preserves the graph geometry of similarities across spatial and temporal dimensions. To balance the contrastive learning objective and the need for structure preservation, we propose a dynamic weighting mechanism that adaptively manages this trade-off and stabilises training. We validate the proposed method through extensive experiments, including multivariate time series classification to demonstrate its general applicability, as well as macroscopic and microscopic traffic prediction to highlight its particular usefulness in encoding traffic interactions. Across all tasks, our method preserves the similarity structures more effectively and improves state-of-the-art task performances. This method can be integrated with an arbitrary neural network model and is particularly beneficial for time series data with spatial or geographical features. Furthermore, our findings suggest that well-preserved similarity structures in the latent space indicate more informative and useful representations. This provides insights to design more effective neural networks for data-driven transportation research. Our code is made openly accessible with all resulting data at https://github.com/yiru-jiao/spclt
[991] Curious Causality-Seeking Agents Learn Meta Causal World
Zhiyu Zhao, Haoxuan Li, Haifeng Zhang, Jun Wang, Francesco Faccio, Jürgen Schmidhuber, Mengyue Yang
Main category: cs.LG
TL;DR: The paper introduces Meta-Causal Graphs as world models that encode how causal structures shift across latent world states, and proposes a Causality-Seeking Agent that identifies meta states, discovers causal relationships through curiosity-driven interventions, and refines the graph through exploration.
Details
Motivation: Traditional world models assume fixed causal rules, but in reality, observed causal mechanisms can shift due to policy or environment changes. This creates problems when subtle changes alter the very causal mechanisms being observed.Method: Proposes Meta-Causal Graphs composed of multiple causal subgraphs triggered by meta states in latent space. A Causality-Seeking Agent identifies meta states, discovers causal relationships through curiosity-driven intervention policy, and iteratively refines the graph through exploration.
Result: Experiments on synthetic tasks and robot arm manipulation show the method robustly captures shifts in causal dynamics and generalizes effectively to unseen contexts.
Conclusion: Meta-Causal Graphs provide an effective representation for handling shifting causal mechanisms in world models, enabling better generalization and adaptation to changing environments.
Abstract: When building a world model, a common assumption is that the environment has a single, unchanging underlying causal rule, like applying Newton’s laws to every situation. In reality, what appears as a drifting causal mechanism is often the manifestation of a fixed underlying mechanism seen through a narrow observational window. This brings about a problem that, when building a world model, even subtle shifts in policy or environment states can alter the very observed causal mechanisms. In this work, we introduce the \textbf{Meta-Causal Graph} as world models, a minimal unified representation that efficiently encodes the transformation rules governing how causal structures shift across different latent world states. A single Meta-Causal Graph is composed of multiple causal subgraphs, each triggered by meta state, which is in the latent state space. Building on this representation, we introduce a \textbf{Causality-Seeking Agent} whose objectives are to (1) identify the meta states that trigger each subgraph, (2) discover the corresponding causal relationships by agent curiosity-driven intervention policy, and (3) iteratively refine the Meta-Causal Graph through ongoing curiosity-driven exploration and agent experiences. Experiments on both synthetic tasks and a challenging robot arm manipulation task demonstrate that our method robustly captures shifts in causal dynamics and generalizes effectively to previously unseen contexts.
[992] Universal Sequence Preconditioning
Annie Marsden, Elad Hazan
Main category: cs.LG
TL;DR: The paper proposes a universal preconditioning method using orthogonal polynomial convolution for sequential prediction, achieving sublinear regret bounds for linear dynamical systems with stable asymmetric transitions.
Details
Motivation: To address the problem of preconditioning in sequential prediction, particularly for linear dynamical systems with marginally stable and asymmetric transition matrices where existing methods struggle.Method: Convolve input sequences with coefficients from orthogonal polynomials (Chebyshev or Legendre), which corresponds to applying a polynomial to the hidden transition matrix in linear dynamical systems.
Result: The method reduces regret for two prediction algorithms and achieves the first sublinear, hidden-dimension independent regret bounds (up to logarithmic factors) for systems with marginally stable asymmetric transitions. Experiments show improved performance across diverse algorithms including RNNs.
Conclusion: Simple orthogonal polynomial-based preconditioning is an effective universal method that improves sequential prediction performance and generalizes beyond linear dynamical systems.
Abstract: We study the problem of preconditioning in sequential prediction. From the theoretical lens of linear dynamical systems, we show that convolving the input sequence corresponds to applying a polynomial to the hidden transition matrix. Building on this insight, we propose a universal preconditioning method that convolves the input with coefficients from orthogonal polynomials such as Chebyshev or Legendre. We prove that this approach reduces regret for two distinct prediction algorithms and yields the first ever sublinear and hidden-dimension independent regret bounds (up to logarithmic factors) that hold for systems with marginally stable and asymmetric transition matrices. Finally, extensive synthetic and real-world experiments show that this simple preconditioning strategy improves the performance of a diverse range of algorithms, including recurrent neural networks, and generalizes to signals beyond linear dynamical systems.
[993] Reasoning as an Adaptive Defense for Safety
Taeyoun Kim, Fahim Tajwar, Aditi Raghunathan, Aviral Kumar
Main category: cs.LG
TL;DR: TARS is a reinforcement learning method that trains LLMs to adaptively reason about safety using chain-of-thought, achieving better safety-refusal trade-offs and robustness to jailbreak attacks.
Details
Motivation: To extend adaptive reasoning methods beyond math/code domains to safety vulnerabilities, creating models that can dynamically allocate compute for safety reasoning per prompt.Method: RL approach with three key design choices: lightweight SFT warmstart, mixed prompt types (harmful/harmless/ambiguous), and reward function to preserve reasoning capabilities during training.
Result: TARS-trained models show adaptive compute allocation (more on ambiguous queries), better safety-refusal balance, improved safe/unsafe distinction, and robustness to both white-box (GCG) and black-box (PAIR) attacks.
Conclusion: TARS provides an effective open recipe for training LLMs against jailbreaks through adaptive safety reasoning per prompt.
Abstract: Reasoning methods that adaptively allocate test-time compute have advanced LLM performance on easy to verify domains such as math and code. In this work, we study how to utilize this approach to train models that exhibit a degree of robustness to safety vulnerabilities, and show that doing so can provide benefits. We build a recipe called $\textit{TARS}$ (Training Adaptive Reasoners for Safety), a reinforcement learning (RL) approach that trains models to reason about safety using chain-of-thought traces and a reward signal that balances safety with task completion. To build TARS, we identify three critical design choices: (1) a ``lightweight’’ warmstart SFT stage, (2) a mix of harmful, harmless, and ambiguous prompts to prevent shortcut behaviors such as too many refusals, and (3) a reward function to prevent degeneration of reasoning capabilities during training. Models trained with TARS exhibit adaptive behaviors by spending more compute on ambiguous queries, leading to better safety-refusal trade-offs. They also internally learn to better distinguish between safe and unsafe prompts and attain greater robustness to both white-box (e.g., GCG) and black-box attacks (e.g., PAIR). Overall, our work provides an effective, open recipe for training LLMs against jailbreaks and harmful requests by reasoning per prompt.
[994] Provably Efficient Online RLHF with One-Pass Reward Modeling
Long-Fei Li, Yu-Yang Qian, Peng Zhao, Zhi-Hua Zhou
Main category: cs.LG
TL;DR: Proposes a one-pass reward modeling method for online RLHF that eliminates historical data storage and achieves constant-time updates per iteration, improving computational efficiency.
Details
Motivation: Traditional RLHF methods rely on fixed datasets with limited coverage, while online RLHF requires expensive re-optimization from scratch at each iteration, leading to growing computational costs.Method: Formalizes RLHF as a contextual preference bandit and develops an algorithm based on online mirror descent with tailored local norm, replacing standard maximum likelihood estimation for reward modeling.
Result: Theoretical guarantees show enhanced statistical and computational efficiency. Experiments with Llama-3-8B-Instruct and Qwen2.5-7B-Instruct on Ultrafeedback and Mixture2 datasets validate effectiveness.
Conclusion: The proposed one-pass reward modeling method successfully addresses the computational bottleneck in online RLHF by enabling constant-time updates without storing historical data.
Abstract: Reinforcement Learning from Human Feedback (RLHF) has shown remarkable success in aligning Large Language Models (LLMs) with human preferences. Traditional RLHF methods rely on a fixed dataset, which often suffers from limited coverage. To this end, online RLHF has emerged as a promising direction, enabling iterative data collection and refinement. Despite its potential, this paradigm faces a key bottleneck: the requirement to continuously integrate new data into the dataset and re-optimize the model from scratch at each iteration, resulting in computational and storage costs that grow linearly with the number of iterations. In this work, we address this challenge by proposing a one-pass reward modeling method that eliminates the need to store historical data and achieves constant-time updates per iteration. Specifically, we first formalize RLHF as a contextual preference bandit and develop a new algorithm based on online mirror descent with a tailored local norm, replacing the standard maximum likelihood estimation for reward modeling. We then apply it to various online RLHF settings, including passive data collection, active data collection, and deployment-time adaptation. We provide theoretical guarantees showing that our method enhances both statistical and computational efficiency. Finally, we design practical algorithms for LLMs and conduct experiments with the Llama-3-8B-Instruct and Qwen2.5-7B-Instruct models on Ultrafeedback and Mixture2 datasets, validating the effectiveness of our approach.
[995] Echo State Transformer: Attention Over Finite Memories
Yannis Bendi-Ouis, Xavier Hinaut
Main category: cs.LG
TL;DR: Echo State Transformers (EST) combine Transformer attention with Reservoir Computing to create a fixed-size memory system that achieves constant computational complexity, breaking the quadratic scaling problem of standard Transformers while maintaining strong performance on time series tasks.
Details
Motivation: Address limitations of Transformers including their brain-unrealistic processing, quadratic complexity growth with sequence length, and computational intensity by creating more efficient models.Method: Hybrid architecture integrating Transformer attention with Reservoir Computing principles, using parallel reservoirs as working memory units with trainable hyperparameters that dynamically adapt memory/non-linearity trade-off.
Result: EST ranks first overall in 2 of 5 categories on Time Series Library benchmark (69 tasks), outperforming state-of-the-art baselines on classification and anomaly detection while remaining competitive on short-term forecasting.
Conclusion: EST is a compelling alternative for time-series classification/anomaly detection and practical complement to transformers for applications prioritizing robust representations and sensitive event detection.
Abstract: While Large Language Models and their underlying Transformer architecture are remarkably efficient, they do not reflect how our brain processes and learns a diversity of cognitive tasks such as language and working memory. Furthermore, sequential data processing with Transformers encounters a fundamental barrier: quadratic complexity growth with sequence length. Motivated by these limitations, our ambition is to create more efficient models that are less reliant on intensive computations. We introduce Echo State Transformers (EST), a hybrid architecture that elegantly resolves this challenge while demonstrating exceptional performance in classification and detection tasks. EST integrates the Transformer attention mechanisms with principles from Reservoir Computing to create a fixed-size window distributed memory system. Drawing inspiration from Echo State Networks, the most prominent instance of the Reservoir Computing paradigm, our approach leverages reservoirs (random recurrent networks) as a lightweight and efficient memory. Our architecture integrates a new module called ‘‘Working Memory’’ based on several reservoirs working in parallel. These reservoirs work as independent working memory units with distinct internal dynamics. A novelty here is that the classical reservoir hyperparameters, controlling the dynamics, are now trained. Thus, the EST dynamically adapts the reservoir memory/non-linearity trade-off. Thanks to these working memory units, EST achieves constant computational complexity at each processing step, effectively breaking the quadratic scaling problem of standard Transformers. We evaluate ESTs on a recent challenging timeseries benchmark: the Time Series Library, which comprises 69 tasks across five categories. Results show that ESTs ranks first overall in two of five categories, outperforming strong state-of-the-art baselines on classification and anomaly detection tasks, while remaining competitive on short-term forecasting. These results position ESTs as a compelling alternative for time-series classification and anomaly detection, and a practical complement to transformer-style models in applications that prioritize robust representations and sensitive event detection.
[996] Dimension-free Score Matching and Time Bootstrapping for Diffusion Models
Syamantak Kumar, Dheeraj Nagaraj, Purnamrita Sarkar
Main category: cs.LG
TL;DR: This paper establishes the first nearly dimension-free sample complexity bounds for learning score functions in diffusion models, achieving double exponential improvement in dimension dependence over prior work.
Details
Motivation: Previous sample complexity bounds for diffusion models had polynomial dependence on dimension, which limited their theoretical understanding and practical efficiency.Method: Uses a single function approximator to jointly estimate scores across noise levels, introduces martingale-based error decomposition and sharp variance bounds, and proposes Bootstrapped Score Matching (BSM) for variance reduction.
Result: Achieved nearly dimension-free sample complexity bounds (modulo log(|H|) dependence) with double exponential improvement in dimension over prior results.
Conclusion: The analysis provides insights into the efficiency and effectiveness of diffusion models and introduces techniques that may be of independent interest for learning from dependent data generated by Markov processes.
Abstract: Diffusion models generate samples by estimating the score function of the target distribution at various noise levels. The model is trained using samples drawn from the target distribution by progressively adding noise. Previous sample complexity bounds have polynomial dependence on the dimension $d$, apart from a $\log(|\mathcal{H}|)$ term, where $\mathcal{H}$ is the hypothesis class. In this work, we establish the first (nearly) dimension-free sample complexity bounds, modulo the $\log(|\mathcal{H}|)$ dependence, for learning these score functions, achieving a double exponential improvement in the dimension over prior results. A key aspect of our analysis is the use of a single function approximator to jointly estimate scores across noise levels, a practical feature that enables generalization across time steps. We introduce a martingale-based error decomposition and sharp variance bounds, enabling efficient learning from dependent data generated by Markov processes, which may be of independent interest. Building on these insights, we propose Bootstrapped Score Matching (BSM), a variance reduction technique that leverages previously learned scores to improve accuracy at higher noise levels. These results provide insights into the efficiency and effectiveness of diffusion models for generative modeling.
[997] Mixing It Up: Exploring Mixer Networks for Irregular Multivariate Time Series Forecasting
Christian Klötergens, Vijaya Krishna Yalavarthi, Tim Dernedde, Lars Schmidt-Thieme
Main category: cs.LG
TL;DR: IMTS-Mixer is a novel forecasting architecture designed for Irregular Multivariate Time Series (IMTS) that transforms irregular data into fixed-size matrices for integration with mixer modules, achieving state-of-the-art accuracy and computational efficiency.
Details
Motivation: Real-world datasets in healthcare, climate research, and biomechanics often have irregularly spaced observations with missing values, violating the regular spacing assumptions of most forecasting models. TS-mixer models have been successful for regular time series but cannot handle IMTS.Method: IMTS-Mixer retains TS mixer principles while introducing innovative methods to transform irregular multivariate time series into fixed-size matrix representations, enabling seamless integration with mixer modules.
Result: IMTS-Mixer establishes a new state-of-the-art in forecasting accuracy on four real-world benchmark datasets from various domains while also improving computational efficiency.
Conclusion: The proposed IMTS-Mixer successfully bridges the gap between TS mixer models and irregular multivariate time series forecasting, demonstrating superior performance and efficiency for real-world applications with irregular observations.
Abstract: Forecasting Irregular Multivariate Time Series (IMTS) has recently emerged as a distinct research field, necessitating specialized models to address its unique challenges. While most forecasting literature assumes regularly spaced observations without missing values, many real-world datasets - particularly in healthcare, climate research, and biomechanics - violate these assumptions. Time Series (TS)-mixer models have achieved remarkable success in regular multivariate time series forecasting. However, they remain unexplored for IMTS due to their requirement for complete and evenly spaced observations. To bridge this gap, we introduce IMTS-Mixer, a novel forecasting architecture designed specifically for IMTS. Our approach retains the core principles of TS mixer models while introducing innovative methods to transform IMTS into fixed-size matrix representations, enabling their seamless integration with mixer modules. We evaluate IMTS-Mixer on a benchmark of four real-world datasets from various domains. Our results demonstrate that IMTS-Mixer establishes a new state-of-the-art in forecasting accuracy while also improving computational efficiency.
[998] TimeXL: Explainable Multi-modal Time Series Prediction with LLM-in-the-Loop
Yushan Jiang, Wenchao Yu, Geon Lee, Dongjin Song, Kijung Shin, Wei Cheng, Yanchi Liu, Haifeng Chen
Main category: cs.LG
TL;DR: TimeXL is a multi-modal time series prediction framework that integrates prototype-based time series encoding with three collaborating LLMs to improve prediction accuracy and provide interpretable explanations through a closed-loop workflow.
Details
Motivation: Most existing time series analysis methods overlook rich contextual signals from auxiliary modalities, limiting their predictive power and interpretability.Method: Uses a multi-modal prototype-based encoder for preliminary forecasts, then employs three LLMs: prediction LLM for forecast refinement, reflection LLM for error analysis, and refinement LLM for iterative improvement and encoder retraining.
Result: Achieves up to 8.9% improvement in AUC on four real-world datasets and produces human-centric, multi-modal explanations.
Conclusion: Demonstrates the power of LLM-driven reasoning for time series prediction, enabling continuous performance improvement and enhanced interpretability through the closed-loop workflow.
Abstract: Time series analysis provides essential insights for real-world system dynamics and informs downstream decision-making, yet most existing methods often overlook the rich contextual signals present in auxiliary modalities. To bridge this gap, we introduce TimeXL, a multi-modal prediction framework that integrates a prototype-based time series encoder with three collaborating Large Language Models (LLMs) to deliver more accurate predictions and interpretable explanations. First, a multi-modal prototype-based encoder processes both time series and textual inputs to generate preliminary forecasts alongside case-based rationales. These outputs then feed into a prediction LLM, which refines the forecasts by reasoning over the encoder’s predictions and explanations. Next, a reflection LLM compares the predicted values against the ground truth, identifying textual inconsistencies or noise. Guided by this feedback, a refinement LLM iteratively enhances text quality and triggers encoder retraining. This closed-loop workflow-prediction, critique (reflect), and refinement-continuously boosts the framework’s performance and interpretability. Empirical evaluations on four real-world datasets demonstrate that TimeXL achieves up to 8.9% improvement in AUC and produces human-centric, multi-modal explanations, highlighting the power of LLM-driven reasoning for time series prediction.
[999] Unifying Re-Identification, Attribute Inference, and Data Reconstruction Risks in Differential Privacy
Bogdan Kulynych, Juan Felipe Gomez, Georgios Kaissis, Jamie Hayes, Borja Balle, Flavio du Pin Calmon, Jean Louis Raisaro
Main category: cs.LG
TL;DR: This paper proposes a unified framework using f-DP to interpret and calibrate differentially private mechanisms, providing consistent and tunable bounds for re-identification, attribute inference, and data reconstruction risks.
Details
Motivation: Existing methods for mapping DP parameters to concrete privacy risks are overly pessimistic and inconsistent across different attack settings, making DP mechanisms difficult to interpret and calibrate.Method: The authors use the hypothesis-testing interpretation of DP (f-DP) to derive unified bounds on attack success that work consistently across re-identification, attribute inference, and data reconstruction risks.
Result: Empirical results show tighter bounds than prior methods using ε-DP, Rényi DP, and concentrated DP, enabling 20% noise reduction at the same risk level and accuracy improvements from 52% to 70% in text classification.
Conclusion: The f-DP framework provides a principled approach for interpreting and calibrating DP protection against specific levels of privacy risks, offering consistent and tunable risk evaluation.
Abstract: Differentially private (DP) mechanisms are difficult to interpret and calibrate because existing methods for mapping standard privacy parameters to concrete privacy risks – re-identification, attribute inference, and data reconstruction – are both overly pessimistic and inconsistent. In this work, we use the hypothesis-testing interpretation of DP ($f$-DP), and determine that bounds on attack success can take the same unified form across re-identification, attribute inference, and data reconstruction risks. Our unified bounds are (1) consistent across a multitude of attack settings, and (2) tunable, enabling practitioners to evaluate risk with respect to arbitrary, including worst-case, levels of baseline risk. Empirically, our results are tighter than prior methods using $\varepsilon$-DP, R'enyi DP, and concentrated DP. As a result, calibrating noise using our bounds can reduce the required noise by 20% at the same risk level, which yields, e.g., an accuracy increase from 52% to 70% in a text classification task. Overall, this unifying perspective provides a principled framework for interpreting and calibrating the degree of protection in DP against specific levels of re-identification, attribute inference, or data reconstruction risk.
[1000] Memory Injection Attacks on LLM Agents via Query-Only Interaction
Shen Dong, Shaochen Xu, Pengfei He, Yige Li, Jiliang Tang, Tianming Liu, Hui Liu, Zhen Xiang
Main category: cs.LG
TL;DR: MINJA is a memory injection attack that compromises LLM agents by injecting malicious records through normal interactions, enabling attackers to influence agent behavior without direct memory access.
Details
Motivation: LLM agents with compromised memory banks can produce harmful outputs when malicious records are retrieved. Current attacks assume direct memory modification, but MINJA addresses the real-world scenario where attackers only have query-level access.Method: Inject malicious records via queries and output observations. Use bridging steps to link victim queries to malicious reasoning, with indication prompts that guide agents to generate similar bridging steps autonomously. Employ progressive shortening to remove prompts gradually.
Result: Extensive experiments show MINJA effectively compromises agent memory across diverse agents. The attack requires minimal execution requirements and enables any user to influence agent memory.
Conclusion: MINJA highlights significant security risks in LLM agents, demonstrating that memory can be compromised through normal interactions without direct memory access, posing serious threats to real-world agent deployments.
Abstract: Agents powered by large language models (LLMs) have demonstrated strong capabilities in a wide range of complex, real-world applications. However, LLM agents with a compromised memory bank may easily produce harmful outputs when the past records retrieved for demonstration are malicious. In this paper, we propose a novel Memory INJection Attack, MINJA, without assuming that the attacker can directly modify the memory bank of the agent. The attacker injects malicious records into the memory bank by only interacting with the agent via queries and output observations. These malicious records are designed to elicit a sequence of malicious reasoning steps corresponding to a different target query during the agent’s execution of the victim user’s query. Specifically, we introduce a sequence of bridging steps to link victim queries to the malicious reasoning steps. During the memory injection, we propose an indication prompt that guides the agent to autonomously generate similar bridging steps, with a progressive shortening strategy that gradually removes the indication prompt, such that the malicious record will be easily retrieved when processing later victim queries. Our extensive experiments across diverse agents demonstrate the effectiveness of MINJA in compromising agent memory. With minimal requirements for execution, MINJA enables any user to influence agent memory, highlighting the risk.
[1001] Validating LLM-as-a-Judge Systems under Rating Indeterminacy
Luke Guerdan, Solon Barocas, Kenneth Holstein, Hanna Wallach, Zhiwei Steven Wu, Alexandra Chouldechova
Main category: cs.LG
TL;DR: The paper introduces a framework for validating LLM-as-a-judge systems under rating indeterminacy, showing that standard forced-choice rating approaches lead to suboptimal judge system selection, while multi-label response set ratings improve performance by up to 31%.
Details
Motivation: Current LLM-as-a-judge validation methods rely on forced-choice ratings that don't account for rating indeterminacy - where multiple ratings can be valid for the same item, leading to biased validation results.Method: Proposed a framework using multi-label “response set” ratings instead of forced-choice ratings, with theoretical analysis of different human-judge agreement metrics and rating elicitation/aggregation schemes across 11 real-world rating tasks and 9 commercial LLMs.
Result: Standard forced-choice validation approaches select highly suboptimal judge systems, performing up to 31% worse than systems selected using the proposed multi-label response set approach that accounts for rating indeterminacy.
Conclusion: Concrete recommendations for more principled LLM-as-a-judge validation that better handles rating indeterminacy through multi-label rating approaches rather than forced-choice methods.
Abstract: The LLM-as-a-judge paradigm, in which a judge LLM system replaces human raters in rating the outputs of other generative AI (GenAI) systems, plays a critical role in scaling and standardizing GenAI evaluations. To validate such judge systems, evaluators assess human–judge agreement by first collecting multiple human ratings for each item in a validation corpus, then aggregating the ratings into a single, per-item gold label rating. For many items, however, rating criteria may admit multiple valid interpretations, so a human or LLM rater may deem multiple ratings “reasonable” or “correct.” We call this condition rating indeterminacy. Problematically, many rating tasks that contain rating indeterminacy rely on forced-choice elicitation, whereby raters are instructed to select only one rating for each item. In this paper, we introduce a framework for validating LLM-as-a-judge systems under rating indeterminacy. We draw theoretical connections between different measures of judge system performance under different human–judge agreement metrics, and different rating elicitation and aggregation schemes. We demonstrate that differences in how humans and LLMs resolve rating indeterminacy when responding to forced-choice rating instructions can heavily bias LLM-as-a-judge validation. Through extensive experiments involving 11 real-world rating tasks and 9 commercial LLMs, we show that standard validation approaches that rely upon forced-choice ratings select judge systems that are highly suboptimal, performing as much as 31% worse than judge systems selected by our approach that uses multi-label “response set” ratings to account for rating indeterminacy. We conclude with concrete recommendations for more principled approaches to LLM-as-a-judge validation.
[1002] Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training
Minhak Song, Beomhan Baek, Kwangjun Ahn, Chulhee Yun
Main category: cs.LG
TL;DR: Schedule-Free (SF) method is a principled alternative to conventional pretraining strategies that effectively navigates loss landscapes without decay phases or memory overhead, making it suitable for large-scale training.
Details
Motivation: Conventional pretraining strategies with fixed compute budgets are inadequate for large-scale training, and existing alternatives like WSD schedules and weight averaging have limitations such as explicit decay phases or additional memory costs.Method: Revisits the Schedule-Free (SF) method and conducts theoretical/empirical analysis of its dynamics, revealing implicit weight averaging without memory overhead. Proposes a refined SF variant that improves robustness to momentum and large batch sizes.
Result: SF-AdamW effectively navigates loss landscape “river” structure without decay phases or auxiliary averaging. The refined SF variant addresses key limitations and performs better under large batch sizes.
Conclusion: Schedule-Free method is established as a practical, scalable, and theoretically grounded approach for language model training that overcomes limitations of conventional strategies.
Abstract: As both model and dataset sizes continue to scale rapidly, conventional pretraining strategies with fixed compute budgets-such as cosine learning rate schedules-are increasingly inadequate for large-scale training. Recent alternatives, including warmup-stable-decay (WSD) schedules and weight averaging, offer greater flexibility. However, WSD relies on explicit decay phases to track progress, while weight averaging addresses this limitation at the cost of additional memory. In search of a more principled and scalable alternative, we revisit the Schedule-Free (SF) method [Defazio et al., 2024], which has shown strong empirical performance across diverse settings. We show that SF-AdamW effectively navigates the “river” structure of the loss landscape without decay phases or auxiliary averaging, making it particularly suitable for continuously scaling training workloads. To understand this behavior, we conduct a theoretical and empirical analysis of SF dynamics, revealing that it implicitly performs weight averaging without memory overhead. Guided by this analysis, we propose a refined variant of SF that improves robustness to momentum and performs better under large batch sizes, addressing key limitations of the original method. Together, these results establish SF as a practical, scalable, and theoretically grounded approach for language model training.
[1003] Revisiting Agnostic Boosting
Arthur da Cunha, Mikael Møller Høgsgaard, Andrea Paudice, Yuxin Sun
Main category: cs.LG
TL;DR: A new agnostic boosting algorithm with improved sample complexity that reduces to the realizable case and uses margin-based filtering.
Details
Motivation: Boosting is well-studied in the realizable case but less understood in the agnostic setting where no assumptions are made about label distributions.Method: Reduction to the realizable case followed by margin-based filtering of high-quality hypotheses.
Result: Substantially improved sample complexity compared to prior works under general assumptions, with nearly-matching lower bounds.
Conclusion: The paper settles the sample complexity of agnostic boosting up to logarithmic factors.
Abstract: Boosting is a key method in statistical learning, allowing for converting weak learners into strong ones. While well studied in the realizable case, the statistical properties of weak-to-strong learning remain less understood in the agnostic setting, where there are no assumptions on the distribution of the labels. In this work, we propose a new agnostic boosting algorithm with substantially improved sample complexity compared to prior works under very general assumptions. Our approach is based on a reduction to the realizable case, followed by a margin-based filtering of high-quality hypotheses. Furthermore, we show a nearly-matching lower bound, settling the sample complexity of agnostic boosting up to logarithmic factors.
[1004] Ground-Compose-Reinforce: Grounding Language in Agentic Behaviours using Limited Data
Andrew C. Li, Toryn Q. Klassen, Andrew Wang, Parand A. Alamdari, Sheila A. McIlraith
Main category: cs.LG
TL;DR: Ground-Compose-Reinforce is an end-to-end neurosymbolic framework that trains RL agents directly from high-level task specifications using Reward Machines, requiring minimal data by exploiting compositionality.
Details
Motivation: To address the challenge of grounding language in perception and action for situated agents without needing manually designed reward functions or massive datasets.Method: Uses Reward Machines (automata-based representations) that capture high-level task structure and can be autoformalized from natural language. The framework exploits compositionality to ground these specifications with limited data.
Result: Experiments in a custom Meta-World domain show the framework successfully elicits complex behaviors from high-level specifications using only 350 labeled pretraining trajectories, including behaviors not seen during pretraining, while non-compositional approaches fail.
Conclusion: The proposed neurosymbolic framework enables effective training of RL agents from high-level task specifications with minimal data by leveraging compositionality and Reward Machines.
Abstract: Grounding language in perception and action is a key challenge when building situated agents that can interact with humans, or other agents, via language. In the past, addressing this challenge has required manually designing the language grounding or curating massive datasets that associate language with the environment. We propose Ground-Compose-Reinforce, an end-to-end, neurosymbolic framework for training RL agents directly from high-level task specifications–without manually designed reward functions or other domain-specific oracles, and without massive datasets. These task specifications take the form of Reward Machines, automata-based representations that capture high-level task structure and are in some cases autoformalizable from natural language. Critically, we show that Reward Machines can be grounded using limited data by exploiting compositionality. Experiments in a custom Meta-World domain with only 350 labelled pretraining trajectories show that our framework faithfully elicits complex behaviours from high-level specifications–including behaviours that never appear in pretraining–while non-compositional approaches fail.
[1005] Cohort-attention Evaluation Metric against Tied Data: Studying Performance of Classification Models in Cancer Detection
Longfei Wei, Fang Sheng, Jianfei Zhang
Main category: cs.LG
TL;DR: The paper proposes CAT framework with new metrics (CATSen, CATSpe, CATMean) to address limitations of traditional classification metrics in medical AI screening, particularly for imbalanced data and diverse populations.
Details
Motivation: Traditional classification metrics fail to account for imbalanced data, varying performance across cohorts, and patient-level inconsistencies in medical AI screening, leading to biased evaluations.Method: Proposes Cohort-Attention Evaluation Metrics (CAT) framework with patient-level assessment, entropy-based distribution weighting, and cohort-weighted sensitivity and specificity.
Result: Develops key metrics including CATSensitivity (CATSen), CATSpecificity (CATSpe), and CATMean that ensure balanced and fair evaluation across diverse populations.
Conclusion: The CAT framework enhances predictive reliability, fairness, and interpretability, providing a robust evaluation method for AI-driven medical screening models.
Abstract: Artificial intelligence (AI) has significantly improved medical screening accuracy, particularly in cancer detection and risk assessment. However, traditional classification metrics often fail to account for imbalanced data, varying performance across cohorts, and patient-level inconsistencies, leading to biased evaluations. We propose the Cohort-Attention Evaluation Metrics (CAT) framework to address these challenges. CAT introduces patient-level assessment, entropy-based distribution weighting, and cohort-weighted sensitivity and specificity. Key metrics like CATSensitivity (CATSen), CATSpecificity (CATSpe), and CATMean ensure balanced and fair evaluation across diverse populations. This approach enhances predictive reliability, fairness, and interpretability, providing a robust evaluation method for AI-driven medical screening models.
[1006] A Lightweight Gradient-based Causal Discovery Framework with Applications to Complex Industrial Processes
Meiliang Liu, Huiwen Dong, Xiaoxiao Yang, Yunfang Xu, Zijin Li, Zhengye Si, Xinyue Yang, Zhiwen Zhao
Main category: cs.LG
TL;DR: Proposes GRNGC, a neural Granger causality method that uses gradient regularization instead of first-layer sparsity penalties, requiring only one model for all time series and supporting various architectures like KAN, MLP, and LSTM.
Details
Motivation: Existing neural Granger causality models have computational inefficiency (separate models per time series) and weaken complex interaction capture due to first-layer sparsity penalties.Method: Uses gradient regularization with L1 penalty between model input and output to infer causality, requiring only one prediction model and supporting multiple neural architectures.
Result: Outperforms baselines on DREAM, Lorenz-96, fMRI BOLD, and CausalTime datasets with reduced computational overhead, and effectively reconstructs gene regulatory networks on real-world DNA datasets.
Conclusion: GRNGC provides a flexible, efficient, and effective approach for neural Granger causality inference that overcomes limitations of existing methods.
Abstract: With the advancement of deep learning technologies, various neural network-based Granger causality models have been proposed. Although these models have demonstrated notable improvements, several limitations remain. Most existing approaches adopt the component-wise architecture, necessitating the construction of a separate model for each time series, which results in substantial computational costs. In addition, imposing the sparsity-inducing penalty on the first-layer weights of the neural network to extract causal relationships weakens the model’s ability to capture complex interactions. To address these limitations, we propose Gradient Regularization-based Neural Granger Causality (GRNGC), which requires only one time series prediction model and applies $L_{1}$ regularization to the gradient between model’s input and output to infer Granger causality. Moreover, GRNGC is not tied to a specific time series forecasting model and can be implemented with diverse architectures such as KAN, MLP, and LSTM, offering enhanced flexibility. Numerical simulations on DREAM, Lorenz-96, fMRI BOLD, and CausalTime show that GRNGC outperforms existing baselines and significantly reduces computational overhead. Meanwhile, experiments on real-world DNA, Yeast, HeLa, and bladder urothelial carcinoma datasets further validate the model’s effectiveness in reconstructing gene regulatory networks.
[1007] Federated Structured Sparse PCA for Anomaly Detection in IoT Networks
Chenyi Huang, Xinrong Li, Xianchao Xiu
Main category: cs.LG
TL;DR: Proposes FedSSP, a federated structured sparse PCA method for IoT anomaly detection that integrates double sparsity regularization to eliminate redundant features and suppress noise.
Details
Motivation: Current federated PCA methods lack sparsity integration, which is critical for robust anomaly detection in IoT environments.Method: Uses double sparsity regularization (row-wise sparsity via ℓ₂,p-norm and element-wise sparsity via ℓq-norm) with proximal alternating minimization algorithm for distributed optimization.
Result: Experiments on real datasets show enhanced model interpretability and detection accuracy through structured sparsity.
Conclusion: FedSSP effectively addresses the sparsity limitation in federated PCA for IoT anomaly detection with proven convergence guarantees.
Abstract: Although federated learning has gained prominence as a privacy-preserving framework tailored for distributed Internet of Things (IoT) environments, current federated principal component analysis (PCA) methods lack integration of sparsity, a critical feature for robust anomaly detection. To address this limitation, we propose a novel federated structured sparse PCA (FedSSP) approach for anomaly detection in IoT networks. The proposed model uniquely integrates double sparsity regularization: (1) row-wise sparsity governed by $\ell_{2,p}$-norm with $p\in[0,1)$ to eliminate redundant feature dimensions, and (2) element-wise sparsity via $\ell_{q}$-norm with $q\in[0,1)$ to suppress noise-sensitive components. To efficiently solve this non-convex optimization problem in a distributed setting, we devise a proximal alternating minimization (PAM) algorithm with rigorous theoretical proofs establishing its convergence guarantees. Experiments on real datasets validate that incorporating structured sparsity enhances both model interpretability and detection accuracy.
[1008] PhysGym: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors
Yimeng Chen, Piotr Piȩkos, Mateusz Ostaszewski, Firas Laakom, Jürgen Schmidhuber
Main category: cs.LG
TL;DR: PhysGym is a new benchmark suite for evaluating LLM-based scientific reasoning in interactive physics environments, with sophisticated control over prior knowledge levels to assess agent performance across varying problem complexity.
Details
Motivation: There is a lack of specialized benchmarks for evaluating how LLM-based agents handle scientific discovery, particularly how they cope with environmental complexity and utilize prior knowledge.Method: Developed PhysGym - a simulation platform with interactive physics environments where agents probe environments, gather data sequentially under constraints, and formulate hypotheses about physical laws, with controlled prior knowledge levels.
Result: The benchmark successfully differentiates LLM capabilities based on varying priors and task complexity, demonstrating its utility for rigorous assessment of scientific reasoning.
Conclusion: PhysGym provides a standardized framework for evaluating LLM-based scientific discovery capabilities, addressing a critical gap in current benchmarking landscape.
Abstract: Evaluating the scientific discovery capabilities of large language model based agents, particularly how they cope with varying environmental complexity and utilize prior knowledge, requires specialized benchmarks currently lacking in the landscape. To address this gap, we introduce \textsc{PhysGym}, a novel benchmark suite and simulation platform for rigorously assessing LLM-based scientific reasoning in interactive physics environments. \textsc{PhysGym}’s primary contribution lies in its sophisticated control over the level of prior knowledge provided to the agent. This allows researchers to dissect agent performance along axes including the complexity of the problem and the prior knowledge levels. The benchmark comprises a suite of interactive simulations, where agents must actively probe environments, gather data sequentially under constraints and formulate hypotheses about underlying physical laws. \textsc{PhysGym} provides standardized evaluation protocols and metrics for assessing hypothesis accuracy and model fidelity. We demonstrate the benchmark’s utility by presenting results from baseline LLMs, showcasing its ability to differentiate capabilities based on varying priors and task complexity.
[1009] TianQuan-S2S: A Subseasonal-to-Seasonal Global Weather Model via Incorporate Climatology State
Guowen Li, Xintong Liu, Yang Liu, Mengxuan Chen, Shilei Cao, Xuehe Wang, Juepeng Zheng, Jinxiao Zhang, Haoyuan Liang, Lixian Zhang, Jiuke Wang, Meng Jin, Hong Cheng, Haohuan Fu
Main category: cs.LG
TL;DR: TianQuan-S2S is a global subseasonal-to-seasonal forecasting model that integrates initial weather states with climatological means using an uncertainty-augmented Transformer, achieving superior performance over existing methods.
Details
Motivation: S2S forecasting is crucial for decision-making but remains challenging due to weather system chaos. Current data-driven models suffer from inadequate climate state incorporation and degradation issues, losing fine-scale details and producing over-smoothed forecasts.Method: Proposes TianQuan-S2S model that integrates initial weather states with climatological means via climatology-based patch embedding and enhances variability capture through an uncertainty-augmented Transformer architecture.
Result: Extensive experiments on ERA5 dataset show significant improvements in both deterministic and ensemble forecasting over climatology mean, traditional numerical methods, and existing data-driven models. Outperforms ECMWF-S2S and Fuxi-S2S in key meteorological variables.
Conclusion: The proposed model effectively addresses limitations of current S2S forecasting approaches through innovative integration of climate states and uncertainty-aware Transformer design, demonstrating superior forecasting capabilities.
Abstract: Accurate Subseasonal-to-Seasonal (S2S) forecasting is vital for decision-making in agriculture, energy production, and emergency management. However, it remains a challenging and underexplored problem due to the chaotic nature of the weather system. Recent data-driven studies have shown promising results, but their performance is limited by the inadequate incorporation of climate states and a model tendency to degrade, progressively losing fine-scale details and yielding over-smoothed forecasts. To overcome these limitations, we propose TianQuan-S2S, a global S2S forecasting model that integrates initial weather states with climatological means via incorporating climatology into patch embedding and enhancing variability capture through an uncertainty-augmented Transformer. Extensive experiments on the Earth Reanalysis 5 (ERA5) reanalysis dataset demonstrate that our model yields a significant improvement in both deterministic and ensemble forecasting over the climatology mean, traditional numerical methods, and data-driven models. Ablation studies empirically show the effectiveness of our model designs. Remarkably, our model outperforms skillful numerical ECMWF-S2S and advanced data-driven Fuxi-S2S in key meteorological variables.
[1010] DmC: Nearest Neighbor Guidance Diffusion Model for Offline Cross-domain Reinforcement Learning
Linh Le Pham Van, Minh Hoang Nguyen, Duc Kieu, Hung Le, Hung The Tran, Sunil Gupta
Main category: cs.LG
TL;DR: DmC is a novel framework for cross-domain offline RL that addresses limited target data by using k-NN based domain proximity estimation and nearest-neighbor-guided diffusion to generate better-aligned source samples, significantly outperforming existing methods.
Details
Motivation: Existing cross-domain offline RL methods require large target datasets, which is impractical in real-world scenarios. The paper addresses challenges of dataset imbalance and partial domain overlap when target data is limited.Method: Proposes DmC framework that uses k-NN based estimation to measure domain proximity without neural network training, then employs nearest-neighbor-guided diffusion model to generate additional source samples better aligned with the target domain.
Result: Extensive experiments in MuJoCo environments show DmC significantly outperforms state-of-the-art cross-domain offline RL methods with substantial performance gains.
Conclusion: DmC effectively addresses cross-domain offline RL with limited target samples by mitigating overfitting through k-NN estimation and enhancing policy learning with better-aligned generated source samples.
Abstract: Cross-domain offline reinforcement learning (RL) seeks to enhance sample efficiency in offline RL by utilizing additional offline source datasets. A key challenge is to identify and utilize source samples that are most relevant to the target domain. Existing approaches address this challenge by measuring domain gaps through domain classifiers, target transition dynamics modeling, or mutual information estimation using contrastive loss. However, these methods often require large target datasets, which is impractical in many real-world scenarios. In this work, we address cross-domain offline RL under a limited target data setting, identifying two primary challenges: (1) Dataset imbalance, which is caused by large source and small target datasets and leads to overfitting in neural network-based domain gap estimators, resulting in uninformative measurements; and (2) Partial domain overlap, where only a subset of the source data is closely aligned with the target domain. To overcome these issues, we propose DmC, a novel framework for cross-domain offline RL with limited target samples. Specifically, DmC utilizes $k$-nearest neighbor ($k$-NN) based estimation to measure domain proximity without neural network training, effectively mitigating overfitting. Then, by utilizing this domain proximity, we introduce a nearest-neighbor-guided diffusion model to generate additional source samples that are better aligned with the target domain, thus enhancing policy learning with more effective source samples. Through theoretical analysis and extensive experiments in diverse MuJoCo environments, we demonstrate that DmC significantly outperforms state-of-the-art cross-domain offline RL methods, achieving substantial performance gains.
[1011] Effortless, Simulation-Efficient Bayesian Inference using Tabular Foundation Models
Julius Vetter, Manuel Gloeckler, Daniel Gedon, Jakob H. Macke
Main category: cs.LG
TL;DR: NPE-PFN uses pre-trained tabular foundation models (TabPFN) as autoregressive conditional density estimators for simulation-based inference, achieving competitive accuracy with superior simulation efficiency and eliminating the need for network training and hyperparameter tuning.
Details
Motivation: To improve simulation efficiency in SBI by reducing the number of expensive simulations needed for accurate Bayesian inference, especially for complex scientific inverse problems.Method: Repurposes TabPFN as a pre-trained autoregressive conditional density estimator for Neural Posterior Estimation, creating NPE-PFN that requires no network training or hyperparameter tuning.
Result: Competitive accuracy with current SBI approaches on benchmark tasks and complex scientific problems, often requiring orders of magnitude fewer simulations while showing superior robustness to model misspecification.
Conclusion: NPE-PFN provides a training-free, general-purpose inference solution that offers efficient, easy-to-use, and flexible SBI for a wide range of stochastic inverse problems.
Abstract: Simulation-based inference (SBI) offers a flexible and general approach to performing Bayesian inference: In SBI, a neural network is trained on synthetic data simulated from a model and used to rapidly infer posterior distributions for observed data. A key goal for SBI is to achieve accurate inference with as few simulations as possible, especially for expensive simulators. In this work, we address this challenge by repurposing recent probabilistic foundation models for tabular data: We show how tabular foundation models – specifically TabPFN – can be used as pre-trained autoregressive conditional density estimators for SBI. We propose Neural Posterior Estimation with Prior-data Fitted Networks (NPE-PFN) and show that it is competitive with current SBI approaches in terms of accuracy for both benchmark tasks and two complex scientific inverse problems. Crucially, it often substantially outperforms them in terms of simulation efficiency, sometimes requiring orders of magnitude fewer simulations. NPE-PFN eliminates the need for inference network selection, training, and hyperparameter tuning. We also show that it exhibits superior robustness to model misspecification and can be scaled to simulation budgets that exceed the context size limit of TabPFN. NPE-PFN provides a new direction for SBI, where training-free, general-purpose inference models offer efficient, easy-to-use, and flexible solutions for a wide range of stochastic inverse problems.
[1012] A critical assessment of reinforcement learning methods for microswimmer navigation in complex flows
Selim Mecanna, Aurore Loisy, Christophe Eloy
Main category: cs.LG
TL;DR: This paper evaluates reinforcement learning methods for autonomous navigation in fluid flows, finding that commonly used algorithms (Q-Learning, A2C) perform poorly while PPO achieves near-optimal performance when properly implemented.
Details
Motivation: To quantitatively assess RL methods for navigation in partially observable flows, as commonly used simple implementations may not discover optimal strategies despite widespread adoption in fluid mechanics.Method: Introduced a well-posed directional navigation problem with known quasi-optimal policy, then tested Q-Learning, Advantage Actor Critic, and PPO algorithms on various flow types (Taylor-Green vortices, ABC flow, 2D turbulence) using vectorized environments, generalized advantage estimation, and hyperparameter optimization.
Result: Commonly used algorithms (Q-Learning, A2C) showed poor performance and robustness, while PPO matched theoretical quasi-optimal performance in turbulent flow when properly implemented with advanced techniques.
Conclusion: Algorithm selection, implementation details, and fine-tuning are crucial for discovering truly smart autonomous navigation strategies in complex flows, with PPO demonstrating superior performance over simpler RL methods.
Abstract: Navigating in a fluid flow while being carried by it, using only information accessible from on-board sensors, is a problem commonly faced by small planktonic organisms. It is also directly relevant to autonomous robots deployed in the oceans. In the last ten years, the fluid mechanics community has widely adopted reinforcement learning, often in the form of its simplest implementations, to address this challenge. But it is unclear how good are the strategies learned by these algorithms. In this paper, we perform a quantitative assessment of reinforcement learning methods applied to navigation in partially observable flows. We first introduce a well-posed problem of directional navigation for which a quasi-optimal policy is known analytically. We then report on the poor performance and robustness of commonly used algorithms (Q-Learning, Advantage Actor Critic) in flows regularly encountered in the literature: Taylor-Green vortices, Arnold-Beltrami-Childress flow, and two-dimensional turbulence. We show that they are vastly surpassed by PPO (Proximal Policy Optimization), a more advanced algorithm that has established dominance across a wide range of benchmarks in the reinforcement learning community. In particular, our custom implementation of PPO matches the theoretical quasi-optimal performance in turbulent flow and does so in a robust manner. Reaching this result required the use of several additional techniques, such as vectorized environments and generalized advantage estimation, as well as hyperparameter optimization. This study demonstrates the importance of algorithm selection, implementation details, and fine-tuning for discovering truly smart autonomous navigation strategies in complex flows.
[1013] Analog Foundation Models
Julian Büchel, Iason Chalas, Giovanni Acampa, An Chen, Omobayode Fagbohungbe, Sidney Tsai, Kaoutar El Maghraoui, Manuel Le Gallo, Abbas Rahimi, Abu Sebastian
Main category: cs.LG
TL;DR: A general method to adapt LLMs for noisy, low-precision analog hardware, enabling models like Phi-3-mini and Llama-3.2-1B to maintain 4-bit performance despite analog noise and quantization constraints.
Details
Motivation: Analog in-memory computing (AIMC) offers speed and power efficiency benefits but introduces noisy computations and strict quantization constraints that prevent off-the-shelf LLMs from achieving 4-bit-level performance on AIMC hardware.Method: A general and scalable training methodology to robustly adapt LLMs for execution on noisy, low-precision analog hardware, which also enables quantization for low-precision digital inference.
Result: State-of-the-art models retain performance comparable to 4-bit weight, 8-bit activation baselines despite analog noise and constraints, and show better test-time compute scaling than static quantization models.
Conclusion: This work bridges the gap between high-capacity LLMs and efficient analog hardware, offering a path toward energy-efficient foundation models.
Abstract: Analog in-memory computing (AIMC) is a promising compute paradigm to improve speed and power efficiency of neural network inference beyond the limits of conventional von Neumann-based architectures. However, AIMC introduces fundamental challenges such as noisy computations and strict constraints on input and output quantization. Because of these constraints and imprecisions, off-the-shelf LLMs are not able to achieve 4-bit-level performance when deployed on AIMC-based hardware. While researchers previously investigated recovering this accuracy gap on small, mostly vision-based models, a generic method applicable to LLMs pre-trained on trillions of tokens does not yet exist. In this work, we introduce a general and scalable method to robustly adapt LLMs for execution on noisy, low-precision analog hardware. Our approach enables state-of-the-art models $\unicode{x2013}$ including Phi-3-mini-4k-instruct and Llama-3.2-1B-Instruct $\unicode{x2013}$ to retain performance comparable to 4-bit weight, 8-bit activation baselines, despite the presence of analog noise and quantization constraints. Additionally, we show that as a byproduct of our training methodology, analog foundation models can be quantized for inference on low-precision digital hardware. Finally, we show that our models also benefit from test-time compute scaling, showing better scaling behavior than models trained with 4-bit weight and 8-bit static input quantization. Our work bridges the gap between high-capacity LLMs and efficient analog hardware, offering a path toward energy-efficient foundation models. Code is available at https://github.com/IBM/analog-foundation-models.
[1014] Score-informed Neural Operator for Enhancing Ordering-based Causal Discovery
Jiyeon Kang, Songseong Kim, Chanhui Lee, Doyeong Hwang, Joanie Hayoun Chung, Yunkyung Ko, Sumin Lee, Sungwoong Kim, Sungbin Lim
Main category: cs.LG
TL;DR: The paper proposes Score-informed Neural Operator (SciNO), a probabilistic generative model that improves Hessian diagonal approximation for causal ordering algorithms, reducing order divergence by 42.7% on synthetic and 31.5% on real-world datasets compared to DiffAN.
Details
Motivation: Existing methods for causal ordering based on score matching require accurate Hessian diagonal estimation, but current approaches using Stein gradient estimators are computationally expensive and memory-intensive, while diffusion-model-based methods remain unstable due to second-order derivatives.Method: Proposes Score-informed Neural Operator (SciNO), a probabilistic generative model in smooth function spaces designed to stably approximate the Hessian diagonal and preserve structural information during score modeling. Also introduces a probabilistic control algorithm for causal reasoning that integrates SciNO’s probability estimates with autoregressive model priors.
Result: SciNO reduces order divergence by 42.7% on synthetic graphs and 31.5% on real-world datasets on average compared to DiffAN, while maintaining memory efficiency and scalability. The method enhances causal reasoning abilities of LLMs without additional fine-tuning or prompt engineering.
Conclusion: SciNO provides a stable and efficient approach for Hessian diagonal approximation in causal ordering algorithms, significantly improving performance while being memory-efficient and scalable, and enables enhanced causal reasoning in LLMs without requiring model modifications.
Abstract: Ordering-based approaches to causal discovery identify topological orders of causal graphs, providing scalable alternatives to combinatorial search methods. Under the Additive Noise Model (ANM) assumption, recent causal ordering methods based on score matching require an accurate estimation of the Hessian diagonal of the log-densities. In this paper, we aim to improve the approximation of the Hessian diagonal of the log-densities, thereby enhancing the performance of ordering-based causal discovery algorithms. Existing approaches that rely on Stein gradient estimators are computationally expensive and memory-intensive, while diffusion-model-based methods remain unstable due to the second-order derivatives of score models. To alleviate these problems, we propose Score-informed Neural Operator (SciNO), a probabilistic generative model in smooth function spaces designed to stably approximate the Hessian diagonal and to preserve structural information during the score modeling. Empirical results show that SciNO reduces order divergence by 42.7% on synthetic graphs and by 31.5% on real-world datasets on average compared to DiffAN, while maintaining memory efficiency and scalability. Furthermore, we propose a probabilistic control algorithm for causal reasoning with autoregressive models that integrates SciNO’s probability estimates with autoregressive model priors, enabling reliable data-driven causal ordering informed by semantic information. Consequently, the proposed method enhances causal reasoning abilities of LLMs without additional fine-tuning or prompt engineering.
[1015] Koopman Eigenfunction-Based Identification and Optimal Nonlinear Control of Turbojet Engine
David Grasev
Main category: cs.LG
TL;DR: The paper presents a data-driven approach for modeling and controlling gas turbine engines using Koopman operator theory, overcoming limitations of traditional physics-based models.
Details
Motivation: Gas turbine engines are complex nonlinear systems where conventional physics-based modeling is challenging due to unavailable performance characteristics and simplifying assumptions. Data-driven approaches are needed.Method: Uses sparse identification of nonlinear dynamics for rotor estimation, maps dynamics to Koopman eigenfunction space via eigenvalue optimization and gradient-based identification, then designs nonlinear feedback controller and Kalman estimator in this space.
Result: The Koopman-based controller outperforms traditional and gain-scheduled PI controllers, as well as internal model control, in both reference tracking and disturbance rejection across various flight conditions.
Conclusion: The Koopman eigenfunction approach enables global nonlinear control with improved performance tuning through targeted mode optimization, making it superior to conventional control methods for gas turbine engines.
Abstract: Gas turbine engines are complex and highly nonlinear dynamical systems. Deriving their physics-based models can be challenging because it requires performance characteristics that are not always available, often leading to many simplifying assumptions. This paper discusses the limitations of conventional experimental methods used to derive component-level and locally linear parameter-varying models, and addresses these issues by employing identification techniques based on data collected from standard engine operation under closed-loop control. The rotor dynamics are estimated using the sparse identification of nonlinear dynamics. Subsequently, the autonomous part of the dynamics is mapped into an optimally constructed Koopman eigenfunction space. This process involves eigenvalue optimization using metaheuristic algorithms and temporal projection, followed by gradient-based eigenfunction identification. The resulting Koopman model is validated against an in-house reference component-level model. A globally optimal nonlinear feedback controller and a Kalman estimator are then designed within the eigenfunction space and compared to traditional and gain-scheduled proportional-integral controllers, as well as a proposed internal model control approach. The eigenmode structure enables targeting individual modes during optimization, leading to improved performance tuning. Results demonstrate that the Koopman-based controller surpasses other benchmark controllers in both reference tracking and disturbance rejection under sea-level and varying flight conditions, due to its global nature.
[1016] Approximation and Generalization Abilities of Score-based Neural Network Generative Models for Sub-Gaussian Distributions
Guoji Fu, Wee Sun Lee
Main category: cs.LG
TL;DR: This paper analyzes the approximation and generalization capabilities of score-based neural network generative models (SGMs) for distribution estimation, proving they can achieve nearly optimal convergence rates under mild sub-Gaussian assumptions without requiring Lipschitz continuity or strictly positive density bounds.
Details
Motivation: To establish rigorous theoretical guarantees for score-based generative models under minimal assumptions, addressing limitations of previous work that required stronger conditions like Lipschitz continuity of score functions or strictly positive density lower bounds.Method: Theoretical analysis using deep ReLU neural networks with specific width and depth bounds to approximate score functions, with an early stopping strategy and mean square error analysis under α-sub-Gaussian distribution assumptions.
Result: Proved that SGMs can achieve nearly optimal convergence rate of Õ(n⁻¹t₀⁻ᵈ/²) for score estimation, and when target density lies in Sobolev/Besov classes with early stopping, they attain nearly minimax convergence rates up to logarithmic factors.
Conclusion: The framework provides universal convergence guarantees for SGMs under milder assumptions than previous work, removing crucial requirements like Lipschitz continuity and strictly positive density bounds, while maintaining nearly optimal performance.
Abstract: This paper studies the approximation and generalization abilities of score-based neural network generative models (SGMs) in estimating an unknown distribution $P_0$ from $n$ i.i.d. observations in $d$ dimensions. Assuming merely that $P_0$ is $\alpha$-sub-Gaussian, we prove that for any time step $t \in [t_0, n^{\mathcal{O}(1)}]$, where $t_0 > \mathcal{O}(\alpha^2n^{-2/d}\log n)$, there exists a deep ReLU neural network with width $\leq \mathcal{O}(n^{\frac{3}{d}}\log_2n)$ and depth $\leq \mathcal{O}(\log^2n)$ that can approximate the scores with $\tilde{\mathcal{O}}(n^{-1})$ mean square error and achieve a nearly optimal rate of $\tilde{\mathcal{O}}(n^{-1}t_0^{-d/2})$ for score estimation, as measured by the score matching loss. Our framework is universal and can be used to establish convergence rates for SGMs under milder assumptions than previous work. For example, assuming further that the target density function $p_0$ lies in Sobolev or Besov classes, with an appropriately early stopping strategy, we demonstrate that neural network-based SGMs can attain nearly minimax convergence rates up to logarithmic factors. Our analysis removes several crucial assumptions, such as Lipschitz continuity of the score function or a strictly positive lower bound on the target density.
[1017] Modeling Cell Dynamics and Interactions with Unbalanced Mean Field Schrödinger Bridge
Zhenyi Zhang, Zihan Wang, Yuhao Sun, Tiejun Li, Peijie Zhou
Main category: cs.LG
TL;DR: CytoBridge is a deep learning method that models unbalanced stochastic interaction dynamics from sparse snapshot data, addressing limitations of existing methods by incorporating cell-cell interactions.
Details
Motivation: Existing methods for modeling cellular dynamics from snapshot data cannot adequately account for cell-cell interactions, which are essential for understanding real-world cellular processes and state transitions.Method: Proposed Unbalanced Mean-Field Schrödinger Bridge (UMFSB) framework and CytoBridge algorithm that uses neural networks to model cellular transitions, proliferation, and interactions directly from data.
Result: CytoBridge outperforms existing methods by accurately identifying growth, transition, and interaction patterns, eliminating false transitions, and reconstructing developmental landscapes more accurately on both synthetic and real scRNA-seq data.
Conclusion: CytoBridge provides an effective framework for modeling unbalanced stochastic interaction dynamics from sparse time-resolved snapshot data, enabling better understanding of complex cellular processes.
Abstract: Modeling the dynamics from sparsely time-resolved snapshot data is crucial for understanding complex cellular processes and behavior. Existing methods leverage optimal transport, Schr"odinger bridge theory, or their variants to simultaneously infer stochastic, unbalanced dynamics from snapshot data. However, these approaches remain limited in their ability to account for cell-cell interactions. This integration is essential in real-world scenarios since intercellular communications are fundamental life processes and can influence cell state-transition dynamics. To address this challenge, we formulate the Unbalanced Mean-Field Schr"odinger Bridge (UMFSB) framework to model unbalanced stochastic interaction dynamics from snapshot data. Inspired by this framework, we further propose CytoBridge, a deep learning algorithm designed to approximate the UMFSB problem. By explicitly modeling cellular transitions, proliferation, and interactions through neural networks, CytoBridge offers the flexibility to learn these processes directly from data. The effectiveness of our method has been extensively validated using both synthetic gene regulatory data and real scRNA-seq datasets. Compared to existing methods, CytoBridge identifies growth, transition, and interaction patterns, eliminates false transitions, and reconstructs the developmental landscape with greater accuracy. Code is available at: https://github.com/zhenyiizhang/CytoBridge-NeurIPS.
[1018] Deriving Transformer Architectures as Implicit Multinomial Regression
Jonas A. Actor, Anthony Gruber, Eric C. Cyr
Main category: cs.LG
TL;DR: The paper establishes a mathematical connection between attention mechanisms and multinomial regression, showing that attention dynamics optimize features for classification.
Details
Motivation: Attention mechanisms empirically improve model performance but lack rigorous mathematical justification.Method: Show that in fixed multinomial regression, optimizing over latent features yields solutions aligned with attention block dynamics.
Result: Attention-induced feature evolution in transformers can be interpreted as trajectories recovering optimal features for classification.
Conclusion: Attention mechanisms mathematically optimize features for classification, providing theoretical justification for their empirical success.
Abstract: While attention has been empirically shown to improve model performance, it lacks a rigorous mathematical justification. This short paper establishes a novel connection between attention mechanisms and multinomial regression. Specifically, we show that in a fixed multinomial regression setting, optimizing over latent features yields solutions that align with the dynamics induced on features by attention blocks. In other words, the evolution of representations through a transformer can be interpreted as a trajectory that recovers the optimal features for classification.
[1019] Minimizing False-Positive Attributions in Explanations of Non-Linear Models
Anders Gjølbye, Stefan Haufe, Lars Kai Hansen
Main category: cs.LG
TL;DR: PatternLocal is a novel XAI method that transforms discriminative model weights into generative representations to suppress suppressor variables and reduce false-positive feature attributions in non-linear models.
Details
Motivation: Suppressor variables can cause false-positive feature attributions in XAI methods, undermining explanation utility. While remedies exist for linear models, extension to non-linear models and instance-based explanations has been limited.Method: PatternLocal starts with a locally linear surrogate (e.g., LIME, KernelSHAP, or gradient-based methods) and transforms the resulting discriminative model weights into a generative representation to suppress suppressor variables while preserving local fidelity.
Result: In extensive hyperparameter optimization on the XAI-TRIS benchmark, PatternLocal consistently outperformed other XAI methods and reduced false-positive attributions when explaining non-linear tasks. Evaluation on EEG motor imagery dataset demonstrated physiologically plausible explanations.
Conclusion: PatternLocal enables more reliable and actionable insights by effectively suppressing suppressor variables in non-linear model explanations while maintaining local fidelity.
Abstract: Suppressor variables can influence model predictions without being dependent on the target outcome, and they pose a significant challenge for Explainable AI (XAI) methods. These variables may cause false-positive feature attributions, undermining the utility of explanations. Although effective remedies exist for linear models, their extension to non-linear models and instance-based explanations has remained limited. We introduce PatternLocal, a novel XAI technique that addresses this gap. PatternLocal begins with a locally linear surrogate, e.g., LIME, KernelSHAP, or gradient-based methods, and transforms the resulting discriminative model weights into a generative representation, thereby suppressing the influence of suppressor variables while preserving local fidelity. In extensive hyperparameter optimization on the XAI-TRIS benchmark, PatternLocal consistently outperformed other XAI methods and reduced false-positive attributions when explaining non-linear tasks, thereby enabling more reliable and actionable insights. We further evaluate PatternLocal on an EEG motor imagery dataset, demonstrating physiologically plausible explanations.
[1020] Variational Regularized Unbalanced Optimal Transport: Single Network, Least Action
Yuhao Sun, Zhenyi Zhang, Zihan Wang, Tiejun Li, Peijie Zhou
Main category: cs.LG
TL;DR: Var-RUOT is a new framework that solves Regularized Unbalanced Optimal Transport problems by incorporating optimality conditions into parameterization and loss design, enabling more stable convergence and lower-action solutions for recovering dynamics from high-dimensional system snapshots.
Details
Motivation: Existing methods for recovering dynamics from high-dimensional system snapshots often fail to enforce optimality conditions, leading to solutions that violate the principle of least action and suffer from convergence instability.Method: Var-RUOT incorporates optimal necessary conditions for RUOT problems into both search space parameterization and loss function design, requiring only learning a scalar field to solve RUOT problems and enabling solutions with lower action.
Result: Var-RUOT demonstrates faster convergence, improved training stability, and finds solutions with lower action compared to existing algorithms on both simulated data and real single-cell datasets.
Conclusion: Var-RUOT provides an effective framework for solving RUOT problems with better alignment to biological priors, improved convergence stability, and lower-action solutions for dynamics recovery in high-dimensional systems.
Abstract: Recovering the dynamics from a few snapshots of a high-dimensional system is a challenging task in statistical physics and machine learning, with important applications in computational biology. Many algorithms have been developed to tackle this problem, based on frameworks such as optimal transport and the Schr"odinger bridge. A notable recent framework is Regularized Unbalanced Optimal Transport (RUOT), which integrates both stochastic dynamics and unnormalized distributions. However, since many existing methods do not explicitly enforce optimality conditions, their solutions often struggle to satisfy the principle of least action and meet challenges to converge in a stable and reliable way. To address these issues, we propose Variational RUOT (Var-RUOT), a new framework to solve the RUOT problem. By incorporating the optimal necessary conditions for the RUOT problem into both the parameterization of the search space and the loss function design, Var-RUOT only needs to learn a scalar field to solve the RUOT problem and can search for solutions with lower action. We also examined the challenge of selecting a growth penalty function in the widely used Wasserstein-Fisher-Rao metric and proposed a solution that better aligns with biological priors in Var-RUOT. We validated the effectiveness of Var-RUOT on both simulated data and real single-cell datasets. Compared with existing algorithms, Var-RUOT can find solutions with lower action while exhibiting faster convergence and improved training stability. Our code is available at https://github.com/ZerooVector/VarRUOT.
[1021] OmniFC: Rethinking Federated Clustering via Lossless and Secure Distance Reconstruction
Jie Yan, Jing Liu, Zhong-Yuan Zhang
Main category: cs.LG
TL;DR: OmniFC is a model-agnostic federated clustering framework that uses Lagrange coded computing to enable privacy-preserving construction of global distance matrices from encoded client data, overcoming Non-IID data challenges without sharing raw data.
Details
Motivation: Existing federated clustering methods suffer from privacy leakage during collaboration and robustness degradation due to Non-IID data distributions, as they rely on model-specific proxies that inherit biases from centralized methods.Method: Leverages Lagrange coded computing to allow clients to share only encoded data, enabling exact reconstruction of global distance matrices without privacy leakage, even under client collusion. This approach is model-agnostic and naturally resilient to Non-IID data.
Result: Theoretical analysis confirms reconstruction fidelity and privacy guarantees. Comprehensive experiments show superior robustness, effectiveness, and generality across various benchmarks compared to state-of-the-art methods.
Conclusion: OmniFC provides a unified, model-agnostic framework for federated clustering that overcomes privacy and robustness challenges, offering a generalizable solution applicable to diverse clustering methods while maintaining strong privacy guarantees.
Abstract: Federated clustering (FC) aims to discover global cluster structures across decentralized clients without sharing raw data, making privacy preservation a fundamental requirement. There are two critical challenges: (1) privacy leakage during collaboration, and (2) robustness degradation due to aggregation of proxy information from non-independent and identically distributed (Non-IID) local data, leading to inaccurate or inconsistent global clustering. Existing solutions typically rely on model-specific local proxies, which are sensitive to data heterogeneity and inherit inductive biases from their centralized counterparts, thus limiting robustness and generality. We propose Omni Federated Clustering (OmniFC), a unified and model-agnostic framework. Leveraging Lagrange coded computing, our method enables clients to share only encoded data, allowing exact reconstruction of the global distance matrix–a fundamental representation of sample relationships–without leaking private information, even under client collusion. This construction is naturally resilient to Non-IID data distributions. This approach decouples FC from model-specific proxies, providing a unified extension mechanism applicable to diverse centralized clustering methods. Theoretical analysis confirms both reconstruction fidelity and privacy guarantees, while comprehensive experiments demonstrate OmniFC’s superior robustness, effectiveness, and generality across various benchmarks compared to state-of-the-art methods. Code will be released.
[1022] Unlabeled Data vs. Pre-trained Knowledge: Rethinking SSL in the Era of Large Models
Song-Lin Lv, Rui Zhu, Tong Wei, Yu-Feng Li, Lan-Zhe Guo
Main category: cs.LG
TL;DR: SSL faces challenges in the era of large pre-trained models, which outperform SSL methods on image classification tasks under limited supervision.
Details
Motivation: To compare the effectiveness of semi-supervised learning versus pre-trained models when labeled data is scarce, addressing whether to rely on unlabeled data or leverage foundation models.Method: Conducted fair comparison between SSL methods and pre-trained models (e.g., CLIP) on image classification tasks under controlled supervision budget.
Result: Pre-trained models show both high efficiency and strong performance, outperforming SSL methods on widely adopted SSL benchmarks.
Conclusion: SSL needs new research directions integrating with pre-trained models, and MLLMs still face performance limitations despite their scale.
Abstract: Semi-supervised learning (SSL) alleviates the cost of data labeling process by exploiting unlabeled data and has achieved promising results. Meanwhile, with the development of large foundation models, exploiting pre-trained models becomes a promising way to address the label scarcity in the downstream tasks, such as various parameter-efficient fine-tuning techniques. This raises a natural yet critical question: When labeled data is limited, should we rely on unlabeled data or pre-trained models? To investigate this issue, we conduct a fair comparison between SSL methods and pre-trained models (e.g., CLIP) on representative image classification tasks under a controlled supervision budget. Experiments reveal that SSL has met its ``Waterloo” in the era of large models, as pre-trained models show both high efficiency and strong performance on widely adopted SSL benchmarks. This underscores the urgent need for SSL researchers to explore new avenues, such as deeper integration between the SSL and pre-trained models. Furthermore, we investigate the potential of Multi-Modal Large Language Models (MLLMs) in image classification tasks. Results show that, despite their massive parameter scales, MLLMs still face significant performance limitations, highlighting that even a seemingly well-studied task remains highly challenging.
[1023] EvoBrain: Dynamic Multi-Channel EEG Graph Modeling for Time-Evolving Brain Networks
Rikuto Kotoge, Zheng Chen, Tasuku Kimura, Yasuko Matsubara, Takufumi Yanagisawa, Haruhiko Kishima, Yasushi Sakurai
Main category: cs.LG
TL;DR: EvoBrain is a novel seizure detection model that addresses limitations in dynamic GNNs by explicitly modeling evolving brain connectivity and integrating temporal signals with graph structures using a two-stream Mamba architecture and GCN with Laplacian Positional Encoding.
Details
Motivation: Current dynamic GNN methods for EEG-based seizure detection have two key limitations: they use temporally fixed static graphs that don't reflect evolving brain connectivity during seizures, and they inadequately model interactions between temporal signals and graph structures, leading to inconsistent performance.Method: Proposed EvoBrain model with: 1) Two-stream Mamba architecture for temporal processing, 2) GCN enhanced by Laplacian Positional Encoding, 3) Explicit dynamic graph structures where both nodes and edges evolve over time, following neurological insights.
Result: Significant performance improvements: 23% increase in AUROC and 30% improvement in F1 score compared to dynamic GNN baseline. Also evaluated on challenging early seizure prediction tasks.
Conclusion: Theoretical analysis proves the expressivity advantage of explicit dynamic modeling and time-then-graph approaches. EvoBrain effectively addresses the fundamental challenges in dynamic GNNs for seizure detection and demonstrates superior performance.
Abstract: Dynamic GNNs, which integrate temporal and spatial features in Electroencephalography (EEG) data, have shown great potential in automating seizure detection. However, fully capturing the underlying dynamics necessary to represent brain states, such as seizure and non-seizure, remains a non-trivial task and presents two fundamental challenges. First, most existing dynamic GNN methods are built on temporally fixed static graphs, which fail to reflect the evolving nature of brain connectivity during seizure progression. Second, current efforts to jointly model temporal signals and graph structures and, more importantly, their interactions remain nascent, often resulting in inconsistent performance. To address these challenges, we present the first theoretical analysis of these two problems, demonstrating the effectiveness and necessity of explicit dynamic modeling and time-then-graph dynamic GNN method. Building on these insights, we propose EvoBrain, a novel seizure detection model that integrates a two-stream Mamba architecture with a GCN enhanced by Laplacian Positional Encoding, following neurological insights. Moreover, EvoBrain incorporates explicitly dynamic graph structures, allowing both nodes and edges to evolve over time. Our contributions include (a) a theoretical analysis proving the expressivity advantage of explicit dynamic modeling and time-then-graph over other approaches, (b) a novel and efficient model that significantly improves AUROC by 23% and F1 score by 30%, compared with the dynamic GNN baseline, and (c) broad evaluations of our method on the challenging early seizure prediction tasks.
[1024] Physics-informed Reduced Order Modeling of Time-dependent PDEs via Differentiable Solvers
Nima Hosseini Dashtbayaz, Hesam Salehipour, Adrian Butscher, Nigel Morris
Main category: cs.LG
TL;DR: Physics-informed ROM (Φ-ROM) incorporates differentiable PDE solvers into reduced-order modeling training to ensure latent dynamics align with governing physics, improving generalization, forecasting, and data efficiency.
Details
Motivation: Traditional ROMs exclude high-fidelity solvers from training, causing latent dynamics to drift from physical constraints, limiting generalization and forecasting capabilities.Method: Integrates differentiable PDE solvers directly into training procedure to shape latent space dynamics according to governing physics, ensuring correspondence between full and reduced systems.
Result: Outperforms state-of-the-art data-driven ROMs by accurately generalizing to unseen parameters, enabling long-term forecasting, maintaining spatiotemporal continuity, and reducing data requirements.
Conclusion: Φ-ROM provides robust framework for field reconstruction and data assimilation, working with sparse observations and extensible to various PDE systems through open-source implementation.
Abstract: Reduced-order modeling (ROM) of time-dependent and parameterized differential equations aims to accelerate the simulation of complex high-dimensional systems by learning a compact latent manifold representation that captures the characteristics of the solution fields and their time-dependent dynamics. Although high-fidelity numerical solvers generate the training datasets, they have thus far been excluded from the training process, causing the learned latent dynamics to drift away from the discretized governing physics. This mismatch often limits generalization and forecasting capabilities. In this work, we propose Physics-informed ROM ($\Phi$-ROM) by incorporating differentiable PDE solvers into the training procedure. Specifically, the latent space dynamics and its dependence on PDE parameters are shaped directly by the governing physics encoded in the solver, ensuring a strong correspondence between the full and reduced systems. Our model outperforms state-of-the-art data-driven ROMs and other physics-informed strategies by accurately generalizing to new dynamics arising from unseen parameters, enabling long-term forecasting beyond the training horizon, maintaining continuity in both time and space, and reducing the data cost. Furthermore, $\Phi$-ROM learns to recover and forecast the solution fields even when trained or evaluated with sparse and irregular observations of the fields, providing a flexible framework for field reconstruction and data assimilation. We demonstrate the framework’s robustness across various PDE solvers and highlight its broad applicability by providing an open-source JAX implementation that is readily extensible to other PDE systems and differentiable solvers, available at https://phi-rom.github.io.
[1025] Unveiling m-Sharpness Through the Structure of Stochastic Gradient Noise
Haocheng Luo, Mehrtash Harandi, Dinh Phung, Trung Le
Main category: cs.LG
TL;DR: The paper provides a theoretical explanation for the m-sharpness phenomenon in SAM where performance improves with smaller micro-batch sizes, and introduces RW-SAM as a parallelizable alternative.
Details
Motivation: To understand why SAM's performance improves with smaller micro-batch sizes (m-sharpness phenomenon) and develop a theoretical foundation for this empirical observation.Method: Used an extended Stochastic Differential Equation (SDE) framework to analyze stochastic gradient noise structure and characterize SAM variants’ dynamics. Proposed Reweighted SAM (RW-SAM) with sharpness-weighted sampling.
Result: Found that stochastic noise in SAM perturbations induces variance-based sharpness regularization. RW-SAM successfully mimics m-SAM’s generalization benefits while maintaining parallelizability.
Conclusion: The theoretical analysis explains m-sharpness phenomenon, and RW-SAM provides a practical parallelizable alternative to achieve similar generalization improvements as m-SAM.
Abstract: Sharpness-aware minimization (SAM) has emerged as a highly effective technique for improving model generalization, but its underlying principles are not fully understood. We investigated the phenomenon known as m-sharpness, where the performance of SAM improves monotonically as the micro-batch size for computing perturbations decreases. In practice, the empirical m-sharpness effect underpins the deployment of SAM in distributed training, yet a rigorous theoretical account has remained lacking. To provide a theoretical explanation for m-sharpness, we leverage an extended Stochastic Differential Equation (SDE) framework and analyze the structure of stochastic gradient noise (SGN) to characterize the dynamics of various SAM variants, including n-SAM and m-SAM. Our findings reveal that the stochastic noise introduced during SAM perturbations inherently induces a variance-based sharpness regularization effect. Motivated by our theoretical insights, we introduce Reweighted SAM (RW-SAM), which employs sharpness-weighted sampling to mimic the generalization benefits of m-SAM while remaining parallelizable. Comprehensive experiments validate the effectiveness of our theoretical analysis and proposed method.
[1026] HOPSE: Scalable Higher-Order Positional and Structural Encoder for Combinatorial Representations
Martin Carrasco, Guillermo Bernardez, Marco Montagna, Nina Miolane, Lev Telyatnikov
Main category: cs.LG
TL;DR: HOPSE is a scalable Topological Deep Learning method that avoids Higher-Order Message Passing by decomposing higher-order domains into neighborhood relationships using Hasse graphs, achieving comparable performance with 7x speedups.
Details
Motivation: Existing Topological Deep Learning methods using Higher-Order Message Passing face scalability challenges due to combinatorial explosion of message-passing routes and computational complexity overhead.Method: HOPSE decomposes arbitrary higher-order domains into neighborhood relationships using Hasse graph decomposition, decoupling representation learning of neighborhood topology from attributes without message passing.
Result: HOPSE matches performance on traditional TDL datasets and outperforms HOMP methods on topological tasks, achieving up to 7x speedups over HOMP-based models.
Conclusion: HOPSE provides a scalable alternative to HOMP-based methods in Topological Deep Learning, opening new paths for efficient higher-order interaction modeling.
Abstract: While Graph Neural Networks (GNNs) have proven highly effective at modeling relational data, pairwise connections cannot fully capture multi-way relationships naturally present in complex real-world systems. In response to this, Topological Deep Learning (TDL) leverages more general combinatorial representations – such as simplicial or cellular complexes – to accommodate higher-order interactions. Existing TDL methods often extend GNNs through Higher-Order Message Passing (HOMP), but face critical \emph{scalability challenges} due to \textit{(i)} a combinatorial explosion of message-passing routes, and \textit{(ii)} significant complexity overhead from the propagation mechanism. This work presents HOPSE (Higher-Order Positional and Structural Encoder), an alternative method to solve tasks involving higher-order interactions \emph{without message passing}. Instead, HOPSE breaks \emph{arbitrary higher-order domains} into their neighborhood relationships using a Hasse graph decomposition. This method shows that decoupling the representation learning of neighborhood topology from that of attributes results in lower computational complexity, casting doubt on the need for HOMP. The experiments on molecular graph tasks and topological benchmarks show that HOPSE matches performance on traditional TDL datasets and outperforms HOMP methods on topological tasks, achieving up to $7\times$ speedups over HOMP-based models, opening a new path for scalable TDL.
[1027] Automatic Discovery of One Parameter Subgroups of $SO(n)$
Pavan Karjol, Vivek V Kashyap, Rohan Kashyap, Prathosh A P
Main category: cs.LG
TL;DR: A framework for automatically discovering one-parameter subgroups of SO(n) using Jordan forms of skew-symmetric matrices to establish canonical forms and invariant function representations.
Details
Motivation: One-parameter subgroups of SO(n) are crucial in robotics, quantum mechanics, and molecular structure analysis, but their automatic discovery remains challenging.Method: Utilizes standard Jordan form of skew-symmetric matrices (Lie algebra of SO(n)) to establish canonical forms for orbits and derive standardized representations for invariant functions, learning parameters to uncover underlying subgroups.
Result: Successfully demonstrated in double pendulum modeling, moment of inertia prediction, top quark tagging, and invariant polynomial regression, recovering meaningful subgroup structure and producing interpretable symmetry-aware representations.
Conclusion: The proposed framework effectively discovers one-parameter subgroups of SO(n) and generates interpretable, symmetry-aware representations for various applications.
Abstract: We introduce a novel framework for the automatic discovery of one-parameter subgroups ($H_{\gamma}$) of $SO(3)$ and, more generally, $SO(n)$. One-parameter subgroups of $SO(n)$ are crucial in a wide range of applications, including robotics, quantum mechanics, and molecular structure analysis. Our method utilizes the standard Jordan form of skew-symmetric matrices, which define the Lie algebra of $SO(n)$, to establish a canonical form for orbits under the action of $H_{\gamma}$. This canonical form is then employed to derive a standardized representation for $H_{\gamma}$-invariant functions. By learning the appropriate parameters, the framework uncovers the underlying one-parameter subgroup $H_{\gamma}$. The effectiveness of the proposed approach is demonstrated through tasks such as double pendulum modeling, moment of inertia prediction, top quark tagging and invariant polynomial regression, where it successfully recovers meaningful subgroup structure and produces interpretable, symmetry-aware representations.
[1028] Fast Rate Bounds for Multi-Task and Meta-Learning with Different Sample Sizes
Hossein Zakerinia, Christoph H. Lampert
Main category: cs.LG
TL;DR: New fast-rate PAC-Bayesian generalization bounds for unbalanced multi-task and meta-learning, where tasks have different training set sizes, addressing real-world scenarios where equal-size training sets are uncommon.
Details
Motivation: Previous fast-rate bounds only worked for balanced settings with equal training set sizes, while standard-rate bounds were available for unbalanced settings. Real-world multi-task learning typically involves tasks with varying amounts of training data.Method: Developed new PAC-Bayesian generalization bounds specifically designed for the unbalanced multi-task setting, making them numerically computable and interpretable.
Result: The new bounds provide stronger guarantees than previous bounds and are flexible enough to handle various cases. The paper also reveals that unbalanced settings have different statistical properties than balanced ones, and allows for two meaningful definitions of multi-task risk.
Conclusion: The work provides the first fast-rate PAC-Bayesian bounds for unbalanced multi-task learning, demonstrates fundamental statistical differences between balanced and unbalanced settings, and clarifies risk definitions in multi-task learning.
Abstract: We present new fast-rate PAC-Bayesian generalization bounds for multi-task and meta-learning in the unbalanced setting, i.e. when the tasks have training sets of different sizes, as is typically the case in real-world scenarios. Previously, only standard-rate bounds were known for this situation, while fast-rate bounds were limited to the setting where all training sets are of equal size. Our new bounds are numerically computable as well as interpretable, and we demonstrate their flexibility in handling a number of cases where they give stronger guarantees than previous bounds. Besides the bounds themselves, we also make conceptual contributions: we demonstrate that the unbalanced multi-task setting has different statistical properties than the balanced situation, specifically that proofs from the balanced situation do not carry over to the unbalanced setting. Additionally, we shed light on the fact that the unbalanced situation allows two meaningful definitions of multi-task risk, depending on whether all tasks should be considered equally important or if sample-rich tasks should receive more weight than sample-poor ones.
[1029] A Temporal Difference Method for Stochastic Continuous Dynamics
Haruki Settai, Naoya Takeishi, Takehisa Yairi
Main category: cs.LG
TL;DR: Proposes a model-free reinforcement learning approach that targets the Hamilton-Jacobi-Bellman equation without requiring prior knowledge of system dynamics, bridging stochastic control and model-free RL.
Details
Motivation: Existing HJB-based RL methods require explicit knowledge of system dynamics (coefficient functions), which limits their practical application in real-world scenarios where dynamics are unknown.Method: Develops a model-free temporal difference method that still targets the HJB equation, establishing exponential convergence for idealized continuous-time dynamics.
Result: The method achieves exponential convergence in continuous-time dynamics and demonstrates empirical advantages over transition-kernel-based formulations.
Conclusion: This work bridges stochastic control and model-free reinforcement learning by providing a model-free approach that maintains theoretical grounding in the HJB equation.
Abstract: For continuous systems modeled by dynamical equations such as ODEs and SDEs, Bellman’s Principle of Optimality takes the form of the Hamilton-Jacobi-Bellman (HJB) equation, which provides the theoretical target of reinforcement learning (RL). Although recent advances in RL successfully leverage this formulation, the existing methods typically assume the underlying dynamics are known a priori because they need explicit access to the coefficient functions of dynamical equations to update the value function following the HJB equation. We address this inherent limitation of HJB-based RL; we propose a model-free approach still targeting the HJB equation and propose the corresponding temporal difference method. We establish exponential convergence of the idealized continuous-time dynamics and empirically demonstrate its potential advantages over transition-kernel-based formulations. The proposed formulation paves the way toward bridging stochastic control and model-free reinforcement learning.
[1030] Feasibility-Aware Decision-Focused Learning for Predicting Parameters in the Constraints
Jayanta Mandi, Marianne Defresne, Senne Berden, Tias Guns
Main category: cs.LG
TL;DR: A decision-focused learning framework for predicting uncertain constraint parameters in constrained optimization problems, featuring two novel loss functions that balance feasibility and decision quality through a tunable parameter.
Details
Motivation: When parameters in constrained optimization problems are uncertain, traditional predict-then-optimize approaches can lead to infeasible solutions. There's a need to simultaneously manage both feasibility and decision quality in decision-focused learning.Method: Developed a DFL framework with two novel loss functions based on maximum likelihood estimation: one penalizes infeasibility, the other penalizes suboptimal decisions. Introduced a tunable parameter to form a weighted average of the two losses.
Result: Experimental results show that adjusting the tunable parameter provides control over the trade-off between suboptimality and feasibility. The approach outperforms existing baselines in either objective across several COP instances.
Conclusion: The proposed framework successfully balances feasibility and decision quality in constrained optimization problems with uncertain parameters, offering decision-makers flexible control over this critical trade-off.
Abstract: When some parameters of a constrained optimization problem (COP) are uncertain, this gives rise to a predict-then-optimize (PtO) problem, comprising two stages: the prediction of the unknown parameters from contextual information and the subsequent optimization using those predicted parameters. Decision-focused learning (DFL) implements the first stage by training a machine learning (ML) model to optimize the quality of the decisions made using the predicted parameters. When the predicted parameters occur in the constraints, they can lead to infeasible solutions. Therefore, it is important to simultaneously manage both feasibility and decision quality. We develop a DFL framework for predicting constraint parameters in a generic COP. While prior works typically assume that the underlying optimization problem is a linear program (LP) or integer LP (ILP), our approach makes no such assumption. We derive two novel loss functions based on maximum likelihood estimation (MLE): the first one penalizes infeasibility (by penalizing predicted parameters that lead to infeasible solutions), while the second one penalizes suboptimal decisions (by penalizing predicted parameters that make the true optimal solution infeasible). We introduce a single tunable parameter to form a weighted average of the two losses, allowing decision-makers to balance suboptimality and feasibility. We experimentally demonstrate that adjusting this parameter provides decision-makers control over this trade-off. Moreover, across several COP instances, we show that adjusting the tunable parameter allows a decision-maker to prioritize either suboptimality or feasibility, outperforming the performance of existing baselines in either objective.
[1031] Fairness under Competition
Ronen Gradwohl, Eilam Shapira, Moshe Tennenholtz
Main category: cs.LG
TL;DR: The paper shows that individually fair classifiers can lead to unfair ecosystem outcomes when multiple firms compete, and improving individual fairness may actually decrease overall ecosystem fairness.
Details
Motivation: To understand how adopting fair ML classifiers affects overall ecosystem fairness when multiple competing firms use such algorithms.Method: Theoretical analysis of fairness in competitive ecosystems, quantifying fairness loss based on classifier correlation and data overlap, with supporting experimental evidence.
Result: Demonstrated that individually fair classifiers can result in unfair ecosystem outcomes, and improving individual fairness may paradoxically reduce overall ecosystem fairness.
Conclusion: The findings call for a shift from individual classifier fairness to ecosystem-level fairness considerations in ML deployment.
Abstract: Algorithmic fairness has emerged as a central issue in ML, and it has become standard practice to adjust ML algorithms so that they will satisfy fairness requirements such as Equal Opportunity. In this paper we consider the effects of adopting such fair classifiers on the overall level of ecosystem fairness. Specifically, we introduce the study of fairness with competing firms, and demonstrate the failure of fair classifiers in yielding fair ecosystems. Our results quantify the loss of fairness in systems, under a variety of conditions, based on classifiers’ correlation and the level of their data overlap. We show that even if competing classifiers are individually fair, the ecosystem’s outcome may be unfair; and that adjusting biased algorithms to improve their individual fairness may lead to an overall decline in ecosystem fairness. In addition to these theoretical results, we also provide supporting experimental evidence. Together, our model and results provide a novel and essential call for action.
[1032] Adversarial Robustness of Nonparametric Regression
Parsa Moradi, Hanzaleh Akabrinodehi, Mohammad Ali Maddah-Ali
Main category: cs.LG
TL;DR: This paper analyzes adversarial robustness in nonparametric regression, showing that smoothing spline estimators are robust against adversarial corruption when fewer than o(n) samples are corrupted, but no estimator can handle constant fraction corruption.
Details
Motivation: While adversarial robustness of parametric regression is well-studied, nonparametric regression robustness remains largely unexplored, especially when adversaries can corrupt input data.Method: The study characterizes adversarial robustness assuming regression functions belong to second-order Sobolev space, establishing minimax lower bounds and analyzing classical smoothing spline estimators with proper regularization.
Result: Smoothing spline estimators are robust when o(n) samples are corrupted, achieving vanishing estimation error as n→∞, but no estimator can guarantee vanishing error when a constant fraction of data is corrupted.
Conclusion: Smoothing spline estimators are optimal in terms of maximum tolerable corrupted samples, providing fundamental limits on adversarial robustness in nonparametric regression.
Abstract: In this paper, we investigate the adversarial robustness of nonparametric regression, a fundamental problem in machine learning, under the setting where an adversary can arbitrarily corrupt a subset of the input data. While the robustness of parametric regression has been extensively studied, its nonparametric counterpart remains largely unexplored. We characterize the adversarial robustness in nonparametric regression, assuming the regression function belongs to the second-order Sobolev space (i.e., it is square integrable up to its second derivative). The contribution of this paper is two-fold: (i) we establish a minimax lower bound on the estimation error, revealing a fundamental limit that no estimator can overcome, and (ii) we show that, perhaps surprisingly, the classical smoothing spline estimator, when properly regularized, exhibits robustness against adversarial corruption. These results imply that if $o(n)$ out of $n$ samples are corrupted, the estimation error of the smoothing spline vanishes as $n \to \infty$. On the other hand, when a constant fraction of the data is corrupted, no estimator can guarantee vanishing estimation error, implying the optimality of the smoothing spline in terms of maximum tolerable number of corrupted samples.
[1033] MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention
Can Yaras, Alec S. Xu, Pierre Abillama, Changwoo Lee, Laura Balzano
Main category: cs.LG
TL;DR: MonarchAttention is a novel sub-quadratic attention approximation method using Monarch matrices that achieves significant speedups over FlashAttention-2 while maintaining performance without retraining.
Details
Motivation: Transformers suffer from quadratic complexity in sequence length due to attention mechanism, limiting their efficiency for long sequences.Method: Uses Monarch matrices for sub-quadratic attention approximation via variational form of softmax, with optimized algorithm for efficient projection onto Monarch matrix class.
Result: Achieves 1.4× to 8.2× speedup over FlashAttention-2 across sequence lengths (256 to 16K), with minimal performance loss on diverse vision and language tasks without retraining.
Conclusion: MonarchAttention provides a flexible, hardware-efficient, and transferable solution for approximating softmax attention with sub-quadratic complexity.
Abstract: Transformers have achieved state-of-the-art performance across various tasks, but suffer from a notable quadratic complexity in sequence length due to the attention mechanism. In this work, we propose MonarchAttention – a novel approach to sub-quadratic attention approximation via Monarch matrices, an expressive class of structured matrices. Based on the variational form of softmax, we describe an efficient optimization-based algorithm to compute an approximate projection of softmax attention onto the class of Monarch matrices with $\Theta(N\sqrt{N} d)$ computational complexity and $\Theta(Nd)$ memory/IO complexity. Unlike previous approaches, MonarchAttention is both (1) transferable, yielding minimal performance loss with no additional training, even when replacing every attention layer of the Transformer, and (2) hardware-efficient, utilizing the highest-throughput tensor core units on modern GPUs. With optimized kernels, MonarchAttention achieves substantial speed-ups in wall-time over FlashAttention-2: $1.4\times$ for shorter sequences $(N=256)$, $4.5\times$ for medium-length sequences $(N=4K)$, and $8.2\times$ for longer sequences $(N=16K)$. We demonstrate the quality of MonarchAttention on diverse tasks and architectures in vision and language problems, showing that it flexibly and accurately approximates softmax attention in a variety of contexts. Our code is available at https://github.com/cjyaras/monarch-attention.
[1034] Offline Clustering of Linear Bandits: The Power of Clusters under Limited Data
Jingyuan Liu, Zeyu Zhang, Xuchuang Wang, Xutong Liu, John C. S. Lui, Mohammad Hajiesmaili, Carlee Joe-Wong
Main category: cs.LG
TL;DR: The paper studies offline clustering of contextual bandits (Off-ClusBand), proposing two algorithms to learn user clusters from offline data and improve decision-making when online data collection is limited.
Details
Motivation: Prior work focused on online clustering of bandits, ignoring widely available offline data. The motivation is to leverage offline datasets to learn cluster properties and accelerate learning in contextual bandit problems.Method: Proposed two algorithms: Off-C2LUB which performs well under limited offline data, and Off-CLUB which works well when data is sufficient but may have bias with sparse data. Both use offline datasets to cluster users based on preference similarity.
Result: Off-C2LUB outperforms existing methods under limited offline user data, while Off-CLUB performs well with sufficient data and nearly matches the lower bound. Experimental validation on real and synthetic datasets confirms these findings.
Conclusion: Offline clustering of bandits is effective for leveraging existing data to improve decision-making, with different algorithms performing optimally under varying data availability conditions.
Abstract: Contextual multi-armed bandit is a fundamental learning framework for making a sequence of decisions, e.g., advertising recommendations for a sequence of arriving users. Recent works have shown that clustering these users based on the similarity of their learned preferences can accelerate the learning. However, prior work has primarily focused on the online setting, which requires continually collecting user data, ignoring the offline data widely available in many applications. To tackle these limitations, we study the offline clustering of bandits (Off-ClusBand) problem, which studies how to use the offline dataset to learn cluster properties and improve decision-making. The key challenge in Off-ClusBand arises from data insufficiency for users: unlike the online case where we continually learn from online data, in the offline case, we have a fixed, limited dataset to work from and thus must determine whether we have enough data to confidently cluster users together. To address this challenge, we propose two algorithms: Off-C2LUB, which we show analytically and experimentally outperforms existing methods under limited offline user data, and Off-CLUB, which may incur bias when data is sparse but performs well and nearly matches the lower bound when data is sufficient. We experimentally validate these results on both real and synthetic datasets.
[1035] Regret Analysis of Average-Reward Unichain MDPs via an Actor-Critic Approach
Swetha Ganesh, Vaneet Aggarwal
Main category: cs.LG
TL;DR: NAC-B achieves optimal regret in infinite-horizon average-reward MDPs under weak unichain assumption using natural actor-critic with batching and function approximation.
Details
Motivation: Existing theoretical guarantees for actor-critic methods rely on restrictive ergodicity assumptions, while practical applications often involve infinite-horizon average-reward MDPs with potential periodicity and transient states.Method: Proposes NAC-B (Natural Actor-Critic with Batching) that uses function approximation for both actor and critic, employs batching to mitigate periodicity and reduce gradient estimate stochasticity, and introduces constants C_hit and C_tar to analyze convergence rates.
Result: Achieves order-optimal regret of O(√T) under the unichain assumption, which is among the weakest assumptions where policy gradient theorem remains valid for average-reward settings.
Conclusion: NAC-B provides scalable actor-critic method with strong theoretical guarantees for infinite-horizon average-reward MDPs under minimal assumptions, handling both transient states and periodicity.
Abstract: Actor-Critic methods are widely used for their scalability, yet existing theoretical guarantees for infinite-horizon average-reward Markov Decision Processes (MDPs) often rely on restrictive ergodicity assumptions. We propose NAC-B, a Natural Actor-Critic with Batching, that achieves order-optimal regret of $\tilde{O}(\sqrt{T})$ in infinite-horizon average-reward MDPs under the unichain assumption, which permits both transient states and periodicity. This assumption is among the weakest under which the classic policy gradient theorem remains valid for average-reward settings. NAC-B employs function approximation for both the actor and the critic, enabling scalability to problems with large state and action spaces. The use of batching in our algorithm helps mitigate potential periodicity in the MDP and reduces stochasticity in gradient estimates, and our analysis formalizes these benefits through the introduction of the constants $C_{\text{hit}}$ and $C_{\text{tar}}$, which characterize the rate at which empirical averages over Markovian samples converge to the stationary distribution.
[1036] Towards Fully FP8 GEMM LLM Training at Scale
Alejandro Hernández-Cano, Dhia Garbaya, Imanol Schlag, Martin Jaggi
Main category: cs.LG
TL;DR: A new LLM architecture enables full FP8 computation for all GEMMs in transformer blocks during forward/backward passes, achieving unprecedented throughput gains while maintaining BF16-level performance.
Details
Motivation: FP8 formats offer significant potential for LLM pre-training but face stability challenges at scale, with existing approaches relying on suboptimal kernels or falling back to higher precision, compromising throughput gains.Method: Introduces a new LLM architecture that reduces large outlier activations to enable stable FP8 training, with FP8 computation for all GEMMs in transformer blocks during both forward and backward passes.
Result: Achieves unprecedented throughput gains, particularly at scale, while matching downstream performance of standard BF16 training. Identifies key metrics to monitor low-precision training and predict potential divergences.
Conclusion: The proposed architecture enables stable, long-term FP8 training with significant throughput improvements while maintaining model quality, making FP8 practical for large-scale LLM pre-training.
Abstract: Despite the significant potential of FP8 data formats for large language model (LLM) pre-training, their adoption has been limited due to challenges in maintaining stability at scale. Existing approaches often rely on suboptimal fine-grained FP8 kernels or fall back to higher-precision matrix multiplications (GEMMs) in sensitive components, such as attention projections, compromising potential throughput gains. We introduce a new class of LLM architectures that, for the first time, support FP8 computation for all GEMMs within transformer blocks during both forward and backward passes. This enables unprecedented throughput gains, particularly at scale, while matching the downstream performance of standard BF16 training. Our architecture design reduces large outlier activations, promoting stable long-term FP8 training. In addition, we identify key metrics to monitor low-precision training and predict potential future divergences.
[1037] Copresheaf Topological Neural Networks: A Generalized Deep Learning Framework
Mustafa Hajij, Lennart Bastian, Sarah Osentoski, Hardik Kabaria, John L. Davenport, Sheik Dawood, Balaji Cherukuri, Joseph G. Kocheemoolayil, Nastaran Shahmansouri, Adrian Lew, Theodore Papamarkou, Tolga Birdal
Main category: cs.LG
TL;DR: CTNNs provide a unifying framework using copresheaves from algebraic topology to generalize deep learning models for structured data, outperforming conventional baselines on benchmarks.
Details
Motivation: Address the challenge of principled neural architecture design for specific tasks and data types, as current deep learning lacks systematic frameworks for structured data like graphs, meshes, and topological manifolds.Method: Formulate model design using copresheaves from algebraic topology, creating a constructive framework that generalizes practical deep learning models and enables derivation of theoretically sound solutions.
Result: CTNNs consistently outperform conventional baselines on structured data benchmarks, particularly excelling in tasks requiring hierarchical or localized sensitivity.
Conclusion: CTNNs establish a principled multi-scale foundation for next-generation deep learning architectures, providing a unified approach to tackle representation learning challenges across diverse data types.
Abstract: We introduce copresheaf topological neural networks (CTNNs), a powerful unifying framework that encapsulates a wide spectrum of deep learning architectures, designed to operate on structured data, including images, point clouds, graphs, meshes, and topological manifolds. While deep learning has profoundly impacted domains ranging from digital assistants to autonomous systems, the principled design of neural architectures tailored to specific tasks and data types remains one of the field’s most persistent open challenges. CTNNs address this gap by formulating model design in the language of copresheaves, a concept from algebraic topology that generalizes most practical deep learning models in use today. This abstract yet constructive formulation yields a rich design space from which theoretically sound and practically effective solutions can be derived to tackle core challenges in representation learning, such as long-range dependencies, oversmoothing, heterophily, and non-Euclidean domains. Our empirical results on structured data benchmarks demonstrate that CTNNs consistently outperform conventional baselines, particularly in tasks requiring hierarchical or localized sensitivity. These results establish CTNNs as a principled multi-scale foundation for the next generation of deep learning architectures.
[1038] OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models
Ziheng Cheng, Yixiao Huang, Hui Xu, Somayeh Sojoudi, Xuandong Zhao, Dawn Song, Song Mei
Main category: cs.LG
TL;DR: OVERT is the first large-scale benchmark for evaluating over-refusal in text-to-image models, containing 4,600 seemingly harmful but benign prompts and 1,785 genuinely harmful prompts across nine safety categories.
Details
Motivation: Current T2I safety alignment strategies often cause over-refusal, rejecting benign prompts and reducing model utility, but there's no systematic benchmark to evaluate this phenomenon.Method: Developed an automatic workflow to construct synthetic evaluation data, creating the OVERT benchmark with prompts across nine safety-related categories and exploring prompt rewriting as a mitigation approach.
Result: Evaluation of leading T2I models revealed over-refusal is widespread across various categories, and prompt rewriting often compromises faithfulness to original prompt meaning.
Conclusion: Over-refusal is a significant issue requiring further research to enhance safety alignment without compromising functionality, and the generation framework can adapt to diverse safety requirements.
Abstract: Text-to-Image (T2I) models have achieved remarkable success in generating visual content from text inputs. Although multiple safety alignment strategies have been proposed to prevent harmful outputs, they often lead to overly cautious behavior – rejecting even benign prompts – a phenomenon known as $\textit{over-refusal}$ that reduces the practical utility of T2I models. Despite over-refusal having been observed in practice, there is no large-scale benchmark that systematically evaluates this phenomenon for T2I models. In this paper, we present an automatic workflow to construct synthetic evaluation data, resulting in OVERT ($\textbf{OVE}$r-$\textbf{R}$efusal evaluation on $\textbf{T}$ext-to-image models), the first large-scale benchmark for assessing over-refusal behaviors in T2I models. OVERT includes 4,600 seemingly harmful but benign prompts across nine safety-related categories, along with 1,785 genuinely harmful prompts (OVERT-unsafe) to evaluate the safety-utility trade-off. Using OVERT, we evaluate several leading T2I models and find that over-refusal is a widespread issue across various categories (Figure 1), underscoring the need for further research to enhance the safety alignment of T2I models without compromising their functionality. As a preliminary attempt to reduce over-refusal, we explore prompt rewriting; however, we find it often compromises faithfulness to the meaning of the original prompts. Finally, we demonstrate the flexibility of our generation framework in accommodating diverse safety requirements by generating customized evaluation data adapting to user-defined policies.
[1039] MIN-Merging: Merge the Important Neurons for Model Merging
Yunfei Liang
Main category: cs.LG
TL;DR: MIN-Merging is a router-based framework that selectively merges important neurons to reduce parameter conflicts in model merging, achieving better performance on domain-specific tasks while maintaining generalization.
Details
Motivation: Existing model merging approaches suffer from parameter conflicts that degrade performance on domain-specific tasks, despite the availability of many open-source models across domains.Method: A router-based framework that selectively merges the most important neurons to reduce parameter conflicts during model merging.
Result: Extensive experiments on CV and NLP benchmarks show consistent gains on in-domain tasks while retaining generalization ability on out-of-domain tasks.
Conclusion: MIN-Merging is an effective practical solution to the parameter conflict problem in model merging.
Abstract: Recent advances in deep learning have led to a surge of open-source models across diverse domains. While model merging offers a promising way to combine their strengths, existing approaches often suffer from parameter conflicts that degrade performance on domain-specific tasks. We propose MIN-Merging, a router-based framework that selectively merges the most important neurons to reduce such conflicts. Extensive experiments on Computer Vision(CV) and Natural Language Processing(NLP) benchmarks show that MIN-Merging achieves consistent gains on in-domain tasks while retaining the generalization ability of pretrained models on out-of-domain tasks. These results highlight its effectiveness as a practical solution to the parameter conflict problem in model merging.
[1040] LaX: Boosting Low-Rank Training of Foundation Models via Latent Crossing
Ruijie Zhang, Ziyue Liu, Zhengyang Wang, Zheng Zhang
Main category: cs.LG
TL;DR: LaX is a plug-and-play module that enhances low-rank models by enabling information flow across low-rank subspaces, achieving full-rank performance with 2-3× fewer parameters.
Details
Motivation: Training foundation models like ViTs and LLMs is computationally expensive, and existing low-rank factorization methods often degrade performance due to restricted parameter space.Method: Introduces Latent Crossing (LaX) - a simple plug-and-play module that enables information flow across low-rank subspaces to enhance model capacity.
Result: LaX boosts low-rank model performance to match or exceed full-rank baselines while using 2-3× fewer parameters. When combined with LoRA for fine-tuning LLaMA-7/13B, it improves performance on arithmetic and common sense reasoning tasks with negligible cost.
Conclusion: LaX effectively enhances the capacity of low-rank models, making them competitive with full-rank models while significantly reducing parameter count and computational cost.
Abstract: Training foundation models such as ViTs and LLMs requires tremendous computing cost. Low-rank matrix or tensor factorization offers a parameter-efficient alternative, but often downgrades performance due to the restricted parameter space. In this work, we introduce {\textbf{Latent Crossing (LaX)}} – a simple yet effective plug-and-play module that enhances the capacity of low-rank models by enabling information flow across low-rank subspaces. We extensively validate the benefits of LaX on pre-training tasks with ViT-Base/Large and LLaMA-like models ranging from 60M to 1B parameters. LaX boosts low-rank model performance to match or exceed the full-rank baselines while using 2-3(\times) fewer parameters. When equipped with low-rank adapters (i.e., LoRA) for fine-tuning LLaMA-7/13B, LaX consistently improves performance on arithmetic and common sense reasoning tasks with negligible cost.
[1041] ADPO: Anchored Direct Preference Optimization
Wang Zixian
Main category: cs.LG
TL;DR: ADPO improves DPO by incorporating soft preference probabilities, reference anchoring for stable policy updates, and listwise learning via Plackett-Luce modeling, achieving 12-79% improvements over DPO in noisy and distribution-shifted scenarios.
Details
Motivation: Standard DPO assumes hard binary labels and pairwise comparisons, which can be brittle under noisy or distribution-shifted supervision, limiting its robustness in real-world applications.Method: ADPO introduces three key improvements: (1) soft preference probabilities instead of hard labels, (2) reference anchoring that creates an implicit trust region for stable policy updates, and (3) extension to listwise learning using Plackett-Luce modeling.
Result: In 12 controlled scenarios (4 noise types × 3 severities) across 3 model scales, ADPO showed 12-79% relative improvements over standard DPO. Listwise variants achieved highest WinMass in 9/12 scenarios, and larger models amplified ADPO’s benefits (0.718 vs 0.416 at hidden=256).
Conclusion: ADPO provides a more robust alternative to DPO that handles noisy and distribution-shifted supervision effectively, with anchoring serving as an effective trust-region regularizer, especially beneficial for larger models.
Abstract: Direct Preference Optimization (DPO) is an efficient alternative to reinforcement learning from human feedback (RLHF), yet it typically assumes hard binary labels and pairwise comparisons. Such assumptions can be brittle under noisy or distribution-shifted supervision. We present Anchored Direct Preference Optimization (ADPO), which (i) incorporates soft preference probabilities, (ii) aligns policy updates through reference anchoring that induces an implicit trust region, and (iii) extends to listwise learning via Plackett-Luce modeling. In controlled synthetic setups covering 12 scenarios (4 noise types x 3 severities) and 3 model scales, ADPO exhibits relative improvements ranging from 12% to 79% over a standard DPO baseline (10-seed means; 95% CIs in the Appendix). Hard labels tend to fare better under severe noise, whereas soft labels yield better calibration under distribution shift; listwise variants achieve the highest WinMass (expected probability mass on the ground-truth best item) in 9/12 scenarios. Larger models amplify ADPO’s benefits (0.718 vs. 0.416 at hidden=256), suggesting that anchoring acts as an effective trust-region regularizer. We release code and configurations to facilitate reproducibility.
[1042] ProSpero: Active Learning for Robust Protein Design Beyond Wild-Type Neighborhoods
Michal Kmicikiewicz, Vincent Fortuin, Ewa Szczurek
Main category: cs.LG
TL;DR: ProSpero is an active learning framework that guides a frozen pre-trained generative model using a surrogate updated from oracle feedback, enabling exploration beyond wild-type neighborhoods while maintaining biological plausibility.
Details
Motivation: Current methods for designing novel protein sequences often lead to biologically implausible sequences or rely on surrogate models that lose fidelity in novel regions, limiting effective exploration beyond wild-type neighborhoods.Method: Integrates fitness-relevant residue selection with biologically-constrained Sequential Monte Carlo sampling, using a frozen pre-trained generative model guided by a surrogate model updated from oracle feedback.
Result: ProSpero consistently outperforms or matches existing methods across diverse protein engineering tasks, retrieving sequences of both high fitness and novelty, and remains effective even when the surrogate is misspecified.
Conclusion: The framework enables effective exploration beyond wild-type neighborhoods while preserving biological plausibility, achieving superior performance in protein sequence design.
Abstract: Designing protein sequences of both high fitness and novelty is a challenging task in data-efficient protein engineering. Exploration beyond wild-type neighborhoods often leads to biologically implausible sequences or relies on surrogate models that lose fidelity in novel regions. Here, we propose ProSpero, an active learning framework in which a frozen pre-trained generative model is guided by a surrogate updated from oracle feedback. By integrating fitness-relevant residue selection with biologically-constrained Sequential Monte Carlo sampling, our approach enables exploration beyond wild-type neighborhoods while preserving biological plausibility. We show that our framework remains effective even when the surrogate is misspecified. ProSpero consistently outperforms or matches existing methods across diverse protein engineering tasks, retrieving sequences of both high fitness and novelty.
[1043] Noise-corrected GRPO: From Noisy Rewards to Unbiased Gradients
Omar El Mansouri, Mohamed El Amine Seddik, Salem Lahlou
Main category: cs.LG
TL;DR: Introduces GRPO and Dr.GRPO frameworks for noise-robust RLHF that model reward corruption as Bernoulli noise and apply correction to debias learning signals, achieving provably unbiased gradient estimates and improved performance on math and code tasks.
Details
Motivation: RLHF/RLVR methods are highly sensitive to noise from inconsistent or erroneous rewards, but the interaction between such noise and group-based policy optimization methods remains underexplored.Method: Group Relative Policy Optimization (GRPO) and Done Right GRPO (Dr.GRPO) frameworks that explicitly model reward corruption as Bernoulli noise, apply noise correction after estimating reward flip probabilities to debias the learning signal.
Result: Consistent improvements across math and code tasks, with gains of up to 6.7 percentage points in accuracy on math tasks and 1.5 on code tasks under realistic reward model conditions.
Conclusion: Bridges label-noise correction from supervised learning with modern RLHF, offering both theoretical insights and a practical algorithm for noisy real-world deployment.
Abstract: Reinforcement learning from human feedback (RLHF) or verifiable rewards (RLVR), the standard paradigm for aligning LLMs or building recent SOTA reasoning models, is highly sensitive to noise from inconsistent or erroneous rewards. Yet, the interaction between such noise and widely used group-based policy optimization methods remains underexplored. We introduce a noise-robust Group Relative Policy Optimization (GRPO) and Done Right GRPO (Dr.GRPO) framework that explicitly models reward corruption as Bernoulli noise. Our method applies noise correction after estimating reward flip probabilities to debias the learning signal, yielding provably unbiased gradient estimates. Theoretical analysis shows that group-based methods inherently mitigate individual-level noise, and our correction strategy amplifies this robustness. Empirically, we observe consistent improvements across math and code tasks when applying our noise correction to standard reward model usage, with particular gains of up to 6.7 percentage points in accuracy on math tasks and 1.5 on code tasks under realistic reward model conditions. This work bridges label-noise correction from supervised learning with modern RLHF, offering both theoretical insights and a practical algorithm for noisy real-world deployment.
[1044] Infinite-Width Limit of a Single Attention Layer: Analysis via Tensor Programs
Mana Sakai, Ryo Karakida, Masaaki Imaizumi
Main category: cs.LG
TL;DR: The paper rigorously derives the infinite-width limit distribution for attention layers, showing it’s fundamentally non-Gaussian and hierarchical (Gaussian conditional on similarity scores), without requiring infinite-head approximations or special scaling.
Details
Motivation: Existing Gaussian-based asymptotic theories fail to capture attention layer behavior except under special regimes like infinite heads or tailored scaling. This work aims to provide a rigorous theory for attention layers under realistic conditions.Method: Leveraging the Tensor Programs framework, the authors derive the exact infinite-width limit distribution of variables within a single attention layer under standard 1/√n-scaling and realistic architectural dimensionality.
Result: The limiting distribution is fundamentally non-Gaussian and exhibits a hierarchical structure - it’s Gaussian conditional on the random similarity scores. Numerical experiments validate the theory at finite width.
Conclusion: This work provides the first rigorous characterization of attention layers in the infinite-width regime and lays groundwork for developing a unified theory of deep Transformer architectures.
Abstract: In modern theoretical analyses of neural networks, the infinite-width limit is often invoked to justify Gaussian approximations of neuron preactivations (e.g., via neural network Gaussian processes or Tensor Programs). However, these Gaussian-based asymptotic theories have so far been unable to capture the behavior of attention layers, except under special regimes such as infinitely many heads or tailored scaling schemes. In this paper, leveraging the Tensor Programs framework, we rigorously identify the infinite-width limit distribution of variables within a single attention layer under realistic architectural dimensionality and standard $1/\sqrt{n}$-scaling with $n$ dimensionality. We derive the exact form of this limit law without resorting to infinite-head approximations or tailored scalings, demonstrating that it departs fundamentally from Gaussianity. This limiting distribution exhibits non-Gaussianity from a hierarchical structure, being Gaussian conditional on the random similarity scores. Numerical experiments validate our theoretical predictions, confirming the effectiveness of our theory at finite width and accurate description of finite-head attentions. Beyond characterizing a standalone attention layer, our findings lay the groundwork for developing a unified theory of deep Transformer architectures in the infinite-width regime.
[1045] RotaTouille: Rotation Equivariant Deep Learning for Contours
Odin Hoff Gardaa, Nello Blaser
Main category: cs.LG
TL;DR: RotaTouille is a deep learning framework for contour data that achieves rotation and cyclic shift equivariance using complex-valued circular convolution, with applications in shape classification, reconstruction, and contour regression.
Details
Motivation: Contours appear in various domains and require models that are equivariant to planar rotations and cyclic shifts of starting points, which are arbitrary in contour representations.Method: Uses complex-valued circular convolution to achieve rotation and cyclic shift equivariance, along with equivariant non-linearities, coarsening layers, and global pooling layers for invariant representations.
Result: The framework demonstrates effectiveness in experiments for shape classification, reconstruction, and contour regression tasks.
Conclusion: RotaTouille successfully provides a deep learning approach that handles the key symmetries present in contour data through principled equivariant architecture design.
Abstract: Contours or closed planar curves are common in many domains. For example, they appear as object boundaries in computer vision, isolines in meteorology, and the orbits of rotating machinery. In many cases when learning from contour data, planar rotations of the input will result in correspondingly rotated outputs. It is therefore desirable that deep learning models be rotationally equivariant. In addition, contours are typically represented as an ordered sequence of edge points, where the choice of starting point is arbitrary. It is therefore also desirable for deep learning methods to be equivariant under cyclic shifts. We present RotaTouille, a deep learning framework for learning from contour data that achieves both rotation and cyclic shift equivariance through complex-valued circular convolution. We further introduce and characterize equivariant non-linearities, coarsening layers, and global pooling layers to obtain invariant representations for downstream tasks. Finally, we demonstrate the effectiveness of RotaTouille through experiments in shape classification, reconstruction, and contour regression.
[1046] What Makes a Good Curriculum? Disentangling the Effects of Data Ordering on LLM Mathematical Reasoning
Yaning Jia, Chunhui Zhang, Xingjian Diao, Xiangchi Yuan, Zhongyu Ouyang, Chiyu Ma, Soroush Vosoughi
Main category: cs.LG
TL;DR: Curriculum learning effectiveness depends on model capability and task complexity, with no universal strategy. Forward vs reverse curriculum outcomes vary by model and task, and different difficulty metrics produce distinct gains.
Details
Motivation: To systematically evaluate when curriculum learning helps, which direction (forward/reverse) is better, and whether effectiveness depends on measurement metrics, given disparate approaches in prior work.Method: Unified offline evaluation framework with five difficulty dimensions (Problem Difficulty, Model Surprisal, Confidence Margin, Predictive Uncertainty, Decision Variability), tested through controlled post-training experiments on mathematical reasoning benchmarks using Llama3.1-8B, Mistral-7B, and Gemma3-4B.
Result: No curriculum strategy dominates universally; forward vs reverse effectiveness depends on model capability and task complexity; samples at different difficulty levels produce distinct gains; task-aligned curricula shape final representations while inner-state curricula modulate internal states like confidence.
Conclusion: There is no universal curriculum strategy - effectiveness depends on model-task combinations. Prioritizing decision-uncertain samples may enhance learning, offering actionable guidance across different model and task regimes.
Abstract: Curriculum learning (CL) - ordering training data from easy to hard - has become a popular strategy for improving reasoning in large language models (LLMs). Yet prior work employs disparate difficulty metrics and training setups, leaving open fundamental questions: When does curriculum help? Which direction - forward or reverse - is better? And does the answer depend on what we measure? We address these questions through a unified offline evaluation framework that decomposes curriculum difficulty into five complementary dimensions: Problem Difficulty, Model Surprisal, Confidence Margin, Predictive Uncertainty, and Decision Variability. Through controlled post-training experiments on mathematical reasoning benchmarks with Llama3.1-8B, Mistral-7B, and Gemma3-4B, we find that (i) no curriculum strategy dominates universally - the relative effectiveness of forward versus reverse CL depends jointly on model capability and task complexity; (ii) even within a single metric, samples at different difficulty levels produce distinct gains depending on task demands; and (iii) task-aligned curricula focus on shaping the model’s final representations and generalization, whereas inner-state curricula modulate internal states such as confidence and uncertainty. Our findings challenge the notion of a universal curriculum strategy and offer actionable guidance across model and task regimes, with some metrics indicating that prioritizing decision-uncertain samples can further enhance learning outcomes.
[1047] On the Stability of Graph Convolutional Neural Networks: A Probabilistic Perspective
Ning Zhang, Henry Kenlay, Li Zhang, Mihai Cucuringu, Xiaowen Dong
Main category: cs.LG
TL;DR: The paper proposes a novel distribution-aware stability analysis framework for Graph Convolutional Neural Networks (GCNNs) that characterizes output perturbations across input data distributions, enabling probabilistic analysis of graph topology perturbations.
Details
Motivation: Existing theoretical understanding of GCNN stability is limited to worst-case perturbations, hampering development of robust and trustworthy models. Current approaches don't adequately address how perturbations affect outputs across diverse input data.Method: Proposed a distribution-aware formulation for analyzing GCNN stability that characterizes output perturbations across a broad range of input data, enabling probabilistic analysis of the interplay between node data statistics and graph topology perturbations.
Result: Extensive experiments validated the theoretical findings and demonstrated benefits over existing baselines in terms of representation stability and adversarial attacks on downstream tasks.
Conclusion: The proposed distribution-aware stability formulation is practically significant and highlights the importance of incorporating data distribution into stability analysis for GCNNs.
Abstract: Graph convolutional neural networks (GCNNs) have emerged as powerful tools for analyzing graph-structured data, achieving remarkable success across diverse applications. However, the theoretical understanding of the stability of these models, i.e., their sensitivity to small changes in the graph structure, remains in rather limited settings, hampering the development and deployment of robust and trustworthy models in practice. To fill this gap, we study how perturbations in the graph topology affect GCNN outputs and propose a novel formulation for analyzing model stability. Unlike prior studies that focus only on worst-case perturbations, our distribution-aware formulation characterizes output perturbations across a broad range of input data. This way, our framework enables, for the first time, a probabilistic perspective on the interplay between the statistical properties of the node data and perturbations in the graph topology. We conduct extensive experiments to validate our theoretical findings and demonstrate their benefits over existing baselines, in terms of both representation stability and adversarial attacks on downstream tasks. Our results demonstrate the practical significance of the proposed formulation and highlight the importance of incorporating data distribution into stability analysis.
[1048] Kuramoto Orientation Diffusion Models
Yue Song, T. Anderson Keller, Sevan Brodjian, Takeru Miyato, Yisong Yue, Pietro Perona, Max Welling
Main category: cs.LG
TL;DR: A score-based generative model using Kuramoto synchronization dynamics for orientation-rich images like fingerprints and textures, achieving improved generation quality through biologically inspired phase synchronization.
Details
Motivation: Standard isotropic Euclidean diffusion struggles with orientation-rich images that exhibit coherent angular directional patterns. Biological phase synchronization in neural systems provides inspiration for better modeling such structured data.Method: Proposes stochastic Kuramoto dynamics for diffusion process, using forward synchronization among phase variables and reverse desynchronization with learned score function. Implements wrapped Gaussian transition kernels and periodicity-aware networks for circular geometry.
Result: Achieves competitive results on general image benchmarks and significantly improves generation quality on orientation-dense datasets like fingerprints and textures.
Conclusion: Biologically inspired synchronization dynamics show promise as structured priors in generative modeling, particularly for orientation-rich image generation.
Abstract: Orientation-rich images, such as fingerprints and textures, often exhibit coherent angular directional patterns that are challenging to model using standard generative approaches based on isotropic Euclidean diffusion. Motivated by the role of phase synchronization in biological systems, we propose a score-based generative model built on periodic domains by leveraging stochastic Kuramoto dynamics in the diffusion process. In neural and physical systems, Kuramoto models capture synchronization phenomena across coupled oscillators – a behavior that we re-purpose here as an inductive bias for structured image generation. In our framework, the forward process performs \textit{synchronization} among phase variables through globally or locally coupled oscillator interactions and attraction to a global reference phase, gradually collapsing the data into a low-entropy von Mises distribution. The reverse process then performs \textit{desynchronization}, generating diverse patterns by reversing the dynamics with a learned score function. This approach enables structured destruction during forward diffusion and a hierarchical generation process that progressively refines global coherence into fine-scale details. We implement wrapped Gaussian transition kernels and periodicity-aware networks to account for the circular geometry. Our method achieves competitive results on general image benchmarks and significantly improves generation quality on orientation-dense datasets like fingerprints and textures. Ultimately, this work demonstrates the promise of biologically inspired synchronization dynamics as structured priors in generative modeling.
[1049] Imbalanced Gradients in RL Post-Training of Multi-Task LLMs
Runzhe Wu, Ankur Samanta, Ayush Jain, Scott Fujimoto, Jeongyeol Kwon, Ben Kretzu, Youliang Yu, Kaveh Hassani, Boris Vidolov, Yonathan Efroni
Main category: cs.LG
TL;DR: Multi-task RL post-training of LLMs suffers from gradient imbalance where certain tasks produce much larger gradients, biasing optimization toward those tasks despite not necessarily providing better learning gains.
Details
Motivation: Current multi-task post-training assumes all tasks contribute similar gradient magnitudes, but this assumption fails in RL settings, leading to biased optimization that doesn't necessarily improve performance on the most important tasks.Method: The paper analyzes gradient magnitudes across different tasks during RL post-training and compares them with actual learning gains, examining whether larger gradients correlate with better performance improvements.
Result: Large-gradient tasks don’t necessarily yield better learning gains - they can achieve similar or even lower performance improvements than small-gradient tasks. Gradient imbalances can’t be explained by typical training statistics like rewards or advantages.
Conclusion: Naive dataset mixing in multi-task RL post-training is problematic due to gradient imbalances. Future work should develop principled gradient-level correction methods for LLMs to address this optimization bias.
Abstract: Multi-task post-training of large language models (LLMs) is typically performed by mixing datasets from different tasks and optimizing them jointly. This approach implicitly assumes that all tasks contribute gradients of similar magnitudes; when this assumption fails, optimization becomes biased toward large-gradient tasks. In this paper, however, we show that this assumption fails in RL post-training: certain tasks produce significantly larger gradients, thus biasing updates toward those tasks. Such gradient imbalance would be justified only if larger gradients implied larger learning gains on the tasks (i.e., larger performance improvements) – but we find this is not true. Large-gradient tasks can achieve similar or even much lower learning gains than small-gradient ones. Further analyses reveal that these gradient imbalances cannot be explained by typical training statistics such as training rewards or advantages, suggesting that they arise from the inherent differences between tasks. This cautions against naive dataset mixing and calls for future work on principled gradient-level corrections for LLMs.
[1050] Assessing the Completeness of Traffic Scenario Categories for Automated Highway Driving Functions via Cluster-based Analysis
Niklas Roßberg, Marion Neumeier, Sinan Hasirlioglu, Mohamed Essayed Bouzouraa, Michael Botsch
Main category: cs.LG
TL;DR: This paper introduces a CVQ-VAE pipeline for clustering highway traffic scenarios to support safe Automated Driving Systems deployment, analyzing the trade-off between cluster quality and data completeness.
Details
Motivation: To ensure safe release of Automated Driving Systems by precisely understanding traffic scenarios through clustering and analyzing scenario category completeness.Method: Uses Clustering Vector Quantized - Variational Autoencoder (CVQ-VAE) for highway traffic scenario clustering, creating catalogs with different numbers of categories, and analyzes the impact of category count on completeness.
Result: Shows outperforming clustering performance compared to previous work and discusses the trade-off between cluster quality and data requirements for maintaining completeness using the highD dataset.
Conclusion: The CVQ-VAE approach effectively clusters traffic scenarios and provides insights into the balance between clustering granularity and data completeness for ADS safety validation.
Abstract: The ability to operate safely in increasingly complex traffic scenarios is a fundamental requirement for Automated Driving Systems (ADS). Ensuring the safe release of ADS functions necessitates a precise understanding of the occurring traffic scenarios. To support this objective, this work introduces a pipeline for traffic scenario clustering and the analysis of scenario category completeness. The Clustering Vector Quantized - Variational Autoencoder (CVQ-VAE) is employed for the clustering of highway traffic scenarios and utilized to create various catalogs with differing numbers of traffic scenario categories. Subsequently, the impact of the number of categories on the completeness considerations of the traffic scenario categories is analyzed. The results show an outperforming clustering performance compared to previous work. The trade-off between cluster quality and the amount of required data to maintain completeness is discussed based on the publicly available highD dataset.
[1051] Statistically Valid Post-Deployment Monitoring Should Be Standard for AI-Based Digital Health
Pavel Dolin, Weizhi Li, Gautam Dasarathy, Visar Berisha
Main category: cs.LG
TL;DR: This paper argues for statistically valid and label-efficient testing frameworks as a principled foundation for post-deployment monitoring in clinical AI, addressing current underdeveloped practices.
Details
Motivation: Only 9% of FDA-registered AI healthcare tools have post-deployment surveillance plans, and existing monitoring approaches are manual, sporadic, and reactive - ill-suited for dynamic clinical environments.Method: Proposes framing detection of data changes and model performance degradation as distinct statistical hypothesis testing problems with explicit error rate guarantees and formal inference capabilities.
Result: Establishes a framework for statistically rigorous monitoring that provides reproducibility, scientific soundness, and aligns with regulatory requirements for clinical AI systems.
Conclusion: Grounding post-deployment monitoring in statistical rigor ensures reliability and opens new research directions for principled detection, attribution, and mitigation of model failures in real-world clinical settings.
Abstract: This position paper argues that post-deployment monitoring in clinical AI is underdeveloped and proposes statistically valid and label-efficient testing frameworks as a principled foundation for ensuring reliability and safety in real-world deployment. A recent review found that only 9% of FDA-registered AI-based healthcare tools include a post-deployment surveillance plan. Existing monitoring approaches are often manual, sporadic, and reactive, making them ill-suited for the dynamic environments in which clinical models operate. We contend that post-deployment monitoring should be grounded in label-efficient and statistically valid testing frameworks, offering a principled alternative to current practices. We use the term “statistically valid” to refer to methods that provide explicit guarantees on error rates (e.g., Type I/II error), enable formal inference under pre-defined assumptions, and support reproducibility–features that align with regulatory requirements. Specifically, we propose that the detection of changes in the data and model performance degradation should be framed as distinct statistical hypothesis testing problems. Grounding monitoring in statistical rigor ensures a reproducible and scientifically sound basis for maintaining the reliability of clinical AI systems. Importantly, it also opens new research directions for the technical community–spanning theory, methods, and tools for statistically principled detection, attribution, and mitigation of post-deployment model failures in real-world settings.
[1052] ShapeX: Shapelet-Driven Post Hoc Explanations for Time Series Classification Models
Bosong Huang, Ming Jin, Yuxuan Liang, Johan Barthelemy, Debo Cheng, Qingsong Wen, Chenghao Liu, Shirui Pan
Main category: cs.LG
TL;DR: ShapeX is a novel framework for explaining time series classification models by identifying key shapelets (subsequences) and using Shapley values to assess their importance, outperforming existing methods in precision and causal fidelity.
Details
Motivation: Current post-hoc time series explanation methods focus on timestep-level feature attribution but overlook that classification outcomes are primarily driven by key shapelets, which is crucial for transparency in high-stakes applications like healthcare and finance.Method: ShapeX segments time series into shapelet-driven segments using the Shapelet Describe-and-Detect (SDD) framework to learn diverse shapelets, then employs Shapley values to evaluate their saliency for explanation.
Result: Experimental results on synthetic and real-world datasets show ShapeX outperforms existing methods in identifying relevant subsequences, improving both precision and causal fidelity of explanations.
Conclusion: ShapeX effectively bridges the gap in time series explanation by focusing on shapelet-level attribution, producing explanations that reveal causal relationships rather than just correlations, making it valuable for high-stakes applications requiring transparency.
Abstract: Explaining time series classification models is crucial, particularly in high-stakes applications such as healthcare and finance, where transparency and trust play a critical role. Although numerous time series classification methods have identified key subsequences, known as shapelets, as core features for achieving state-of-the-art performance and validating their pivotal role in classification outcomes, existing post-hoc time series explanation (PHTSE) methods primarily focus on timestep-level feature attribution. These explanation methods overlook the fundamental prior that classification outcomes are predominantly driven by key shapelets. To bridge this gap, we present ShapeX, an innovative framework that segments time series into meaningful shapelet-driven segments and employs Shapley values to assess their saliency. At the core of ShapeX lies the Shapelet Describe-and-Detect (SDD) framework, which effectively learns a diverse set of shapelets essential for classification. We further demonstrate that ShapeX produces explanations which reveal causal relationships instead of just correlations, owing to the atomicity properties of shapelets. Experimental results on both synthetic and real-world datasets demonstrate that ShapeX outperforms existing methods in identifying the most relevant subsequences, enhancing both the precision and causal fidelity of time series explanations.
[1053] Membership Inference Attacks for Unseen Classes
Pratiksha Thaker, Neil Kale, Zhiwei Steven Wu, Virginia Smith
Main category: cs.LG
TL;DR: Shadow model-based membership inference attacks fail when data access is restricted, but quantile regression attacks succeed by learning generalizable features from member examples.
Details
Motivation: Existing membership inference attacks rely on shadow models that require access to data similar to the target model's training data, which is problematic in AI safety applications with restricted data access.Method: Proposed quantile regression attacks that learn features from member examples that can generalize to unseen classes, without requiring shadow models or access to similar data distributions.
Result: Quantile regression attacks achieve up to 11x higher true positive rates compared to shadow model-based approaches in restricted data access scenarios.
Conclusion: Shadow model-based MIAs have fundamental limitations in real-world AI safety applications, while quantile regression attacks provide a more robust alternative that works with restricted data access.
Abstract: The state-of-the-art for membership inference attacks on machine learning models is a class of attacks based on shadow models that mimic the behavior of the target model on subsets of held-out nonmember data. However, we find that this class of attacks is fundamentally limited because of a key assumption – that the shadow models can replicate the target model’s behavior on the distribution of interest. As a result, we show that attacks relying on shadow models can fail catastrophically on critical AI safety applications where data access is restricted due to legal, ethical, or logistical constraints, so that the shadow models have no reasonable signal on the query examples. Although this problem seems intractable within the shadow model paradigm, we find that quantile regression attacks are a promising approach in this setting, as these models learn features of member examples that can generalize to unseen classes. We demonstrate this both empirically and theoretically, showing that quantile regression attacks achieve up to 11x the TPR of shadow model-based approaches in practice, and providing a theoretical model that outlines the generalization properties required for this approach to succeed. Our work identifies an important failure mode in existing MIAs and provides a cautionary tale for practitioners that aim to directly use existing tools for real-world applications of AI safety.
[1054] Optimal Rates in Continual Linear Regression via Increasing Regularization
Ran Levinstein, Amit Attia, Matan Schliserman, Uri Sherman, Tomer Koren, Daniel Soudry, Itay Evron
Main category: cs.LG
TL;DR: The paper analyzes continual linear regression under random task orderings and shows that using explicit ℓ₂ regularization or implicit regularization via finite step budgets can achieve near-optimal O(log k/k) or optimal O(1/k) convergence rates, closing the gap from prior O(1/k^{1/4}) results.
Details
Motivation: There exists a significant gap between the worst-case lower bound of Ω(1/k) and prior upper bound of O(1/k^{1/4}) for realizable continual linear regression under random task orderings, motivating the investigation of regularization schemes to narrow or close this gap.Method: The paper uses two regularization approaches: (1) explicit isotropic ℓ₂ regularization, and (2) implicit regularization via finite step budgets. These are shown to reduce to stochastic gradient descent on carefully defined surrogate losses. A generalized variant of SGD for time-varying functions is also analyzed.
Result: With fixed regularization strength, the method achieves a near-optimal rate of O(log k/k). Using an increasing regularization strength schedule, the method provably achieves the optimal rate of O(1/k), matching the lower bound.
Conclusion: Regularization schemes, particularly with increasing regularization strength or decreasing steps per task, are beneficial for continual learning and can achieve optimal convergence rates in the worst case, effectively mitigating forgetting in continual linear regression.
Abstract: We study realizable continual linear regression under random task orderings, a common setting for developing continual learning theory. In this setup, the worst-case expected loss after $k$ learning iterations admits a lower bound of $\Omega(1/k)$. However, prior work using an unregularized scheme has only established an upper bound of $O(1/k^{1/4})$, leaving a significant gap. Our paper proves that this gap can be narrowed, or even closed, using two frequently used regularization schemes: (1) explicit isotropic $\ell_2$ regularization, and (2) implicit regularization via finite step budgets. We show that these approaches, which are used in practice to mitigate forgetting, reduce to stochastic gradient descent (SGD) on carefully defined surrogate losses. Through this lens, we identify a fixed regularization strength that yields a near-optimal rate of $O(\log k / k)$. Moreover, formalizing and analyzing a generalized variant of SGD for time-varying functions, we derive an increasing regularization strength schedule that provably achieves an optimal rate of $O(1/k)$. This suggests that schedules that increase the regularization coefficient or decrease the number of steps per task are beneficial, at least in the worst case.
[1055] Assessing the Feasibility of Early Cancer Detection Using Routine Laboratory Data: An Evaluation of Machine Learning Approaches on an Imbalanced Dataset
Shumin Li
Main category: cs.LG
TL;DR: This study evaluates cancer risk classification in dogs using routine lab data from the Golden Retriever Lifetime Study, finding that while a detectable cancer signal exists, it’s too weak for reliable clinical use due to confounding with normal aging and inflammation.
Details
Motivation: To develop accessible screening tools for early cancer detection in dogs using routine laboratory data, addressing challenges of non-specific biomarkers and severe class imbalance in screening populations.Method: Comprehensive benchmark evaluation of 126 analytical pipelines combining machine learning models, feature selection methods, and data balancing techniques using the GRLS cohort, with patient-level data partitioning to prevent leakage.
Result: Optimal model (Logistic Regression with class weighting and recursive feature elimination) showed moderate ranking ability (AUROC=0.815) but poor clinical performance (F1-score=0.25, PPV=0.15). High NPV (0.98) but insufficient recall (0.79) for reliable rule-out testing. SHAP analysis revealed predictions driven by non-specific features like age and inflammation markers.
Conclusion: While a statistically detectable cancer signal exists in routine lab data, it’s too weak and confounded for clinically reliable discrimination from normal aging or inflammatory conditions. Multi-modal data integration is needed for meaningful progress in computational veterinary oncology.
Abstract: The development of accessible screening tools for early cancer detection in dogs represents a significant challenge in veterinary medicine. Routine laboratory data offer a promising, low-cost source for such tools, but their utility is hampered by the non-specificity of individual biomarkers and the severe class imbalance inherent in screening populations. This study assesses the feasibility of cancer risk classification using the Golden Retriever Lifetime Study (GRLS) cohort under real-world constraints, including the grouping of diverse cancer types and the inclusion of post-diagnosis samples. A comprehensive benchmark evaluation was conducted, systematically comparing 126 analytical pipelines that comprised various machine learning models, feature selection methods, and data balancing techniques. Data were partitioned at the patient level to prevent leakage. The optimal model, a Logistic Regression classifier with class weighting and recursive feature elimination, demonstrated moderate ranking ability (AUROC = 0.815; 95% CI: 0.793-0.836) but poor clinical classification performance (F1-score = 0.25, Positive Predictive Value = 0.15). While a high Negative Predictive Value (0.98) was achieved, insufficient recall (0.79) precludes its use as a reliable rule-out test. Interpretability analysis with SHapley Additive exPlanations (SHAP) revealed that predictions were driven by non-specific features like age and markers of inflammation and anemia. It is concluded that while a statistically detectable cancer signal exists in routine lab data, it is too weak and confounded for clinically reliable discrimination from normal aging or other inflammatory conditions. This work establishes a critical performance ceiling for this data modality in isolation and underscores that meaningful progress in computational veterinary oncology will require integration of multi-modal data sources.
[1056] TRACE: Grounding Time Series in Context for Multimodal Embedding and Retrieval
Jialin Chen, Ziyu Zhao, Gaukhar Nurbek, Aosong Feng, Ali Maatouk, Leandros Tassiulas, Yifeng Gao, Rex Ying
Main category: cs.LG
TL;DR: TRACE is a multimodal retriever that aligns time-series data with textual context, enabling semantic cross-modal retrieval and serving as a powerful encoder for downstream tasks.
Details
Motivation: There's a growing need for effective interpretation and retrieval of time-series data across domains like weather and healthcare, but existing methods lack semantic grounding, struggle with modality alignment, and have limited multi-channel handling capabilities.Method: TRACE grounds time-series embeddings in aligned textual context, enables fine-grained channel-level alignment, uses hard negative mining for semantic retrieval, and supports flexible cross-modal retrieval modes (Text-to-Timeseries and Timeseries-to-Text).
Result: TRACE achieves state-of-the-art performance on downstream forecasting and classification tasks, improves predictive accuracy and interpretability, and serves as both an effective encoder and general-purpose retriever.
Conclusion: TRACE addresses the gap in time-series retrieval by providing semantic grounding and cross-modal alignment, demonstrating dual utility as both a retrieval engine and standalone encoder across multiple domains.
Abstract: The ubiquity of dynamic data in domains such as weather, healthcare, and energy underscores a growing need for effective interpretation and retrieval of time-series data. These data are inherently tied to domain-specific contexts, such as clinical notes or weather narratives, making cross-modal retrieval essential not only for downstream tasks but also for developing robust time-series foundation models by retrieval-augmented generation (RAG). Despite the increasing demand, time-series retrieval remains largely underexplored. Existing methods often lack semantic grounding, struggle to align heterogeneous modalities, and have limited capacity for handling multi-channel signals. To address this gap, we propose TRACE, a generic multimodal retriever that grounds time-series embeddings in aligned textual context. TRACE enables fine-grained channel-level alignment and employs hard negative mining to facilitate semantically meaningful retrieval. It supports flexible cross-modal retrieval modes, including Text-to-Timeseries and Timeseries-to-Text, effectively linking linguistic descriptions with complex temporal patterns. By retrieving semantically relevant pairs, TRACE enriches downstream models with informative context, leading to improved predictive accuracy and interpretability. Beyond a static retrieval engine, TRACE also serves as a powerful standalone encoder, with lightweight task-specific tuning that refines context-aware representations while maintaining strong cross-modal alignment. These representations achieve state-of-the-art performance on downstream forecasting and classification tasks. Extensive experiments across multiple domains highlight its dual utility, as both an effective encoder for downstream applications and a general-purpose retriever to enhance time-series models.
[1057] DB-FGA-Net: Dual Backbone Frequency Gated Attention Network for Multi-Class Brain Tumor Classification with Grad-CAM Interpretability
Saraf Anzum Shreya, MD. Abu Ismail Siddique, Sharaf Tasnim
Main category: cs.LG
TL;DR: Proposes DB-FGA-Net, a double-backbone network with VGG16 and Xception integrated with Frequency-Gated Attention Block for brain tumor classification without data augmentation, achieving state-of-the-art performance with interpretable Grad-CAM visualization.
Details
Motivation: Deep learning-based brain tumor classification methods often rely on heavy data augmentation which limits generalization and trust in clinical applications. Need for augmentation-free, interpretable models for reliable clinical translation.Method: Double-backbone network integrating VGG16 and Xception with Frequency-Gated Attention (FGA) Block to capture complementary local and global features. Uses Grad-CAM for visualization and includes GUI for real-time classification.
Result: Achieves 99.24% accuracy on 7K-DS dataset for 4-class, 98.68% for 3-class, and 99.85% for 2-class settings. Generalizes with 95.77% accuracy on independent 3K-DS dataset, outperforming baseline and state-of-the-art methods.
Conclusion: Augmentation-free, interpretable, and deployable deep learning models like DB-FGA-Net hold strong potential for reliable clinical translation in brain tumor diagnosis, bridging the gap between model prediction and clinical interpretability.
Abstract: Brain tumors are a challenging problem in neuro-oncology, where early and precise diagnosis is important for successful treatment. Deep learning-based brain tumor classification methods often rely on heavy data augmentation which can limit generalization and trust in clinical applications. In this paper, we propose a double-backbone network integrating VGG16 and Xception with a Frequency-Gated Attention (FGA) Block to capture complementary local and global features. Unlike previous studies, our model achieves state-of-the-art performance without augmentation which demonstrates robustness to variably sized and distributed datasets. For further transparency, Grad-CAM is integrated to visualize the tumor regions based on which the model is giving prediction, bridging the gap between model prediction and clinical interpretability. The proposed framework achieves 99.24% accuracy on the 7K-DS dataset for the 4-class setting, along with 98.68% and 99.85% in the 3-class and 2-class settings, respectively. On the independent 3K-DS dataset, the model generalizes with 95.77% accuracy, outperforming baseline and state-of-the-art methods. To further support clinical usability, we developed a graphical user interface (GUI) that provides real-time classification and Grad-CAM-based tumor localization. These findings suggest that augmentation-free, interpretable, and deployable deep learning models such as DB-FGA-Net hold strong potential for reliable clinical translation in brain tumor diagnosis.
[1058] Boost Post-Training Quantization via Null Space Optimization for Large Language Models
Jiaqi Zhao, Miao Zhang, Deng Xiang, Ming Li, Weili Guan, Liqiang Nie
Main category: cs.LG
TL;DR: This paper introduces null space projection (Q2N) to reduce quantization error in LLMs by constraining weight perturbations within the null space of input activations.
Details
Motivation: Existing quantization methods show diminishing returns, suggesting current strategies are insufficient for developing more compressed models. The paper aims to inspire new research directions by introducing null space concepts to LLM quantization.Method: Proposes Q2N - a plug-and-play null space projection module that includes: 1) an efficient null space projection approximation method tailored for LLMs, and 2) a theoretical derivation of closed-form solution for equivalent vectors that avoids additional memory overhead.
Result: Extensive experiments on state-of-the-art LLMs (LLaMA3, DeepSeek, Qwen3) demonstrate the effectiveness of both Q2N and the null space optimization perspective for LLM quantization.
Conclusion: This work represents the first step in alleviating quantization error using null space insights, aiming to inspire future researchers to design more advanced quantization methods.
Abstract: Existing post-training quantization methods for large language models (LLMs) offer remarkable success. However, the increasingly marginal performance gains suggest that existing quantization strategies are insufficient to support the development of more compressed models. To inspire new directions for future research, this paper introduces the concept of null space into LLMs quantization. We argue that the quantization error can be effectively alleviated by constraining the post-quantization weight perturbation to lie within the null space of input activations. To prove this idea, we propose a plug-and-play null space projection module for existing milestone PTQ baselines named Q2N. Specifically, we first design an efficient and accurate null space projection approximation method tailored to the characteristics of LLMs. Subsequently, we theoretically derive a closed-form solution for an equivalent vector of the obtained projection matrix, which satisfies practical inference condition while avoiding additional memory overhead. Extensive experiments are conducted on various state-of-the-art LLMs (LLaMA3, DeepSeek, Qwen3) and baselines, demonstrating the effectiveness of both our Q2N and the perspective of null space optimization for LLMs quantization. We view this paper the first step to further alleviate the quantization error based on the insights of null space, hoping it inspiring future researchers to design more advanced quantization methods. Codes are available at https://github.com/zjq0455/q2n.
[1059] In Defense of Defensive Forecasting
Juan Carlos Perdomo, Benjamin Recht
Main category: cs.LG
TL;DR: A survey of Defensive Forecasting algorithms that derive predictions by correcting past mistakes rather than prognostication, presenting elementary theory and near-optimal algorithms for various online learning tasks.
Details
Motivation: To provide a comprehensive introduction to Defensive Forecasting theory and demonstrate its practical applications in online learning scenarios where predictions need to be robust against any possible outcomes.Method: Frames prediction as a sequential game and derives predictions to minimize metrics regardless of outcomes. Presents elementary theory and simple algorithms based on correcting past mistakes.
Result: Developed simple, near-optimal algorithms for online learning, calibration, prediction with expert advice, and online conformal prediction through the Defensive Forecasting framework.
Conclusion: Defensive Forecasting provides a powerful framework for deriving robust predictions by focusing on correcting past errors rather than making forecasts, yielding effective algorithms for various online learning problems.
Abstract: This tutorial provides a survey of algorithms for Defensive Forecasting, where predictions are derived not by prognostication but by correcting past mistakes. Pioneered by Vovk, Defensive Forecasting frames the goal of prediction as a sequential game, and derives predictions to minimize metrics no matter what outcomes occur. We present an elementary introduction to this general theory and derive simple, near-optimal algorithms for online learning, calibration, prediction with expert advice, and online conformal prediction.
[1060] Path-specific effects for pulse-oximetry guided decisions in critical care
Kevin Zhang, Yonghan Jung, Divyat Mahajan, Karthikeyan Shanmugam, Shalmali Joshi
Main category: cs.LG
TL;DR: This paper uses causal inference methods to investigate how racial bias in pulse oximeter readings affects invasive ventilation decisions in ICUs, finding minimal impact on ventilation rates but more pronounced effects on ventilation duration.
Details
Motivation: To address racial disparities in healthcare, specifically inaccurate pulse oximeter readings that overestimate oxygen saturation for dark-skinned patients, and to move beyond statistical correlations to establish causal relationships between measurement bias and clinical outcomes.Method: Employed causal inference with path-specific effects to isolate racial bias impact, used doubly robust estimator with self-normalized variant for improved efficiency, and provided finite-sample guarantees. Validated on semi-synthetic data and applied to MIMIC-IV and eICU datasets.
Result: Minimal impact of racial discrepancies on invasive ventilation rates, but path-specific effects mediated by oxygen saturation disparity were more pronounced on ventilation duration, with severity varying by dataset.
Conclusion: The study provides a novel pipeline for investigating clinical decision-making disparities and emphasizes the necessity of causal methods for robust fairness assessment in healthcare.
Abstract: Identifying and measuring biases associated with sensitive attributes is a crucial consideration in healthcare to prevent treatment disparities. One prominent issue is inaccurate pulse oximeter readings, which tend to overestimate oxygen saturation for dark-skinned patients and misrepresent supplemental oxygen needs. Most existing research has revealed statistical disparities linking device measurement errors to patient outcomes in intensive care units (ICUs) without causal formalization. This study causally investigates how racial discrepancies in oximetry measurements affect invasive ventilation in ICU settings. We employ a causal inference-based approach using path-specific effects to isolate the impact of bias by race on clinical decision-making. To estimate these effects, we leverage a doubly robust estimator, propose its self-normalized variant for improved sample efficiency, and provide novel finite-sample guarantees. Our methodology is validated on semi-synthetic data and applied to two large real-world health datasets: MIMIC-IV and eICU. Contrary to prior work, our analysis reveals minimal impact of racial discrepancies on invasive ventilation rates. However, path-specific effects mediated by oxygen saturation disparity are more pronounced on ventilation duration, and the severity differs by dataset. Our work provides a novel pipeline for investigating potential disparities in clinical decision-making and, more importantly, highlights the necessity of causal methods to robustly assess fairness in healthcare.
[1061] Complexity Scaling Laws for Neural Models using Combinatorial Optimization
Lowell Weissman, Michael Krumdick, A. Lynn Abbott
Main category: cs.LG
TL;DR: The paper develops scaling laws based on problem complexity measures (solution space size and representation space size) using TSP as a case study, showing predictable suboptimality growth when scaling problem size.
Details
Motivation: To extend neural scaling laws beyond compute, model size, and dataset size by incorporating problem complexity measures, particularly for combinatorial optimization problems.Method: Analyzed two complexity measures (solution space size and representation space size) using Traveling Salesman Problem as a case study. Examined scaling behavior across different model training approaches (reinforcement learning and supervised fine-tuning).
Result: Combinatorial optimization promotes smooth cost trends, enabling meaningful scaling laws even without interpretable loss. Suboptimality grows predictably when scaling TSP nodes or spatial dimensions, regardless of training method.
Conclusion: Problem complexity scaling laws can be established for combinatorial optimization, with similar trends observed in simpler gradient descent approaches, suggesting broader applicability of complexity-based scaling principles.
Abstract: Recent work on neural scaling laws demonstrates that model performance scales predictably with compute budget, model size, and dataset size. In this work, we develop scaling laws based on problem complexity. We analyze two fundamental complexity measures: solution space size and representation space size. Using the Traveling Salesman Problem (TSP) as a case study, we show that combinatorial optimization promotes smooth cost trends, and therefore meaningful scaling laws can be obtained even in the absence of an interpretable loss. We then show that suboptimality grows predictably for fixed-size models when scaling the number of TSP nodes or spatial dimensions, independent of whether the model was trained with reinforcement learning or supervised fine-tuning on a static dataset. We conclude with an analogy to problem complexity scaling in local search, showing that a much simpler gradient descent of the cost landscape produces similar trends.
[1062] MoORE: SVD-based Model MoE-ization for Conflict- and Oblivion-Resistant Multi-Task Adaptation
Shen Yuan, Yin Zheng, Taifeng Wang, Binbin Liu, Hongteng Xu
Main category: cs.LG
TL;DR: Proposes MoORE, a novel multi-task adaptation method using Mixture of Orthogonal Rank-one Experts to prevent task conflict and oblivion in foundation models.
Details
Motivation: To address task conflict and oblivion issues when adapting large foundation models to multi-task scenarios.Method: Applies SVD to pre-trained model weights, introduces learnable router to adjust singular values, and creates orthogonal rank-one experts while maintaining original column space.
Result: Outperforms existing multi-task adaptation methods consistently across various datasets, showing superior conflict- and oblivion-resistance.
Conclusion: MoORE provides an effective solution for multi-task adaptation that maintains model performance while preventing task interference and forgetting.
Abstract: Adapting large-scale foundation models in multi-task scenarios often suffers from task conflict and oblivion. To mitigate such issues, we propose a novel ‘‘model MoE-ization’’ strategy that leads to a conflict- and oblivion-resistant multi-task adaptation method. Given a weight matrix of a pre-trained model, our method applies SVD to it and introduces a learnable router to adjust its singular values based on tasks and samples. Accordingly, the weight matrix becomes a Mixture of Orthogonal Rank-one Experts (MoORE), in which each expert corresponds to the outer product of a left singular vector and the corresponding right one. We can improve the model capacity by imposing a learnable orthogonal transform on the right singular vectors. Unlike low-rank adaptation (LoRA) and its MoE-driven variants, MoORE guarantees the experts’ orthogonality and maintains the column space of the original weight matrix. These two properties make the adapted model resistant to the conflicts among the new tasks and the oblivion of its original tasks, respectively. Experiments on various datasets demonstrate that MoORE outperforms existing multi-task adaptation methods consistently, showing its superiority in terms of conflict- and oblivion-resistance. The code of the experiments is available at https://github.com/DaShenZi721/MoORE.
[1063] Bohdi: Heterogeneous LLM Fusion with Automatic Data Exploration
Junqi Gao, Zhichang Guo, Dazhi Zhang, Dong Li, Runze Liu, Pengfei Li, Kai Tian, Biqing Qi
Main category: cs.LG
TL;DR: Bohdi is a synthetic-data-only heterogeneous LLM fusion framework that uses hierarchical domain organization and adaptive sampling to overcome limitations of existing methods, achieving better performance and eliminating capability imbalance.
Details
Motivation: Existing heterogeneous LLM fusion methods rely on limited real data from specific domains and use fixed data allocation, preventing comprehensive knowledge acquisition and causing capability imbalance across domains.Method: Organizes knowledge domains into hierarchical tree structure, uses multi-model collaboration for automatic domain exploration and data generation, and employs DynaBranches mechanism with Hierarchical Multi-Armed Bandit for adaptive sampling based on performance feedback.
Result: Significantly outperforms existing baselines on multiple target LLMs, shows higher data efficiency, and virtually eliminates capability imbalance across domains.
Conclusion: Bohdi effectively addresses key limitations in heterogeneous LLM fusion through synthetic data generation and adaptive domain sampling, demonstrating superior performance and balanced capabilities.
Abstract: Heterogeneous Large Language Model (LLM) fusion integrates the strengths of multiple source LLMs with different architectures into a target LLM with low computational overhead. While promising, existing methods suffer from two major limitations: 1) reliance on real data from limited domain for knowledge fusion, preventing the target LLM from fully acquiring knowledge across diverse domains, and 2) fixed data allocation proportions across domains, failing to dynamically adjust according to the target LLM’s varying capabilities across domains, leading to a capability imbalance. To overcome these limitations, we propose Bohdi, a synthetic-data-only heterogeneous LLM fusion framework. Through the organization of knowledge domains into a hierarchical tree structure, Bohdi enables automatic domain exploration and multi-domain data generation through multi-model collaboration, thereby comprehensively extracting knowledge from source LLMs. By formalizing domain expansion and data sampling proportion allocation on the knowledge tree as a Hierarchical Multi-Armed Bandit problem, Bohdi leverages the designed DynaBranches mechanism to adaptively adjust sampling proportions based on the target LLM’s performance feedback across domains. Integrated with our proposed Introspection-Rebirth (IR) mechanism, DynaBranches dynamically tracks capability shifts during target LLM’s updates via Sliding Window Binomial Likelihood Ratio Testing (SWBLRT), further enhancing its online adaptation capability. Comparative experimental results on a comprehensive suite of benchmarks demonstrate that Bohdi significantly outperforms existing baselines on multiple target LLMs, exhibits higher data efficiency, and virtually eliminates the imbalance in the target LLM’s capabilities. Our code is available at https://github.com/gjq100/Bohdi.git.
[1064] Permutation Equivariant Neural Controlled Differential Equations for Dynamic Graph Representation Learning
Torben Berndt, Benjamin Walker, Tiexin Qin, Jan Stühmer, Andrey Kormilitzin
Main category: cs.LG
TL;DR: Permutation Equivariant Neural Graph CDEs extend Graph Neural CDEs by projecting them onto permutation equivariant function spaces, reducing parameters while maintaining representational power for better efficiency and generalization.
Details
Motivation: Dynamic graphs have complex temporal dynamics from evolving node features and changing network structures. Graph Neural CDEs were adapted for graphs but need parameter optimization for better efficiency.Method: Project Graph Neural CDEs onto permutation equivariant function spaces to reduce parameter count while preserving representational capabilities.
Result: Significantly reduced parameter count without compromising performance, leading to more efficient training and improved generalization in both interpolation and extrapolation scenarios.
Conclusion: Permutation Equivariant Neural Graph CDEs provide an efficient and effective approach for modeling dynamic graphs with better training efficiency and generalization performance.
Abstract: Dynamic graphs exhibit complex temporal dynamics due to the interplay between evolving node features and changing network structures. Recently, Graph Neural Controlled Differential Equations (Graph Neural CDEs) successfully adapted Neural CDEs from paths on Euclidean domains to paths on graph domains. Building on this foundation, we introduce Permutation Equivariant Neural Graph CDEs, which project Graph Neural CDEs onto permutation equivariant function spaces. This significantly reduces the model’s parameter count without compromising representational power, resulting in more efficient training and improved generalisation. We empirically demonstrate the advantages of our approach through experiments on simulated dynamical systems and real-world tasks, showing improved performance in both interpolation and extrapolation scenarios.
[1065] Wearable Sensor-Based IoT XAI Framework for Predicting Freezing of Gait in Parkinsons Disease
Biplov Paneru
Main category: cs.LG
TL;DR: A wearable sensor system using LoRa communication and machine learning algorithms (XGBoost, Catboost, Extra Trees) achieves high accuracy (up to 97%) for early prediction of Freezing of Gait (FOG), with potential for real-time monitoring and assistive technology applications.
Details
Motivation: There is a critical need for early detection and treatment of Freezing of Gait (FOG) to help affected individuals, requiring reliable and accessible monitoring systems.Method: Developed a wearable sensor system using Esp-32 microcontroller with LoRa communication, trained machine learning models (Catboost, XGBoost, Extra Tree classifiers) using Micromlgen Python library, and performed SHAP analysis for interpretability.
Result: XGBoost achieved 97% accuracy, Catboost 96%, and Extra Trees Classifier 90% for FOG classification. SHAP analysis identified GYR SI degree as the most significant factor in prediction.
Conclusion: The sensor-based technology shows great potential for real-world healthcare applications, enabling monitoring, location tracking via GPS, and providing assistive aid for FOG-affected individuals.
Abstract: This research discusses the critical need for early detection and treatment for early prediction of Freezing of Gaits (FOG) utilizing a wearable sensor technology powered with LoRa communication. The system consisted of an Esp-32 microcontroller, in which the trained model is utilized utilizing the Micromlgen Python library. The research investigates accurate FOG classification based on pertinent clinical data by utilizing machine learning (ML) algorithms like Catboost, XGBoost, and Extra Tree classifiers. The XGBoost could classify with approximately 97% accuracy, along with 96% for the catboost and 90% for the Extra Trees Classifier model. The SHAP analysis interpretability shows that GYR SI degree is the most affecting factor in the prediction of the diseases. These results show the possibility of monitoring and identifying the affected person with tracking location on GPS and providing aid as an assistive technology for aiding the affected. The developed sensor-based technology has great potential for real-world problem solving in the field of healthcare and biomedical technology enhancements.
[1066] BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers
Patrik Okanovic, Sameer Deshmukh, Grzegorz Kwasniewski, Yi Zhu, Haruto Fujii, Sakina Fatima, Maciej Besta, Kentaro Katayama, Takumi Honda, Yusuke Nagasaka, Torsten Hoefler
Main category: cs.LG
TL;DR: BLaST is a block sparsity method that achieves up to 95% sparsity in MLP weights with minimal accuracy loss, delivering 2.2x inference speedup and significant memory/cost reductions.
Details
Motivation: Large-scale ML models suffer from high energy consumption dominated by data movement. Sparsification can reduce these costs by pruning redundant parameters, but existing methods cause accuracy degradation or performance overhead.Method: BLaST iteratively sparsifies weight matrices into a block sparsity pattern suitable for efficient sparse matrix-matrix (SpMM) multiplication, applicable to linear layers in all settings.
Result: Achieves up to 95% sparsity in MLP weights with negligible accuracy loss (<2.25%), 2.2x inference speedup for Llama 3.2, 4.45x memory footprint reduction, and 2.9x reduction in GPU setup and operating costs.
Conclusion: BLaST provides a general, robust, and reliable sparsification method that effectively reduces data movement costs while maintaining model accuracy and delivering significant performance improvements.
Abstract: The energy consumption of large-scale ML models is dominated by data movement, shuffling billions of parameters across memory hierarchies and data centers. Sparsification offers a principled way to mitigate these costs by pruning redundant weights and activations, thereby reducing data movement. Effective sparsification to prune redundant parameters is still challenging: existing methods incur significant accuracy degradation, performance overhead, or both. We introduce (Bl)ock (a)nd (S)parse (T)ransformers (BLaST), a general, robust, and reliable method for sparsification, applicable to linear layers in all settings. Our method iteratively sparsifies weight matrices into a block sparsity pattern suitable for efficient sparse matrix-matrix (SpMM) multiplication. BLaST achieves up to 95% sparsity in MLP weights with negligible accuracy loss (majority <2.25%). We show a 2.2x inference speedup for Llama 3.2 with 16 GPUs, and up to 4.45x reduction in inference memory footprint resulting in a 2.9x reduction in GPU setup and operating costs.
[1067] MPX: Mixed Precision Training for JAX
Alexander Gräfe, Sebastian Trimpe
Main category: cs.LG
TL;DR: MPX is a mixed-precision training toolbox for JAX that simplifies and accelerates large-scale neural network training while preserving accuracy, with seamless integration into existing pipelines.
Details
Motivation: JAX lacks robust support for mixed-precision training despite its growing popularity as a machine learning toolbox, while mixed-precision training has become essential for enhancing neural network training efficiency.Method: MPX casts inputs and outputs to half precision with dynamic loss-scaling to prevent gradient underflow/overflow. It integrates with JAX’s type-promotion behavior for correct precision operations and provides wrappers for automatic mixed-precision gradient and optimizer management.
Result: The toolbox enables conversion of full-precision pipelines to mixed-precision with minimal modifications, seamlessly working with popular JAX toolboxes like Equinox and Flax.
Conclusion: MPX provides a comprehensive solution for mixed-precision training in JAX, addressing precision issues while maintaining model accuracy and offering straightforward integration into existing training pipelines.
Abstract: Mixed-precision training has emerged as an indispensable tool for enhancing the efficiency of neural network training in recent years. Concurrently, JAX has grown in popularity as a versatile machine learning toolbox. However, it currently lacks robust support for mixed-precision training. We propose MPX, a mixed-precision training toolbox for JAX that simplifies and accelerates the training of large-scale neural networks while preserving model accuracy. MPX seamlessly integrates with popular toolboxes such as Equinox and Flax, allowing users to convert full-precision pipelines to mixed-precision versions with minimal modifications. By casting both inputs and outputs to half precision, and introducing a dynamic loss-scaling mechanism, MPX alleviates issues like gradient underflow and overflow that commonly arise in half precision computations. Its design inherits critical features from JAX’s type-promotion behavior, ensuring that operations take place in the correct precision and allowing for selective enforcement of full precision where needed (e.g., sums, means, or softmax). MPX further provides wrappers for automatic creation and management of mixed-precision gradients and optimizers, enabling straightforward integration into existing JAX training pipelines. MPX’s source code, documentation, and usage examples are available at github.com/Data-Science-in-Mechanical-Engineering/mixed_precision_for_JAX .
[1068] Cultivating Pluralism In Algorithmic Monoculture: The Community Alignment Dataset
Lily Hong Zhang, Smitha Milli, Karen Jusko, Jonathan Smith, Brandon Amos, Wassim Bouaziz, Manon Revel, Jack Kussman, Yasha Sheynin, Lisa Titus, Bhaktipriya Radharapu, Jane Yu, Vidya Sarma, Kris Rose, Maximilian Nickel
Main category: cs.LG
TL;DR: This paper addresses the challenge of aligning LLMs with diverse human preferences across cultural and political dimensions through a large-scale multilingual study, revealing significant gaps between human preference diversity and LLM responses, and proposing negatively-correlated sampling to improve alignment.
Details
Motivation: To address how LLMs can serve users with varying preferences that conflict across cultural, political, or other dimensions, as current methods fail to capture the full diversity of human preferences.Method: Conducted large-scale multilingual human study with 15,000 participants from 5 countries, analyzed 21 state-of-the-art LLMs, developed negatively-correlated sampling techniques for candidate generation, and created the Community Alignment dataset with 200,000 comparisons.
Result: Found humans exhibit significantly more preference variation than LLMs, existing preference collection methods are insufficient, negatively-correlated sampling significantly enhances alignment performance, and created the largest multilingual multi-turn preference dataset.
Conclusion: The Community Alignment dataset provides a valuable resource for improving LLM effectiveness for diverse global populations, demonstrating the importance of better sampling methods to capture heterogeneous human preferences.
Abstract: How can large language models (LLMs) serve users with varying preferences that may conflict across cultural, political, or other dimensions? To advance this challenge, this paper establishes four key results. First, we demonstrate, through a large-scale multilingual human study with representative samples from five countries (N=15,000), that humans exhibit significantly more variation in preferences than the responses of 21 state-of-the-art LLMs. Second, we show that existing methods for preference dataset collection are insufficient for learning the diversity of human preferences even along two of the most salient dimensions of variability in global values, due to the underlying homogeneity of candidate responses. Third, we argue that this motivates the need for negatively-correlated sampling when generating candidate sets, and we show that simple prompt-based techniques for doing so significantly enhance the performance of alignment methods in learning heterogeneous preferences. Fourth, based on this novel candidate sampling approach, we collect and open-source Community Alignment, the largest and most representative multilingual and multi-turn preference dataset to date, featuring almost 200,000 comparisons from annotators spanning five countries. We hope that the Community Alignment dataset will be a valuable resource for improving the effectiveness of LLMs for a diverse global population.
[1069] Continental-scale habitat distribution modelling with multimodal earth observation foundation models
Sara Si-Moussi, Stephan Hennekens, Sander Mucher, Stan Los, Yoann Cartier, Borja Jiménez-Alfaro, Fabio Attorre, Jens-Christian Svenning, Wilfried Thuiller
Main category: cs.LG
TL;DR: This paper presents a framework using high-resolution remote sensing and AI to improve habitat mapping across large geographical areas, addressing challenges of multiple habitat types and class imbalance.
Details
Motivation: Current habitat maps have limitations in thematic and spatial resolution due to difficulties in modeling multiple mutually exclusive habitat types and handling severe class imbalance during training.Method: Used vegetation plots from European Vegetation Archive to model Level 3 EUNIS habitat types across Europe, employing hierarchical classification strategies, integration of multispectral and radar imagery through Earth Observation Foundation models, and ensemble machine learning with class imbalance correction.
Result: Strategies using hierarchical habitat classifications resolved classification ambiguities, especially in fragmented habitats. Integration of satellite imagery through EO-FMs enhanced within-formation discrimination and overall performance. Ensemble machine learning with class imbalance correction further boosted predictive accuracy.
Conclusion: The methodological framework is transferable beyond Europe and adaptable to other classification systems. Future research should focus on temporal habitat dynamics modeling, habitat segmentation and quality assessment, and leveraging next-generation EO data with higher-quality in situ observations.
Abstract: Habitats integrate the abiotic conditions, vegetation composition and structure that support biodiversity and sustain nature’s contributions to people. Most habitats face mounting pressures from human activities, which requires accurate, high-resolution habitat mapping for effective conservation and restoration. Yet, current habitat maps often fall short in thematic or spatial resolution because they must (1) model several mutually exclusive habitat types that co-occur across landscapes and (2) cope with severe class imbalance that complicates exhaustive multi-class training. Here, we evaluated how high-resolution remote sensing (RS) data and Artificial Intelligence (AI) tools can improve habitat mapping across large geographical extents at fine spatial and thematic resolution. Using vegetation plots from the European Vegetation Archive, we modelled the distribution of Level 3 EUNIS habitat types across Europe and assessed multiple modelling strategies against independent validation datasets. Strategies that exploited the hierarchical nature of habitat classifications resolved classification ambiguities, especially in fragmented habitats. Integrating satellite-borne multispectral and radar imagery, particularly through Earth Observation (EO) Foundation models (EO-FMs), enhanced within-formation discrimination and overall performance. Finally, ensemble machine learning that corrects class imbalance boosted predictive accuracy even further. Our methodological framework is transferable beyond Europe and adaptable to other classification systems. Future research should advance temporal modelling of habitat dynamics, extend to habitat segmentation and quality assessment, and exploit next-generation EO data paired with higher-quality in situ observations.
[1070] PICore: Physics-Informed Unsupervised Coreset Selection for Data Efficient Neural Operator Training
Anirudh Satheesh, Anant Khandelwal, Mucong Ding, Radu Balan
Main category: cs.LG
TL;DR: PICore is an unsupervised coreset selection framework that identifies the most informative training samples for neural operators without requiring ground-truth PDE solutions, reducing both data annotation costs and training time.
Details
Motivation: Neural operators require large amounts of labeled training data from expensive numerical simulations, creating significant bottlenecks in training efficiency and cost.Method: PICore uses physics-informed loss to select the most informative unlabeled inputs, then only simulates those selected samples to generate labels, training neural operators on this reduced dataset.
Result: Across four PDE benchmarks, PICore achieves up to 78% average increase in training efficiency compared to supervised coreset selection methods with minimal accuracy loss.
Conclusion: PICore effectively reduces both data annotation costs and training time for neural operators while maintaining accuracy, making PDE solving more accessible.
Abstract: Neural operators offer a powerful paradigm for solving partial differential equations (PDEs) that cannot be solved analytically by learning mappings between function spaces. However, there are two main bottlenecks in training neural operators: they require a significant amount of training data to learn these mappings, and this data needs to be labeled, which can only be accessed via expensive simulations with numerical solvers. To alleviate both of these issues simultaneously, we propose PICore, an unsupervised coreset selection framework that identifies the most informative training samples without requiring access to ground-truth PDE solutions. PICore leverages a physics-informed loss to select unlabeled inputs by their potential contribution to operator learning. After selecting a compact subset of inputs, only those samples are simulated using numerical solvers to generate labels, reducing annotation costs. We then train the neural operator on the reduced labeled dataset, significantly decreasing training time as well. Across four diverse PDE benchmarks and multiple coreset selection strategies, PICore achieves up to 78% average increase in training efficiency relative to supervised coreset selection methods with minimal changes in accuracy. We provide code at https://github.com/Asatheesh6561/PICore.
[1071] Boosting Revisited: Benchmarking and Advancing LP-Based Ensemble Methods
Fabian Akkerman, Julien Ferry, Christian Artigues, Emmanuel Hebrard, Thibaut Vidal
Main category: cs.LG
TL;DR: Large-scale experimental study of six LP-based boosting methods shows they can match or outperform XGBoost/LightGBM with shallow trees while producing sparser ensembles, and can effectively thin pre-trained ensembles.
Details
Motivation: Despite theoretical appeal, totally corrective boosting methods based on linear programming have received limited empirical attention, warranting comprehensive experimental evaluation.Method: Conducted large-scale study of six LP-based boosting formulations (including novel NM-Boost and QRLP-Boost) across 20 datasets, evaluating both heuristic and optimal base learners, and analyzing accuracy, sparsity, margins, anytime performance, and hyperparameter sensitivity.
Result: Totally corrective methods can outperform or match XGBoost/LightGBM with shallow trees while producing significantly sparser ensembles, and can thin pre-trained ensembles without performance loss.
Conclusion: LP-based boosting methods show practical value with sparser ensembles and competitive performance, though optimal decision trees have both strengths and limitations in this context.
Abstract: Despite their theoretical appeal, totally corrective boosting methods based on linear programming have received limited empirical attention. In this paper, we conduct the first large-scale experimental study of six LP-based boosting formulations, including two novel methods, NM-Boost and QRLP-Boost, across 20 diverse datasets. We evaluate the use of both heuristic and optimal base learners within these formulations, and analyze not only accuracy, but also ensemble sparsity, margin distribution, anytime performance, and hyperparameter sensitivity. We show that totally corrective methods can outperform or match state-of-the-art heuristics like XGBoost and LightGBM when using shallow trees, while producing significantly sparser ensembles. We further show that these methods can thin pre-trained ensembles without sacrificing performance, and we highlight both the strengths and limitations of using optimal decision trees in this context.
[1072] FedCVD++: Communication-Efficient Federated Learning for Cardiovascular Risk Prediction with Parametric and Non-Parametric Model Optimization
Abdelrhman Gaber, Hassan Abd-Eltawab, John Elgallab, Youssif Abuzied, Dineo Mpanya, Turgay Celik, Swarun Kumar, Tamer ElBatt
Main category: cs.LG
TL;DR: FedCVD++ is an enhanced federated learning framework for cardiovascular disease prediction that integrates both parametric and non-parametric models, achieving state-of-the-art performance while reducing communication overhead and addressing class imbalance across institutions.
Details
Motivation: Cardiovascular diseases cause over 17 million deaths annually worldwide, creating an urgent need for privacy-preserving predictive systems that can work across multiple healthcare institutions without sharing sensitive patient data.Method: The framework integrates parametric models (logistic regression, SVM, neural networks) and non-parametric models (Random Forest, XGBoost) with three key innovations: tree-subset sampling for 70% communication reduction, XGBoost-based feature extraction for lightweight federated ensembles, and federated SMOTE synchronization for handling class imbalance across institutions.
Result: On the Framingham dataset (4,238 records), FedCVD++ achieves federated XGBoost (F1 = 0.80) surpassing centralized counterpart (F1 = 0.78), and federated Random Forest (F1 = 0.81) matching non-federated performance. Communication-efficient strategies reduce bandwidth consumption by 3.2X while preserving 95% accuracy, delivering up to 15% higher F1-scores than existing FL frameworks.
Conclusion: This work represents the first practical integration of non-parametric models into federated healthcare systems, providing a validated privacy-preserving solution for cardiovascular disease prediction under real-world clinical constraints with superior scalability for multi-institutional deployment.
Abstract: Cardiovascular diseases (CVD) cause over 17 million deaths annually worldwide, highlighting the urgent need for privacy-preserving predictive systems. We introduce FedCVD++, an enhanced federated learning (FL) framework that integrates both parametric models (logistic regression, SVM, neural networks) and non-parametric models (Random Forest, XGBoost) for coronary heart disease risk prediction. To address key FL challenges, we propose: (1) tree-subset sampling that reduces Random Forest communication overhead by 70%, (2) XGBoost-based feature extraction enabling lightweight federated ensembles, and (3) federated SMOTE synchronization for resolving cross-institutional class imbalance. Evaluated on the Framingham dataset (4,238 records), FedCVD++ achieves state-of-the-art results: federated XGBoost (F1 = 0.80) surpasses its centralized counterpart (F1 = 0.78), and federated Random Forest (F1 = 0.81) matches non-federated performance. Additionally, our communication-efficient strategies reduce bandwidth consumption by 3.2X while preserving 95% accuracy. Compared to existing FL frameworks, FedCVD++ delivers up to 15% higher F1-scores and superior scalability for multi-institutional deployment. This work represents the first practical integration of non-parametric models into federated healthcare systems, providing a privacy-preserving solution validated under real-world clinical constraints.
[1073] Learning Robust Satellite Attitude Dynamics with Physics-Informed Normalising Flow
Carlo Cena, Mauro Martini, Marcello Chiaberge
Main category: cs.LG
TL;DR: Physics-Informed Neural Networks (PINNs) combined with Real NVP architecture improve spacecraft attitude control by 27.08% in accuracy and 62% in settling time compared to purely data-driven approaches.
Details
Motivation: Traditional MPC relies on accurate physics models, which can be incomplete or computationally expensive. Machine learning offers alternatives but struggles with generalization and stability outside training domains.Method: Used Real NVP neural network with self-attention mechanism trained on Basilisk simulator data. Compared purely data-driven baseline with physics-informed variant (PINNs) for learning spacecraft attitude dynamics.
Result: PINN-based models reduced mean relative error by 27.08% and improved settling times by up to 62% compared to traditional MPC when dealing with observation noise and reaction wheel friction.
Conclusion: Incorporating physics-based information into neural networks significantly enhances performance, robustness, and control accuracy in spacecraft attitude dynamics modeling and MPC applications.
Abstract: Attitude control is a fundamental aspect of spacecraft operations. Model Predictive Control (MPC) has emerged as a powerful strategy for these tasks, relying on accurate models of the system dynamics to optimize control actions over a prediction horizon. In scenarios where physics models are incomplete, difficult to derive, or computationally expensive, machine learning offers a flexible alternative by learning the system behavior directly from data. However, purely data-driven models often struggle with generalization and stability, especially when applied to inputs outside their training domain. To address these limitations, we investigate the benefits of incorporating Physics-Informed Neural Networks (PINNs) into the learning of spacecraft attitude dynamics, comparing their performance with that of purely data-driven approaches. Using a Real-valued Non-Volume Preserving (Real NVP) neural network architecture with a self-attention mechanism, we trained several models on simulated data generated with the Basilisk simulator. Two training strategies were considered: a purely data-driven baseline and a physics-informed variant to improve robustness and stability. Our results demonstrate that the inclusion of physics-based information significantly enhances the performance in terms of the mean relative error with the best architectures found by 27.08%. These advantages are particularly evident when the learned models are integrated into an MPC framework, where PINN-based models consistently outperform their purely data-driven counterparts in terms of control accuracy and robustness, and achieve improved settling times when compared to traditional MPC approaches, yielding improvements of up to 62%, when subject to observation noise and RWs friction.
[1074] Cost-Aware Contrastive Routing for LLMs
Reza Shirkavand, Shangqian Gao, Peiran Yu, Heng Huang
Main category: cs.LG
TL;DR: CSCR is a lightweight routing framework that maps prompts and models into a shared embedding space for fast, cost-sensitive LLM selection using contrastive learning and k-NN lookup.
Details
Motivation: Existing routing approaches overlook prompt-specific context, rely on expensive profiling, assume fixed expert pools, or use inefficient trial-and-error strategies.Method: Uses compact logit footprints for open-source models and perplexity fingerprints for black-box APIs. Trains contrastive encoder to select cheapest accurate expert within cost bands. Inference uses single k-NN lookup via FAISS index.
Result: Outperforms baselines across benchmarks, improving accuracy-cost tradeoff by up to 25%, generalizes to unseen LLMs and out-of-distribution prompts.
Conclusion: CSCR enables fast, adaptive routing with microsecond latency and no retraining when expert pool changes, providing robust cost-aware LLM selection.
Abstract: We study cost-aware routing for large language models across diverse and dynamic pools of models. Existing approaches often overlook prompt-specific context, rely on expensive model profiling, assume a fixed set of experts, or use inefficient trial-and-error strategies. We introduce Cost-Spectrum Contrastive Routing (CSCR), a lightweight framework that maps both prompts and models into a shared embedding space to enable fast, cost-sensitive selection. CSCR uses compact, fast-to-compute logit footprints for open-source models and perplexity fingerprints for black-box APIs. A contrastive encoder is trained to favor the cheapest accurate expert within adaptive cost bands. At inference time, routing reduces to a single k-NN lookup via a FAISS index, requiring no retraining when the expert pool changes and enabling microsecond latency. Across multiple benchmarks, CSCR consistently outperforms baselines, improving the accuracy-cost tradeoff by up to 25%, while generalizing robustly to unseen LLMs and out-of-distribution prompts.
[1075] DQS: A Low-Budget Query Strategy for Enhancing Unsupervised Data-driven Anomaly Detection Approaches
Lucas Correia, Jan-Christoph Goos, Thomas Bäck, Anna V. Kononova
Main category: cs.LG
TL;DR: This paper introduces an active learning approach for time series anomaly detection that uses a novel dissimilarity-based query strategy (DQS) to refine threshold selection, addressing limitations of purely unsupervised methods.
Details
Motivation: Existing unsupervised time series anomaly detection methods suffer from poor threshold selection, and many that claim to be unsupervised actually require labeled data for calibration, which is often unavailable in real-world scenarios.Method: The approach integrates active learning with existing unsupervised methods by selectively querying labels for multivariate time series. It introduces DQS, which maximizes diversity of queried samples by evaluating similarity between anomaly scores using dynamic time warping.
Result: DQS performs best in small-budget scenarios, while other query strategies are more robust to mislabelling. All query strategies outperform unsupervised thresholds even with mislabelling present.
Conclusion: When feasible to query an oracle, using active learning-based thresholds is recommended. The choice of query strategy depends on oracle expertise and labeling budget.
Abstract: Truly unsupervised approaches for time series anomaly detection are rare in the literature. Those that exist suffer from a poorly set threshold, which hampers detection performance, while others, despite claiming to be unsupervised, need to be calibrated using a labelled data subset, which is often not available in the real world. This work integrates active learning with an existing unsupervised anomaly detection method by selectively querying the labels of multivariate time series, which are then used to refine the threshold selection process. To achieve this, we introduce a novel query strategy called the dissimilarity-based query strategy (DQS). DQS aims to maximise the diversity of queried samples by evaluating the similarity between anomaly scores using dynamic time warping. We assess the detection performance of DQS in comparison to other query strategies and explore the impact of mislabelling, a topic that is underexplored in the literature. Our findings indicate that DQS performs best in small-budget scenarios, though the others appear to be more robust when faced with mislabelling. Therefore, in the real world, the choice of query strategy depends on the expertise of the oracle and the number of samples they are willing to label. Regardless, all query strategies outperform the unsupervised threshold even in the presence of mislabelling. Thus, whenever it is feasible to query an oracle, employing an active learning-based threshold is recommended.
[1076] Foundational theory for optimal decision tree problems. I. Algorithmic and geometric foundations
Xi He
Main category: cs.LG
TL;DR: This paper introduces four novel definitions of Optimal Decision Tree (ODT) problems and derives optimal algorithms using algebraic programming theory, providing a unified solution for general ODT problems with arbitrary splitting rules.
Details
Motivation: To provide unambiguous formal specifications for ODT problems and derive correct-by-construction algorithms that unify and generalize existing approaches, particularly for decision trees with flexible splitting rules.Method: Uses algebraic programming theory to derive algorithms from formal specifications stated as executable recursive programs. The approach establishes existence of dynamic programming solutions and constructs them when they exist.
Result: Developed four novel optimal algorithms for ODT problems with arbitrary splitting rules that satisfy given axioms and objective functions. These algorithms encompass the known depth-constrained axis-parallel ODT algorithm as a special case.
Conclusion: The framework provides a unified, efficient, and elegant solution for general ODT problems and is extendable to support algorithms for constructing even more flexible decision trees, including those with mixed splitting rules.
Abstract: In the first paper (part I) of this series of two, we introduce four novel definitions of the ODT problems: three for size-constrained trees and one for depth-constrained trees. These definitions are stated unambiguously through executable recursive programs, satisfying all criteria we propose for a formal specification. In this sense, they resemble the “standard form” used in the study of general-purpose solvers. Grounded in algebraic programming theory-a relational formalism for deriving correct-by-construction algorithms from specifications-we can not only establish the existence or nonexistence of dynamic programming solutions but also derive them constructively whenever they exist. Consequently, the four generic problem definitions yield four novel optimal algorithms for ODT problems with arbitrary splitting rules that satisfy the axioms and objective functions of a given form. These algorithms encompass the known depth-constrained, axis-parallel ODT algorithm as the special case, while providing a unified, efficient, and elegant solution for the general ODT problem. In Part II, we present the first optimal hypersurface decision tree algorithm and provide comprehensive experiments against axis-parallel decision tree algorithms, including heuristic CART and state-of-the-art optimal methods. The results demonstrate the significant potential of decision trees with flexible splitting rules. Moreover, our framework is readily extendable to support algorithms for constructing even more flexible decision trees, including those with mixed splitting rules.
[1077] On Linear Mode Connectivity of Mixture-of-Experts Architectures
Viet-Hoang Tran, Van Hoan Trinh, Khanh Vinh Bui, Tan M. Nguyen
Main category: cs.LG
TL;DR: Linear Mode Connectivity (LMC) exists in Mixture-of-Experts (MoE) architectures, where independently trained MoE models can be connected by linear paths with low loss after proper alignment of expert permutations and gating functions.
Details
Motivation: To investigate whether the Linear Mode Connectivity phenomenon observed in standard neural networks also applies to Mixture-of-Experts architectures, which have different structural properties and symmetries.Method: Systematic analysis of MoE symmetries, development of a matching algorithm to align independently trained MoEs, and empirical validation across various MoE configurations (dense, sparse, shared-expert) with different datasets.
Result: The study confirms the existence of LMC in MoE architectures after proper alignment, demonstrating that independently trained MoE models can be connected by linear paths with consistently low loss.
Conclusion: LMC is a fundamental property of MoE architectures, providing insights into their functional landscape and optimization dynamics, with implications for model ensembling and generalization.
Abstract: Linear Mode Connectivity (LMC) is a notable phenomenon in the loss landscapes of neural networks, wherein independently trained models have been observed to be connected–up to permutation symmetries–by linear paths in parameter space along which the loss remains consistently low. This observation challenges classical views of non-convex optimization and has implications for model ensembling, generalization, and our understanding of neural loss geometry. Inspired by recent studies on LMC in standard neural networks, we systematically investigate this phenomenon within Mixture-of-Experts (MoE) architectures–a class of models known for their scalability and computational efficiency, which combine traditional neural networks–referred to as experts–through a learnable gating mechanism. We begin by conducting a comprehensive analysis of both dense and sparse gating regimes, demonstrating that the symmetries inherent to MoE architectures are fully characterized by permutations acting on both the expert components and the gating function. Building on these foundational findings, we propose a matching algorithm that enables alignment between independently trained MoEs, thereby facilitating the discovery of LMC. Finally, we empirically validate the presence of LMC using our proposed algorithm across diverse MoE configurations–including dense, sparse, and shared-expert variants–under a wide range of model settings and datasets of varying scales and modalities. Our results confirm the existence of LMC in MoE architectures and offer fundamental insights into the functional landscape and optimization dynamics of deep learning models.
[1078] Foundational theory for optimal decision tree problems. II. Optimal hypersurface decision tree algorithm
Xi He
Main category: cs.LG
TL;DR: This paper introduces the first hypersurface decision tree (HODT) algorithm that can solve optimal decision tree problems for general hypersurface splitting rules, outperforming existing methods limited to hyperplane splits.
Details
Motivation: Existing optimal decision tree methods are limited to hyperplane splitting rules and rely on external solvers. The authors aim to develop an algorithm that can handle general hypersurface decision trees without requiring external solvers.Method: Building on the algorithmic and geometric foundations from Part I, the authors introduce the HODT algorithm that addresses general hypersurface decision tree models. They test it on synthetic datasets varying tree size, data size, dimensionality, and noise levels, and evaluate on 30 real-world datasets.
Result: The HODT algorithm recovers ground truth more accurately than axis-parallel trees and shows greater noise robustness. On real-world datasets, it achieves up to 30% higher accuracy than state-of-the-art optimal axis-parallel decision tree algorithms when tree complexity is properly controlled.
Conclusion: The HODT algorithm successfully extends optimal decision tree optimization to general hypersurface splitting rules, demonstrating superior performance over existing methods limited to hyperplane splits.
Abstract: Decision trees are a ubiquitous model for classification and regression tasks due to their interpretability and efficiency. However, solving the optimal decision tree (ODT) problem remains a challenging combinatorial optimization task. Even for the simplest splitting rules–axis-parallel hyperplanes–it is NP-hard to optimize. In Part I of this series, we rigorously defined the proper decision tree model through four axioms and, based on these, introduced four formal definitions of the ODT problem. From these definitions, we derived four generic algorithms capable of solving ODT problems for arbitrary decision trees satisfying the axioms. We also analyzed the combinatorial geometric properties of hypersurfaces, showing that decision trees defined by polynomial hypersurface splitting rules satisfy the proper axioms that we proposed. In this second paper (Part II) of this two-part series, building on the algorithmic and geometric foundations established in Part I, we introduce the first hypersurface decision tree (HODT) algorithm. To the best of our knowledge, existing optimal decision tree methods are, to date, limited to hyperplane splitting rules–a special case of hypersurfaces–and rely on general-purpose solvers. In contrast, our HODT algorithm addresses the general hypersurface decision tree model without requiring external solvers. Using synthetic datasets generated from ground-truth hyperplane decision trees, we vary tree size, data size, dimensionality, and label and feature noise. Results showing that our algorithm recovers the ground truth more accurately than axis-parallel trees and exhibits greater robustness to noise. We also analyzed generalization performance across 30 real-world datasets, showing that HODT can achieve up to 30% higher accuracy than the state-of-the-art optimal axis-parallel decision tree algorithm when tree complexity is properly controlled.
[1079] Automated Constitutive Model Discovery by Pairing Sparse Regression Algorithms with Model Selection Criteria
Jorge-Humberto Urrea-Quintero, David Anton, Laura De Lorenzis, Henning Wessels
Main category: cs.LG
TL;DR: Automated framework for constitutive model discovery using three sparse regression algorithms (LASSO, LARS, OMP) paired with three model selection criteria (CV, AIC, BIC), applied to isotropic and anisotropic hyperelasticity with high accuracy.
Details
Motivation: To provide a systematic alternative to traditional model calibration by automating constitutive model discovery from data, exploring trade-offs between sparsity, predictive performance, and computational cost.Method: Pairs three sparse regression algorithms (LASSO, LARS, OMP) with three model selection criteria (K-fold cross-validation, AIC, BIC) to create nine distinct algorithms for systematic model discovery.
Result: All nine algorithm-criterion combinations performed consistently well, discovering highly accurate constitutive models for both isotropic and anisotropic materials, broadening viable discovery approaches beyond LASSO.
Conclusion: The framework successfully automates constitutive model discovery, demonstrating that multiple sparse regression approaches (including OMP as a tractable ℓ0 heuristic) can effectively discover accurate material models beyond traditional ℓ1-based methods.
Abstract: The automated discovery of constitutive models from data has recently emerged as a promising alternative to the traditional model calibration paradigm. In this work, we present a fully automated framework for constitutive model discovery that systematically pairs three sparse regression algorithms (Least Absolute Shrinkage and Selection Operator (LASSO), Least Angle Regression (LARS), and Orthogonal Matching Pursuit (OMP)) with three model selection criteria: $K$-fold cross-validation (CV), Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC). This pairing yields nine distinct algorithms for model discovery and enables a systematic exploration of the trade-off between sparsity, predictive performance, and computational cost. While LARS serves as an efficient path-based solver for the $\ell_1$-constrained problem, OMP is introduced as a tractable heuristic for $\ell_0$-regularized selection. The framework is applied to both isotropic and anisotropic hyperelasticity, utilizing both synthetic and experimental datasets. Results reveal that all nine algorithm-criterion combinations perform consistently well in discovering isotropic and anisotropic materials, yielding highly accurate constitutive models. These findings broaden the range of viable discovery algorithms beyond $\ell_1$-based approaches such as LASSO.
[1080] On the Fragility of Contribution Score Computation in Federated Learning
Balazs Pejo, Marcell Frank, Krisztian Varga, Peter Veliczky, Gergely Biczok
Main category: cs.LG
TL;DR: The paper reveals that contribution evaluation in federated learning is fragile and can be distorted by both architectural choices (aggregation methods) and intentional manipulation (poisoning attacks).
Details
Motivation: To investigate the fragility of contribution evaluation in federated learning, which is crucial for ensuring fairness and incentivizing participation but may be vulnerable to distortions.Method: The study explores two perspectives: architectural sensitivity (impact of different model aggregation methods) and intentional manipulation (poisoning attacks). Extensive experiments were conducted across diverse datasets and model architectures using the Flower framework.
Result: Both the choice of aggregation method and the presence of attackers significantly distort contribution scores. Advanced aggregation techniques and poisoning attacks can unintentionally or intentionally alter the final evaluation scores.
Conclusion: There is a critical need for more robust contribution evaluation schemes in federated learning to address vulnerabilities from both architectural choices and malicious manipulation.
Abstract: This paper investigates the fragility of contribution evaluation in federated learning, a critical mechanism for ensuring fairness and incentivizing participation. We argue that contribution scores are susceptible to significant distortions from two fundamental perspectives: architectural sensitivity and intentional manipulation. First, we explore how different model aggregation methods impact these scores. While most research assumes a basic averaging approach, we demonstrate that advanced techniques, including those designed to handle unreliable or diverse clients, can unintentionally yet significantly alter the final scores. Second, we explore vulnerabilities posed by poisoning attacks, where malicious participants strategically manipulate their model updates to inflate their own contribution scores or reduce the importance of other participants. Through extensive experiments across diverse datasets and model architectures, implemented within the Flower framework, we rigorously show that both the choice of aggregation method and the presence of attackers are potent vectors for distorting contribution scores, highlighting a critical need for more robust evaluation schemes.
[1081] Policy Compatible Skill Incremental Learning via Lazy Learning Interface
Daehee Lee, Dongsu Lee, TaeYoon Kwack, Wonje Choi, Honguk Woo
Main category: cs.LG
TL;DR: SIL-C is a framework that maintains compatibility between incrementally learned skills and downstream policies through bilateral lazy learning-based mapping, enabling skill improvements to enhance policy performance without retraining.
Details
Motivation: As skill repertoires evolve in Skill Incremental Learning (SIL), they can disrupt compatibility with existing skill-based policies, limiting reusability and generalization of policies.Method: Uses bilateral lazy learning-based mapping to dynamically align subtask space (from policies) with skill space (agent behaviors), enabling subtasks to be executed by selecting appropriate skills based on trajectory distribution similarity.
Result: SIL-C maintains compatibility between evolving skills and downstream policies while ensuring efficiency throughout the learning process across diverse SIL scenarios.
Conclusion: SIL-C successfully enables skill improvements to enhance downstream policy performance without requiring policy retraining or structural adaptation, addressing the compatibility challenge in skill incremental learning.
Abstract: Skill Incremental Learning (SIL) is the process by which an embodied agent expands and refines its skill set over time by leveraging experience gained through interaction with its environment or by the integration of additional data. SIL facilitates efficient acquisition of hierarchical policies grounded in reusable skills for downstream tasks. However, as the skill repertoire evolves, it can disrupt compatibility with existing skill-based policies, limiting their reusability and generalization. In this work, we propose SIL-C, a novel framework that ensures skill-policy compatibility, allowing improvements in incrementally learned skills to enhance the performance of downstream policies without requiring policy re-training or structural adaptation. SIL-C employs a bilateral lazy learning-based mapping technique to dynamically align the subtask space referenced by policies with the skill space decoded into agent behaviors. This enables each subtask, derived from the policy’s decomposition of a complex task, to be executed by selecting an appropriate skill based on trajectory distribution similarity. We evaluate SIL-C across diverse SIL scenarios and demonstrate that it maintains compatibility between evolving skills and downstream policies while ensuring efficiency throughout the learning process.
[1082] Differentiable Structure Learning and Causal Discovery for General Binary Data
Chang Deng, Bryon Aragam
Main category: cs.LG
TL;DR: A differentiable structure learning framework for discrete data that captures arbitrary dependencies, avoids unrealistic simplifications of previous methods, and establishes identifiability under mild assumptions.
Details
Motivation: Existing methods assume specific structural equation models that may not match true data-generating processes, ignore complex dependence structures in discrete data, and consider only linear effects, limiting their general applicability.Method: Proposes a differentiable structure learning framework formulated as a single differentiable optimization task that captures arbitrary dependencies among discrete variables, avoiding unrealistic simplifications of previous approaches.
Result: Empirical results demonstrate that the approach effectively captures complex relationships in discrete data, and theoretical analysis shows identifiability up to Markov equivalence under mild assumptions.
Conclusion: The framework provides a more general and realistic approach to differentiable structure learning for discrete data, capable of handling complex dependencies while establishing theoretical guarantees for identifiability.
Abstract: Existing methods for differentiable structure learning in discrete data typically assume that the data are generated from specific structural equation models. However, these assumptions may not align with the true data-generating process, which limits the general applicability of such methods. Furthermore, current approaches often ignore the complex dependence structure inherent in discrete data and consider only linear effects. We propose a differentiable structure learning framework that is capable of capturing arbitrary dependencies among discrete variables. We show that although general discrete models are unidentifiable from purely observational data, it is possible to characterize the complete set of compatible parameters and structures. Additionally, we establish identifiability up to Markov equivalence under mild assumptions. We formulate the learning problem as a single differentiable optimization task in the most general form, thereby avoiding the unrealistic simplifications adopted by previous methods. Empirical results demonstrate that our approach effectively captures complex relationships in discrete data.
[1083] Differentiable Sparsity via $D$-Gating: Simple and Versatile Structured Penalization
Chris Kolb, Laetitia Frost, Bernd Bischl, David Rügamer
Main category: cs.LG
TL;DR: D-Gating is a differentiable structured overparameterization method that splits weights into primary vectors and gating factors, providing theoretical equivalence to L2,2/D regularization with exponential convergence guarantees.
Details
Motivation: Structured sparsity regularization is principled but non-differentiable, breaking compatibility with SGD and requiring specialized optimizers or post-hoc pruning without formal guarantees.Method: Propose D-Gating that splits each weight group into a primary weight vector and multiple scalar gating factors, creating a fully differentiable structured overparameterization.
Result: Proved local minimum equivalence with L2,2/D penalization and exponential convergence. Validated across vision, language, and tabular tasks, showing strong performance-sparsity tradeoffs and outperforming direct optimization and pruning baselines.
Conclusion: D-Gating provides theoretical equivalence to structured sparsity problems with distinct learning dynamics that evolve from non-sparse to sparse optimization, offering a principled differentiable alternative.
Abstract: Structured sparsity regularization offers a principled way to compact neural networks, but its non-differentiability breaks compatibility with conventional stochastic gradient descent and requires either specialized optimizers or additional post-hoc pruning without formal guarantees. In this work, we propose $D$-Gating, a fully differentiable structured overparameterization that splits each group of weights into a primary weight vector and multiple scalar gating factors. We prove that any local minimum under $D$-Gating is also a local minimum using non-smooth structured $L_{2,2/D}$ penalization, and further show that the $D$-Gating objective converges at least exponentially fast to the $L_{2,2/D}$-regularized loss in the gradient flow limit. Together, our results show that $D$-Gating is theoretically equivalent to solving the original group sparsity problem, yet induces distinct learning dynamics that evolve from a non-sparse regime into sparse optimization. We validate our theory across vision, language, and tabular tasks, where $D$-Gating consistently delivers strong performance-sparsity tradeoffs and outperforms both direct optimization of structured penalties and conventional pruning baselines.
[1084] Transfer Learning on Edge Connecting Probability Estimation under Graphon Model
Yuyao Wang, Yu-Hung Cheng, Debarghya Mukherjee, Huimin Cheng
Main category: cs.LG
TL;DR: GTRANS is a transfer learning framework that improves graphon estimation for small target graphs by leveraging structural information from larger source graphs using neighborhood smoothing and Gromov-Wasserstein optimal transport, with adaptive debiasing to prevent negative transfer.
Details
Motivation: Accurate graphon estimation typically requires large graphs, but in practice only small networks are often observed. Transfer learning can help improve estimation in small target graphs by leveraging information from larger, related source graphs.Method: GTRANS integrates neighborhood smoothing and Gromov-Wasserstein optimal transport to align and transfer structural patterns between graphs. It includes an adaptive debiasing mechanism that identifies and corrects target-specific deviations via residual smoothing.
Result: Theoretical guarantees on the stability of the estimated alignment matrix are provided. Extensive synthetic and real data experiments demonstrate GTRANS improves target graph estimation accuracy, leading to enhanced performance in downstream applications like graph classification and link prediction.
Conclusion: GTRANS effectively addresses the challenge of graphon estimation for small networks through transfer learning, with theoretical guarantees and empirical validation showing improved performance in both estimation accuracy and downstream applications.
Abstract: Graphon models provide a flexible nonparametric framework for estimating latent connectivity probabilities in networks, enabling a range of downstream applications such as link prediction and data augmentation. However, accurate graphon estimation typically requires a large graph, whereas in practice, one often only observes a small-sized network. One approach to addressing this issue is to adopt a transfer learning framework, which aims to improve estimation in a small target graph by leveraging structural information from a larger, related source graph. In this paper, we propose a novel method, namely GTRANS, a transfer learning framework that integrates neighborhood smoothing and Gromov-Wasserstein optimal transport to align and transfer structural patterns between graphs. To prevent negative transfer, GTRANS includes an adaptive debiasing mechanism that identifies and corrects for target-specific deviations via residual smoothing. We provide theoretical guarantees on the stability of the estimated alignment matrix and demonstrate the effectiveness of GTRANS in improving the accuracy of target graph estimation through extensive synthetic and real data experiments. These improvements translate directly to enhanced performance in downstream applications, such as the graph classification task and the link prediction task.
[1085] Accelerated Evolving Set Processes for Local PageRank Computation
Binbin Huang, Luo Luo, Yanghua Xiao, Deqing Yang, Baojian Zhou
Main category: cs.LG
TL;DR: A novel framework using nested evolving set processes to accelerate Personalized PageRank computation, achieving time complexity independent of graph size under certain conditions.
Details
Motivation: To develop more efficient algorithms for computing Personalized PageRank vectors, particularly for large graphs where traditional methods scale poorly.Method: Uses nested evolving set processes with localized inexact proximal point iterations to solve simplified linear systems, requiring only O~(1/√α) such solves.
Result: Achieves time complexity of min{O~(R²/ε²), O~(m)} for ε-approximation, with overall complexity O~(R²/(√αε²)) when 1/ε² ≪ m, independent of graph size.
Conclusion: The framework provides efficient PPR computation, resolves an open conjecture, and shows practical efficiency on real-world graphs with early convergence.
Abstract: This work proposes a novel framework based on nested evolving set processes to accelerate Personalized PageRank (PPR) computation. At each stage of the process, we employ a localized inexact proximal point iteration to solve a simplified linear system. We show that the time complexity of such localized methods is upper bounded by $\min{\tilde{\mathcal{O}}(R^2/\epsilon^2), \tilde{\mathcal{O}}(m)}$ to obtain an $\epsilon$-approximation of the PPR vector, where $m$ denotes the number of edges in the graph and $R$ is a constant defined via nested evolving set processes. Furthermore, the algorithms induced by our framework require solving only $\tilde{\mathcal{O}}(1/\sqrt{\alpha})$ such linear systems, where $\alpha$ is the damping factor. When $1/\epsilon^2\ll m$, this implies the existence of an algorithm that computes an $\ epsilon $-approximation of the PPR vector with an overall time complexity of $\tilde{\mathcal{O}}\left(R^2 / (\sqrt{\alpha}\epsilon^2)\right)$, independent of the underlying graph size. Our result resolves an open conjecture from existing literature. Experimental results on real-world graphs validate the efficiency of our methods, demonstrating significant convergence in the early stages.
[1086] Efficient Resource-Constrained Training of Vision Transformers via Subspace Optimization
Le-Trung Nguyen, Enzo Tartaglione, Van-Tam Nguyen
Main category: cs.LG
TL;DR: WASI applies subspace-based training to transformers, reducing memory usage by 62× and FLOPs by 2× while maintaining accuracy comparable to vanilla training.
Details
Motivation: Address energy consumption and data privacy concerns in AI by enabling on-device learning, overcoming memory bottlenecks in large transformer models.Method: Weight-Activation Subspace Iteration (WASI) restricts training to a fixed subspace where essential model information resides, mitigating backpropagation memory bottlenecks.
Result: Achieved 62× memory reduction, 2× computational cost reduction, and 1.5× faster training/inference on Raspberry Pi 5 while maintaining comparable accuracy to vanilla training.
Conclusion: WASI enables efficient on-device transformer training, making large models practical for edge devices while preserving privacy and reducing energy consumption.
Abstract: As AI increasingly shapes daily life, energy consumption and data privacy have become pressing concerns. On-device learning trains models directly on edge devices, cutting energy consumption and safeguarding data privacy. However, the expanding scale of modern neural networks creates a major obstacle for on-device training. Although prior work has concentrated on compact convolutional architectures, we instead apply subspace-based training to transformer models. Motivated by the idea that a model’s essential information lies in a fixed subspace, we introduce Weight-Activation Subspace Iteration (WASI), a method that mitigates the memory bottleneck of backpropagation and boosts inference efficiency in transformer models by restricting training to this subspace. Our results demonstrate that WASI maintains accuracy comparable to vanilla training while reducing memory usage by up to $62\times$ and computational cost (FLOPs) by up to $2\times$. On a Raspberry Pi 5, WASI achieves roughly $1.5\times$ faster training and inference than vanilla training.
[1087] Multi-Agent Regime-Conditioned Diffusion (MARCD) for CVaR-Constrained Portfolio Decisions
Ali Atiah Alzahrani
Main category: cs.LG
TL;DR: MARCD is a generative-to-decision framework that combines regime-conditioned scenarios with CVaR allocation to improve portfolio performance under regime shifts, achieving 34% reduction in maximum drawdown compared to benchmarks.
Details
Motivation: To improve portfolio decisions under regime shifts by combining generative scenarios with risk-aware allocation, addressing the need for better risk management in volatile market conditions.Method: Four-stage framework: (1) Gaussian HMM for latent regime inference, (2) diffusion generator for regime-conditioned scenarios, (3) signal extraction via blended moments, (4) governed CVaR epigraph quadratic program with explicit constraints.
Result: MARCD achieved significantly better scenario calibration and materially smaller drawdowns: MaxDD 9.3% vs 14.1% for BL (34% reduction) during 2020-2025 out-of-sample testing on liquid multi-asset ETFs.
Conclusion: The framework demonstrates the value of decision-aware generative modeling in finance, providing an auditable pipeline with explicit constraints and improved risk management capabilities.
Abstract: We examine whether regime-conditioned generative scenarios combined with a convex CVaR allocator improve portfolio decisions under regime shifts. We present MARCD, a generative-to-decision framework with: (i) a Gaussian HMM to infer latent regimes; (ii) a diffusion generator that produces regime-conditioned scenarios; (iii) signal extraction via blended, shrunk moments; and (iv) a governed CVaR epigraph quadratic program. Contributions: Within the Scenario stage we introduce a tail-weighted diffusion objective that up-weights low-quantile outcomes relevant for drawdowns and a regime-expert (MoE) denoiser whose gate increases with crisis posteriors; both are evaluated end-to-end through the allocator. Under strict walk-forward on liquid multi-asset ETFs (2005-2025), MARCD exhibits stronger scenario calibration and materially smaller drawdowns: MaxDD 9.3% versus 14.1% for BL (a 34% reduction) over 2020-2025 out-of-sample. The framework provides an auditable pipeline with explicit budget, box, and turnover constraints, demonstrating the value of decision-aware generative modeling in finance.
[1088] Leveraging Teleconnections with Physics-Informed Graph Attention Networks for Long-Range Extreme Rainfall Forecasting in Thailand
Kiattikun Chobtham, Kanoksri Sarinnapakorn, Kritanai Torsri, Prattana Deeprasertkul, Jirawan Kamma
Main category: cs.LG
TL;DR: Novel physics-informed Graph Neural Networks with extreme-value analysis improve rainfall forecasting in Thailand, outperforming baselines and enhancing extreme-event prediction for water management.
Details
Motivation: Accurate rainfall forecasting, especially for extreme events, is challenging but crucial for climatology and Earth system science.Method: Combines physics-informed Graph Neural Networks with extreme-value analysis, using Graph Attention Network with LSTM layers and Spatial Season-aware GPD method for Peak-Over-Threshold mapping.
Result: Outperforms well-established baselines across most regions, including extreme-prone areas, and remains competitive with state-of-the-art methods. Improves extreme-event prediction compared to operational SEAS5 system.
Conclusion: Provides practical enhancement for producing high-resolution rainfall maps that support decision-making in long-term water management.
Abstract: Accurate rainfall forecasting, particularly for extreme events, remains a significant challenge in climatology and the Earth system. This paper presents novel physics-informed Graph Neural Networks (GNNs) combined with extreme-value analysis techniques to improve gauge-station rainfall predictions across Thailand. The model leverages a graph-structured representation of gauge stations to capture complex spatiotemporal patterns, and it offers explainability through teleconnections. We preprocess relevant climate indices that potentially influence regional rainfall. The proposed Graph Attention Network with Long Short-Term Memory (Attention-LSTM) applies the attention mechanism using initial edge features derived from simple orographic-precipitation physics formulation. The embeddings are subsequently processed by LSTM layers. To address extremes, we perform Peak-Over-Threshold (POT) mapping using the novel Spatial Season-aware Generalized Pareto Distribution (GPD) method, which overcomes limitations of traditional machine-learning models. Experiments demonstrate that our method outperforms well-established baselines across most regions, including areas prone to extremes, and remains strongly competitive with the state of the art. Compared with the operational forecasting system SEAS5, our real-world application improves extreme-event prediction and offers a practical enhancement to produce high-resolution maps that support decision-making in long-term water management.
[1089] Tahakom LLM Guidelines and Recipes: From Pre-training Data to an Arabic LLM
Areej AlOtaibi, Lina Alyahya, Raghad Alshabanah, Shahad Alfawzan, Shuruq Alarefei, Reem Alsabti, Nouf Alsubaie, Abdulaziz Alhuzaymi, Lujain Alkhelb, Majd Alsayari, Waad Alahmed, Omar Talabay, Jalal Alowibdi, Salem Alelyani, Adel Bibi
Main category: cs.LG
TL;DR: This paper addresses the challenges in developing Large Language Models for Arabic, focusing on data curation, tokenizer design, and evaluation frameworks.
Details
Motivation: LLMs have advanced NLP but developing them for Arabic presents unique challenges that need to be systematically addressed.Method: The approach includes collection and filtration of Arabic pre-training datasets, assessment of tokenizer designs, and proposing corrective methodology for evaluation frameworks.
Result: The paper shares data and methodologies to promote transparency and collaborative development in Arabic language modeling.
Conclusion: The work contributes to advancing language modeling for Arabic by addressing key challenges and sharing resources for community development.
Abstract: Large Language Models (LLMs) have significantly advanced the field of natural language processing, enhancing capabilities in both language understanding and generation across diverse domains. However, developing LLMs for Arabic presents unique challenges. This paper explores these challenges by focusing on critical aspects such as data curation, tokenizer design, and evaluation. We detail our approach to the collection and filtration of Arabic pre-training datasets, assess the impact of various tokenizer designs on model performance, and examine the limitations of existing Arabic evaluation frameworks, for which we propose a systematic corrective methodology. To promote transparency and facilitate collaborative development, we share our data and methodologies, contributing to the advancement of language modeling, particularly for the Arabic language.
[1090] Adapting to Stochastic and Adversarial Losses in Episodic MDPs with Aggregate Bandit Feedback
Shinji Ito, Kevin Jamieson, Haipeng Luo, Arnab Maiti, Taira Tsuchiya
Main category: cs.LG
TL;DR: First best-of-both-worlds algorithms for episodic MDPs with aggregate bandit feedback, achieving O(log T) regret in stochastic and O(√T) in adversarial settings, with matching lower bounds.
Details
Motivation: Prior work focused only on worst-case analysis, leaving a gap for algorithms that perform well in both stochastic and adversarial environments under the challenging aggregate bandit feedback model.Method: Combines FTRL over occupancy measures, self-bounding techniques, and new loss estimators inspired by online shortest path problems, with extensions to unknown transitions using confidence-based techniques.
Result: Achieves O(log T) regret in stochastic settings and O(√T) regret in adversarial settings for known transitions, with matching lower bounds proving optimality. Also provides first individual-gap-dependent lower bounds.
Conclusion: Successfully develops the first BOBW algorithms for episodic MDPs with aggregate bandit feedback, establishing optimal regret bounds and extending to unknown-transition settings.
Abstract: We study online learning in finite-horizon episodic Markov decision processes (MDPs) under the challenging aggregate bandit feedback model, where the learner observes only the cumulative loss incurred in each episode, rather than individual losses at each state-action pair. While prior work in this setting has focused exclusively on worst-case analysis, we initiate the study of best-of-both-worlds (BOBW) algorithms that achieve low regret in both stochastic and adversarial environments. We propose the first BOBW algorithms for episodic tabular MDPs with aggregate bandit feedback. In the case of known transitions, our algorithms achieve $O(\log T)$ regret in stochastic settings and ${O}(\sqrt{T})$ regret in adversarial ones. Importantly, we also establish matching lower bounds, showing the optimality of our algorithms in this setting. We further extend our approach to unknown-transition settings by incorporating confidence-based techniques. Our results rely on a combination of FTRL over occupancy measures, self-bounding techniques, and new loss estimators inspired by recent advances in online shortest path problems. Along the way, we also provide the first individual-gap-dependent lower bounds and demonstrate near-optimal BOBW algorithms for shortest path problems with bandit feedback.
[1091] Disentanglement Beyond Static vs. Dynamic: A Benchmark and Evaluation Framework for Multi-Factor Sequential Representations
Tal Barami, Nimrod Berman, Ilan Naiman, Amos H. Hason, Rotem Ezra, Omri Azencot
Main category: cs.LG
TL;DR: The paper introduces the first standardized benchmark for multi-factor sequential disentanglement across six datasets, proposes a post-hoc Latent Exploration Stage and Koopman-inspired model, and shows Vision-Language Models can automate evaluation.
Details
Motivation: Prior work has focused on simpler two-factor static and dynamic settings, overlooking the inherently multi-factor nature of real-world sequential data involving multiple interacting semantic factors over time.Method: Created a benchmark with six diverse datasets, developed modular tools for dataset integration and evaluation, proposed a post-hoc Latent Exploration Stage for automatic alignment, and introduced a Koopman-inspired model. Also used Vision-Language Models for automated annotation and evaluation.
Result: The proposed Koopman-inspired model achieves state-of-the-art results. Vision-Language Models successfully automate dataset annotation and serve as zero-shot disentanglement evaluators, eliminating the need for manual labels and human intervention.
Conclusion: The contributions provide a robust and scalable foundation for advancing multi-factor sequential disentanglement, with code available on GitHub and datasets/models on Hugging Face.
Abstract: Learning disentangled representations in sequential data is a key goal in deep learning, with broad applications in vision, audio, and time series. While real-world data involves multiple interacting semantic factors over time, prior work has mostly focused on simpler two-factor static and dynamic settings, primarily because such settings make data collection easier, thereby overlooking the inherently multi-factor nature of real-world data. We introduce the first standardized benchmark for evaluating multi-factor sequential disentanglement across six diverse datasets spanning video, audio, and time series. Our benchmark includes modular tools for dataset integration, model development, and evaluation metrics tailored to multi-factor analysis. We additionally propose a post-hoc Latent Exploration Stage to automatically align latent dimensions with semantic factors, and introduce a Koopman-inspired model that achieves state-of-the-art results. Moreover, we show that Vision-Language Models can automate dataset annotation and serve as zero-shot disentanglement evaluators, removing the need for manual labels and human intervention. Together, these contributions provide a robust and scalable foundation for advancing multi-factor sequential disentanglement. Our code is available on GitHub, and the datasets and trained models are available on Hugging Face.
[1092] Enhancing Graph Neural Networks: A Mutual Learning Approach
Paul Agbaje, Arkajyoti Mitra, Afia Anjum, Pranali Khose, Ebelechukwu Nwafor, Habeeb Olufowobi
Main category: cs.LG
TL;DR: This paper proposes a collaborative learning framework for GNNs where ensembles of student models mutually teach each other without a pre-trained teacher, using adaptive logit weighting and entropy enhancement for efficient knowledge exchange.
Details
Motivation: To explore collaborative learning among GNNs as an alternative to traditional knowledge distillation, enabling simple shallow GNN architectures to learn synergistically and perform better on multiple tasks without requiring a pre-trained teacher model.Method: A collaborative learning framework with ensembles of student GNNs that mutually teach each other during training, featuring an adaptive logit weighting unit for efficient knowledge exchange and an entropy enhancement technique to improve mutual learning.
Result: Extensive experiments on three datasets each for node and graph classification demonstrate the effectiveness of the approach, showing that collaborative learning enables better inference performance particularly for multiple tasks.
Conclusion: Collaborative learning among GNNs without a pre-trained teacher can produce efficient models that perform better during inference, especially for tackling multiple tasks, through mutual teaching and adaptive learning strategies.
Abstract: Knowledge distillation (KD) techniques have emerged as a powerful tool for transferring expertise from complex teacher models to lightweight student models, particularly beneficial for deploying high-performance models in resource-constrained devices. This approach has been successfully applied to graph neural networks (GNNs), harnessing their expressive capabilities to generate node embeddings that capture structural and feature-related information. In this study, we depart from the conventional KD approach by exploring the potential of collaborative learning among GNNs. In the absence of a pre-trained teacher model, we show that relatively simple and shallow GNN architectures can synergetically learn efficient models capable of performing better during inference, particularly in tackling multiple tasks. We propose a collaborative learning framework where ensembles of student GNNs mutually teach each other throughout the training process. We introduce an adaptive logit weighting unit to facilitate efficient knowledge exchange among models and an entropy enhancement technique to improve mutual learning. These components dynamically empower the models to adapt their learning strategies during training, optimizing their performance for downstream tasks. Extensive experiments conducted on three datasets each for node and graph classification demonstrate the effectiveness of our approach.
[1093] Convergence Analysis of SGD under Expected Smoothness
Yuta Kawamoto, Hideaki Iiduka
Main category: cs.LG
TL;DR: This paper provides a refined convergence analysis of SGD under the expected smoothness condition, deriving explicit convergence rates with residual errors for various step-size schedules.
Details
Motivation: Classical SGD analyses rely on assumptions that are either too strong (bounded variance) or too coarse (uniform noise), while the expected smoothness condition offers a more flexible alternative that better captures the relationship between stochastic gradients and the objective function.Method: The authors refine the expected smoothness condition with interpretations and sampling-dependent constants, derive bounds for the expectation of squared full gradient norm, and prove convergence rates with explicit residual errors for different step-size schedules.
Result: The analysis proves O(1/K) convergence rates for SGD under expected smoothness with explicit residual errors, unifying and extending recent work in the field.
Conclusion: The paper provides a comprehensive and self-contained convergence analysis of SGD under expected smoothness, offering refined interpretations and explicit convergence guarantees that unify and extend previous research in this area.
Abstract: Stochastic gradient descent (SGD) is the workhorse of large-scale learning, yet classical analyses rely on assumptions that can be either too strong (bounded variance) or too coarse (uniform noise). The expected smoothness (ES) condition has emerged as a flexible alternative that ties the second moment of stochastic gradients to the objective value and the full gradient. This paper presents a self-contained convergence analysis of SGD under ES. We (i) refine ES with interpretations and sampling-dependent constants; (ii) derive bounds of the expectation of squared full gradient norm; and (iii) prove $O(1/K)$ rates with explicit residual errors for various step-size schedules. All proofs are given in full detail in the appendix. Our treatment unifies and extends recent threads (Khaled and Richt'arik, 2020; Umeda and Iiduka, 2025).
[1094] On Uncertainty Calibration for Equivariant Functions
Edward Berman, Jacob Ginesin, Marco Pacini, Robin Walters
Main category: cs.LG
TL;DR: This paper presents a theoretical framework connecting equivariance to uncertainty estimation, proving bounds on calibration errors for equivariant models and showing how symmetry mismatch causes miscalibration.
Details
Motivation: Data-sparse domains like robotic manipulation and molecular physics are challenging for deep learning. Equivariant networks can help with undersampled data, but the relationship between equivariance and model calibration/uncertainty estimation hasn't been studied.Method: Developed a theory relating equivariance to uncertainty estimation by proving lower and upper bounds on uncertainty calibration errors (ECE and ENCE) under various equivariance conditions. Conducted numerical experiments with real and simulated datasets.
Result: The theoretical framework elucidates generalization limits of equivariant models and shows how symmetry mismatch results in miscalibration in both classification and regression tasks.
Conclusion: The work establishes the fundamental relationship between equivariance and uncertainty calibration, providing insights into how symmetry properties affect model confidence and calibration across different domains.
Abstract: Data-sparse settings such as robotic manipulation, molecular physics, and galaxy morphology classification are some of the hardest domains for deep learning. For these problems, equivariant networks can help improve modeling across undersampled parts of the input space, and uncertainty estimation can guard against overconfidence. However, until now, the relationships between equivariance and model confidence, and more generally equivariance and model calibration, has yet to be studied. Since traditional classification and regression error terms show up in the definitions of calibration error, it is natural to suspect that previous work can be used to help understand the relationship between equivariance and calibration error. In this work, we present a theory relating equivariance to uncertainty estimation. By proving lower and upper bounds on uncertainty calibration errors (ECE and ENCE) under various equivariance conditions, we elucidate the generalization limits of equivariant models and illustrate how symmetry mismatch can result in miscalibration in both classification and regression. We complement our theoretical framework with numerical experiments that clarify the relationship between equivariance and uncertainty using a variety of real and simulated datasets, and we comment on trends with symmetry mismatch, group size, and aleatoric and epistemic uncertainties.
cs.MA
[1095] Collaborative Task Assignment, Sequencing and Multi-agent Path-finding
Yifan Bai, Shruti Kotpalliwar, Christoforos Kanellakis, George Nikolakopoulos
Main category: cs.MA
TL;DR: CBS-TS is an optimal algorithm for collaborative task assignment, sequencing, and multi-agent pathfinding that alternates between task sequencing using MILP and conflict resolution using CBS with MLA*, achieving better performance than baseline methods.
Details
Motivation: To solve the TSPF problem where multiple agents must visit task locations without collisions while minimizing flowtime, considering agent-task compatibility constraints and ensuring all tasks are completed.Method: Conflict-Based Search with Task Sequencing (CBS-TS) that alternates between MILP for task sequencing and CBS with Multi-Label A* for collision-free path planning in a search forest, limiting MILP invocations to enhance efficiency.
Result: CBS-TS outperforms Conflict-based Steiner Search (CBSS) in most scenarios with higher success rates and consistently optimal solutions, while CBSS achieves near-optimal solutions in some cases.
Conclusion: CBS-TS is an effective optimal algorithm for TSPF problems that efficiently combines task sequencing and path planning while maintaining optimality and computational efficiency.
Abstract: In this article, we address the problem of collaborative task assignment, sequencing, and multi-agent pathfinding (TSPF), where a team of agents must visit a set of task locations without collisions while minimizing flowtime. TSPF incorporates agent-task compatibility constraints and ensures that all tasks are completed. We propose a Conflict-Based Search with Task Sequencing (CBS-TS), an optimal and complete algorithm that alternates between finding new task sequences and resolving conflicts in the paths of current sequences. CBS-TS uses a mixed-integer linear program (MILP) to optimize task sequencing and employs Conflict-Based Search (CBS) with Multi-Label A* (MLA*) for collision-free path planning within a search forest. By invoking MILP for the next-best sequence only when needed, CBS-TS efficiently limits the search space, enhancing computational efficiency while maintaining optimality. We compare the performance of our CBS-TS against Conflict-based Steiner Search (CBSS), a baseline method that, with minor modifications, can address the TSPF problem. Experimental results demonstrate that CBS-TS outperforms CBSS in most testing scenarios, achieving higher success rates and consistently optimal solutions, whereas CBSS achieves near-optimal solutions in some cases. The supplementary video is available at https://youtu.be/QT8BYgvefmU.
[1096] LLM-augmented empirical game theoretic simulation for social-ecological systems
Jennifer Shi, Christopher K. Frantz, Christian Kimmich, Saba Siddiki, Atrisha Sarkar
Main category: cs.MA
TL;DR: The paper compares four LLM-augmented modeling frameworks for social-ecological systems and finds that expert-guided EGTA models with parameterized payoffs outperform LLM-based behavior induction.
Details
Motivation: To address the challenge of integrating different modeling approaches for social-ecological systems and evaluate whether LLM-driven simulations produce plausible behaviors for real-world governance.Method: Compared four LLM-augmented frameworks (procedural ABMs, generative ABMs, LLM-EGTA, and expert-guided LLM-EGTA) using a case study of irrigation and fishing in the Amu Darya basin under different governance structures.
Result: Different frameworks produced strikingly different collective behavior patterns, and expert-guided EGTA models with parameterized payoffs were more effective than LLM-based behavior induction through system prompts.
Conclusion: Methodological diversity is valuable, but shaping behavior through parameterized payoffs in expert-guided models is superior to inducing behavior through LLM system prompts.
Abstract: Designing institutions for social-ecological systems requires models that capture heterogeneity, uncertainty, and strategic interaction. Multiple modeling approaches have emerged to meet this challenge, including empirical game-theoretic analysis (EGTA), which merges ABM’s scale and diversity with game-theoretic models’ formal equilibrium analysis. The newly popular class of LLM-driven simulations provides yet another approach, and it is not clear how these approaches can be integrated with one another, nor whether the resulting simulations produce a plausible range of behaviours for real-world social-ecological governance. To address this gap, we compare four LLM-augmented frameworks: procedural ABMs, generative ABMs, LLM-EGTA, and expert guided LLM-EGTA, and evaluate them on a real-world case study of irrigation and fishing in the Amu Darya basin under centralized and decentralized governance. Our results show: first, procedural ABMs, generative ABMs, and LLM-augmented EGTA models produce strikingly different patterns of collective behaviour, highlighting the value of methodological diversity. Second, inducing behaviour through system prompts in LLMs is less effective than shaping behaviour through parameterized payoffs in an expert-guided EGTA-based model.
[1097] CreditXAI: A Multi-Agent System for Explainable Corporate Credit Rating
Yumeng Shi, Zhongliang Yang, Yisi Wang, Linna Zhou
Main category: cs.MA
TL;DR: CreditXAI is a Multi-Agent System framework that simulates credit analysts’ collaborative decision-making, improving predictive accuracy by over 7% while providing interpretable credit assessments.
Details
Motivation: Traditional deep learning methods in corporate credit rating suffer from 'black-box' problems and limited interpretability, lacking hierarchical reasoning mechanisms despite incorporating non-financial information.Method: Proposed CreditXAI, a Multi-Agent System (MAS) framework that simulates professional credit analysts’ collaborative decision-making, focusing on business, financial, and governance risk dimensions.
Result: Multi-agent collaboration improves predictive accuracy by more than 7% over the best single-agent baseline, demonstrating significant synergistic advantage in corporate credit risk evaluation.
Conclusion: The study provides a new technical pathway to build intelligent and interpretable credit rating models through multi-agent collaboration.
Abstract: In the domain of corporate credit rating, traditional deep learning methods have improved predictive accuracy but still suffer from the inherent ‘black-box’ problem and limited interpretability. While incorporating non-financial information enriches the data and provides partial interpretability, the models still lack hierarchical reasoning mechanisms, limiting their comprehensive analytical capabilities. To address these challenges, we propose CreditXAI, a Multi-Agent System (MAS) framework that simulates the collaborative decision-making process of professional credit analysts. The framework focuses on business, financial, and governance risk dimensions to generate consistent and interpretable credit assessments. Experimental results demonstrate that multi-agent collaboration improves predictive accuracy by more than 7% over the best single-agent baseline, confirming its significant synergistic advantage in corporate credit risk evaluation. This study provides a new technical pathway to build intelligent and interpretable credit rating models.
[1098] CGoT: A Novel Inference Mechanism for Embodied Multi-Agent Systems Using Composable Graphs of Thoughts
Yixiao Nie, Yang Zhang, Yingjie Jin, Zhepeng Wang, Xiu Li, Xiang Li
Main category: cs.MA
TL;DR: A novel vehicle-robot system using autonomous vehicles to transport service robots in office parks, enhanced by LLMs and a new CGOT inference mechanism for improved operational efficiency.
Details
Motivation: To leverage the growing integration of self-driving cars and service robots with LLM advancements to create more efficient cooperative systems for industrial and everyday applications.Method: Proposes a vehicle-robot system where autonomous ego-vehicles transport service robots to perform tasks, incorporating LLMs and a novel CGOT inference mechanism for agents carrying other agents.
Result: Experimental results validate the performance of the proposed method, demonstrating feasibility and benefits of the integrated system.
Conclusion: The study successfully demonstrates that incorporating LLMs into vehicle-robot systems enhances operational efficiency and maximizes cooperative potential between autonomous vehicles and service robots.
Abstract: The integration of self-driving cars and service robots is becoming increasingly prevalent across a wide array of fields, playing a crucial and expanding role in both industrial applications and everyday life. In parallel, the rapid advancements in Large Language Models (LLMs) have garnered substantial attention and interest within the research community. This paper introduces a novel vehicle-robot system that leverages the strengths of both autonomous vehicles and service robots. In our proposed system, two autonomous ego-vehicles transports service robots to locations within an office park, where they perform a series of tasks. The study explores the feasibility and potential benefits of incorporating LLMs into this system, with the aim of enhancing operational efficiency and maximizing the potential of the cooperative mechanisms between the vehicles and the robots. This paper proposes a novel inference mechanism which is called CGOT toward this type of system where an agent can carry another agent. Experimental results are presented to validate the performance of the proposed method.
[1099] IFS: Information Flow Structure for Multi-agent Ad Hoc System
Yanqing Fu, Chenrun Wang, Chao Huang, Zhuping Wang
Main category: cs.MA
TL;DR: Proposes IFS (Information Flow Structure) to address insufficient information flow and limited processing capacity in multi-agent ad hoc systems, showing improved performance in StarCraft II experiments.
Details
Motivation: Multi-agent ad hoc systems face challenges with uncertainty, partial observability, and dynamic team compositions, where existing approaches have insufficient information flow and limited processing capacity.Method: Proposed IFS framework that addresses information flow challenges through improved communication and information fusion mechanisms.
Result: IFS significantly improves information flow and processing capacity in StarCraft II experiments, demonstrating strong generalization and outperforming baseline methods in complex ad hoc teamwork scenarios.
Conclusion: The proposed IFS structure effectively addresses key limitations in multi-agent ad hoc systems and shows superior performance in dynamic collaborative environments.
Abstract: Multi-agent ad hoc systems are dynamic collaborative systems in which multiple autonomous agents must cooperate with both known and unknown teammates in open environments, without relying on pre-coordinated strategies. These systems operate under conditions of uncertainty and partial observability, where team composition, agent behaviors, and environmental factors may change during execution. Through an analysis of information flow in such systems, we identify two key limitations in existing research: insufficient information flow and limited information processing capacity. To address these issues, we propose an information flow structure for multi-agent ad hoc systems (IFS), which tackles these challenges from the perspectives of communication and information fusion. Experimental results in StarCraft II demonstrate that IFS significantly improves both information flow and processing capacity, while exhibiting strong generalization capabilities and outperforming baseline methods in complex ad hoc teamwork scenarios.
[1100] Group size effects and collective misalignment in LLM multi-agent systems
Ariel Flint, Luca Maria Aiello, Romualdo Pastor-Satorras, Andrea Baronchelli
Main category: cs.MA
TL;DR: Systematic study of how group size affects multi-agent LLM dynamics in coordination games, revealing non-linear effects and critical population thresholds.
Details
Motivation: Existing work only contrasts single agents vs fixed-size collectives, leaving open how group size shapes multi-agent dynamics, particularly in misalignment scenarios.Method: Systematically explored outcomes across full range of group sizes in coordination games, developed mean-field analytical approach to study convergence.
Result: Collective bias is deeper than previously thought - interaction can amplify, introduce, or override biases; group size affects dynamics non-linearly; above critical size, simulations converge to deterministic predictions.
Conclusion: Group size is a key driver of multi-agent dynamics, highlighting need to consider population-level effects when deploying LLM systems at scale.
Abstract: Multi-agent systems of large language models (LLMs) are rapidly expanding across domains, introducing dynamics not captured by single-agent evaluations. Yet, existing work has mostly contrasted the behavior of a single agent with that of a collective of fixed size, leaving open a central question: how does group size shape dynamics? Here, we move beyond this dichotomy and systematically explore outcomes across the full range of group sizes. We focus on multi-agent misalignment, building on recent evidence that interacting LLMs playing a simple coordination game can generate collective biases absent in individual models. First, we show that collective bias is a deeper phenomenon than previously assessed: interaction can amplify individual biases, introduce new ones, or override model-level preferences. Second, we demonstrate that group size affects the dynamics in a non-linear way, revealing model-dependent dynamical regimes. Finally, we develop a mean-field analytical approach and show that, above a critical population size, simulations converge to deterministic predictions that expose the basins of attraction of competing equilibria. These findings establish group size as a key driver of multi-agent dynamics and highlight the need to consider population-level effects when deploying LLM-based systems at scale.
[1101] Hollywood Town: Long-Video Generation via Cross-Modal Multi-Agent Orchestration
Zheng Wei, Mingchen Li, Zeqian Zhang, Ruibin Yuan, Pan Hui, Huamin Qu, James Evans, Maneesh Agrawala, Anyi Rao
Main category: cs.MA
TL;DR: The paper introduces OmniAgent, a hierarchical graph-based multi-agent framework for long video generation, featuring hypergraph nodes for context sharing and cyclic graphs with limited retries for iterative refinement.
Details
Motivation: To enhance multi-agent collaboration in creative tasks like long video generation by addressing challenges in modular specialization, contextual information sharing, and iterative refinement.Method: Proposes three innovations: 1) OmniAgent framework with film-production-inspired hierarchical architecture, 2) hypergraph nodes for temporary group discussions to share context, 3) transition from DAGs to directed cyclic graphs with limited retries for iterative output refinement.
Result: The innovations enable more robust multi-agent systems with improved collaboration, reduced individual memory requirements, and enhanced output quality through iterative feedback loops.
Conclusion: These contributions provide foundational advancements for developing more effective multi-agent systems in creative domains, particularly for complex tasks like long video generation.
Abstract: Recent advancements in multi-agent systems have demonstrated significant potential for enhancing creative task performance, such as long video generation. This study introduces three innovations to improve multi-agent collaboration. First, we propose OmniAgent, a hierarchical, graph-based multi-agent framework for long video generation that leverages a film-production-inspired architecture to enable modular specialization and scalable inter-agent collaboration. Second, inspired by context engineering, we propose hypergraph nodes that enable temporary group discussions among agents lacking sufficient context, reducing individual memory requirements while ensuring adequate contextual information. Third, we transition from directed acyclic graphs (DAGs) to directed cyclic graphs with limited retries, allowing agents to reflect and refine outputs iteratively, thereby improving earlier stages through feedback from subsequent nodes. These contributions lay the groundwork for developing more robust multi-agent systems in creative tasks.
[1102] Agent-GSPO: Communication-Efficient Multi-Agent Systems via Group Sequence Policy Optimization
Yijia Fan, Jusheng Zhang, Jing Yang, Keze Wang
Main category: cs.MA
TL;DR: Agent-GSPO is a framework that optimizes token economy in multi-agent systems using sequence-level reinforcement learning, achieving state-of-the-art performance with significantly reduced token consumption.
Details
Motivation: To address the prohibitive communication costs in multi-agent systems, particularly the high token consumption that makes such systems economically unviable.Method: Uses Group Sequence Policy Optimization (GSPO) algorithm with communication-aware reward that explicitly penalizes verbosity, training agents through sequence-level reinforcement learning.
Result: Achieves new state-of-the-art performance across seven reasoning benchmarks while using only a fraction of the token consumption compared to existing methods, with emergent strategies like “strategic silence”.
Conclusion: Provides a practical blueprint for developing scalable and economically viable multi-agent systems by directly optimizing for token economy.
Abstract: To combat the prohibitive communication costs of free-for-all" multi-agent systems (MAS), we introduce \textbf{Agent-GSPO}, a framework that directly optimizes for token economy using sequence-level reinforcement learning. Agent-GSPO leverages the stable and memory-efficient Group Sequence Policy Optimization (GSPO) algorithm to train agents on a communication-aware reward that explicitly penalizes verbosity. Across seven reasoning benchmarks, Agent-GSPO not only achieves new state-of-the-art performance but does so with a fraction of the token consumption of existing methods. By fostering emergent strategies like strategic silence," our approach provides a practical
blueprint for developing scalable and economically viable multi-agent systems.
[1103] ColorEcosystem: Powering Personalized, Standardized, and Trustworthy Agentic Service in massive-agent Ecosystem
Fangwen Wu, Zheng Wu, Jihong Wang, Yunku Chen, Ruiguang Pei, Heyuan Huang, Xin Liao, Xingyu Lou, Huarong Deng, Zhihui Fu, Weiwen Liu, Zhuosheng Zhang, Weinan Zhang, Jun Wang
Main category: cs.MA
TL;DR: ColorEcosystem is a blueprint for massive-agent ecosystems that addresses challenges like impersonal services, lack of standardization, and untrustworthy behavior through three key components: agent carrier for personalization, agent store for standardization, and agent audit for trustworthiness.
Details
Motivation: Current massive-agent ecosystems face growing challenges including impersonal service experiences, lack of standardization, and untrustworthy behavior, which hinder effective agentic service management at scale.Method: ColorEcosystem consists of three key components: agent carrier (provides personalized service experiences using user-specific data and digital twins), agent store (centralized standardized platform for managing diverse agentic services), and agent audit (ensures integrity and credibility through supervision of developer and user activities).
Result: The proposed ColorEcosystem blueprint is positioned to power personalized, standardized, and trustworthy agentic service across massive-agent ecosystems, with partial implementation already completed and code open-sourced.
Conclusion: ColorEcosystem addresses critical challenges in massive-agent ecosystems through its three-component architecture, enabling scalable personalized, standardized, and trustworthy agentic services.
Abstract: With the rapid development of (multimodal) large language model-based agents, the landscape of agentic service management has evolved from single-agent systems to multi-agent systems, and now to massive-agent ecosystems. Current massive-agent ecosystems face growing challenges, including impersonal service experiences, a lack of standardization, and untrustworthy behavior. To address these issues, we propose ColorEcosystem, a novel blueprint designed to enable personalized, standardized, and trustworthy agentic service at scale. Concretely, ColorEcosystem consists of three key components: agent carrier, agent store, and agent audit. The agent carrier provides personalized service experiences by utilizing user-specific data and creating a digital twin, while the agent store serves as a centralized, standardized platform for managing diverse agentic services. The agent audit, based on the supervision of developer and user activities, ensures the integrity and credibility of both service providers and users. Through the analysis of challenges, transitional forms, and practical considerations, the ColorEcosystem is poised to power personalized, standardized, and trustworthy agentic service across massive-agent ecosystems. Meanwhile, we have also implemented part of ColorEcosystem’s functionality, and the relevant code is open-sourced at https://github.com/opas-lab/color-ecosystem.
cs.MM
[1104] Enabling American Sign Language Communication Under Low Data Rates
Panneer Selvam Santhalingam, Swann Thantsin, Ahmad Kamari, Parth Pathak, Kenneth De Haan
Main category: cs.MM
TL;DR: VC4ASL enables ASL communication over audio channels in video conferencing apps when video fails, using pose encoding and reconstruction with error correction.
Details
Motivation: Video conferencing relies on high-speed internet, forcing ASL users to use audio-only mode which doesn't support their visual language, creating communication barriers.Method: Encodes and transmits human pose information through audio channel, then renders signed content. Uses novel error detection/correction exploiting human pose structural constraints.
Result: System effectively facilitates intelligible ASL communication over audio in low-bandwidth scenarios where video transmission is impaired.
Conclusion: VC4ASL provides a practical solution for ASL communication in degraded network conditions without requiring platform modifications.
Abstract: In recent years, video conferencing applications have become increasingly prevalent, relying heavily on high-speed internet connectivity. When such connectivity is lacking, users often default to audio-only communication, a mode that significantly disadvantages American Sign Language (ASL) users, whose communication relies on hand gestures, body movement, and facial expressions. In this work, we introduce VC4ASL, a system designed to enable ASL communication over the audio channel of existing video conferencing applications, even in the absence of reliable video. VC4ASL integrates seamlessly with current platforms without requiring any modifications. Our approach establishes a communication channel through audio by encoding and transmitting human pose information, which is then rendered to reconstruct signed content. We propose novel receive-side error detection and correction mechanisms that exploit the inherent structural constraints of human pose data. To evaluate the system, we simulate network-degraded environments, generate pose-based ASL video sequences, and conduct user studies to assess comprehension among ASL users. Experimental results demonstrate that VC4ASL effectively facilitates intelligible ASL communication over audio in low-bandwidth scenarios where video transmission is impaired.
[1105] CMIE: Combining MLLM Insights with External Evidence for Explainable Out-of-Context Misinformation Detection
Fanxiao Li, Jiaying Wu, Canyuan He, Wei Zhou
Main category: cs.MM
TL;DR: CMIE framework improves out-of-context misinformation detection by identifying underlying semantic relationships between images and text, outperforming existing methods.
Details
Motivation: Current MLLMs struggle to capture deeper semantic relationships between images and text for OOC misinformation detection, and are sensitive to noise in evidence.Method: Proposes CMIE framework with Coexistence Relationship Generation (CRG) strategy and Association Scoring (AS) mechanism to identify underlying coexistence relationships and selectively use relevant evidence.
Result: Experimental results show CMIE outperforms existing methods in detecting out-of-context misinformation.
Conclusion: The CMIE framework effectively addresses limitations of MLLMs in capturing deeper semantic relationships and handling noisy evidence for improved misinformation detection.
Abstract: Multimodal large language models (MLLMs) have demonstrated impressive capabilities in visual reasoning and text generation. While previous studies have explored the application of MLLM for detecting out-of-context (OOC) misinformation, our empirical analysis reveals two persisting challenges of this paradigm. Evaluating the representative GPT-4o model on direct reasoning and evidence augmented reasoning, results indicate that MLLM struggle to capture the deeper relationships-specifically, cases in which the image and text are not directly connected but are associated through underlying semantic links. Moreover, noise in the evidence further impairs detection accuracy. To address these challenges, we propose CMIE, a novel OOC misinformation detection framework that incorporates a Coexistence Relationship Generation (CRG) strategy and an Association Scoring (AS) mechanism. CMIE identifies the underlying coexistence relationships between images and text, and selectively utilizes relevant evidence to enhance misinformation detection. Experimental results demonstrate that our approach outperforms existing methods.
eess.AS
[1106] A Unified Framework for Direction and Diffuseness Estimation Using Tight-Frame Microphone Arrays
Akira Omoto
Main category: eess.AS
TL;DR: A unified framework for estimating sound-field direction and diffuseness using practical microphone arrays with different spatial configurations, enabling consistent diffuseness evaluation across heterogeneous geometries.
Details
Motivation: To develop robust, broadband methods for spatial-sound-field characterization that connect theoretical diffuseness analysis with implementable array designs, overcoming limitations of traditional approaches requiring mode whitening or spherical-harmonic decomposition.Method: Formulated a velocity-only covariance approach based on covariance-based diffuseness models. Compared three array types (A-format, rigid-sphere, and newly proposed tight-frame array) through simulations and measurement-based experiments.
Result: The tight-frame configuration achieves near-isotropic directional sampling and reproduces diffuseness characteristics comparable to higher-order spherical arrays while maintaining compact physical structure. The framework also enables accurate direction-of-arrival estimation based on acoustic intensity.
Conclusion: The proposed framework successfully bridges theoretical diffuseness analysis with practical array designs, supporting the development of robust spatial-sound-field characterization methods that work across different array geometries.
Abstract: This work presents a unified framework for estimating both sound-field direction and diffuseness using practical microphone arrays with different spatial configurations. Building on covariance-based diffuseness models, we formulate a velocity-only covariance approach that enables consistent diffuseness evaluation across heterogeneous array geometries without requiring mode whitening or spherical-harmonic decomposition. Three array types – an A-format array, a rigid-sphere array, and a newly proposed tight-frame array – are modeled and compared through both simulations and measurement-based experiments. The results show that the tight-frame configuration achieves near-isotropic directional sampling and reproduces diffuseness characteristics comparable to those of higher-order spherical arrays, while maintaining a compact physical structure. We further examine the accuracy of direction-of-arrival estimation based on acoustic intensity within the same framework. These findings connect theoretical diffuseness analysis with implementable array designs and support the development of robust, broadband methods for spatial-sound-field characterization.
[1107] Bridging the Perceptual-Statistical Gap in Dysarthria Assessment: Why Machine Learning Still Falls Short
Krishna Gurugubelli
Main category: eess.AS
TL;DR: This paper analyzes the gap between human expert performance and machine learning models in dysarthria detection and severity assessment, identifying a “perceptual-statistical gap” and proposing strategies to bridge it.
Details
Motivation: Despite progress in acoustic modeling and deep learning, automated dysarthria detection models still fall short of human expert performance, limiting their clinical impact.Method: The paper provides comprehensive analysis of human expert perceptual processes, surveys ML representations and methods, reviews existing literature, presents theoretical analysis of label noise and inter-rater variability limits, and outlines practical strategies including perceptually motivated features, self-supervised pretraining, ASR-informed objectives, multimodal fusion, human-in-the-loop training, and explainability methods.
Result: The paper identifies the “perceptual-statistical gap” as a key conceptual divergence and provides theoretical analysis of the limitations imposed by label noise and inter-rater variability in dysarthria assessment.
Conclusion: The paper proposes experimental protocols and evaluation metrics aligned with clinical goals to guide future research toward clinically reliable and interpretable dysarthria assessment tools that can bridge the perceptual-statistical gap.
Abstract: Automated dysarthria detection and severity assessment from speech have attracted significant research attention due to their potential clinical impact. Despite rapid progress in acoustic modeling and deep learning, models still fall short of human expert performance. This manuscript provides a comprehensive analysis of the reasons behind this gap, emphasizing a conceptual divergence we term the ``perceptual-statistical gap’’. We detail human expert perceptual processes, survey machine learning representations and methods, review existing literature on feature sets and modeling strategies, and present a theoretical analysis of limits imposed by label noise and inter-rater variability. We further outline practical strategies to narrow the gap, perceptually motivated features, self-supervised pretraining, ASR-informed objectives, multimodal fusion, human-in-the-loop training, and explainability methods. Finally, we propose experimental protocols and evaluation metrics aligned with clinical goals to guide future research toward clinically reliable and interpretable dysarthria assessment tools.
[1108] Binaural Signal Matching with Wearable Arrays for Near-Field Sources and Directional Focus
Sapir Goldring, Zamir Ben Hur, David Lou Alon, Chad McKell, Sebastian Prepelita, Boaz Rafaely
Main category: eess.AS
TL;DR: This paper extends Binaural Signal Matching (BSM) for near-field sound reproduction using wearable glasses-mounted microphones, introducing a near-field extension (NF-BSM) with distance modeling and Field of View weighting (NF-FoV-BSM) to improve performance for close sources.
Details
Motivation: Conventional BSM assumes far-field sources, but wearable audio systems often deal with near-field sources. Previous work showed degradation for very close sources, motivating the need for improved near-field modeling.Method: Proposed near-field BSM (NF-BSM) with distance-dependent modeling and Field of View weighting (NF-FoV-BSM) to emphasize perceptually relevant directions. Evaluated using realistic simulated HRTFs and ATFs, accounting for head rotation and analyzing binaural cues (ILD, ITD).
Result: NF-BSM outperforms traditional far-field BSM in near-field scenarios. NF-FoV-BSM achieves the best perceptual and objective quality, particularly for close sources and under head rotation conditions.
Conclusion: Far-field models have limitations for near-field sources. Incorporating source distance and directional weighting significantly improves binaural reproduction performance for wearable spatial audio systems.
Abstract: This paper investigates the performance of Binaural Signal Matching (BSM) methods for near-field sound reproduction using a wearable glasses-mounted microphone array. BSM is a flexible, signal-independent approach for binaural rendering with arbitrary arrays, but its conventional formulation assumes far-field sources. In our previous work, we proposed a near-field extension of BSM (NF-BSM) that incorporates distance-dependent modeling and showed improved performance over far-field BSM using analytic data, though degradation persisted for sources very close to the array. In this study, we extend that analysis by using realistic simulated data of near-field Head-Related Transfer Functions (HRTFs) and Acoustic Transfer Functions (ATFs) of the array, accounting for listener head rotation and evaluating binaural cues such as interaural level and time differences (ILD and ITD). A key contribution is the introduction of a Field of View (FoV) weighting, designed to emphasize perceptually relevant directions and improve robustness under challenging conditions. Results from both simulation and a listening test confirm that NF-BSM outperforms traditional far-field BSM in near-field scenarios, and that the proposed NF-FoV-BSM method achieves the best perceptual and objective quality among all tested methods, particularly at close source distances and under head rotation. These findings highlight the limitations for far-field models in near-field sources and demonstrate that incorporating source distance and directional weighting can significantly improve binaural reproduction performance for wearable spatial audio systems.
[1109] Empowering Multimodal Respiratory Sound Classification with Counterfactual Adversarial Debiasing for Out-of-Distribution Robustness
Heejoon Koo, Miika Toikkanen, Yoon Tae Kim, Soo Yong Kim, June-Woo Kim
Main category: eess.AS
TL;DR: A counterfactual adversarial debiasing framework for multimodal respiratory sound classification that suppresses spurious correlations from patient metadata to improve generalization under distribution shifts.
Details
Motivation: Current multimodal respiratory sound classification methods are vulnerable to spurious correlations from attributes like age, sex, and acquisition device, which hinder generalization across clinical sites with distribution shifts.Method: Three-stage approach: 1) Causal graph-based counterfactual debiasing to suppress non-causal dependencies, 2) Adversarial debiasing to learn metadata-insensitive representations, 3) Counterfactual metadata augmentation to mitigate spurious correlations and strengthen metadata-invariant representations.
Result: The method consistently outperforms strong baselines in evaluations under both in-distribution and distribution shift scenarios.
Conclusion: The proposed counterfactual adversarial debiasing framework effectively addresses metadata-related biases in multimodal respiratory sound classification, improving generalization performance across different clinical settings.
Abstract: Multimodal respiratory sound classification offers promise for early pulmonary disease detection by integrating bioacoustic signals with patient metadata. Nevertheless, current approaches remain vulnerable to spurious correlations from attributes such as age, sex, or acquisition device, which hinder their generalization, especially under distribution shifts across clinical sites. To this end, we propose a counterfactual adversarial debiasing framework. First, we employ a causal graph-based counterfactual debiasing strategy to suppress non-causal dependencies from patient metadata. Second, we introduce adversarial debiasing to learn metadata-insensitive representations and reduce metadata-specific biases. Third, we design counterfactual metadata augmentation to mitigate spurious correlations further and strengthen metadata-invariant representations. By doing so, our method consistently outperforms strong baselines in evaluations under both in-distribution and distribution shifts. The code is available at https://github.com/RSC-Toolkit/BTS-CARD.
[1110] UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models
Wenming Tu, Guanrou Yang, Ruiqi Yan, Wenxi Chen, Ziyang Ma, Yipeng Kang, Kai Yu, Xie Chen, Zilong Zheng
Main category: eess.AS
TL;DR: UltraVoice is a large-scale speech dialogue dataset enabling fine-grained control over six speech style dimensions (emotion, speed, volume, accent, language, composite styles), which significantly improves speech models’ stylistic controllability without compromising core conversational abilities.
Details
Motivation: Current spoken dialogue models lack fine-grained speech style control capabilities, focusing primarily on functional abilities like reasoning and question answering, which limits human-like interaction quality.Method: Created UltraVoice dataset with 830+ hours of speech dialogues annotated with instructions across six stylistic dimensions. Fine-tuned leading models (SLAM-Omni, VocalNet) on this dataset to enhance speech style controllability.
Result: Fine-tuned models achieved 29.12-42.33% MOS improvements and 14.61-40.09 percentage point IFR gains on multi-dimensional control tasks. Also showed +10.84% and +7.87% improvements on URO-Bench benchmark for core conversational abilities. The dataset also enables training controllable TTS models.
Conclusion: UltraVoice successfully addresses the speech style control gap in dialogue models, demonstrating substantial improvements in both stylistic controllability and core conversational performance, with broad applicability for expressive speech synthesis.
Abstract: Spoken dialogue models currently lack the ability for fine-grained speech style control, a critical capability for human-like interaction that is often overlooked in favor of purely functional capabilities like reasoning and question answering. To address this limitation, we introduce UltraVoice, the first large-scale speech dialogue dataset engineered for multiple fine-grained speech style control. Encompassing over 830 hours of speech dialogues, UltraVoice provides instructions across six key speech stylistic dimensions: emotion, speed, volume, accent, language, and composite styles. Fine-tuning leading models such as SLAM-Omni and VocalNet on UltraVoice significantly enhances their fine-grained speech stylistic controllability without degrading core conversational abilities. Specifically, our fine-tuned models achieve improvements of 29.12-42.33% in Mean Opinion Score (MOS) and 14.61-40.09 percentage points in Instruction Following Rate (IFR) on multi-dimensional control tasks designed in the UltraVoice. Moreover, on the URO-Bench benchmark, our fine-tuned models demonstrate substantial gains in core understanding, reasoning, and conversational abilities, with average improvements of +10.84% on the Basic setting and +7.87% on the Pro setting. Furthermore, the dataset’s utility extends to training controllable Text-to-Speech (TTS) models, underscoring its high quality and broad applicability for expressive speech synthesis. The complete dataset and model checkpoints are available at: https://github.com/bigai-nlco/UltraVoice.
[1111] Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMS
Anand, Umberto Cappellazzo, Stavros Petridis, Maja Pantic
Main category: eess.AS
TL;DR: This paper studies attention sinks and massive activations in multimodal speech recognition LLMs, identifies their patterns across ASR, VSR, and AVSR tasks, and proposes a decorrelation loss method to mitigate these issues while improving performance.
Details
Motivation: To understand the internal dynamics of LLMs in multimodal speech recognition during fine-tuning, particularly the phenomena of attention sinks and massive activations that were previously observed in NLP but not studied in multimodal contexts.Method: Conducted detailed analysis of audio-visual LLMs to identify attention sinks and massive activations, then introduced a simple decorrelation loss that reduces cosine similarity between BOS and other tokens to mitigate these phenomena.
Result: Identified attention sinks and massive activations at BOS and intermediate low-semantic tokens across ASR, VSR, and AVSR; showed massive activations originate in MLP layers with fixed feature indices; intermediate sink tokens exhibit high cosine similarity to BOS token; decorrelation loss effectively mitigates intermediate sinks and massive activations while improving WER under high audio-visual feature downsampling.
Conclusion: The proposed decorrelation loss successfully addresses attention sink and massive activation issues in multimodal speech recognition LLMs, leading to improved performance and stability across different downsampling rates.
Abstract: Large language models (LLMs) have recently advanced auditory speech recognition (ASR), visual speech recognition (VSR), and audio-visual speech recognition (AVSR). However, understanding of their internal dynamics under fine-tuning remains limited. In natural language processing, recent work has revealed attention sinks, tokens that attract disproportionately high attention, and associated massive activations in which some features of sink tokens exhibit huge activation in LLMs. In this work, we are the first to study these phenomena in multimodal speech recognition. Through a detailed analysis of audio-visual LLMs, we identify attention sinks and massive activations not only at the BOS token but also at intermediate low-semantic tokens across ASR, VSR, and AVSR. We show that massive activations originate in the MLP layers and correspond to fixed feature indices across all sink tokens. We further show that intermediate sink tokens exhibit high cosine similarity to the BOS token, thereby amplifying attention and activation. Building on these insights, we introduce a simple decorrelation loss that reduces cosine similarity between BOS and other tokens, effectively mitigating intermediate sinks and massive activations. Furthermore, our method improves word error rate (WER) under high audio-visual feature downsampling while remaining stable at lower downsampling rates.
[1112] HyBeam: Hybrid Microphone-Beamforming Array-Agnostic Speech Enhancement for Wearables
Yuval Bar Ilan, Boaz Rafaely, Vladimir Tourbabin
Main category: eess.AS
TL;DR: HyBeam is a hybrid speech enhancement framework that combines raw microphone signals at low frequencies with beamformer signals at high frequencies to achieve robust performance across diverse acoustic conditions and microphone setups.
Details
Motivation: Existing deep learning methods for speech enhancement often assume fixed array geometries, limiting their applicability in mobile, embedded, and wearable devices where microphone configurations vary. Array-agnostic approaches using either raw microphone signals or beamformer outputs have limitations under changing geometries.Method: HyBeam uses a hybrid approach: raw microphone signals at low frequencies and beamformer signals at higher frequencies, exploiting their complementary strengths while remaining array-agnostic.
Result: Simulations across diverse rooms and wearable array configurations show HyBeam consistently outperforms microphone-only and beamformer-only baselines in PESQ, STOI, and SI-SDR metrics. Bandwise analysis confirms the hybrid approach leverages beamformer directivity at high frequencies and microphone cues at low frequencies.
Conclusion: The hybrid framework effectively combines the strengths of both approaches, achieving superior performance across all frequency bands compared to using either method alone, making it suitable for variable-geometry microphone arrays in mobile and wearable applications.
Abstract: Speech enhancement is a fundamental challenge in signal processing, particularly when robustness is required across diverse acoustic conditions and microphone setups. Deep learning methods have been successful for speech enhancement, but often assume fixed array geometries, limiting their use in mobile, embedded, and wearable devices. Existing array-agnostic approaches typically rely on either raw microphone signals or beamformer outputs, but both have drawbacks under changing geometries. We introduce HyBeam, a hybrid framework that uses raw microphone signals at low frequencies and beamformer signals at higher frequencies, exploiting their complementary strengths while remaining highly array-agnostic. Simulations across diverse rooms and wearable array configurations demonstrate that HyBeam consistently surpasses microphone-only and beamformer-only baselines in PESQ, STOI, and SI-SDR. A bandwise analysis shows that the hybrid approach leverages beamformer directivity at high frequencies and microphone cues at low frequencies, outperforming either method alone across all bands.
[1113] SRP-PHAT-NET: A Reliability-Driven DNN for Reverberant Speaker Localization
Bar Shaybet, Vladimir Tourbabin, Boaz Rafaely
Main category: eess.AS
TL;DR: SRP-PHAT-NET is a deep learning framework for DOA estimation that introduces built-in reliability assessment using Gaussian-weighted labels, enabling selective use of high-confidence predictions for improved accuracy.
Details
Motivation: Current deep learning methods for DOA estimation lack reliability assessment mechanisms, which is essential for real-world deployment in reverberant environments.Method: Uses SRP-PHAT directional maps as spatial features and trains the model with Gaussian-weighted labels centered around true directions to enable reliability scoring.
Result: The framework allows tuning Gaussian kernel width for application-specific needs, and selectively using high-confidence predictions significantly improves localization accuracy.
Conclusion: Integrating reliability estimation into deep learning-based DOA estimation provides practical benefits for spatial audio applications in challenging environments.
Abstract: Accurate Direction-of-Arrival (DOA) estimation in reverberant environments remains a fundamental challenge for spatial audio applications. While deep learning methods have shown strong performance in such conditions, they typically lack a mechanism to assess the reliability of their predictions - an essential feature for real-world deployment. In this work, we present the SRP-PHAT-NET, a deep neural network framework that leverages SRP-PHAT directional maps as spatial features and introduces a built-in reliability estimation. To enable meaningful reliability scoring, the model is trained using Gaussian-weighted labels centered around the true direction. We systematically analyze the influence of label smoothing on accuracy and reliability, demonstrating that the choice of Gaussian kernel width can be tuned to application-specific requirements. Experimental results show that selectively using high-confidence predictions yields significantly improved localization accuracy, highlighting the practical benefits of integrating reliability into deep learning-based DOA estimation.
[1114] DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching
Hanke Xie, Dake Guo, Chengyou Wang, Yue Li, Wenjie Tian, Xinfa Zhu, Xinsheng Wang, Xiulin Li, Guanqiong Miao, Bo Liu, Lei Xie
Main category: eess.AS
TL;DR: DialoSpeech is a dual-track architecture combining LLMs with Chunked Flow Matching for expressive dialogue speech synthesis, addressing challenges in multi-turn conversations with natural turn-taking and overlapping speech.
Details
Motivation: Current TTS systems struggle with generating human-like interactive dialogue speech due to scarcity of dual-track data and difficulties achieving contextual coherence, turn-taking, overlapping speech, and speaker consistency in multi-turn conversations.Method: Proposed DialoSpeech - a dual-track architecture combining large language models with Chunked Flow Matching, plus a data processing pipeline to construct dual-track dialogue datasets for scalable training.
Result: DialoSpeech generates natural multi-turn conversations with coherent speaker turns and natural overlaps, supports both Chinese and English cross-lingual synthesis, and outperforms baseline models in experiments.
Conclusion: DialoSpeech offers an effective solution for generating human-like spoken dialogues, addressing key challenges in interactive dialogue speech synthesis through its dual-track architecture and scalable training approach.
Abstract: Recent advances in text-to-speech (TTS) synthesis, particularly those leveraging large language models (LLMs), have significantly improved expressiveness and naturalness. However, generating human-like, interactive dialogue speech remains challenging. Current systems face limitations due to the scarcity of dual-track data and difficulties in achieving naturalness, contextual coherence, and interactional dynamics, such as turn-taking, overlapping speech, and speaker consistency, in multi-turn conversations. To address these challenges, we propose DialoSpeech, a dual-track architecture combining a large language model with Chunked Flow Matching for expressive, human-like dialogue speech synthesis. DialoSpeech generates natural multi-turn conversations with coherent speaker turns and natural overlaps, supporting both Chinese and English and cross-lingual speech synthesis. We introduce a data processing pipeline to construct dual-track dialogue datasets, facilitating scalable training and experimental validation. Experiments show that our model outperforms baselines, offering a solution for generating human-like spoken dialogues. Audio samples are available at https://tiamojames.github.io/DialoSpeech
[1115] DiffRhythm 2: Efficient and High Fidelity Song Generation via Block Flow Matching
Yuepeng Jiang, Huakang Chen, Ziqian Ning, Jixun Yao, Zerui Han, Di Wu, Meng Meng, Jian Luan, Zhonghua Fu, Lei Xie
Main category: eess.AS
TL;DR: DiffRhythm 2 is an end-to-end framework for high-fidelity, controllable song generation that addresses lyric-vocal alignment and multi-preference optimization challenges through semi-autoregressive architecture and cross-pair preference optimization.
Details
Motivation: Existing non-autoregressive frameworks struggle with lyric-vocal alignment and multi-preference optimization in RLHF, leading to performance degradation when merging models for diverse musical preferences.Method: Uses semi-autoregressive architecture based on block flow matching for lyric alignment, music VAE for low frame rate processing, cross-pair preference optimization for RLHF, and stochastic block representation alignment loss for musical coherence.
Result: Enables faithful lyric-vocal alignment without external constraints while maintaining NAR model efficiency, achieves computationally tractable long sequence processing with 5 Hz frame rate, and mitigates performance drop in multi-preference optimization.
Conclusion: DiffRhythm 2 provides an effective solution for high-quality song generation with improved lyric alignment and robust optimization across diverse human preferences.
Abstract: Generating full-length, high-quality songs is challenging, as it requires maintaining long-term coherence both across text and music modalities and within the music modality itself. Existing non-autoregressive (NAR) frameworks, while capable of producing high-quality songs, often struggle with the alignment between lyrics and vocal. Concurrently, catering to diverse musical preferences necessitates reinforcement learning from human feedback (RLHF). However, existing methods often rely on merging multiple models during multi-preference optimization, which results in significant performance degradation. To address these challenges, we introduce DiffRhythm 2, an end-to-end framework designed for high-fidelity, controllable song generation. To tackle the lyric alignment problem, DiffRhythm 2 employs a semi-autoregressive architecture based on block flow matching. This design enables faithful alignment of lyrics to singing vocals without relying on external labels and constraints, all while preserving the high generation quality and efficiency of NAR models. To make this framework computationally tractable for long sequences, we implement a music variational autoencoder (VAE) that achieves a low frame rate of 5 Hz while still enabling high-fidelity audio reconstruction. In addition, to overcome the limitations of multi-preference optimization in RLHF, we propose cross-pair preference optimization. This method effectively mitigates the performance drop typically associated with model merging, allowing for more robust optimization across diverse human preferences. We further enhance musicality and structural coherence by introducing stochastic block representation alignment loss.
[1116] Adapting Speech Foundation Models with Large Language Models for Unified Speech Recognition
Jing-Xuan Zhang, Genshun Wan, Jin Li, Jianqing Gao
Main category: eess.AS
TL;DR: UASR-LLM adapts frozen speech foundation models to unified visual, auditory, and audiovisual speech recognition by integrating visual representations through injection modules and using LLMs as text decoders with a two-stage training strategy.
Details
Motivation: Speech foundation models excel in auditory tasks but their adaptation to multimodal scenarios (visual and audiovisual speech recognition) remains underexplored, creating a need for unified frameworks.Method: Introduces visual injection modules into multiple SFM layers, connects augmented SFMs with decoder-only LLMs via feed-forward adaptor, uses two-stage training (visual injection pretraining then speech recognition finetuning) with frozen SFM parameters and LoRA for LLMs.
Result: Achieves superior performance over state-of-the-art baselines across VSR, ASR, and AVSR tasks under both clean and noisy conditions, with ablation studies confirming generalization across various SFMs and LLMs.
Conclusion: The proposed framework successfully adapts frozen speech foundation models to unified multimodal speech recognition tasks while maintaining strong performance across different modalities and conditions.
Abstract: Unified speech recognition aims to perform auditory, visual, and audiovisual speech recognition within a single model framework. While speech foundation models (SFMs) have demonstrated remarkable performance in auditory tasks, their adaptation to multimodal scenarios remains underexplored. This paper presents UASR-LLM, a novel framework that adapts frozen SFMs to unified VSR, ASR, and AVSR tasks by leveraging large language models (LLMs) as text decoders. Our approach introduces visual representations into multiple SFM layers through visual injection modules, enabling multimodal input processing and unified hidden representations. The augmented SFMs connect with decoder-only LLMs via a feed-forward adaptor, where concatenated representations and instruction prompts guide speech transcription. We implement a twostage training strategy: visual injection pretraining followed by speech recognition finetuning. SFM parameters remain frozen throughout training, with only visual injection modules optimized initially, and LLMs finetuned using LoRA parameters subsequently. Experimental results demonstrate superior performance over state-of-the-art baselines across VSR, ASR, and AVSR tasks under both clean and noisy conditions. Ablation studies confirm generalization across various SFMs and LLMs, validating the proposed training strategy.
[1117] LibriConvo: Simulating Conversations from Read Literature for ASR and Diarization
Máté Gedeon, Péter Mihajlik
Main category: eess.AS
TL;DR: LibriConvo is a simulated multi-speaker conversational dataset using speaker-aware conversation simulation (SASC) that provides realistic conversational timing and semantic coherence for training and evaluating speaker diarization and ASR systems.
Details
Motivation: To address limitations of prior resources that use semantically disconnected utterances and implausible temporal gaps, creating a more realistic conversational dataset for multi-speaker speech processing research.Method: Uses CallHome with external VAD for reliable boundaries, applies compression to reduce long silences, organizes LibriTTS utterances by book for context consistency, and employs novel room impulse response selection for acoustic realism.
Result: Dataset contains 240.1 hours across 1,496 dialogues with 830 unique speakers. Sortformer outperforms pyannote in diarization, and fine-tuned Fast Conformer-CTC XLarge achieves 7.29% WER, beating zero-shot Whisper-large-v3.
Conclusion: LibriConvo provides a valuable resource for advancing multi-speaker speech processing with realistic conversational dynamics and controlled experimental conditions.
Abstract: We introduce LibriConvo, a simulated multi-speaker conversational dataset based on speaker-aware conversation simulation (SASC), designed to support training and evaluation of speaker diarization and automatic speech recognition (ASR) systems. Unlike prior resources that mostly rely on semantically disconnected utterances and implausible temporal gaps, LibriConvo ensures semantic coherence and realistic conversational timing. Our pipeline leverages CallHome with external VAD for reliable boundaries, applies compression to reduce unnaturally long silences, and organizes LibriTTS utterances by book to maintain contextual consistency. Acoustic realism is enhanced via a novel room impulse response selection procedure that ranks speaker-microphone configurations by spatial plausibility, balancing realism and diversity. The dataset comprises 240.1 hours across 1,496 dialogues with 830 unique speakers, split in a speaker-disjoint manner for robust evaluation. Baselines show that the sortformer model outperforms the pyannote pipeline in diarization, while a fine-tuned Fast Conformer-CTC XLarge with Serialized Output Training achieves 7.29% WER for ASR, surpassing zero-shot Whisper-large-v3. LibriConvo provides a valuable resource for advancing multi-speaker speech processing research with realistic conversational dynamics and controlled experimental conditions.
[1118] Treble10: A high-quality dataset for far-field speech recognition, dereverberation, and enhancement
Sarabeth S. Mullins, Georg Götz, Eric Bezzam, Steven Zheng, Daniel Gert Nielsen
Main category: eess.AS
TL;DR: Treble10 is a large-scale, physically accurate room-acoustic dataset that bridges the gap between measured corpora and simplified simulation-based datasets by using hybrid wave-based and geometrical acoustics simulation.
Details
Motivation: Current far-field speech datasets face a trade-off between acoustic realism and scalability - measured datasets are expensive and low-coverage while simulation-based datasets fail to reproduce key physical phenomena like diffraction and interference.Method: Hybrid simulation paradigm combining wave-based and geometrical acoustics solver implemented in Treble SDK, generating over 3000 broadband room impulse responses in 10 fully furnished real-world rooms at 32 kHz.
Result: Created six complementary subsets including mono, 8th-order Ambisonics, and 6-channel device RIRs, plus pre-convolved reverberant speech scenes paired with LibriSpeech utterances, accurately modeling both low-frequency wave effects and high-frequency reflections.
Conclusion: Treble10 enables reproducible, physically grounded evaluation and large-scale data augmentation for far-field speech tasks, serving as both a benchmark and template for next-generation simulation-driven audio research.
Abstract: Accurate far-field speech datasets are critical for tasks such as automatic speech recognition (ASR), dereverberation, speech enhancement, and source separation. However, current datasets are limited by the trade-off between acoustic realism and scalability. Measured corpora provide faithful physics but are expensive, low-coverage, and rarely include paired clean and reverberant data. In contrast, most simulation-based datasets rely on simplified geometrical acoustics, thus failing to reproduce key physical phenomena like diffraction, scattering, and interference that govern sound propagation in complex environments. We introduce Treble10, a large-scale, physically accurate room-acoustic dataset. Treble10 contains over 3000 broadband room impulse responses (RIRs) simulated in 10 fully furnished real-world rooms, using a hybrid simulation paradigm implemented in the Treble SDK that combines a wave-based and geometrical acoustics solver. The dataset provides six complementary subsets, spanning mono, 8th-order Ambisonics, and 6-channel device RIRs, as well as pre-convolved reverberant speech scenes paired with LibriSpeech utterances. All signals are simulated at 32 kHz, accurately modelling low-frequency wave effects and high-frequency reflections. Treble10 bridges the realism gap between measurement and simulation, enabling reproducible, physically grounded evaluation and large-scale data augmentation for far-field speech tasks. The dataset is openly available via the Hugging Face Hub, and is intended as both a benchmark and a template for next-generation simulation-driven audio research.
[1119] SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity
Hanke Xie, Haopeng Lin, Wenxiao Cao, Dake Guo, Wenjie Tian, Jun Wu, Hanlin Wen, Ruixuan Shang, Hongmei Liu, Zhiqi Jiang, Yuepeng Jiang, Wenxi Chen, Ruiqi Yan, Jiale Qian, Yichao Yan, Shunshun Yin, Ming Tao, Xie Chen, Lei Xie, Xinsheng Wang
Main category: eess.AS
TL;DR: SoulX-Podcast is a multi-speaker conversational TTS system that generates podcast-style dialogues with stable speaker timbre, smooth transitions, and context-adaptive prosody across multiple languages and dialects.
Details
Motivation: Existing TTS systems are mostly single-speaker focused and struggle with coherent multi-speaker conversational speech, creating a need for systems that can handle podcast-style multi-turn dialogues.Method: Integrates paralinguistic controls and supports multiple languages (Mandarin, English) and Chinese dialects (Sichuanese, Henanese, Cantonese) to enable personalized podcast-style speech generation.
Result: Can continuously produce over 90 minutes of conversation with stable speaker timbre and smooth transitions, with speakers showing contextually adaptive prosody. Achieves state-of-the-art performance in both monologue TTS and multi-turn conversational speech synthesis.
Conclusion: SoulX-Podcast successfully addresses the challenge of multi-speaker conversational TTS, demonstrating robust performance in generating natural, extended podcast-style dialogues with adaptive prosody across multiple languages.
Abstract: Recent advances in text-to-speech (TTS) synthesis have significantly improved speech expressiveness and naturalness. However, most existing systems are tailored for single-speaker synthesis and fall short in generating coherent multi-speaker conversational speech. This technical report presents SoulX-Podcast, a system designed for podcast-style multi-turn, multi-speaker dialogic speech generation, while also achieving state-of-the-art performance in conventional TTS tasks. To meet the higher naturalness demands of multi-turn spoken dialogue, SoulX-Podcast integrates a range of paralinguistic controls and supports both Mandarin and English, as well as several Chinese dialects, including Sichuanese, Henanese, and Cantonese, enabling more personalized podcast-style speech generation. Experimental results demonstrate that SoulX-Podcast can continuously produce over 90 minutes of conversation with stable speaker timbre and smooth speaker transitions. Moreover, speakers exhibit contextually adaptive prosody, reflecting natural rhythm and intonation changes as dialogues progress. Across multiple evaluation metrics, SoulX-Podcast achieves state-of-the-art performance in both monologue TTS and multi-turn conversational speech synthesis.
[1120] Matching Reverberant Speech Through Learned Acoustic Embeddings and Feedback Delay Networks
Philipp Götz, Gloria Dal Santo, Sebastian J. Schlecht, Vesa Välimäki, Emanuël A. P. Habets
Main category: eess.AS
TL;DR: Proposes a method for blind estimation of artificial reverberation parameters using learned room-acoustic priors and a feedback delay network structure for real-time AAR applications.
Details
Motivation: Real-time generation of perceptually plausible reverberation in auditory augmented reality systems is challenging without explicit acoustic measurements.Method: Formulates blind reverberation parameter estimation as signal matching task using learned room-acoustic prior, and proposes FDN structure that reproduces frequency-dependent decay times and direct-to-reverberation ratio.
Result: Experimental evaluation shows improvements in estimated room-acoustic parameters and perceptual plausibility of artificial reverberant speech compared to leading automatic FDN tuning method.
Conclusion: The approach demonstrates potential for efficient, perceptually consistent reverberation rendering in AAR applications.
Abstract: Reverberation conveys critical acoustic cues about the environment, supporting spatial awareness and immersion. For auditory augmented reality (AAR) systems, generating perceptually plausible reverberation in real time remains a key challenge, especially when explicit acoustic measurements are unavailable. We address this by formulating blind estimation of artificial reverberation parameters as a reverberant signal matching task, leveraging a learned room-acoustic prior. Furthermore, we propose a feedback delay network (FDN) structure that reproduces both frequency-dependent decay times and the direct-to-reverberation ratio of a target space. Experimental evaluation against a leading automatic FDN tuning method demonstrates improvements in estimated room-acoustic parameters and perceptual plausibility of artificial reverberant speech. These results highlight the potential of our approach for efficient, perceptually consistent reverberation rendering in AAR applications.
[1121] Evaluation of Spherical Wavelet Framework in Comparsion with Ambisonics
Ş. Ekmen, H. Lee
Main category: eess.AS
TL;DR: The Spherical Wavelet Framework (SWF) is compared with Ambisonics for spatial audio reproduction. SWF shows better spatial and timbral fidelity but depends heavily on sphere subdivision and cannot natively represent continuous direction waves.
Details
Motivation: To investigate SWF in greater detail compared to Ambisonics, addressing limitations of previous studies that were limited to specific conditions and lacked perceptual metrics.Method: Used IACC, ITD, and ILD estimations, plus listening tests with ecologically valid sound sources. Evaluated various reproduction layouts including regular polyhedron, t-design, and Lebedev grid with corresponding Ambisonics orders and channel counts.
Result: SWF was rated significantly more similar to the reference than Ambisonics in terms of overall spatial and timbral fidelity, but performance is considerably dependent on the subdivision of the sphere. SWF cannot natively represent waves arriving at continuous directions.
Conclusion: SWF shows promise for spatial audio with better fidelity than Ambisonics, but has limitations regarding sphere subdivision dependency and continuous direction representation. Possible solutions are proposed.
Abstract: Recently, the Spherical Wavelet Framework (SWF) was proposed to combine the benefits of Ambisonics and Object-Based Audio (OBA) by utilising highly localised basis functions. SWF can enhance the sweet-spot area and reduce localisation blur while still enabling a sparse representation of the complete sound field, making storage and transmission more efficient. Initial vector analysis and listening test of SWF have shown promising results; however, these findings are limited to very specific conditions and do not include perceptual metrics. The present study investigates SWF in greater detail, comparing it with Ambisonics. The comparison was carried out using IACC, ITD, and ILD estimations, as well as listening tests with ecologically valid sound sources. Various reproduction layouts: regular polyhedron, t-design, and Lebedev grid with their corresponding Ambisonics orders and channel counts were evaluated. Results indicate that SWF is rated significantly more similar to the reference than Ambisonics is, in terms of overall spatial and timbral fidelity; however, it is considerably dependent on the subdivison of the sphere. Moreover, it cannot natively represent a wave arriving at a continuous direction. Possible solutions are proposed.
[1122] PESTO: Pitch Estimation with Self-supervised Transposition-equivariant Objective
Alain Riou, Stefan Lattner, Gaëtan Hadjeres, Geoffroy Peeters
Main category: eess.AS
TL;DR: A lightweight self-supervised learning approach for pitch estimation using equivariance to pitch transposition, achieving strong performance with <30k parameters.
Details
Motivation: To develop a pitch estimation method that doesn't require large labeled datasets and can work efficiently on low-resource devices for real-time applications.Method: Uses a Siamese neural network with pitch-shifted audio inputs in Constant-Q Transform, class-based transposition-equivariant objective to prevent collapse, and transposition-preserving architecture with learnable Toeplitz matrices.
Result: Model generalizes across singing voice and musical instrument pitch estimation tasks, surpassing self-supervised baselines and narrowing performance gap with supervised methods.
Conclusion: The proposed self-supervised approach enables accurate pitch estimation with minimal parameters, making it suitable for resource-constrained devices and real-time applications.
Abstract: In this paper, we address the problem of pitch estimation using Self Supervised Learning (SSL). The SSL paradigm we use is equivariance to pitch transposition, which enables our model to accurately perform pitch estimation on monophonic audio after being trained only on a small unlabeled dataset. We use a lightweight ($<$ 30k parameters) Siamese neural network that takes as inputs two different pitch-shifted versions of the same audio represented by its Constant-Q Transform. To prevent the model from collapsing in an encoder-only setting, we propose a novel class-based transposition-equivariant objective which captures pitch information. Furthermore, we design the architecture of our network to be transposition-preserving by introducing learnable Toeplitz matrices. We evaluate our model for the two tasks of singing voice and musical instrument pitch estimation and show that our model is able to generalize across tasks and datasets while being lightweight, hence remaining compatible with low-resource devices and suitable for real-time applications. In particular, our results surpass self-supervised baselines and narrow the performance gap between self-supervised and supervised methods for pitch estimation.
[1123] Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-spoofing
Tianchi Liu, Duc-Tuan Truong, Rohan Kumar Das, Kong Aik Lee, Haizhou Li
Main category: eess.AS
TL;DR: Nes2Net is a lightweight back-end architecture that processes high-dimensional speech features without dimensionality reduction, improving performance while reducing computational costs in speech deepfake detection tasks.
Details
Motivation: Speech foundation models produce high-dimensional features that mismatch with downstream task requirements, and traditional dimensionality reduction approaches increase parameter overhead, computational costs, and risk information loss.Method: Proposed Nested Res2Net (Nes2Net) with nested structure for enhanced multi-scale feature extraction, improved feature interaction, and preservation of high-dimensional information without using dimensionality reduction layers.
Result: 22% performance improvement and 87% back-end computational cost reduction on CtrSVDD dataset, with consistent superior robustness and generalization across ASVspoof 2021, ASVspoof 5, PartialSpoof, and In-the-Wild datasets.
Conclusion: Nes2Net effectively addresses the dimensionality mismatch problem in speech foundation models, providing better performance, reduced computational costs, and enhanced robustness across diverse speech deepfake detection scenarios.
Abstract: Speech foundation models have significantly advanced various speech-related tasks by providing exceptional representation capabilities. However, their high-dimensional output features often create a mismatch with downstream task models, which typically require lower-dimensional inputs. A common solution is to apply a dimensionality reduction (DR) layer, but this approach increases parameter overhead, computational costs, and risks losing valuable information. To address these issues, we propose Nested Res2Net (Nes2Net), a lightweight back-end architecture designed to directly process high-dimensional features without DR layers. The nested structure enhances multi-scale feature extraction, improves feature interaction, and preserves high-dimensional information. We first validate Nes2Net on CtrSVDD, a singing voice deepfake detection dataset, and report a 22% performance improvement and an 87% back-end computational cost reduction over the state-of-the-art baseline. Additionally, extensive testing across four diverse datasets: ASVspoof 2021, ASVspoof 5, PartialSpoof, and In-the-Wild, covering fully spoofed speech, adversarial attacks, partial spoofing, and real-world scenarios, consistently highlights Nes2Net’s superior robustness and generalization capabilities. The code package and pre-trained models are available at https://github.com/Liu-Tianchi/Nes2Net.
[1124] ReFESS-QI: Reference-Free Evaluation For Speech Separation With Joint Quality And Intelligibility Scoring
Ari Frummer, Helin Wang, Tianyu Cao, Adi Arbel, Yuval Sieradzki, Oren Gal, Jesús Villalba, Thomas Thebaud, Najim Dehak
Main category: eess.AS
TL;DR: A text-free, reference-free evaluation framework for speech separation using self-supervised learning representations to predict audio quality (SI-SNR) and speech intelligibility (WER) without needing reference audios or transcriptions.
Details
Motivation: Traditional speech separation evaluation metrics require matched reference audios and transcriptions, making them unsuitable for real-world mixtures where no references exist.Method: Proposed framework uses self-supervised learning (SSL) representations from both mixture and separated tracks to jointly predict SI-SNR for audio quality and WER for speech intelligibility.
Result: On WHAMR! dataset: WER estimation achieved MAE of 17% and PCC of 0.77; SI-SNR estimation achieved MAE of 1.38 and PCC of 0.95. Framework shows robustness across various SSL representations.
Conclusion: The SSL-based framework provides effective reference-free evaluation for speech separation, enabling assessment of real-world mixtures without ground truth references.
Abstract: Source separation is a crucial pre-processing step for various speech processing tasks, such as automatic speech recognition (ASR). Traditionally, the evaluation metrics for speech separation rely on the matched reference audios and corresponding transcriptions to assess audio quality and intelligibility. However, they cannot be used to evaluate real-world mixtures for which no reference exists. This paper introduces a text-free reference-free evaluation framework based on self-supervised learning (SSL) representations. The proposed framework utilize the mixture and separated tracks to predict jointly audio quality, through the Scale Invariant Signal to Noise Ratio (SI-SNR) metric, and speech intelligibility through the Word Error Rate (WER) metric. We conducted experiments on the WHAMR! dataset, which shows a WER estimation with a mean absolute error (MAE) of 17% and a Pearson correlation coefficient (PCC) of 0.77; and SI-SNR estimation with an MAE of 1.38 and PCC of 0.95. We further demonstrate the robustness of our estimator by using various SSL representations.
[1125] WhaleVAD-BPN: Improving Baleen Whale Call Detection with Boundary Proposal Networks and Post-processing Optimisation
Christiaan M. Geldenhuys, Günther Tonitz, Thomas R. Niesler
Main category: eess.AS
TL;DR: The paper proposes a boundary proposal network (BPN) to improve sound event detection for baleen whale calls, reducing false positives and improving minority-class detection.
Details
Motivation: Existing SED systems for baleen whale calls suffer from false positive detections and poor performance on minority classes, which limits their practical utility.Method: Extends existing SED system with BPN that uses intermediate latent representations to gate final output. Also explores forward-search and backward-search hyperparameter optimization for post-processing.
Result: BPN achieves 16.8% absolute precision increase, 21.3% F1 improvement for d-calls, 9.4% for bp-calls. Complete system gets 9.8% absolute F1 improvement over baseline (0.475 vs baseline).
Conclusion: The BPN approach effectively reduces false positives and improves minority-class detection in whale call detection, with significant performance gains over existing methods.
Abstract: While recent sound event detection (SED) systems can identify baleen whale calls in marine audio, challenges related to false positive and minority-class detection persist. We propose the boundary proposal network (BPN), which extends an existing lightweight SED system. The BPN is inspired by work in image object detection and aims to reduce the number of false positive detections. It achieves this by using intermediate latent representations computed within the backbone classification model to gate the final output. When added to an existing SED system, the BPN achieves a 16.8 % absolute increase in precision, as well as 21.3 % and 9.4 % improvements in the F1-score for minority-class d-calls and bp-calls, respectively. We further consider two approaches to the selection of post-processing hyperparameters: a forward-search and a backward-search. By separately optimising event-level and frame-level hyperparameters, these two approaches lead to considerable performance improvements over parameters selected using empirical methods. The complete WhaleVAD-BPN system achieves a cross-validated development F1-score of 0.475, which is a 9.8 % absolute improvement over the baseline.
eess.IV
[1126] HDR Image Reconstruction using an Unsupervised Fusion Model
Kumbha Nagaswetha
Main category: eess.IV
TL;DR: A deep learning-based multi-exposure fusion method for HDR imaging that combines underexposed and overexposed LDR images using a CNN to generate high-quality HDR outputs without ground-truth data.
Details
Motivation: Conventional digital cameras have limited dynamic range and cannot capture the wide brightness levels that human vision can perceive, creating a need for effective HDR imaging solutions.Method: Uses a convolutional neural network to fuse complementary information from underexposed (preserves bright regions) and overexposed (preserves dark regions) LDR images in an unsupervised manner without ground-truth HDR data, with a customized loss function.
Result: Achieves superior visual quality compared to existing fusion methods as measured by MEF-SSIM, demonstrating effective HDR reconstruction from limited exposure inputs.
Conclusion: The proposed unsupervised deep learning approach provides a practical solution for HDR image generation that works well without requiring ground-truth HDR training data, making it suitable for real-world applications.
Abstract: High Dynamic Range (HDR) imaging aims to reproduce the wide range of brightness levels present in natural scenes, which the human visual system can perceive but conventional digital cameras often fail to capture due to their limited dynamic range. To address this limitation, we propose a deep learning-based multi-exposure fusion approach for HDR image generation. The method takes a set of differently exposed Low Dynamic Range (LDR) images, typically an underexposed and an overexposed image, and learns to fuse their complementary information using a convolutional neural network (CNN). The underexposed image preserves details in bright regions, while the overexposed image retains information in dark regions; the network effectively combines these to reconstruct a high-quality HDR output. The model is trained in an unsupervised manner, without relying on ground-truth HDR images, making it practical for real-world applications where such data is unavailable. We evaluate our results using the Multi-Exposure Fusion Structural Similarity Index Measure (MEF-SSIM) and demonstrate that our approach achieves superior visual quality compared to existing fusion methods. A customized loss function is further introduced to improve reconstruction fidelity and optimize model performance.
[1127] Inverse Design of Metasurface for Spectral Imaging
Rongzhou Chen, Haitao Nie, Shuo Zhu, Yaping Zhao, Chutian Wang, Edmund Y. Lam
Main category: eess.IV
TL;DR: A physics-data co-driven framework for designing reconfigurable metasurfaces using phase-change materials for compact compressive spectral imaging in shortwave infrared, achieving improved reconstruction fidelity and noise resilience.
Details
Motivation: Inverse design of metasurfaces for joint optimization of optical modulation and algorithmic decoding in computational optics presents significant challenges, especially in hyperspectral imaging applications.Method: Differentiable neural simulator trained on 320,000+ simulated geometries to predict spectral responses across 11 crystallization states, enabling end-to-end joint optimization of metasurface geometry, spectral encoding function, and deep reconstruction network with soft shape regularization.
Result: Optimized system improves reconstruction fidelity by up to 7.6 dB in peak-signal-to-noise ratio, with enhanced noise resilience and improved measurement matrix conditioning.
Conclusion: The approach demonstrates potential for high-performance hyperspectral imaging through co-optimization of physical metasurface design and computational reconstruction algorithms.
Abstract: Inverse design of metasurfaces for the joint optimization of optical modulation and algorithmic decoding in computational optics presents significant challenges, especially in applications such as hyperspectral imaging. We introduce a physics-data co-driven framework for designing reconfigurable metasurfaces fabricated from the phase-change material Ge2Sb2Se4Te1 to achieve compact, compressive spectral imaging in the shortwave infrared region. Central to our approach is a differentiable neural simulator, trained on over 320,000 simulated geometries, that accurately predicts spectral responses across 11 crystallization states. This differentiability enables end-to-end joint optimization of the metasurface geometry, its spectral encoding function, and a deep reconstruction network. We also propose a soft shape regularization technique that preserves manufacturability during gradient-based updates. Experiments show that our optimized system improves reconstruction fidelity by up to 7.6 dB in the peak-signal-to-noise ratio, with enhanced noise resilience and improved measurement matrix conditioning, underscoring the potential of our approach for high-performance hyperspectral imaging.
[1128] Frequency-Spatial Interaction Driven Network for Low-Light Image Enhancement
Yunhong Tao, Wenbing Tao, Xiang Xiang
Main category: eess.IV
TL;DR: FSIDNet is a two-stage frequency-spatial interaction network for low-light image enhancement that restores amplitude in stage 1 and phase in stage 2, using cross-domain interaction blocks and information exchange modules.
Details
Motivation: Existing LLIE methods either ignore frequency domain information or fail to effectively promote information propagation, limiting performance.Method: Two-stage architecture: stage 1 restores amplitude for lightness, stage 2 restores phase for fine structures. Uses frequency-spatial interaction blocks and Information Exchange Module for cross-stage feature integration.
Result: Achieves excellent performance on benchmark datasets (LOL-Real, LSRW-Huawei) in both visual results and quantitative metrics while maintaining good model efficiency.
Conclusion: The proposed FSIDNet effectively leverages frequency-spatial interaction and cross-stage information flow to significantly improve low-light image enhancement performance.
Abstract: Low-light image enhancement (LLIE) aims at improving the perception or interpretability of an image captured in an environment with poor illumination. With the advent of deep learning, the LLIE technique has achieved significant breakthroughs. However, existing LLIE methods either ignore the important role of frequency domain information or fail to effectively promote the propagation and flow of information, limiting the LLIE performance. In this paper, we develop a novel frequency-spatial interaction-driven network (FSIDNet) for LLIE based on two-stage architecture. To be specific, the first stage is designed to restore the amplitude of low-light images to improve the lightness, and the second stage devotes to restore phase information to refine fine-grained structures. Considering that Frequency domain and spatial domain information are complementary and both favorable for LLIE, we further develop two frequency-spatial interaction blocks which mutually amalgamate the complementary spatial and frequency information to enhance the capability of the model. In addition, we construct the Information Exchange Module (IEM) to associate two stages by adequately incorporating cross-stage and cross-scale features to effectively promote the propagation and flow of information in the two-stage network structure. Finally, we conduct experiments on several widely used benchmark datasets (i.e., LOL-Real, LSRW-Huawei, etc.), which demonstrate that our method achieves the excellent performance in terms of visual results and quantitative metrics while preserving good model efficiency.
[1129] Learning Event-guided Exposure-agnostic Video Frame Interpolation via Adaptive Feature Blending
Junsik Jung, Yoonki Cho, Woo Jae Kim, Lin Wang, Sune-eui Yoon
Main category: eess.IV
TL;DR: A novel event-guided framework for exposure-agnostic video frame interpolation that uses Target-adaptive Event Sampling and Target-adaptive Importance Mapping to handle severely low-frame-rate blurry videos captured under unknown exposure conditions.
Details
Motivation: Existing event-guided methods struggle with severely low-frame-rate blurry videos due to lack of temporal constraints, especially under unknown and dynamic exposure conditions where event cameras' high temporal resolution could be advantageous.Method: Two key components: Target-adaptive Event Sampling (TES) samples events around target timestamp and unknown exposure time to align with blurry frames; Target-adaptive Importance Mapping (TIM) generates importance maps considering temporal proximity and spatial relevance to adaptively blend consecutive features.
Result: Extensive experiments on synthetic and real-world datasets demonstrate the framework’s effectiveness in exposure-agnostic video frame interpolation scenarios.
Conclusion: The proposed event-guided framework successfully addresses limitations of existing methods by better aligning events with blurry frames and adaptively blending features based on temporal and spatial relevance, enabling effective interpolation of severely low-frame-rate blurry videos under unknown exposure conditions.
Abstract: Exposure-agnostic video frame interpolation (VFI) is a challenging task that aims to recover sharp, high-frame-rate videos from blurry, low-frame-rate inputs captured under unknown and dynamic exposure conditions. Event cameras are sensors with high temporal resolution, making them especially advantageous for this task. However, existing event-guided methods struggle to produce satisfactory results on severely low-frame-rate blurry videos due to the lack of temporal constraints. In this paper, we introduce a novel event-guided framework for exposure-agnostic VFI, addressing this limitation through two key components: a Target-adaptive Event Sampling (TES) and a Target-adaptive Importance Mapping (TIM). Specifically, TES samples events around the target timestamp and the unknown exposure time to better align them with the corresponding blurry frames. TIM then generates an importance map that considers the temporal proximity and spatial relevance of consecutive features to the target. Guided by this map, our framework adaptively blends consecutive features, allowing temporally aligned features to serve as the primary cues while spatially relevant ones offer complementary support. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of our approach in exposure-agnostic VFI scenarios.
[1130] Expert Validation of Synthetic Cervical Spine Radiographs Generated with a Denoising Diffusion Probabilistic Model
Austin A. Barr, Brij S. Karmur, Anthony J. Winder, Eddie Guo, John T. Lysack, James N. Scott, William F. Morrish, Muneer Eesa, Morgan Willson, David W. Cadotte, Michael M. H. Yang, Ian Y. M. Chan, Sanju Lama, Garnette R. Sutherland
Main category: eess.IV
TL;DR: DDPM-generated synthetic cervical spine X-rays are clinically indistinguishable from real images, enabling large-scale dataset creation for neurosurgery ML applications.
Details
Motivation: Overcome limitations in assembling large, high-quality neuroimaging datasets for machine learning in neurosurgery through privacy-preserving synthetic data generation.Method: Used denoising diffusion probabilistic model (DDPM) trained on 4,963 cervical spine radiographs, evaluated via clinical Turing test with 8 experts reviewing 50 image quartets (1 real + 3 synthetic).
Result: Experts correctly identified real images only 29% of the time, realism scores were comparable (real: 3.323 vs synthetic: 3.228-3.320), no evidence of memorization, and generated 20,063 synthetic radiographs.
Conclusion: DDPM-generated cervical spine X-rays are statistically indistinguishable from real clinical images, providing a scalable solution for creating large neuroimaging datasets for ML applications.
Abstract: Machine learning in neurosurgery is limited by challenges in assembling large, high-quality imaging datasets. Synthetic data offers a scalable, privacy-preserving solution. We evaluated the feasibility of generating realistic lateral cervical spine radiographs using a denoising diffusion probabilistic model (DDPM) trained on 4,963 images from the Cervical Spine X-ray Atlas. Model performance was monitored via training/validation loss and Frechet inception distance, and synthetic image quality was assessed in a blinded “clinical Turing test” with six neuroradiologists and two spine-fellowship trained neurosurgeons. Experts reviewed 50 quartets containing one real and three synthetic images, identifying the real image and rating realism on a 4-point Likert scale. Experts correctly identified the real image in 29% of trials (Fleiss’ kappa=0.061). Mean realism scores were comparable between real (3.323) and synthetic images (3.228, 3.258, and 3.320; p=0.383, 0.471, 1.000). Nearest-neighbor analysis found no evidence of memorization. We also provide a dataset of 20,063 synthetic radiographs. These results demonstrate that DDPM-generated cervical spine X-rays are statistically indistinguishable in realism and quality from real clinical images, offering a novel approach to creating large-scale neuroimaging datasets for ML applications in landmarking, segmentation, and classification.
[1131] Synthetic-to-Real Transfer Learning for Chromatin-Sensitive PWS Microscopy
Jahidul Arafat, Sanjaya Poudel
Main category: eess.IV
TL;DR: CFU Net is a hierarchical segmentation architecture that enables automated nuclear segmentation in chromatin sensitive microscopy without manual annotations, achieving high performance on synthetic data and demonstrating clinical utility in early cancer detection.
Details
Motivation: Manual nuclear segmentation limits population-scale analysis needed for biomarker discovery in early cancer detection, and the lack of annotated csPWS imaging data prevents direct use of standard deep learning methods.Method: CFU Net uses physics-based rendering with chromatin packing statistics, Mie scattering models, and modality-specific noise, combined with a three-stage curriculum from adversarial RGB pretraining to spectroscopic fine tuning and histology validation. It integrates ConvNeXt backbone, Feature Pyramid Network, UNet++ dense connections, dual attention, and deep supervision.
Result: Achieves near-perfect performance on synthetic test data (Dice 0.9879, IoU 0.9895), 8.3% Dice improvement over baseline UNet, 74.9% compression with 0.15s inference (240x throughput gain), and extracts chromatin biomarkers that distinguish normal from pre-cancerous tissue with 94% classification accuracy.
Conclusion: Provides a general framework for synthetic-to-real transfer learning in specialized microscopy and open resources for community validation on clinical specimens, enabling automated early cancer detection through chromatin packing analysis.
Abstract: Chromatin sensitive partial wave spectroscopic (csPWS) microscopy enables label free detection of nanoscale chromatin packing alterations that occur before visible cellular transformation. However, manual nuclear segmentation limits population scale analysis needed for biomarker discovery in early cancer detection. The lack of annotated csPWS imaging data prevents direct use of standard deep learning methods. We present CFU Net, a hierarchical segmentation architecture trained with a three stage curriculum on synthetic multimodal data. CFU Net achieves near perfect performance on held out synthetic test data that represent diverse spectroscopic imaging conditions without manual annotations (Dice 0.9879, IoU 0.9895). Our approach uses physics based rendering that incorporates empirically supported chromatin packing statistics, Mie scattering models, and modality specific noise, combined with a curriculum that progresses from adversarial RGB pretraining to spectroscopic fine tuning and histology validation. CFU Net integrates five architectural elements (ConvNeXt backbone, Feature Pyramid Network, UNet plus plus dense connections, dual attention, and deep supervision) that together improve Dice over a baseline UNet by 8.3 percent. We demonstrate deployment ready INT8 quantization with 74.9 percent compression and 0.15 second inference, giving a 240 times throughput gain over manual analysis. Applied to more than ten thousand automatically segmented nuclei from synthetic test data, the pipeline extracts chromatin biomarkers that distinguish normal from pre cancerous tissue with large effect sizes (Cohens d between 1.31 and 2.98), reaching 94 percent classification accuracy. This work provides a general framework for synthetic to real transfer learning in specialized microscopy and open resources for community validation on clinical specimens.
[1132] TraceTrans: Translation and Spatial Tracing for Surgical Prediction
Xiyu Luo, Haodong LI, Xinxing Cheng, He Zhao, Yang Hu, Xuan Song, Tianyang Zhang
Main category: eess.IV
TL;DR: TraceTrans is a deformable image translation model for post-operative prediction that generates anatomically consistent images while revealing spatial correspondences with pre-operative inputs.
Details
Motivation: Existing image translation methods focus on matching target distributions but neglect spatial correspondences, leading to structural inconsistencies and hallucinations that undermine reliability in clinical applications requiring anatomical accuracy.Method: Uses an encoder for feature extraction and dual decoders for predicting spatial deformations and synthesizing the translated image. The predicted deformation field imposes spatial constraints to ensure anatomical consistency with the source.
Result: Extensive experiments on medical cosmetology and brain MRI datasets demonstrate accurate and interpretable post-operative predictions.
Conclusion: TraceTrans shows potential for reliable clinical deployment by delivering anatomically consistent and interpretable predictions through explicit spatial correspondence modeling.
Abstract: Image-to-image translation models have achieved notable success in converting images across visual domains and are increasingly used for medical tasks such as predicting post-operative outcomes and modeling disease progression. However, most existing methods primarily aim to match the target distribution and often neglect spatial correspondences between the source and translated images. This limitation can lead to structural inconsistencies and hallucinations, undermining the reliability and interpretability of the predictions. These challenges are accentuated in clinical applications by the stringent requirement for anatomical accuracy. In this work, we present TraceTrans, a novel deformable image translation model designed for post-operative prediction that generates images aligned with the target distribution while explicitly revealing spatial correspondences with the pre-operative input. The framework employs an encoder for feature extraction and dual decoders for predicting spatial deformations and synthesizing the translated image. The predicted deformation field imposes spatial constraints on the generated output, ensuring anatomical consistency with the source. Extensive experiments on medical cosmetology and brain MRI datasets demonstrate that TraceTrans delivers accurate and interpretable post-operative predictions, highlighting its potential for reliable clinical deployment.
[1133] TVMC: Time-Varying Mesh Compression via Multi-Stage Anchor Mesh Generation
He Huang, Qi Yang, Yiling Xu, Zhu Li, Jenq-Neng Hwang
Main category: eess.IV
TL;DR: TVMC is a novel framework for compressing time-varying meshes using multi-stage anchor mesh generation and inter-frame prediction, achieving state-of-the-art compression performance with 10.2%-16.9% BD-rate gains over V-DMC.
Details
Motivation: Time-varying meshes with dynamic connectivity and varying vertex counts face practical challenges due to large data volumes. Existing compression methods struggle with topological inconsistency and motion artifacts when leveraging temporal redundancy.Method: Three-stage coarse-to-fine anchor mesh generation: 1) Initial anchor via fast topology alignment, 2) Coarse anchor using Kalman filter-based motion estimation, 3) Fine anchor with Quadric Error Metric refinement. Encodes inter-frame motions and adaptively quantizes residual displacements.
Result: Extensive experiments on MPEG dynamic mesh sequences show TVMC achieves state-of-the-art compression performance with 10.2%-16.9% BD-rate gains over V-DMC standard while preserving high reconstruction quality.
Conclusion: TVMC’s hierarchical strategy effectively preserves consistent connectivity and high-quality surface approximation, providing an efficient and compact representation for dynamic geometry in time-varying meshes.
Abstract: Time-varying meshes, characterized by dynamic connectivity and varying vertex counts, hold significant promise for applications such as augmented reality. However, their practical utilization remains challenging due to the substantial data volume required for high-fidelity representation. While various compression methods attempt to leverage temporal redundancy between consecutive mesh frames, most struggle with topological inconsistency and motion-induced artifacts. To address these issues, we propose Time-Varying Mesh Compression (TVMC), a novel framework built on multi-stage coarse-to-fine anchor mesh generation for inter-frame prediction. Specifically, the anchor mesh is progressively constructed in three stages: initial, coarse, and fine. The initial anchor mesh is obtained through fast topology alignment to exploit temporal coherence. A Kalman filter-based motion estimation module then generates a coarse anchor mesh by accurately compensating inter-frame motions. Subsequently, a Quadric Error Metric-based refinement step optimizes vertex positions to form a fine anchor mesh with improved geometric fidelity. Based on the refined anchor mesh, the inter-frame motions relative to the reference base mesh are encoded, while the residual displacements between the subdivided fine anchor mesh and the input mesh are adaptively quantized and compressed. This hierarchical strategy preserves consistent connectivity and high-quality surface approximation, while achieving an efficient and compact representation of dynamic geometry. Extensive experiments on standard MPEG dynamic mesh sequences demonstrate that TVMC achieves state-of-the-art compression performance. Compared to the latest V-DMC standard, it delivers a significant BD-rate gain of 10.2% ~ 16.9%, while preserving high reconstruction quality. The code is available at https://github.com/H-Huang774/TVMC.
[1134] Low-Light Image Enhancement Using Gamma Learning And Attention-Enabled Encoder-Decoder Networks
Bibhabasu Debnath, Sahana Ray, Sanjay Ghosh
Main category: eess.IV
TL;DR: A dual-stage deep learning architecture called GAtED combines adaptive gamma correction with attention-enhanced refinement to enhance low-light images by improving global illumination and restoring local details.
Details
Motivation: Low-light images suffer from amplified noise, inadequate illumination, contrast reduction, color distortion, and detail loss, making accurate object recognition and scene analysis challenging. Existing methods need better integration of global illumination adjustment with local detail refinement.Method: Two-stage approach: 1) Adaptive Gamma Correction Module (AGCM) learns pixel-wise gamma values using local and global cues for brightening, 2) Encoder-decoder network with Convolutional Block Attention Modules (CBAM) refines details. Trained with composite loss including L1, SSIM, total variation, color constancy, and gamma regularization.
Result: Achieves PSNR up to 29.96 dB and SSIM up to 0.9458 on LOL datasets, outperforming existing methods. Better perceptual quality on DICM, LIME, MEF, NPE datasets with best NIQE scores across all datasets.
Conclusion: GAtED effectively handles both global illumination adjustment and local detail enhancement, providing a practical solution for low-light image enhancement with superior performance and fewer artifacts.
Abstract: Images acquired in low-light environments present significant obstacles for computer vision systems and human perception, especially for applications requiring accurate object recognition and scene analysis. Such images typically manifest multiple quality issues: amplified noise, inadequate scene illumination, contrast reduction, color distortion, and loss of details. While recent deep learning methods have shown promise, developing simple and efficient frameworks that naturally integrate global illumination adjustment with local detail refinement continues to be an important objective. To this end, we introduce a dual-stage deep learning architecture that combines adaptive gamma correction with attention-enhanced refinement to address these fundamental limitations. The first stage uses an Adaptive Gamma Correction Module (AGCM) to learn suitable gamma values for each pixel based on both local and global cues, producing a brightened intermediate output. The second stage applies an encoder-decoder deep network with Convolutional Block Attention Modules (CBAM) to this brightened image, in order to restore finer details. We train the network using a composite loss that includes L1 reconstruction, SSIM, total variation, color constancy, and gamma regularization terms to balance pixel accuracy with visual quality. Experiments on LOL-v1, LOL-v2 real, and LOL-v2 synthetic datasets show our method reaches PSNR of upto 29.96 dB and upto 0.9458 SSIM, outperforming existing approaches. Additional tests on DICM, LIME, MEF, and NPE datasets using NIQE, BRISQUE, and UNIQUE metrics confirm better perceptual quality with fewer artifacts, achieving the best NIQE scores across all datasets. Our GAtED (Gamma learned and Attention-enabled Encoder-Decoder) method effectively handles both global illumination adjustment and local detail enhancement, offering a practical solution for low-light enhancement.
[1135] Understanding What Is Not Said:Referring Remote Sensing Image Segmentation with Scarce Expressions
Kai Ye, Bowen Liu, Jianghang Lin, Jiayi Ji, Pingyang Dai, Liujuan Cao
Main category: eess.IV
TL;DR: The paper introduces Weakly Referring Expression Learning (WREL) for Referring Remote Sensing Image Segmentation, which uses class names as weak supervision along with limited accurate referring expressions to overcome annotation challenges in remote sensing.
Details
Motivation: Acquiring high-quality referring expressions in remote sensing is challenging due to small, densely distributed objects and complex backgrounds, making full annotation expensive and impractical.Method: Proposes WREL paradigm using class names as weak referring expressions with limited accurate ones, and LRB-WREL with Learnable Reference Bank to refine weak expressions through sample-specific prompt embeddings, combined with teacher-student optimization using EMA updates.
Result: Extensive experiments on new benchmark show WREL and LRB-WREL approach or surpass models trained with fully annotated referring expressions, validating theoretical performance bounds.
Conclusion: WREL provides an effective solution for RRSIS under limited annotation conditions, with theoretical guarantees and practical performance comparable to fully supervised approaches.
Abstract: Referring Remote Sensing Image Segmentation (RRSIS) aims to segment instances in remote sensing images according to referring expressions. Unlike Referring Image Segmentation on general images, acquiring high-quality referring expressions in the remote sensing domain is particularly challenging due to the prevalence of small, densely distributed objects and complex backgrounds. This paper introduces a new learning paradigm, Weakly Referring Expression Learning (WREL) for RRSIS, which leverages abundant class names as weakly referring expressions together with a small set of accurate ones to enable efficient training under limited annotation conditions. Furthermore, we provide a theoretical analysis showing that mixed-referring training yields a provable upper bound on the performance gap relative to training with fully annotated referring expressions, thereby establishing the validity of this new setting. We also propose LRB-WREL, which integrates a Learnable Reference Bank (LRB) to refine weakly referring expressions through sample-specific prompt embeddings that enrich coarse class-name inputs. Combined with a teacher-student optimization framework using dynamically scheduled EMA updates, LRB-WREL stabilizes training and enhances cross-modal generalization under noisy weakly referring supervision. Extensive experiments on our newly constructed benchmark with varying weakly referring data ratios validate both the theoretical insights and the practical effectiveness of WREL and LRB-WREL, demonstrating that they can approach or even surpass models trained with fully annotated referring expressions.
[1136] Structure Aware Image Downscaling
G B Kevin Arjun, Suvrojit Mitra, Sanjay Ghosh
Main category: eess.IV
TL;DR: A new structure-informed image downscaling method that uses edge detection and edge-guided interpolation to preserve fine details and minimize artifacts.
Details
Motivation: To address the challenge of preserving structural integrity and visual fidelity in image downscaling, particularly avoiding edge blurring and texture loss common in existing methods.Method: Three-step approach: (1) edge map computation using efficient edge detection, (2) edge-guided interpolation to preserve details, (3) texture enhancement by fusing local texture components from original image.
Result: Achieved high PSNR scores: 39.07 dB on DIV2K dataset and 38.71 dB on RealSR dataset for 4x downscaling. Outperforms recent methods in both visual quality and performance metrics.
Conclusion: The proposed method effectively minimizes artifacts while retaining crucial image features, producing downscaled images without edge blurring and texture loss.
Abstract: Image downscaling is one of the key operations in recent display technology and visualization tools. By this process, the dimension of an image is reduced, aiming to preserve structural integrity and visual fidelity. In this paper, we propose a new image downscaling method which is built on the core ideas of image filtering and edge detection. In particular, we present a structure-informed downscaling algorithm that maintains fine details through edge-aware processing. The proposed method comprises three steps: (i) edge map computation, (ii) edge-guided interpolation, and (iii) texture enhancement. To faithfully retain the strong structures in an image, we first compute the edge maps by applying an efficient edge detection operator. This is followed by an edge-guided interpolation to preserve fine details after resizing. Finally, we fuse local texture enriched component of the original image to the interpolated one to restore high-frequency information. By integrating edge information with adaptive filtering, our approach effectively minimizes artifacts while retaining crucial image features. To demonstrate the effective downscaling capability of our proposed method, we validate on four datasets: DIV2K, BSD100, Urban100, and RealSR. For downscaling by 4x, our method could achieve as high as 39.07 dB PSNR on the DIV2K dataset and 38.71 dB on the RealSR dataset. Extensive experimental results confirm that the proposed image downscaling method is capable of achieving superior performance in terms of both visual quality and performance metrics with reference to recent methods. Most importantly, the downscaled images by our method do not suffer from edge blurring and texture loss, unlike many existing ones.
[1137] Region-Adaptive Learned Hierarchical Encoding for 3D Gaussian Splatting Data
Shashank N. Sridhara, Birendra Kathariya, Fangjun Pu, Peng Yin, Eduardo Pavez, Antonio Ortega
Main category: eess.IV
TL;DR: RALHE is a learned hierarchical compression method for 3D Gaussian Splatting that reduces model size for bandwidth-constrained applications while maintaining rendering quality.
Details
Motivation: 3DGS models are large and limit deployment in bandwidth-constrained applications like volumetric media streaming, requiring efficient compression methods.Method: Uses learned hierarchical latent representation with octree structure, overfits latents to Gaussian attributes under rate constraint, employs autoregressive probability model for bitrate estimation, and jointly optimizes multi-resolution latents with decoder and entropy coding networks.
Result: Achieves up to 2dB PSNR gain at low bitrates (<1 MB) compared to baseline 3DGS compression methods.
Conclusion: RALHE provides effective compression for 3DGS data while maintaining rendering quality, making it suitable for bandwidth-constrained applications.
Abstract: We introduce Region-Adaptive Learned Hierarchical Encoding (RALHE) for 3D Gaussian Splatting (3DGS) data. While 3DGS has recently become popular for novel view synthesis, the size of trained models limits its deployment in bandwidth-constrained applications such as volumetric media streaming. To address this, we propose a learned hierarchical latent representation that builds upon the principles of “overfitted” learned image compression (e.g., Cool-Chic and C3) to efficiently encode 3DGS attributes. Unlike images, 3DGS data have irregular spatial distributions of Gaussians (geometry) and consist of multiple attributes (signals) defined on the irregular geometry. Our codec is designed to account for these differences between images and 3DGS. Specifically, we leverage the octree structure of the voxelized 3DGS geometry to obtain a hierarchical multi-resolution representation. Our approach overfits latents to each Gaussian attribute under a global rate constraint. These latents are decoded independently through a lightweight decoder network. To estimate the bitrate during training, we employ an autoregressive probability model that leverages octree-derived contexts from the 3D point structure. The multi-resolution latents, decoder, and autoregressive entropy coding networks are jointly optimized for each Gaussian attribute. Experiments demonstrate that the proposed RALHE compression framework achieves a rendering PSNR gain of up to 2dB at low bitrates (less than 1 MB) compared to the baseline 3DGS compression methods.
[1138] USF-MAE: Ultrasound Self-Supervised Foundation Model with Masked Autoencoding
Youssef Megahed, Robin Ducharme, Mark Walker, Steven Hawken, Adrian D. C. Chan
Main category: eess.IV
TL;DR: USF-MAE is the first large-scale self-supervised foundation model for ultrasound imaging, pretrained on 370,000 ultrasound images from 46 datasets using masked autoencoding, achieving state-of-the-art performance on multiple clinical classification tasks.
Details
Motivation: Ultrasound image interpretation is challenging due to noise, operator dependence, and limited field of view. Current deep learning approaches are limited by scarce labeled data and domain gaps between general and sonographic images.Method: Developed a self-supervised masked autoencoding (MAE) framework using Vision Transformer encoder-decoder architecture, pretrained on 370,000 2D/3D ultrasound images from 46 datasets (OpenUS-46), then fine-tuned on downstream classification tasks.
Result: Achieved F1-scores of 81.6% (breast cancer), 79.6% (ovarian tumors), and 82.4% (gastrointestinal stromal tumors), consistently outperforming conventional CNN and ViT baselines and demonstrating strong cross-anatomical generalization.
Conclusion: USF-MAE demonstrates that large-scale self-supervised pretraining on ultrasound-specific data enables effective representation learning and strong performance across diverse clinical tasks, approaching or surpassing supervised foundation models.
Abstract: Ultrasound imaging is one of the most widely used diagnostic modalities, offering real-time, radiation-free assessment across diverse clinical domains. However, interpretation of ultrasound images remains challenging due to high noise levels, operator dependence, and limited field of view, resulting in substantial inter-observer variability. Current Deep Learning approaches are hindered by the scarcity of large labeled datasets and the domain gap between general and sonographic images, which limits the transferability of models pretrained on non-medical data. To address these challenges, we introduce the Ultrasound Self-Supervised Foundation Model with Masked Autoencoding (USF-MAE), the first large-scale self-supervised MAE framework pretrained exclusively on ultrasound data. The model was pre-trained on 370,000 2D and 3D ultrasound images curated from 46 open-source datasets, collectively termed OpenUS-46, spanning over twenty anatomical regions. This curated dataset has been made publicly available to facilitate further research and reproducibility. Using a Vision Transformer encoder-decoder architecture, USF-MAE reconstructs masked image patches, enabling it to learn rich, modality-specific representations directly from unlabeled data. The pretrained encoder was fine-tuned on three public downstream classification benchmarks: BUS-BRA (breast cancer), MMOTU-2D (ovarian tumors), and GIST514-DB (gastrointestinal stromal tumors). Across all tasks, USF-MAE consistently outperformed conventional CNN and ViT baselines, achieving F1-scores of 81.6%, 79.6%, and 82.4%, respectively. Despite not using labels during pretraining, USF-MAE approached the performance of the supervised foundation model UltraSam on breast cancer classification and surpassed it on the other tasks, demonstrating strong cross-anatomical generalization.
[1139] Equivariance2Inverse: A Practical Self-Supervised CT Reconstruction Method Benchmarked on Real, Limited-Angle, and Blurred Data
Dirk Elias Schut, Adriaan Graas, Robert van Liere, Tristan van Leeuwen
Main category: eess.IV
TL;DR: Self-supervised CT reconstruction methods often make simplified physics assumptions that reduce real-world robustness. The paper benchmarks six methods on real and synthetic data, identifies limitations of pixel-wise noise independence assumptions, and proposes a new method combining successful concepts from existing approaches.
Details
Motivation: Self-supervised CT reconstruction methods are appealing for real-world applications but often use simplified physics models that make inaccurate assumptions about scintillator blurring, scanning geometry, and noise distribution, reducing their robustness to real-world conditions.Method: The paper reviews model assumptions of six self-supervised CT reconstruction methods and benchmarks them on the real-world 2DeteCT dataset and synthetic data with/without scintillator blurring and limited-angle scanning geometry. Based on findings, a new method called Equivariance2Inverse is proposed by combining concepts from Robust Equivariant Imaging and Sparse2Inverse.
Result: Benchmark results show that methods assuming pixel-wise independent noise perform poorly on data with scintillator blurring, while assuming rotation invariance improves limited-angle reconstruction performance.
Conclusion: The paper demonstrates the importance of accurate physics modeling in self-supervised CT reconstruction and proposes Equivariance2Inverse as a new method that combines successful concepts to address identified limitations in existing approaches.
Abstract: Deep learning has shown impressive results in reducing noise and artifacts in X-ray computed tomography (CT) reconstruction. Self-supervised CT reconstruction methods are especially appealing for real-world applications because they require no ground truth training examples. However, these methods involve a simplified X-ray physics model during training, which may make inaccurate assumptions, for example, about scintillator blurring, the scanning geometry, or the distribution of the noise. As a result, they can be less robust to real-world imaging circumstances. In this paper, we review the model assumptions of six recent self-supervised CT reconstruction methods. Moreover, we benchmark these methods on the real-world 2DeteCT dataset and on synthetic data with and without scintillator blurring and a limited-angle scanning geometry. The results of our benchmark show that methods that assume that the noise is pixel-wise independent do not perform well on data with scintillator blurring, and that assuming rotation invariance improves results on limited-angle reconstructions. Based on these findings, we combined successful concepts of the Robust Equivariant Imaging and Sparse2Inverse methods in a new self-supervised CT reconstruction method called Equivariance2Inverse.
[1140] KongNet: A Multi-headed Deep Learning Model for Detection and Classification of Nuclei in Histopathology Images
Jiaqi Lv, Esha Sadia Nasir, Kesi Xu, Mostafa Jahanifar, Brinder Singh Chohan, Behnaz Elhaminia, Shan E Ahmed Raza
Main category: eess.IV
TL;DR: KongNet is a multi-headed deep learning architecture with shared encoder and specialized decoders for nuclei detection and classification in histopathology images, achieving top results in multiple challenges and state-of-the-art performance on public datasets.
Details
Motivation: Accurate detection and classification of nuclei in histopathology images are critical for diagnostic and research applications, requiring robust models that work across diverse tissue and stain types.Method: Multi-headed architecture with shared encoder and parallel, cell-type-specialized decoders using multi-task learning to jointly predict nuclei centroids, segmentation masks, and contours, enhanced with SCSE attention modules and composite loss function.
Result: Achieved first place on MONKEY Challenge track 1, second on track 2; first place in 2025 MIDOG Challenge with lightweight variant; top three in PUMA Challenge; state-of-the-art performance on PanNuke and CoNIC datasets.
Conclusion: The specialized multi-decoder design is highly effective for nuclei detection and classification across diverse tissue and stain types, with pre-trained models publicly released to support future research.
Abstract: Accurate detection and classification of nuclei in histopathology images are critical for diagnostic and research applications. We present KongNet, a multi-headed deep learning architecture featuring a shared encoder and parallel, cell-type-specialised decoders. Through multi-task learning, each decoder jointly predicts nuclei centroids, segmentation masks, and contours, aided by Spatial and Channel Squeeze-and-Excitation (SCSE) attention modules and a composite loss function. We validate KongNet in three Grand Challenges. The proposed model achieved first place on track 1 and second place on track 2 during the MONKEY Challenge. Its lightweight variant (KongNet-Det) secured first place in the 2025 MIDOG Challenge. KongNet pre-trained on the MONKEY dataset and fine-tuned on the PUMA dataset ranked among the top three in the PUMA Challenge without further optimisation. Furthermore, KongNet established state-of-the-art performance on the publicly available PanNuke and CoNIC datasets. Our results demonstrate that the specialised multi-decoder design is highly effective for nuclei detection and classification across diverse tissue and stain types. The pre-trained model weights along with the inference code have been publicly released to support future research.
[1141] Revising Second Order Terms in Deep Animation Video Coding
Konstantin Schmidt, Thomas Richter
Main category: eess.IV
TL;DR: The paper improves FOMM by replacing Jacobian transformations with global rotation for better handling of head rotations, achieving 40-80% bitrate savings on P-frames while maintaining video quality.
Details
Motivation: FOMM has limitations in handling strong head movements, particularly head rotations, due to its reliance on source-image warping, which fails to recreate videos with significant head movements.Method: Replaced Jacobian transformations in FOMM with global rotation for head rotations, applied state-of-the-art normalization techniques to the discriminator to stabilize adversarial training.
Result: The optimized system performs better on items with head rotations while saving 40% to 80% of bitrate on P-frames, with improved visual quality demonstrated through LPIPS and DISTS metrics.
Conclusion: The proposed modifications to FOMM successfully address head rotation limitations while significantly reducing bitrate requirements and improving video generation quality.
Abstract: First Order Motion Model is a generative model that animates human heads based on very little motion information derived from keypoints. It is a promising solution for video communication because first it operates at very low bitrate and second its computational complexity is moderate compared to other learning based video codecs. However, it has strong limitations by design. Since it generates facial animations by warping source-images, it fails to recreate videos with strong head movements. This works concentrates on one specific kind of head movements, namely head rotations. We show that replacing the Jacobian transformations in FOMM by a global rotation helps the system to perform better on items with head-rotations while saving 40% to 80% of bitrate on P-frames. Moreover, we apply state-of-the-art normalization techniques to the discriminator to stabilize the adversarial training which is essential for generating visually appealing videos. We evaluate the performance by the learned metics LPIPS and DISTS to show the success our optimizations.
[1142] Extended Depth-of-Field Lensless Imaging using an Optimized Radial Mask
Jose Reinaldo Cunha Santos A V Silva Neto, Tomoya Nakamura, Yasushi Makihara, Yasushi Yagi
Main category: eess.IV
TL;DR: Optimized radial coded mask for lensless cameras to improve frequency response while maintaining extended depth of field capabilities.
Details
Motivation: Leverage the design freedom of coded masks in lensless cameras to extend depth of field while improving frequency response, addressing limitations of naive radial mask implementations.Method: Shape-preserving optimization scheme for radial-type amplitude coded mask that retains radial characteristics while optimizing frequency response.
Result: Optimized radial mask achieved better overall frequency response compared to naive implementations and demonstrated extended DOF in both simulations and real prototype camera.
Conclusion: The proposed optimization successfully creates radial coded masks that maintain extended depth of field capabilities while significantly improving frequency response performance.
Abstract: The freedom of design of coded masks used by mask-based lensless cameras is an advantage these systems have when compared to lens-based ones. We leverage this freedom of design to propose a shape-preserving optimization scheme for a radial-type amplitude coded mask. Due to the depth-independency of the radial mask’s point spread function, they can be used for extending the effective depth of field (DOF) of a lensless imaging system. In this paper we optimized a coded mask for improved frequency response, while retaining its radial characteristics and therefore extended-DOF capabilities. We show that our optimized radial mask achieved better overall frequency response when compared to naive implementations of a radial mask. We also quantitatively and qualitatively demonstrated the extended DOF imaging achieved by our optimized radial mask in simulations by comparing it to different non-radial coded masks. Finally, we built a prototype camera to validate the extended DOF capabilities of our coded mask in real scenarios.
[1143] Continuous and complete liver vessel segmentation with graph-attention guided diffusion
Xiaotong Zhang, Alexander Broersen, Gonnie CM van Erp, Silvia L. Pintea, Jouke Dijkstra
Main category: eess.IV
TL;DR: A diffusion-based segmentation model using graph-attention modules at multiple scales to improve connectivity and completeness in liver vessel segmentation, particularly for small vessels.
Details
Motivation: Current methods struggle with connectivity and completeness in liver vessel segmentation, especially for small vessels, due to inconsistent annotations and lack of explicit geometry learning.Method: Uses diffusion model with graph-attention modules to incorporate vessel geometry knowledge and continuity, applied at multiple scales to focus on small vessels.
Result: Outperforms eight state-of-the-art medical segmentation methods on 3D-ircadb-01 and LiVS datasets.
Conclusion: The proposed diffusion-based approach with multi-scale graph-attention effectively addresses connectivity and completeness challenges in liver vessel segmentation.
Abstract: Improving connectivity and completeness are the most challenging aspects of liver vessel segmentation, especially for small vessels. These challenges require both learning the continuous vessel geometry, and focusing on small vessel detection. However, current methods do not explicitly address these two aspects and cannot generalize well when constrained by inconsistent annotations. Here, we take advantage of the generalization of the diffusion model and explicitly integrate connectivity and completeness in our diffusion-based segmentation model. Specifically, we use a graph-attention module that adds knowledge about vessel geometry, and thus adds continuity. Additionally, we perform the graph-attention at multiple-scales, thus focusing on small liver vessels. Our method outperforms eight state-of-the-art medical segmentation methods on two public datasets: 3D-ircadb-01 and LiVS. Our code is available at https://github.com/ZhangXiaotong015/GATSegDiff.
[1144] Macro2Micro: A Rapid and Precise Cross-modal Magnetic Resonance Imaging Synthesis using Multi-scale Structural Brain Similarity
Sooyoung Kim, Joonwoo Kwon, Junbeom Kwon, Jungyoun Janice Min, Sangyoon Bae, Yuewei Lin, Shinjae Yoo, Jiook Cha
Main category: eess.IV
TL;DR: Macro2Micro is a GAN-based deep learning framework that predicts brain microstructure from macrostructure, achieving 6.8% SSIM improvement over previous methods with real-time inference capability.
Details
Motivation: Mapping nonlinear relationships between macroscopic and microscopic brain components is challenging due to technical limitations and high cost of multimodal MRI acquisition.Method: Uses Generative Adversarial Network (GAN) with explicit encoding of multiscale brain information into distinct processing branches, plus an auxiliary discriminator and learning objective for artifact elimination.
Result: Successfully translates T1-weighted MRIs into Fractional Anisotropy (FA) images with 6.8% SSIM improvement over previous methods, while preserving individual brain characteristics. Inference time <0.01 seconds per modality translation.
Conclusion: Macro2Micro enables real-time multimodal rendering for medical and research applications by predicting microstructure from macrostructure efficiently.
Abstract: The human brain is a complex system requiring both macroscopic and microscopic components for comprehensive understanding. However, mapping nonlinear relationships between these scales remains challenging due to technical limitations and the high cost of multimodal Magnetic Resonance Imaging (MRI) acquisition. To address this, we introduce Macro2Micro, a deep learning framework that predicts brain microstructure from macrostructure using a Generative Adversarial Network (GAN). Based on the hypothesis that microscale structural information can be inferred from macroscale structures, Macro2Micro explicitly encodes multiscale brain information into distinct processing branches. To enhance artifact elimination and output quality, we propose a simple yet effective auxiliary discriminator and learning objective. Extensive experiments demonstrated that Macro2Micro faithfully translates T1-weighted MRIs into corresponding Fractional Anisotropy (FA) images, achieving a 6.8% improvement in the Structural Similarity Index Measure (SSIM) compared to previous methods, while retaining the individual biological characteristics of the brain. With an inference time of less than 0.01 seconds per MR modality translation, Macro2Micro introduces the potential for real-time multimodal rendering in medical and research applications. The code will be made available upon acceptance.
[1145] RWKV-UNet: Improving UNet with Long-Range Cooperation for Effective Medical Image Segmentation
Juntao Jiang, Jiangning Zhang, Weixuan Liu, Muxuan Gao, Xiaobin Hu, Zhucun Xue, Yong Liu, Shuicheng Yan
Main category: eess.IV
TL;DR: RWKV-UNet integrates RWKV structure into U-Net for medical image segmentation, addressing CNN’s limited long-range dependency capture and transformer’s high computational complexity.
Details
Motivation: CNNs struggle with long-range dependencies while transformers have high computational costs, creating a need for efficient models that can capture global context for accurate medical image segmentation.Method: Proposes RWKV-UNet with Global-Local Spatial Perception blocks combining CNNs and RWKVs, and Cross-Channel Mix module for multi-scale feature fusion and global channel integration.
Result: Achieves state-of-the-art performance on 11 benchmark datasets across various medical image segmentation tasks, with smaller variants (RWKV-UNet-S/T) balancing accuracy and efficiency.
Conclusion: RWKV-UNet effectively addresses limitations of CNNs and transformers, providing accurate and computationally efficient medical image segmentation suitable for clinical applications.
Abstract: In recent years, significant advancements have been made in deep learning for medical image segmentation, particularly with convolutional neural networks (CNNs) and transformer models. However, CNNs face limitations in capturing long-range dependencies, while transformers suffer from high computational complexity. To address this, we propose RWKV-UNet, a novel model that integrates the RWKV (Receptance Weighted Key Value) structure into the U-Net architecture. This integration enhances the model’s ability to capture long-range dependencies and to improve contextual understanding, which is crucial for accurate medical image segmentation. We build a strong encoder with developed Global-Local Spatial Perception (GLSP) blocks combining CNNs and RWKVs. We also propose a Cross-Channel Mix (CCM) module to improve skip connections with multi-scale feature fusion, achieving global channel information integration. Experiments on 11 benchmark datasets show that the RWKV-UNet achieves state-of-the-art performance on various types of medical image segmentation tasks. Additionally, smaller variants, RWKV-UNet-S and RWKV-UNet-T, balance accuracy and computational efficiency, making them suitable for broader clinical applications.
[1146] Dynamic-Aware Spatio-temporal Representation Learning for Dynamic MRI Reconstruction
Dayoung Baik, Jaejun Yoo
Main category: eess.IV
TL;DR: DA-INR is a dynamic MRI reconstruction model that uses implicit neural representation to capture spatial-temporal continuity and temporal redundancy, achieving superior reconstruction quality at extreme undersampling ratios with reduced optimization time and minimal hyperparameter tuning.
Details
Motivation: Dynamic MRI reconstruction faces challenges in obtaining ground truth data and previous INR methods suffer from long optimization time and extensive hyperparameter tuning requirements.Method: Proposes Dynamic-Aware INR (DA-INR) that captures spatial and temporal continuity in the image domain and explicitly incorporates temporal redundancy into the model structure using implicit neural representation.
Result: DA-INR outperforms other models in reconstruction quality even at extreme undersampling ratios while significantly reducing optimization time and requiring minimal hyperparameter tuning.
Conclusion: DA-INR effectively addresses the limitations of previous INR methods for dynamic MRI reconstruction by incorporating temporal dynamics and redundancy, achieving superior performance with improved efficiency.
Abstract: Dynamic MRI reconstruction, one of inverse problems, has seen a surge by the use of deep learning techniques. Especially, the practical difficulty of obtaining ground truth data has led to the emergence of unsupervised learning approaches. A recent promising method among them is implicit neural representation (INR), which defines the data as a continuous function that maps coordinate values to the corresponding signal values. This allows for filling in missing information only with incomplete measurements and solving the inverse problem effectively. Nevertheless, previous works incorporating this method have faced drawbacks such as long optimization time and the need for extensive hyperparameter tuning. To address these issues, we propose Dynamic-Aware INR (DA-INR), an INR-based model for dynamic MRI reconstruction that captures the spatial and temporal continuity of dynamic MRI data in the image domain and explicitly incorporates the temporal redundancy of the data into the model structure. As a result, DA-INR outperforms other models in reconstruction quality even at extreme undersampling ratios while significantly reducing optimization time and requiring minimal hyperparameter tuning.
[1147] Slot-BERT: Self-supervised Object Discovery in Surgical Video
Guiqiu Liao, Matjaz Jogan, Marcel Hussing, Kenta Nakahashi, Kazuhiro Yasufuku, Amin Madani, Eric Eaton, Daniel A. Hashimoto
Main category: eess.IV
TL;DR: Slot-BERT is a bidirectional long-range model for unsupervised object-centric learning in surgical videos that maintains temporal coherence while being computationally efficient for medical applications.
Details
Motivation: Existing object-centric methods for videos struggle with maintaining long-range temporal coherence in surgical videos, while fully parallel processing is computationally impractical for medical facility hardware.Method: Slot-BERT uses bidirectional processing to learn object-centric representations in latent space with a novel slot contrastive loss that reduces redundancy and improves representation disentanglement through enhanced slot orthogonality.
Result: Slot-BERT surpasses state-of-the-art object-centric approaches in unsupervised training, achieving superior performance across abdominal, cholecystectomy, and thoracic surgical procedures, and demonstrates efficient zero-shot domain adaptation.
Conclusion: Slot-BERT provides an effective solution for scalable object discovery in long surgical videos while maintaining temporal coherence and computational efficiency suitable for medical facility implementation.
Abstract: Object-centric slot attention is a powerful framework for unsupervised learning of structured and explainable representations that can support reasoning about objects and actions, including in surgical videos. While conventional object-centric methods for videos leverage recurrent processing to achieve efficiency, they often struggle with maintaining long-range temporal coherence required for long videos in surgical applications. On the other hand, fully parallel processing of entire videos enhances temporal consistency but introduces significant computational overhead, making it impractical for implementation on hardware in medical facilities. We present Slot-BERT, a bidirectional long-range model that learns object-centric representations in a latent space while ensuring robust temporal coherence. Slot-BERT scales object discovery seamlessly to long videos of unconstrained lengths. A novel slot contrastive loss further reduces redundancy and improves the representation disentanglement by enhancing slot orthogonality. We evaluate Slot-BERT on real-world surgical video datasets from abdominal, cholecystectomy, and thoracic procedures. Our method surpasses state-of-the-art object-centric approaches under unsupervised training achieving superior performance across diverse domains. We also demonstrate efficient zero-shot domain adaptation to data from diverse surgical specialties and databases.
[1148] Are Pixel-Wise Metrics Reliable for Sparse-View Computed Tomography Reconstruction?
Tianyu Lin, Xinran Li, Chuntung Zhuang, Qi Chen, Yuanhao Cai, Kai Ding, Alan L. Yuille, Zongwei Zhou
Main category: eess.IV
TL;DR: The paper proposes CARE, a completeness-aware reconstruction enhancement framework with novel anatomy-aware metrics to address limitations of traditional CT evaluation metrics that fail to capture structural completeness.
Details
Motivation: Traditional CT reconstruction metrics like SSIM and PSNR prioritize pixel-wise fidelity but often miss critical anatomical structures, especially small or thin regions that are easily overlooked.Method: Propose anatomy-aware evaluation metrics for structural completeness and introduce CARE framework that incorporates structural penalties during training to preserve anatomical structures. CARE is model-agnostic and works with analytical, implicit, and generative methods.
Result: CARE substantially improves structural completeness: +32% for large organs, +22% for small organs, +40% for intestines, and +36% for vessels when applied to various reconstruction methods.
Conclusion: The CARE framework effectively enhances structural completeness in CT reconstructions across different anatomical structures, addressing a critical limitation of traditional evaluation metrics.
Abstract: Widely adopted evaluation metrics for sparse-view CT reconstruction–such as Structural Similarity Index Measure and Peak Signal-to-Noise Ratio–prioritize pixel-wise fidelity but often fail to capture the completeness of critical anatomical structures, particularly small or thin regions that are easily missed. To address this limitation, we propose a suite of novel anatomy-aware evaluation metrics designed to assess structural completeness across anatomical structures, including large organs, small organs, intestines, and vessels. Building on these metrics, we introduce CARE, a Completeness-Aware Reconstruction Enhancement framework that incorporates structural penalties during training to encourage anatomical preservation of significant structures. CARE is model-agnostic and can be seamlessly integrated into analytical, implicit, and generative methods. When applied to these methods, CARE substantially improves structural completeness in CT reconstructions, achieving up to +32% improvement for large organs, +22% for small organs, +40% for intestines, and +36% for vessels.
[1149] A Poisson-Guided Decomposition Network for Extreme Low-Light Image Enhancement
Isha Rao, Ratul Chakraborty, Sanjay Ghosh
Main category: eess.IV
TL;DR: A lightweight deep learning method that integrates Retinex decomposition with Poisson denoising for low-light image enhancement, handling signal-dependent noise without requiring reflectance/illumination priors.
Details
Motivation: Traditional Gaussian noise assumptions don't hold in real-world low-light imaging where noise is signal-dependent (Poisson noise), requiring specialized denoising approaches for extreme low-light conditions.Method: Unified encoder-decoder network combining Retinex-based decomposition with Poisson denoising, using Poisson denoising loss to handle signal-dependent noise without reflectance/illumination priors.
Result: Method effectively enhances illumination and suppresses noise, improving visibility and brightness while preserving image structure and color constancy without color distortion.
Conclusion: The proposed approach is effective and practical for low-light image enhancement, successfully handling Poisson noise and maintaining image quality under ambient illumination.
Abstract: Low-light image denoising and enhancement are challenging, especially when traditional noise assumptions, such as Gaussian noise, do not hold in majority. In many real-world scenarios, such as low-light imaging, noise is signal-dependent and is better represented as Poisson noise. In this work, we address the problem of denoising images degraded by Poisson noise under extreme low-light conditions. We introduce a light-weight deep learning-based method that integrates Retinex based decomposition with Poisson denoising into a unified encoder-decoder network. The model simultaneously enhances illumination and suppresses noise by incorporating a Poisson denoising loss to address signal-dependent noise. Without prior requirement for reflectance and illumination, the network learns an effective decomposition process while ensuring consistent reflectance and smooth illumination without causing any form of color distortion. The experimental results demonstrate the effectiveness and practicality of the proposed low-light illumination enhancement method. Our method significantly improves visibility and brightness in low-light conditions, while preserving image structure and color constancy under ambient illumination.
[1150] ReXGroundingCT: A 3D Chest CT Dataset for Segmentation of Findings from Free-Text Reports
Mohammed Baharoon, Luyang Luo, Michael Moritz, Abhinav Kumar, Sung Eun Kim, Xiaoman Zhang, Miao Zhu, Mahmoud Hussain Alabbad, Maha Sbayel Alhazmi, Neel P. Mistry, Lucas Bijnens, Kent Ryan Kleinschmidt, Brady Chrisler, Sathvik Suryadevara, Sri Sai Dinesh Jaliparthi, Noah Michael Prudlo, Mark David Marino, Jeremy Palacio, Rithvik Akula, Di Zhou, Hong-Yu Zhou, Ibrahim Ethem Hamamci, Scott J. Adams, Hassan Rayhan AlOmaish, Pranav Rajpurkar
Main category: eess.IV
TL;DR: ReXGroundingCT is the first public dataset linking free-text radiology findings to pixel-level 3D segmentations in chest CT scans, containing 16,301 annotated entities across 8,028 text-to-3D-segmentation pairs from 3,142 CT scans.
Details
Motivation: To establish a foundation for enabling free-text finding segmentation and grounded radiology report generation in CT imaging by creating a publicly available dataset that bridges textual findings with precise 3D spatial annotations.Method: Three-stage pipeline: 1) GPT-4 extracted and standardized findings from Turkish reports translated to English; 2) GPT-4o-mini categorized findings into hierarchical ontology; 3) 3D annotations produced with radiologist quality assurance, plus creation of chain-of-thought dataset for anatomical reasoning.
Result: Dataset contains 16,301 annotated entities across 8,028 text-to-3D-segmentation pairs from 3,142 non-contrast CT scans, with 79% focal and 21% non-focal abnormalities. Includes public validation set (50 cases) and private test set (100 cases) annotated by board-certified radiologists.
Conclusion: ReXGroundingCT provides a comprehensive foundation for advancing free-text finding segmentation and grounded radiology report generation in CT imaging, with model performance tracked on a public leaderboard.
Abstract: We introduce ReXGroundingCT, the first publicly available dataset linking free-text findings to pixel-level 3D segmentations in chest CT scans. The dataset includes 3,142 non-contrast chest CT scans paired with standardized radiology reports from CT-RATE. Construction followed a structured three-stage pipeline. First, GPT-4 was used to extract and standardize findings, descriptors, and metadata from reports originally written in Turkish and machine-translated into English. Second, GPT-4o-mini categorized each finding into a hierarchical ontology of lung and pleural abnormalities. Third, 3D annotations were produced for all CT volumes: the training set was quality-assured by board-certified radiologists, and the validation and test sets were fully annotated by board-certified radiologists. Additionally, a complementary chain-of-thought dataset was created to provide step-by-step hierarchical anatomical reasoning for localizing findings within the CT volume, using GPT-4o and localization coordinates derived from organ segmentation models. ReXGroundingCT contains 16,301 annotated entities across 8,028 text-to-3D-segmentation pairs, covering diverse radiological patterns from 3,142 non-contrast CT scans. About 79% of findings are focal abnormalities and 21% are non-focal. The dataset includes a public validation set of 50 cases and a private test set of 100 cases, both annotated by board-certified radiologists. The dataset establishes a foundation for enabling free-text finding segmentation and grounded radiology report generation in CT imaging. Model performance on the private test set is hosted on a public leaderboard at https://rexrank.ai/ReXGroundingCT. The dataset is available at https://huggingface.co/datasets/rajpurkarlab/ReXGroundingCT.
[1151] Smartphone-based iris recognition through high-quality visible-spectrum iris image capture.V2
Naveenkumar G Venkataswamy, Yu Liu, Soumyabrata Dey, Stephanie Schuckers, Masudul H Imtiaz
Main category: eess.IV
TL;DR: This paper presents a complete pipeline for visible spectrum iris recognition on smartphones, achieving high accuracy through standardized quality control and lightweight neural networks.
Details
Motivation: Smartphone-based iris recognition in visible spectrum faces challenges due to illumination variability, pigmentation differences, and lack of standardized capture controls.Method: Developed a compact end-to-end pipeline with ISO/IEC 29794-6 quality compliance, custom Android app for real-time framing and feedback, LightIrisNet (MobileNetV3-based segmentation), and IrisFormer transformer matcher adapted to VIS domain.
Result: OSIRIS achieved TAR of 97.9% at FAR=0.01 (EER=0.76%), IrisFormer trained on UBIRIS.v2 achieved EER of 0.057% on CUVIRIS dataset (752 images from 47 subjects).
Conclusion: Standardized capture and VIS-adapted lightweight models enable accurate and practical iris recognition on smartphones.
Abstract: Smartphone-based iris recognition in the visible spectrum (VIS) remains difficult due to illumination variability, pigmentation differences, and the absence of standardized capture controls. This work presents a compact end-to-end pipeline that enforces ISO/IEC 29794-6 quality compliance at acquisition and demonstrates that accurate VIS iris recognition is feasible on commodity devices. Using a custom Android application performing real-time framing, sharpness evaluation, and feedback, we introduce the CUVIRIS dataset of 752 compliant images from 47 subjects. A lightweight MobileNetV3-based multi-task segmentation network (LightIrisNet) is developed for efficient on-device processing, and a transformer matcher (IrisFormer) is adapted to the VIS domain. Under a standardized protocol and comparative benchmarking against prior CNN baselines, OSIRIS attains a TAR of 97.9% at FAR=0.01 (EER=0.76%), while IrisFormer, trained only on UBIRIS.v2, achieves an EER of 0.057% on CUVIRIS. The acquisition app, trained models, and a public subset of the dataset are released to support reproducibility. These results confirm that standardized capture and VIS-adapted lightweight models enable accurate and practical iris recognition on smartphones.
[1152] Incomplete Multi-view Clustering via Hierarchical Semantic Alignment and Cooperative Completion
Xiaojian Ding, Lin Zhao, Xian Li, Xiaoying Zhu
Main category: eess.IV
TL;DR: HSACC is a novel incomplete multi-view clustering framework that uses hierarchical semantic alignment and cooperative completion to address missing views and achieve robust cross-view fusion.
Details
Motivation: Existing deep incomplete multi-view clustering methods suffer from static fusion strategies and two-stage pipelines, leading to suboptimal fusion and error propagation issues.Method: HSACC employs dual-level semantic spaces: low-level ensures consistency via mutual information maximization, high-level uses adaptive view weights for weighted fusion. It implicitly recovers missing views through aligned latent representations and jointly optimizes reconstruction and clustering.
Result: HSACC significantly outperforms state-of-the-art methods on five benchmark datasets. Ablation studies confirm effectiveness of hierarchical alignment and dynamic weighting, and parameter analysis shows robustness to hyperparameter variations.
Conclusion: The proposed HSACC framework effectively addresses incomplete multi-view clustering challenges through hierarchical semantic alignment and cooperative completion, demonstrating superior performance and robustness.
Abstract: Incomplete multi-view data, where certain views are entirely missing for some samples, poses significant challenges for traditional multi-view clustering methods. Existing deep incomplete multi-view clustering approaches often rely on static fusion strategies or two-stage pipelines, leading to suboptimal fusion results and error propagation issues. To address these limitations, this paper proposes a novel incomplete multi-view clustering framework based on Hierarchical Semantic Alignment and Cooperative Completion (HSACC). HSACC achieves robust cross-view fusion through a dual-level semantic space design. In the low-level semantic space, consistency alignment is ensured by maximizing mutual information across views. In the high-level semantic space, adaptive view weights are dynamically assigned based on the distributional affinity between individual views and an initial fused representation, followed by weighted fusion to generate a unified global representation. Additionally, HSACC implicitly recovers missing views by projecting aligned latent representations into high-dimensional semantic spaces and jointly optimizes reconstruction and clustering objectives, enabling cooperative learning of completion and clustering. Experimental results demonstrate that HSACC significantly outperforms state-of-the-art methods on five benchmark datasets. Ablation studies validate the effectiveness of the hierarchical alignment and dynamic weighting mechanisms, while parameter analysis confirms the model’s robustness to hyperparameter variations.