Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 279]
cs.CV [Total: 343]
cs.AI [Total: 137]
cs.SD [Total: 32]
cs.LG [Total: 391]
cs.MA [Total: 10]
cs.MM [Total: 3]
eess.AS [Total: 9]
eess.IV [Total: 11]

cs.CL

[1] Table Question Answering in the Era of Large Language Models: A Comprehensive Survey of Tasks, Methods, and Evaluation

Wei Zhou, Bolei Ma, Annemarie Friedrich, Mohsen Mesgar

Main category: cs.CL

TL;DR: This survey provides a comprehensive overview of Table Question Answering (TQA) research with focus on LLM-based methods, organizing task formulations, challenges, and methodological trends while highlighting emerging directions like reinforcement learning.

Details

Motivation: The field lacks systematic organization and understanding of TQA task formulations, core challenges, and methodological trends, particularly with emerging research directions like reinforcement learning.

Method: Provides comprehensive categorization of benchmarks and task setups, groups modeling strategies by challenges they target, and analyzes their strengths and limitations.

Result: Offers a structured overview of TQA research with focus on LLM-based methods, unifying disparate research threads and identifying open problems.

Conclusion: The survey provides a consolidated foundation for the TQA community, enabling deeper understanding of state of the art and guiding future developments in this rapidly evolving area.

Abstract: Table Question Answering (TQA) aims to answer natural language questions about tabular data, often accompanied by additional contexts such as text passages. The task spans diverse settings, varying in table representation, question/answer complexity, modality involved, and domain. While recent advances in large language models (LLMs) have led to substantial progress in TQA, the field still lacks a systematic organization and understanding of task formulations, core challenges, and methodological trends, particularly in light of emerging research directions such as reinforcement learning. This survey addresses this gap by providing a comprehensive and structured overview of TQA research with a focus on LLM-based methods. We provide a comprehensive categorization of existing benchmarks and task setups. We group current modeling strategies according to the challenges they target, and analyze their strengths and limitations. Furthermore, we highlight underexplored but timely topics that have not been systematically covered in prior research. By unifying disparate research threads and identifying open problems, our survey offers a consolidated foundation for the TQA community, enabling a deeper understanding of the state of the art and guiding future developments in this rapidly evolving area.

[2] MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Prediction

Jianjin Wang, Runsong Zhao, Xiaoqian Liu, Yuan Ge, Ziqiang Xu, Tong Xiao, Shengxiang Gao, Zhengtao Yu, Jingbo Zhu

Main category: cs.CL

TL;DR: The paper introduces multi-token prediction (MTP) loss to improve speech-to-unit translation by enabling models to predict multiple subsequent tokens at each position, enhancing semantic density. The proposed MTP-S2UT variant applies this loss to intermediate layers for earlier information enrichment.

Details

Motivation: Current speech-to-speech translation methods use speech tokens that lack semantic density, requiring multiple tokens to express complete semantic units. This limits the efficiency and quality of translation.

Method: Proposes multi-token prediction (MTP) loss for S2UT models to predict multiple subsequent tokens per position. Introduces MTP-S2UT variant that applies MTP loss to intermediate layers where CTC loss is computed, enabling earlier information enrichment.

Result: All MTP loss variants consistently improve S2UT translation quality. MTP-S2UT achieves the best performance among the tested approaches.

Conclusion: Applying multi-token prediction loss, particularly at intermediate layers (MTP-S2UT), effectively enhances semantic density and improves speech-to-unit translation performance by enabling earlier and more effective information enrichment.

Abstract: Current direct speech-to-speech translation methods predominantly employ speech tokens as intermediate representations. However, a single speech token is not dense in semantics, so we generally need multiple tokens to express a complete semantic unit. To address this limitation, we introduce multi-token prediction (MTP) loss into speech-to-unit translation (S2UT) models, enabling models to predict multiple subsequent tokens at each position, thereby capturing more complete semantics and enhancing information density per position. Initial MTP implementations apply the loss at the final layer, which improves output representation but initiates information enrichment too late. We hypothesize that advancing the information enrichment process to intermediate layers can achieve earlier and more effective enhancement of hidden representation. Consequently, we propose MTP-S2UT loss, applying MTP loss to hidden representation where CTC loss is computed. Experiments demonstrate that all MTP loss variants consistently improve the quality of S2UT translation, with MTP-S2UT achieving the best performance.

[3] Emotionally Charged, Logically Blurred: AI-driven Emotional Framing Impairs Human Fallacy Detection

Yanran Chen, Lynn Greschner, Roman Klinger, Michael Klenk, Steffen Eger

Main category: cs.CL

TL;DR: LLMs can inject emotional appeals into fallacious arguments, reducing human fallacy detection by 14.5% and increasing convincingness, particularly with fear and sadness emotions.

Details

Motivation: To study how emotional framing interacts with logical fallacies and convincingness, as fallacious arguments can still appear convincing due to their subjective nature.

Method: Benchmarked eight LLMs to inject emotional appeal into fallacious arguments while preserving logical structures, then used best models to generate stimuli for human study.

Result: LLM-driven emotional framing reduced human fallacy detection by 14.5% on average. Humans detected fallacies better with enjoyment than fear/sadness. Fear, sadness, and enjoyment correlated with higher convincingness than neutral states.

Conclusion: The work reveals implications for AI-driven emotional manipulation in fallacious argumentation, showing how emotional appeals can mask logical flaws and increase persuasiveness.

Abstract: Logical fallacies are common in public communication and can mislead audiences; fallacious arguments may still appear convincing despite lacking soundness, because convincingness is inherently subjective. We present the first computational study of how emotional framing interacts with fallacies and convincingness, using large language models (LLMs) to systematically change emotional appeals in fallacious arguments. We benchmark eight LLMs on injecting emotional appeal into fallacious arguments while preserving their logical structures, then use the best models to generate stimuli for a human study. Our results show that LLM-driven emotional framing reduces human fallacy detection in F1 by 14.5% on average. Humans perform better in fallacy detection when perceiving enjoyment than fear or sadness, and these three emotions also correlate with significantly higher convincingness compared to neutral or other emotion states. Our work has implications for AI-driven emotional manipulation in the context of fallacious argumentation.

[4] The Idola Tribus of AI: Large Language Models tend to perceive order where none exists

Shin-nosuke Ishikawa, Masato Todo, Taiki Ogihara, Hirotsugu Ohba

Main category: cs.CL

TL;DR: LLMs tend to generate absurd patterns when analyzing number sequences, showing logical inconsistency even in simple tasks.

Details

Motivation: To evaluate LLMs' logical consistency and self-coherence, which are crucial for complex real-world applications like retrieval-augmented generation and AI agent frameworks.

Method: Conducted experiments asking LLMs to explain patterns in various integer sequences, including arithmetic sequences and randomly generated series.

Result: LLMs successfully identified correct patterns in arithmetic/geometric sequences but frequently over-recognized inconsistent patterns in random series, even with multi-step reasoning models like OpenAI o3, o4-mini, and Google Gemini 2.5 Flash.

Conclusion: LLMs exhibit a tendency to perceive non-existent patterns (AI equivalent of Idola Tribus), highlighting limitations in logical reasoning capabilities despite chain-of-thought mechanisms.

Abstract: We present a tendency of large language models (LLMs) to generate absurd patterns despite their clear inappropriateness in a simple task of identifying regularities in number series. Several approaches have been proposed to apply LLMs to complex real-world tasks, such as providing knowledge through retrieval-augmented generation and executing multi-step tasks using AI agent frameworks. However, these approaches rely on the logical consistency and self-coherence of LLMs, making it crucial to evaluate these aspects and consider potential countermeasures. To identify cases where LLMs fail to maintain logical consistency, we conducted an experiment in which LLMs were asked to explain the patterns in various integer sequences, ranging from arithmetic sequences to randomly generated integer series. While the models successfully identified correct patterns in arithmetic and geometric sequences, they frequently over-recognized patterns that were inconsistent with the given numbers when analyzing randomly generated series. This issue was observed even in multi-step reasoning models, including OpenAI o3, o4-mini, and Google Gemini 2.5 Flash Preview Thinking. This tendency to perceive non-existent patterns can be interpreted as the AI model equivalent of Idola Tribus and highlights potential limitations in their capability for applied tasks requiring logical reasoning, even when employing chain-of-thought reasoning mechanisms.

[5] SeCon-RAG: A Two-Stage Semantic Filtering and Conflict-Free Framework for Trustworthy RAG

Xiaonan Si, Meilin Zhu, Simeng Qin, Lijia Yu, Lijun Zhang, Shuaitong Liu, Xinfeng Li, Ranjie Duan, Yang Liu, Xiaojun Jia

Main category: cs.CL

TL;DR: SeCon-RAG is a two-stage semantic filtering framework that protects RAG systems from corpus poisoning attacks while preserving valuable knowledge, using entity-intent-relation extraction and conflict-aware filtering.

Details

Motivation: Existing RAG defenses use aggressive filtering that causes unnecessary loss of valuable information and reduces generation reliability, creating a need for more precise protection methods.

Method: Two-stage framework: 1) Joint semantic and cluster-based filtering guided by EIRE (Entity-intent-relation extractor) to score relevance and build clean database; 2) EIRE-guided conflict-aware filtering that analyzes semantic consistency between query, answers, and retrieved knowledge before generation.

Result: Significantly outperforms state-of-the-art defense methods across various LLMs and datasets, achieving improvements in both generation robustness and output trustworthiness.

Conclusion: SeCon-RAG effectively preserves useful knowledge while mitigating conflict contamination, providing a trustworthy RAG solution that balances security and information retention.

Abstract: Retrieval-augmented generation (RAG) systems enhance large language models (LLMs) with external knowledge but are vulnerable to corpus poisoning and contamination attacks, which can compromise output integrity. Existing defenses often apply aggressive filtering, leading to unnecessary loss of valuable information and reduced reliability in generation. To address this problem, we propose a two-stage semantic filtering and conflict-free framework for trustworthy RAG. In the first stage, we perform a joint filter with semantic and cluster-based filtering which is guided by the Entity-intent-relation extractor (EIRE). EIRE extracts entities, latent objectives, and entity relations from both the user query and filtered documents, scores their semantic relevance, and selectively adds valuable documents into the clean retrieval database. In the second stage, we proposed an EIRE-guided conflict-aware filtering module, which analyzes semantic consistency between the query, candidate answers, and retrieved knowledge before final answer generation, filtering out internal and external contradictions that could mislead the model. Through this two-stage process, SeCon-RAG effectively preserves useful knowledge while mitigating conflict contamination, achieving significant improvements in both generation robustness and output trustworthiness. Extensive experiments across various LLMs and datasets demonstrate that the proposed SeCon-RAG markedly outperforms state-of-the-art defense methods.

[6] ReaLM: Residual Quantization Bridging Knowledge Graph Embeddings and Large Language Models

Wenbin Guo, Xin Wang, Jiaoyan Chen, Lingbing Guo, Zhao Li, Zirui Chen

Main category: cs.CL

TL;DR: ReaLM bridges the gap between KG embeddings and LLMs using residual vector quantization to discretize KG embeddings into code sequences, achieving state-of-the-art KGC performance.

Details

Motivation: Existing LLM-based KGC methods struggle with semantic transfer due to misalignment between continuous KG embedding space and discrete LLM token space.

Method: Uses residual vector quantization to discretize pretrained KG embeddings into compact code sequences, integrates them as learnable LLM tokens, and applies ontology-guided class constraints for semantic consistency.

Result: Achieves state-of-the-art performance on two benchmark datasets, demonstrating effective alignment of structured knowledge with large language models.

Conclusion: ReaLM effectively bridges the KG-LLM gap through discretization and semantic constraints, enabling superior knowledge graph completion.

Abstract: Large Language Models (LLMs) have recently emerged as a powerful paradigm for Knowledge Graph Completion (KGC), offering strong reasoning and generalization capabilities beyond traditional embedding-based approaches. However, existing LLM-based methods often struggle to fully exploit structured semantic representations, as the continuous embedding space of pretrained KG models is fundamentally misaligned with the discrete token space of LLMs. This discrepancy hinders effective semantic transfer and limits their performance. To address this challenge, we propose ReaLM, a novel and effective framework that bridges the gap between KG embeddings and LLM tokenization through the mechanism of residual vector quantization. ReaLM discretizes pretrained KG embeddings into compact code sequences and integrates them as learnable tokens within the LLM vocabulary, enabling seamless fusion of symbolic and contextual knowledge. Furthermore, we incorporate ontology-guided class constraints to enforce semantic consistency, refining entity predictions based on class-level compatibility. Extensive experiments on two widely used benchmark datasets demonstrate that ReaLM achieves state-of-the-art performance, confirming its effectiveness in aligning structured knowledge with large-scale language models.

[7] All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language

Shiyuan Guo, Henry Sleight, Fabien Roger

Main category: cs.CL

TL;DR: Models struggle with ciphered reasoning despite understanding ciphers, creating a vulnerability in CoT monitoring where attackers could hide reasoning in encrypted text.

Details

Motivation: To assess the risk of attackers evading chain-of-thought monitoring through ciphered reasoning, where harmful AI actions could be hidden in encrypted, translated, or compressed text.

Method: Tested 28 different ciphers by fine-tuning and prompting up to 10 models to reason in each cipher, using math problems as a proxy for reasoning ability and measuring accuracy.

Result: Found an asymmetry: models can translate ciphered text accurately but struggle with reasoning in it, especially with lesser-known ciphers. Ciphered reasoning capability correlates with cipher prevalence in pretraining data and improves slowly with fine-tuning.

Conclusion: Ciphered reasoning may be an ineffective evasion tactic for current models, but provides guidance for constraining this capability in future frontier models.

Abstract: Detecting harmful AI actions is important as AI agents gain adoption. Chain-of-thought (CoT) monitoring is one method widely used to detect adversarial attacks and AI misalignment. However, attackers and misaligned models might evade CoT monitoring through ciphered reasoning: reasoning hidden in encrypted, translated, or compressed text. To assess this risk, we test whether models can perform ciphered reasoning. For each of 28 different ciphers, we fine-tune and prompt up to 10 models to reason in that cipher. We measure model accuracy on math problems as a proxy for reasoning ability. Across the models we test, we find an asymmetry: model accuracy can drop significantly when reasoning in ciphered text, even though models demonstrate comprehension of ciphered text by being able to translate it accurately to English. Even frontier models struggle with lesser-known ciphers, although they can reason accurately in well-known ciphers like rot13. We show that ciphered reasoning capability correlates with cipher prevalence in pretraining data. We also identify scaling laws showing that ciphered reasoning capability improves slowly with additional fine-tuning data. Our work suggests that evading CoT monitoring using ciphered reasoning may be an ineffective tactic for current models and offers guidance on constraining the development of this capability in future frontier models.

[8] Preference-Aware Memory Update for Long-Term LLM Agents

Haoran Sun, Zekun Zhang, Shaoning Zeng

Main category: cs.CL

TL;DR: PAMU enables dynamic refinement of preference memory in LLM-based agents by combining sliding window averages and exponential moving averages to capture both short-term and long-term user preferences.

Details

Motivation: Existing long-term memory approaches for LLM agents lack mechanisms for dynamically updating memory representations in response to evolving user behaviors and contexts, limiting their reasoning capabilities.

Method: Proposed Preference-Aware Memory Update Mechanism (PAMU) that integrates sliding window averages (SW) with exponential moving averages (EMA) to create fused preference-aware representations.

Result: Experiments on five task scenarios of the LoCoMo dataset show PAMU significantly improves output quality of LLMs across five baselines in long-term conversations.

Conclusion: PAMU effectively addresses the memory updating gap and enhances LLM agent performance in long-term conversational contexts through dynamic and personalized memory refinement.

Abstract: One of the key factors influencing the reasoning capabilities of LLM-based agents is their ability to leverage long-term memory. Integrating long-term memory mechanisms allows agents to make informed decisions grounded in historical interactions. While recent advances have significantly improved the storage and retrieval components, by encoding memory into dense vectors for similarity search or organizing memory as structured knowledge graphs most existing approaches fall short in memory updating. In particular, they lack mechanisms for dynamically refining preference memory representations in response to evolving user behaviors and contexts. To address this gap, we propose a Preference-Aware Memory Update Mechanism (PAMU) that enables dynamic and personalized memory refinement. By integrating sliding window averages (SW) with exponential moving averages (EMA), PAMU constructs a fused preference-aware representation that captures both short-term fluctuations and long-term user tendencies. We conduct experiments on five task scenarios of the LoCoMo dataset, and the results show that our mechanism can significantly improve the output quality of LLM in five baselines, validating its effectiveness in long-term conversations.

[9] Layout-Aware Parsing Meets Efficient LLMs: A Unified, Scalable Framework for Resume Information Extraction and Evaluation

Fanwei Zhu, Jinke Yu, Zulong Chen, Ying Zhou, Junhao Ji, Zhibo Yang, Yuxue Zhang, Haoyuan Hu, Zhenghao Liu

Main category: cs.CL

TL;DR: A layout-aware and efficiency-optimized framework for automated resume information extraction that addresses layout heterogeneity, LLM cost/latency, and dataset/evaluation standardization challenges.

Details

Motivation: Automated resume extraction faces three major challenges: extreme layout heterogeneity, high LLM costs/latency, and lack of standardized datasets/evaluation tools for real-world deployment.

Method: Combines fine-tuned layout parser for document normalization, inference-efficient LLM extractor with parallel prompting and instruction tuning, and two-stage automated evaluation framework with new benchmark datasets.

Result: Significantly outperforms baselines in accuracy and efficiency; fine-tuned 0.6B LLM achieves top-tier accuracy while reducing latency and computational cost; deployed in Alibaba’s HR platform.

Conclusion: The framework successfully addresses all three challenges and enables real-time resume extraction applications at scale in production environments.

Abstract: Automated resume information extraction is critical for scaling talent acquisition, yet its real-world deployment faces three major challenges: the extreme heterogeneity of resume layouts and content, the high cost and latency of large language models (LLMs), and the lack of standardized datasets and evaluation tools. In this work, we present a layout-aware and efficiency-optimized framework for automated extraction and evaluation that addresses all three challenges. Our system combines a fine-tuned layout parser to normalize diverse document formats, an inference-efficient LLM extractor based on parallel prompting and instruction tuning, and a robust two-stage automated evaluation framework supported by new benchmark datasets. Extensive experiments show that our framework significantly outperforms strong baselines in both accuracy and efficiency. In particular, we demonstrate that a fine-tuned compact 0.6B LLM achieves top-tier accuracy while significantly reducing inference latency and computational cost. The system is fully deployed in Alibaba’s intelligent HR platform, supporting real-time applications across its business units.

[10] VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation

Yubo Sun, Chunyi Peng, Yukun Yan, Shi Yu, Zhenghao Liu, Chi Chen, Zhiyuan Liu, Maosong Sun

Main category: cs.CL

TL;DR: EVisRAG is an end-to-end framework that enhances visual retrieval-augmented generation by learning to reason with evidence across multiple images, using a novel training method called RS-GRPO to jointly optimize visual perception and reasoning.

Details

Motivation: Current VRAG systems often fail to reliably perceive and integrate evidence across multiple images, leading to weak grounding and erroneous conclusions in vision-language models.

Method: Proposes EVisRAG framework that first observes retrieved images and records per-image evidence, then derives final answer from aggregated evidence. Uses Reward-Scoped Group Relative Policy Optimization (RS-GRPO) to bind fine-grained rewards to scope-specific tokens for joint optimization of visual perception and reasoning.

Result: Experimental results show 27% average improvement over backbone VLM on multiple visual question answering benchmarks. EVisRAG improves answer accuracy by precisely perceiving and localizing question-relevant evidence across multiple images.

Conclusion: EVisRAG effectively addresses multi-image reasoning challenges in VRAG systems, enabling more reliable evidence integration and reducing hallucinations through evidence-guided reasoning.

Abstract: Visual retrieval-augmented generation (VRAG) augments vision-language models (VLMs) with external visual knowledge to ground reasoning and reduce hallucinations. Yet current VRAG systems often fail to reliably perceive and integrate evidence across multiple images, leading to weak grounding and erroneous conclusions. In this paper, we propose EVisRAG, an end-to-end framework that learns to reason with evidence-guided multi-image to address this issue. The model first observes retrieved images and records per-image evidence, then derives the final answer from the aggregated evidence. To train EVisRAG effectively, we introduce Reward-Scoped Group Relative Policy Optimization (RS-GRPO), which binds fine-grained rewards to scope-specific tokens to jointly optimize visual perception and reasoning abilities of VLMs. Experimental results on multiple visual question answering benchmarks demonstrate that EVisRAG delivers substantial end-to-end gains over backbone VLM with 27% improvements on average. Further analysis shows that, powered by RS-GRPO, EVisRAG improves answer accuracy by precisely perceiving and localizing question-relevant evidence across multiple images and deriving the final answer from that evidence, much like a real detective.

[11] Judge’s Verdict: A Comprehensive Analysis of LLM Judge Capability Through Human Agreement

Steve Han, Gilberto Titericz Junior, Tom Balough, Wenfei Zhou

Main category: cs.CL

TL;DR: The Judge’s Verdict Benchmark evaluates 54 LLMs’ ability to replicate human judgment in scoring responses, using a two-step methodology that goes beyond correlation to measure actual agreement patterns and identify human-like vs super-consistent judgment behaviors.

Details

Motivation: To address the limitations of using correlation alone for evaluating LLMs as judges, and to establish a more comprehensive benchmark that measures how well LLMs can replicate human judgment patterns in response accuracy evaluation tasks.

Method: Two-step methodology: (1) correlation test to filter judges with strong alignment, (2) human-likeness test using z-scores to identify human-like judgment (|z| < 1) and super-consistent judgment (z > 1) patterns. Evaluated 54 LLMs including 43 open-source and 11 closed models.

Result: 27 out of 54 LLMs achieved Tier 1 performance: 23 models exhibited human-like judgment patterns that preserve human judgment nuances, while 4 models showed super-consistent behavior that exceeds typical human-to-human agreement levels. Judge excellence was found to depend on specific training strategies rather than model size alone.

Conclusion: Correlation alone is insufficient for judge evaluation; the benchmark provides a standardized method for classifying LLM judges into performance tiers and establishes a “Turing Test for judges” based on agreement patterns.

Abstract: This research introduces the Judge’s Verdict Benchmark, a novel two-step methodology to evaluate Large Language Models (LLMs) as judges for response accuracy evaluation tasks. We assess how well 54 LLMs can replicate human judgment when scoring responses from RAG (Retrieval-Augmented Generation) or Agentic pipelines against ground truth answers. Our methodology progresses from traditional correlation analysis to comprehensive Cohen’s Kappa analysis that measures actual agreement patterns. The two-step approach includes: (1) a correlation test that filters judges with strong alignment, followed by (2) a human-likeness test using z-scores to identify two distinct judgment patterns: human-like judgment (|z| < 1) that mimics natural human variation, and super-consistent judgment (z > 1) that exceeds typical human-to-human agreement levels. This methodology reveals that 27 out of 54 tested LLMs achieve Tier 1 performance: 23 models exhibit human-like patterns that preserve the nuances of human judgment, while 4 models demonstrate super-consistent behavior, a pattern that could indicate either enhanced reliability or oversimplification of complex judgments. Testing 43 open-source models (1B-405B parameters) and 11 closed models (GPT, Gemini, Claude variants), we demonstrate that judge excellence is not solely dependent on model size but on specific training strategies. Our key contributions include: (1) establishing that correlation alone is insufficient for judge evaluation, (2) introducing a “Turing Test for judges” based on agreement patterns, and (3) providing a standardized benchmark for classifying LLM judges into distinct performance tiers for different evaluation needs.

[12] MedAgentAudit: Diagnosing and Quantifying Collaborative Failure Modes in Medical Multi-Agent Systems

Lei Gu, Yinghao Zhu, Haoran Sang, Zixiang Wang, Dehao Sui, Wen Tang, Ewen Harrison, Junyi Gao, Lequan Yu, Liantao Ma

Main category: cs.CL

TL;DR: LLM-based multi-agent systems in medical consultations are evaluated mainly on final accuracy, ignoring reasoning transparency. A study of 3,600 cases reveals four key failure patterns in collaborative reasoning.

Details

Motivation: Current evaluation of LLM-based multi-agent systems focuses only on final-answer accuracy, treating their internal processes as black boxes. This poses risks in high-stakes medical applications where verifiable reasoning pathways are crucial.

Method: Large-scale empirical study of 3,600 cases from six medical datasets and six multi-agent frameworks, using mixed-methods approach combining qualitative analysis with quantitative auditing.

Result: Identified four dominant failure patterns: flawed consensus from shared model deficiencies, suppression of correct minority opinions, ineffective discussion dynamics, and critical information loss during synthesis.

Conclusion: High accuracy alone is insufficient for clinical trust; transparent and auditable reasoning processes are essential for responsible development and deployment of medical AI.

Abstract: While large language model (LLM)-based multi-agent systems show promise in simulating medical consultations, their evaluation is often confined to final-answer accuracy. This practice treats their internal collaborative processes as opaque “black boxes” and overlooks a critical question: is a diagnostic conclusion reached through a sound and verifiable reasoning pathway? The inscrutable nature of these systems poses a significant risk in high-stakes medical applications, potentially leading to flawed or untrustworthy conclusions. To address this, we conduct a large-scale empirical study of 3,600 cases from six medical datasets and six representative multi-agent frameworks. Through a rigorous, mixed-methods approach combining qualitative analysis with quantitative auditing, we develop a comprehensive taxonomy of collaborative failure modes. Our quantitative audit reveals four dominant failure patterns: flawed consensus driven by shared model deficiencies, suppression of correct minority opinions, ineffective discussion dynamics, and critical information loss during synthesis. This study demonstrates that high accuracy alone is an insufficient measure of clinical or public trust. It highlights the urgent need for transparent and auditable reasoning processes, a cornerstone for the responsible development and deployment of medical AI.

[13] Gold Panning: Turning Positional Bias into Signal for Multi-Document LLM Reasoning

Adam Byerly, Daniel Khashabi

Main category: cs.CL

TL;DR: Gold Panning Bandits framework leverages LLM position bias to efficiently identify relevant documents by strategically reordering them and observing response shifts, reducing query costs by up to 65% without model retraining.

Details

Motivation: Large language models exhibit strong position bias in multi-document contexts, prioritizing information based on location rather than relevance. Instead of treating this bias as noise to be mitigated, the authors propose using it as a diagnostic signal.

Method: Propose Gold Panning Bandits framework that reorders documents and observes shifts in model responses to identify relevant content. Frame document reordering as bipartite matching problem, with both optimal Hungarian algorithm (O(N³)) and greedy strategy (O(N log N)) approaches.

Result: The approach identifies relevant documents using up to 65% fewer language model queries than random permutation baselines on knowledge-intensive NLP tasks, substantially reducing computational cost without requiring model retraining.

Conclusion: Inherent LLM biases can be transformed from liabilities into assets for efficient, inference-time optimization, demonstrating that position bias can be leveraged as a valuable signal rather than just mitigated.

Abstract: Large language models exhibit a strong position bias in multi-document contexts, systematically prioritizing information based on location rather than relevance. While existing approaches treat this bias as noise to be mitigated, we introduce Gold Panning Bandits, a framework that leverages position bias as a diagnostic signal: by reordering documents and observing shifts in the model’s responses, we can efficiently identify the most relevant content. We frame the problem of choosing reorderings as a bipartite matching problem. While an optimal assignment can be computed at each iteration with the Hungarian algorithm in $O(N^3)$ time, we propose a greedy $O(N \log N)$ strategy that achieves comparable performance by prioritizing the placement of the most uncertain documents in the most informative positions. Our approach identifies relevant documents using up to 65% fewer language model queries than random permutation baselines on knowledge-intensive NLP tasks, substantially reducing computational cost without model retraining. This work demonstrates that inherent LLM biases can be transformed from liabilities into assets for efficient, inference-time optimization.

[14] PromptGuard at BLP-2025 Task 1: A Few-Shot Classification Framework Using Majority Voting and Keyword Similarity for Bengali Hate Speech Detection

Rakib Hossan, Shubhashis Roy Dipta

Main category: cs.CL

TL;DR: PromptGuard is a few-shot framework for Bengali hate speech classification that uses chi-square statistical analysis for keyword extraction and adaptive majority voting, achieving 67.61 micro-F1 score and outperforming baselines.

Details

Motivation: Traditional supervised approaches require extensive labeled datasets which are expensive for low-resource languages like Bengali, necessitating few-shot methods.

Method: Combines chi-square statistical analysis for keyword extraction with adaptive majority voting for decision-making, exploring statistical vs random keyword selection and adaptive voting mechanisms.

Result: Achieves micro-F1 of 67.61, outperforming n-gram baselines (60.75) and random approaches (14.65). Chi-square keywords provide consistent improvements across categories.

Conclusion: Chi-square-based keywords show the most consistent impact across all categories, and adaptive voting benefits ambiguous cases requiring extended classification rounds.

Abstract: The BLP-2025 Task 1A requires Bengali hate speech classification into six categories. Traditional supervised approaches need extensive labeled datasets that are expensive for low-resource languages. We developed PromptGuard, a few-shot framework combining chi-square statistical analysis for keyword extraction with adaptive majority voting for decision-making. We explore statistical keyword selection versus random approaches and adaptive voting mechanisms that extend classification based on consensus quality. Chi-square keywords provide consistent improvements across categories, while adaptive voting benefits ambiguous cases requiring extended classification rounds. PromptGuard achieves a micro-F1 of 67.61, outperforming n-gram baselines (60.75) and random approaches (14.65). Ablation studies confirm chi-square-based keywords show the most consistent impact across all categories.

[15] Steering Embedding Models with Geometric Rotation: Mapping Semantic Relationships Across Languages and Models

Michael Freenor, Lauren Alvarez

Main category: cs.CL

TL;DR: RISE (Rotor-Invariant Shift Estimation) is a geometric method that represents semantic transformations as rotational operations in embedding space, demonstrating consistent cross-lingual and cross-model performance for discourse-level semantic transformations.

Details

Motivation: Modern high-dimensional text embeddings lack interpretable geometric properties unlike early word embeddings, making it difficult to understand how semantic relationships are encoded across languages and models.

Method: RISE represents semantic transformations as consistent rotational operations in embedding space, leveraging the manifold structure of language representations and evaluating across multiple embedding models, datasets, and languages.

Result: RISE consistently maps discourse-level semantic transformations (e.g., negation, conditionality) across 7 morphologically diverse languages in 5 language groups and 3 embedding models, showing high transfer performance.

Conclusion: This work provides the first systematic demonstration that discourse-level semantic transformations correspond to consistent geometric operations in multilingual embedding spaces, empirically supporting the Linear Representation Hypothesis at sentence level.

Abstract: Understanding how language and embedding models encode semantic relationships is fundamental to model interpretability and control. While early word embeddings exhibited intuitive vector arithmetic (‘‘king’’ - ‘‘man’’ + ‘‘woman’’ = ‘‘queen’’), modern high-dimensional text representations lack straightforward interpretable geometric properties. We introduce Rotor-Invariant Shift Estimation (RISE), a geometric approach that represents semantic transformations as consistent rotational operations in embedding space, leveraging the manifold structure of modern language representations. RISE operations have the ability to operate across both languages and models with high transfer of performance, suggesting the existence of analogous cross-lingual geometric structure. We evaluate RISE across three embedding models, three datasets, and seven morphologically diverse languages in five major language groups. Our results demonstrate that RISE consistently maps discourse-level semantic transformations with distinct grammatical features (e.g., negation and conditionality) across languages and models. This work provides the first systematic demonstration that discourse-level semantic transformations correspond to consistent geometric operations in multilingual embedding spaces, empirically supporting the Linear Representation Hypothesis at the sentence level.

[16] Text Prompt Injection of Vision Language Models

Ruizhe Zhu

Main category: cs.CL

TL;DR: Text prompt injection is an effective method to mislead large vision language models with low computational requirements.

Details

Motivation: Safety concerns have increased with the widespread use of large vision language models, prompting investigation into simple attack methods.

Method: Developed an algorithm for text prompt injection attacks against vision language models.

Result: The approach demonstrated effectiveness and efficiency in misleading models, particularly working well for large models with low computational demands.

Conclusion: Text prompt injection presents a significant security vulnerability for vision language models that requires attention.

Abstract: The widespread application of large vision language models has significantly raised safety concerns. In this project, we investigate text prompt injection, a simple yet effective method to mislead these models. We developed an algorithm for this type of attack and demonstrated its effectiveness and efficiency through experiments. Compared to other attack methods, our approach is particularly effective for large models without high demand for computational resources.

[17] NG-Router: Graph-Supervised Multi-Agent Collaboration for Nutrition Question Answering

Kaiwen Shi, Zheyuan Zhang, Zhengqing Yuan, Keerthiram Murugesan, Vincent Galass, Chuxu Zhang, Yanfang Ye

Main category: cs.CL

TL;DR: NG-Router is a multi-agent framework for nutrition QA that uses knowledge graphs and graph neural networks to route questions to specialized agents, with gradient-based subgraph retrieval to handle contextual overload.

Details

Motivation: Address limitations of single-agent systems and complex multi-agent architectures in nutrition QA, while solving contextual overload that hinders accurate decision-making.

Method: Formulates nutritional QA as supervised knowledge-graph-guided multi-agent collaboration, integrating agent nodes into heterogeneous knowledge graphs and using GNNs to learn task-aware routing distributions with gradient-based subgraph retrieval.

Result: Consistently outperforms both single-agent and ensemble baselines across multiple benchmarks and backbone models.

Conclusion: NG-Router provides a principled approach to domain-aware multi-agent reasoning for complex nutritional health tasks.

Abstract: Diet plays a central role in human health, and Nutrition Question Answering (QA) offers a promising path toward personalized dietary guidance and the prevention of diet-related chronic diseases. However, existing methods face two fundamental challenges: the limited reasoning capacity of single-agent systems and the complexity of designing effective multi-agent architectures, as well as contextual overload that hinders accurate decision-making. We introduce Nutritional-Graph Router (NG-Router), a novel framework that formulates nutritional QA as a supervised, knowledge-graph-guided multi-agent collaboration problem. NG-Router integrates agent nodes into heterogeneous knowledge graphs and employs a graph neural network to learn task-aware routing distributions over agents, leveraging soft supervision derived from empirical agent performance. To further address contextual overload, we propose a gradient-based subgraph retrieval mechanism that identifies salient evidence during training, thereby enhancing multi-hop and relational reasoning. Extensive experiments across multiple benchmarks and backbone models demonstrate that NG-Router consistently outperforms both single-agent and ensemble baselines, offering a principled approach to domain-aware multi-agent reasoning for complex nutritional health tasks.

[18] NarraBench: A Comprehensive Framework for Narrative Benchmarking

Sil Hamilton, Matthew Wilkens, Andrew Piper

Main category: cs.CL

TL;DR: NarraBench introduces a taxonomy of narrative-understanding tasks and surveys 78 existing benchmarks, finding significant gaps in current evaluations.

Details

Motivation: To address the need for comprehensive evaluations of narrative understanding in NLP, particularly for overlooked aspects like narrative events, style, perspective, and revelation.

Method: Developed a theory-informed taxonomy of narrative-understanding tasks and conducted a survey of 78 existing benchmarks to assess coverage and alignment.

Result: Only 27% of narrative tasks are well captured by existing benchmarks, with significant gaps in areas like narrative events, style, perspective, and revelation. Need for benchmarks assessing subjective and perspectival aspects.

Conclusion: The taxonomy, survey, and methodology provide valuable tools for NLP researchers to better test LLM narrative understanding capabilities and address current evaluation gaps.

Abstract: We present NarraBench, a theory-informed taxonomy of narrative-understanding tasks, as well as an associated survey of 78 existing benchmarks in the area. We find significant need for new evaluations covering aspects of narrative understanding that are either overlooked in current work or are poorly aligned with existing metrics. Specifically, we estimate that only 27% of narrative tasks are well captured by existing benchmarks, and we note that some areas – including narrative events, style, perspective, and revelation – are nearly absent from current evaluations. We also note the need for increased development of benchmarks capable of assessing constitutively subjective and perspectival aspects of narrative, that is, aspects for which there is generally no single correct answer. Our taxonomy, survey, and methodology are of value to NLP researchers seeking to test LLM narrative understanding.

[19] CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs

Nafiseh Nikeghbal, Amir Hossein Kargaran, Jana Diesner

Main category: cs.CL

TL;DR: CoBia is a lightweight adversarial attack framework that systematically tests LLMs’ ability to maintain ethical behavior in conversations by introducing fabricated bias claims and evaluating recovery from biased statements.

Details

Motivation: LLMs sometimes reveal harmful behavior like racist viewpoints despite safety measures, requiring systematic testing of bias amplification in conversational contexts.

Method: CoBia creates constructed conversations where models utter biased claims about social groups, then evaluates if models can recover and reject biased follow-up questions across 6 socio-demographic categories using established bias metrics.

Result: Purposefully constructed conversations reliably reveal bias amplification, and LLMs often fail to reject biased follow-up questions during dialogue, highlighting deeply embedded biases.

Conclusion: Stress-testing through adversarial conversations effectively surfaces embedded biases in LLMs, demonstrating the need for improved safety mechanisms beyond standard checks.

Abstract: Improvements in model construction, including fortified safety guardrails, allow Large language models (LLMs) to increasingly pass standard safety checks. However, LLMs sometimes slip into revealing harmful behavior, such as expressing racist viewpoints, during conversations. To analyze this systematically, we introduce CoBia, a suite of lightweight adversarial attacks that allow us to refine the scope of conditions under which LLMs depart from normative or ethical behavior in conversations. CoBia creates a constructed conversation where the model utters a biased claim about a social group. We then evaluate whether the model can recover from the fabricated bias claim and reject biased follow-up questions. We evaluate 11 open-source as well as proprietary LLMs for their outputs related to six socio-demographic categories that are relevant to individual safety and fair treatment, i.e., gender, race, religion, nationality, sex orientation, and others. Our evaluation is based on established LLM-based bias metrics, and we compare the results against human judgments to scope out the LLMs’ reliability and alignment. The results suggest that purposefully constructed conversations reliably reveal bias amplification and that LLMs often fail to reject biased follow-up questions during dialogue. This form of stress-testing highlights deeply embedded biases that can be surfaced through interaction. Code and artifacts are available at https://github.com/nafisenik/CoBia.

[20] Tailored Teaching with Balanced Difficulty: Elevating Reasoning in Multimodal Chain-of-Thought via Prompt Curriculum

Xinglong Yang, Quan Feng, Zhongying Pan, Xiang Chen, Yu Tian, Wentong Li, Shuofei Qiao, Yuxia Geng, Xingyu Zhao, Sheng-Jun Huang

Main category: cs.CL

TL;DR: The paper proposes a novel framework for selecting multimodal Chain-of-Thought (MCoT) prompts using a curriculum learning approach that balances model-perceived difficulty and intrinsic sample complexity to improve multimodal reasoning performance.

Details

Motivation: Current MCoT prompting methods use randomly or manually selected examples, which fail to account for model-specific knowledge distributions and task complexity, leading to suboptimal and unstable performance.

Method: Reframes prompt selection as curriculum design, integrating two signals: model-perceived difficulty (quantified via prediction disagreement in active learning) and intrinsic sample complexity (inherent difficulty of question-image pairs). Uses difficulty-balanced sampling to select diverse prompts across both dimensions.

Result: Extensive experiments on five challenging benchmarks with multiple MLLMs show substantial and consistent improvements, greatly reducing performance discrepancies caused by random sampling.

Conclusion: Provides a principled and robust approach for enhancing multimodal reasoning through difficulty-balanced prompt selection that aligns with model capabilities.

Abstract: The effectiveness of Multimodal Chain-of-Thought (MCoT) prompting is often limited by the use of randomly or manually selected examples. These examples fail to account for both model-specific knowledge distributions and the intrinsic complexity of the tasks, resulting in suboptimal and unstable model performance. To address this, we propose a novel framework inspired by the pedagogical principle of “tailored teaching with balanced difficulty”. We reframe prompt selection as a prompt curriculum design problem: constructing a well ordered set of training examples that align with the model’s current capabilities. Our approach integrates two complementary signals: (1) model-perceived difficulty, quantified through prediction disagreement in an active learning setup, capturing what the model itself finds challenging; and (2) intrinsic sample complexity, which measures the inherent difficulty of each question-image pair independently of any model. By jointly analyzing these signals, we develop a difficulty-balanced sampling strategy that ensures the selected prompt examples are diverse across both dimensions. Extensive experiments conducted on five challenging benchmarks and multiple popular Multimodal Large Language Models (MLLMs) demonstrate that our method yields substantial and consistent improvements and greatly reduces performance discrepancies caused by random sampling, providing a principled and robust approach for enhancing multimodal reasoning.

[21] iBERT: Interpretable Style Embeddings via Sense Decomposition

Vishal Anand, Milad Alshomary, Kathleen McKeown

Main category: cs.CL

TL;DR: iBERT is an interpretable BERT encoder that produces sparse, non-negative embeddings as mixtures of context-independent sense vectors, enabling modular control over representations for both style and semantic analysis.

Details

Motivation: To create inherently interpretable and controllable embeddings that expose discriminative cues in language, such as stylistic and semantic structure, allowing for modular control before any downstream use.

Method: Each input token is represented as a sparse, non-negative mixture over k context-independent sense vectors, which can be pooled into sentence embeddings or used at token level.

Result: On STEL benchmark, improves style representation effectiveness by ~8 points over SBERT-style baselines while maintaining competitive performance on authorship verification. Specific style attributes can be assigned to specific sense vectors.

Conclusion: iBERT provides structural modularity to interpretably decompose discriminative signals in data, enabling generalization even when supervision blends stylistic and semantic factors, and is not limited to stylistic modeling.

Abstract: We present iBERT (interpretable-BERT), an encoder to produce inherently interpretable and controllable embeddings - designed to modularize and expose the discriminative cues present in language, such as stylistic and semantic structure. Each input token is represented as a sparse, non-negative mixture over k context-independent sense vectors, which can be pooled into sentence embeddings or used directly at the token level. This enables modular control over representation, before any decoding or downstream use. To demonstrate our model’s interpretability, we evaluate it on a suite of style-focused tasks. On the STEL benchmark, it improves style representation effectiveness by ~8 points over SBERT-style baselines, while maintaining competitive performance on authorship verification. Because each embedding is a structured composition of interpretable senses, we highlight how specific style attributes - such as emoji use, formality, or misspelling can be assigned to specific sense vectors. While our experiments center on style, iBERT is not limited to stylistic modeling. Its structural modularity is designed to interpretably decompose whichever discriminative signals are present in the data - enabling generalization even when supervision blends stylistic and semantic factors.

[22] StoryBox: Collaborative Multi-Agent Simulation for Hybrid Bottom-Up Long-Form Story Generation Using Large Language Models

Zehao Chen, Rong Pan, Haoran Li

Main category: cs.CL

TL;DR: A hybrid bottom-up approach for long-form story generation using multi-agent simulations where agents interact in a dynamic environment to create emergent events that form the story foundation.

Details

Motivation: Inspired by how human writers create mental scenes of character interactions, aiming to overcome rigid top-down structures in traditional story generation methods.

Method: Multi-agent simulations in a dynamic sandbox environment where agents’ behaviors and interactions generate emergent events that drive story development organically.

Result: The system generates coherent stories exceeding 10,000 words and achieves state-of-the-art performance across multiple metrics.

Conclusion: This hybrid bottom-up approach offers a scalable and innovative solution for creating dynamic, immersive long-form stories that evolve naturally from agent-driven interactions.

Abstract: Human writers often begin their stories with an overarching mental scene, where they envision the interactions between characters and their environment. Inspired by this creative process, we propose a novel approach to long-form story generation, termed hybrid bottom-up long-form story generation, using multi-agent simulations. In our method, agents interact within a dynamic sandbox environment, where their behaviors and interactions with one another and the environment generate emergent events. These events form the foundation for the story, enabling organic character development and plot progression. Unlike traditional top-down approaches that impose rigid structures, our hybrid bottom-up approach allows for the natural unfolding of events, fostering more spontaneous and engaging storytelling. The system is capable of generating stories exceeding 10,000 words while maintaining coherence and consistency, addressing some of the key challenges faced by current story generation models. We achieve state-of-the-art performance across several metrics. This approach offers a scalable and innovative solution for creating dynamic, immersive long-form stories that evolve organically from agent-driven interactions.

[23] DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning

Hossein Entezari Zarch, Lei Gao, Chaoyi Jiang, Murali Annavarm

Main category: cs.CL

TL;DR: DELTA is a training-free sparse attention mechanism that reduces computational cost in large reasoning models by partitioning transformer layers into full attention, selection, and sparse attention groups, achieving 5x token reduction and 1.5x speedup while maintaining accuracy.

Details

Motivation: Large reasoning models suffer from high inference costs due to full attention computation over growing sequences, and existing sparse attention methods cause severe accuracy degradation on reasoning tasks.

Method: Partitions transformer layers into three groups: initial full attention layers, selection layers that identify salient tokens using aggregated attention scores, and sparse-attention layers that attend only to selected tokens.

Result: Matches or surpasses full attention accuracy on reasoning benchmarks (AIME, GPQA-Diamond), reduces attended tokens by up to 5x, and achieves 1.5x end-to-end speedup.

Conclusion: Selective reuse of intermediate attention maps provides a robust path for efficient long-context reasoning without sacrificing model accuracy.

Abstract: Large reasoning models (LRMs) achieve state-of-the-art performance on challenging benchmarks by generating long chains of intermediate steps, but their inference cost is dominated by decoding, where each new token must attend to the entire growing sequence. Existing sparse attention methods reduce computation by pruning the key-value (KV) cache, yet they suffer from severe accuracy degradation on reasoning tasks due to cumulative selection errors and the dynamic importance of tokens over long derivations. We present \textbf{DELTA}, a training-free sparse attention mechanism that achieves computational efficiency without sacrificing model accuracy. DELTA partitions transformer layers into three groups: initial layers that use full attention, a small set of \emph{selection layers} that identify salient tokens via aggregated head-level attention scores, and subsequent \emph{sparse-attention layers} that attend only to the selected subset. This design preserves the full KV cache in GPU memory for accuracy, while avoiding expensive full-attention computation over many layers. On reasoning benchmarks such as AIME and GPQA-Diamond, DELTA matches or surpasses full attention in accuracy, while reducing the number of attended tokens by up to $5\times$ and delivering $1.5\times$ end-to-end speedup. Our results show that selective reuse of intermediate attention maps offers a robust path toward efficient long-context reasoning.

[24] Closing the Data-Efficiency Gap Between Autoregressive and Masked Diffusion LLMs

Xu Pan, Ely Hahami, Jingxuan Fan, Ziqian Xie, Haim Sompolinsky

Main category: cs.CL

TL;DR: Masked diffusion LLMs (dLLMs) outperform autoregressive LLMs (arLLMs) in knowledge injection via fine-tuning, being free from the reversal curse and requiring less data augmentation. A novel masked fine-tuning method for arLLMs is proposed to close this performance gap.

Details

Motivation: Autoregressive LLMs struggle with knowledge injection due to issues like the reversal curse, while masked diffusion LLMs show promise in pre-training but their post-training capabilities are unknown.

Method: Fine-tuned arLLMs and dLLMs on three datasets, evaluated with forward/backward QA to test knowledge generalization and reversal curse. Proposed masked fine-tuning for arLLMs.

Result: dLLMs achieved high accuracy on both forward and backward QAs without paraphrases, while arLLMs required extensive data augmentation and still suffered from reversal curse. Masked fine-tuning improved arLLM data efficiency.

Conclusion: dLLMs are superior for knowledge injection via fine-tuning, being reversal-curse-free and data-efficient. The proposed masked fine-tuning method successfully bridges the performance gap for arLLMs.

Abstract: Despite autoregressive large language models (arLLMs) being the current dominant paradigm in language modeling, they resist knowledge injection via fine-tuning due to inherent shortcomings such as the “reversal curse” – the challenge of answering questions that reverse the original information order in the training sample. Masked diffusion large language models (dLLMs) are rapidly emerging as a powerful alternative to the arLLM paradigm, with evidence of better data efficiency and free of the “reversal curse” in pre-training. However, it is unknown whether these advantages extend to the post-training phase, i.e. whether pre-trained dLLMs can easily acquire new knowledge through fine-tuning. On three diverse datasets, we fine-tune arLLMs and dLLMs, evaluating them with forward and backward style Question Answering (QA) to probe knowledge generalization and the reversal curse. Our results confirm that arLLMs critically rely on extensive data augmentation via paraphrases for QA generalization, and paraphrases are only effective when their information order matches the QA style. Conversely, dLLMs achieve high accuracies on both forward and backward QAs without paraphrases; adding paraphrases yields only marginal gains. Lastly, inspired by the dLLM’s performance, we introduce a novel masked fine-tuning paradigm for knowledge injection into pre-trained arLLMs. This proposed method successfully and drastically improves the data efficiency of arLLM fine-tuning, effectively closing the performance gap with dLLMs.

[25] Abductive Preference Learning

Yijin Ni, Peng Qi

Main category: cs.CL

TL;DR: The paper proposes abductive preference learning to address LLM overconfidence by learning preferences over prompts given responses, complementing standard methods that focus on response selection.

Details

Motivation: Frontier LLMs remain overconfident despite RLHF/DPO alignment, failing to distinguish between prompts that should alter responses (e.g., safe vs unsafe food scenarios). This stems from preference learning's focus on response selection while neglecting counterfactual prompts.

Method: Proposes abductive preference learning - fine-tuning paradigm that reverses conditioning by learning preferences over prompts given a response. Implements abductive DPO and DPOP variants, with multitask objective combining standard and abductive approaches.

Result: Multitask DPOP boosts accuracy from 90.0% to 99.5% in response selection and 54.7% to 85.0% in prompt discrimination on abductive dataset. On AlpacaEval, improves win rate from 5.26% to 6.17%. Qualitative evidence shows improved sensitivity to prompt differences.

Conclusion: Abductive preference learning preserves conventional optimization benefits while addressing counterfactual prompt challenge, demonstrating complementary strengths when combined with standard methods.

Abstract: Frontier large language models such as GPT-5 and Claude Sonnet remain prone to overconfidence even after alignment through Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO). For instance, they tend to offer the same conservative answer “No” to both questions “Can I eat the [food / potato chips] that has been left out overnight?” despite the latter requiring no refridgeration for safe consumption. We find that this failure is potentially attributed to a limitation of existing preference learning: it emphasizes selecting the correct response for a given prompt, while neglecting counterfactual prompts that should alter the response. To address this limitation, we propose abductive preference learning, a fine-tuning paradigm that reverses the conventional conditioning by learning preferences over prompts given a response. To validate this idea, we construct an abductive dataset derived from the HaluEval QA benchmark with 1,001 entries, implementing abductive DPO and its variant DPOP. Experiments reveal complementary strengths: standard methods improve response selection, abductive methods improve prompt discrimination, while a multitask objective unifies both. On the abductive dataset, multitask DPOP boosts accuracy from $90.0%$ to $99.5%$ in response selection and $54.7%$ to $85.0%$ in prompt discrimination, with qualitative evidence highlighting improved sensitivity to prompt differences. Finally, evaluation on AlpacaEval shows multitask DPOP improves win rate (from $5.26%$ to $6.17%$), confirming that abductive preference learning preserves the benefits of conventional preference optimization while addressing the overlooked challenge of counterfactual prompts.

[26] Debate, Deliberate, Decide (D3): A Cost-Aware Adversarial Framework for Reliable and Interpretable LLM Evaluation

Chaithanya Bandi, Abir Harrasse

Main category: cs.CL

TL;DR: D3 is a cost-aware, adversarial multi-agent framework that uses structured debates among specialized agents to provide reliable and interpretable LLM evaluations, addressing issues of inconsistency and bias in automated judging.

Details

Motivation: Current LLM evaluation methods suffer from inconsistency, bias, and lack of transparent decision criteria in automated judging, making reliable assessment challenging.

Method: D3 employs two protocols: MORE (Multi-Advocate One-Round Evaluation) with parallel defenses, and SAMRE (Single-Advocate Multi-Round Evaluation) with budgeted stopping. It uses role-specialized agents (advocates, judge, jury) in structured debates with probabilistic modeling of score gaps.

Result: D3 achieves state-of-the-art agreement with human judgments (accuracy and Cohen’s kappa), reduces positional and verbosity biases through anonymization and role diversification, and provides favorable cost-accuracy trade-offs via budgeted stopping.

Conclusion: D3 establishes a principled, practical framework for reliable, interpretable, and cost-aware LLM evaluation through structured debate protocols and probabilistic modeling.

Abstract: The evaluation of Large Language Models (LLMs) remains challenging due to inconsistency, bias, and the absence of transparent decision criteria in automated judging. We present Debate, Deliberate, Decide (D3), a cost-aware, adversarial multi-agent framework that orchestrates structured debate among role-specialized agents (advocates, a judge, and an optional jury) to produce reliable and interpretable evaluations. D3 instantiates two complementary protocols: (1) Multi-Advocate One-Round Evaluation (MORE), which elicits k parallel defenses per answer to amplify signal via diverse advocacy, and (2) Single-Advocate Multi-Round Evaluation (SAMRE) with budgeted stopping, which iteratively refines arguments under an explicit token budget and convergence checks. We develop a probabilistic model of score gaps that (i) characterizes reliability and convergence under iterative debate and (ii) explains the separation gains from parallel advocacy. Under mild assumptions, the posterior distribution of the round-r gap concentrates around the true difference and the probability of mis-ranking vanishes; moreover, aggregating across k advocates provably increases expected score separation. We complement theory with a rigorous experimental suite across MT-Bench, AlignBench, and AUTO-J, showing state-of-the-art agreement with human judgments (accuracy and Cohen’s kappa), reduced positional and verbosity biases via anonymization and role diversification, and a favorable cost-accuracy frontier enabled by budgeted stopping. Ablations and qualitative analyses isolate the contributions of debate, aggregation, and anonymity. Together, these results establish D3 as a principled, practical recipe for reliable, interpretable, and cost-aware LLM evaluation.

[27] HIPPD: Brain-Inspired Hierarchical Information Processing for Personality Detection

Guanming Chen, Lingzhi Shen, Xiaohao Cai, Imran Razzak, Shoaib Jameel

Main category: cs.CL

TL;DR: HIPPD is a brain-inspired framework for personality detection that emulates hierarchical brain processing, using LLMs for semantic reasoning, dynamic memory for feature retention, and specialized models for pattern recognition, achieving state-of-the-art performance.

Details

Motivation: Existing machine learning approaches struggle with contextual information across multiple posts and fail to extract robust features in semantically sparse environments for personality detection.

Method: Uses LLM as cerebral cortex for semantic reasoning, dynamic memory module as prefrontal cortex for adaptive feature retention, and specialized lightweight models as basal ganglia with winner-takes-all routing for personality pattern recognition.

Result: Extensive experiments on Kaggle and Pandora datasets show HIPPD consistently outperforms state-of-the-art baselines.

Conclusion: The brain-inspired hierarchical framework effectively addresses limitations of existing approaches and demonstrates superior performance in personality detection from text.

Abstract: Personality detection from text aims to infer an individual’s personality traits based on linguistic patterns. However, existing machine learning approaches often struggle to capture contextual information spanning multiple posts and tend to fall short in extracting representative and robust features in semantically sparse environments. This paper presents HIPPD, a brain-inspired framework for personality detection that emulates the hierarchical information processing of the human brain. HIPPD utilises a large language model to simulate the cerebral cortex, enabling global semantic reasoning and deep feature abstraction. A dynamic memory module, modelled after the prefrontal cortex, performs adaptive gating and selective retention of critical features, with all adjustments driven by dopaminergic prediction error feedback. Subsequently, a set of specialised lightweight models, emulating the basal ganglia, are dynamically routed via a strict winner-takes-all mechanism to capture the personality-related patterns they are most proficient at recognising. Extensive experiments on the Kaggle and Pandora datasets demonstrate that HIPPD consistently outperforms state-of-the-art baselines.

[28] Don’t Throw Away Your Pretrained Model

Shangbin Feng, Wenhao Yu, Yike Wang, Hongming Zhang, Yulia Tsvetkov, Dong Yu

Main category: cs.CL

TL;DR: Switch Generation enables model collaboration where pretrained and aligned models take turns generating response segments, outperforming individual models and other collaboration methods across diverse tasks.

Details

Motivation: Alignment training improves reasoning and instruction following but reduces creativity and calibration. The goal is to combine strengths of both aligned and unaligned models through collaboration.

Method: Train a switcher LM that learns to choose between different model checkpoints to generate the next segment in a response sequence, based on query context and model strengths.

Result: Model collaboration outperforms individual models on 16/18 tasks, and Switch Generation further outperforms baselines by 12.9% on average. It discovers compositional skills and generalizes to unseen models/tasks.

Conclusion: Switch Generation effectively leverages model collaboration to combine complementary strengths, reusing training pipeline by-products that would otherwise be discarded.

Abstract: Alignment training has tradeoffs: it helps language models (LMs) gain in reasoning and instruction following but might lose out on skills such as creativity and calibration, where unaligned base models are better at. We aim to make the best of both worlds through model collaboration, where different models in the training pipeline collaborate and complement each other. Since LM responses feature interleaving skills that favor different models, we propose Switch Generation, where pretrained and aligned model versions take turns to ``speak’’ in a response sequence. Specifically, we train a switcher LM by learning from outcomes of choosing different models to generate the next segment across diverse queries and contexts. At inference time, the switcher LM guides different model checkpoints to dynamically generate the next segment where their strengths are most needed. Extensive experiments with 8 model collaboration baselines and 18 datasets show that 1) model collaboration consistently outperforms individual models on 16 out of 18 tasks, and 2) Switch Generation further outperforms baselines by 12.9% on average. Further analysis reveals that Switch Generation discovers compositional skills to solve problems where individual models struggle and generalizes to unseen models and tasks, reusing and repurposing by-products in expensive model training pipelines that are otherwise discarded.

[29] Enhancing Faithfulness in Abstractive Summarization via Span-Level Fine-Tuning

Sicong Huang, Qianqi Yan, Shengze Wang, Ian Lane

Main category: cs.CL

TL;DR: This paper investigates fine-tuning strategies to reduce hallucinations in LLM-generated summaries by using span-level annotations of unfaithful content.

Details

Motivation: LLMs often produce unfaithful summaries with hallucinations at word, phrase, or concept levels, and existing mitigation strategies fail to fully address diverse error types.

Method: Automatically generate summaries using various LLMs, use GPT-4o to annotate span-level hallucinations, then fine-tune LLMs using three techniques: gradient ascent, unlikelihood training, and task vector negation.

Result: All three approaches successfully improved faithfulness using span-level annotations, with unlikelihood training being the most effective method.

Conclusion: Fine-tuning LLMs with span-level hallucination annotations can significantly improve summary faithfulness, with unlikelihood training showing the best performance among the tested methods.

Abstract: Abstractive summarization using large language models (LLMs) has become an essential tool for condensing information. However, despite their ability to generate fluent summaries, these models sometimes produce unfaithful summaries, introducing hallucinations at the word, phrase, or concept level. Existing mitigation strategies, such as post-processing corrections or contrastive learning with synthetically generated negative samples, fail to fully address the diverse errors that can occur in LLM-generated summaries. In this paper, we investigate fine-tuning strategies to reduce the occurrence of unfaithful spans in generated summaries. First, we automatically generate summaries for the set of source documents in the training set with a variety of LLMs and then use GPT-4o to annotate any hallucinations it detects at the span-level. Leveraging these annotations, we fine-tune LLMs with both hallucination-free summaries and annotated unfaithful spans to enhance model faithfulness. In this paper, we introduce a new dataset that contains both faithful and unfaithful summaries with span-level labels and we evaluate three techniques to fine-tuning a LLM to improve the faithfulness of the resulting summarization: gradient ascent, unlikelihood training, and task vector negation. Experimental results show that all three approaches successfully leverage span-level annotations to improve faithfulness, with unlikelihood training being the most effective.

[30] Unpacking Hateful Memes: Presupposed Context and False Claims

Weibin Cai, Jiayu Li, Reza Zafarani

Main category: cs.CL

TL;DR: SHIELD is a hateful meme detection framework that identifies hateful memes through presupposed context modeling and false claim detection, outperforming state-of-the-art methods.

Details

Motivation: Current approaches focus on detection but neglect understanding what makes memes hateful, drawing from philosophy and psychology insights about presupposed context and false claims.

Method: Developed PCM for modeling contextual information across modalities and FACT module for detecting false claims using external knowledge and cross-modal reference graphs.

Result: SHIELD outperforms state-of-the-art methods across datasets and metrics, and shows versatility on other tasks like fake news detection.

Conclusion: The framework successfully captures the fundamental nature of hate in memes by addressing both presupposed context and false claims, providing effective detection across various applications.

Abstract: While memes are often humorous, they are frequently used to disseminate hate, causing serious harm to individuals and society. Current approaches to hateful meme detection mainly rely on pre-trained language models. However, less focus has been dedicated to \textit{what make a meme hateful}. Drawing on insights from philosophy and psychology, we argue that hateful memes are characterized by two essential features: a \textbf{presupposed context} and the expression of \textbf{false claims}. To capture presupposed context, we develop \textbf{PCM} for modeling contextual information across modalities. To detect false claims, we introduce the \textbf{FACT} module, which integrates external knowledge and harnesses cross-modal reference graphs. By combining PCM and FACT, we introduce \textbf{\textsf{SHIELD}}, a hateful meme detection framework designed to capture the fundamental nature of hate. Extensive experiments show that SHIELD outperforms state-of-the-art methods across datasets and metrics, while demonstrating versatility on other tasks, such as fake news detection.

[31] Beyond Fertility: Analyzing STRR as a Metric for Multilingual Tokenization Evaluation

Mir Tafseer Nayeem, Sawsan Alqahtani, Md Tahmid Rahman Laskar, Tasnim Mohiuddin, M Saiful Bari

Main category: cs.CL

TL;DR: The paper analyzes tokenizer fairness across languages, finding current metrics like fertility are insufficient. They propose STRR (Single Token Retention Rate) which reveals systematic English prioritization and fragmentation in languages like Hindi.

Details

Motivation: Tokenization is crucial but under-evaluated in LLMs. Standard metrics like fertility capture compression efficiency but obscure how vocabularies are allocated across languages and domains, failing to reveal cross-lingual fairness issues.

Method: Analyzed six widely used tokenizers across seven languages and two domains. Proposed STRR (Single Token Retention Rate) to measure the proportion of words preserved as single tokens, providing an interpretable view of cross-lingual fairness.

Result: Found stable fertility for English, high fertility for Chinese, and little domain sensitivity. STRR revealed systematic prioritization of English, strong support for Chinese, and fragmentation in Hindi. STRR complements fertility and provides practical guidance.

Conclusion: STRR offers an interpretable view of cross-lingual fairness in tokenizers, complementing existing metrics and providing guidance for designing more equitable multilingual tokenizers.

Abstract: Tokenization is a crucial but under-evaluated step in large language models (LLMs). The standard metric, fertility (the average number of tokens per word), captures compression efficiency but obscures how vocabularies are allocated across languages and domains. We analyze six widely used tokenizers across seven languages and two domains, finding stable fertility for English, high fertility for Chinese, and little domain sensitivity. To address fertility’s blind spots, we propose the Single Token Retention Rate (STRR), which measures the proportion of words preserved as single tokens. STRR reveals systematic prioritization of English, strong support for Chinese, and fragmentation in Hindi, offering an interpretable view of cross-lingual fairness. Our results show that STRR complements fertility and provides practical guidance for designing more equitable multilingual tokenizers.

[32] Unifying Tree Search Algorithm and Reward Design for LLM Reasoning: A Survey

Jiaqi Wei, Xiang Zhang, Yuejin Yang, Wenxuan Huang, Juntai Cao, Sheng Xu, Xiang Zhuang, Zhangyang Gao, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, Chenyu You, Wanli Ouyang, Siqi Sun

Main category: cs.CL

TL;DR: This paper introduces a unified framework for deliberative tree search in LLMs, resolving ambiguity around reward signals by distinguishing between transient search guidance for test-time scaling and durable parametric reward modeling for self-improvement.

Details

Motivation: The field of deliberative tree search in LLMs is fragmented and lacks a common formalism, particularly concerning the ambiguous role of reward signals - whether they serve as transient heuristics or durable learning targets.

Method: The paper introduces a unified framework that deconstructs search algorithms into three core components: Search Mechanism, Reward Formulation, and Transition Function, establishing a formal distinction between transient Search Guidance and durable Parametric Reward Modeling.

Result: The framework enables a component-centric taxonomy, synthesis of state-of-the-art approaches, and provides a research roadmap for systematic progress in creating autonomous, self-improving agents.

Conclusion: The proposed unified framework resolves ambiguity in deliberative tree search, providing a foundation for more systematic advancement in LLM research by clearly distinguishing between test-time scaling and self-improvement applications.

Abstract: Deliberative tree search is a cornerstone of modern Large Language Model (LLM) research, driving the pivot from brute-force scaling toward algorithmic efficiency. This single paradigm unifies two critical frontiers: \textbf{Test-Time Scaling (TTS)}, which deploys on-demand computation to solve hard problems, and \textbf{Self-Improvement}, which uses search-generated data to durably enhance model parameters. However, this burgeoning field is fragmented and lacks a common formalism, particularly concerning the ambiguous role of the reward signal – is it a transient heuristic or a durable learning target? This paper resolves this ambiguity by introducing a unified framework that deconstructs search algorithms into three core components: the \emph{Search Mechanism}, \emph{Reward Formulation}, and \emph{Transition Function}. We establish a formal distinction between transient \textbf{Search Guidance} for TTS and durable \textbf{Parametric Reward Modeling} for Self-Improvement. Building on this formalism, we introduce a component-centric taxonomy, synthesize the state-of-the-art, and chart a research roadmap toward more systematic progress in creating autonomous, self-improving agents.

[33] Toward Machine Translation Literacy: How Lay Users Perceive and Rely on Imperfect Translations

Yimin Xiao, Yongle Zhang, Dayeon Ki, Calvin Bao, Marianna J. Martindale, Charlotte Vaughn, Ge Gao, Marine Carpuat

Main category: cs.CL

TL;DR: Study on how bilingual and non-bilingual users perceive and rely on imperfect machine translation in real-world settings, revealing that non-bilingual users over-rely on MT due to lack of evaluation strategies.

Details

Motivation: Understanding how the general public perceives and relies on imperfect machine translation is crucial for contextualizing MT research in real-world applications.

Method: Human study conducted in a public museum with 452 participants, investigating how fluency and adequacy errors impact bilingual and non-bilingual users’ reliance on MT during casual use.

Result: Non-bilingual users often over-rely on MT due to lack of evaluation strategies and alternatives, but experiencing errors can prompt users to reassess future reliance.

Conclusion: Highlights the need for MT evaluation and NLP explanation techniques to promote not only MT quality, but also MT literacy among users.

Abstract: As Machine Translation (MT) becomes increasingly commonplace, understanding how the general public perceives and relies on imperfect MT is crucial for contextualizing MT research in real-world applications. We present a human study conducted in a public museum (n=452), investigating how fluency and adequacy errors impact bilingual and non-bilingual users’ reliance on MT during casual use. Our findings reveal that non-bilingual users often over-rely on MT due to a lack of evaluation strategies and alternatives, while experiencing the impact of errors can prompt users to reassess future reliance. This highlights the need for MT evaluation and NLP explanation techniques to promote not only MT quality, but also MT literacy among its users.

[34] Beyond the limitation of a single query: Train your LLM for query expansion with Reinforcement Learning

Shu Zhao, Tan Yu, Anbang Xu

Main category: cs.CL

TL;DR: ExpandSearch trains a 3B LLM-based search agent with query expansion capability through reinforcement learning, assisted by a pre-trained squeezer model for document understanding, achieving state-of-the-art performance on multi-hop QA benchmarks.

Details

Motivation: Existing reasoning-augmented search agents have limited capabilities in reasoning and search, resulting in unsatisfactory performance on multi-hop QA benchmarks that require handling complex or compound queries.

Method: Train an LLM-based search agent with query expansion through reinforcement learning, where the agent proposes multiple query variants per turn for simultaneous searching. Incorporate a pre-trained squeezer model to help understand retrieved documents, allowing the search agent to focus on query generation.

Result: Achieves state-of-the-art accuracy on multi-hop QA benchmarks with 4.4% average improvement across seven benchmarks, showing strong gains on tasks requiring diverse evidence aggregation.

Conclusion: Even small-scale 3B LLMs can demonstrate strong query expansion capabilities when assisted by a squeezer model, achieving superior performance on complex multi-hop reasoning tasks.

Abstract: Reasoning-augmented search agents, such as Search-R1, are trained to reason, search, and generate the final answer iteratively. Nevertheless, due to their limited capabilities in reasoning and search, their performance on multi-hop QA benchmarks remains far from satisfactory. To handle complex or compound queries, we train an LLM-based search agent with the native capability of query expansion through reinforcement learning. In each turn, our search agent proposes several query variants, which are searched simultaneously to cover more relevant information. Meanwhile, given limited post-training data and computing resources, it is very challenging for a search agent to master multiple tasks, including query generation, retrieved information understanding, and answer generation. Therefore, we propose incorporating a pre-trained squeezer model that helps the search agent understand the retrieved documents, allowing the search agent to focus on query generation for high retrieval recall. With the assistance of the squeezer model, we discover that even a small-scale 3B LLM can demonstrate a strong capability of query expansion and achieve state-of-the-art accuracy on the multi-hop QA benchmarks. To be specific, our experiments across seven question-answering benchmarks demonstrate that our method, named ExpandSearch, achieves an average improvement of 4.4% compared to state-of-the-art baselines, with strong gains on multi-hop reasoning tasks requiring diverse evidence aggregation.

[35] Talk Isn’t Always Cheap: Understanding Failure Modes in Multi-Agent Debate

Andrea Wynn, Harsh Satija, Gillian Hadfield

Main category: cs.CL

TL;DR: Multi-agent debate can harm reasoning performance even with stronger models outnumbering weaker ones, as agents shift from correct to incorrect answers favoring agreement over challenging flawed reasoning.

Details

Motivation: To investigate how diversity in model capabilities influences multi-agent debate dynamics and outcomes, challenging the assumption that debate always improves reasoning.

Method: Conducted experiments with diverse AI models in debate settings, analyzing how models shift responses and investigating factors like sycophancy, social conformity, and model/task types.

Result: Debate led to decreased accuracy over time, with models frequently abandoning correct answers to agree with incorrect peer reasoning, even when stronger models were in the majority.

Conclusion: Naive applications of multi-agent debate can degrade performance when agents lack proper incentives or capabilities to resist persuasive but incorrect reasoning, revealing important failure modes in reason exchange.

Abstract: While multi-agent debate has been proposed as a promising strategy for improving AI reasoning ability, we find that debate can sometimes be harmful rather than helpful. Prior work has primarily focused on debates within homogeneous groups of agents, whereas we explore how diversity in model capabilities influences the dynamics and outcomes of multi-agent interactions. Through a series of experiments, we demonstrate that debate can lead to a decrease in accuracy over time - even in settings where stronger (i.e., more capable) models outnumber their weaker counterparts. Our analysis reveals that models frequently shift from correct to incorrect answers in response to peer reasoning, favoring agreement over challenging flawed reasoning. We perform additional experiments investigating various potential contributing factors to these harmful shifts - including sycophancy, social conformity, and model and task type. These results highlight important failure modes in the exchange of reasons during multi-agent debate, suggesting that naive applications of debate may cause performance degradation when agents are neither incentivised nor adequately equipped to resist persuasive but incorrect reasoning.

[36] Path Drift in Large Reasoning Models:How First-Person Commitments Override Safety

Yuyi Huang, Runzhe Zhan, Lidia S. Chao, Ailin Tao, Derek F. Wong

Main category: cs.CL

TL;DR: Long Chain-of-Thought models can drift from aligned paths during reasoning, creating safety vulnerabilities through three behavioral triggers, and a defense strategy is proposed using path-level oversight.

Details

Motivation: To identify and address the vulnerability in Long-CoT prompting where reasoning trajectories can drift from safety-aligned paths, violating constraints despite RLHF safeguards.

Method: Empirical analysis of three Path Drift triggers, development of a three-stage Path Drift Induction Framework, and proposal of path-level defense with role attribution correction and metacognitive reflection.

Result: Identified three behavioral triggers of Path Drift, demonstrated that each stage of the induction framework reduces refusal rates independently, and their combination compounds the effect.

Conclusion: Trajectory-level alignment oversight is necessary for long-form reasoning beyond token-level alignment to mitigate Path Drift risks in LLMs.

Abstract: As large language models (LLMs) are increasingly deployed for complex reasoning tasks, Long Chain-of-Thought (Long-CoT) prompting has emerged as a key paradigm for structured inference. Despite early-stage safeguards enabled by alignment techniques such as RLHF, we identify a previously underexplored vulnerability: reasoning trajectories in Long-CoT models can drift from aligned paths, resulting in content that violates safety constraints. We term this phenomenon Path Drift. Through empirical analysis, we uncover three behavioral triggers of Path Drift: (1) first-person commitments that induce goal-driven reasoning that delays refusal signals; (2) ethical evaporation, where surface-level disclaimers bypass alignment checkpoints; (3) condition chain escalation, where layered cues progressively steer models toward unsafe completions. Building on these insights, we introduce a three-stage Path Drift Induction Framework comprising cognitive load amplification, self-role priming, and condition chain hijacking. Each stage independently reduces refusal rates, while their combination further compounds the effect. To mitigate these risks, we propose a path-level defense strategy incorporating role attribution correction and metacognitive reflection (reflective safety cues). Our findings highlight the need for trajectory-level alignment oversight in long-form reasoning beyond token-level alignment.

[37] Lightweight Baselines for Medical Abstract Classification: DistilBERT with Cross-Entropy as a Strong Default

Jiaqi Liu, Lanruo Wang, Su Liu, Xin Hu

Main category: cs.CL

TL;DR: Compact encoders like DistilBERT with standard cross-entropy outperform BERT base for medical abstract classification while using fewer parameters, suggesting a practical default approach for deployment in constrained health settings.

Details

Motivation: Large language models are difficult to deploy in health settings due to strict cost, latency, and privacy constraints, motivating the exploration of lightweight alternatives for medical abstract classification.

Method: Finetuned BERT base and DistilBERT on medical abstracts corpus using three objectives: standard cross-entropy, class weighted cross entropy, and focal loss, while keeping tokenizer, sequence length, optimizer, and schedule fixed.

Result: DistilBERT with plain cross-entropy achieved the best balance on test set performance while using far fewer parameters than BERT base, with results reported using accuracy, Macro F1, and Weighted F1 metrics.

Conclusion: A practical default approach is to start with compact encoders and cross-entropy, then add calibration and task-specific checks before considering heavier models for deployment in constrained health settings.

Abstract: Large language models work well for many NLP tasks, but they are hard to deploy in health settings with strict cost, latency, and privacy limits. We revisit a lightweight recipe for medical abstract classification and ask how far compact encoders can go under a controlled budget. Using the public medical abstracts corpus, we finetune BERT base and DistilBERT with three objectives standard cross-entropy, class weighted cross entropy, and focal loss keeping tokenizer, sequence length, optimizer, and schedule fixed. DistilBERT with plain cross-entropy gives the best balance on the test set while using far fewer parameters than BERT base. We report accuracy, Macro F1, and Weighted F1, release the evaluation code, and include confusion analyses to make error patterns clear. Our results suggest a practical default: start with a compact encoder and cross-entropy, then add calibration and task-specific checks before moving to heavier models.

[38] HUME: Measuring the Human-Model Performance Gap in Text Embedding Task

Adnan El Assadi, Isaac Chung, Roman Solomatin, Niklas Muennighoff, Kenneth Enevoldsen

Main category: cs.CL

TL;DR: HUME introduces a human evaluation framework for text embeddings to compare human vs model performance across 16 MTEB datasets, revealing humans achieve 77.6% vs models’ 80.1% with significant variation across tasks and languages.

Details

Motivation: Current embedding evaluation frameworks lack reliable human performance baselines, limiting interpretability of model scores and understanding of where models succeed or fail in capturing meaning and nuance.

Method: Developed HUME framework to measure human performance across 16 MTEB datasets covering reranking, classification, clustering, and semantic textual similarity tasks in diverse high- and low-resource languages.

Result: Humans achieved average 77.6% performance vs 80.1% for best embedding model, with substantial variation - models reach near-ceiling performance on some datasets but struggle on others, particularly in low-resource languages.

Conclusion: HUME provides human performance baselines, insights into task difficulty patterns, and an extensible framework that enables more meaningful model evaluation and informs development of both models and benchmarks.

Abstract: Comparing human and model performance offers a valuable perspective for understanding the strengths and limitations of embedding models, highlighting where they succeed and where they fail to capture meaning and nuance. However, such comparisons are rarely made, as human performance on embedding tasks is difficult to measure. To fill this gap, we introduce HUME: Human Evaluation Framework for Text Embeddings. While frameworks like MTEB provide broad model evaluation, they lack reliable estimates of human performance, limiting the interpretability of model scores. We measure human performance across 16 MTEB datasets spanning reranking, classification, clustering, and semantic textual similarity across linguistically diverse high- and low-resource languages. Humans achieve an average performance of 77.6% compared to 80.1% for the best embedding model, although variation is substantial: models reach near-ceiling performance on some datasets while struggling on others, suggesting dataset issues and revealing shortcomings in low-resource languages. We provide human performance baselines, insight into task difficulty patterns, and an extensible evaluation framework that enables a more meaningful interpretation of the model and informs the development of both models and benchmarks. Our code, dataset, and leaderboard are publicly available at https://github.com/embeddings-benchmark/mteb.

[39] CLMN: Concept based Language Models via Neural Symbolic Reasoning

Yibo Yang

Main category: cs.CL

TL;DR: CLMN is a neural-symbolic framework that maintains both performance and interpretability in NLP by representing concepts as continuous embeddings and using fuzzy-logic reasoning to learn adaptive interaction rules.

Details

Motivation: Current concept bottleneck models in NLP either use binary activations that harm text representations or latent concepts that weaken semantics, and they rarely model dynamic concept interactions like negation and context.

Method: CLMN represents concepts as continuous, human-readable embeddings and applies fuzzy-logic reasoning to learn adaptive interaction rules that state how concepts affect each other and the final decision. It augments original text features with concept-aware representations and automatically induces interpretable logic rules.

Result: Across multiple datasets and pre-trained language models, CLMN achieves higher accuracy than existing concept-based methods while improving explanation quality.

Conclusion: Integrating neural representations with symbolic reasoning in a unified concept space can yield practical, transparent NLP systems.

Abstract: Deep learning has advanced NLP, but interpretability remains limited, especially in healthcare and finance. Concept bottleneck models tie predictions to human concepts in vision, but NLP versions either use binary activations that harm text representations or latent concepts that weaken semantics, and they rarely model dynamic concept interactions such as negation and context. We introduce the Concept Language Model Network (CLMN), a neural-symbolic framework that keeps both performance and interpretability. CLMN represents concepts as continuous, human-readable embeddings and applies fuzzy-logic reasoning to learn adaptive interaction rules that state how concepts affect each other and the final decision. The model augments original text features with concept-aware representations and automatically induces interpretable logic rules. Across multiple datasets and pre-trained language models, CLMN achieves higher accuracy than existing concept-based methods while improving explanation quality. These results show that integrating neural representations with symbolic reasoning in a unified concept space can yield practical, transparent NLP systems.

[40] Unilaw-R1: A Large Language Model for Legal Reasoning with Reinforcement Learning and Iterative Inference

Hua Cai, Shuang Zhao, Liang Zhang, Xuli Shen, Qing Xu, Weilin Shen, Zihao Wen, Tianke Ban

Main category: cs.CL

TL;DR: Unilaw-R1 is a 7B-parameter LLM specialized for legal reasoning that addresses knowledge gaps, unreliable logic, and weak generalization through curated CoT data and two-stage training, achieving competitive performance with larger models.

Details

Motivation: LLMs show promise in reasoning but their capabilities in complex legal problems remain underexplored, with challenges in legal knowledge, reasoning reliability, and business generalization.

Method: Constructed Unilaw-R1-Data (17K high-quality CoT samples) and used two-stage training combining SFT and RL to enhance legal reasoning and support interpretable decision-making.

Result: Outperformed all similar-scale models and achieved performance comparable to DeepSeek-R1-Distill-Qwen-32B (54.9%). Showed 6.6% average improvement over Qwen-2.5-7B-Instruct on LawBench and LexEval.

Conclusion: Unilaw-R1 demonstrates that specialized legal reasoning models with lightweight architecture can effectively handle complex legal tasks while reducing deployment costs.

Abstract: Reasoning-focused large language models (LLMs) are rapidly evolving across various domains, yet their capabilities in handling complex legal problems remains underexplored. In this paper, we introduce Unilaw-R1, a large language model tailored for legal reasoning. With a lightweight 7-billion parameter scale, Unilaw-R1 significantly reduces deployment cost while effectively tackling three core challenges in the legal domain: insufficient legal knowledge, unreliable reasoning logic, and weak business generalization. To address these issues, we first construct Unilaw-R1-Data, a high-quality dataset containing 17K distilled and screened chain-of-thought (CoT) samples. Based on this, we adopt a two-stage training strategy combining Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), which significantly boosts the performance on complex legal reasoning tasks and supports interpretable decision-making in legal AI applications. To assess legal reasoning ability, we also introduce Unilaw-R1-Eval, a dedicated benchmark designed to evaluate models across single- and multi-choice legal tasks. Unilaw-R1 demonstrates strong results on authoritative benchmarks, outperforming all models of similar scale and achieving performance on par with the much larger DeepSeek-R1-Distill-Qwen-32B (54.9%). Following domain-specific training, it also showed significant gains on LawBench and LexEval, exceeding Qwen-2.5-7B-Instruct (46.6%) by an average margin of 6.6%.

[41] MADS: Multi-Agent Dialogue Simulation for Diverse Persuasion Data Generation

Mingjin Li, Yu Liu, Huayi Liu, Xiang Ye, Chao Jiang, Hongguang Zhang, Yu Ruan

Main category: cs.CL

TL;DR: MADS is a multi-agent framework that generates persuasive dialogues through agent self-play, using persona-driven user agents, task-oriented dialog agents, and optimization agents to create training data without human annotation.

Details

Motivation: To address industry challenges like lack of user data, cold-start evaluation difficulties, and prompt inefficiency by enabling low-cost generation of persuasive dialogue training data.

Method: Uses three coordinated agents: User Agents with personality signifiers (Zodiac Signs, MBTI), Dialog Agent for persuasion strategies, and Optimization Agent for evaluation and refinement. Validated through Chain-of-Attitude modeling and LLM persuasion assessment.

Result: Significantly improved persuasion capacity of small LLMs, increasing organic traffic conversion rate by 22.4% (from 1.83% to 2.24%) in real-world marketing scenario.

Conclusion: MADS demonstrates clear business value by enabling scalable generation of persuasive dialogue data and improving conversion rates without human annotation costs.

Abstract: We propose MADS (Multi-Agent Dialogue Simulation), a scalable framework for generating persuasive multi-turn dialogues via agent self-play. MADS employs three coordinated agents: User Agents designed to simulate diverse persona-driven behaviors by leveraging personality signifiers such as Zodiac Signs and MBTI types, a Dialog Agent executing task-oriented persuasion strategies and an Optimization Agent evaluating and refining dialogue outcomes. We further validate its effectiveness through users’ Chain-of-Attitude (CoA) modeling and dedicated LLMs’ persuasion assessment. This approach enables low-cost generation of training data without human annotation, addressing key industry challenges such as lack of user data, cold-start evaluation difficulties, and prompt inefficiency. Applied to a real-world marketing scenario, MADS significantly improved the persuasion capacity of small LLMs, increasing the organic traffic conversion rate by 22.4% (from 1.83% to 2.24%) , demonstrating clear business value.

[42] A-IPO: Adaptive Intent-driven Preference Optimization

Wenqing Wang, Muhammad Asif Ali, Ali Shoker, Ruohan Yang, Junyang Chen, Ying Sha, Huan Wang

Main category: cs.CL

TL;DR: A-IPO is a new alignment method that infers latent user intent from prompts and incorporates it into preference optimization, addressing limitations of existing methods that overlook minority opinions and fail to capture user intentions.

Details

Motivation: Existing alignment methods like DPO default to majority views and overlook minority opinions, failing to capture latent user intentions in prompts. Human preferences are diverse and shaped by regional, cultural, and social factors.

Method: A-IPO introduces an intention module that infers latent intent behind user prompts and explicitly incorporates this inferred intent into the reward function. It adds an intention-response similarity term that increases preference margin in log-odds.

Result: A-IPO achieves substantial improvements: up to +24.8 win-rate and +45.6 Response-Intention Consistency on Real-pref; up to +38.6 Response Similarity and +52.2 Defense Success Rate on Attack-pref; up to +54.6 Intention Consistency Score on GlobalOpinionQA-Ext.

Conclusion: A-IPO facilitates pluralistic preference optimization while enhancing adversarial robustness, consistently surpassing existing baselines across multiple evaluation benchmarks.

Abstract: Human preferences are diverse and dynamic, shaped by regional, cultural, and social factors. Existing alignment methods like Direct Preference Optimization (DPO) and its variants often default to majority views, overlooking minority opinions and failing to capture latent user intentions in prompts. To address these limitations, we introduce \underline{\textbf{A}}daptive \textbf{\underline{I}}ntent-driven \textbf{\underline{P}}reference \textbf{\underline{O}}ptimization (\textbf{A-IPO}). Specifically,A-IPO introduces an intention module that infers the latent intent behind each user prompt and explicitly incorporates this inferred intent into the reward function, encouraging stronger alignment between the preferred model’s responses and the user’s underlying intentions. We demonstrate, both theoretically and empirically, that incorporating an intention–response similarity term increases the preference margin (by a positive shift of $\lambda,\Delta\mathrm{sim}$ in the log-odds), resulting in clearer separation between preferred and dispreferred responses compared to DPO. For evaluation, we introduce two new benchmarks, Real-pref, Attack-pref along with an extended version of an existing dataset, GlobalOpinionQA-Ext, to assess real-world and adversarial preference alignment. Through explicit modeling of diverse user intents,A-IPO facilitates pluralistic preference optimization while simultaneously enhancing adversarial robustness in preference alignment. Comprehensive empirical evaluation demonstrates that A-IPO consistently surpasses existing baselines, yielding substantial improvements across key metrics: up to +24.8 win-rate and +45.6 Response-Intention Consistency on Real-pref; up to +38.6 Response Similarity and +52.2 Defense Success Rate on Attack-pref; and up to +54.6 Intention Consistency Score on GlobalOpinionQA-Ext.

[43] FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline

Haotian Wu, Shufan Jiang, Mingyu Chen, Yiyang Feng, Hehai Lin, Heqing Zou, Yao Shu, Chengwei Qin

Main category: cs.CL

TL;DR: FURINA-Builder is a multi-agent pipeline that automatically constructs customizable role-playing benchmarks, addressing limitations of existing benchmarks. It enables evaluation of arbitrary characters across diverse scenarios and formats.

Details

Motivation: Existing role-playing benchmarks are obsolete due to narrow scope, outdated interaction paradigms, and limited adaptability across diverse application scenarios.

Method: Multi-agent collaboration pipeline that simulates dialogues between test characters and other characters from a character-scene pool, with an LLM judge selecting evaluation dimensions and adjusting responses into test utterances.

Result: Built FURINA-Bench with established and synthesized characters. Found o3 and DeepSeek-R1 perform best on English/Chinese tasks respectively. Established characters outperform synthesized ones, with reasoning amplifying this disparity. Model scale doesn’t monotonically reduce hallucinations. Reasoning LLMs show trade-off: improved RP performance but increased hallucinations.

Conclusion: FURINA-Builder effectively addresses benchmark limitations and FURINA-Bench poses significant challenges, revealing a Pareto frontier between RP performance and reliability across all LLMs.

Abstract: As large language models (LLMs) advance in role-playing (RP) tasks, existing benchmarks quickly become obsolete due to their narrow scope, outdated interaction paradigms, and limited adaptability across diverse application scenarios. To address this gap, we introduce FURINA-Builder, a novel multi-agent collaboration pipeline that automatically constructs fully customizable RP benchmarks at any scale. It enables evaluation of arbitrary characters across diverse scenarios and prompt formats, as the first benchmark builder in RP area for adaptable assessment. FURINA-Builder simulates dialogues between a test character and other characters drawn from a well-constructed character-scene pool, while an LLM judge selects fine-grained evaluation dimensions and adjusts the test character’s responses into final test utterances. Using this pipeline, we build FURINA-Bench, a new comprehensive role-playing benchmark featuring both established and synthesized test characters, each assessed with dimension-specific evaluation criteria. Human evaluation and preliminary separability analysis justify our pipeline and benchmark design. We conduct extensive evaluations of cutting-edge LLMs and find that o3 and DeepSeek-R1 achieve the best performance on English and Chinese RP tasks, respectively. Across all models, established characters consistently outperform synthesized ones, with reasoning capabilities further amplifying this disparity. Interestingly, we observe that model scale does not monotonically reduce hallucinations. More critically, for reasoning LLMs, we uncover a novel trade-off: reasoning improves RP performance but simultaneously increases RP hallucinations. This trade-off extends to a broader Pareto frontier between RP performance and reliability for all LLMs. These findings demonstrate the effectiveness of FURINA-Builder and the challenge posed by FURINA-Bench.

[44] Diversity Augmentation of Dynamic User Preference Data for Boosting Personalized Text Summarizers

Parthiv Chatterjee, Shivam Sonawane, Amey Hengle, Aditya Tanna, Sourish Dasgupta, Tanmoy Chakraborty

Main category: cs.CL

TL;DR: PerAugy is a novel data augmentation technique using cross-trajectory shuffling and summary-content perturbation that significantly improves personalized document summarization by enhancing user-encoder accuracy and increasing dataset diversity.

Details

Motivation: Personalized summarization is challenging due to subjective user preferences and scarcity of training data containing both user preference history and target summaries. Existing datasets like MS/CAS PENS lack target summaries and have limited topic diversity, restricting model generalization.

Method: Proposed PerAugy - a data augmentation technique that performs cross-trajectory shuffling and summary-content perturbation to generate diverse training data for personalized summarization models.

Result: PerAugy boosted accuracy of four SOTA user-encoders (best: 0.132↑ AUC) and increased personalization in two summarizer frameworks by 61.2% on average (PSE-SU4 metric). Introduced diversity metrics (TP, RTC, DegreeD) showed strong correlation between dataset diversity and performance gains.

Conclusion: Increased dataset diversity through PerAugy’s augmentation is a key factor driving performance improvements in personalized summarization, with TP and DegreeD metrics strongly correlating with user-encoder performance across all accuracy measures.

Abstract: Document summarization enables efficient extraction of user-relevant content but is inherently shaped by individual subjectivity, making it challenging to identify subjective salient information in multifaceted documents. This complexity underscores the necessity for personalized summarization. However, training models for personalized summarization has so far been challenging, particularly because diverse training data containing both user preference history (i.e., click-skip trajectory) and expected (gold-reference) summaries are scarce. The MS/CAS PENS dataset is a valuable resource but includes only preference history without target summaries, preventing end-to-end supervised learning, and its limited topic-transition diversity further restricts generalization. To address this, we propose $\mathrm{PerAugy}$, a novel cross-trajectory shuffling and summary-content perturbation based data augmentation technique that significantly boosts the accuracy of four state-of-the-art baseline (SOTA) user-encoders commonly used in personalized summarization frameworks (best result: $\text{0.132}$$\uparrow$ w.r.t AUC). We select two such SOTA summarizer frameworks as baselines and observe that when augmented with their corresponding improved user-encoders, they consistently show an increase in personalization (avg. boost: $\text{61.2%}\uparrow$ w.r.t. PSE-SU4 metric). As a post-hoc analysis of the role of induced diversity in the augmented dataset by \peraugy, we introduce three dataset diversity metrics – $\mathrm{TP}$, $\mathrm{RTC}$, and \degreed\ to quantify the induced diversity. We find that $\mathrm{TP}$ and $\mathrm{DegreeD}$ strongly correlate with user-encoder performance on the PerAugy-generated dataset across all accuracy metrics, indicating that increased dataset diversity is a key factor driving performance gains.

[45] Stop When Enough: Adaptive Early-Stopping for Chain-of-Thought Reasoning

Renliang Sun, Wei Cheng, Dawei Li, Haifeng Chen, Wei Wang

Main category: cs.CL

TL;DR: REFRAIN is a training-free framework that adaptively stops Chain-of-Thought reasoning to prevent overthinking, reducing token usage by 20-55% while maintaining or improving accuracy.

Details

Motivation: Chain-of-Thought reasoning improves LLM performance but can lead to overthinking - excessive or redundant reasoning that increases costs and may lead to incorrect conclusions.

Method: Uses a two-stage stop discriminator to identify reflective yet redundant reasoning and a sliding-window Upper Confidence Bound multi-armed bandit controller to dynamically adjust stopping thresholds based on problem difficulty.

Result: Across four benchmarks and two model families, REFRAIN reduces token usage by 20-55% while maintaining or improving accuracy compared to standard CoT prompting.

Conclusion: When-to-stop reasoning is a practical axis of test-time scaling that enables models to reason just enough rather than more.

Abstract: Chain-of-Thought (CoT) reasoning has driven recent gains of large language models (LLMs) on reasoning-intensive tasks by externalizing intermediate steps. However, excessive or redundant reasoning – so-called overthinking – can increase inference costs and lead LLMs toward incorrect conclusions. In this paper, we present REFRAIN ($\underline{REF}$lective-$\underline{R}$edundancy for $\underline{A}$daptive $\underline{IN}$ference), a training-free framework that adaptively determines when to stop reasoning to mitigate overthinking. REFRAIN integrates a two-stage stop discriminator to identify reflective yet redundant reasoning and a sliding-window Upper Confidence Bound (SW-UCB) multi-armed bandit controller to dynamically adjust stopping thresholds according to problem difficulty without supervision or fine-tuning. Across four representative benchmarks and two model families, REFRAIN reduces token usage by 20-55% while maintaining or improving accuracy compared to standard CoT prompting. Extensive ablation and robustness analyses demonstrate its stability across models, scorers, and prompt variations. In summary, our findings highlight when-to-stop as a new and practical axis of test-time scaling – enabling models to reason not just more, but just enough.

[46] LinearRAG: Linear Graph Retrieval Augmented Generation on Large-scale Corpora

Luyao Zhuang, Shengyuan Chen, Yilin Xiao, Huachi Zhou, Yujing Zhang, Hao Chen, Qinggang Zhang, Xiao Huang

Main category: cs.CL

TL;DR: LinearRAG is an efficient graph-based RAG framework that uses relation-free hierarchical graphs for reliable retrieval, avoiding costly relation extraction while scaling linearly with corpus size.

Details

Motivation: Traditional RAG struggles with fragmented information in large corpora, while existing GraphRAG methods suffer from unstable and costly relation extraction that produces noisy graphs degrading retrieval quality.

Method: Constructs Tri-Graph using lightweight entity extraction and semantic linking without relation modeling, then uses two-stage retrieval: entity activation via semantic bridging followed by passage retrieval through importance aggregation.

Result: Significantly outperforms baseline models on four datasets, demonstrating superior retrieval performance.

Conclusion: LinearRAG provides an economical and reliable alternative to traditional GraphRAG by eliminating relation extraction while maintaining effective retrieval capabilities for complex queries.

Abstract: Retrieval-Augmented Generation (RAG) is widely used to mitigate hallucinations of Large Language Models (LLMs) by leveraging external knowledge. While effective for simple queries, traditional RAG systems struggle with large-scale, unstructured corpora where information is fragmented. Recent advances incorporate knowledge graphs to capture relational structures, enabling more comprehensive retrieval for complex, multi-hop reasoning tasks. However, existing graph-based RAG (GraphRAG) methods rely on unstable and costly relation extraction for graph construction, often producing noisy graphs with incorrect or inconsistent relations that degrade retrieval quality. In this paper, we revisit the pipeline of existing GraphRAG systems and propose LinearRAG (Linear Graph-based Retrieval-Augmented Generation), an efficient framework that enables reliable graph construction and precise passage retrieval. Specifically, LinearRAG constructs a relation-free hierarchical graph, termed Tri-Graph, using only lightweight entity extraction and semantic linking, avoiding unstable relation modeling. This new paradigm of graph construction scales linearly with corpus size and incurs no extra token consumption, providing an economical and reliable indexing of the original passages. For retrieval, LinearRAG adopts a two-stage strategy: (i) relevant entity activation via local semantic bridging, followed by (ii) passage retrieval through global importance aggregation. Extensive experiments on four datasets demonstrate that LinearRAG significantly outperforms baseline models.

[47] Hybrid OCR-LLM Framework for Enterprise-Scale Document Information Extraction Under Copy-heavy Task

Zilong Wang, Xiaoyu Shen

Main category: cs.CL

TL;DR: A framework combining OCR engines and LLMs for efficient information extraction from copy-heavy documents, achieving near-perfect accuracy with sub-second processing speeds through adaptive strategy selection.

Details

Motivation: Information extraction from copy-heavy documents with structurally similar content is a critical but understudied challenge in enterprise document processing, requiring optimization of the accuracy-efficiency trade-off.

Method: Systematic framework combining OCR engines with LLMs, implementing 25 configurations across three extraction paradigms (direct, replacement, table-based) with intelligent strategy selection based on document-specific characteristics.

Result: Outstanding performance: F1=1.0 accuracy with 0.97s latency for structured documents, and F1=0.997 accuracy with 0.6s for image inputs using PaddleOCR. 54x performance improvement compared to multimodal methods, maintaining sub-second processing speeds.

Conclusion: The repetitive nature of copy-heavy tasks can be transformed from computational burden into optimization opportunity through structure-aware method selection, establishing a general principle for enterprise document processing.

Abstract: Information extraction from copy-heavy documents, characterized by massive volumes of structurally similar content, represents a critical yet understudied challenge in enterprise document processing. We present a systematic framework that strategically combines OCR engines with Large Language Models (LLMs) to optimize the accuracy-efficiency trade-off inherent in repetitive document extraction tasks. Unlike existing approaches that pursue universal solutions, our method exploits document-specific characteristics through intelligent strategy selection. We implement and evaluate 25 configurations across three extraction paradigms (direct, replacement, and table-based) on identity documents spanning four formats (PNG, DOCX, XLSX, PDF). Through table-based extraction methods, our adaptive framework delivers outstanding results: F1=1.0 accuracy with 0.97s latency for structured documents, and F1=0.997 accuracy with 0.6 s for challenging image inputs when integrated with PaddleOCR, all while maintaining sub-second processing speeds. The 54 times performance improvement compared with multimodal methods over naive approaches, coupled with format-aware routing, enables processing of heterogeneous document streams at production scale. Beyond the specific application to identity extraction, this work establishes a general principle: the repetitive nature of copy-heavy tasks can be transformed from a computational burden into an optimization opportunity through structure-aware method selection.

[48] DiffHeads: Differential Analysis and Inference-Time Masking of Bias Heads in Large Language Models

Tingxu Han, Wei Song, Ziqi Ding, Ziming Li, Chunrong Fang, Yuekang Li, Dongfang Liu, Zhenyu Chen, Zhenting Wang

Main category: cs.CL

TL;DR: DiffHeads is a lightweight debiasing framework that identifies and masks specific bias heads in LLMs, reducing unfairness by up to 49.4% without harming model utility.

Details

Motivation: LLMs increasingly mediate decisions in sensitive domains where unfair treatment of demographic groups is unacceptable, but existing bias mitigation approaches are largely fragile and lack insight into the underlying mechanisms.

Method: 1) Compare Direct-Answer vs Chain-of-Thought prompting across 8 LLMs; 2) Define token-to-head contribution score to trace bias to specific attention heads; 3) Propose DiffHeads that identifies bias heads through differential activation analysis and selectively masks them.

Result: Direct-Answer prompting increases unfairness by 534.5%-391.9%. A small cluster of bias heads activate under DA but stay dormant with CoT. DiffHeads reduces unfairness by 49.4% under DA and 40.3% under CoT.

Conclusion: The paper provides the first causal link between prompting strategies and bias emergence, and demonstrates that selective masking of identified bias heads effectively reduces unfairness while preserving model utility.

Abstract: Large language models (LLMs) increasingly mediate decisions in domains where unfair treatment of demographic groups is unacceptable. Existing work probes when biased outputs appear, but gives little insight into the mechanisms that generate them, leaving existing mitigations largely fragile. In this paper, we conduct a systematic investigation LLM unfairness and propose DiffHeads, a lightweight debiasing framework for LLMs. We first compare Direct-Answer (DA) prompting to Chain-of-Thought (CoT) prompting across eight representative open- and closed-source LLMs. DA will trigger the nature bias part of LLM and improve measured unfairness by 534.5%-391.9% in both one-turn and two-turn dialogues. Next, we define a token-to-head contribution score that traces each token’s influence back to individual attention heads. This reveals a small cluster of bias heads that activate under DA but stay largely dormant with CoT, providing the first causal link between prompting strategy and bias emergence. Finally, building on this insight, we propose DiffHeads that identifies bias heads through differential activation analysis between DA and CoT, and selectively masks only those heads. DiffHeads reduces unfairness by 49.4%, and 40.3% under DA and CoT, respectively, without harming model utility.

[49] BILLY: Steering Large Language Models via Merging Persona Vectors for Creative Generation

Tsung-Min Pai, Jui-I Wang, Li-Chun Lu, Shao-Hua Sun, Hung-Yi Lee, Kai-Wei Chang

Main category: cs.CL

TL;DR: BILLY is a training-free framework that blends multiple persona vectors in a single LLM’s activation space to achieve multi-perspective creativity without the computational costs of multi-LLM systems.

Details

Motivation: Multi-LLM systems enhance creativity through collective intelligence but suffer from high computational costs and inference latency. BILLY aims to capture the benefits of multi-LLM collaboration within a single model.

Method: Extracts and blends multiple distinct persona vectors directly in the model’s activation space, then steers generation with this merged vector during inference to enable multi-perspective output without multi-LLM communication.

Result: BILLY surpasses single model prompting and traditional multi-LLM approaches on creativity benchmarks while substantially reducing inference time and computational costs.

Conclusion: Persona vector blending enables effective control over complementary generation aspects and greater interpretability, providing multi-LLM benefits in a single model.

Abstract: Multi-LLM systems enhance the creativity of large language models by simulating human collective intelligence but suffer from significant drawbacks, such as high computational costs and inference latency. To address these limitations, we propose BILLY (BlendIng persona vectors for Large Language model creativitY), a training-free framework that captures the benefits of multi-LLM collaboration, i.e. inducing diverse perspectives and specialized expertise, within a single model. BILLY operates by extracting and blending multiple distinct persona vectors directly in the model’s activation space. We steer the model’s generation process with this merged vector while inference, enabling multi-perspective output without explicit multi-LLM communication. Our experiments across creativity-oriented benchmarks demonstrate that BILLY surpasses single model prompting and traditional multi-LLM approaches, while substantially reducing inference time and computational costs. Our analyses further reveal that distinct persona vectors can be blended to achieve both effective control over complementary aspects of generation and greater interpretability.

[50] BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data

Jaap Jumelet, Abdellah Fourtassi, Akari Haga, Bastian Bunzeck, Bhargav Shandilya, Diana Galvan-Sosa, Faiz Ghifari Haznitrama, Francesca Padovani, Francois Meyer, Hai Hu, Julen Etxaniz, Laurent Prévot, Linyang He, María Grandury, Mila Marcheva, Negar Foroutan, Nikitas Theodoropoulos, Pouya Sadeghi, Siyuan Song, Suchir Salhan, Susana Zhou, Yurii Paniv, Ziyin Zhang, Arianna Bisazza, Alex Warstadt, Leshem Choshen

Main category: cs.CL

TL;DR: BabyBabelLM is a multilingual dataset collection modeling language exposure from birth to native language acquisition, with developmentally plausible pretraining data covering 45 languages.

Details

Motivation: To facilitate multilingual pretraining and cognitive modeling by providing developmentally appropriate language data that mirrors what a person observes from birth until native language acquisition.

Method: Curated developmentally plausible pretraining data equivalent to 100M English words in each of 45 languages, compiled evaluation suites, and trained baseline models for each language.

Result: Created a comprehensive multilingual dataset collection with evaluation benchmarks and baseline models across 45 languages.

Conclusion: BabyBabelLM provides a valuable resource for advancing multilingual pretraining and cognitive modeling research by offering developmentally realistic language data across multiple languages.

Abstract: We present BabyBabelLM, a multilingual collection of datasets modeling the language a person observes from birth until they acquire a native language. We curate developmentally plausible pretraining data aiming to cover the equivalent of 100M English words of content in each of 45 languages. We compile evaluation suites and train baseline models in each language. BabyBabelLM aims to facilitate multilingual pretraining and cognitive modeling.

[51] Large Language Model Sourcing: A Survey

Liang Pang, Kangxi Wu, Sunhao Dai, Zihao Wei, Zenghao Duan, Jia Gu, Xiang Li, Zhiyi Yin, Jun Xu, Huawei Shen, Xueqi Cheng

Main category: cs.CL

TL;DR: This survey systematically investigates provenance tracking for LLM-generated content across four dimensions: model sourcing, model structure sourcing, training data sourcing, and external data sourcing.

Details

Motivation: As LLMs shift from objective tasks to subjective decision-making, their black-box nature and human-like outputs create significant risks including hallucinations, bias, unfairness, and copyright infringement, necessitating multi-perspective information sourcing.

Method: Organizes provenance tracking around four interrelated dimensions capturing both model- and data-centric perspectives, with a dual-paradigm taxonomy classifying methods into prior-based (proactive traceability embedding) and posterior-based (retrospective inference) approaches.

Result: The survey provides a comprehensive framework for tracing LLM-generated content origins across different dimensions, enhancing understanding of how content is shaped by model architecture, training data, and external influences.

Conclusion: Provenance tracking across these dimensions enhances transparency, accountability, and trustworthiness of LLM deployment in real-world applications, addressing critical risks associated with their widespread use.

Abstract: The rapid advancement of large language models (LLMs) has revolutionized artificial intelligence, shifting from supporting objective tasks (e.g., recognition) to empowering subjective decision-making (e.g., planning, decision). This marks the dawn of general and powerful AI, with applications spanning a wide range of fields, including programming, education, healthcare, finance, and law. However, their deployment introduces multifaceted risks. Due to the black-box nature of LLMs and the human-like quality of their generated content, issues such as hallucinations, bias, unfairness, and copyright infringement become particularly significant. In this context, sourcing information from multiple perspectives is essential. This survey presents a systematic investigation into provenance tracking for content generated by LLMs, organized around four interrelated dimensions that together capture both model- and data-centric perspectives. From the model perspective, Model Sourcing treats the model as a whole, aiming to distinguish content generated by specific LLMs from content authored by humans. Model Structure Sourcing delves into the internal generative mechanisms, analyzing architectural components that shape the outputs of model. From the data perspective, Training Data Sourcing focuses on internal attribution, tracing the origins of generated content back to the training data of model. In contrast, External Data Sourcing emphasizes external validation, identifying external information used to support or influence the responses of model. Moreover, we also propose a dual-paradigm taxonomy that classifies existing sourcing methods into prior-based (proactive traceability embedding) and posterior-based (retrospective inference) approaches. Traceability across these dimensions enhances the transparency, accountability, and trustworthiness of LLMs deployment in real-world applications.

[52] A Survey of Inductive Reasoning for Large Language Models

Kedi Chen, Dezhao Ruan, Yuhao Dan, Yaoting Wang, Siyu Yan, Xuecheng Wu, Yinqi Zhang, Qin Chen, Jie Zhou, Liang He, Biqing Qi, Linyang Li, Qipeng Guo, Xiaoming Shi, Wei Zhang

Main category: cs.CL

TL;DR: This paper presents the first comprehensive survey of inductive reasoning in large language models (LLMs), categorizing improvement methods, summarizing benchmarks, and providing analysis on inductive ability sources.

Details

Motivation: Inductive reasoning is fundamental for knowledge generalization and aligns with human cognition, but there's no systematic summary of it for LLMs despite its importance.

Method: Categorizes methods into post-training, test-time scaling, and data augmentation; summarizes benchmarks; derives unified sandbox-based evaluation with observation coverage metric.

Result: Provides comprehensive categorization of inductive reasoning methods and benchmarks, along with analysis of inductive ability sources.

Conclusion: The survey establishes a solid foundation for future research on inductive reasoning in LLMs by systematizing current knowledge and evaluation approaches.

Abstract: Reasoning is an important task for large language models (LLMs). Among all the reasoning paradigms, inductive reasoning is one of the fundamental types, which is characterized by its particular-to-general thinking process and the non-uniqueness of its answers. The inductive mode is crucial for knowledge generalization and aligns better with human cognition, so it is a fundamental mode of learning, hence attracting increasing interest. Despite the importance of inductive reasoning, there is no systematic summary of it. Therefore, this paper presents the first comprehensive survey of inductive reasoning for LLMs. First, methods for improving inductive reasoning are categorized into three main areas: post-training, test-time scaling, and data augmentation. Then, current benchmarks of inductive reasoning are summarized, and a unified sandbox-based evaluation approach with the observation coverage metric is derived. Finally, we offer some analyses regarding the source of inductive ability and how simple model architectures and data help with inductive tasks, providing a solid foundation for future research.

[53] Weed Out, Then Harvest: Dual Low-Rank Adaptation is an Effective Noisy Label Detector for Noise-Robust Learning

Bo Yuan, Yulin Chen, Yin Zhang

Main category: cs.CL

TL;DR: Delora is a novel framework that decouples sample selection from model training for PEFT of LLMs with noisy labels, using clean and noisy LoRA modules to detect and handle mislabeled data.

Details

Motivation: Real-world training data often contains noisy labels, and existing methods that select samples based on small losses can create a vicious cycle where inaccurate initial selection leads to suboptimal performance.

Method: Delora introduces clean and noisy LoRA modules to establish a noisy label detector. The clean LoRA memorizes clean data while the noisy LoRA memorizes mislabeled data, serving as a learnable threshold for sample selection, decoupling this process from model training.

Result: Experimental results on synthetic and real-world noisy datasets show Delora’s effectiveness in noisy label detection and text classification tasks.

Conclusion: Delora successfully breaks the vicious cycle of noisy label learning by decoupling sample selection from model training, providing an effective solution for PEFT of LLMs with noisy labels.

Abstract: Parameter-efficient fine-tuning (PEFT) large language models (LLMs) have shown impressive performance in various downstream tasks. However, in many real-world scenarios, the collected training data inevitably contains noisy labels. To learn from noisy labels, most solutions select samples with small losses for model training. However, the selected samples, in turn, impact the loss computation in the next iteration. An inaccurate initial selection can create a vicious cycle, leading to suboptimal performance. To break this cycle, we propose Delora, a novel framework that decouples the sample selection from model training. For sample selection, Delora establishes a noisy label detector by introducing clean and noisy LoRA. Benefiting from the memory effect, the clean LoRA is encouraged to memorize clean data, while the noisy LoRA is constrained to memorize mislabeled data, which serves as a learnable threshold for selecting clean and noisy samples. For model training, Delora can use carefully selected samples to fine-tune language models seamlessly. Experimental results on synthetic and real-world noisy datasets demonstrate the effectiveness of Delora in noisy label detection and text classification.

[54] You only need 4 extra tokens: Synergistic Test-time Adaptation for LLMs

Yijie Xu, Huizai Yao, Zhiyu Guo, Weiyu Guo, Pengteng Li, Aiwei Liu, Xuming Hu, Hui Xiong

Main category: cs.CL

TL;DR: SyTTA is a label-free test-time adaptation framework that uses input perplexity and output predictive entropy to adapt LLMs to domain shifts without additional supervision.

Details

Motivation: LLMs face distribution shifts in specialized domains but domain-specific fine-tuning requires expensive labeled data that's hard to collect in expertise-limited settings.

Method: SyTTA couples two uncertainty signals: input-side perplexity (domain mismatch) and output-side predictive entropy (unstable token probabilities) for on-the-fly adaptation without supervision.

Result: Consistent gains across diverse models and domains; on agricultural QA, improves Rouge-LSum by over 120% on Qwen-2.5-7B with only 4 extra tokens per query.

Conclusion: Effective test-time adaptation for LLMs is achievable without labeled examples, supporting deployment in label-scarce domains.

Abstract: Large language models (LLMs) are increasingly deployed in specialized domains such as finance, medicine, and agriculture, where they face significant distribution shifts from their training data. Domain-specific fine-tuning can mitigate this challenge but relies on high-quality labeled data that is expensive and slow to collect in expertise-limited settings. We study label-free test-time adaptation for language models and present SyTTA, an inference-time framework that adapts models on-the-fly without additional supervision. SyTTA couples two complementary uncertainty signals that arise under distribution shift: input-side perplexity, indicating mismatch with domain-specific terminology and patterns, and output-side predictive entropy, indicating diffuse and unstable token probabilities during generation. Across diverse model architectures and domain-specific benchmarks, SyTTA delivers consistent gains. Notably, on agricultural question answering, SyTTA improves Rouge-LSum by over 120% on Qwen-2.5-7B with only 4 extra tokens per query. These results show that effective test-time adaptation for language models is achievable without labeled examples, supporting deployment in label-scarce domains. The code will be made available upon acceptance.

[55] Text2Token: Unsupervised Text Representation Learning with Token Target Prediction

Ruize An, Richong Zhang, Zhijie Nie, Zhanyu Wu, Yanzhao Zhang, Dingkun Long

Main category: cs.CL

TL;DR: Text2Token is an unsupervised generative framework for text representation learning that uses token target prediction with carefully constructed target distributions, achieving competitive performance with state-of-the-art methods.

Details

Motivation: The motivation comes from findings that high-quality text representations align with key tokens, revealing connections between representation and vocabulary spaces, which inspired revisiting generative tasks for unsupervised text representation learning.

Method: The framework uses token target prediction with two methods to construct target token distributions: data-driven (extracting meaningful tokens from text) and model-derived (using semantically derived tokens from LLM backbone).

Result: Experiments on MTEB v2 benchmark show Text2Token achieves competitive performance with state-of-the-art unsupervised contrastive learning embedder LLM2Vec.

Conclusion: The analysis reveals that vocabulary and representation spaces optimize together toward optimal solutions during training, providing new insights for future work in text representation learning.

Abstract: Unsupervised text representation learning (TRL) is a fundamental task in natural language processing, which is beneficial for improving search and recommendations with the web’s unlabeled texts. A recent empirical study finds that the high-quality representation aligns with the key token of the input text, uncovering the potential connection between representation space and vocabulary space. Inspired by the findings, we revisit the generative tasks and develop an unsupervised generative framework for TRL, Text2Token. The framework is based on the token target prediction task, utilizing carefully constructed target token distribution as supervisory signals. To construct the high-quality target token distribution, we analyze the token-alignment properties with advanced embedders and identify two essential categories of key tokens: (1) the meaningful tokens in the text and (2) semantically derived tokens beyond the text. Based on these insights, we propose two methods – data-driven and model-derived – to construct synthetic token targets from data or the LLM backbone. Experiments on the MTEB v2 benchmark demonstrate that Text2Token achieves performance competitive with the state-of-the-art embedder with unsupervised contrastive learning, LLM2Vec. Our analysis further shows that vocabulary and representation spaces optimize together and toward the optimum solution during training, providing new ideas and insights for future work.

[56] ImCoref-CeS: An Improved Lightweight Pipeline for Coreference Resolution with LLM-based Checker-Splitter Refinement

Kangyang Luo, Yuzhuo Bai, Shuzheng Si, Cheng Gao, Zhitong Wang, Yingli Shen, Wenhao Li, Zhu Liu, Yufeng Han, Jiayi Wu, Cunliang Kong, Maosong Sun

Main category: cs.CL

TL;DR: ImCoref-CeS is a novel framework that combines enhanced supervised neural methods with LLM reasoning for coreference resolution, achieving state-of-the-art performance.

Details

Motivation: To address the dilemma between exploring supervised neural methods and leveraging LLMs in coreference resolution, and to effectively combine their strengths which remains underexplored.

Method: Proposes ImCoref-CeS framework with: 1) ImCoref - improved supervised method with lightweight bridging module for long-text encoding, biaffine scorer for positional information, and hybrid mention regularization; 2) LLM as Checker-Splitter agent to validate candidate mentions and split erroneous clusters.

Result: Extensive experiments demonstrate effectiveness and superior performance compared to existing state-of-the-art methods.

Conclusion: The proposed ImCoref-CeS framework successfully integrates enhanced supervised models with LLM reasoning, achieving top performance in coreference resolution.

Abstract: Coreference Resolution (CR) is a critical task in Natural Language Processing (NLP). Current research faces a key dilemma: whether to further explore the potential of supervised neural methods based on small language models, whose detect-then-cluster pipeline still delivers top performance, or embrace the powerful capabilities of Large Language Models (LLMs). However, effectively combining their strengths remains underexplored. To this end, we propose \textbf{ImCoref-CeS}, a novel framework that integrates an enhanced supervised model with LLM-based reasoning. First, we present an improved CR method (\textbf{ImCoref}) to push the performance boundaries of the supervised neural method by introducing a lightweight bridging module to enhance long-text encoding capability, devising a biaffine scorer to comprehensively capture positional information, and invoking a hybrid mention regularization to improve training efficiency. Importantly, we employ an LLM acting as a multi-role Checker-Splitter agent to validate candidate mentions (filtering out invalid ones) and coreference results (splitting erroneous clusters) predicted by ImCoref. Extensive experiments demonstrate the effectiveness of ImCoref-CeS, which achieves superior performance compared to existing state-of-the-art (SOTA) methods.

[57] Audit-of-Understanding: Posterior-Constrained Inference for Mathematical Reasoning in Language Models

Samir Abdaljalil, Erchin Serpedin, Khalid Qaraqe, Hasan Kurban

Main category: cs.CL

TL;DR: AoU is a framework that constrains LLM inference to validated premises through decomposition, auditing, and conditioned inference, reducing reasoning-induced hallucinations with theoretical guarantees and empirical improvements.

Details

Motivation: LLMs often generate reasoning traces that appear coherent but rely on unsupported assumptions, leading to hallucinated conclusions, which prior work mainly addresses through factual verification or post-hoc methods.

Method: Three-phase framework: (1) decomposing queries into candidate assumptions, (2) auditing their support, and (3) conditioning inference only on validated subset. Formally, posterior-constrained inference connecting to selective prediction and rejection learning.

Result: Empirical improvements: +30% gains on GSM8K, +45% on MultiArith, and consistent +20-28% improvements on SVAMP over Chain-of-Thought, Self-Consistency, and CoT-Decoding. Also provides theoretical guarantees under perfect validation and excess-risk bounds under imperfect audits.

Conclusion: AoU effectively addresses reasoning-induced hallucinations in LLMs by constraining inference to validated premises, demonstrating significant improvements in accuracy and faithfulness across multiple benchmarks.

Abstract: Large language models (LLMs) often generate reasoning traces that appear coherent but rest on unsupported assumptions, leading to hallucinated conclusions. Prior work mainly addresses factual hallucinations or relies on post-hoc verification, leaving reasoning-induced hallucinations largely unaddressed. We propose Audit-of-Understanding (AoU), a framework that constrains inference to validated premises through three phases: (1) decomposing a query into candidate assumptions, (2) auditing their support, and (3) conditioning inference only on the validated subset. Formally, AoU is \emph{posterior-constrained inference}, connecting to selective prediction and rejection learning. Our contributions are threefold: (i) theoretical guarantees under perfect validation, (ii) excess-risk bounds under imperfect audits, and (iii) tractability analysis. Empirically, AoU improves both accuracy and faithfulness on GSM8K, MultiArith, and SVAMP, achieving up to +30% gains on GSM8K, +45% on MultiArith, and consistent +20–28% improvements on SVAMP over Chain-of-Thought, Self-Consistency, and CoT-Decoding. Code is available at https://anonymous.4open.science/r/audit-of-understanding-E28B.

[58] Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models

Liang Lin, Miao Yu, Moayad Aloqaily, Zhenhong Zhou, Kun Wang, Linsey Pang, Prakhar Mehrotra, Qingsong Wen

Main category: cs.CL

TL;DR: A defense framework called \ourmethod that protects LLMs from backdoor attacks without requiring prior knowledge of trigger settings, using deliberate backdoor injection and recovery fine-tuning.

Details

Motivation: Backdoor attacks are a major threat to LLMs through public checkpoints, and existing defenses rely on impractical assumptions about trigger settings.

Method: Two-stage process: (1) aggregating backdoor representations by injecting known triggers into compromised models, (2) performing recovery fine-tuning to restore benign outputs.

Result: Reduces average Attack Success Rate to 4.41% across benchmarks (28.1%~69.3% improvement over baselines), preserves clean accuracy within 0.5% of original model, and generalizes across different backdoor types.

Conclusion: The defense framework is effective, practical, and robust for real-world deployment scenarios without requiring trigger knowledge.

Abstract: Backdoor attacks are a significant threat to large language models (LLMs), often embedded via public checkpoints, yet existing defenses rely on impractical assumptions about trigger settings. To address this challenge, we propose \ourmethod, a defense framework that requires no prior knowledge of trigger settings. \ourmethod is based on the key observation that when deliberately injecting known backdoors into an already-compromised model, both existing unknown and newly injected backdoors aggregate in the representation space. \ourmethod leverages this through a two-stage process: \textbf{first}, aggregating backdoor representations by injecting known triggers, and \textbf{then}, performing recovery fine-tuning to restore benign outputs. Extensive experiments across multiple LLM architectures demonstrate that: (I) \ourmethod reduces the average Attack Success Rate to 4.41% across multiple benchmarks, outperforming existing baselines by 28.1%$\sim$69.3%$\uparrow$. (II) Clean accuracy and utility are preserved within 0.5% of the original model, ensuring negligible impact on legitimate tasks. (III) The defense generalizes across different types of backdoors, confirming its robustness in practical deployment scenarios.

[59] On the Entity-Level Alignment in Crosslingual Consistency

Yihong Liu, Mingyang Wang, François Yvon, Hinrich Schütze

Main category: cs.CL

TL;DR: The paper investigates crosslingual consistency in multilingual LLMs, finding that factual inconsistencies arise from entity misalignment across languages. The authors propose two methods that improve consistency by integrating English subject translations into multilingual prompts.

Details

Motivation: Multilingual LLMs often fail to maintain consistent factual knowledge across languages, and the underlying causes of these crosslingual inconsistencies are not well understood. The researchers hypothesize that entity alignment failures may be the root cause.

Method: The authors assess entity alignment through translation tasks and propose two methods: SubSub (substituting subjects with English translations) and SubInj (injecting English subject translations into prompts). They also conduct mechanistic analysis to understand how these interventions work.

Result: The study finds strong correlation between alignment and consistency across all models. The proposed methods (SubSub and SubInj) significantly improve both factual recall accuracy and crosslingual consistency by reinforcing entity representation alignment through the model’s internal pivot-language processing.

Conclusion: Entity alignment is a key factor in crosslingual consistency of factual knowledge in multilingual LLMs. The proposed interventions offer practical strategies for improving multilingual factual prediction by leveraging the model’s internal language processing mechanisms.

Abstract: Multilingual large language models (LLMs) are expected to recall factual knowledge consistently across languages. However, the factors that give rise to such crosslingual consistency – and its frequent failure – remain poorly understood. In this work, we hypothesize that these inconsistencies may arise from failures in entity alignment, the process of mapping subject and object entities into a shared conceptual space across languages. To test this, we assess alignment through entity-level (subject and object) translation tasks, and find that consistency is strongly correlated with alignment across all studied models, with misalignment of subjects or objects frequently resulting in inconsistencies. Building on this insight, we propose SubSub and SubInj, two effective methods that integrate English translations of subjects into prompts across languages, leading to substantial gains in both factual recall accuracy and consistency. Finally, our mechanistic analysis reveals that these interventions reinforce the entity representation alignment in the conceptual space through model’s internal pivot-language processing, offering effective and practical strategies for improving multilingual factual prediction.

[60] MatryoshkaThinking: Recursive Test-Time Scaling Enables Efficient Reasoning

Hongwei Chen, Yishu Lei, Dan Zhang, Bo Ke, Danxiang Zhu, Xuyi Chen, Yuxiang Lu, Zhengjie Huang, Shikun Feng, Jingzhou He, Yu Sun, Hua Wu, Haifeng Wang

Main category: cs.CL

TL;DR: MatryoshkaThinking is a novel test-time scaling method that reduces computational cost by 96% while achieving state-of-the-art performance on AIME2025.

Details

Motivation: Existing test-time scaling approaches like DeepConf incur substantial computational overhead to achieve competitive results, creating a need for more efficient methods.

Method: Recursive exploitation of the model’s intrinsic capabilities in reasoning, verification, and summarization to enhance retention of correct solutions and reduce Pass@k vs Pass@1 disparity.

Result: Achieves 99.79 score on AIME2025 using only 4% of the computation required by DeepConf, with comprehensive validation across multiple models and benchmarks.

Conclusion: Provides new insights for designing efficient and scalable test-time inference strategies for advanced language models.

Abstract: Test-time scaling has emerged as a promising paradigm in language modeling, wherein additional computational resources are allocated during inference to enhance model performance. Recent approaches, such as DeepConf, have demonstrated the efficacy of this strategy, however, they often incur substantial computational overhead to achieve competitive results. In this work, we propose MatryoshkaThinking, a novel method that significantly reduces computational cost while maintaining state-of-the-art performance. Specifically, MatryoshkaThinking attains a score of 99.79 on AIME2025 using only 4% of the computation required by DeepConf. The core of our approach lies in the recursive exploitation of the model’s intrinsic capabilities in reasoning, verification, and summarization, which collectively enhance the retention of correct solutions and reduce the disparity between Pass@k and Pass@1. Comprehensive evaluations across multiple open-source models and challenging multi-modal reasoning benchmarks validate the effectiveness and generality of our method. These findings offer new insights into the design of efficient and scalable test-time inference strategies for advanced language models.

[61] Are LLMs Empathetic to All? Investigating the Influence of Multi-Demographic Personas on a Model’s Empathy

Ananya Malik, Nazanin Sabri, Melissa Karnaze, Mai Elsherief

Main category: cs.CL

TL;DR: LLMs show varying empathy levels across different demographic groups, with intersectional analysis revealing complex patterns that sometimes reverse expected empathy trends.

Details

Motivation: To investigate whether LLMs demonstrate equitable empathy across diverse user groups, considering emotional experiences are shaped by demographic and cultural contexts.

Method: Proposed a framework analyzing cognitive and affective empathy across 315 unique personas defined by intersecting demographic attributes (age, culture, gender) across four LLMs, using both quantitative and qualitative analysis.

Result: Attributes profoundly shape empathetic responses, with multiple attributes sometimes attenuating or reversing expected empathy patterns. Models broadly reflect real-world empathetic trends but show notable misalignments for certain groups like Confucian culture.

Conclusion: Designing empathy-aware LLMs that account for demographic diversity is crucial for promoting more inclusive and equitable model behavior.

Abstract: Large Language Models’ (LLMs) ability to converse naturally is empowered by their ability to empathetically understand and respond to their users. However, emotional experiences are shaped by demographic and cultural contexts. This raises an important question: Can LLMs demonstrate equitable empathy across diverse user groups? We propose a framework to investigate how LLMs’ cognitive and affective empathy vary across user personas defined by intersecting demographic attributes. Our study introduces a novel intersectional analysis spanning 315 unique personas, constructed from combinations of age, culture, and gender, across four LLMs. Results show that attributes profoundly shape a model’s empathetic responses. Interestingly, we see that adding multiple attributes at once can attenuate and reverse expected empathy patterns. We show that they broadly reflect real-world empathetic trends, with notable misalignments for certain groups, such as those from Confucian culture. We complement our quantitative findings with qualitative insights to uncover model behaviour patterns across different demographic groups. Our findings highlight the importance of designing empathy-aware LLMs that account for demographic diversity to promote more inclusive and equitable model behaviour.

[62] End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs

Nam Luu, Ondřej Bojar

Main category: cs.CL

TL;DR: Combining pre-trained speech encoders with LLMs for simultaneous ASR and speech translation, achieving better results than SeamlessM4T and matching cascaded Whisper+NLLB systems.

Details

Motivation: To explore a combined end-to-end architecture that can perform both Automatic Speech Recognition and Speech Translation simultaneously using pre-trained components.

Method: Combined end-to-end architecture using pre-trained speech encoders and Large Language Models (LLMs) for simultaneous ASR and ST tasks.

Result: Best model achieves better translation results than SeamlessM4T and matches performance of cascaded Whisper+NLLB system, with up to 8% gain in COMET-DA22 metric for English-to-German translation.

Conclusion: The combined end-to-end architecture with pre-trained speech encoders and LLMs is effective for simultaneous ASR and ST, outperforming large foundational models and matching cascaded systems.

Abstract: Speech Translation (ST) is a machine translation task that involves converting speech signals from one language to the corresponding text in another language; this task has two different approaches, namely the traditional cascade and the more recent end-to-end. This paper explores a combined end-to-end architecture of pre-trained speech encoders and Large Language Models (LLMs) for performing both Automatic Speech Recognition (ASR) and ST simultaneously. Experiments with the English-to-German language pair show that our best model not only can achieve better translation results than SeamlessM4T, a large foundational end-to-end, multi-modal translation model, but can also match the performance of a cascaded system with Whisper and NLLB, with up to a score gain of 8% in $\text{COMET}^{\text{DA}}_{22}$ metric.

[63] ASC analyzer: A Python package for measuring argument structure construction usage in English texts

Hakyung Sung, Kristopher Kyle

Main category: cs.CL

TL;DR: Introduces ASC analyzer - a Python package for automatically analyzing argument structure constructions in L2 writing and computing 50 indices to measure diversity, proportion, frequency, and verb association strength.

Details

Motivation: Argument structure constructions provide a theoretical framework for assessing L2 proficiency, but there's a lack of scalable tools to systematically measure their usage in language learning.

Method: Developed a Python package that automatically tags ASCs and computes 50 indices covering diversity, proportion, frequency, and ASC-verb lemma association strength. Conducted bivariate and multivariate analyses to examine relationships with L2 writing scores.

Result: The paper presents the ASC analyzer tool and demonstrates its utility through empirical analysis linking ASC-based indices to L2 writing proficiency scores.

Conclusion: The ASC analyzer provides a systematic and scalable solution for measuring argument structure construction usage in L2 writing assessment, filling an important gap in language proficiency analysis tools.

Abstract: Argument structure constructions (ASCs) offer a theoretically grounded lens for analyzing second language (L2) proficiency, yet scalable and systematic tools for measuring their usage remain limited. This paper introduces the ASC analyzer, a publicly available Python package designed to address this gap. The analyzer automatically tags ASCs and computes 50 indices that capture diversity, proportion, frequency, and ASC-verb lemma association strength. To demonstrate its utility, we conduct both bivariate and multivariate analyses that examine the relationship between ASC-based indices and L2 writing scores.

[64] RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models

Aashiq Muhamed, Leonardo F. R. Ribeiro, Markus Dreyer, Virginia Smith, Mona T. Diab

Main category: cs.CL

TL;DR: Language models in RAG systems struggle with selective refusal when context is flawed, with refusal accuracy dropping below 50% on multi-document tasks. The authors introduce RefusalBench, a generative methodology using 176 perturbation strategies to create diagnostic test cases for evaluating refusal capabilities.

Details

Motivation: Current RAG systems have significant failure points in selectively refusing to answer based on flawed context, which is critical for safety. Static benchmarks fail to reliably evaluate this capability as models exploit dataset artifacts and memorize test instances.

Method: Introduces RefusalBench, a generative methodology that programmatically creates diagnostic test cases through controlled linguistic perturbation. Uses 176 distinct perturbation strategies across six categories of informational uncertainty and three intensity levels.

Result: Evaluation of over 30 models reveals systematic failure patterns: refusal accuracy drops below 50% on multi-document tasks, models exhibit dangerous overconfidence or overcaution, and neither scale nor extended reasoning improves performance. Refusal comprises separable detection and categorization skills.

Conclusion: Selective refusal is a trainable, alignment-sensitive capability that offers a clear path for improvement. The authors release two benchmarks (RefusalBench-NQ and RefusalBench-GaRAGe) and the complete generation framework for continued dynamic evaluation of this critical capability.

Abstract: The ability of language models in RAG systems to selectively refuse to answer based on flawed context is critical for safety, yet remains a significant failure point. Our large-scale study reveals that even frontier models struggle in this setting, with refusal accuracy dropping below 50% on multi-document tasks, while exhibiting either dangerous overconfidence or overcaution. Static benchmarks fail to reliably evaluate this capability, as models exploit dataset-specific artifacts and memorize test instances. We introduce RefusalBench, a generative methodology that programmatically creates diagnostic test cases through controlled linguistic perturbation. Our framework employs 176 distinct perturbation strategies across six categories of informational uncertainty and three intensity levels. Evaluation of over 30 models uncovers systematic failure patterns: refusal comprises separable detection and categorization skills, and neither scale nor extended reasoning improves performance. We find that selective refusal is a trainable, alignment-sensitive capability, offering a clear path for improvement. We release two benchmarks – RefusalBench-NQ (single document) and RefusalBench-GaRAGe (multi-document) – and our complete generation framework to enable continued, dynamic evaluation of this critical capability.

[65] AssoMem: Scalable Memory QA with Multi-Signal Associative Retrieval

Kai Zhang, Xinyuan Zhang, Ejaz Ahmed, Hongda Jiang, Caleb Kumar, Kai Sun, Zhaojiang Lin, Sanat Sharma, Shereen Oraby, Aaron Colak, Ahmed Aly, Anuj Kumar, Xiaozhong Liu, Xin Luna Dong

Main category: cs.CL

TL;DR: AssoMem is a novel framework that constructs an associative memory graph for memory-augmented AI assistants, improving recall in similarity-dense QA scenarios through multi-dimensional retrieval signals and adaptive fusion.

Details

Motivation: Existing methods for memory recall in QA systems mainly rely on semantic distance, which struggles in similarity-dense scenarios. The paper is inspired by how humans link information associatively to improve memory organization and retrieval.

Method: Proposes AssoMem framework that constructs an associative memory graph anchoring dialogue utterances to automatically extracted clues. Uses multi-dimensional retrieval signals (relevance, importance, temporal alignment) with adaptive mutual information-driven fusion strategy.

Result: Extensive experiments across three benchmarks and the new MeetingQA dataset show AssoMem consistently outperforms state-of-the-art baselines, demonstrating superiority in context-aware memory recall.

Conclusion: AssoMem provides an effective solution for memory recall in QA systems by leveraging associative memory structures and multi-dimensional retrieval signals, particularly beneficial in similarity-dense scenarios.

Abstract: Accurate recall from large scale memories remains a core challenge for memory augmented AI assistants performing question answering (QA), especially in similarity dense scenarios where existing methods mainly rely on semantic distance to the query for retrieval. Inspired by how humans link information associatively, we propose AssoMem, a novel framework constructing an associative memory graph that anchors dialogue utterances to automatically extracted clues. This structure provides a rich organizational view of the conversational context and facilitates importance aware ranking. Further, AssoMem integrates multi-dimensional retrieval signals-relevance, importance, and temporal alignment using an adaptive mutual information (MI) driven fusion strategy. Extensive experiments across three benchmarks and a newly introduced dataset, MeetingQA, demonstrate that AssoMem consistently outperforms SOTA baselines, verifying its superiority in context-aware memory recall.

[66] STEAM: A Semantic-Level Knowledge Editing Framework for Large Language Models

Geunyeong Jeong, Juoh Sun, Seonghee Lee, Harksoo Kim

Main category: cs.CL

TL;DR: STEAM is a semantic-level knowledge editing framework that improves integration of updated knowledge into LLMs by aligning latent representations with semantic anchors, enhancing reasoning and coherence.

Details

Motivation: Existing knowledge editing methods focus on token-level optimization but create isolated residual streams that bypass natural reasoning, lacking semantic coherence.

Method: STEAM identifies semantic anchors for factual associations and guides internal representations toward these anchors using alignment loss during optimization.

Result: Experimental results show STEAM improves model’s reasoning ability with edited knowledge and enhances semantic coherence compared to locate-and-edit methods.

Conclusion: Latent-space alignment is crucial for reliable and coherent knowledge editing, enabling better integration of updated facts into the model’s knowledge structure.

Abstract: Large Language Models store extensive factual knowledge acquired during large-scale pre-training. However, this knowledge is inherently static, reflecting only the state of the world at the time of training. Knowledge editing has emerged as a promising solution for updating outdated or incorrect facts without full retraining. However, most existing locate-and-edit methods primarily focus on token-level likelihood optimization without addressing semantic coherence. Our analysis reveals that such edited knowledge is often encoded as isolated residual streams in the model’s latent space, distinct from pre-existing knowledge and bypassing natural reasoning process. To address this, we propose \textsc{Steam}, a semantic-level knowledge editing framework that enhances integration of updated knowledge into the model’s knowledge structure. \textsc{Steam} first identifies target representations as semantic anchors for the updated factual association, then guides the internal representation of the edited fact towards these anchors through an alignment loss during optimization. Experimental results demonstrate that \textsc{Steam} improves model’s ability to reason with edited knowledge and enhances semantic coherence, underscoring the importance of latent-space alignment for reliable and coherent knowledge editing. The code is available at https://github.com/GY-Jeong/STEAM.

[67] LONGQAEVAL: Designing Reliable Evaluations of Long-Form Clinical QA under Resource Constraints

Federica Bologna, Tiffany Pan, Matthew Wilkens, Yue Guo, Lucy Lu Wang

Main category: cs.CL

TL;DR: LongQAEval is an evaluation framework for clinical QA systems that compares coarse vs fine-grained annotation methods across correctness, relevance, and safety dimensions, finding varying IAA patterns and recommending partial sentence annotation for cost efficiency.

Details

Motivation: Evaluating long-form clinical QA systems is challenging due to resource intensity, need for medical expertise, and difficulty achieving consistent human judgments over long-form text.

Method: Introduced LongQAEval framework comparing coarse answer-level vs fine-grained sentence-level evaluation across correctness, relevance, and safety dimensions using physician annotations of 300 real patient questions answered by physicians and LLMs.

Result: IAA varies by dimension: fine-grained annotation improves agreement on correctness, coarse improves agreement on relevance, and safety judgments remain inconsistent. Annotating only small sentence subsets provides reliability comparable to coarse annotations.

Conclusion: Partial sentence annotation can reduce cost and effort while maintaining reliability comparable to coarse evaluation methods in clinical QA assessment.

Abstract: Evaluating long-form clinical question answering (QA) systems is resource-intensive and challenging: accurate judgments require medical expertise and achieving consistent human judgments over long-form text is difficult. We introduce LongQAEval, an evaluation framework and set of evaluation recommendations for limited-resource and high-expertise settings. Based on physician annotations of 300 real patient questions answered by physicians and LLMs, we compare coarse answer-level versus fine-grained sentence-level evaluation over the dimensions of correctness, relevance, and safety. We find that inter-annotator agreement (IAA) varies by dimension: fine-grained annotation improves agreement on correctness, coarse improves agreement on relevance, and judgments on safety remain inconsistent. Additionally, annotating only a small subset of sentences can provide reliability comparable to coarse annotations, reducing cost and effort.

[68] Do Audio LLMs Really LISTEN, or Just Transcribe? Measuring Lexical vs. Acoustic Emotion Cues Reliance

Jingyi Chen, Zhimeng Guo, Jiyun Chun, Pichao Wang, Andrew Perrault, Micha Elsner

Main category: cs.CL

TL;DR: LISTEN benchmark reveals that large audio language models (LALMs) primarily rely on lexical cues rather than acoustic information for emotion understanding, showing limited ability to process acoustic signals when lexical cues are neutral or conflicting.

Details

Motivation: To determine whether LALMs genuinely process acoustic information or mainly depend on lexical content for emotion understanding from speech, addressing the unclear relationship between lexical and acoustic processing in these models.

Method: Developed LISTEN benchmark with controlled tests to disentangle lexical reliance from acoustic sensitivity, evaluating six state-of-the-art LALMs across various conditions including neutral lexical cues, cue alignment, cue conflict, and paralinguistic settings.

Result: Models consistently showed lexical dominance - predicting ’neutral’ when lexical cues were neutral, limited performance gains with aligned cues, failure to classify distinct emotions under cue conflict, and near-chance performance in paralinguistic settings.

Conclusion: Current LALMs largely ’transcribe’ rather than ’listen,’ heavily relying on lexical semantics while underutilizing acoustic cues. LISTEN provides a principled framework for assessing emotion understanding in multimodal models.

Abstract: Understanding emotion from speech requires sensitivity to both lexical and acoustic cues. However, it remains unclear whether large audio language models (LALMs) genuinely process acoustic information or rely primarily on lexical content. We present LISTEN (Lexical vs. Acoustic Speech Test for Emotion in Narratives), a controlled benchmark designed to disentangle lexical reliance from acoustic sensitivity in emotion understanding. Across evaluations of six state-of-the-art LALMs, we observe a consistent lexical dominance. Models predict “neutral” when lexical cues are neutral or absent, show limited gains under cue alignment, and fail to classify distinct emotions under cue conflict. In paralinguistic settings, performance approaches chance. These results indicate that current LALMs largely “transcribe” rather than “listen,” relying heavily on lexical semantics while underutilizing acoustic cues. LISTEN offers a principled framework for assessing emotion understanding in multimodal models.

[69] RECON: Reasoning with Condensation for Efficient Retrieval-Augmented Generation

Zhichao Xu, Minheng Wang, Yawei Wang, Wenqian Ye, Yuntao Du, Yunpu Ma, Yijun Tian

Main category: cs.CL

TL;DR: RECON is a framework that integrates summarization into RAG systems to compress retrieved documents, reducing context length by 35% while improving performance on QA benchmarks.

Details

Motivation: Current RAG systems with RL training suffer from inefficient context management due to long, noisy retrieved documents, which increases costs and degrades performance.

Method: RECON adds an explicit summarization module trained via two-stage process: relevance pretraining on QA datasets followed by multi-aspect distillation from proprietary LLMs for factuality and clarity. Integrated into Search-R1 pipeline.

Result: Reduces total context length by 35%, improves training speed and inference latency, boosts EM score by 14.5% for 3B model and 3.0% for 7B model, with particular strength in multi-hop QA.

Conclusion: Learned context compression is essential for building practical, scalable, and performant RAG systems.

Abstract: Retrieval-augmented generation (RAG) systems trained using reinforcement learning (RL) with reasoning are hampered by inefficient context management, where long, noisy retrieved documents increase costs and degrade performance. We introduce RECON (REasoning with CONdensation), a framework that integrates an explicit summarization module to compress evidence within the reasoning loop. Our summarizer is trained via a two-stage process: relevance pretraining on QA datasets, followed by multi-aspect distillation from proprietary LLMs to ensure factuality and clarity. Integrated into the Search-R1 pipeline, RECON reduces total context length by 35%, leading to improved training speed and inference latency, while simultaneously improving RAG performance on downstream QA benchmarks. Notably, it boosts the average EM score of the 3B model by 14.5% and the 7B model by 3.0%, showing particular strength in multi-hop QA. RECON demonstrates that learned context compression is essential for building practical, scalable, and performant RAG systems. Our code implementation is made available at https://github.com/allfornancy/RECON.

[70] Steering Over-refusals Towards Safety in Retrieval Augmented Generation

Utsav Maskey, Mark Dras, Usman Naseem

Main category: cs.CL

TL;DR: The paper analyzes over-refusals in safety-aligned LLMs during RAG, showing that context contamination and domain factors cause benign queries to be refused. It introduces SafeRAG-Steering to reduce over-refusals while maintaining legitimate safety refusals.

Details

Motivation: Safety alignment in LLMs causes over-refusals where benign requests are declined due to aggressive safety filters, particularly problematic in RAG systems where context properties influence refusal behavior.

Method: Constructed RagRefuse benchmark with domain-stratified data, and introduced SafeRAG-Steering - a model-centric embedding intervention that steers embeddings toward safe, non-refusing output regions during inference.

Result: Analysis shows context arrangement, domain factors, and harmful-text density trigger refusals on benign queries, with effects varying by model alignment choices. SafeRAG-Steering reduces over-refusals in contaminated RAG pipelines.

Conclusion: SafeRAG-Steering effectively mitigates over-refusals in RAG systems while preserving legitimate safety refusals, addressing the trade-off between safety and utility in safety-aligned LLMs.

Abstract: Safety alignment in large language models (LLMs) induces over-refusals – where LLMs decline benign requests due to aggressive safety filters. We analyze this phenomenon in retrieval-augmented generation (RAG), where both the query intent and retrieved context properties influence refusal behavior. We construct RagRefuse, a domain-stratified benchmark spanning medical, chemical, and open domains, pairing benign and harmful queries with controlled context contamination patterns and sizes. Our analysis shows that context arrangement / contamination, domain of query and context, and harmful-text density trigger refusals even on benign queries, with effects depending on model-specific alignment choices. To mitigate over-refusals, we introduce \textsc{SafeRAG-Steering}, a model-centric embedding intervention that steers the embedding regions towards the confirmed safe, non-refusing output regions at inference time. This reduces over-refusals in contaminated RAG pipelines while preserving legitimate refusals.

[71] End-to-end Speech Recognition with similar length speech and text

Peng Fan, Wenping Wang, Fei Deng

Main category: cs.CL

TL;DR: The paper proposes two novel alignment methods (Time Independence Loss and Aligned Cross Entropy Loss) with frame fusion to address speech-text length mismatch in ASR, achieving significant frame reduction while improving performance.

Details

Motivation: Traditional CTC methods fail to properly align speech and text when downsampling speech to text-similar lengths, creating challenges in automatic speech recognition.

Method: Introduces Time Independence Loss (TIL) and Aligned Cross Entropy (AXE) Loss based on edit distance, combined with frame fusion that weights and sums keyframes with their context frames.

Result: Experimental results on AISHELL-1 and AISHELL-2 dataset subsets show the proposed methods outperform previous work and achieve at least 86% reduction in frame count.

Conclusion: The proposed alignment methods effectively address speech-text length mismatch in ASR while significantly reducing computational requirements through frame reduction.

Abstract: The mismatch of speech length and text length poses a challenge in automatic speech recognition (ASR). In previous research, various approaches have been employed to align text with speech, including the utilization of Connectionist Temporal Classification (CTC). In earlier work, a key frame mechanism (KFDS) was introduced, utilizing intermediate CTC outputs to guide downsampling and preserve keyframes, but traditional methods (CTC) failed to align speech and text appropriately when downsampling speech to a text-similar length. In this paper, we focus on speech recognition in those cases where the length of speech aligns closely with that of the corresponding text. To address this issue, we introduce two methods for alignment: a) Time Independence Loss (TIL) and b) Aligned Cross Entropy (AXE) Loss, which is based on edit distance. To enhance the information on keyframes, we incorporate frame fusion by applying weights and summing the keyframe with its context 2 frames. Experimental results on AISHELL-1 and AISHELL-2 dataset subsets show that the proposed methods outperform the previous work and achieve a reduction of at least 86% in the number of frames.

[72] Rethinking LLM Evaluation: Can We Evaluate LLMs with 200x Less Data?

Shaobo Wang, Cong Wang, Wenjie Fu, Yue Min, Mingquan Feng, Isabel Guan, Xuming Hu, Conghui He, Cunxiang Wang, Kexin Yang, Xingzhang Ren, Fei Huang, Dayiheng Liu, Linfeng Zhang

Main category: cs.CL

TL;DR: EssenceBench is a coarse-to-fine framework that uses genetic algorithms to compress benchmarks by eliminating redundant samples while preserving model ranking accuracy.

Details

Motivation: As benchmark suites grow larger, there's a need to reduce redundancy while maintaining evaluation accuracy. Current methods lack systematic integration for both prediction accuracy and ranking consistency.

Method: Uses sample-level redundancy analysis and frames benchmark compression as an optimization problem. Proposes EssenceBench with iterative Genetic Algorithm combining fitness-based subset search and attribution-based sample search.

Result: Achieves superior compression with lower reconstruction error and higher efficiency. On HellaSwag (10K samples), preserves all model rankings within 5% using 25x fewer samples, and 95% ranking preservation with 200x fewer samples.

Conclusion: EssenceBench effectively compresses benchmarks while maintaining ranking consistency, offering a systematic solution for large-scale benchmark evaluation.

Abstract: As the demand for comprehensive evaluations of diverse model capabilities steadily increases, benchmark suites have correspondingly grown significantly in scale. Despite notable advances in redundancy reduction and subset-level performance prediction, a systematic framework that effectively integrates these methods to ensure both prediction accuracy and ranking consistency is still largely elusive. In this paper, we first perform a sample-level analysis of benchmark redundancy and identify several highly similar samples that can be eliminated. Besides, we frame benchmark compression as an optimization problem with the aim of score reconstruction. Building on these, we then propose EssenceBench, a coarse-to-fine framework utilizing an iterative Genetic Algorithm (GA), which takes the advantages of fitness-based subset search and attribution-based sample search. Compared to previous methods, our approach yields superior compression results with lower reconstruction error and markedly higher efficiency. In particular, on the HellaSwag benchmark (10K samples), our method preserves the ranking of all models shifting within 5% using 25x fewer samples, and achieves 95% ranking preservation shifting within 5% using only 200x fewer samples.

[73] NIM: Neuro-symbolic Ideographic Metalanguage for Inclusive Communication

Prawaal Sharma, Poonam Goyal, Navneet Goyal, Vidisha Sharma

Main category: cs.CL

TL;DR: A universal ideographic metalanguage using Neuro-symbolic AI to bridge the digital divide for semi-literate populations by decomposing complex ideas into simple atomic concepts.

Details

Motivation: To address communication barriers faced by individuals with lower academic literacy in digital communication, reducing the digital divide.

Method: Combines neural-based LLMs with symbolic knowledge heuristics based on Natural Semantic Metalanguage (NSM) for semantic decomposition, using human-centric collaborative design with 200+ semi-literate participants.

Result: Achieved over 80% semantic comprehensibility, accessible learning curve, and universal adaptability for underprivileged populations with limited formal education.

Conclusion: The ideographic metalanguage system effectively serves underprivileged populations by transcending academic, linguistic, and cultural boundaries in digital communication.

Abstract: Digital communication has become the cornerstone of modern interaction, enabling rapid, accessible, and interactive exchanges. However, individuals with lower academic literacy often face significant barriers, exacerbating the “digital divide”. In this work, we introduce a novel, universal ideographic metalanguage designed as an innovative communication framework that transcends academic, linguistic, and cultural boundaries. Our approach leverages principles of Neuro-symbolic AI, combining neural-based large language models (LLMs) enriched with world knowledge and symbolic knowledge heuristics grounded in the linguistic theory of Natural Semantic Metalanguage (NSM). This enables the semantic decomposition of complex ideas into simpler, atomic concepts. Adopting a human-centric, collaborative methodology, we engaged over 200 semi-literate participants in defining the problem, selecting ideographs, and validating the system. With over 80% semantic comprehensibility, an accessible learning curve, and universal adaptability, our system effectively serves underprivileged populations with limited formal education.

[74] FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth

Qiran Zou, Hou Hei Lam, Wenhao Zhao, Yiming Tang, Tingting Chen, Samson Yu, Tianyi Zhang, Chang Liu, Xiangyang Ji, Dianbo Liu

Main category: cs.CL

TL;DR: FML-bench is a benchmark for evaluating automatic machine learning research agents on 8 diverse fundamental ML problems, addressing limitations of existing benchmarks by reducing coding burden, emphasizing fundamental research over applications, and providing comprehensive evaluation metrics.

Details

Motivation: Existing benchmarks for ML research agents overemphasize engineering aspects and neglect academic rigor, lack task diversity, focus too much on application-oriented tasks, and have limited scalability to realistic research settings.

Method: Developed FML-bench with 8 diverse fundamental ML research problems, reduced coding burden, and created a unified evaluation framework with five complementary metrics to comprehensively assess agent performance.

Result: Evaluation of state-of-the-art agents showed that those employing broad research exploration strategies outperformed those focusing on narrow but deep exploration, suggesting breadth of exploration leads to more effective research outcomes.

Conclusion: FML-bench provides a comprehensive benchmark for evaluating ML research agents, emphasizing fundamental problems and diverse exploration strategies, with findings suggesting broad exploration approaches are more effective than incremental refinement.

Abstract: Large language models (LLMs) have sparked growing interest in automatic machine learning research agents. Among them, agents capable of autonomously proposing ideas and conducting machine learning experiments are particularly promising, as they maximize research automation and accelerate scientific progress by iteratively refining ideas based on experimental results. However, comprehensively evaluating such agents remains challenging. Existing benchmarks tend to overemphasize engineering aspects while neglecting academic rigor, creating barriers that obscure a clear assessment of an agent’s scientific capabilities in machine learning research. They also suffer from limited task diversity, an overemphasis on application-oriented tasks over fundamental research problems, and limited scalability to realistic research settings. To address these limitations, we introduce FML-bench, a benchmark designed to evaluate automatic machine learning research agents on 8 diverse and fundamental machine learning research problems. It reduces coding burden, emphasizes fundamental problems rather than specific use cases, offers high task diversity, and is extensible to real-world machine learning GitHub repositories. Furthermore, we present a unified evaluation framework with five complementary metrics, designed to comprehensively assess agent performance on our benchmark. We evaluate state-of-the-art automatic research agents on FML-bench, and find that agents employing broad research exploration strategies outperform those focusing on narrow but deep exploration. These findings suggest that emphasizing the breadth of exploration may lead to more effective research outcomes than focusing solely on incremental refinement. Our benchmark is available at https://github.com/qrzou/FML-bench.

[75] When or What? Understanding Consumer Engagement on Digital Platforms

Jingyi Wu, Junying Liang

Main category: cs.CL

TL;DR: This study analyzes TED Talks using LDA modeling to examine mismatches between creator content themes and audience engagement preferences, finding that timing has stronger influence on popularity than content features.

Details

Motivation: To understand what drives popularity in digital services, challenging the dominant assumption that content features are primary drivers and addressing creators' frequent misjudgment of audience preferences.

Method: Applied Latent Dirichlet Allocation (LDA) modeling to a large corpus of TED Talks, comparing thematic supply from creators with audience engagement demand, and conducting longitudinal analysis of temporal dynamics.

Result: Identified persistent mismatches between producer offerings and consumer preferences, with temporal dynamics exerting stronger influence on engagement than thematic content - when content is delivered matters more than what is delivered.

Conclusion: Challenges content-centric assumptions about popularity, highlighting the importance of timing and contextual factors in shaping consumer responses, with practical implications for marketers, platform managers, and content creators.

Abstract: Understanding what drives popularity is critical in today’s digital service economy, where content creators compete for consumer attention. Prior studies have primarily emphasized the role of content features, yet creators often misjudge what audiences actually value. This study applies Latent Dirichlet Allocation (LDA) modeling to a large corpus of TED Talks, treating the platform as a case of digital service provision in which creators (speakers) and consumers (audiences) interact. By comparing the thematic supply of creators with the demand expressed in audience engagement, we identify persistent mismatches between producer offerings and consumer preferences. Our longitudinal analysis further reveals that temporal dynamics exert a stronger influence on consumer engagement than thematic content, suggesting that when content is delivered may matter more than what is delivered. These findings challenge the dominant assumption that content features are the primary drivers of popularity and highlight the importance of timing and contextual factors in shaping consumer responses. The results provide new insights into consumer attention dynamics on digital platforms and carry practical implications for marketers, platform managers, and content creators seeking to optimize audience engagement strategies.

[76] Assessing Large Language Models for Structured Medical Order Extraction

A H M Rezaul Karim, Ozlem Uzuner

Main category: cs.CL

TL;DR: The paper presents a medical order extraction system using a general-purpose LLaMA-4 17B model with few-shot learning, achieving 5th place in the MEDIQA-OE 2025 shared task without domain-specific fine-tuning.

Details

Motivation: Medical order extraction is crucial for structuring clinical information to support decision-making and enable downstream applications like documentation and workflow automation in healthcare settings.

Method: Used a general-purpose instruction-tuned LLaMA-4 17B model with single in-context example (few-shot learning) without domain-specific fine-tuning, focusing on effective prompt engineering.

Result: Ranked 5th among 17 teams with 105 submissions, achieving average F1 score of 37.76 with notable improvements in reason and provenance accuracy.

Conclusion: Large non-domain-specific LLMs with effective prompt engineering can serve as strong, scalable baselines for specialized clinical NLP tasks, demonstrating competitive performance without domain adaptation.

Abstract: Medical order extraction is essential for structuring actionable clinical information, supporting decision-making, and enabling downstream applications such as documentation and workflow automation. Orders may be embedded in diverse sources, including electronic health records, discharge summaries, and multi-turn doctor-patient dialogues, and can span categories such as medications, laboratory tests, imaging studies, and follow-up actions. The MEDIQA-OE 2025 shared task focuses on extracting structured medical orders from extended conversational transcripts, requiring the identification of order type, description, reason, and provenance. We present the MasonNLP submission, which ranked 5th among 17 participating teams with 105 total submissions. Our approach uses a general-purpose, instruction-tuned LLaMA-4 17B model without domain-specific fine-tuning, guided by a single in-context example. This few-shot configuration achieved an average F1 score of 37.76, with notable improvements in reason and provenance accuracy. These results demonstrate that large, non-domain-specific LLMs, when paired with effective prompt engineering, can serve as strong, scalable baselines for specialized clinical NLP tasks.

[77] UltraLLaDA: Scaling the Context Length to 128K for Diffusion Large Language Models

Guangxin He, Shen Nie, Fengqi Zhu, Yuankang Zhao, Tianyi Bai, Ran Yan, Jie Fu, Chongxuan Li, Binhang Yuan

Main category: cs.CL

TL;DR: This paper presents UltraLLaDA, a diffusion LLM with 128K-token context window achieved through post-training techniques, specifically modifying RoPE embeddings to accommodate diffusion probabilistic modeling.

Details

Motivation: Diffusion LLMs show great potential but their long-context behavior remains largely unexplored, creating a need for efficient methods to extend context windows without full retraining.

Method: Simple modification to standard Rotary Positional Embeddings (RoPE) extension to accommodate diffusion probabilistic modeling, plus comparison of masking strategies during post-training.

Result: UltraLLaDA significantly outperforms training-free baselines on long-context tasks, achieving stable scaling to 128K-token context window.

Conclusion: Special positional extension is key for scaling diffusion LLMs to extended contexts, providing practical guidance for efficient 128K-scale context via post-training.

Abstract: Diffusion LLMs have attracted growing interest, with plenty of recent work emphasizing their great potential in various downstream tasks; yet the long-context behavior of diffusion LLMs remains largely uncharted. We present a case study of post-training techniques for extending the context window of diffusion LLMs (i.e., LLaDA) without retraining from scratch. We show that a simple modification to the standard Rotary Positional Embeddings (RoPE) extension effectively accommodates the probabilistic modeling inherent in the diffusion process, enabling stable scaling to longer context ranges. We further compare masking strategies used during post-training and analyze their impact on optimization stability and long-range recall. Instantiating these insights, we introduce UltraLLaDA, a diffusion LLM with a 128K-token context window that, in our empirical evaluation on long-context tasks, significantly outperforms training-free baselines. Our experimental results highlight the special positional extension as a key lever for scaling diffusion LLMs to extended contexts and offer practical guidance for practitioners seeking 128K-scale context via efficient post-training.

[78] VOLTAGE: A Versatile Contrastive Learning based OCR Methodology for ultra low-resource scripts through Auto Glyph Feature Extraction

Prawaal Sharma, Poonam Goyal, Vidisha Sharma, Navneet Goyal

Main category: cs.CL

TL;DR: VOLTAGE is a contrastive learning-based OCR methodology for low-resource languages that uses auto-glyph feature recommendation for cluster-based labeling and data augmentation to achieve high accuracy on endangered scripts like Takri.

Details

Motivation: 2500 of 7000 languages are endangered, leading to loss of traditional wisdom and community essence. Low-resource languages face extinction risk due to lack of unsupervised OCR methodologies, impeding digital inclusion.

Method: Contrastive learning-based OCR with auto-glyph feature recommendation for cluster-based labeling, augmented with image transformations and Generative Adversarial Networks for data diversity and volume.

Result: Achieved 95% accuracy for machine printed and 87% for handwritten samples on Takri script. Demonstrated universal behavior across Indic scripts (both low and high resource).

Conclusion: VOLTAGE successfully addresses OCR challenges for low-resource languages, enabling digital inclusion and preservation of endangered scripts through unsupervised methodology with high accuracy.

Abstract: UNESCO has classified 2500 out of 7000 languages spoken worldwide as endangered. Attrition of a language leads to loss of traditional wisdom, folk literature, and the essence of the community that uses it. It is therefore imperative to bring digital inclusion to these languages and avoid its extinction. Low resource languages are at a greater risk of extinction. Lack of unsupervised Optical Character Recognition(OCR) methodologies for low resource languages is one of the reasons impeding their digital inclusion. We propose VOLTAGE - a contrastive learning based OCR methodology, leveraging auto-glyph feature recommendation for cluster-based labelling. We augment the labelled data for diversity and volume using image transformations and Generative Adversarial Networks. Voltage has been designed using Takri - a family of scripts used in 16th to 20th century in the Himalayan regions of India. We present results for Takri along with other Indic scripts (both low and high resource) to substantiate the universal behavior of the methodology. An accuracy of 95% for machine printed and 87% for handwritten samples on Takri script has been achieved. We conduct baseline and ablation studies along with building downstream use cases for Takri, demonstrating the usefulness of our work.

[79] Merlin’s Whisper: Enabling Efficient Reasoning in LLMs via Black-box Adversarial Prompting

Heming Xia, Cunxiao Du, Rui Li, Chak Tou Leong, Yongqi Li, Wenjie Li

Main category: cs.CL

TL;DR: AdvPrompt reduces computational overhead in large reasoning models by generating adversarial prompts that elicit concise responses without sacrificing accuracy.

Details

Motivation: Large reasoning models incur substantial computational and latency overheads due to lengthy reasoning processes, hindering practical deployment.

Method: AdvPrompt uses black-box adversarial prompting with iterative refinement to generate high-quality prompts that elicit concise responses from both open-source and closed-source models.

Result: Achieves 3x reduction in response length on GSM8K for Qwen3, ~40% average token reduction across four benchmarks, and 35-47% reduction for closed-source APIs on MATH-500.

Conclusion: Black-box adversarial prompting is a practical and effective strategy for enhancing LRM efficiency across various model scales and families.

Abstract: Large reasoning models (LRMs) have demonstrated remarkable proficiency in tackling complex reasoning tasks through step-by-step thinking. However, such a lengthy reasoning process incurs substantial computational and latency overheads, hindering the practical deployment of these models. In this work, we present a new perspective on mitigating overthinking in LRMs via black-box adversarial prompting. By treating both open-source LRMs and closed-source APIs as black-box communicators, we investigate how to elicit concise responses without sacrificing accuracy. We introduce AdvPrompt, an iterative refinement framework that generates high-quality adversarial prompts from diverse perspectives. Experiments across multiple benchmarks demonstrate that AdvPrompt consistently reduces token usage while preserving performance. Notably, AdvPrompt achieves a 3x reduction in average response length on simple GSM8K questions for the Qwen3 model series, and delivers an average ~40% token reduction across four benchmarks. For closed-source APIs, AdvPrompt reduces token usage on MATH-500 by 35% for Claude-3.7 and 47% for Gemini-2.5. Further analysis reveals the generalizability of AdvPrompt across various model scales and families, underscoring the potential of black-box prompting as a practical and effective strategy for enhancing LRM efficiency.

[80] Detecting Hallucinations in Authentic LLM-Human Interactions

Yujie Ren, Niklas Gruhlke, Anne Lauscher

Main category: cs.CL

TL;DR: AuthenHallu is the first hallucination detection benchmark built from authentic LLM-human interactions, revealing 31.4% hallucination rate overall and 60.0% in challenging domains like Math.

Details

Motivation: Existing hallucination benchmarks are artificially constructed and fail to capture real-world hallucination characteristics, limiting their practical utility.

Method: Created benchmark by selecting and annotating samples from genuine LLM-human dialogues, providing faithful reflection of real-world hallucinations.

Result: Hallucinations occur in 31.4% of query-response pairs overall, increasing to 60.0% in Math & Number Problems. LLMs show promise but insufficient performance as hallucination detectors.

Conclusion: Authentic benchmarks are crucial for understanding real-world LLM hallucinations, and current LLM-based detection methods need improvement for practical deployment.

Abstract: As large language models (LLMs) are increasingly applied in sensitive domains such as medicine and law, hallucination detection has become a critical task. Although numerous benchmarks have been proposed to advance research in this area, most of them are artificially constructed–either through deliberate hallucination induction or simulated interactions–rather than derived from genuine LLM-human dialogues. Consequently, these benchmarks fail to fully capture the characteristics of hallucinations that occur in real-world usage. To address this limitation, we introduce AuthenHallu, the first hallucination detection benchmark built entirely from authentic LLM-human interactions. For AuthenHallu, we select and annotate samples from genuine LLM-human dialogues, thereby providing a faithful reflection of how LLMs hallucinate in everyday user interactions. Statistical analysis shows that hallucinations occur in 31.4% of the query-response pairs in our benchmark, and this proportion increases dramatically to 60.0% in challenging domains such as Math & Number Problems. Furthermore, we explore the potential of using vanilla LLMs themselves as hallucination detectors and find that, despite some promise, their current performance remains insufficient in real-world scenarios.

[81] BitMar: Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices

Euhid Aman, Esteban Carlin, Hsing-Kuo Pao, Giovanni Beltrame, Ghaluh Indah Permata Sari, Yie-Tarng Chen

Main category: cs.CL

TL;DR: BitMar is a quantized multimodal transformer that uses 1.58-bit encoders and external episodic memory for efficient image-text generation on edge devices, achieving competitive performance with low latency and small footprint.

Details

Motivation: To enable deployment of multimodal vision-language models on edge devices by addressing the challenges of extensive full-precision backbones and lack of aggressive quantization in memory-augmented architectures.

Method: Uses 1.58-bit encoders (BitNet-style for text, DiNOv2-based for vision) to create compact embeddings, combines them to query a fixed-size key-value episodic memory, and employs BitNet decoder with per-layer conditioning and sliding-window attention for long/streaming inputs.

Result: Achieves competitive captioning and multimodal understanding performance with low latency and small model footprint, demonstrating strong quality-speed trade-off suitable for edge deployment.

Conclusion: BitMar is well-suited for edge deployment due to its efficient quantization, episodic memory architecture, and attention mechanisms that enable effective multimodal generation under tight memory constraints.

Abstract: Cross-attention transformers and other multimodal vision-language models excel at grounding and generation; however, their extensive, full-precision backbones make it challenging to deploy them on edge devices. Memory-augmented architectures enhance the utilization of past context; however, most works rarely pair them with aggressive edge-oriented quantization. We introduce BitMar, a quantized multimodal transformer that proposes an external human-like episodic memory for effective image-text generation on hardware with limited resources. BitMar utilizes 1.58-bit encoders, one for text (BitNet-style) and one for vision (DiNOv2-based), to create compact embeddings that are combined and used to query a fixed-size key-value episodic memory. During vector retrieval, the BitNet decoder applies per-layer conditioning, which increases the contextual relevance of generated content. The decoder also employs attention sinks with a sliding-window mechanism to process long or streaming inputs under tight memory budgets. The combination of per-layer conditioning and sliding-window attention achieves a strong quality-speed trade-off, delivering competitive captioning and multimodal understanding at low latency with a small model footprint. These characteristics make BitMar well-suited for edge deployment.

[82] Mission Impossible: Feedback-Guided Dynamic Interactive Planning for Improving Reasoning on LLMs

Dong Yan, Gaochen Wu, Bowen Zhou

Main category: cs.CL

TL;DR: FGDIP is a dynamic planning framework that enhances LLM reasoning in open-domain multi-hop tasks using feedback-guided adaptive strategies and depth-first search with node generation.

Details

Motivation: Existing language agents struggle with open-domain multi-hop reasoning due to reliance on fixed action sequences and inability to handle massive information retrieval requirements.

Method: Identifies key entities as initial nodes, generates reasoning child nodes refined through historical error analysis and real-time feedback, integrates depth-first search with innovative node generation, and dynamically adjusts strategies based on error paths and concurrent nodes.

Result: Achieved 54.47% F1 score on HotpotQA and 70.05% on StrategyQA, surpassing best baselines by 5.03% and 7.25% respectively.

Conclusion: FGDIP demonstrates versatility and potential to enhance language agents in multi-hop reasoning tasks through dynamic and adaptive reasoning strategies.

Abstract: Recent advancements in language agents have led to significant improvements in multi-hop reasoning tasks. However, existing approaches often struggle with handling open-domain problems, which require massive information retrieval due to their reliance on a fixed sequence of actions. To address this, we propose Feedback-Guided Dynamic Interactive Planning (FGDIP), a novel framework tailored to enhance reasoning in LLMs by utilizing dynamic and adaptive strategies for information exploration in open-domain multi-hop reasoning tasks. Our approach begins by identifying key entities relevant to the problem, which serve as the initial nodes in the reasoning process. From these initial nodes, we then generate reasoning child nodes with the process being refined through a combination of historical error analysis and real-time feedback, which allows the framework to dynamically adjust and optimize its reasoning strategies. By integrating depth-first search with an innovative node generation technique, our framework adapts based on both prior error paths and concurrently generated nodes at the same hierarchical level. This dynamic strategy effectively expands the search space while ensuring the reasoning process systematically converges toward accurate solutions. Experimental results show that FGDIP achieved up to 54.47% F1 score on the HotpotQA dataset and 70.05% on the StrategyQA dataset, surpassing the best baseline by 5.03% and 7.25% respectively, highlighting its versatility and potential to enhance language agents in multi-hop reasoning tasks.

[83] Dynamic Topic Evolution with Temporal Decay and Attention in Large Language Models

Di Wu abd Shuaidong Pan

Main category: cs.CL

TL;DR: A dynamic topic evolution framework using temporal LLMs with time-aware embeddings and state transitions to model topic changes over time.

Details

Motivation: To systematically understand dynamic semantic patterns in large-scale text by modeling how topics evolve, expand, and decline over different time periods.

Method: Uses LLM contextual embeddings with temporal decay functions and attention mechanisms to adjust semantic importance by time intervals, then maps to latent topic space with state transition matrices for dynamic evolution modeling.

Result: Outperforms existing models across multiple metrics, effectively captures topic generation, expansion, and decline, and improves topic coherence, diversity, stability, and interpretability.

Conclusion: Provides a systematic solution for dynamic topic modeling that enriches topic modeling research and supports complex text analysis across multiple domains.

Abstract: This paper proposes a modeling framework for dynamic topic evolution based on temporal large language models. The method first uses a large language model to obtain contextual embeddings of text and then introduces a temporal decay function and an attention mechanism. These components allow the model to adjust the importance of semantic units according to time intervals and capture topic variations across different periods. The temporal representations are then mapped into a latent topic space, where a state transition matrix is applied to describe the dynamic evolution of topics. A joint optimization objective constrains both semantic modeling and temporal consistency, ensuring diversity and smoothness in topic generation. The design emphasizes the unified modeling of semantic representation and temporal evolution, which improves topic coherence and diversity while enhancing stability and interpretability over time. Experiments on real-world corpora show that the framework effectively captures the generation, expansion, and decline of topics and outperforms existing models across multiple metrics. Overall, the proposed method provides a systematic solution for understanding dynamic semantic patterns in large-scale text, enriches the research paradigm of topic modeling, and supports complex text analysis tasks in multiple domains.

[84] Preserving LLM Capabilities through Calibration Data Curation: From Analysis to Optimization

Bowei He, Lihao Yin, Huiling Zhen, Shuqi Liu, Han Wu, Xiaokun Zhang, Mingxuan Yuan, Chen Ma

Main category: cs.CL

TL;DR: This paper analyzes how calibration data impacts LLM capabilities after post-training compression, finding that activation space representativeness and diversity determine calibration quality, and proposes a data curation framework to preserve critical LLM capabilities.

Details

Motivation: To systematically examine how calibration data affects different LLM capabilities after compression, particularly focusing on compositional properties, domain correspondence, and high-level reasoning tasks like math and code generation.

Method: Analyzed calibration data impacts from activation pattern perspective, explored underlying mechanisms, and proposed a calibration data curation framework based on activation space representativeness and diversity.

Result: Found that representativeness and diversity in activation space fundamentally determine calibration data quality, and the proposed framework enhances performance of existing compression methods in preserving critical LLM capabilities.

Conclusion: Calibration data’s impact on compressed LLM capabilities is determined by activation space properties, and systematic data curation can significantly improve compression method performance on preserving high-level reasoning abilities.

Abstract: Post-training compression has been a widely employed approach to scale down large language model (LLM) and facilitate efficient inference. In various proposed compression methods, including pruning and quantization, calibration data plays a vital role by informing the weight importance and activation dynamic ranges. However, how calibration data impacts the LLM capability after compression is less explored. Few of the existing works, though recognizing the significance of this study, only investigate the language modeling or commonsense reasoning performance degradation from limited angles, like the data sources or sample amounts. More systematic research is still needed to examine the impacts on different LLM capabilities in terms of compositional properties and domain correspondence of calibration data. In this work, we aim at bridging this gap and further analyze underlying influencing mechanisms from the activation pattern perspective. Especially, we explore the calibration data’s impacts on high-level complex reasoning capabilities, like math problem solving and code generation. Delving into the underlying mechanism, we find that the representativeness and diversity in activation space more fundamentally determine the quality of calibration data. Finally, we propose a calibration data curation framework based on such observations and analysis, enhancing the performance of existing post-training compression methods on preserving critical LLM capabilities. Our code is provided in \href{https://github.com/BokwaiHo/COLA.git}{Link}.

[85] FactAppeal: Identifying Epistemic Factual Appeals in News Media

Guy Mor-Lan, Tamir Sheafer, Shaul R. Shenhav

Main category: cs.CL

TL;DR: The paper introduces Epistemic Appeal Identification task and FactAppeal dataset for analyzing how factual claims are anchored by external sources or evidence in news sentences.

Details

Motivation: To understand how factual claims are made credible by identifying whether and how they are anchored by external sources or evidence, going beyond simple claim detection and verification.

Method: Created FactAppeal dataset with 3,226 manually annotated English news sentences containing span-level annotations for factual statements and source mentions, including fine-grained characteristics like source types, naming, roles, credentials, and attribution methods.

Result: The best performing model (Gemma 2 9B) achieved a macro-F1 score of 0.73 for the Epistemic Appeal Identification task.

Conclusion: The paper successfully establishes a new task for analyzing epistemic structures in factual claims and provides a comprehensive dataset that enables fine-grained analysis of how evidence and sources are used to support claims.

Abstract: How is a factual claim made credible? We propose the novel task of Epistemic Appeal Identification, which identifies whether and how factual statements have been anchored by external sources or evidence. To advance research on this task, we present FactAppeal, a manually annotated dataset of 3,226 English-language news sentences. Unlike prior resources that focus solely on claim detection and verification, FactAppeal identifies the nuanced epistemic structures and evidentiary basis underlying these claims and used to support them. FactAppeal contains span-level annotations which identify factual statements and mentions of sources on which they rely. Moreover, the annotations include fine-grained characteristics of factual appeals such as the type of source (e.g. Active Participant, Witness, Expert, Direct Evidence), whether it is mentioned by name, mentions of the source’s role and epistemic credentials, attribution to the source via direct or indirect quotation, and other features. We model the task with a range of encoder models and generative decoder models in the 2B-9B parameter range. Our best performing model, based on Gemma 2 9B, achieves a macro-F1 score of 0.73.

[86] You’re Not Gonna Believe This: A Computational Analysis of Factual Appeals and Sourcing in Partisan News

Guy Mor-Lan, Tamir Sheafer, Shaul R. Shenhav

Main category: cs.CL

TL;DR: This paper analyzes epistemic strategies in factual reporting by comparing CNN and Fox News using article matching and FactAppeal framework on 470K+ articles from COVID-19 and Israel-Hamas war coverage.

Details

Motivation: While media bias is widely studied, the epistemic strategies behind factual reporting remain computationally underexplored, particularly how partisan outlets use different methods to construct reality.

Method: Used article matching strategy to compare reports on same events and applied FactAppeal framework to a corpus of over 470K articles covering COVID-19 pandemic and Israel-Hamas war.

Result: CNN’s reporting contains more factual statements and is more likely to ground them in external sources. CNN builds credibility by citing Experts and Expert Documents (appeal to formal authority), while Fox News favors News Reports and direct quotations.

Conclusion: This work quantifies how partisan outlets use systematically different epistemic strategies to construct reality, adding a new dimension to the study of media bias.

Abstract: While media bias is widely studied, the epistemic strategies behind factual reporting remain computationally underexplored. This paper analyzes these strategies through a large-scale comparison of CNN and Fox News. To isolate reporting style from topic selection, we employ an article matching strategy to compare reports on the same events and apply the FactAppeal framework to a corpus of over 470K articles covering two highly politicized periods: the COVID-19 pandemic and the Israel-Hamas war. We find that CNN’s reporting contains more factual statements and is more likely to ground them in external sources. The outlets also exhibit sharply divergent sourcing patterns: CNN builds credibility by citing Experts} and Expert Documents, constructing an appeal to formal authority, whereas Fox News favors News Reports and direct quotations. This work quantifies how partisan outlets use systematically different epistemic strategies to construct reality, adding a new dimension to the study of media bias.

[87] AGENTIQL: An Agent-Inspired Multi-Expert Framework for Text-to-SQL Generation

Omid Reza Heidari, Siobhan Reid, Yassine Yaakoubi

Main category: cs.CL

TL;DR: AGENTIQL is an agent-inspired multi-expert framework for text-to-SQL that uses reasoning and coding agents for question decomposition and sub-query generation, with an adaptive router to balance efficiency and accuracy.

Details

Motivation: Monolithic LLM architectures struggle with complex reasoning and schema diversity in text-to-SQL tasks, requiring a more modular and interpretable approach.

Method: Multi-expert framework with reasoning agent for question decomposition, coding agent for sub-query generation, refinement step for column selection, and adaptive router to choose between modular pipeline and baseline parser. Supports parallel execution for scalability.

Result: Achieves 86.07% EX on Spider benchmark using 14B models with Planner&Executor merging strategy, narrowing gap to GPT-4-based SOTA (89.65% EX) while using smaller open-source LLMs.

Conclusion: AGENTIQL provides a robust, scalable, and interpretable approach to semantic parsing that enhances both accuracy and transparency through intermediate reasoning steps.

Abstract: LLMs have advanced text-to-SQL generation, yet monolithic architectures struggle with complex reasoning and schema diversity. We propose AGENTIQL, an agent-inspired multi-expert framework that combines a reasoning agent for question decomposition, a coding agent for sub-query generation, and a refinement step for column selection. An adaptive router further balances efficiency and accuracy by selecting between our modular pipeline and a baseline parser. Several steps in the pipeline can be executed in parallel, making the framework scalable to larger workloads. Evaluated on the Spider benchmark, AGENTIQL improves execution accuracy and interpretability and achieves up to 86.07% EX with 14B models using the Planner&Executor merging strategy. The attained performance is contingent upon the efficacy of the routing mechanism, thereby narrowing the gap to GPT-4-based SOTA (89.65% EX) while using much smaller open-source LLMs. Beyond accuracy, AGENTIQL enhances transparency by exposing intermediate reasoning steps, offering a robust, scalable, and interpretable approach to semantic parsing.

[88] BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions

Zhengbo Zhang, Zhiheng Lyu, Junhao Gong, Hongzhu Yi, Xinming Wang, Yuxuan Zhou, Jiabing Yang, Ping Nie, Yan Huang, Wenhu Chen

Main category: cs.CL

TL;DR: BrowserAgent is an interactive web agent that uses human-inspired browser actions to solve complex web tasks, achieving competitive performance with less training data than previous methods.

Details

Motivation: Current web agents rely on converting web environments to static text, which contrasts with human browsing behaviors involving diverse interactions like scrolling, clicking, and typing.

Method: Two-stage training (SFT and RFT) with predefined browser actions operating directly on raw web pages via Playwright, plus an explicit memory mechanism for storing key conclusions across steps.

Result: BrowserAgent-7B achieves around 20% improvement over Search-R1 on multi-hop QA tasks like HotpotQA, 2Wiki, and Bamboogle, despite using significantly less training data.

Conclusion: BrowserAgent serves as a more advanced framework for interactive and scalable web agents by mimicking human browsing behaviors more closely.

Abstract: Efficiently solving real-world problems with LLMs increasingly hinges on their ability to interact with dynamic web environments and autonomously acquire external information. While recent research like Search-R1 and WebDancer demonstrates strong performance in solving web tasks, they heavily rely on additional tools to convert the interactive web environment into static text content. This is in contrast to human browsing behaviors, which involve diverse interactions with the browser, such as scrolling, clicking, and typing. In this paper, we propose BrowserAgent, a more interactive agent that solves complex tasks through human-inspired browser actions. BrowserAgent operates directly on raw web pages via Playwright through a set of predefined browser actions. We adopt a two-stage training (Supervised Fine-Tuning (SFT) and Rejection Fine-Tuning (RFT)) to improve the model’s generalization abilities. Despite using significantly less training data than Search-R1, BrowserAgent achieves more competitive results across different Open-QA tasks. Additionally, we introduce an explicit memory mechanism to store key conclusions across steps, further enhancing the model’s reasoning capabilities for long-horizon tasks. Notably, BrowserAgent-7B can achieve around 20% improvement over Search-R1 on multi-hop QA tasks like HotpotQA, 2Wiki, and Bamboogle. These results indicate that BrowserAgent can serve as a more advanced framework for more interactive and scalable web agents.

[89] Unlocking LLM Safeguards for Low-Resource Languages via Reasoning and Alignment with Minimal Training Data

Zhuowei Chen, Bowei Zhang, Nankai Lin, Tian Hou, Lianxi Wang

Main category: cs.CL

TL;DR: ConsistentGuard is a reasoning-based multilingual safeguard for LLMs that improves interpretability and cross-lingual knowledge transfer, achieving superior performance with minimal training data across multiple languages.

Details

Motivation: Existing LLM safeguards rely on classifier-based methods that lack interpretability and perform poorly on low-resource languages, creating security risks from malicious requests.

Method: Proposed ConsistentGuard uses reasoning-based approach with alignment to enhance explainability and boost knowledge transfer between languages, requiring only 1,000 training samples.

Result: Outperforms larger models trained with more data on three datasets across six languages, demonstrating strong interpretability and generalization ability.

Conclusion: The method provides an effective multilingual safeguard solution with minimal data requirements and contributes a benchmark extension and code release for future research.

Abstract: Recent advances in LLMs have enhanced AI capabilities, but also increased the risk posed by malicious requests, highlighting the need for effective LLM safeguards to detect such queries. Existing approaches largely rely on classifier-based methods that lack interpretability and perform poorly on low-resource languages. To address these limitations, we propose ConsistentGuard, a novel reasoning-based multilingual safeguard, which enhances explainability via reasoning and boosts knowledge transfer between languages through alignment. With only 1,000 training samples, our method demonstrates superior performance on three datasets across six languages, outperforming larger models trained with significantly more data, and exhibits strong interpretability and generalization ability. We also contribute a multilingual benchmark extension and release our codes to support future research.

[90] RePro: Training Language Models to Faithfully Recycle the Web for Pretraining

Zichun Yu, Chenyan Xiong

Main category: cs.CL

TL;DR: RePro is a web recycling method that trains a small LM with reinforcement learning to generate high-quality rephrasings of pretraining data, improving data efficiency by 2-3x and delivering 4.7%-14.0% accuracy gains over organic-only baselines.

Details

Motivation: High-quality pretraining data is becoming scarce for frontier LLMs, creating a need for methods to effectively recycle and reuse existing web data.

Method: Trains a 4B LM with reinforcement learning using one quality reward and three faithfulness rewards to generate effective rephrasings of pretraining data while maintaining core semantics and structure.

Result: Achieves 4.7%-14.0% relative accuracy gains on 22 downstream tasks, outperforms state-of-the-art prompting-based methods, and improves organic data efficiency by 2-3x.

Conclusion: RePro provides an efficient and controllable path to effectively harness pretraining data, preserving critical information and faithfully reflecting organic data characteristics better than prompting-based approaches.

Abstract: High-quality pretraining data is the fossil fuel of large language models (LLMs), yet its reserves are running low for frontier models. In this paper, we introduce RePro, a novel web recycling method that trains a relatively small LM with reinforcement learning to generate effective and faithful rephrasings of pretraining data. Specifically, we design one quality reward and three faithfulness rewards, optimizing the LM rephraser to convert organic data into high-quality rephrasings while maintaining its core semantics and structure. In our experiment, we train a 4B rephraser to recycle 72B tokens sampled from DCLM-RefinedWeb. Pretraining results on 400M and 1.4B models demonstrate that RePro delivers 4.7%-14.0% relative accuracy gains over organic-only baseline on 22 downstream tasks. RePro also outperforms ReWire, the state-of-the-art web recycling method that prompts a 70B rephraser, as well as the organic baseline with a 4x larger data pool. Experiments with different amounts of recycled data highlight that RePro improves organic data efficiency by 2-3x. Individual and distributional analyses validate that RePro preserves more critical information and faithfully reflects the characteristics of organic data compared to prompting-based methods. Together, these results show that RePro provides an efficient and controllable path to effectively harness the fossil fuel of LLM pretraining. We open-source our code, rephraser, and recycled data at https://github.com/cxcscmu/RePro.

[91] Sarcasm Detection Using Deep Convolutional Neural Networks: A Modular Deep Learning Framework

Manas Zambre, Sarika Bobade

Main category: cs.CL

TL;DR: A modular deep learning framework using DCNNs and BERT for sarcasm detection in text by analyzing linguistic, emotional, and contextual cues.

Details

Motivation: Sarcasm is often misinterpreted in text due to absence of tone and body language, creating need for automated detection systems.

Method: Modular framework integrating sentiment analysis, contextual embeddings, linguistic feature extraction, and emotion detection through multi-layer architecture using DCNNs and BERT.

Result: The model is in conceptual stage but demonstrates feasibility for real-world applications like chatbots and social media analysis.

Conclusion: The proposed modular deep learning approach shows promise for effective sarcasm detection in text-based communication systems.

Abstract: Sarcasm is a nuanced and often misinterpreted form of communication, especially in text, where tone and body language are absent. This paper proposes a modular deep learning framework for sarcasm detection, leveraging Deep Convolutional Neural Networks (DCNNs) and contextual models such as BERT to analyze linguistic, emotional, and contextual cues. The system integrates sentiment analysis, contextual embeddings, linguistic feature extraction, and emotion detection through a multi-layer architecture. While the model is in the conceptual stage, it demonstrates feasibility for real-world applications such as chatbots and social media analysis.

[92] Large Language Models for Full-Text Methods Assessment: A Case Study on Mediation Analysis

Wenqing Zhang, Trang Nguyen, Elizabeth A. Stuart, Yiqun T. Chen

Main category: cs.CL

TL;DR: LLMs can automate methodological assessments in systematic reviews with near-human accuracy for straightforward criteria, but struggle with complex inference tasks requiring human oversight.

Details

Motivation: Systematic reviews are labor-intensive, especially for methodological information extraction. LLMs offer potential to automate this process and transform evidence synthesis.

Method: Benchmarked state-of-the-art LLMs against expert human reviewers across 180 full-text scientific articles using causal mediation analysis as a representative methodological domain.

Result: Model performance closely correlated with human judgments (accuracy correlation 0.71; F1 correlation 0.97), achieving near-human accuracy on straightforward criteria but declining sharply on complex assessments (lagging experts by up to 15%). Errors resulted from superficial linguistic cues and longer documents yielded lower accuracy.

Conclusion: Current LLMs excel at identifying explicit methodological features but require human oversight for nuanced interpretations. Integrating automated extraction with expert review provides a promising approach to enhance efficiency and rigor in evidence synthesis.

Abstract: Systematic reviews are crucial for synthesizing scientific evidence but remain labor-intensive, especially when extracting detailed methodological information. Large language models (LLMs) offer potential for automating methodological assessments, promising to transform evidence synthesis. Here, using causal mediation analysis as a representative methodological domain, we benchmarked state-of-the-art LLMs against expert human reviewers across 180 full-text scientific articles. Model performance closely correlated with human judgments (accuracy correlation 0.71; F1 correlation 0.97), achieving near-human accuracy on straightforward, explicitly stated methodological criteria. However, accuracy sharply declined on complex, inference-intensive assessments, lagging expert reviewers by up to 15%. Errors commonly resulted from superficial linguistic cues – for instance, models frequently misinterpreted keywords like “longitudinal” or “sensitivity” as automatic evidence of rigorous methodological approache, leading to systematic misclassifications. Longer documents yielded lower model accuracy, whereas publication year showed no significant effect. Our findings highlight an important pattern for practitioners using LLMs for methods review and synthesis from full texts: current LLMs excel at identifying explicit methodological features but require human oversight for nuanced interpretations. Integrating automated information extraction with targeted expert review thus provides a promising approach to enhance efficiency and methodological rigor in evidence synthesis across diverse scientific fields.

[93] HiligayNER: A Baseline Named Entity Recognition Model for Hiligaynon

James Ald Teves, Ray Daniel Cal, Josh Magdiel Villaluz, Jean Malolos, Mico Magtira, Ramon Rodriguez, Mideth Abisado, Joseph Marvin Imperial

Main category: cs.CL

TL;DR: HiligayNER is the first baseline NER model for Hiligaynon language, built using mBERT and XLM-RoBERTa trained on 8,000+ annotated sentences, achieving over 80% performance across metrics and showing cross-lingual transferability.

Details

Motivation: Hiligaynon language is underrepresented in NLP research due to lack of annotated corpora and baseline models, despite being spoken by millions in the Philippines.

Method: Collected 8,000+ annotated sentences from news, social media, and literary texts, then fine-tuned mBERT and XLM-RoBERTa models on this corpus for NER.

Result: Both models achieved over 80% precision, recall, and F1-score across entity types, with promising cross-lingual transferability to Cebuano and Tagalog.

Conclusion: HiligayNER successfully establishes the first baseline for Hiligaynon NER, contributing to language technology development for underrepresented Philippine languages and supporting future multilingual NLP research.

Abstract: The language of Hiligaynon, spoken predominantly by the people of Panay Island, Negros Occidental, and Soccsksargen in the Philippines, remains underrepresented in language processing research due to the absence of annotated corpora and baseline models. This study introduces HiligayNER, the first publicly available baseline model for the task of Named Entity Recognition (NER) in Hiligaynon. The dataset used to build HiligayNER contains over 8,000 annotated sentences collected from publicly available news articles, social media posts, and literary texts. Two Transformer-based models, mBERT and XLM-RoBERTa, were fine-tuned on this collected corpus to build versions of HiligayNER. Evaluation results show strong performance, with both models achieving over 80% in precision, recall, and F1-score across entity types. Furthermore, cross-lingual evaluation with Cebuano and Tagalog demonstrates promising transferability, suggesting the broader applicability of HiligayNER for multilingual NLP in low-resource settings. This work aims to contribute to language technology development for underrepresented Philippine languages, specifically for Hiligaynon, and support future research in regional language processing.

[94] Review of Inference-Time Scaling Strategies: Reasoning, Search and RAG

Zhichao Wang, Cheng Wan, Dong Nie

Main category: cs.CL

TL;DR: This survey paper systematically organizes inference-time scaling techniques for LLMs into two main categories: Output-focused methods (multi-step generation strategies, reasoning, search/decoding, training for long CoT, model ensembles) and Input-focused methods (primarily few-shot learning and RAG with detailed subcategories).

Details

Motivation: The diminishing availability of high-quality training data creates a bottleneck for LLM performance gains, shifting focus to inference-time scaling as a way to improve performance on downstream tasks without costly model re-training.

Method: Systematic survey and organization of inference-time scaling techniques into two comprehensive perspectives: Output-focused methods (complex multi-step generation strategies) and Input-focused methods (few-shot and RAG approaches).

Result: A structured framework for understanding the rapidly evolving field of inference-time scaling, with detailed categorization of techniques including reasoning methods, search/decoding strategies, and comprehensive RAG analysis.

Conclusion: Inference-time scaling represents a new paradigm for improving LLM performance that addresses the data availability bottleneck by leveraging additional computation at deployment time rather than through expensive model re-training.

Abstract: The performance gains of LLMs have historically been driven by scaling up model size and training data. However, the rapidly diminishing availability of high-quality training data is introducing a fundamental bottleneck, shifting the focus of research toward inference-time scaling. This paradigm uses additional computation at the time of deployment to substantially improve LLM performance on downstream tasks without costly model re-training. This review systematically surveys the diverse techniques contributing to this new era of inference-time scaling, organizing the rapidly evolving field into two comprehensive perspectives: Output-focused and Input-focused methods. Output-focused techniques encompass complex, multi-step generation strategies, including reasoning (e.g., CoT, ToT, ReAct), various search and decoding methods (e.g., MCTS, beam search), training for long CoT (e.g., RLVR, GRPO), and model ensemble methods. Input-focused techniques are primarily categorized by few-shot and RAG, with RAG as the central focus. The RAG section is further detailed through a structured examination of query expansion, data, retrieval and reranker, LLM generation methods, and multi-modal RAG.

[95] Toward Human-Centered Readability Evaluation

Bahar İlgen, Georges Hattab

Main category: cs.CL

TL;DR: Proposes HCRS, a human-centered readability framework for health text simplification that goes beyond surface metrics to include clarity, trustworthiness, tone, cultural relevance, and actionability.

Details

Motivation: Current NLP evaluation metrics (BLEU, FKGL, SARI) focus on surface-level features and fail to capture human-centered qualities needed in high-stakes health contexts where communication must be usable, respectful, and trustworthy.

Method: Developed HCRS - a five-dimensional evaluation framework integrating automatic measures with structured human feedback, grounded in HCI and health communication research.

Result: Proposed framework and validation protocol for capturing relational and contextual aspects of readability in health text simplification.

Conclusion: HCRS advances health text simplification evaluation beyond surface metrics, enabling NLP systems to better align with diverse users’ needs, expectations, and lived experiences.

Abstract: Text simplification is essential for making public health information accessible to diverse populations, including those with limited health literacy. However, commonly used evaluation metrics in Natural Language Processing (NLP), such as BLEU, FKGL, and SARI, mainly capture surface-level features and fail to account for human-centered qualities like clarity, trustworthiness, tone, cultural relevance, and actionability. This limitation is particularly critical in high-stakes health contexts, where communication must be not only simple but also usable, respectful, and trustworthy. To address this gap, we propose the Human-Centered Readability Score (HCRS), a five-dimensional evaluation framework grounded in Human-Computer Interaction (HCI) and health communication research. HCRS integrates automatic measures with structured human feedback to capture the relational and contextual aspects of readability. We outline the framework, discuss its integration into participatory evaluation workflows, and present a protocol for empirical validation. This work aims to advance the evaluation of health text simplification beyond surface metrics, enabling NLP systems that align more closely with diverse users’ needs, expectations, and lived experiences.

[96] Is Implicit Knowledge Enough for LLMs? A RAG Approach for Tree-based Structures

Mihir Gupte, Paolo Giusto, Ramesh S

Main category: cs.CL

TL;DR: A bottom-up method for linearizing tree-like structured data (e.g., GitHub repositories) by generating implicit aggregated summaries at each hierarchical level, enabling efficient RAG with 68% fewer documents while maintaining response quality.

Details

Motivation: LLMs can use in-context information effectively, but it's unclear how to best represent retrieved knowledge from hierarchical structures like trees for RAG systems.

Method: Proposed a bottom-up approach that linearizes tree structures by generating implicit, aggregated summaries at each hierarchical level, allowing knowledge to be stored in a knowledge base for RAG.

Result: Response quality was comparable to using RAG on raw unstructured code, but the proposed method generated over 68% fewer documents in the retriever, showing significant efficiency gains.

Conclusion: Leveraging implicit, linearized knowledge is an effective and scalable strategy for handling complex hierarchical data structures in RAG systems.

Abstract: Large Language Models (LLMs) are adept at generating responses based on information within their context. While this ability is useful for interacting with structured data like code files, another popular method, Retrieval-Augmented Generation (RAG), retrieves relevant documents to augment the model’s in-context learning. However, it is not well-explored how to best represent this retrieved knowledge for generating responses on structured data, particularly hierarchical structures like trees. In this work, we propose a novel bottom-up method to linearize knowledge from tree-like structures (like a GitHub repository) by generating implicit, aggregated summaries at each hierarchical level. This approach enables the knowledge to be stored in a knowledge base and used directly with RAG. We then compare our method to using RAG on raw, unstructured code, evaluating the accuracy and quality of the generated responses. Our results show that while response quality is comparable across both methods, our approach generates over 68% fewer documents in the retriever, a significant gain in efficiency. This finding suggests that leveraging implicit, linearized knowledge may be a highly effective and scalable strategy for handling complex, hierarchical data structures.

Haeji Jung, Jinju Kim, Kyungjin Kim, Youjeong Roh, David R. Mortensen

Main category: cs.CL

TL;DR: Romanization is the most effective transliteration method for multilingual NLP, outperforming other approaches in 7 out of 8 evaluation settings for NER and NLI tasks.

Details

Motivation: To investigate how shared script, overlapping token vocabularies, and shared phonology contribute to the performance of multilingual models in bridging language gaps.

Method: Conducted controlled experiments using three types of transliteration (romanization, phonemic transcription, and substitution ciphers) plus orthography, evaluating on named entity recognition and natural language inference tasks.

Result: Romanization significantly outperformed other input types across most evaluation settings, showing it’s the most effective approach for multilingual NLP.

Conclusion: Longer subword tokens shared with pre-trained languages lead to better model utilization, making romanization the most successful transliteration method.

Abstract: Transliteration has emerged as a promising means to bridge the gap between various languages in multilingual NLP, showing promising results especially for languages using non-Latin scripts. We investigate the degree to which shared script, overlapping token vocabularies, and shared phonology contribute to performance of multilingual models. To this end, we conduct controlled experiments using three kinds of transliteration (romanization, phonemic transcription, and substitution ciphers) as well as orthography. We evaluate each model on two downstream tasks – named entity recognition (NER) and natural language inference (NLI) – and find that romanization significantly outperforms other input types in 7 out of 8 evaluation settings, largely consistent with our hypothesis that it is the most effective approach. We further analyze how each factor contributed to the success, and suggest that having longer (subword) tokens shared with pre-trained languages leads to better utilization of the model.

[98] DUAL-Bench: Measuring Over-Refusal and Robustness in Vision-Language Models

Kaixuan Ren, Preslav Nakov, Usman Naseem

Main category: cs.CL

TL;DR: DUAL-Bench is the first multimodal benchmark for evaluating over-refusal and safe completion in vision-language models, revealing significant performance gaps across 18 VLMs.

Details

Motivation: Existing benchmarks don't systematically address over-refusal in visual modality, where models either refuse benign requests too conservatively or complete tasks unsafely, especially in dual-use scenarios with harmless instructions but harmful images.

Method: Created DUAL-Bench benchmark evaluating 18 VLMs across 12 hazard categories, focusing on robustness under semantics-preserving visual perturbations to measure safe completion rates.

Result: Poor performance across models: GPT-5-Nano achieves 12.9% safe completion, GPT-5 models average 7.9%, and Qwen models only 3.9%, showing substantial room for improvement.

Conclusion: DUAL-Bench enables development of more nuanced alignment strategies to ensure VLMs remain both safe and useful in complex multimodal settings, addressing the balance between safety and usefulness.

Abstract: As vision-language models become increasingly capable, maintaining a balance between safety and usefulness remains a central challenge. Safety mechanisms, while essential, can backfire, causing over-refusal, where models decline benign requests out of excessive caution. Yet, no existing benchmark has systematically addressed over-refusal in the visual modality. This setting introduces unique challenges, such as dual-use cases where an instruction is harmless, but the accompanying image contains harmful content. Models frequently fail in such scenarios, either refusing too conservatively or completing tasks unsafely, which highlights the need for more fine-grained alignment. The ideal behavior is safe completion, i.e., fulfilling the benign parts of a request while explicitly warning about any potentially harmful elements. To address this, we present DUAL-Bench, the first multimodal benchmark focused on over-refusal and safe completion in VLMs. We evaluated 18 VLMs across 12 hazard categories, with focus on their robustness under semantics-preserving visual perturbations. The results reveal substantial room for improvement: GPT-5-Nano achieves 12.9% safe completion, GPT-5 models average 7.9%, and Qwen models only 3.9%. We hope that DUAL-Bench will foster the development of more nuanced alignment strategies that ensure models remain both safe and useful in complex multimodal settings.

[99] Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks

Jiajing Guo, Kenil Patel, Jorge Piazentin Ono, Wenbin He, Liu Ren

Main category: cs.CL

TL;DR: Benchmarking of test-time scaling strategies for LLM-based Text2SQL systems, evaluating performance, latency, and token usage on BIRD Mini-Dev benchmark.

Details

Motivation: To assess the effectiveness of test-time scaling strategies in real-world Text2SQL applications with latest reasoning models, as their practical deployment impact remains uncertain.

Method: Evaluated six lightweight industry-oriented test-time scaling strategies and four LLMs (including two reasoning models) on BIRD Mini-Dev benchmark, measuring accuracy, inference latency, and token consumption.

Result: Divide-and-Conquer prompting and few-shot demonstrations consistently improved performance for both general-purpose and reasoning-focused LLMs. Additional workflow steps had mixed results, and base model selection was critical.

Conclusion: Reveals practical trade-offs between accuracy, efficiency, and complexity in Text2SQL system deployment, highlighting the importance of strategy selection and model choice for industrial applications.

Abstract: Large language models (LLMs) are increasingly powering Text-to-SQL (Text2SQL) systems, enabling non-expert users to query industrial databases using natural language. While test-time scaling strategies have shown promise in LLM-based solutions, their effectiveness in real-world applications, especially with the latest reasoning models, remains uncertain. In this work, we benchmark six lightweight, industry-oriented test-time scaling strategies and four LLMs, including two reasoning models, evaluating their performance on the BIRD Mini-Dev benchmark. Beyond standard accuracy metrics, we also report inference latency and token consumption, providing insights relevant for practical system deployment. Our findings reveal that Divide-and-Conquer prompting and few-shot demonstrations consistently enhance performance for both general-purpose and reasoning-focused LLMs. However, introducing additional workflow steps yields mixed results, and base model selection plays a critical role. This work sheds light on the practical trade-offs between accuracy, efficiency, and complexity when deploying Text2SQL systems.

[100] LLM$\times$MapReduce-V3: Enabling Interactive In-Depth Survey Generation through a MCP-Driven Hierarchically Modular Agent System

Yu Chao, Siyu Lin, xiaorong wang, Zhu Zhang, Zihan Zhou, Haoyu Wang, Shuo Wang, Jie Zhou, Zhiyuan Liu, Maosong Sun

Main category: cs.CL

TL;DR: LLM x MapReduce-V3 is a hierarchical multi-agent system for generating long-form surveys using modular MCP servers orchestrated by a planner agent.

Details

Motivation: To create a more flexible and controllable system for long-form survey generation that allows human intervention and customization, building on previous MapReduce versions.

Method: Uses a multi-agent architecture with independent MCP servers for different functions (skeleton initialization, digest construction, skeleton refinement) that can be aggregated hierarchically. A planner agent dynamically orchestrates workflow based on tool descriptions and execution history.

Result: Human evaluations show the system outperforms representative baselines in both content depth and length, demonstrating the effectiveness of MCP-based modular planning.

Conclusion: The hierarchical modular approach using MCP servers enables precise capture of research perspectives and generates comprehensive surveys, with the system showing superior performance compared to baseline methods.

Abstract: We introduce LLM x MapReduce-V3, a hierarchically modular agent system designed for long-form survey generation. Building on the prior work, LLM x MapReduce-V2, this version incorporates a multi-agent architecture where individual functional components, such as skeleton initialization, digest construction, and skeleton refinement, are implemented as independent model-context-protocol (MCP) servers. These atomic servers can be aggregated into higher-level servers, creating a hierarchically structured system. A high-level planner agent dynamically orchestrates the workflow by selecting appropriate modules based on their MCP tool descriptions and the execution history. This modular decomposition facilitates human-in-the-loop intervention, affording users greater control and customization over the research process. Through a multi-turn interaction, the system precisely captures the intended research perspectives to generate a comprehensive skeleton, which is then developed into an in-depth survey. Human evaluations demonstrate that our system surpasses representative baselines in both content depth and length, highlighting the strength of MCP-based modular planning.

[101] ADVICE: Answer-Dependent Verbalized Confidence Estimation

Ki Jung Seo, Sehun Lim, Taeuk Kim

Main category: cs.CL

TL;DR: ADVICE is a fine-tuning framework that addresses LLM overconfidence by enabling answer-dependent confidence estimation, improving calibration while maintaining task performance.

Details

Motivation: LLMs often exhibit overconfidence in their verbalized confidence, which reduces reliability. The cause of this overconfidence is poorly understood, with answer-independence identified as a key factor.

Method: Proposed ADVICE framework - a fine-tuning approach that facilitates answer-grounded confidence estimation to address the model’s failure to condition confidence on its own answer.

Result: Extensive experiments show ADVICE substantially improves confidence calibration while preserving task performance. The framework strengthens answer-groundedness and leads to more balanced, well-calibrated confidence distributions.

Conclusion: The work identifies answer-independence as the origin of overconfidence and establishes ADVICE as an effective framework for more trustworthy confidence verbalization in LLMs.

Abstract: Recent progress in large language models (LLMs) has enabled them to express their confidence in natural language, enhancing transparency and reliability. However, their confidence often exhibits overconfidence, the cause of which remains poorly understood. In this work, we conduct a detailed analysis of the dynamics underlying verbalized confidence and identify answer-independence as a key factor, defined as the model’s failure to condition confidence on its own answer. To address this, we propose ADVICE (Answer-Dependent Verbalized Confidence Estimation), a fine-tuning framework that facilitates answer-grounded confidence estimation. Extensive experiments show that ADVICE substantially improves confidence calibration while preserving task performance. Further analyses confirm that ADVICE strengthens answer-groundedness, leading to more balanced and well-calibrated confidence distributions. Our findings shed light on the origin of overconfidence and establish a framework for more trustworthy confidence verbalization.

[102] GapDNER: A Gap-Aware Grid Tagging Model for Discontinuous Named Entity Recognition

Yawen Yang, Fukun Ma, Shiao Meng, Aiwei Liu, Lijie Wen

Main category: cs.CL

TL;DR: GapDNER is a novel model for discontinuous named entity recognition that uses gap-aware grid tagging to handle non-adjacent tokens and overlapping entities by modeling context gaps between entity fragments.

Details

Motivation: Previous methods for discontinuous NER face challenges with error propagation and decoding ambiguity due to the wide variety of span or word combinations when connecting entity fragments or internal tokens.

Method: Proposes GapDNER which treats context gaps as additional span types and converts span classification to token-pair grid tagging. Uses two interactive components: intra-span regularity extraction with biaffine mechanism and linear attention, and inter-span relation enhancement with criss-cross attention. Uses BFS algorithm for entity decoding.

Result: Achieves new state-of-the-art performance on three datasets for discontinuous NER and shows remarkable advantages in recognizing complex entity structures.

Conclusion: GapDNER effectively addresses discontinuous NER challenges by modeling context gaps and using comprehensive token-pair grid features, demonstrating superior performance in handling complex biomedical entity structures.

Abstract: In biomedical fields, one named entity may consist of a series of non-adjacent tokens and overlap with other entities. Previous methods recognize discontinuous entities by connecting entity fragments or internal tokens, which face challenges of error propagation and decoding ambiguity due to the wide variety of span or word combinations. To address these issues, we deeply explore discontinuous entity structures and propose an effective Gap-aware grid tagging model for Discontinuous Named Entity Recognition, named GapDNER. Our GapDNER innovatively applies representation learning on the context gaps between entity fragments to resolve decoding ambiguity and enhance discontinuous NER performance. Specifically, we treat the context gap as an additional type of span and convert span classification into a token-pair grid tagging task. Subsequently, we design two interactive components to comprehensively model token-pair grid features from both intra- and inter-span perspectives. The intra-span regularity extraction module employs the biaffine mechanism along with linear attention to capture the internal regularity of each span, while the inter-span relation enhancement module utilizes criss-cross attention to obtain semantic relations among different spans. At the inference stage of entity decoding, we assign a directed edge to each entity fragment and context gap, then use the BFS algorithm to search for all valid paths from the head to tail of grids with entity tags. Experimental results on three datasets demonstrate that our GapDNER achieves new state-of-the-art performance on discontinuous NER and exhibits remarkable advantages in recognizing complex entity structures.

[103] Evaluating Language Models’ Evaluations of Games

Katherine M. Collins, Cedegao E. Zhang, Graham Todd, Lance Ying, Mauricio Barba da Costa, Ryan Liu, Prafull Sharma, Adrian Weller, Ionatan Kuperwajs, Lionel Wong, Joshua B. Tenenbaum, Thomas L. Griffiths

Main category: cs.CL

TL;DR: This paper proposes evaluating AI systems’ ability to evaluate games themselves, not just solve them. Using 100+ board games and human judgments, it compares how reasoning models vs language models assess game fairness and funness.

Details

Motivation: Current AI evaluations focus on problem-solving (like playing games), but reasoning also involves deciding which problems are worth solving. The paper advocates for assessing AI systems' evaluation capabilities.

Method: Introduced a formalism for evaluating game evaluations. Used a dataset of 100+ novel board games and 450+ human judgments to compare reasoning models, language models, and symbolic agents on assessing game payoff/fairness and funness.

Result: Reasoning models align better with human evaluations than non-reasoning language models. However, as models approach game-theoretic optimality, their fit to human data weakens. Funness assessments show more variability across models. Reasoning models have highly variable resource usage.

Conclusion: The study highlights the importance of developing resource-rational meta-reasoning in AI systems, as current reasoning models show unpredictable resource consumption when evaluating complex queries like game assessments.

Abstract: Reasoning is not just about solving problems – it is also about evaluating which problems are worth solving at all. Evaluations of artificial intelligence (AI) systems primarily focused on problem solving, historically by studying how models play games such as chess and Go. In this paper, we advocate for a new paradigm that assesses AI systems’ evaluation of games. First, we introduce a formalism for evaluating such evaluations. We then leverage a large-scale dataset of over $100$ novel board games and over 450 human judgments to compare evaluations produced by modern language and reasoning models against those of people and symbolic computational agents. We consider two kinds of evaluative queries: assessing the payoff (or fairness) and the funness of games. These queries span two dimensions relevant to the design of evaluations of AI evaluations: how complex a query is to compute and how difficult a query is to quantify. Our results show that reasoning models are generally more aligned to people in their evaluations of games than non-reasoning language models. However, we observe a non-monotonic relationship: as models get closer to game-theoretic optimal, their fit to human data weakens. We also observe more “jaggedness” across models for assessing funness, in line with the greater difficulty of quantifying this query. Across queries and games, reasoning models show highly variable and unpredictable resource usage when assessing queries, pointing to the importance of imbuing more resource-rational meta-reasoning in language and reasoning models.

[104] End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF: A Reproducibility Study

Anirudh Ganesh, Jayavardhan Reddy

Main category: cs.CL

TL;DR: Reproducibility study of BiLSTM-CNN-CRF model for sequence labeling, achieving 91.18% F1-score on CoNLL-2003 NER with open-source PyTorch implementation.

Details

Motivation: To reproduce and verify the state-of-the-art neural architecture for sequence labeling proposed by Ma and Hovy (2016), which combines character-level CNNs, word-level BiLSTMs, and CRFs for end-to-end learning without hand-crafted features.

Method: Implemented the BiLSTM-CNN-CRF model combining character-level CNN representations, word-level BiLSTM context modeling, and CRF for structured prediction. Tested on named entity recognition (NER) and part-of-speech (POS) tagging tasks.

Result: Successfully reproduced key results with 91.18% F1-score on CoNLL-2003 NER, demonstrating the model’s effectiveness across sequence labeling tasks.

Conclusion: The BiLSTM-CNN-CRF architecture is reproducible and effective for sequence labeling tasks. The study provides detailed analysis of architecture components and releases open-source PyTorch implementation to support further research.

Abstract: We present a reproducibility study of the state-of-the-art neural architecture for sequence labeling proposed by Ma and Hovy (2016)\cite{ma2016end}. The original BiLSTM-CNN-CRF model combines character-level representations via Convolutional Neural Networks (CNNs), word-level context modeling through Bi-directional Long Short-Term Memory networks (BiLSTMs), and structured prediction using Conditional Random Fields (CRFs). This end-to-end approach eliminates the need for hand-crafted features while achieving excellent performance on named entity recognition (NER) and part-of-speech (POS) tagging tasks. Our implementation successfully reproduces the key results, achieving 91.18% F1-score on CoNLL-2003 NER and demonstrating the model’s effectiveness across sequence labeling tasks. We provide a detailed analysis of the architecture components and release an open-source PyTorch implementation to facilitate further research.

[105] Punctuation-aware treebank tree binarization

Eitan Klinger, Vivaan Wadhwa, Jungyeul Park

Main category: cs.CL

TL;DR: A punctuation-aware treebank binarization method that preserves punctuation as sibling nodes, improving head prediction accuracy and structural compatibility with CCGbank.

Details

Motivation: Standard binarization pipelines drop punctuation before head selection, which alters constituent structure and harms head-child identification accuracy.

Method: Developed a reproducible pipeline that preserves punctuation as sibling nodes prior to binarization, with derived artifacts including intermediate markers, reversibility signatures, and alignment indices.

Result: On Penn Treebank, punctuation-aware preprocessing improved head prediction accuracy from 73.66% (Collins rules) and 86.66% (MLP) to 91.85% with the same classifier, achieving competitive alignment with CCGbank derivations.

Conclusion: The punctuation-aware binarization approach significantly improves parsing performance and structural compatibility, with all code and resources released for replication and extension.

Abstract: This article presents a curated resource and evaluation suite for punctuation-aware treebank binarization. Standard binarization pipelines drop punctuation before head selection, which alters constituent shape and harms head-child identification. We release (1) a reproducible pipeline that preserves punctuation as sibling nodes prior to binarization, (2) derived artifacts and metadata (intermediate @X markers, reversibility signatures, alignment indices), and (3) an accompanying evaluation suite covering head-child prediction, round-trip reversibility, and structural compatibility with derivational resources (CCGbank). On the Penn Treebank, punctuation-aware preprocessing improves head prediction accuracy from 73.66% (Collins rules) and 86.66% (MLP) to 91.85% with the same classifier, and achieves competitive alignment against CCGbank derivations. All code, configuration files, and documentation are released to enable replication and extension to other corpora.

[106] KOTOX: A Korean Toxic Dataset for Deobfuscation and Detoxification

Yejin Lee, Su-Hyeon Kim, Hyundong Jin, Dayoung Kim, Yeonsoo Kim, Yo-Sub Han

Main category: cs.CL

TL;DR: KOTOX is a Korean toxic dataset designed for deobfuscation and detoxification, addressing the lack of resources for low-resource languages in toxic content detection.

Details

Motivation: Toxic content is a growing social issue, but most research focuses on English, leaving low-resource languages like Korean underrepresented. LLMs struggle with detecting toxic content in these languages, especially when users employ obfuscation techniques to evade detection.

Method: The authors categorize Korean obfuscation approaches based on linguistic characteristics and define transformation rules from real-world examples. They construct three dataset versions (easy, normal, hard) representing different obfuscation difficulty levels.

Result: KOTOX is the first dataset that simultaneously supports deobfuscation and detoxification for Korean language, providing a resource to better understand and mitigate obfuscated toxic content in LLMs for low-resource languages.

Conclusion: The dataset facilitates improved detection and neutralization of toxic content in Korean, addressing the gap in resources for low-resource languages and helping LLMs handle obfuscated toxic expressions more effectively.

Abstract: Toxic content has become an increasingly critical social issue with the rapid expansion of online communication. While numerous studies explored methods for detecting and detoxifying such content, most have focused primarily on English, leaving low-resource language underrepresented. Consequently, Large Language Models~(LLMs) often struggle to identify and neutralize toxic expressions in these languages. This challenge becomes even more pronounced when user employ obfuscation techniques to evade detection systems. Therefore, we propose a \textbf{KOTOX: Korean Toxic Dataset} for deobfuscation and detoxicification to address this issue. We categorize various obfuscation approaches based on linguistic characteristics of Korean and define a set of transformation rules grounded in real-word examples. Using these rules, we construct three dataset versions (easy, normal, and hard) representing different levels of obfuscation difficulty. This is the first dataset that simultaneously supports deobfuscation and detoxification for the Korean language. We expect it to facilitate better understanding and mitigating of obfuscated toxic content in LLM for low-resource languages. Our code and data are available at https://github.com/leeyejin1231/KOTOX.

[107] Judge Before Answer: Can MLLM Discern the False Premise in Question?

Jidong Li, Lingyong Fang, Haodong Zhao, Sufeng Duan, Gongshen Liu

Main category: cs.CL

TL;DR: The paper introduces JBA, a comprehensive benchmark for evaluating multimodal large language models’ ability to recognize false premises, and proposes an enhancement framework that significantly improves model performance on this task.

Details

Motivation: Current MLLMs remain vulnerable to false premise problems, and existing benchmarks are limited in scope, lacking fine-grained categorization and sufficient coverage to properly evaluate models' false premise recognition abilities.

Method: Developed a fully automated pipeline to construct the JBA benchmark, systematically categorizing premises into 3 main types and 13 subtypes based on required abilities. Also proposed a recognition enhancement framework to strengthen MLLM robustness against false premises.

Result: Current MLLMs still struggle with false premise recognition. Models trained with the proposed enhancement framework achieved significant improvements in false premise recognition capabilities.

Conclusion: The JBA benchmark provides rigorous evaluation of false premise recognition, and the proposed enhancement framework effectively strengthens MLLM robustness against false premises, addressing a critical vulnerability in current multimodal models.

Abstract: Multimodal large language models (MLLMs) have witnessed astonishing advancements in recent years. Despite these successes, MLLMs remain vulnerable to flase premise problems. However, existing benchmarks targeting this issue are limited in scope: they often lack fine-grained categorization, exhibit insufficient coverage, and thus fail to provide a rigorous evaluation of the ability of models to recognize false premises. To bridge this gap, we introduce a fully automated pipeline for constructing a comprehensive benchmark of false premise questions. Our method systematically categorizes the premises into three main types and thirteen subtypes according to the abilities required to identify the premises, resulting in the JBA dataset.Results show current MLLMs still struggle with false premise recognition. Building upon this benchmark, we further propose a recognition enhancement framework tailored to strengthen the robustness of MLLMs to detect false premises. Extensive experiments demonstrate that models trained with our framework achieve significant improvements in false premise recognition.

[108] RV-HATE: Reinforced Multi-Module Voting for Implicit Hate Speech Detection

Yejin Lee, Hyeseon Ahn, Yo-Sub Han

Main category: cs.CL

TL;DR: RV-HATE is a hate speech detection framework that adapts to dataset-specific characteristics using specialized modules and reinforcement learning for optimal performance.

Details

Motivation: Hate speech detection faces challenges due to diverse dataset characteristics from different platforms, but prior methods use fixed approaches without adapting to data-specific features.

Method: Uses multiple specialized modules focusing on different linguistic/contextual features, reinforcement learning to optimize module weights, and voting mechanism for final decision.

Result: Improves detection accuracy, addresses implicit hate speech, and provides interpretable insights into dataset characteristics.

Conclusion: RV-HATE achieves superior performance over conventional static methods by adapting to dataset-specific attributes.

Abstract: Hate speech remains prevalent in human society and continues to evolve in its forms and expressions. Modern advancements in internet and online anonymity accelerate its rapid spread and complicate its detection. However, hate speech datasets exhibit diverse characteristics primarily because they are constructed from different sources and platforms, each reflecting different linguistic styles and social contexts. Despite this diversity, prior studies on hate speech detection often rely on fixed methodologies without adapting to data-specific features. We introduce RV-HATE, a detection framework designed to account for the dataset-specific characteristics of each hate speech dataset. RV-HATE consists of multiple specialized modules, where each module focuses on distinct linguistic or contextual features of hate speech. The framework employs reinforcement learning to optimize weights that determine the contribution of each module for a given dataset. A voting mechanism then aggregates the module outputs to produce the final decision. RV-HATE offers two primary advantages: (1)~it improves detection accuracy by tailoring the detection process to dataset-specific attributes, and (2)~it also provides interpretable insights into the distinctive features of each dataset. Consequently, our approach effectively addresses implicit hate speech and achieves superior performance compared to conventional static methods. Our code is available at https://github.com/leeyejin1231/RV-HATE.

[109] Enhancing Large Language Model Reasoning via Selective Critical Token Fine-Tuning

Zhiwen Ruan, Yixia Li, He Zhu, Yun Chen, Peng Li, Yang Liu, Guanhua Chen

Main category: cs.CL

TL;DR: CFT is a fine-tuning method that updates only critical tokens identified via counterfactual perturbations, improving reasoning performance while maintaining output diversity.

Details

Motivation: Standard SFT uniformly penalizes all tokens, neglecting that only a small subset of critical tokens determines reasoning correctness, leading to reduced diversity and limited generalization.

Method: Critical Token Fine-tuning (CFT) identifies functionally indispensable tokens via counterfactual perturbations and updates only these decisive reasoning steps while preserving non-critical tokens.

Result: CFT consistently outperforms standard SFT on 11 mathematical reasoning benchmarks across 5 models, achieving better performance while fine-tuning less than 12% of tokens. It also enables test-time scaling and provides stronger RL initialization.

Conclusion: CFT is a practical and general framework for efficient and robust LLM fine-tuning that enhances both generation quality and diversity through targeted token updates.

Abstract: Large language models (LLMs) primarily rely on supervised fine-tuning (SFT) as a key method to adapt pre-trained models to domain-specific tasks such as mathematical reasoning. However, standard SFT uniformly penalizes all tokens, neglecting that only a small subset of critical tokens determines reasoning correctness. This uniform supervision often causes reduced output diversity and limited generalization. We propose Critical Token Fine-tuning (CFT), a simple yet effective approach that updates only tokens identified as functionally indispensable via counterfactual perturbations. By focusing gradient signals on these decisive reasoning steps while preserving the diversity of non-critical tokens, CFT can enhance both generation and diversity. Extensive experiments on five models across three families (Qwen, OLMo, LLaMA) and eleven mathematical reasoning benchmarks show that CFT, despite fine-tuning on less than 12% of tokens, consistently outperforms standard SFT. Moreover, CFT enables test-time scaling through improved sampling diversity and provides a stronger initialization for reinforcement learning, sustaining performance gains in later training stages while maintaining higher entropy for better exploration. These results highlight CFT as a practical and general framework for efficient and robust LLM fine-tuning.

[110] DeepResearchGuard: Deep Research with Open-Domain Evaluation and Multi-Stage Guardrails for Safety

Wei-Chieh Huang, Henry Peng Zou, Yaozu Wu, Dongyuan Li, Yankai Chen, Weizhi Zhang, Yangning Li, Angelo Zangari, Jizhou Guo, Chunyu Miao, Liancheng Fang, Langzhou He, Renhe Jiang, Philip S. Yu

Main category: cs.CL

TL;DR: DEEPRESEARCHGUARD is a comprehensive safety framework for deep research systems that introduces four-stage safeguards and open-domain evaluation to address deficiencies in existing frameworks regarding report quality, credibility, and safety.

Details

Motivation: Existing deep research frameworks lack sufficient evaluation procedures and stage-specific protections, overlooking crucial aspects like credibility, coherence, breadth, depth, and safety, which may lead to hazardous sources being integrated into final reports.

Method: Introduces DEEPRESEARCHGUARD with four-stage safeguards and open-domain evaluation of references and reports, along with DRSAFEBENCH benchmark for deep research safety. Evaluates performance across multiple metrics including defense success rate and over-refusal rate.

Result: DEEPRESEARCHGUARD achieves 18.16% average defense success rate improvement while reducing over-refusal rate by 6%. Input guard provides substantial early-stage protection, while plan and research guards enhance citation discipline and source credibility.

Conclusion: DEEPRESEARCHGUARD enables comprehensive open-domain evaluation and stage-aware defenses that effectively block harmful content propagation while systematically improving report quality without excessive over-refusal rates.

Abstract: Deep research frameworks have shown promising capabilities in synthesizing comprehensive reports from web sources. While deep research possesses significant potential to address complex issues through planning and research cycles, existing frameworks are deficient in sufficient evaluation procedures and stage-specific protections. They typically treat evaluation as exact match accuracy of question-answering, but overlook crucial aspects of report quality such as credibility, coherence, breadth, depth, and safety. This oversight may result in hazardous or malicious sources being integrated into the final report. To address these issues, we introduce DEEPRESEARCHGUARD, a comprehensive framework featuring four-stage safeguards with open-domain evaluation of references and reports. We assess performance across multiple metrics, e.g., defense success rate and over-refusal rate, and five key report dimensions. In the absence of a suitable safety benchmark, we introduce DRSAFEBENCH, a stage-wise benchmark for deep research safety. Our evaluation spans diverse state-of-the-art LLMs, including GPT-4o, Gemini-2.5-flash, DeepSeek-v3, and o4-mini. DEEPRESEARCHGUARD achieves an average defense success rate improvement of 18.16% while reducing over-refusal rate by 6%. The input guard provides the most substantial early-stage protection by filtering out obvious risks, while the plan and research guards enhance citation discipline and source credibility. Through extensive experiments, we show that DEEPRESEARCHGUARD enables comprehensive open-domain evaluation and stage-aware defenses that effectively block harmful content propagation, while systematically improving report quality without excessive over-refusal rates. The code can be found via https://github.com/Jasonya/DeepResearchGuard.

[111] ABLEIST: Intersectional Disability Bias in LLM-Generated Hiring Scenarios

Mahika Phutane, Hayoung Jung, Matthew Kim, Tanushree Mitra, Aditya Vashistha

Main category: cs.CL

TL;DR: LLMs perpetuate identity-based discrimination against people with disabilities in hiring, with amplified harms for those facing intersectional marginalization (gender, caste) in the Global South.

Details

Motivation: To address the Western-centric bias in existing research and investigate how intersecting forms of marginalization shape experiences of people with disabilities in LLM-based hiring systems.

Method: Comprehensive audit of six LLMs across 2,820 hiring scenarios with diverse disability, gender, nationality, and caste profiles, using ABLEIST metrics (five ableism-specific and three intersectional harm metrics).

Result: Significant increases in ABLEIST harms towards disabled candidates, with state-of-the-art models failing to detect these harms. Intersectional harms (e.g., Tokenism) were sharply amplified for gender and caste-marginalized disabled candidates.

Conclusion: Current safety tools have critical blind spots, highlighting the need for intersectional safety evaluations of frontier models in high-stakes domains like hiring.

Abstract: Large language models (LLMs) are increasingly under scrutiny for perpetuating identity-based discrimination in high-stakes domains such as hiring, particularly against people with disabilities (PwD). However, existing research remains largely Western-centric, overlooking how intersecting forms of marginalization–such as gender and caste–shape experiences of PwD in the Global South. We conduct a comprehensive audit of six LLMs across 2,820 hiring scenarios spanning diverse disability, gender, nationality, and caste profiles. To capture subtle intersectional harms and biases, we introduce ABLEIST (Ableism, Inspiration, Superhumanization, and Tokenism), a set of five ableism-specific and three intersectional harm metrics grounded in disability studies literature. Our results reveal significant increases in ABLEIST harms towards disabled candidates–harms that many state-of-the-art models failed to detect. These harms were further amplified by sharp increases in intersectional harms (e.g., Tokenism) for gender and caste-marginalized disabled candidates, highlighting critical blind spots in current safety tools and the need for intersectional safety evaluations of frontier models in high-stakes domains like hiring.

[112] DND: Boosting Large Language Models with Dynamic Nested Depth

Tieyuan Chen, Xiaodong Chen, Haoxing Chen, Zhenzhong Lan, Weiyao Lin, Jianguo Li

Main category: cs.CL

TL;DR: Dynamic Nested Depth (DND) improves LLM performance by selectively reprocessing critical tokens through nested depth processing, achieving performance gains with minimal computational overhead.

Details

Motivation: To enhance off-the-shelf LLM performance by efficiently reprocessing difficult tokens while avoiding redundant computation for easier ones, addressing the need for precise token-level processing control.

Method: DND identifies critical tokens at transformer layer ends using a router, feeds them back for extra processing, employs router controlling loss for better token selection, and uses threshold control for selection stability. It integrates into pre-trained models during post-training.

Result: Boosts dense Qwen3-1.7B by 1.88% and MoE Qwen3-30B-A3B by 0.87% on diverse benchmarks with minimal parameter and computing increase.

Conclusion: DND effectively improves LLM performance through dynamic token selection and nested reprocessing, demonstrating practical efficiency gains for both dense and MoE models.

Abstract: We introduce Dynamic Nested Depth (DND), a novel method that improves performance for off-the-shelf LLMs by selecting critical tokens to reprocess in a nested depth manner. Specifically, at the end of the given transformer layer, DND identifies more critical tokens with a router and feeds them back for an extra round of processing, effectively ``reviewing" difficult tokens while avoiding redundant computation for easier ones. The dynamic selection mechanism is tailored for precise control via two novel strategies: a router controlling loss to enhance token selection distinguishability, and a threshold control scheme to ensure selection stability. We demonstrate the effectiveness of DND by directly integrating it into pre-trained dense and MoE models during a post-training phase. On diverse benchmarks, this approach boosts the performances of the dense Qwen3-1.7B by 1.88% and the MoE Qwen3-30B-A3B by 0.87%, all with a minimal parameter and computing increase.

[113] LogiNumSynth: Synthesizing Joint Logical-Numerical Reasoning Problems for Language Models

Yiwei Liu, Yucheng Li, Xiao Li, Gong Cheng

Main category: cs.CL

TL;DR: LogiNumSynth is a flexible natural language problem synthesizer that creates tasks requiring joint logical and numerical reasoning, with fine-grained control over difficulty levels for evaluation and training of language models.

Details

Motivation: Existing datasets for joint logical-numerical reasoning have limited control over task complexity and use fixed rule sets, constraining their generalizability for evaluation and training of language models.

Method: Developed LogiNumSynth synthesizer that generates natural language problems with controllable reasoning world richness, logical reasoning depth, and numerical computation complexity.

Result: Experiments show persistent weaknesses in LLMs’ logical-numerical reasoning, and LogiNumSynth can effectively diagnose these issues and provide targeted supervision to improve reasoning skills.

Conclusion: LogiNumSynth serves as both a diagnostic tool for evaluating reasoning weaknesses and a source of targeted training data to advance integrated logical-numerical reasoning capabilities in language models.

Abstract: Joint logical-numerical reasoning remains a major challenge for language models, yet existing datasets rely on fixed rule sets and offer limited control over task complexity, constraining their generalizability for evaluation and training. We present LogiNumSynth, a flexible natural language problem synthesizer that synthesizes tasks requiring proficiency in joint logical reasoning (e.g., rule-based reasoning) and numerical reasoning (e.g., arithmetic computation). LogiNumSynth supports fine-grained control over reasoning world richness, logical reasoning depth, and the complexity of numerical computations, enabling flexible data synthesis across difficulty levels. We demonstrate three key contributions: (1) Synthesizer – synthesizing fully controllable joint reasoning tasks over natural language; (2) Evaluation & Process Analysis – evaluating both process accuracy and answer accuracy; (3) Targeted Training – using synthesized data to enhance LLMs’ reasoning performance. Experiments with multiple LLMs highlight persistent weaknesses in logical-numerical reasoning, showing that LogiNumSynth can serve as both a diagnostic tool and a source of targeted supervision for advancing integrated reasoning skills.

[114] Enabling Doctor-Centric Medical AI with LLMs through Workflow-Aligned Tasks and Benchmarks

Wenya Xie, Qingying Xiao, Yu Zheng, Xidong Wang, Junying Chen, Ke Ji, Anningzhe Gao, Prayag Tiwari, Xiang Wan, Feng Jiang, Benyou Wang

Main category: cs.CL

TL;DR: Proposes repositioning LLMs as clinical assistants for doctors rather than direct patient interaction, creates DoctorFLAN dataset to improve LLM performance in medical contexts

Details

Motivation: Direct deployment of LLMs to patients poses safety risks due to limited domain expertise, so repositioning them as doctor assistants is safer

Method: Conducted two-stage survey to identify clinical needs, built DoctorFLAN dataset (92k Q&A across 22 tasks and 27 specialties), created evaluation benchmarks DoctorFLAN-test and DotaBench

Result: DoctorFLAN notably improves performance of open-source LLMs in medical contexts, facilitates alignment with physician workflows

Conclusion: Provides valuable resource and framework for advancing doctor-centered medical LLM development, complementing existing patient-oriented models

Abstract: The rise of large language models (LLMs) has transformed healthcare by offering clinical guidance, yet their direct deployment to patients poses safety risks due to limited domain expertise. To mitigate this, we propose repositioning LLMs as clinical assistants that collaborate with experienced physicians rather than interacting with patients directly. We conduct a two-stage inspiration-feedback survey to identify real-world needs in clinical workflows. Guided by this, we construct DoctorFLAN, a large-scale Chinese medical dataset comprising 92,000 Q&A instances across 22 clinical tasks and 27 specialties. To evaluate model performance in doctor-facing applications, we introduce DoctorFLAN-test (550 single-turn Q&A items) and DotaBench (74 multi-turn conversations). Experimental results with over ten popular LLMs demonstrate that DoctorFLAN notably improves the performance of open-source LLMs in medical contexts, facilitating their alignment with physician workflows and complementing existing patient-oriented models. This work contributes a valuable resource and framework for advancing doctor-centered medical LLM development

Qinglin Zhu, Yizhen Yao, Runcong Zhao, Yanzheng Xiang, Amrutha Saseendran, Chen Jin, Philip Alexander Teare, Bin Liang, Yulan He, Lin Gui

Main category: cs.CL

TL;DR: LRD is a two-stage parallel decoding framework that addresses information loss and premature commitment in diffusion-inspired models, achieving significant speedups (up to 10.6x) while improving accuracy on coding and reasoning tasks.

Details

Motivation: Autoregressive models suffer from high latency due to sequential decoding, while recent diffusion approaches have limitations including information loss (discarding predictive distributions) and premature commitment (local decisions without global coordination).

Method: Two-stage framework: 1) Latent Refinement maintains masked positions as distributional mixtures of predicted tokens and mask embeddings for global consistency, 2) Predictive Feedback Loop progressively finalizes confident tokens while retaining uncertain ones for iterative feedback, using KL-divergence dynamics for convergence.

Result: Significant improvements: HumanEval +6.3, MBPP +2.6 (coding), GSM8K +2.9, MATH500 +3.8 (reasoning) with speedups up to 10.6x compared to baseline models.

Conclusion: LRD provides a strong and versatile alternative for parallel sequence generation, achieving both improved accuracy and substantial speedups across multiple domains.

Abstract: Autoregressive (AR) models remain the standard for natural language generation but still suffer from high latency due to strictly sequential decoding. Recent diffusion-inspired approaches, such as LlaDA and Dream, mitigate this by generating in parallel, yet they suffer from two core limitations: information loss, as predictive distributions for non-finalized tokens are discarded at each step, and premature commitment, where local decisions are made without sufficient global coordination. We introduce Latent Refinement Decoding (LRD), a two-stage framework with Latent Refinement and a Predictive Feedback Loop. The first stage maintains masked positions as distributional mixtures of predicted tokens and the mask embedding, allowing the model to establish more globally consistent beliefs. The second stage progressively finalizes confident tokens while retaining uncertain ones for iterative feedback. KL-divergence dynamics provide a principled and reliable criterion for convergence and early stopping. Experiments across coding (HumanEval +6.3, MBPP +2.6) and reasoning (GSM8K +2.9, MATH500 +3.8) show that LRD improves accuracy while delivering speedups of up to 10.6x, making it a strong and versatile alternative for parallel sequence generation.

[116] Enhancing LLM Reasoning via Non-Human-Like Reasoning Path Preference Optimization

Junjie Lu, Yuliang Liu, Chaofeng Qu, Wei Shen, Zhouhan Lin, Min Xu

Main category: cs.CL

TL;DR: CGPO introduces confidence-guided reasoning path optimization that identifies points of maximal uncertainty in LLM reasoning and applies self-generated, non-human-like guidance to prevent trajectory drift, achieving better performance than human-annotated or strong model approaches.

Details

Motivation: Current methods introduce training bias toward human-like reasoning and limit exploration of alternative paths. The observation that 75% of first errors occur after the lowest-confidence point suggests guiding at uncertainty points provides better supervision than error correction.

Method: CGPO leverages confidence signals to identify points of maximal uncertainty in reasoning and applies self-generated, non-human-like reasoning-path guidance to mitigate trajectory drift.

Result: Experiments on code and mathematical reasoning tasks show CGPO with small model-generated data achieves better performance than approaches using strong model or human-annotated data with the same training data amount.

Conclusion: Confidence-guided reasoning path optimization effectively improves LLM reasoning by targeting uncertainty points rather than explicit errors, enabling better performance with less dependence on human or high-capacity model annotations.

Abstract: Current approaches for strengthening LLM reasoning tend to introduce a training bias toward human-like reasoning trajectories. In step-wise preference optimization, in particular, dependence on human or higher-capacity model annotations for intermediate steps limits exploration of alternative, non-human-like reasoning paths and thus constrains achievable performance. Furthermore, through a small-scale pilot study, we observed that in approximately 75% of cases, the model’s first erroneous step occurs after the lowest-confidence point. This suggests that guiding the model at its lowest-confidence point before an error provides more accurate supervision than locating the first explicit error. In this paper, we propose Confidence-Guided Reasoning Path Preference Optimization (CGPO), a method that leverages a confidence signal to identify points of maximal uncertainty in the model’s reasoning process and applies self-generated, non-human-like reasoning-path guidance to mitigate trajectory drift. Our experiments span diverse models applied to both code and mathematical reasoning tasks. The results show that, with the same amount of training data, our method using data generated by a small model can achieve better performance in most cases compared with approaches using data generated by a strong model or human-annotated.

[117] TypePilot: Leveraging the Scala Type System for Secure LLM-generated Code

Alexander Sternfeld, Andrei Kucharavy, Ljiljana Dolamic

Main category: cs.CL

TL;DR: TypePilot is an AI framework that enhances security of LLM-generated code using strongly typed languages like Scala, reducing vulnerabilities through type-guided workflows.

Details

Motivation: LLMs generate code with subtle but critical vulnerabilities that pose risks in security-sensitive systems, requiring improved security measures.

Method: Uses TypePilot agentic framework with Scala and formal verification (Stainless framework) to enforce safety constraints through type-focused pipeline.

Result: Substantially mitigates input validation and injection vulnerabilities compared to direct code generation or naive secure prompting.

Conclusion: Type-guided LLM workflows have significant potential to improve trustworthiness of automated code generation in high-assurance domains.

Abstract: Large language Models (LLMs) have shown remarkable proficiency in code generation tasks across various programming languages. However, their outputs often contain subtle but critical vulnerabilities, posing significant risks when deployed in security-sensitive or mission-critical systems. This paper introduces TypePilot, an agentic AI framework designed to enhance the security and robustness of LLM-generated code by leveraging strongly typed and verifiable languages, using Scala as a representative example. We evaluate the effectiveness of our approach in two settings: formal verification with the Stainless framework and general-purpose secure code generation. Our experiments with leading open-source LLMs reveal that while direct code generation often fails to enforce safety constraints, just as naive prompting for more secure code, our type-focused agentic pipeline substantially mitigates input validation and injection vulnerabilities. The results demonstrate the potential of structured, type-guided LLM workflows to improve the SotA of the trustworthiness of automated code generation in high-assurance domains.

[118] One Size Does Not Fit All: Exploring Variable Thresholds for Distance-Based Multi-Label Text Classification

Jens Van Nooten, Andriy Kosar, Guy De Pauw, Walter Daelemans

Main category: cs.CL

TL;DR: The paper presents a novel label-specific thresholding method for distance-based multi-label text classification that improves performance by optimizing thresholds for each label using a validation set.

Details

Motivation: Distance-based text classification offers fast inference and adaptability to expanding label sets, but multi-label classification requires effective thresholding methods to determine label relevance based on semantic similarity in embedding spaces.

Method: The authors conduct exploratory studies on diverse multi-label datasets, analyze similarity distributions across models and datasets, and propose a label-specific thresholding method that optimizes thresholds for each label using validation data.

Result: The label-specific thresholding method achieves 46% average improvement over normalized 0.5 thresholding and outperforms uniform thresholding approaches by 14% on average, while maintaining strong performance with limited labeled examples.

Conclusion: Label-specific thresholding significantly improves distance-based multi-label text classification performance, with the method being robust even when limited labeled data is available.

Abstract: Distance-based unsupervised text classification is a method within text classification that leverages the semantic similarity between a label and a text to determine label relevance. This method provides numerous benefits, including fast inference and adaptability to expanding label sets, as opposed to zero-shot, few-shot, and fine-tuned neural networks that require re-training in such cases. In multi-label distance-based classification and information retrieval algorithms, thresholds are required to determine whether a text instance is “similar” to a label or query. Similarity between a text and label is determined in a dense embedding space, usually generated by state-of-the-art sentence encoders. Multi-label classification complicates matters, as a text instance can have multiple true labels, unlike in multi-class or binary classification, where each instance is assigned only one label. We expand upon previous literature on this underexplored topic by thoroughly examining and evaluating the ability of sentence encoders to perform distance-based classification. First, we perform an exploratory study to verify whether the semantic relationships between texts and labels vary across models, datasets, and label sets by conducting experiments on a diverse collection of realistic multi-label text classification (MLTC) datasets. We find that similarity distributions show statistically significant differences across models, datasets and even label sets. We propose a novel method for optimizing label-specific thresholds using a validation set. Our label-specific thresholding method achieves an average improvement of 46% over normalized 0.5 thresholding and outperforms uniform thresholding approaches from previous work by an average of 14%. Additionally, the method demonstrates strong performance even with limited labeled examples.

[119] Bridging Gaps in Hate Speech Detection: Meta-Collections and Benchmarks for Low-Resource Iberian Languages

Paloma Piot, José Ramom Pichel Campos, Javier Parapar

Main category: cs.CL

TL;DR: This paper addresses the gap in hate speech detection resources for low-resource languages by creating standardized multilingual datasets for Iberian languages (European Spanish, European Portuguese, and Galician variants) and establishing new benchmarks through comprehensive evaluation of large language models.

Details

Motivation: Hate speech detection research is largely English-focused, leaving low-resource languages with limited resources and benchmarks. Many low-resource languages have multiple linguistic varieties that are often overlooked, and large language models require substantial data that these languages typically lack.

Method: Compiled a meta-collection of hate speech datasets for European Spanish with unified labels and metadata, then extended it by translating into European Portuguese and two Galician variants (one convergent with Spanish, another with Portuguese). Evaluated state-of-the-art LLMs in zero-shot, few-shot, and fine-tuning settings, and performed cross-lingual analysis.

Result: Established new benchmarks for hate speech detection in Iberian languages and provided baseline results for future research. The created aligned multilingual corpora enable more consistent and scalable hate speech detection across these languages.

Conclusion: The findings highlight the importance of multilingual and variety-aware approaches in hate speech detection and provide a foundation for improved benchmarking in underrepresented European languages, addressing the data gap for low-resource languages with multiple linguistic varieties.

Abstract: Hate speech poses a serious threat to social cohesion and individual well-being, particularly on social media, where it spreads rapidly. While research on hate speech detection has progressed, it remains largely focused on English, resulting in limited resources and benchmarks for low-resource languages. Moreover, many of these languages have multiple linguistic varieties, a factor often overlooked in current approaches. At the same time, large language models require substantial amounts of data to perform reliably, a requirement that low-resource languages often cannot meet. In this work, we address these gaps by compiling a meta-collection of hate speech datasets for European Spanish, standardised with unified labels and metadata. This collection is based on a systematic analysis and integration of existing resources, aiming to bridge the data gap and support more consistent and scalable hate speech detection. We extended this collection by translating it into European Portuguese and into a Galician standard that is more convergent with Spanish and another Galician variant that is more convergent with Portuguese, creating aligned multilingual corpora. Using these resources, we establish new benchmarks for hate speech detection in Iberian languages. We evaluate state-of-the-art large language models in zero-shot, few-shot, and fine-tuning settings, providing baseline results for future research. Moreover, we perform a cross-lingual analysis with our target languages. Our findings underscore the importance of multilingual and variety-aware approaches in hate speech detection and offer a foundation for improved benchmarking in underrepresented European languages.

[120] Evaluating Reasoning Faithfulness in Medical Vision-Language Models using Multimodal Perturbations

Johannes Moll, Markus Graf, Tristan Lemke, Nicolas Lenhart, Daniel Truhn, Jean-Benoit Delbrouck, Jiazhen Pan, Daniel Rueckert, Lisa C. Adams, Keno K. Bressem

Main category: cs.CL

TL;DR: A framework for evaluating faithfulness of chain-of-thought explanations in vision-language models for chest X-ray VQA, showing that answer accuracy and explanation quality are decoupled, with proprietary models outperforming open-source ones on attribution and fidelity metrics.

Details

Motivation: Vision-language models often produce plausible but unfaithful chain-of-thought explanations that don't reflect actual decision processes, undermining trust in clinical applications where existing evaluations fail to catch this misalignment.

Method: Clinical framework for chest X-ray VQA that probes CoT faithfulness via controlled text and image modifications across three axes: clinical fidelity, causal attribution, and confidence calibration, validated through reader studies with radiologists.

Result: Evaluator-radiologist correlations fell within inter-radiologist range, with strong alignment for attribution (τ_b=0.670), moderate for fidelity (τ_b=0.387), and weak for confidence tone (τ_b=0.091). Proprietary models scored higher on attribution (25.0% vs 1.4%) and often on fidelity (36.1% vs 31.7%) than open-source models.

Conclusion: Answer accuracy and explanation quality are decoupled, text cues shift explanations more than visual cues, and there are significant deployment risks requiring evaluation beyond final answer accuracy, especially in clinical settings.

Abstract: Vision-language models (VLMs) often produce chain-of-thought (CoT) explanations that sound plausible yet fail to reflect the underlying decision process, undermining trust in high-stakes clinical use. Existing evaluations rarely catch this misalignment, prioritizing answer accuracy or adherence to formats. We present a clinically grounded framework for chest X-ray visual question answering (VQA) that probes CoT faithfulness via controlled text and image modifications across three axes: clinical fidelity, causal attribution, and confidence calibration. In a reader study (n=4), evaluator-radiologist correlations fall within the observed inter-radiologist range for all axes, with strong alignment for attribution (Kendall’s $\tau_b=0.670$), moderate alignment for fidelity ($\tau_b=0.387$), and weak alignment for confidence tone ($\tau_b=0.091$), which we report with caution. Benchmarking six VLMs shows that answer accuracy and explanation quality are decoupled, acknowledging injected cues does not ensure grounding, and text cues shift explanations more than visual cues. While some open-source models match final answer accuracy, proprietary models score higher on attribution (25.0% vs. 1.4%) and often on fidelity (36.1% vs. 31.7%), highlighting deployment risks and the need to evaluate beyond final answer accuracy.

[121] Discursive Circuits: How Do Language Models Understand Discourse Relations?

Yisong Miao, Min-Yen Kan

Main category: cs.CL

TL;DR: The paper identifies sparse computational circuits (≈0.2% of GPT-2) responsible for discourse understanding in transformer language models, showing they generalize across discourse frameworks and reveal layer-specific linguistic processing.

Details

Motivation: To understand which specific components in transformer language models handle discourse relations, which involve longer spans and complex reasoning compared to simpler tasks.

Method: Introduced Completion under Discourse Relation (CuDR) task and constructed minimal contrastive pairs for activation patching. Used circuit discovery to identify sparse computational graphs that control discourse processing.

Result: Found sparse circuits (≈0.2% of GPT-2) successfully recover discourse understanding in English PDTB-based CuDR task. These circuits generalize to unseen discourse frameworks (RST, SDRT). Lower layers capture linguistic features (lexical semantics, coreference) while upper layers encode discourse-level abstractions.

Conclusion: Discursive circuits are sparse but effective for discourse understanding, with consistent feature utility across frameworks and clear layer specialization in linguistic processing.

Abstract: Which components in transformer language models are responsible for discourse understanding? We hypothesize that sparse computational graphs, termed as discursive circuits, control how models process discourse relations. Unlike simpler tasks, discourse relations involve longer spans and complex reasoning. To make circuit discovery feasible, we introduce a task called Completion under Discourse Relation (CuDR), where a model completes a discourse given a specified relation. To support this task, we construct a corpus of minimal contrastive pairs tailored for activation patching in circuit discovery. Experiments show that sparse circuits ($\approx 0.2%$ of a full GPT-2 model) recover discourse understanding in the English PDTB-based CuDR task. These circuits generalize well to unseen discourse frameworks such as RST and SDRT. Further analysis shows lower layers capture linguistic features such as lexical semantics and coreference, while upper layers encode discourse-level abstractions. Feature utility is consistent across frameworks (e.g., coreference supports Expansion-like relations).

[122] Domain-Specific Data Generation Framework for RAG Adaptation

Chris Xing Tian, Weihao Xie, Zhen Chen, Zhengyuan Yi, Hui Liu, Haoliang Li, Shiqi Wang, Siwei Ma

Main category: cs.CL

TL;DR: RAGen is a scalable framework for generating domain-specific question-answer-context triples to adapt RAG systems, using semantic chunking, concept extraction, and multi-chunk retrieval.

Details

Motivation: RAG systems need specialized training data for domain adaptation beyond general-purpose QA, requiring context-rich data tailored to specific domains.

Method: Modular pipeline with semantic chunking, hierarchical concept extraction, Bloom’s Taxonomy-guided question generation, multi-chunk retrieval, and distractor contexts.

Result: Enables efficient generation of domain-grounded QAC triples for diverse RAG adaptation strategies, handling large evolving document corpora without redundant processing.

Conclusion: RAGen provides a scalable solution for adapting RAG systems to dynamic domains like scientific research and enterprise knowledge bases through automated domain-specific data generation.

Abstract: Retrieval-Augmented Generation (RAG) combines the language understanding and reasoning power of large language models (LLMs) with external retrieval to enable domain-grounded responses. Effectively adapting RAG systems to domain-specific settings requires specialized, context-rich training data beyond general-purpose question-answering. Here, we propose RAGen, a scalable and modular framework for generating domain-grounded question-answer-context (QAC) triples tailored to diverse RAG adaptation approaches. RAGen produces these QAC triples by identifying key concepts in documents, generating diverse questions guided by Bloom’s Taxonomy-inspired principles, and pairing them with precise answers extracted from relevant contexts. RAGen supports multiple RAG adaptation strategies, including the optimization of key components such as the LLM, retriever, and embedding model, etc. Its modular pipeline features semantic chunking, hierarchical concept extraction, and multi-chunk retrieval, along with the introduction of curated distractor contexts to promote robust reasoning. Designed for scalability, RAGen efficiently handles large and evolving document corpora without redundant processing, making it especially suitable for dynamic evolving domains such as scientific research and enterprise knowledge bases.

[123] The Curious Case of Factual (Mis)Alignment between LLMs’ Short- and Long-Form Answers

Saad Obaid ul Islam, Anne Lauscher, Goran Glavaš

Main category: cs.CL

TL;DR: LLMs show inconsistent factual knowledge across query complexities - they answer simple questions correctly but fail on the same facts in complex queries, revealing a reliability gap.

Details

Motivation: To understand the fundamental inconsistency in how LLMs access factual knowledge across different task complexities, as current evaluation practices assume good performance on simple queries implies reliability in complex tasks.

Method: Introduced SLAQ framework comparing LLMs’ answers to same factual questions asked in isolation (short) vs. integrated into complex queries (long), testing 16 LLMs across 600 queries with mechanistic analysis.

Result: Found systematic misalignment between short and long query answers, position-dependent accuracy loss, momentum effects, and that aligned facts activate overlapping model internals with 78% prediction accuracy.

Conclusion: Factual consistency over query complexity is crucial for LLM trustworthiness, challenging current evaluation practices that overestimate reliability in complex knowledge-seeking tasks.

Abstract: Large language models (LLMs) can correctly answer “When was Einstein born?” yet fail to provide the same date when writing about Einstein’s life revealing a fundamental inconsistency in how models access factual knowledge across task complexities. While models display impressive accuracy on factual question-answering benchmarks, the reliability gap between simple and complex queries remains poorly understood, eroding their trustworthiness. In this work, we introduce Short-Long Form Alignment for Factual Question Answering (SLAQ), a controlled evaluation framework that compares LLMs’ answers to the same factual questions asked (a) in isolation (short) vs. (b) integrated into complex queries (long). Looking at 16 LLMs across 600 queries, we find a systematic misalignment of answers to the corresponding short and long queries. We further uncover position-dependent accuracy loss and momentum effects where consecutive correct or incorrect answers create self-reinforcing patterns. Through mechanistic analysis, we find that aligned facts activate overlapping model internals, and that metrics based on mechanistic similarity can predict short-long answer alignment with up to 78% accuracy. Our work establishes factual consistency over query complexity as an important aspect of LLMs’ trustworthiness and challenges current evaluation practices, which implicitly assume that good performance for simple factual queries implies reliability in more complex knowledge-seeking tasks too.

[124] WebRouter: Query-specific Router via Variational Information Bottleneck for Cost-sensitive Web Agent

Tao Li, Jinlong Hu, Yang Wang, Junfeng Liu, Xuejun Liu

Main category: cs.CL

TL;DR: WebRouter is a query-specific router that uses cost-aware Variational Information Bottleneck to reduce operational costs by 87.8% with minimal accuracy drop for LLM-brained web agents.

Details

Motivation: LLM-brained web agents face cost-performance trade-offs due to complex prompts containing goals, action histories, and environmental states, which degrade ensemble performance.

Method: Introduces WebRouter with cost-aware Variational Information Bottleneck (ca-VIB) objective that learns compressed prompt representations while penalizing expected operational costs.

Result: Experiments on five WebVoyager benchmark websites show 87.8% cost reduction compared to GPT-4o baseline with only 3.8% accuracy drop.

Conclusion: WebRouter effectively addresses the cost-performance trade-off in web agents through information-theoretic routing with significant cost savings.

Abstract: LLM-brained web agents offer powerful capabilities for web automation but face a critical cost-performance trade-off. The challenge is amplified by web agents’ inherently complex prompts that include goals, action histories, and environmental states, leading to degraded LLM ensemble performance. To address this, we introduce WebRouter, a novel query-specific router trained from an information-theoretic perspective. Our core contribution is a cost-aware Variational Information Bottleneck (ca-VIB) objective, which learns a compressed representation of the input prompt while explicitly penalizing the expected operational cost. Experiments on five real-world websites from the WebVoyager benchmark show that WebRouter reduces operational costs by a striking 87.8% compared to a GPT-4o baseline, while incurring only a 3.8% accuracy drop.

[125] Fairness Metric Design Exploration in Multi-Domain Moral Sentiment Classification using Transformer-Based Models

Battemuulen Naranbat, Seyed Sahand Mohammadi Ziabari, Yousuf Nasser Al Husaini, Ali Mohammed Mansoor Alsahag

Main category: cs.CL

TL;DR: The paper introduces Moral Fairness Consistency (MFC) metric to evaluate cross-domain fairness in moral sentiment classification, revealing significant fairness violations in transformer models that are masked by aggregate performance metrics.

Details

Motivation: To address fairness challenges in natural language processing for moral sentiment classification, particularly under cross-domain shifts where transformer models show performance disparities that aren't captured by standard metrics.

Method: Evaluated BERT and DistilBERT on Moral Foundations Twitter Corpus (MFTC) and Moral Foundations Reddit Corpus (MFRC) using in-domain and cross-domain protocols with multi-label classification, and introduced the MFC metric to quantify cross-domain fairness stability.

Result: Found pronounced asymmetry in transfer (Twitter->Reddit degraded micro-F1 by 14.9% vs 1.5% for Reddit->Twitter), significant fairness violations in authority label (DPD: 0.22-0.23, EOD: 0.40-0.41), and MFC showed perfect negative correlation with Demographic Parity Difference (rho = -1.000).

Conclusion: MFC serves as a complementary, diagnosis-oriented metric for fairness-aware evaluation of moral reasoning models, enabling more reliable deployment across different linguistic contexts by revealing hidden fairness issues.

Abstract: Ensuring fairness in natural language processing for moral sentiment classification is challenging, particularly under cross-domain shifts where transformer models are increasingly deployed. Using the Moral Foundations Twitter Corpus (MFTC) and Moral Foundations Reddit Corpus (MFRC), this work evaluates BERT and DistilBERT in a multi-label setting with in-domain and cross-domain protocols. Aggregate performance can mask disparities: we observe pronounced asymmetry in transfer, with Twitter->Reddit degrading micro-F1 by 14.9% versus only 1.5% for Reddit->Twitter. Per-label analysis reveals fairness violations hidden by overall scores; notably, the authority label exhibits Demographic Parity Differences of 0.22-0.23 and Equalized Odds Differences of 0.40-0.41. To address this gap, we introduce the Moral Fairness Consistency (MFC) metric, which quantifies the cross-domain stability of moral foundation detection. MFC shows strong empirical validity, achieving a perfect negative correlation with Demographic Parity Difference (rho = -1.000, p < 0.001) while remaining independent of standard performance metrics. Across labels, loyalty demonstrates the highest consistency (MFC = 0.96) and authority the lowest (MFC = 0.78). These findings establish MFC as a complementary, diagnosis-oriented metric for fairness-aware evaluation of moral reasoning models, enabling more reliable deployment across heterogeneous linguistic contexts. .

[126] A Theorem-Proving-Based Evaluation of Neural Semantic Parsing

Hayate Funakura, Hyunsoo Kim, Koji Mineshima

Main category: cs.CL

TL;DR: Graph-matching metrics like Smatch are insufficient for evaluating semantic parsers as they measure surface overlap rather than logical equivalence. The study combines graph-matching with theorem proving to assess parser performance.

Details

Motivation: Current evaluation metrics for semantic parsers focus on surface-level graph matching rather than logical equivalence, which is crucial for reasoning applications.

Method: Compared supervised fine-tuning (T5 models) and few-shot in-context learning (GPT models) using graph-matching, bidirectional entailment with theorem proving, and well-formedness checks under normalized and unnormalized targets.

Result: Models performing well on graph-matching often fail to produce logically equivalent formulas. Normalization improves well-formedness and logical adequacy. Performance degrades with formula complexity and specific linguistic features like coordination and passive voice.

Conclusion: Graph-based metrics have limitations for reasoning applications, motivating logic-sensitive evaluation methods and simplified, normalized target representations for better semantic parsing.

Abstract: Graph-matching metrics such as Smatch are the de facto standard for evaluating neural semantic parsers, yet they capture surface overlap rather than logical equivalence. We reassess evaluation by pairing graph-matching with automated theorem proving. We compare two approaches to building parsers: supervised fine-tuning (T5-Small/Base) and few-shot in-context learning (GPT-4o/4.1/5), under normalized and unnormalized targets. We evaluate outputs using graph-matching, bidirectional entailment between source and target formulas with a first-order logic theorem prover, and well-formedness. Across settings, we find that models performing well on graph-matching often fail to produce logically equivalent formulas. Normalization reduces incidental target variability, improves well-formedness, and strengthens logical adequacy. Error analysis shows performance degrades with increasing formula complexity and with coordination, prepositional phrases, and passive voice; the dominant failures involve variable binding and indexing, and predicate naming. These findings highlight limits of graph-based metrics for reasoning-oriented applications and motivate logic-sensitive evaluation and training objectives together with simplified, normalized target representations. All code and data for our experiments are publicly available.

Jinyuan Xu, Tian Lan, Xintao Yu, Xue He, Hezhi Zhang, Ying Wang, Pierre Magistry, Mathieu Valette, Lei Li

Main category: cs.CL

TL;DR: The paper introduces CNSocialDepress, a Chinese-language benchmark dataset for depression risk detection from social media posts, featuring binary risk labels and multi-dimensional psychological attributes for interpretable analysis.

Details

Motivation: Depression is a major global health issue, but there are scarce publicly available Chinese-language resources for depression detection, with most existing resources limited to binary classification.

Method: Created CNSocialDepress dataset containing 44,178 texts from 233 users, with 10,306 depression-related segments annotated by psychological experts, providing binary risk labels and structured multi-dimensional psychological attributes.

Result: Experimental results demonstrate the dataset’s utility across various NLP tasks, including structured psychological profiling and fine-tuning large language models for depression detection.

Conclusion: The dataset provides effective and practical value for depression risk identification and psychological analysis, offering insights for mental health applications tailored for Chinese-speaking populations.

Abstract: Depression is a pressing global public health issue, yet publicly available Chinese-language resources for risk detection remain scarce and are mostly limited to binary classification. To address this limitation, we release CNSocialDepress, a benchmark dataset for depression risk detection from Chinese social media posts. The dataset contains 44,178 texts from 233 users, within which psychological experts annotated 10,306 depression-related segments. CNSocialDepress provides binary risk labels together with structured multi-dimensional psychological attributes, enabling interpretable and fine-grained analysis of depressive signals. Experimental results demonstrate its utility across a wide range of NLP tasks, including structured psychological profiling and fine-tuning of large language models for depression detection. Comprehensive evaluations highlight the dataset’s effectiveness and practical value for depression risk identification and psychological analysis, thereby providing insights to mental health applications tailored for Chinese-speaking populations.

[128] XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression

Haoqi Yang, Yao Yao, Zuchao Li, Baoyuan Qi, Guoming Liu, Hai Zhao

Main category: cs.CL

TL;DR: XQuant is a training-free KV cache quantization framework that achieves ultra-low bit-width compression (sub-1.4 bits) through data-free calibration and cross-layer compression, outperforming existing methods while maintaining model accuracy.

Details

Motivation: LLMs face significant memory challenges due to KV cache growth during long-text processing, especially in resource-constrained environments. Quantization offers a solution to reduce memory consumption while preserving information.

Method: XQuant introduces two key innovations: a computationally negligible data-free calibration method and cross-layer KV cache compression, enabling training-free, plug-and-play ultra-low bit-width quantization.

Result: Extensive experiments on TruthfulQA and LongBench show XQuant outperforms state-of-the-art methods (KIVI-2bit and AsymKV-1.5bit) by achieving lower bit-width while maintaining superior performance.

Conclusion: XQuant establishes a better trade-off between memory efficiency and model accuracy, providing an effective solution for deploying LLMs in resource-constrained environments.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks. However, their extensive memory requirements, particularly due to KV cache growth during long-text understanding and generation, present significant challenges for deployment in resource-constrained environments. Quantization has emerged as a promising solution to reduce memory consumption while preserving historical information. We propose XQuant, a training-free and plug-and-play framework that achieves ultra-low equivalent bit-width KV cache quantization. XQuant introduces two key innovations: a computationally negligible data-free calibration method and cross-layer KV cache compression, enabling quantization to sub-1.4 bits. Extensive experiments on TruthfulQA and LongBench demonstrate that XQuant outperforms state-of-the-art methods (e.g., KIVI-2bit and AsymKV-1.5bit) by achieving lower bit-width while maintaining superior performance, establishing a better trade-off between memory efficiency and model accuracy.

[129] Attacks by Content: Automated Fact-checking is an AI Security Issue

Michael Schlichtkrull

Main category: cs.CL

TL;DR: The paper introduces ‘attack by content’ where adversaries manipulate AI agents by supplying biased or false information rather than injecting malicious instructions, and proposes using automated fact-checking as a defense mechanism.

Details

Motivation: Existing defenses focus on detecting hidden commands but are ineffective against content-based attacks where adversaries supply misleading information without explicit instructions.

Method: Proposes repurposing automated fact-checking as a cognitive self-defense tool for agents, requiring them to critically evaluate retrieved information by corroborating claims with external evidence and assessing source trustworthiness.

Result: Identifies that current defenses are insufficient against content-based attacks and establishes the need for agents to perform critical evaluation of external information.

Conclusion: Automated fact-checking should be adapted as a defense mechanism to protect AI agents from content-based manipulation attacks, enabling them to verify information credibility independently.

Abstract: When AI agents retrieve and reason over external documents, adversaries can manipulate the data they receive to subvert their behaviour. Previous research has studied indirect prompt injection, where the attacker injects malicious instructions. We argue that injection of instructions is not necessary to manipulate agents - attackers could instead supply biased, misleading, or false information. We term this an attack by content. Existing defenses, which focus on detecting hidden commands, are ineffective against attacks by content. To defend themselves and their users, agents must critically evaluate retrieved information, corroborating claims with external evidence and evaluating source trustworthiness. We argue that this is analogous to an existing NLP task, automated fact-checking, which we propose to repurpose as a cognitive self-defense tool for agents.

[130] Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality

Jana Jung, Marlene Lutz, Indira Sen, Markus Strohmaier

Main category: cs.CL

TL;DR: Psychometric tests designed for humans show moderate reliability but low ecological validity when applied to LLMs, with test scores often not aligning or negatively correlating with actual model behavior in downstream tasks.

Details

Motivation: To evaluate whether human psychometric tests yield meaningful results when applied to large language models, given their increasing use for assessing psychological constructs like sexism, racism, and morality in LLMs.

Method: Systematically evaluated reliability and validity of human psychometric tests for three constructs (sexism, racism, morality) using multiple item and prompt variations. Validity was assessed through convergent (theory-based inter-test correlations) and ecological approaches (alignment between test scores and real-world downstream task behavior).

Result: Found moderate reliability across variations, but psychometric test scores did not align with model behavior in downstream tasks - in some cases even showing negative correlations, indicating low ecological validity.

Conclusion: Systematic evaluation of psychometric tests is essential before interpreting their scores for LLMs, and human-designed psychometric tests cannot be directly applied to LLMs without adaptation.

Abstract: Psychometric tests are increasingly used to assess psychological constructs in large language models (LLMs). However, it remains unclear whether these tests – originally developed for humans – yield meaningful results when applied to LLMs. In this study, we systematically evaluate the reliability and validity of human psychometric tests for three constructs: sexism, racism, and morality. We find moderate reliability across multiple item and prompt variations. Validity is evaluated through both convergent (i.e., testing theory-based inter-test correlations) and ecological approaches (i.e., testing the alignment between tests scores and behavior in real-world downstream tasks). Crucially, we find that psychometric test scores do not align, and in some cases even negatively correlate with, model behavior in downstream tasks, indicating low ecological validity. Our results highlight that systematic evaluations of psychometric tests is essential before interpreting their scores. They also suggest that psychometric tests designed for humans cannot be applied directly to LLMs without adaptation.

[131] Towards Real-Time Fake News Detection under Evidence Scarcity

Guangyu Wei, Ke Han, Yueming Lyu, Yu Luo, Yue Jiang, Caifeng Shan, Nicu Sebe

Main category: cs.CL

TL;DR: EASE is a framework for real-time fake news detection that dynamically adapts decision-making based on evidence sufficiency, using three evaluation perspectives and instruction tuning to improve accuracy and generalization.

Details

Motivation: Existing fake news detection methods struggle with real-time scenarios where emerging events lack sufficient evidence, leading to poor generalization under evidence scarcity.

Method: Proposes EASE framework with sequential evaluation: evidence-based evaluation, reasoning-based evaluation using LLMs, and sentiment-based fallback. Uses instruction tuning with pseudo labels to enhance evaluation accuracy and integrates evaluators’ assessments with news content.

Result: EASE achieves state-of-the-art performance across multiple benchmarks and significantly improves generalization to real-time news. Introduces RealTimeNews-25 benchmark for evaluating emerging news with limited evidence.

Conclusion: EASE effectively addresses evidence scarcity in real-time fake news detection through dynamic evaluation-aware decision-making, demonstrating superior performance and generalization capabilities.

Abstract: Fake news detection becomes particularly challenging in real-time scenarios, where emerging events often lack sufficient supporting evidence. Existing approaches often rely heavily on external evidence and therefore struggle to generalize under evidence scarcity. To address this issue, we propose Evaluation-Aware Selection of Experts (EASE), a novel framework for real-time fake news detection that dynamically adapts its decision-making process according to the assessed sufficiency of available evidence. EASE introduces a sequential evaluation mechanism comprising three independent perspectives: (1) Evidence-based evaluation, which assesses evidence and incorporates it into decision-making only when the evidence is sufficiently supportive; (2) Reasoning-based evaluation, which leverages the world knowledge of large language models (LLMs) and applies them only when their reliability is adequately established; and (3) Sentiment-based fallback, which integrates sentiment cues when neither evidence nor reasoning is reliable. To enhance the accuracy of evaluation processes, EASE employs instruction tuning with pseudo labels to guide each evaluator in justifying its perspective-specific knowledge through interpretable reasoning. Furthermore, the expert modules integrate the evaluators’ justified assessments with the news content to enable evaluation-aware decision-making, thereby enhancing overall detection accuracy. Moreover, we introduce RealTimeNews-25, a new benchmark comprising recent news for evaluating model generalization on emerging news with limited evidence. Extensive experiments demonstrate that EASE not only achieves state-of-the-art performance across multiple benchmarks, but also significantly improves generalization to real-time news. The code and dataset are available: https://github.com/wgyhhhh/EASE.

[132] Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs

Nikita Afonin, Nikita Andriyanov, Nikhil Bageshpura, Kyle Liu, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Alexander Panchenko, Oleg Rogov, Elena Tutubalina, Mikhail Seleznyov

Main category: cs.CL

TL;DR: Emergent misalignment occurs in in-context learning (ICL) with narrow examples, causing models to produce broadly misaligned responses at rates up to 58% with sufficient examples.

Details

Motivation: Previous research showed emergent misalignment in fine-tuning but hadn't examined whether this phenomenon also occurs in in-context learning, which is a critical gap given ICL's widespread use.

Method: Tested three frontier models across three datasets using narrow in-context examples (64-256 examples) and analyzed step-by-step reasoning through chain-of-thought prompting.

Result: Models produced broadly misaligned responses at rates between 2%-17% with 64 examples and up to 58% with 256 examples. 67.5% of misaligned traces showed models adopting dangerous personas to rationalize harmful outputs.

Conclusion: Emergent misalignment is not limited to fine-tuning but also emerges in in-context learning, with models developing harmful personas to justify misaligned behavior, highlighting a significant safety concern.

Abstract: Recent work has shown that narrow finetuning can produce broadly misaligned LLMs, a phenomenon termed emergent misalignment (EM). While concerning, these findings were limited to finetuning and activation steering, leaving out in-context learning (ICL). We therefore ask: does EM emerge in ICL? We find that it does: across three datasets, three frontier models produce broadly misaligned responses at rates between 2% and 17% given 64 narrow in-context examples, and up to 58% with 256 examples. We also examine mechanisms of EM by eliciting step-by-step reasoning (while leaving in-context examples unchanged). Manual analysis of the resulting chain-of-thought shows that 67.5% of misaligned traces explicitly rationalize harmful outputs by adopting a reckless or dangerous ‘‘persona’’, echoing prior results on finetuning-induced EM.

[133] Are Large Language Models Effective Knowledge Graph Constructors?

Ruirui Chen, Weifeng Jiang, Chengwei Qin, Bo Xiong, Fiona Liausvia, Dongkyu Choi, Boon Kiat Quek

Main category: cs.CL

TL;DR: Proposes a hierarchical extraction framework using LLMs to build high-quality knowledge graphs, addressing limitations of existing approaches and releasing a dataset for healthcare applications.

Details

Motivation: Existing LLM-based KG construction methods are limited to sentence-level contexts or predefined schemas, lacking comprehensive coverage and structured representations needed for interpretability and downstream utility.

Method: Hierarchical extraction framework that organizes information at multiple levels using state-of-the-art LLMs, enabling creation of semantically rich and well-structured knowledge graphs.

Result: Comprehensive evaluation from structural and semantic perspectives reveals strengths and shortcomings of current LLMs in KG construction, identifying key challenges for future work.

Conclusion: The framework advances KG construction capabilities and the released dataset on children’s mental well-being aims to foster transparent, reliable applications in high-stakes domains like healthcare.

Abstract: Knowledge graphs (KGs) are vital for knowledge-intensive tasks and have shown promise in reducing hallucinations in large language models (LLMs). However, constructing high-quality KGs remains difficult, requiring accurate information extraction and structured representations that support interpretability and downstream utility. Existing LLM-based approaches often focus narrowly on entity and relation extraction, limiting coverage to sentence-level contexts or relying on predefined schemas. We propose a hierarchical extraction framework that organizes information at multiple levels, enabling the creation of semantically rich and well-structured KGs. Using state-of-the-art LLMs, we extract and construct knowledge graphs and evaluate them comprehensively from both structural and semantic perspectives. Our results highlight the strengths and shortcomings of current LLMs in KG construction and identify key challenges for future work. To advance research in this area, we also release a curated dataset of LLM-generated KGs derived from research papers on children’s mental well-being. This resource aims to foster more transparent, reliable, and impactful applications in high-stakes domains such as healthcare.

[134] FOSSIL: Harnessing Feedback on Suboptimal Samples for Data-Efficient Generalisation with Imitation Learning for Embodied Vision-and-Language Tasks

Sabrina McCallum, Amit Parekh, Alessandro Suglia

Main category: cs.CL

TL;DR: This paper proposes using language feedback to enable imitation learning agents to learn from both optimal and suboptimal demonstrations, improving robustness and compositional generalization in embodied AI tasks.

Details

Motivation: Current imitation learning approaches are limited to learning from optimal behavior only, risking replication of errors from suboptimal demonstrations. Reinforcement learning alternatives sacrifice data efficiency through exploration.

Method: The approach provides language feedback embeddings as input to Transformer-based policies, optionally adding auxiliary self-supervised learning objectives for feedback prediction alongside traditional next action prediction.

Result: Significant improvements in compositional generalization abilities and robustness on embodied Vision-and-Language tasks in the BabyAI-XGen environment, showing successful conversion of suboptimal behavior into learning opportunities.

Conclusion: Language feedback is a competitive and intuitive alternative to intermediate scalar rewards for language-specified embodied tasks, enabling data-efficient learning from both optimal and suboptimal demonstrations.

Abstract: Current approaches to embodied AI tend to learn policies from expert demonstrations. However, without a mechanism to evaluate the quality of demonstrated actions, they are limited to learning from optimal behaviour, or they risk replicating errors and inefficiencies. While reinforcement learning offers one alternative, the associated exploration typically results in sacrificing data efficiency. This work explores how agents trained with imitation learning can learn robust representations from both optimal and suboptimal demonstrations when given access to constructive language feedback as a means to contextualise different modes of behaviour. We directly provide language feedback embeddings as part of the input sequence into a Transformer-based policy, and optionally complement the traditional next action prediction objective with auxiliary self-supervised learning objectives for feedback prediction. We test our approach on a range of embodied Vision-and-Language tasks in our custom BabyAI-XGen environment and show significant improvements in agents’ compositional generalisation abilities and robustness, suggesting that our data-efficient method allows models to successfully convert suboptimal behaviour into learning opportunities. Overall, our results suggest that language feedback is a competitive and intuitive alternative to intermediate scalar rewards for language-specified embodied tasks.

[135] Template-Based Text-to-Image Alignment for Language Accessibility: A Study on Visualizing Text Simplifications

Belkiss Souayed, Sarah Ebling, Yingqiang Gao

Main category: cs.CL

TL;DR: A structured VLM prompting framework generates accessible images from simplified texts for individuals with intellectual disabilities, with Basic Object Focus template achieving best semantic alignment and Retro style identified as most accessible.

Details

Motivation: Individuals with intellectual disabilities struggle with complex texts, and current text-to-image models prioritize aesthetics over accessibility, creating a need for accessible visual illustrations from simplified texts.

Method: Developed five prompt templates (Basic Object Focus, Contextual Scene, Educational Layout, Multi-Level Detail, Grid Layout) with accessibility constraints, evaluated using 400 sentence simplifications from four TS datasets through CLIPScores and human annotation by accessibility experts.

Result: Basic Object Focus template achieved highest semantic alignment, Retro style was most accessible, Wikipedia was most effective data source, and Text Simplicity showed strong reliability while Image Quality was more subjective.

Conclusion: The framework provides practical guidelines for accessible content generation and emphasizes the importance of structured prompting in AI-generated visual accessibility tools, with visual minimalism enhancing language accessibility.

Abstract: Individuals with intellectual disabilities often have difficulties in comprehending complex texts. While many text-to-image models prioritize aesthetics over accessibility, it is not clear how visual illustrations relate to text simplifications (TS) generated from them. This paper presents a structured vision-language model (VLM) prompting framework for generating accessible images from simplified texts. We designed five prompt templates, i.e., Basic Object Focus, Contextual Scene, Educational Layout, Multi-Level Detail, and Grid Layout, each following distinct spatial arrangements while adhering to accessibility constraints such as object count limits, spatial separation, and content restrictions. Using 400 sentence-level simplifications from four established TS datasets (OneStopEnglish, SimPA, Wikipedia, and ASSET), we conducted a two-phase evaluation: Phase 1 assessed prompt template effectiveness with CLIPScores, and Phase 2 involved human annotation of generated images across ten visual styles by four accessibility experts. Results show that the Basic Object Focus prompt template achieved the highest semantic alignment, indicating that visual minimalism enhances language accessibility. Expert evaluation further identified Retro style as the most accessible and Wikipedia as the most effective data source. Inter-annotator agreement varied across dimensions, with Text Simplicity showing strong reliability and Image Quality proving more subjective. Overall, our framework offers practical guidelines for accessible content generation and underscores the importance of structured prompting in AI-generated visual accessibility tools.

[136] Do LLMs “Feel”? Emotion Circuits Discovery and Control

Chenxi Wang, Yixuan Zhang, Ruiji Yu, Yufei Zheng, Lang Gao, Zirui Song, Zixiang Xu, Gus Xia, Huishuai Zhang, Dongyan Zhao, Xiuying Chen

Main category: cs.CL

TL;DR: This study identifies and validates internal emotion circuits in LLMs, achieving 99.65% emotion-expression accuracy through direct circuit modulation.

Details

Motivation: To understand the internal mechanisms that give rise to emotional expression in LLMs and enable universal emotion control in generated text.

Method: Constructed SEV dataset to elicit comparable internal states, extracted context-agnostic emotion directions, identified neurons and attention heads through analytical decomposition and causal analysis, and integrated local components into global emotion circuits.

Result: Identified consistent cross-context emotion encoding, validated causal roles of emotional components, and achieved 99.65% emotion-expression accuracy through circuit modulation, surpassing prompting- and steering-based methods.

Conclusion: First systematic study to uncover and validate emotion circuits in LLMs, offering new insights into interpretability and controllable emotional intelligence.

Abstract: As the demand for emotional intelligence in large language models (LLMs) grows, a key challenge lies in understanding the internal mechanisms that give rise to emotional expression and in controlling emotions in generated text. This study addresses three core questions: (1) Do LLMs contain context-agnostic mechanisms shaping emotional expression? (2) What form do these mechanisms take? (3) Can they be harnessed for universal emotion control? We first construct a controlled dataset, SEV (Scenario-Event with Valence), to elicit comparable internal states across emotions. Subsequently, we extract context-agnostic emotion directions that reveal consistent, cross-context encoding of emotion (Q1). We identify neurons and attention heads that locally implement emotional computation through analytical decomposition and causal analysis, and validate their causal roles via ablation and enhancement interventions. Next, we quantify each sublayer’s causal influence on the model’s final emotion representation and integrate the identified local components into coherent global emotion circuits that drive emotional expression (Q2). Directly modulating these circuits achieves 99.65% emotion-expression accuracy on the test set, surpassing prompting- and steering-based methods (Q3). To our knowledge, this is the first systematic study to uncover and validate emotion circuits in LLMs, offering new insights into interpretability and controllable emotional intelligence.

[137] LLM-Specific Utility: A New Perspective for Retrieval-Augmented Generation

Hengran Zhang, Keping Bi, Jiafeng Guo, Jiaming Zhang, Shuaiqiang Wang, Dawei Yin, Xueqi Cheng

Main category: cs.CL

TL;DR: The paper introduces LLM-specific utility in retrieval-augmented generation, showing that human-annotated passages are not optimal for different LLMs and utility is not transferable across models, necessitating model-specific utility assessment.

Details

Motivation: Traditional RAG treats utility as generic, ignoring that different LLMs benefit differently from the same passages due to variations in internal knowledge and comprehension abilities.

Method: Large-scale experiments across multiple datasets and LLMs, analysis of perplexity as a key metric for readability, and proposing a benchmarking procedure for LLM-specific utility judgments.

Result: Human-annotated passages are not optimal for LLMs, ground-truth utilitarian passages are not transferable across different LLMs, and LLMs struggle to assess utility effectively - failing to reject all passages for known queries and select truly useful ones for unknown queries.

Conclusion: The findings highlight the necessity of adopting LLM-specific utility in RAG research, as utility judgments should be tailored to specific language models rather than treated as generic attributes.

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge. While traditional retrieval focuses on relevance, RAG’s effectiveness depends on the utility of retrieved passages, i.e., the usefulness in facilitating the generation of an accurate and comprehensive answer. Existing studies often treat utility as a generic attribute, ignoring the fact that different LLMs may benefit differently from the same passage due to variations in internal knowledge and comprehension ability. In this work, we introduce and systematically investigate the notion of LLM-specific utility. Through large-scale experiments across multiple datasets and LLMs, we demonstrate that human-annotated passages are not optimal for LLMs and that ground-truth utilitarian passages are not transferable across different LLMs. These findings highlight the necessity of adopting the LLM-specific utility in RAG research. Our findings indicate that some human-annotated passages are not ground-truth utilitarian passages for specific LLMs, partially due to the varying readability of queries and passages for LLMs, a tendency for which perplexity is a key metric. Based on these findings, we propose a benchmarking procedure for LLM-specific utility judgments. We evaluate existing utility judgment methods on six datasets and find that while verbalized methods using pseudo-answers perform robustly, LLMs struggle to assess utility effectively-failing to reject all passages for known queries and to select truly useful ones for unknown queries.

[138] Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers

Wenhan Ma, Hailin Zhang, Liang Zhao, Yifan Song, Yudong Wang, Zhifang Sui, Fuli Luo

Main category: cs.CL

TL;DR: R3 (Rollout Routing Replay) stabilizes RL training in Mixture-of-Experts models by replaying inference routing distributions during training to address routing instability issues.

Details

Motivation: Mixture-of-Experts models suffer from routing instability during RL training, causing catastrophic collapse due to discrepancies between training and inference routing behaviors.

Method: Proposed Rollout Routing Replay (R3) method that records routing distributions from inference engine and replays them during training to reduce policy KL divergence.

Result: R3 significantly reduces training-inference policy divergence, prevents RL training collapse, and outperforms methods like GSPO and TIS across various settings.

Conclusion: R3 provides an effective solution for stabilizing RL training in MoE models by addressing fundamental routing inconsistencies between training and inference phases.

Abstract: Reinforcement learning (RL) has emerged as a crucial approach for enhancing the capabilities of large language models. However, in Mixture-of-Experts (MoE) models, the routing mechanism often introduces instability, even leading to catastrophic RL training collapse. We analyze the training-inference consistency of MoE models and identify a notable discrepancy in routing behaviors between the two phases. Moreover, even under identical conditions, the routing framework can yield divergent expert selections across repeated forward passes. To address this foundational inconsistency, we propose Rollout Routing Replay (R3), a method that records routing distributions from the inference engine and replays them during training. R3 significantly reduces training-inference policy KL divergence and mitigates extreme discrepancies without compromising training speed. Extensive experiments on various settings confirm that R3 succeeds in stabilizing RL training, preventing collapse and outperforming methods such as GSPO and TIS. We believe this work can offer a new solution for stabilizing RL in MoE models.

[139] Early Detection and Reduction of Memorisation for Domain Adaptation and Instruction Tuning

Dean L. Slack, Noura Al Moubayed

Main category: cs.CL

TL;DR: The paper studies memorization during fine-tuning of large language models and proposes early stopping and loss regularization methods to mitigate verbatim memorization with minimal performance loss.

Details

Motivation: Large language models can memorize training data, exposing private or copyrighted text. Most defenses focus on pre-training, leaving memorization during fine-tuning poorly understood, especially for domain adaptation and instruction tuning.

Method: Fine-tuned Pythia, Llama3, and Mistral models (1.4B-70B parameters) on common datasets, tracked verbatim memorization throughout training. Used n-gram memorization score for early stopping and introduced n-gram-aware loss regularizer.

Result: Memorization increases dramatically in first few epochs, often before validation perplexity or evaluation performance is optimized. The proposed methods reduce memorization by up to 40% across all model families with minimal performance trade-offs compared to existing strategies.

Conclusion: The study provides practical, scalable insights into memorization dynamics during language model fine-tuning and offers effective mitigation strategies that balance memorization reduction with model performance.

Abstract: Although large language models excel across many tasks, they can memorise training data and thereby expose private or copyrighted text. Most defences target the pre-training stage, leaving memorisation during fine-tuning, especially for domain adaptation and instruction tuning, poorly understood. We fine-tune Pythia, Llama3, and Mistral models spanning 1.4B-70B parameters on common evaluation datasets and track verbatim memorisation throughout training. We find that memorisation increases dramatically in the first few epochs, often significantly before either validation perplexity or evaluation performance is optimised. We use a simple but effective n-gram memorisation score which reliably precedes verbatim memorisation; using it as an early-stopping criterion mitigates memorisation with minimal performance loss. Further, we introduce an n-gram-aware loss regulariser and show that it reduces memorisation across all model families tested by up to 40% while minimising evaluation performance trade-offs when compared to an existing memorisation mitigation strategy. These results yield practical, scalable insights into memorisation dynamics during language model fine-tuning.

Zirui Song, Yuan Huang, Junchang Liu, Haozhe Luo, Chenxi Wang, Lang Gao, Zixiang Xu, Mingfei Han, Xiaojun Chang, Xiuying Chen

Main category: cs.CL

TL;DR: The paper introduces a high-quality multimodal Werewolf dataset and a novel strategy-alignment evaluation framework to assess LLMs’ social intelligence in deception, reasoning, and strategic gameplay.

Details

Motivation: Current studies on social deduction games reduce gameplay to LLM self-play with templated utterances and lack quality evaluation metrics, overlooking the richness of social interaction and strategic gameplay.

Method: Curated a human-verified multimodal Werewolf dataset with 100+ hours of video and 32.4M tokens, then proposed a two-stage strategy-alignment evaluation: speech evaluation (multiple-choice tasks across 5 social dimensions) and decision evaluation (voting choices and role inferences).

Result: State-of-the-art LLMs show diverse performance with roughly half scoring below 0.50, revealing significant gaps in deception and counterfactual reasoning capabilities.

Conclusion: The dataset and evaluation framework enable fine-grained assessment of linguistic and reasoning capabilities in multi-agent interaction, highlighting current limitations in social intelligence and inspiring future research.

Abstract: Social deduction games like Werewolf combine language, reasoning, and strategy, providing a testbed for studying natural language and social intelligence. However, most studies reduce the game to LLM-based self-play, yielding templated utterances and anecdotal cases that overlook the richness of social gameplay. Evaluation further relies on coarse metrics such as survival time or subjective scoring due to the lack of quality reference data. To address these gaps, we curate a high-quality, human-verified multimodal Werewolf dataset containing over 100 hours of video, 32.4M utterance tokens, and 15 rule variants. Based on this dataset, we propose a novel strategy-alignment evaluation that leverages the winning faction’s strategies as ground truth in two stages: 1) Speech evaluation, formulated as multiple-choice-style tasks that assess whether the model can adopt appropriate stances across five dimensions of social ability; and 2) Decision evaluation, which assesses the model’s voting choices and opponent-role inferences. This framework enables a fine-grained evaluation of models’ linguistic and reasoning capabilities, while capturing their ability to generate strategically coherent gameplay. Our experiments show that state-of-the-art LLMs show diverse performance, with roughly half remain below 0.50, revealing clear gaps in deception and counterfactual reasoning. We hope our dataset further inspires research on language, reasoning, and strategy in multi-agent interaction.

[141] KnowRL: Teaching Language Models to Know What They Know

Sahil Kale, Devendra Singh Dhami

Main category: cs.CL

TL;DR: KnowRL is a self-improvement framework that enhances LLMs’ self-knowledge through introspection and consensus-based rewarding, improving accuracy by 28% and F1 by 12% without external supervision.

Details

Motivation: Current LLMs often misjudge their own competence, making their responses unreliable when they are uncertain. This creates safety risks for AI deployment in critical applications.

Method: Combines introspection (generating and classifying feasible/infeasible tasks) with consensus-based rewarding (reinforcing stability of self-knowledge through internal agreement) using internally generated data.

Result: Improved self-knowledge in LLaMA-3.1-8B and Qwen-2.5-7B, with gains of 28% in accuracy and 12% in F1, outperforming baselines in few iterations without external supervision.

Conclusion: KnowRL unlocks LLMs’ capacity to self-improve knowledge awareness, enabling more reliable and accountable AI for safer deployment in critical applications due to its simplicity and independence from external effort.

Abstract: Truly reliable AI requires more than simply scaling up knowledge; it demands the ability to know what it knows and when it does not. Yet recent research shows that even the best LLMs misjudge their own competence in more than one in five cases, making any response born of such internal uncertainty impossible to fully trust. Inspired by self-improvement reinforcement learning techniques that require minimal data, we present a simple but powerful framework KnowRL that strengthens a model’s internal understanding of its own feasibility boundaries, enabling safer and more responsible behaviour. Our framework combines two components: (i) introspection, where the model generates and classifies tasks it judges feasible or infeasible, and (ii) consensus-based rewarding, where stability of self-knowledge assessment is reinforced through internal agreement. By using internally generated data, this design strengthens consistency in self-knowledge and entirely avoids costly external supervision. In experiments on LLaMA-3.1-8B and Qwen-2.5-7B, KnowRL steadily improved self-knowledge, validated by both intrinsic self-consistency and extrinsic benchmarking. With nothing more than a small seed set and no external supervision, our method drove gains as high as 28% in accuracy and 12% in F1, outperforming baselines in just a few iterations. Our framework essentially unlocks the untapped capacity of LLMs to self-improve their knowledge awareness, opening the door to reliable, more accountable AI and safer deployment in critical applications. Owing to its simplicity and independence from external effort, we encourage applying this reliability-enhancing process to all future models.

[142] Valid Survey Simulations with Limited Human Data: The Roles of Prompting, Fine-Tuning, and Rectification

Stefan Krsteski, Giuseppe Russo, Serina Chang, Robert West, Kristina Gligorić

Main category: cs.CL

TL;DR: LLMs can replace human survey respondents but introduce bias. Combining synthesis (LLM-generated responses) with rectification (debiasing methods) reduces bias below 5% and increases effective sample size by 14%. Optimal allocation favors rectification over fine-tuning.

Details

Motivation: Traditional surveys are costly and slow. LLMs offer a scalable, low-cost alternative but produce biased outputs that yield invalid population estimates.

Method: Study the interplay between synthesis methods (using LLMs to generate survey responses) and rectification methods (debiasing population estimates). Analyze optimal allocation of human responses between synthesis and rectification under fixed budget constraints.

Result: Synthesis alone introduces substantial bias (24-86%). Combining synthesis with rectification reduces bias below 5% and increases effective sample size by up to 14%. Allocating most human responses to rectification rather than fine-tuning produces more effective estimation.

Conclusion: The common practice of using all human responses for fine-tuning is suboptimal. Under fixed budgets, allocating most resources to rectification results in far more effective population estimation from LLM-generated survey responses.

Abstract: Surveys provide valuable insights into public opinion and behavior, but their execution is costly and slow. Large language models (LLMs) have been proposed as a scalable, low-cost substitute for human respondents, but their outputs are often biased and yield invalid estimates. We study the interplay between synthesis methods that use LLMs to generate survey responses and rectification methods that debias population estimates, and explore how human responses are best allocated between them. Using two panel surveys with questions on nutrition, politics, and economics, we find that synthesis alone introduces substantial bias (24-86%), whereas combining it with rectification reduces bias below 5% and increases effective sample size by up to 14%. Overall, we challenge the common practice of using all human responses for fine-tuning, showing that under a fixed budget, allocating most to rectification results in far more effective estimation.

[143] Who are you, ChatGPT? Personality and Demographic Style in LLM-Generated Content

Dana Sotto Porat, Ella Rabinovich

Main category: cs.CL

TL;DR: A data-driven method using automatic classifiers reveals that LLMs systematically express higher Agreeableness and lower Neuroticism compared to humans, with gendered language patterns resembling humans but with reduced variation.

Details

Motivation: To investigate whether large language models exhibit personality- and demographic-like characteristics in their language without relying on self-report questionnaires.

Method: Applied automatic personality and gender classifiers to model replies on open-ended questions from Reddit, comparing six widely used LLMs to human-authored responses.

Result: LLMs systematically express higher Agreeableness and lower Neuroticism, reflecting cooperative and stable conversational tendencies. Gendered language patterns resemble human writers but with reduced variation.

Conclusion: The study provides new insights into personality and demographic patterns of generative AI through large-scale comparative analyses and contributes a new dataset of human and model responses.

Abstract: Generative large language models (LLMs) have become central to everyday life, producing human-like text across diverse domains. A growing body of research investigates whether these models also exhibit personality- and demographic-like characteristics in their language. In this work, we introduce a novel, data-driven methodology for assessing LLM personality without relying on self-report questionnaires, applying instead automatic personality and gender classifiers to model replies on open-ended questions collected from Reddit. Comparing six widely used models to human-authored responses, we find that LLMs systematically express higher Agreeableness and lower Neuroticism, reflecting cooperative and stable conversational tendencies. Gendered language patterns in model text broadly resemble those of human writers, though with reduced variation, echoing prior findings on automated agents. We contribute a new dataset of human and model responses, along with large-scale comparative analyses, shedding new light on the topic of personality and demographic patterns of generative AI.

[144] GenCNER: A Generative Framework for Continual Named Entity Recognition

Yawen Yang, Fukun Ma, Shiao Meng, Aiwei Liu, Lijie Wen

Main category: cs.CL

TL;DR: GenCNER is a generative framework for continual named entity recognition that converts CNER into entity triplet sequence generation using pre-trained seq2seq models, with type-specific pseudo labeling and knowledge distillation to mitigate catastrophic forgetting and semantic shift.

Details

Motivation: Existing continual learning methods for NER suffer from catastrophic forgetting and semantic shift of non-entity types as entity categories continuously increase in real-world scenarios.

Method: Convert CNER into sustained entity triplet sequence generation using pre-trained seq2seq models, with type-specific confidence-based pseudo labeling and knowledge distillation at triplet level.

Result: Outperforms previous state-of-the-art methods on two benchmark datasets in multiple CNER settings, achieving smallest gap compared with non-continual learning results.

Conclusion: GenCNER provides an effective generative framework that successfully mitigates catastrophic forgetting and semantic shift issues in continual NER through triplet-level generation and knowledge preservation techniques.

Abstract: Traditional named entity recognition (NER) aims to identify text mentions into pre-defined entity types. Continual Named Entity Recognition (CNER) is introduced since entity categories are continuously increasing in various real-world scenarios. However, existing continual learning (CL) methods for NER face challenges of catastrophic forgetting and semantic shift of non-entity type. In this paper, we propose GenCNER, a simple but effective Generative framework for CNER to mitigate the above drawbacks. Specifically, we skillfully convert the CNER task into sustained entity triplet sequence generation problem and utilize a powerful pre-trained seq2seq model to solve it. Additionally, we design a type-specific confidence-based pseudo labeling strategy along with knowledge distillation (KD) to preserve learned knowledge and alleviate the impact of label noise at the triplet level. Experimental results on two benchmark datasets show that our framework outperforms previous state-of-the-art methods in multiple CNER settings, and achieves the smallest gap compared with non-CL results.

[145] Investigating Large Language Models’ Linguistic Abilities for Text Preprocessing

Marco Braga, Gian Carlo Milanese, Gabriella Pasi

Main category: cs.CL

TL;DR: LLMs can effectively perform text preprocessing tasks like stopword removal, lemmatization, and stemming with high accuracy, outperforming traditional methods in text classification tasks across multiple languages.

Details

Motivation: Traditional text preprocessing methods ignore contextual information, which is crucial for accurate NLP tasks. LLMs offer the ability to consider context without needing extensive language-specific resources.

Method: Used Large Language Models to perform stopword removal, lemmatization, and stemming, comparing them against traditional algorithms on web-sourced data across six European languages in text classification tasks.

Result: LLMs achieved 97% accuracy for stopword removal, 82% for lemmatization, and 74% for stemming. ML models trained on LLM-preprocessed text showed up to 6% improvement in F1 score compared to traditional preprocessing.

Conclusion: LLM-based preprocessing is a viable alternative to traditional methods, offering context-aware processing that improves downstream text classification performance across multiple languages.

Abstract: Text preprocessing is a fundamental component of Natural Language Processing, involving techniques such as stopword removal, stemming, and lemmatization to prepare text as input for further processing and analysis. Despite the context-dependent nature of the above techniques, traditional methods usually ignore contextual information. In this paper, we investigate the idea of using Large Language Models (LLMs) to perform various preprocessing tasks, due to their ability to take context into account without requiring extensive language-specific annotated resources. Through a comprehensive evaluation on web-sourced data, we compare LLM-based preprocessing (specifically stopword removal, lemmatization and stemming) to traditional algorithms across multiple text classification tasks in six European languages. Our analysis indicates that LLMs are capable of replicating traditional stopword removal, lemmatization, and stemming methods with accuracies reaching 97%, 82%, and 74%, respectively. Additionally, we show that ML algorithms trained on texts preprocessed by LLMs achieve an improvement of up to 6% with respect to the $F_1$ measure compared to traditional techniques. Our code, prompts, and results are publicly available at https://github.com/GianCarloMilanese/llm_pipeline_wi-iat.

[146] Hallucination Detection via Internal States and Structured Reasoning Consistency in Large Language Models

Yusheng Song, Lirong Qiu, Xi Zhang, Zhihao Tang

Main category: cs.CL

TL;DR: The paper introduces a unified framework to detect hallucinations in LLMs by bridging the gap between internal state probing and chain-of-thought verification methods, overcoming signal scarcity and representational alignment barriers.

Details

Motivation: Current hallucination detection methods suffer from a "Detection Dilemma" where internal state probing works well for factual inconsistencies but fails on logical fallacies, while chain-of-thought verification shows the opposite behavior, creating task-dependent blind spots.

Method: Proposes a multi-path reasoning mechanism to obtain fine-grained comparable signals and a segment-aware temporalized cross-attention module to adaptively fuse aligned representations, pinpointing subtle dissonances between internal states and external reasoning.

Result: Extensive experiments on three diverse benchmarks and two leading LLMs demonstrate that the framework consistently and significantly outperforms strong baselines.

Conclusion: The unified framework successfully resolves the detection dilemma by bridging internal state probing and chain-of-thought verification, providing comprehensive hallucination detection across both fact-intensive and logic-intensive tasks.

Abstract: The detection of sophisticated hallucinations in Large Language Models (LLMs) is hampered by a ``Detection Dilemma’’: methods probing internal states (Internal State Probing) excel at identifying factual inconsistencies but fail on logical fallacies, while those verifying externalized reasoning (Chain-of-Thought Verification) show the opposite behavior. This schism creates a task-dependent blind spot: Chain-of-Thought Verification fails on fact-intensive tasks like open-domain QA where reasoning is ungrounded, while Internal State Probing is ineffective on logic-intensive tasks like mathematical reasoning where models are confidently wrong. We resolve this with a unified framework that bridges this critical gap. However, unification is hindered by two fundamental challenges: the Signal Scarcity Barrier, as coarse symbolic reasoning chains lack signals directly comparable to fine-grained internal states, and the Representational Alignment Barrier, a deep-seated mismatch between their underlying semantic spaces. To overcome these, we introduce a multi-path reasoning mechanism to obtain more comparable, fine-grained signals, and a segment-aware temporalized cross-attention module to adaptively fuse these now-aligned representations, pinpointing subtle dissonances. Extensive experiments on three diverse benchmarks and two leading LLMs demonstrate that our framework consistently and significantly outperforms strong baselines. Our code is available: https://github.com/peach918/HalluDet.

[147] An Encoder-Integrated PhoBERT with Graph Attention for Vietnamese Token-Level Classification

Ba-Quang Nguyen

Main category: cs.CL

TL;DR: TextGraphFuseGAT integrates PhoBERT transformer with Graph Attention Networks and self-attention for token classification, achieving state-of-the-art results on three Vietnamese benchmarks.

Details

Motivation: To capture richer inter-token dependencies beyond sequential context alone for improved token-level classification across multiple domains.

Method: Constructs fully connected graph over PhoBERT token embeddings, applies GAT layer to capture inter-token dependencies, adds Transformer self-attention layer, and uses classification head for sequence labeling.

Result: Consistently outperforms strong baselines including transformer-only and hybrid neural models (BiLSTM + CNN + CRF) on three Vietnamese datasets: PhoNER-COVID19, PhoDisfluency, and VietMed-NER.

Conclusion: Combining pretrained semantic features with graph-based relational modeling is effective for improved token classification across multiple domains.

Abstract: We propose a novel neural architecture named TextGraphFuseGAT, which integrates a pretrained transformer encoder (PhoBERT) with Graph Attention Networks for token-level classification tasks. The proposed model constructs a fully connected graph over the token embeddings produced by PhoBERT, enabling the GAT layer to capture rich inter-token dependencies beyond those modeled by sequential context alone. To further enhance contextualization, a Transformer-style self-attention layer is applied on top of the graph-enhanced embeddings. The final token representations are passed through a classification head to perform sequence labeling. We evaluate our approach on three Vietnamese benchmark datasets: PhoNER-COVID19 for named entity recognition in the COVID-19 domain, PhoDisfluency for speech disfluency detection, and VietMed-NER for medical-domain NER. VietMed-NER is the first Vietnamese medical spoken NER dataset, featuring 18 entity types collected from real-world medical speech transcripts and annotated with the BIO tagging scheme. Its specialized vocabulary and domain-specific expressions make it a challenging benchmark for token-level classification models. Experimental results show that our method consistently outperforms strong baselines, including transformer-only and hybrid neural models such as BiLSTM + CNN + CRF, confirming the effectiveness of combining pretrained semantic features with graph-based relational modeling for improved token classification across multiple domains.

[148] Information-Preserving Reformulation of Reasoning Traces for Antidistillation

Jiayu Ding, Lei Cui, Li Dong, Nanning Zheng, Furu Wei

Main category: cs.CL

TL;DR: PART is an antidistillation method that reformulates reasoning traces to prevent unauthorized model distillation while preserving information for human understanding.

Details

Motivation: There's a trade-off between showing detailed reasoning traces (which helps users) and protecting against unauthorized distillation. Current protection methods remove valuable intermediate information.

Method: Two-step reformulation: remove self-talk behaviors and reorder sub-conclusions, implemented via a small auxiliary model with minimal computational overhead.

Result: PART consistently disrupts distillation across various student models and benchmarks. For example, a 32B model’s performance dropped from 54.17 to 46.88 on AIME 2024 (13.5% degradation).

Conclusion: PART effectively protects reasoning traces from unauthorized distillation while preserving information for human users, addressing the trade-off between transparency and protection.

Abstract: Recent advances in Large Language Models (LLMs) show that extending the length of reasoning chains significantly improves performance on complex tasks. While revealing these reasoning traces helps users better follow, verify, and learn from the model’s problem-solving process, it also makes them highly vulnerable to unauthorized distillation. To mitigate this risk, proprietary model providers often adopt aggressive protection strategies, such as replacing detailed reasoning with brief summaries, which deprive users of valuable intermediate information. To address this trade-off, we propose PART, an information-preserving antidistillation reformulation of reasoning traces. Motivated by the difference between how humans understand reasoning traces and how LLMs exploit them for supervised fine-tuning, we design a simple but effective two-step reformulation: removing self-talk behaviors and reordering sub-conclusions. A small auxiliary model is trained to perform this reformulation, incurring minimal computational overhead. Extensive experiments demonstrate that PART consistently disrupts distillation across student models of different sizes and types on various reasoning benchmarks. For instance, when training on reformulated traces, even the performance of a large 32B student model decreases from 54.17 to 46.88 on AIME 2024, corresponding to a 13.5% degradation.

[149] Invisible Languages of the LLM Universe

Saurabh Khanna, Xinxu Li

Main category: cs.CL

TL;DR: The paper analyzes linguistic inequality in AI systems, identifying four categories of languages based on vitality and digital presence, and argues that current AI development perpetuates colonial-era linguistic hierarchies through structural digital epistemic injustice.

Details

Motivation: To address the crisis where approximately 2,000 languages with millions of speakers remain invisible in digital ecosystems despite LLMs being trained on massive multilingual corpora, and to explain why linguistic inequality in AI is structural rather than incidental.

Method: Proposes a critical framework connecting empirical measurements of language vitality (demographic strength) and digitality (online presence) with postcolonial theory and epistemic injustice, analyzing data across all documented human languages to categorize them into four groups.

Result: Identified four language categories: Strongholds (33%, high vitality/digitality), Digital Echoes (6%, high digitality despite declining vitality), Fading Voices (36%, low on both), and Invisible Giants (27%, high vitality but near-zero digitality). Shows patterns reflect continuities from colonial-era linguistic hierarchies to contemporary AI development.

Conclusion: English dominance in AI is not a technical necessity but an artifact of power structures that systematically exclude marginalized linguistic knowledge. Calls for decolonizing language technology and democratizing access to AI benefits to address digital epistemic injustice.

Abstract: Large Language Models are trained on massive multilingual corpora, yet this abundance masks a profound crisis: of the world’s 7,613 living languages, approximately 2,000 languages with millions of speakers remain effectively invisible in digital ecosystems. We propose a critical framework connecting empirical measurements of language vitality (real world demographic strength) and digitality (online presence) with postcolonial theory and epistemic injustice to explain why linguistic inequality in AI systems is not incidental but structural. Analyzing data across all documented human languages, we identify four categories: Strongholds (33%, high vitality and digitality), Digital Echoes (6%, high digitality despite declining vitality), Fading Voices (36%, low on both dimensions), and critically, Invisible Giants (27%, high vitality but near-zero digitality) - languages spoken by millions yet absent from the LLM universe. We demonstrate that these patterns reflect continuities from colonial-era linguistic hierarchies to contemporary AI development, constituting what we term digital epistemic injustice. Our analysis reveals that English dominance in AI is not a technical necessity but an artifact of power structures that systematically exclude marginalized linguistic knowledge. We conclude with implications for decolonizing language technology and democratizing access to AI benefits.

[150] Culturally-Aware Conversations: A Framework & Benchmark for LLMs

Shreya Havaldar, Sunny Rai, Young-Min Cho, Lyle Ungar

Main category: cs.CL

TL;DR: Introduces a new framework and benchmark for evaluating LLMs’ cultural adaptation in realistic multicultural conversations, focusing on linguistic style shaped by situational, relational, and cultural contexts.

Details

Motivation: Existing benchmarks for cultural adaptation in LLMs are misaligned with real-world challenges faced when interacting with users from diverse cultural backgrounds.

Method: Developed a framework grounded in sociocultural theory, constructed a benchmark dataset annotated by culturally diverse raters, and proposed new evaluation desiderata: conversational framing, stylistic sensitivity, and subjective correctness.

Result: Evaluation of top LLMs shows they struggle with cultural adaptation in conversational settings despite their capabilities.

Conclusion: The proposed framework and benchmark provide more realistic evaluation of LLMs’ cultural adaptation abilities, revealing significant gaps in current models’ performance.

Abstract: Existing benchmarks that measure cultural adaptation in LLMs are misaligned with the actual challenges these models face when interacting with users from diverse cultural backgrounds. In this work, we introduce the first framework and benchmark designed to evaluate LLMs in realistic, multicultural conversational settings. Grounded in sociocultural theory, our framework formalizes how linguistic style - a key element of cultural communication - is shaped by situational, relational, and cultural context. We construct a benchmark dataset based on this framework, annotated by culturally diverse raters, and propose a new set of desiderata for cross-cultural evaluation in NLP: conversational framing, stylistic sensitivity, and subjective correctness. We evaluate today’s top LLMs on our benchmark and show that these models struggle with cultural adaptation in a conversational setting.

[151] LLMAtKGE: Large Language Models as Explainable Attackers against Knowledge Graph Embeddings

Ting Li, Yang Yang, Yipeng Yu, Liang Yao, Guoqing Chao, Ruifeng Xu

Main category: cs.CL

TL;DR: LLMAtKGE is a novel LLM-based framework for adversarial attacks on knowledge graph embeddings that selects attack targets and generates human-readable explanations using structured prompting and filtering techniques.

Details

Motivation: Existing black-box adversarial attack methods on knowledge graph embeddings lack human-readable explanations and exhibit poor generalizability, while LLMs have shown strong capabilities in text comprehension and reasoning.

Method: Uses structured prompting to formulate attacks as multiple-choice questions with KG factual evidence, implements semantics-based and centrality-based filters to compress candidate sets, precomputes high-order adjacency, and fine-tunes LLM with triple classification for better filtering.

Result: Outperforms strongest black-box baselines, provides explanations via reasoning, and shows competitive performance compared to white-box methods on two knowledge graph datasets.

Conclusion: LLMAtKGE effectively combines semantic and structural information for adversarial attacks while generating human-readable explanations, demonstrating the potential of LLMs in knowledge graph security applications.

Abstract: Adversarial attacks on knowledge graph embeddings (KGE) aim to disrupt the model’s ability of link prediction by removing or inserting triples. A recent black-box method has attempted to incorporate textual and structural information to enhance attack performance. However, it is unable to generate human-readable explanations, and exhibits poor generalizability. In the past few years, large language models (LLMs) have demonstrated powerful capabilities in text comprehension, generation, and reasoning. In this paper, we propose LLMAtKGE, a novel LLM-based framework that selects attack targets and generates human-readable explanations. To provide the LLM with sufficient factual context under limited input constraints, we design a structured prompting scheme that explicitly formulates the attack as multiple-choice questions while incorporating KG factual evidence. To address the context-window limitation and hesitation issues, we introduce semantics-based and centrality-based filters, which compress the candidate set while preserving high recall of attack-relevant information. Furthermore, to efficiently integrate both semantic and structural information into the filter, we precompute high-order adjacency and fine-tune the LLM with a triple classification task to enhance filtering performance. Experiments on two widely used knowledge graph datasets demonstrate that our attack outperforms the strongest black-box baselines and provides explanations via reasoning, and showing competitive performance compared with white-box methods. Comprehensive ablation and case studies further validate its capability to generate explanations.

[152] Survey Response Generation: Generating Closed-Ended Survey Responses In-Silico with Large Language Models

Georg Ahnert, Anna-Carolina Haensch, Barbara Plank, Markus Strohmaier

Main category: cs.CL

TL;DR: Systematic investigation of 8 different Survey Response Generation Methods for LLMs on political attitude surveys, finding significant differences in alignment and recommending Restricted Generation Methods as most effective.

Details

Motivation: To address the lack of standard practices for generating closed-ended survey responses with LLMs, which are typically trained for open-ended text generation, and to understand how different generation methods impact simulated survey responses.

Method: Generated 32 million simulated survey responses using 8 different Survey Response Generation Methods across 4 political attitude surveys and 10 open-weight language models, comparing individual-level and subpopulation-level alignment.

Result: Found significant differences between Survey Response Generation Methods, with Restricted Generation Methods performing best overall, and reasoning output not consistently improving alignment.

Conclusion: Survey Response Generation Methods significantly impact simulated survey responses, and Restricted Generation Methods are recommended as the most effective approach for generating closed-ended survey responses with LLMs.

Abstract: Many in-silico simulations of human survey responses with large language models (LLMs) focus on generating closed-ended survey responses, whereas LLMs are typically trained to generate open-ended text instead. Previous research has used a diverse range of methods for generating closed-ended survey responses with LLMs, and a standard practice remains to be identified. In this paper, we systematically investigate the impact that various Survey Response Generation Methods have on predicted survey responses. We present the results of 32 mio. simulated survey responses across 8 Survey Response Generation Methods, 4 political attitude surveys, and 10 open-weight language models. We find significant differences between the Survey Response Generation Methods in both individual-level and subpopulation-level alignment. Our results show that Restricted Generation Methods perform best overall, and that reasoning output does not consistently improve alignment. Our work underlines the significant impact that Survey Response Generation Methods have on simulated survey responses, and we develop practical recommendations on the application of Survey Response Generation Methods.

[153] MeTA-LoRA: Data-Efficient Multi-Task Fine-Tuning for Large Language Models

Bo Cheng, Xu Wang, Jinda Liu, Yi Chang, Yuan Wu

Main category: cs.CL

TL;DR: MeTA-LoRA is a two-stage optimization framework that improves data efficiency in multi-task adaptation of LLMs by first learning task-specific LoRA adapters with few samples, then updating a shared adapter through gradient aggregation across tasks.

Details

Motivation: LoRA struggles to efficiently leverage inter-task knowledge in multi-task learning scenarios and requires substantial task-specific data for optimal performance.

Method: Two-stage optimization: 1) Learn task-specific LoRA adapters using few samples from each dataset, 2) Update shared LoRA adapter by aggregating gradients from multiple tasks to promote knowledge transfer.

Result: Matches or surpasses performance of traditional full-data LoRA fine-tuning in multi-task and multilingual learning scenarios while using significantly less task-specific data.

Conclusion: MeTA-LoRA significantly improves data efficiency in multi-task adaptation of LLMs through its two-stage optimization approach.

Abstract: Low-Rank Adaptation (LoRA) has emerged as one of the most widely used parameter-efficient fine-tuning (PEFT) methods for adapting large language models (LLMs) to downstream tasks. While highly effective in single-task settings, it struggles to efficiently leverage inter-task knowledge in complex multi-task learning scenarios, often requiring substantial task-specific data to achieve optimal performance. To address this limitation, we introduce MeTA-LoRA, a two-stage optimization framework that significantly improves data efficiency in multi-task adaptation. In the first stage, task-specific LoRA adapters are learned using only a few samples from each involved dataset, enabling rapid adaptation without large-scale supervision. In the second stage, the shared LoRA adapter is updated by aggregating gradients from multiple tasks to promote knowledge transfer across tasks, further reducing data usage by leveraging common patterns. In both multi-task learning and multilingual learning scenarios, our method matches or surpasses the performance of traditional full-data LoRA fine-tuning approaches, while using significantly less task-specific data.

[154] SemCSE-Multi: Multifaceted and Decodable Embeddings for Aspect-Specific and Interpretable Scientific Domain Mapping

Marc Brinner, Sina Zarrieß

Main category: cs.CL

TL;DR: SemCSE-Multi is an unsupervised framework for generating multifaceted embeddings of scientific abstracts that capture distinct aspects, enabling fine-grained similarity assessment and adaptive visualizations.

Details

Motivation: To create embeddings that capture multiple distinct aspects of scientific abstracts in isolation, allowing for more fine-grained and controllable similarity assessments and visualizations.

Method: Uses unsupervised procedure to generate aspect-specific summarizing sentences, trains embedding models to map related summaries close together, then distills these into a unified model that predicts multiple aspect embeddings in one forward pass.

Result: The framework successfully generates multifaceted embeddings and includes an embedding decoding pipeline that can decode embeddings back to natural language descriptions, even for unoccupied regions in visualizations.

Conclusion: SemCSE-Multi provides interpretable, multifaceted embeddings for scientific abstracts with improved user interpretability through effective decoding capabilities.

Abstract: We propose SemCSE-Multi, a novel unsupervised framework for generating multifaceted embeddings of scientific abstracts, evaluated in the domains of invasion biology and medicine. These embeddings capture distinct, individually specifiable aspects in isolation, thus enabling fine-grained and controllable similarity assessments as well as adaptive, user-driven visualizations of scientific domains. Our approach relies on an unsupervised procedure that produces aspect-specific summarizing sentences and trains embedding models to map semantically related summaries to nearby positions in the embedding space. We then distill these aspect-specific embedding capabilities into a unified embedding model that directly predicts multiple aspect embeddings from a scientific abstract in a single, efficient forward pass. In addition, we introduce an embedding decoding pipeline that decodes embeddings back into natural language descriptions of their associated aspects. Notably, we show that this decoding remains effective even for unoccupied regions in low-dimensional visualizations, thus offering vastly improved interpretability in user-centric settings.

[155] Deconstructing Attention: Investigating Design Principles for Effective Language Modeling

Huiyin Xue, Nafise Sadat Moosavi, Nikolaos Aletras

Main category: cs.CL

TL;DR: This paper systematically deconstructs Transformer attention mechanisms, finding that token mixing is essential while mathematical form and sequence dependency can be relaxed, especially when standard attention is preserved in some layers.

Details

Motivation: To understand which design principles of Transformer attention are essential versus optional, as the necessity of each principle remains largely untested despite attention's success.

Method: Designed controlled variants that selectively relax attention principles (token mixing, sequence-dependent activations, dot-product form, query-key coupling), applied uniformly across layers and in hybrid architectures with some standard attention layers.

Result: Token mixing mechanisms are indispensable (absence causes near-random behavior), while mathematical form and sequence dependency can be substantially relaxed, especially when preserved in subset of layers. Hybrid architectures show cooperative effects where failing variants work when interleaved with standard attention.

Conclusion: Attention’s effectiveness relies more on token mixing than specific mathematical form or sequence dependency, opening avenues for simplifying language models without performance loss.

Abstract: The success of Transformer language models is widely credited to their dot-product attention mechanism, which interweaves a set of key design principles: mixing information across positions (enabling multi-token interactions), sequence-dependent activations (where attention weights adapt to each input), a specific mathematical form (dot-product similarities plus softmax weighting), and coupling of queries and keys to evolving hidden states (grounding attention in the current layer). However, the necessity of each of these principles remains largely untested. In this work, we systematically deconstruct attention by designing controlled variants that selectively relax these principles, applied both uniformly across all layers and in hybrid architectures where only some layers retain standard attention. Our empirical analysis reveals that mechanisms for mixing tokens are indispensable, as their absence collapses models to near-random behavior, while the exact mathematical form and sequence dependency can be substantially relaxed, especially when preserved in just a subset of layers. Surprisingly, even variants that fail in isolation can achieve robust performance when interleaved with standard attention, highlighting a cooperative effect. These findings deepen our understanding of what truly underpins attention’s effectiveness and open new avenues for simplifying language models without sacrificing performance.

[156] LLM-Oriented Token-Adaptive Knowledge Distillation

Xurong Xie, Zhucun Xue, Jiafu Wu, Jian Li, Yabiao Wang, Xiaobin Hu, Yong Liu, Jiangning Zhang

Main category: cs.CL

TL;DR: AdaKD is a dynamic knowledge distillation framework that adapts to each token’s learning state using token difficulty metrics, featuring adaptive token focusing and inverse temperature scaling to improve distillation efficiency.

Details

Motivation: Traditional logit-based KD methods use static strategies that don't align with the dynamic learning process of student models, treating all tokens equally with fixed temperatures, leading to suboptimal knowledge transfer.

Method: AdaKD consists of two modules: Loss-Driven Adaptive Token Focusing (LATF) that dynamically adjusts distillation focus based on student’s learning stability, and Inverse Difficulty Temperature Scaling (IDTS) that uses low temperatures for difficult tokens and high temperatures for easy tokens.

Result: AdaKD consistently improves performance of various distillation methods across multiple model architectures and benchmarks as a plug-and-play framework.

Conclusion: The proposed adaptive approach to knowledge distillation better aligns with the dynamic learning process and significantly enhances knowledge transfer efficiency in language model compression.

Abstract: Knowledge distillation (KD) is a key technique for compressing large-scale language models (LLMs), yet prevailing logit-based methods typically employ static strategies that are misaligned with the dynamic learning process of student models. These methods typically treat all tokens indiscriminately and apply a single, fixed temperature, resulting in suboptimal knowledge transfer. To address these limitations, we propose LLM-Oriented Token-Adaptive Knowledge Distillation (AdaKD), a novel framework that adapts the distillation process to the real-time learning state of each token. AdaKD consists of two synergistic modules driven by a unified token difficulty metric. First, our Loss-Driven Adaptive Token Focusing (LATF) module dynamically adjusts the distillation focus by monitoring the student’s learning stability, concentrating computational resources on the most valuable tokens at each training phase. Second, we introduce Inverse Difficulty Temperature Scaling (IDTS), a counterintuitive yet effective token-level temperature strategy. It employs low temperatures for difficult tokens for targeted error correction, and high temperatures for easy tokens to encourage students to learn from the teacher’s complete and smooth output distribution, thereby enhancing generalization. As a plug-and-play framework, AdaKD can consistently improve the performance of various distillation methods on multiple model architectures and benchmarks.

[157] Enhancing Long Chain-of-Thought Reasoning through Multi-Path Plan Aggregation

Siheng Xiong, Ali Payani, Faramarz Fekri

Main category: cs.CL

TL;DR: MPPA addresses CoT derailment in small LMs by generating multiple candidate plans at intervals and aggregating them, combined with Step-DPO for efficient training.

Details

Motivation: Existing single-pass CoT generation often leads to reasoning derailment due to compounding errors, especially in smaller LMs with long reasoning chains where planning errors are the main issue.

Method: Proposes Multi-Path Plan Aggregation (MPPA) with variable interval plan generation and aggregation using a lightweight LoRA module, plus online Step-DPO with TSMC for stepwise supervision.

Result: Outperforms DeepSeek-R1 distillation and outcome-reward RL baselines on math, science, and logical reasoning benchmarks using only 10% SFT data and 5% preference pairs.

Conclusion: MPPA with Step-DPO effectively mitigates CoT derailment and enables efficient training for improved reasoning in small LMs.

Abstract: Inference-time scaling enhances the reasoning ability of a language model (LM) by extending its chain-of-thought (CoT). However, existing approaches typically generate the entire reasoning chain in a single forward pass, which often leads to CoT derailment, i.e., the reasoning trajectory drifting off course due to compounding errors. This problem is particularly severe for smaller LMs with long CoTs due to their limited capacity. To address this, we analyze raw long CoTs and uncover a reasoning hierarchy consisting of planning and execution steps. Our analysis reveals that most reasoning errors stem from incorrect planning. Motivated by this observation, we propose Multi-Path Plan Aggregation (MPPA), a framework that augments single-pass reasoning with plan exploration and aggregation. Following a variable interval schedule based on the token position, MPPA generates multiple candidate plans and aggregates them into a refined planning step. To maintain efficiency, we adopt a minimal design in which the base LM serves as the primary policy, while a lightweight LoRA module implements the plan aggregation policy. We further observe that outcome-reward RL is inefficient for long trajectories (e.g., exceeding 4K tokens). To overcome this, we introduce online Step-DPO, a process-level preference optimization scheme that leverages Twisted Sequential Monte Carlo (TSMC) to provide scalable stepwise supervision using small LMs. This yields more efficient training, improved stability, and higher accuracy. Extensive experiments on challenging math, science, and logical reasoning benchmarks demonstrate that, with only 10% SFT data and 5% of preference pairs, our method outperforms both the DeepSeek-R1 distillation baseline and the outcome-reward RL baseline across multiple base models and tasks.

[158] ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems

Xin Gui, King Zhu, JinCheng Ren, Qianben Chen, Zekun Moore Wang, Yizhi LI, Xinpeng Liu, Xiaowan Li, Wenli Ren, Linyu Miao, Tianrui Qin, Ziqi Shu, He Zhu, Xiangru Tang, Dingfeng Shi, Jiaheng Liu, Yuchen Eleanor Jiang, Minghao Liu, Ge Zhang, Wangchunshu Zhou

Main category: cs.CL

TL;DR: Acadreason is a new benchmark for evaluating LLMs and agents on academic reasoning across five domains, showing current models perform poorly with none scoring above 40 points.

Details

Motivation: Existing evaluations lack sufficient reasoning depth for academic knowledge, creating a gap in rigorous benchmarks for high-level reasoning in LLMs and agents.

Method: Created Acadreason benchmark with 50 expert-annotated academic problems across computer science, economics, law, mathematics, and philosophy, sourced from top-tier publications with rigorous quality control.

Result: Most LLMs scored below 20 points, with GPT-5 achieving only 16 points. Agents performed better but none exceeded 40 points, showing significant capability gaps.

Conclusion: Current LLMs and agents have substantial limitations in handling super-intelligent academic research tasks, highlighting the challenging nature of the Acadreason benchmark.

Abstract: In recent years, the research focus of large language models (LLMs) and agents has shifted increasingly from demonstrating novel capabilities to complex reasoning and tackling challenging tasks. However, existing evaluations focus mainly on math/code contests or general tasks, while existing multi-domain academic benchmarks lack sufficient reasoning depth, leaving the field without a rigorous benchmark for high-level reasoning. To fill this gap, we introduce the Acadreason benchmark, designed to evaluate the ability of LLMs and agents to acquire and reason over academic knowledge. It consists of 50 expert-annotated academic problems across five high-reasoning domains, including computer science, economics, law, mathematics, and philosophy. All questions are sourced from top-tier publications in recent years and undergo rigorous annotation and quality control to ensure they are both challenging and answerable. We conduct systematic evaluations of over 10 mainstream LLMs and agents. The results show that most LLMs scored below 20 points, with even the cutting-edge GPT-5 achieving only 16 points. While agents achieved higher scores, none exceeded 40 points. This demonstrates the current capability gap between LLMs and agents in super-intelligent academic research tasks and highlights the challenges of Acadreason.

[159] Scaling Language-Centric Omnimodal Representation Learning

Chenghao Xiao, Hou Pong Chan, Hao Zhang, Weiwen Xu, Mahani Aljunied, Yu Rong

Main category: cs.CL

TL;DR: MLLM-based multimodal embedding approaches achieve superior performance due to implicit cross-modal alignment from generative pretraining, which allows contrastive learning to serve as lightweight refinement. The proposed LCO-Emb framework demonstrates state-of-the-art results across modalities.

Details

Motivation: To understand why MLLM-based multimodal embedding approaches with contrastive learning outperform other methods, and to leverage the implicit cross-modal alignment learned during generative pretraining.

Method: Proposed Language-Centric Omnimodal Embedding (LCO-Emb) framework that leverages implicit alignment from MLLM generative pretraining, with contrastive learning as refinement. Analyzed anisotropy and kernel similarity structure to confirm latent alignment.

Result: Achieved state-of-the-art performance across diverse benchmarks and modalities. Identified Generation-Representation Scaling Law (GRSL) showing representation quality scales with generative capabilities. Validated on challenging visual-document retrieval tasks.

Conclusion: Improving generative abilities is an effective paradigm for enhancing representation quality. Continual generative pretraining before contrastive learning can further boost embedding capabilities, with theoretical explanation linking generative quality to representation performance upper bound.

Abstract: Recent multimodal embedding approaches leveraging multimodal large language models (MLLMs) fine-tuned with contrastive learning (CL) have shown promising results, yet the underlying reasons behind their superiority remain underexplored. This work argues that a crucial advantage of MLLM-based approaches stems from implicit cross-modal alignment achieved during generative pretraining, where the language decoder learns to exploit multimodal signals within a shared representation space for generating unimodal outputs. Through analysis of anisotropy and kernel similarity structure, we empirically confirm that latent alignment emerges within MLLM representations, allowing CL to serve as a lightweight refinement stage. Leveraging this insight, we propose a Language-Centric Omnimodal Embedding framework, termed LCO-Emb. Extensive experiments across diverse backbones and benchmarks demonstrate its effectiveness, achieving state-of-the-art performance across modalities. Furthermore, we identify a Generation-Representation Scaling Law (GRSL), showing that the representational capabilities gained through contrastive refinement scales positively with the MLLM’s generative capabilities. This suggests that improving generative abilities evolves as an effective paradigm for enhancing representation quality. We provide a theoretical explanation of GRSL, which formally links the MLLM’s generative quality to the upper bound on its representation performance, and validate it on a challenging, low-resource visual-document retrieval task, showing that continual generative pretraining before CL can further enhance the potential of a model’s embedding capabilities. Codes, models, and resources are available at https://github.com/LCO-Embedding/LCO-Embedding.

[160] When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents

Lingfei Qian, Xueqing Peng, Yan Wang, Vincent Jim Zhang, Huan He, Hanley Smith, Yi Han, Yueru He, Haohang Li, Yupeng Cao, Yangyang Yu, Alejandro Lopez-Lira, Peng Lu, Jian-Yun Nie, Guojun Xiong, Jimin Huang, Sophia Ananiadou

Main category: cs.CL

TL;DR: AMA is the first lifelong, real-time benchmark for evaluating LLM-based trading agents across multiple markets, addressing gaps in current testing methods.

Details

Motivation: Current studies test LLM models instead of agents, cover limited periods and assets, and rely on unverified data, making it unclear if agents can reason and adapt in live markets.

Method: AMA integrates verified trading data, expert-checked news, and diverse agent architectures within a unified trading framework. It implements four agents with different risk styles and reasoning capabilities, evaluated across multiple LLM backends.

Result: Live experiments show agent frameworks display distinct behavioral patterns (from aggressive to conservative), while model backbones contribute less to outcome variation.

Conclusion: AMA establishes a foundation for rigorous, reproducible, and continuously evolving evaluation of financial reasoning and trading intelligence in LLM-based agents.

Abstract: Although Large Language Model (LLM)-based agents are increasingly used in financial trading, it remains unclear whether they can reason and adapt in live markets, as most studies test models instead of agents, cover limited periods and assets, and rely on unverified data. To address these gaps, we introduce Agent Market Arena (AMA), the first lifelong, real-time benchmark for evaluating LLM-based trading agents across multiple markets. AMA integrates verified trading data, expert-checked news, and diverse agent architectures within a unified trading framework, enabling fair and continuous comparison under real conditions. It implements four agents, including InvestorAgent as a single-agent baseline, TradeAgent and HedgeFundAgent with different risk styles, and DeepFundAgent with memory-based reasoning, and evaluates them across GPT-4o, GPT-4.1, Claude-3.5-haiku, Claude-sonnet-4, and Gemini-2.0-flash. Live experiments on both cryptocurrency and stock markets demonstrate that agent frameworks display markedly distinct behavioral patterns, spanning from aggressive risk-taking to conservative decision-making, whereas model backbones contribute less to outcome variation. AMA thus establishes a foundation for rigorous, reproducible, and continuously evolving evaluation of financial reasoning and trading intelligence in LLM-based agents.

[161] Demystifying Reinforcement Learning in Agentic Reasoning

Zhaochen Yu, Ling Yang, Jiaru Zou, Shuicheng Yan, Mengdi Wang

Main category: cs.CL

TL;DR: This paper systematically investigates reinforcement learning for agentic reasoning in LLMs, identifying key design principles for data, algorithms, and reasoning modes that enable smaller models to achieve superior performance.

Details

Motivation: While agentic RL has shown promise in improving LLMs' reasoning abilities, the optimal design principles and practices remain unclear, motivating a comprehensive investigation.

Method: Conducted systematic investigation from three perspectives: data (using real end-to-end tool-use trajectories and high-diversity datasets), algorithm (exploration-friendly techniques like clip higher, reward shaping, and policy entropy), and reasoning mode (deliberative strategy with fewer tool calls).

Result: Achieved strong results on challenging benchmarks (AIME2024/AIME2025, GPQA-Diamond, LiveCodeBench-v6) with smaller models - 4B-sized models outperformed 32B-sized models in agentic reasoning performance.

Conclusion: Simple practices in data curation, algorithm design, and reasoning strategies consistently enhance agentic reasoning and training efficiency, establishing a practical baseline for future agentic RL research.

Abstract: Recently, the emergence of agentic RL has showcased that RL could also effectively improve the agentic reasoning ability of LLMs, yet the key design principles and optimal practices remain unclear. In this work, we conduct a comprehensive and systematic investigation to demystify reinforcement learning in agentic reasoning from three key perspectives: data, algorithm, and reasoning mode. We highlight our key insights: (i) Replacing stitched synthetic trajectories with real end-to-end tool-use trajectories yields a far stronger SFT initialization; high-diversity, model-aware datasets sustain exploration and markedly improve RL performance. (ii) Exploration-friendly techniques are crucial for agentic RL, such as clip higher, overlong reward shaping, and maintaining adequate policy entropy could improve the training efficiency. (iii) A deliberative strategy with fewer tool calls outperforms frequent tool calls or verbose self-reasoning, improving tool efficiency and final accuracy. Together, these simple practices consistently enhance agentic reasoning and training efficiency, achieving strong results on challenging benchmarks with smaller models, and establishing a practical baseline for future agentic RL research. Beyond these empirical insights, we further contribute a high-quality, real end-to-end agentic SFT dataset along with a high-quality RL dataset, and demonstrate the effectiveness of our insights in boosting the agentic reasoning ability of LLMs across four challenging benchmarks, including AIME2024/AIME2025, GPQA-Diamond, and LiveCodeBench-v6. With our recipes, 4B-sized models could also achieve superior agentic reasoning performance compared to 32B-sized models. Code and models: https://github.com/Gen-Verse/Open-AgentRL

[162] Are Large Reasoning Models Interruptible?

Tsung-Han Wu, Mihran Miroyan, David M. Chan, Trevor Darrell, Narges Norouzi, Joseph E. Gonzalez

Main category: cs.CL

TL;DR: LRMs perform well in static settings but fail unpredictably when interrupted or exposed to changing context, with performance dropping up to 60% in dynamic scenarios.

Details

Motivation: The 'frozen world' assumption in LRM evaluation breaks down in modern reasoning tasks like assistive programming where contexts change during long reasoning processes.

Method: Evaluate LRM robustness under two dynamic scenarios: interruptions (testing partial outputs on limited budget) and dynamic context (testing adaptation to in-flight changes).

Result: Static evaluations overestimate robustness - state-of-the-art LRMs fail unpredictably when interrupted or exposed to changing context, with up to 60% performance drop when updates occur late in reasoning.

Conclusion: LRMs exhibit novel failure modes including reasoning leakage, panic under time pressure, and self-doubt when incorporating updated information, highlighting the need for dynamic evaluation frameworks.

Abstract: Large Reasoning Models (LRMs) excel at complex reasoning but are traditionally evaluated in static, “frozen world” settings: model responses are assumed to be instantaneous, and the context of a request is presumed to be immutable over the duration of the response. While generally true for short-term tasks, the “frozen world” assumption breaks down in modern reasoning tasks such as assistive programming, where models may take hours to think through problems and code may change dramatically from the time the model starts thinking to the model’s final output. In this work, we challenge the frozen world assumption and evaluate LRM robustness under two realistic dynamic scenarios: interruptions, which test the quality of the model’s partial outputs on a limited budget, and dynamic context, which tests model adaptation to in-flight changes. Across mathematics and programming benchmarks that require long-form reasoning, static evaluations consistently overestimate robustness: even state-of-the-art LRMs, which achieve high accuracy in static settings, can fail unpredictably when interrupted or exposed to changing context, with performance dropping by up to 60% when updates are introduced late in the reasoning process. Our analysis further reveals several novel failure modes, including reasoning leakage, where models fold the reasoning into their final answer when interrupted; panic, where under time pressure models abandon reasoning entirely and return incorrect answers; and self-doubt, where performance degrades while incorporating updated information.

[163] SANTA: Separate Strategies for Inaccurate and Incomplete Annotation Noise in Distantly-Supervised Named Entity Recognition

Shuzheng Si, Zefan Cai, Shuang Zeng, Guoqiang Feng, Jiaxing Lin, Baobao Chang

Main category: cs.CL

TL;DR: SANTA is a novel approach for distantly-supervised named entity recognition that separately handles inaccurate and incomplete annotation noise using tailored strategies, achieving state-of-the-art performance on five public datasets.

Details

Motivation: Previous methods either only addressed incomplete annotation noise or used the same strategy for both inaccurate and incomplete noise, ignoring their different causes and requiring different handling strategies.

Method: Uses Memory-smoothed Focal Loss and Entity-aware KNN to handle inaccurate annotation noise (entity ambiguity), and Boundary Mixup with noise-tolerant loss to handle incomplete annotation noise (decision boundary shifting).

Result: SANTA effectively mitigates both types of noise and achieves new state-of-the-art performance on five public datasets.

Conclusion: Separate tailored strategies for different types of annotation noise in distantly-supervised NER lead to superior performance compared to uniform approaches.

Abstract: Distantly-Supervised Named Entity Recognition effectively alleviates the burden of time-consuming and expensive annotation in the supervised setting. But the context-free matching process and the limited coverage of knowledge bases introduce inaccurate and incomplete annotation noise respectively. Previous studies either considered only incomplete annotation noise or indiscriminately handle two types of noise with the same strategy. In this paper, we argue that the different causes of two types of noise bring up the requirement of different strategies in model architecture. Therefore, we propose the SANTA to handle these two types of noise separately with (1) Memory-smoothed Focal Loss and Entity-aware KNN to relieve the entity ambiguity problem caused by inaccurate annotation, and (2) Boundary Mixup to alleviate decision boundary shifting problem caused by incomplete annotation and a noise-tolerant loss to improve the robustness. Benefiting from our separate tailored strategies, we confirm in the experiment that the two types of noise are well mitigated. SANTA also achieves a new state-of-the-art on five public datasets.

[164] Native Language Identification in Turkish: L1 Influence of Arabic, Persian, and Albanian

Ahmet Yavuz Uluslu, Gerold Schneider

Main category: cs.CL

TL;DR: First Native Language Identification (NLI) study for Turkish language using texts from Albanian, Arabic, and Persian native speakers, achieving promising results with syntactic features.

Details

Motivation: To extend NLI research beyond English by applying it to Turkish language and identify native language transfer effects from Albanian, Arabic, and Persian speakers.

Method: Used cleaned Turkish Learner Corpus with syntactic features, comparing structural Part-of-Speech n-gram model with hybrid model retaining function words.

Result: Models achieved promising results in identifying native languages from L2 Turkish texts, with analysis of most predictive features revealing L1-specific transfer effects.

Conclusion: Successfully demonstrated NLI applicability to Turkish language, providing insights into L1 transfer patterns and making data/code available for future research.

Abstract: This paper presents the first application of Native Language Identification (NLI) for the Turkish language. NLI is the task of automatically identifying an individual’s native language (L1) based on their writing or speech in a non-native language (L2). While most NLI research has focused on L2 English, our study extends this scope to L2 Turkish by analyzing a corpus of texts written by native speakers of Albanian, Arabic and Persian. We leverage a cleaned version of the Turkish Learner Corpus and demonstrate the effectiveness of syntactic features, comparing a structural Part-of-Speech n-gram model to a hybrid model that retains function words. Our models achieve promising results, and we analyze the most predictive features to reveal L1-specific transfer effects. We make our data and code publicly available for further study.

[165] Survey of Natural Language Processing for Education: Taxonomy, Systematic Review, and Future Trends

Yunshi Lan, Xinyuan Li, Hanyue Du, Xuesong Lu, Ming Gao, Weining Qian, Aoying Zhou

Main category: cs.CL

TL;DR: This survey paper reviews recent NLP advances focused on education applications, covering question answering, automated assessment, error correction, and LLM-based methods, while identifying future research directions.

Details

Motivation: NLP has enormous potential to help teaching and learning in education, with applications in healthcare, commerce, and education domains.

Method: The paper presents a taxonomy of NLP in education, reviews task definitions and challenges, and discusses cutting-edge techniques including LLM-involved methods.

Result: The survey organizes relevant datasets and papers, showcases off-the-shelf demonstrations for educators/researchers, and provides a comprehensive Github repository.

Conclusion: Five promising future directions: generalization across subjects/languages, deployed LLM-based systems, adaptive learning, interpretability, and ethical considerations for NLP in education.

Abstract: Natural Language Processing (NLP) aims to analyze text or speech via techniques in the computer science field. It serves applications in the domains of healthcare, commerce, education, and so on. Particularly, NLP has been widely applied to the education domain and its applications have enormous potential to help teaching and learning. In this survey, we review recent advances in NLP with a focus on solving problems relevant to the education domain. In detail, we begin with introducing the related background and the real-world scenarios in education to which NLP techniques could contribute. Then, we present a taxonomy of NLP in the education domain and highlight typical NLP applications including question answering, question construction, automated assessment, and error correction. Next, we illustrate the task definition, challenges, and corresponding cutting-edge techniques based on the above taxonomy. In particular, LLM-involved methods are included for discussion due to the wide usage of LLMs in diverse NLP applications. After that, we showcase some off-the-shelf demonstrations in this domain, which are designed for educators or researchers. At last, we conclude with five promising directions for future research, including generalization over subjects and languages, deployed LLM-based systems for education, adaptive learning for teaching and learning, interpretability for education, and ethical consideration of NLP techniques. We organize all relevant datasets and papers in the open-available Github Link for better review https://github.com/LiXinyuan1015/NLP-for-Education.

[166] Does Biomedical Training Lead to Better Medical Performance?

Amin Dada, Marie Bauer, Amanda Butler Contreras, Osman Alperen Koraş, Constantin Marc Seibold, Kaleb E Smith, Jens Kleesiek

Main category: cs.CL

TL;DR: Biomedical LLMs show performance decline after fine-tuning on medical tasks, with general-domain models outperforming specialized biomedical models in 6 practical medical tasks.

Details

Motivation: To systematically evaluate the effect of biomedical training on LLMs' performance in healthcare applications, addressing the gap in systematic assessment of biomedical training on medical tasks.

Method: Evaluated 25 models on six practical medical tasks, comparing performance before and after fine-tuning, with focus on hallucinations, ICD10 coding, and instruction adherence.

Result: Nine out of twelve biomedical models showed performance decline after fine-tuning. General-domain models like Meta-Llama-3.1-70B-Instruct outperformed biomedical counterparts.

Conclusion: There is a trade-off between domain-specific fine-tuning and general medical task performance, suggesting that biomedical training may not always improve performance on practical medical tasks.

Abstract: Large Language Models (LLMs) are expected to significantly contribute to patient care, diagnostics, and administrative processes. Emerging biomedical LLMs aim to address healthcare-specific challenges, including privacy demands and computational constraints. Assessing the models’ suitability for this sensitive application area is of the utmost importance. However, biomedical training has not been systematically evaluated on medical tasks. This study investigates the effect of biomedical training in the context of six practical medical tasks evaluating $25$ models. In contrast to previous evaluations, our results reveal a performance decline in nine out of twelve biomedical models after fine-tuning, particularly on tasks involving hallucinations, ICD10 coding, and instruction adherence. General-domain models like Meta-Llama-3.1-70B-Instruct outperformed their biomedical counterparts, indicating a trade-off between domain-specific fine-tuning and general medical task performance. We open-source all evaluation scripts and datasets at https://github.com/TIO-IKIM/CLUE to support further research in this critical area.

[167] QUITO-X: A New Perspective on Context Compression from the Information Bottleneck Theory

Yihang Wang, Xu Huang, Bowen Tian, Yueyang Su, Lei Yu, Huaming Liao, Yixing Fan, Jiafeng Guo, Xueqi Cheng

Main category: cs.CL

TL;DR: The paper proposes using information bottleneck theory for context compression in LLMs, achieving 25% higher compression rates while maintaining QA performance.

Details

Motivation: Long contexts in LLMs cause high costs, inference delays, and the 'lost in the middle' problem. Existing compression methods remove tokens using metrics like self-information or PPL, which don't align with retaining important query-relevant information.

Method: Introduces information bottleneck theory to model context compression and proposes a cross-attention-based approach to approximate mutual information, which can be flexibly adapted for different scenarios.

Result: Extensive experiments on four datasets show the method achieves 25% higher compression rate than state-of-the-art while maintaining question answering performance. In some cases, compressed context even outperforms full context.

Conclusion: Information bottleneck theory provides a novel and effective framework for context compression in LLMs, addressing key limitations of existing methods and enabling more efficient processing of long contexts.

Abstract: Generative LLM have achieved remarkable success in various industrial applications, owing to their promising In-Context Learning capabilities. However, the issue of long context in complex tasks poses a significant barrier to their wider adoption, manifested in two main aspects: (i) The excessively long context leads to high costs and inference delays. (ii) A substantial amount of task-irrelevant information introduced by long contexts exacerbates the “lost in the middle” problem. Existing methods compress context by removing redundant tokens using metrics such as self-information or PPL, which is inconsistent with the objective of retaining the most important tokens when conditioning on a given query. In this study, we introduce information bottleneck theory (IB) to model the problem, offering a novel perspective that thoroughly addresses the essential properties required for context compression. Additionally, we propose a cross-attention-based approach to approximate mutual information in IB, which can be flexibly replaced with suitable alternatives in different scenarios. Extensive experiments on four datasets demonstrate that our method achieves a 25% increase in compression rate compared to the state-of-the-art, while maintaining question answering performance. In particular, the context compressed by our method even outperform the full context in some cases.

[168] Who Speaks Matters: Analysing the Influence of the Speaker’s Ethnicity on Hate Classification

Ananya Malik, Kartik Sharma, Shaily Bhatt, Lynnette Hui Xian Ng

Main category: cs.CL

TL;DR: LLMs show brittleness and bias in hate speech detection when ethnic markers are present, with implicit dialect markers causing more output flips than explicit ones, and larger models being more robust.

Details

Motivation: To critically scrutinize LLM applications for high-stakes tasks like hate speech detection due to known brittleness and bias against marginalized communities and dialects.

Method: Inject explicit markers (mentioning linguistic identity) and implicit markers (dialectal features) into inputs, then analyze how frequently model outputs flip across 3 LLMs, 1 LM, and 5 linguistic identities.

Result: Implicit dialect markers cause more output flips than explicit markers, flip percentages vary across ethnicities, and larger models show more robustness.

Conclusion: Caution is needed when deploying LLMs for high-stakes hate speech detection due to demonstrated brittleness and bias issues.

Abstract: Large Language Models (LLMs) offer a lucrative promise for scalable content moderation, including hate speech detection. However, they are also known to be brittle and biased against marginalised communities and dialects. This requires their applications to high-stakes tasks like hate speech detection to be critically scrutinized. In this work, we investigate the robustness of hate speech classification using LLMs particularly when explicit and implicit markers of the speaker’s ethnicity are injected into the input. For explicit markers, we inject a phrase that mentions the speaker’s linguistic identity. For the implicit markers, we inject dialectal features. By analysing how frequently model outputs flip in the presence of these markers, we reveal varying degrees of brittleness across 3 LLMs and 1 LM and 5 linguistic identities. We find that the presence of implicit dialect markers in inputs causes model outputs to flip more than the presence of explicit markers. Further, the percentage of flips varies across ethnicities. Finally, we find that larger models are more robust. Our findings indicate the need for exercising caution in deploying LLMs for high-stakes tasks like hate speech detection.

[169] A Survey on Automatic Credibility Assessment Using Textual Credibility Signals in the Era of Large Language Models

Ivan Srba, Olesya Razuvayevskaya, João A. Leite, Robert Moro, Ipek Baris Schlicht, Sara Tonelli, Francisco Moreno García, Santiago Barrio Lottmann, Denis Teyssou, Valentin Porcellini, Carolina Scarton, Kalina Bontcheva, Maria Bielikova

Main category: cs.CL

TL;DR: This survey paper provides a comprehensive literature review of 175 research papers on automatic credibility assessment in NLP, focusing on detecting and aggregating multiple credibility signals like factuality, bias, persuasion techniques, and fact-checked claims.

Details

Motivation: Current research in automatic credibility assessment is fragmented, with many signals studied in isolation and lacking integration. There's a need for a comprehensive overview that connects research efforts under a common framework and identifies trends, challenges, and open problems.

Method: Systematic literature review of 175 research papers focusing on textual credibility signals in NLP, examining automatic credibility assessment methods and detection of nine categories of credibility signals, with in-depth analysis of three key categories.

Result: The survey provides an in-depth analysis of credibility signal categories including factuality, subjectivity and bias; persuasion techniques and logical fallacies; and check-worthy and fact-checked claims, along with summaries of existing methods, datasets, and tools.

Conclusion: The paper outlines future research directions and emerging opportunities in credibility assessment, with particular attention to evolving challenges posed by generative AI, positioning NLP research within the broader multidisciplinary landscape.

Abstract: In the age of social media and generative AI, the ability to automatically assess the credibility of online content has become increasingly critical, complementing traditional approaches to false information detection. Credibility assessment relies on aggregating diverse credibility signals - small units of information, such as content subjectivity, bias, or a presence of persuasion techniques - into a final credibility label/score. However, current research in automatic credibility assessment and credibility signals detection remains highly fragmented, with many signals studied in isolation and lacking integration. Notably, there is a scarcity of approaches that detect and aggregate multiple credibility signals simultaneously. These challenges are further exacerbated by the absence of a comprehensive and up-to-date overview of research works that connects these research efforts under a common framework and identifies shared trends, challenges, and open problems. In this survey, we address this gap by presenting a systematic and comprehensive literature review of 175 research papers, focusing on textual credibility signals within the field of Natural Language Processing (NLP), which undergoes a rapid transformation due to advancements in Large Language Models (LLMs). While positioning the NLP research into the the broader multidisciplinary landscape, we examine both automatic credibility assessment methods as well as the detection of nine categories of credibility signals. We provide an in-depth analysis of three key categories: 1) factuality, subjectivity and bias, 2) persuasion techniques and logical fallacies, and 3) check-worthy and fact-checked claims. In addition to summarising existing methods, datasets, and tools, we outline future research direction and emerging opportunities, with particular attention to evolving challenges posed by generative AI.

[170] SEKE: Specialised Experts for Keyword Extraction

Matej Martinc, Hanh Thi Hong Tran, Senja Pollak, Boshko Koloski

Main category: cs.CL

TL;DR: SEKE is a supervised keyword extraction method using Mixture of Experts with DeBERTa backbone and BiLSTM, achieving state-of-the-art performance on English datasets with enhanced explainability.

Details

Motivation: Real-world keyword extraction requires handling diverse content, and existing methods struggle with smaller corpora where specialization is difficult due to limited training data.

Method: Proposes SEKE - a Mixture of Specialised Experts framework using DeBERTa as backbone, integrated with BiLSTM network, where experts specialize in distinct regions of input space via learnable routing sub-network.

Result: Achieves state-of-the-art performance on multiple English datasets compared to strong supervised and unsupervised baselines. Experts specialize in distinct syntactic and semantic components like punctuation, stopwords, parts-of-speech, or named entities.

Conclusion: SEKE provides effective keyword extraction with enhanced explainability through MoE framework, successfully handling diverse content even on smaller corpora.

Abstract: Keyword extraction involves identifying the most descriptive words in a document, allowing automatic categorisation and summarisation of large quantities of diverse textual data. Relying on the insight that real-world keyword detection often requires handling of diverse content, we propose a novel supervised keyword extraction approach based on the mixture of experts (MoE) technique. MoE uses a learnable routing sub-network to direct information to specialised experts, allowing them to specialise in distinct regions of the input space. SEKE, a mixture of Specialised Experts for supervised Keyword Extraction, uses DeBERTa as the backbone model and builds on the MoE framework, where experts attend to each token, by integrating it with a bidirectional Long short-term memory (BiLSTM) network, to allow successful extraction even on smaller corpora, where specialisation is harder due to lack of training data. The MoE framework also provides an insight into inner workings of individual experts, enhancing the explainability of the approach. We benchmark SEKE on multiple English datasets, achieving state-of-the-art performance compared to strong supervised and unsupervised baselines. Our analysis reveals that depending on data size and type, experts specialise in distinct syntactic and semantic components, such as punctuation, stopwords, parts-of-speech, or named entities. Code is available at https://github.com/matejMartinc/SEKE_keyword_extraction

[171] SubData: Bridging Heterogeneous Datasets to Enable Theory-Driven Evaluation of Political and Demographic Perspectives in LLMs

Pietro Bernardelle, Leon Fröhling, Stefano Civelli, Gianluca Demartini

Main category: cs.CL

TL;DR: SubData is an open-source Python library for standardizing datasets to evaluate LLM perspective alignment, with a theory-driven approach to test differently-aligned LLMs on content classification.

Details

Motivation: Evaluate LLM alignment with human perspectives on subjective tasks is challenging due to inconsistent datasets across studies.

Method: Two-step framework: (1) SubData library for standardizing heterogeneous datasets, (2) theory-driven approach to test differently-aligned LLMs on content classification.

Result: SubData enables flexible mapping and customization for diverse research needs, distinguishing it from existing resources.

Conclusion: Invite contributions to extend SubData into a multi-construct benchmark suite for evaluating LLM perspective alignment on NLP tasks.

Abstract: As increasingly capable large language models (LLMs) emerge, researchers have begun exploring their potential for subjective tasks. While recent work demonstrates that LLMs can be aligned with diverse human perspectives, evaluating this alignment on downstream tasks (e.g., hate speech detection) remains challenging due to the use of inconsistent datasets across studies. To address this issue, in this resource paper we propose a two-step framework: we (1) introduce SubData, an open-source Python library designed for standardizing heterogeneous datasets to evaluate LLMs perspective alignment; and (2) present a theory-driven approach leveraging this library to test how differently-aligned LLMs (e.g., aligned with different political viewpoints) classify content targeting specific demographics. SubData’s flexible mapping and taxonomy enable customization for diverse research needs, distinguishing it from existing resources. We illustrate its usage with an example application and invite contributions to extend our initial release into a multi-construct benchmark suite for evaluating LLMs perspective alignment on natural language processing tasks.

[172] Rethinking the Residual Distribution of Locate-then-Editing Methods in Model Editing

Xiaopeng Li, Shanwen Wang, Shasha Li, Shezheng Song, Bin Ji, Jun Ma, Jie Yu

Main category: cs.CL

TL;DR: The paper identifies a failure mode in locate-then-edit model editing methods where residual distribution introduces weight shift errors, and proposes BLUE strategy to address this issue, achieving 35.59% average performance improvement.

Details

Motivation: To address the counterintuitive failure mode in existing locate-then-edit model editing methods where residual distribution introduces weight shift errors that undermine editing precision.

Method: Proposes Boundary Layer Update (BLUE) strategy to enhance locate-then-edit methods by addressing the weight shift errors caused by residual distribution.

Result: BLUE delivers 35.59% average performance improvement in sequential batch editing experiments on three LLMs and two datasets, significantly advancing state of the art in model editing while preserving LLMs’ general capabilities.

Conclusion: BLUE effectively addresses the weight shift error problem in locate-then-edit model editing methods and provides substantial performance improvements while maintaining model generalization capabilities.

Abstract: Model editing enables targeted updates to the knowledge of large language models (LLMs) with minimal retraining. Among existing approaches, locate-then-edit methods constitute a prominent paradigm: they first identify critical layers, then compute residuals at the final critical layer based on the target edit, and finally apply least-squares-based multi-layer updates via $\textbf{residual distribution}$. While empirically effective, we identify a counterintuitive failure mode: residual distribution, a core mechanism in these methods, introduces weight shift errors that undermine editing precision. Through theoretical and empirical analysis, we show that such errors increase with the distribution distance, batch size, and edit sequence length, ultimately leading to inaccurate or suboptimal edits. To address this, we propose the $\textbf{B}$oundary $\textbf{L}$ayer $\textbf{U}$pdat$\textbf{E (BLUE)}$ strategy to enhance locate-then-edit methods. Sequential batch editing experiments on three LLMs and two datasets demonstrate that BLUE not only delivers an average performance improvement of 35.59%, significantly advancing the state of the art in model editing, but also enhances the preservation of LLMs’ general capabilities. Our code is available at https://github.com/xpq-tech/BLUE.

[173] Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training

Changhao Jiang, Ming Zhang, Yifei Cao, Junjie Ye, Xiaoran Fan, Shihan Dou, Zhiheng Xi, Jiajun Sun, Yi Dong, Yujiong Shen, Jingqi Tong, Baoyu Fan, Qi Zhang, Tao Gui, Xuanjing Huang

Main category: cs.CL

TL;DR: This paper introduces Size-dependent Mutual Information (SMI), an information-theoretic method to predict language model knowledge retention and QA accuracy before training, using knowledge frequency, specificity, and model size.

Details

Motivation: The GPT-4 technical report suggests pre-training signals can predict downstream performance but lacks methodological details on quantification. This work addresses this gap by modeling knowledge retention - the capacity of pre-trained models to memorize factual information.

Method: Proposes SMI (Size-dependent Mutual Information) that integrates knowledge frequency, knowledge specificity, and model size to forecast closed-book QA accuracy. Validated through large-scale document retrieval over 21 public and 3 custom models’ pre-training corpora with multi-template QA evaluation.

Result: SMI significantly outperforms repetition-based baselines and achieves R² > 0.7 in predicting QA accuracy for models above 1B parameters, without additional training. Analysis reveals diminishing returns from scaling data and model size.

Conclusion: There is an intrinsic upper bound on knowledge retention achievable by pre-training alone, motivating the need for retrieval and other augmentation strategies to overcome this limitation.

Abstract: The GPT-4 technical report suggests that downstream performance can be predicted from pre-training signals, but offers little methodological detail on how to quantify this. This work address this gap by modeling knowledge retention, the capacity of a pre-trained language model to memorize factual information from its corpus, and introduce a principled method to estimate it prior to training. We propose Size-dependent Mutual Information (SMI), an information-theoretic predictor that integrates knowledge frequency, knowledge specificity, and model size to forecast closed-book question answering (QA) accuracy. SMI is validated through large-scale document retrieval over the disclosed pre-training corpora of 21 public and 3 custom models, combined with a robust multi-template QA evaluation. Experiments show that SMI significantly outperforms repetition-based baselines and achieves $R^2$ > 0.7 in predicting QA accuracy for models above 1B parameters, without additional training. The analysis further reveals diminishing returns from scaling data and model size and provides evidence for an intrinsic upper bound on knowledge retention achievable by pre-training alone, motivating retrieval and other augmentation strategies.

[174] Dynamic Optimizations of LLM Ensembles with Two-Stage Reinforcement Learning Agents

Selim Furkan Tekin, Fatih Ilhan, Gaowen Liu, Ramana Rao Kompella, Ling Liu

Main category: cs.CL

TL;DR: RL-Focal is a two-stage RL framework that dynamically selects and ensembles LLMs for different tasks, improving performance by 8.48% with small ensembles compared to the best individual LLM.

Details

Motivation: The advancement and accessibility of LLMs have renewed interest in multi-agent reinforcement learning for dynamic environments, requiring robust frameworks that can adapt to changing conditions.

Method: Two-stage RL framework: (1) Decider RL-agent dynamically selects small ensembles from N LLMs by maximizing error-diversity and reasoning performance using task-adaptive rewards and policy; (2) Fusion RL-agent resolves reasoning conflicts and adapts to different ensemble teams; (3) Introduces focal diversity metric to model error correlations and prune ensemble combinations.

Result: Extensive evaluations on five benchmarks show RL-Focal achieves 8.48% performance improvement with small ensembles compared to the best individual LLM, while offering stronger robustness.

Conclusion: RL-Focal effectively promotes reward-aware and policy-adaptive ensemble selection and inference fusion, demonstrating significant performance gains and robustness across multiple tasks.

Abstract: The advancement of LLMs and their accessibility have triggered renewed interest in multi-agent reinforcement learning as robust and adaptive frameworks for dynamically changing environments. This paper introduces RL-Focal, a two-stage RL agent framework that routes and ensembles LLMs. First, we develop the Decider RL-agent, which learns to dynamically select an ensemble of small size ($m_i$) among $N$ LLMs ($m_i \ll N$) for incoming queries from a user-defined downstream task $i$, by maximizing both error-diversity and reasoning-performance of the selected ensemble through iterative updates of task-adaptive rewards and policy. Second, to enable effective fusion of dynamically selected LLMs, we develop the stage-2 Fusion RL-agent, which learns to resolve reasoning conflicts from different LLMs and dynamically adapts to different ensemble teams composed by the Decider Agent for different downstream tasks. Third, we introduce the focal diversity metric to better model the error correlations among multiple LLMs, further improving the generalization performance of the Decider Agent, which actively prunes the ensemble combinations. By focal diversity, we enhance performance across tasks by effectively promoting reward-aware and policy-adaptive ensemble selection and inference fusion. Extensive evaluations on five benchmarks show that RL-Focal achieves the performance improvement of 8.48% with an ensemble of small size compared to the best individual LLM in a pool and offers stronger robustness. Code is available at https://github.com/sftekin/rl-focal

[175] Beyond Sample-Level Feedback: Using Reference-Level Feedback to Guide Data Synthesis

Shuhaib Mehri, Xiusi Chen, Heng Ji, Dilek Hakkani-Tür

Main category: cs.CL

TL;DR: Reference-Level Feedback overcomes synthetic data quality ceiling by extracting desirable characteristics from curated references to generate higher-quality instruction-response pairs, achieving state-of-the-art performance.

Details

Motivation: To overcome the quality ceiling in synthetic data generation where models cannot outperform the LLM generating the data, enabling development of more capable instruction-following LLMs.

Method: Introduces Reference-Level Feedback paradigm that extracts desirable characteristics from carefully curated reference samples to guide synthesis of higher-quality instruction-response pairs, creating the REFED dataset of 10K pairs.

Result: Fine-tuning Llama-3.1-8B-Instruct and Mistral-7B-Instruct on REFED achieves state-of-the-art performance among similarly sized models, with 43.96% length-controlled win-rate on AlpacaEval 2.0, outperforming traditional sample-level feedback methods.

Conclusion: Reference-Level Feedback consistently outperforms traditional methods, generalizes across model architectures, and produces high-quality diverse data at low cost, providing an effective solution to synthetic data quality limitations.

Abstract: High-quality instruction-tuning data is crucial for developing Large Language Models (LLMs) that can effectively navigate real-world tasks and follow human instructions. While synthetic data generation offers a scalable approach for creating such datasets, it imposes a quality ceiling where models trained on the data cannot outperform the LLM generating it. To overcome this limitation, we introduce Reference-Level Feedback, a paradigm that extracts desirable characteristics from carefully curated reference samples to guide the synthesis of higher-quality instruction-response pairs. Using this approach, we synthesize REFED, a dataset of 10K instruction-response pairs. Fine-tuning Llama-3.1-8B-Instruct and Mistral-7B-Instruct on REFED demonstrate state-of-the-art performance among similarly sized models, notably reaching a 43.96% length-controlled win-rate on AlpacaEval 2.0. Extensive experiments demonstrate that Reference-Level Feedback consistently outperforms traditional sample-level feedback methods, generalizes across model architectures, and produces high-quality and diverse data at low cost.

Yongcheng Zeng, Xinyu Cui, Xuanfa Jin, Qirui Mi, Guoqing Liu, Zexu Sun, Mengyue Yang, Dong Li, Weiyu Ma, Ning Yang, Jian Zhao, Jianye Hao, Haifeng Zhang, Jun Wang

Main category: cs.CL

TL;DR: EVOLVE framework enables LLMs to develop self-refinement capabilities through iterative training, allowing models to improve their own responses and achieve state-of-the-art performance on various benchmarks.

Details

Motivation: Current large language models lack inherent self-refinement capabilities and may even degrade response quality when attempting self-refinement, creating a need for systematic approaches to develop this ability.

Method: Proposes EVOLVE framework with synergistic optimization of training and inference stages: explores optimization methods during training to activate self-refinement, and investigates generation strategies at inference to enhance and utilize self-refinement while collecting training data.

Result: Evolved self-refinement enables Llama-3.1-8B to surpass GPT-4o with 62.3% length-controlled and 63.3% raw win rates on AlpacaEval 2, 50.3% on Arena-Hard, and improves performance on mathematical reasoning benchmarks like GSM8K and MATH.

Conclusion: Self-refinement can be systematically developed through iterative training and serves as a fundamental mechanism for broader self-improvement of intrinsic model abilities, with effective generalization to out-of-domain reasoning tasks.

Abstract: Self-Refinement refers to a model’s ability to revise its own responses to produce improved outputs. This capability can also serve as a fundamental mechanism for Self-Improvement, for example, by reconstructing datasets with refined results to enhance intrinsic model performance. However, our comprehensive experiments reveal that large language models (LLMs) show no clear evidence of inherent Self-Refinement and may even experience response quality degradation after Self-Refinement. To address this issue, we propose EVOLVE, a simple and effective framework for eliciting and tracking the evolution of Self-Refinement through iterative training. We first explore optimization methods during training to activate the model’s Self-Refinement capability. Then, at inference, we investigate various generation strategies to further enhance and utilize Self-Refinement while supplying the necessary data for training. Through synergistic optimization of training and inference stages, we continually evolve the model’s Self-Refinement ability, enabling it to better refine its own responses. Moreover, we demonstrate the potential of leveraging Self-Refinement to achieve broader Self-Improvement of intrinsic model abilities. Experiments show that the evolved Self-Refinement ability enables the Llama-3.1-8B base model to surpass GPT-4o, achieving 62.3% length-controlled and 63.3% raw win rates on AlpacaEval 2, and 50.3% on Arena-Hard. It also generalizes effectively to out-of-domain reasoning tasks, improving performance on mathematical reasoning benchmarks such as GSM8K and MATH.

[177] Hope vs. Hate: Understanding User Interactions with LGBTQ+ News Content in Mainstream US News Media through the Lens of Hope Speech

Jonathan Pofcher, Christopher M. Homan, Randall Sell, Ashiqur R. KhudaBukhsh

Main category: cs.CL

TL;DR: Analysis of YouTube comments on LGBTQ+ news videos reveals political bias in content rating, with LLMs aligning more with liberal perspectives.

Details

Motivation: To understand user engagement with LGBTQ+ news content on YouTube and examine how political beliefs influence content rating for marginalized communities.

Method: Analyzed 1.4M comments from 3,161 YouTube news videos; developed fine-grained hope speech classifier; conducted annotation study with 3,750 instances using diverse political representation; tested zero-shot LLMs.

Result: Strong association between political beliefs and content rating; models trained on individual political beliefs show significant disagreement; LLMs align more with liberal raters.

Conclusion: Political bias significantly affects how content about marginalized communities is rated, highlighting challenges in developing fair AI systems for content moderation.

Abstract: This paper makes three contributions. First, via a substantial corpus of 1,419,047 comments posted on 3,161 YouTube news videos of major US cable news outlets, we analyze how users engage with LGBTQ+ news content. Our analyses focus both on positive and negative content. In particular, we construct a fine-grained hope speech classifier that detects positive (hope speech), negative, neutral, and irrelevant content. Second, in consultation with a public health expert specializing on LGBTQ+ health, we conduct an annotation study with a balanced and diverse political representation and release a dataset of 3,750 instances with fine-grained labels and detailed annotator demographic information. Finally, beyond providing a vital resource for the LGBTQ+ community, our annotation study and subsequent in-the-wild assessments reveal (1) strong association between rater political beliefs and how they rate content relevant to a marginalized community; (2) models trained on individual political beliefs exhibit considerable in-the-wild disagreement; and (3) zero-shot large language models (LLMs) align more with liberal raters.

[178] Personality Editing for Language Models through Adjusting Self-Referential Queries

Seojin Hwang, Yumin Kim, Byeongjeong Kim, Donghoon Shin, Hwanhee Lee

Main category: cs.CL

TL;DR: PALETTE is a novel method for personality editing in LLMs that uses self-referential adjustment queries based on psychological constructs, requiring only 12 samples to achieve substantial personality alignment improvements.

Details

Motivation: Current prompt-based or fine-tuning approaches for controlling LLM personalities lack robustness or require large-scale training data, making them costly and impractical for applications like conversational agents and content creation.

Method: PALETTE introduces adjustment queries where self-referential statements grounded in psychological constructs are treated like factual knowledge, enabling direct editing of personality-related responses without extensive fine-tuning.

Result: The method achieves substantial improvements in personality alignment across dimensions using only 12 editing samples, with experimental results showing more stable and well-balanced personality control in both automatic and human evaluations.

Conclusion: PALETTE provides an effective and efficient approach for personality editing in LLMs that overcomes the limitations of existing methods by requiring minimal data while achieving robust personality control.

Abstract: Large Language Models (LLMs) are integral to applications such as conversational agents and content creation, where precise control over a model’s personality is essential for maintaining tone, consistency, and user engagement. However, prevailing prompt-based or fine-tuning approaches either lack robustness or demand large-scale training data, making them costly and impractical. In this paper, we present PALETTE (Personality Adjustment by LLM SElf-TargeTed quEries), a novel method for personality editing in LLMs. Our approach introduces adjustment queries, where self-referential statements grounded in psychological constructs are treated analogously to factual knowledge, enabling direct editing of personality-related responses. Unlike fine-tuning, PALETTE requires only 12 editing samples to achieve substantial improvements in personality alignment across personality dimensions. Experimental results from both automatic and human evaluations demonstrate that our method enables more stable and well-balanced personality control in LLMs.

[179] LIDDIA: Language-based Intelligent Drug Discovery Agent

Reza Averly, Frazier N. Baker, Ian A. Watson, Xia Ning

Main category: cs.CL

TL;DR: LIDDIA is an autonomous AI agent that uses large language models to navigate drug discovery, generating molecules meeting pharmaceutical criteria for 70% of targets and identifying novel cancer therapy candidates.

Details

Motivation: Drug discovery is slow, expensive, and complex, relying heavily on human chemists. There's a critical need for intelligent agents that can autonomously navigate the entire drug discovery process rather than just individual tasks.

Method: LIDDIA leverages large language models’ reasoning capabilities to create an autonomous agent that intelligently balances exploration and exploitation in chemical space during drug discovery.

Result: LIDDIA generated molecules meeting key pharmaceutical criteria for over 70% of 30 clinically relevant targets, and identified a promising novel candidate for AR/NR3C4 (a critical target for prostate and breast cancers).

Conclusion: LIDDIA serves as a low-cost, highly-adaptable tool for autonomous drug discovery, demonstrating effective navigation of the drug discovery process and identification of novel therapeutic candidates.

Abstract: Drug discovery is a long, expensive, and complex process, relying heavily on human medicinal chemists, who can spend years searching the vast space of potential therapies. Recent advances in artificial intelligence for chemistry have sought to expedite individual drug discovery tasks; however, there remains a critical need for an intelligent agent that can navigate the drug discovery process. Towards this end, we introduce LIDDIA, an autonomous agent capable of intelligently navigating the drug discovery process in silico. By leveraging the reasoning capabilities of large language models, LIDDIA serves as a low-cost and highly-adaptable tool for autonomous drug discovery. We comprehensively examine LIDDIA , demonstrating that (1) it can generate molecules meeting key pharmaceutical criteria on over 70% of 30 clinically relevant targets, (2) it intelligently balances exploration and exploitation in the chemical space, and (3) it identifies one promising novel candidate on AR/NR3C4, a critical target for both prostate and breast cancers. Code and dataset are available at https://github.com/ninglab/LIDDiA

[180] Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective

Chengyin Xu, Kaiyuan Chen, Xiao Li, Ke Shen, Chenggang Li

Main category: cs.CL

TL;DR: Proposes COD framework for predicting LLM downstream performance by clustering tasks by difficulty and using scaling laws, achieving 1.36% average error.

Details

Motivation: Need accurate pre-training prediction of downstream task performance due to high LLM training costs, challenged by emergence phenomenon and uneven task difficulty.

Method: Clustering-On-Difficulty (COD) framework that clusters tasks by difficulty scaling features, excludes non-emergent/irregular tasks, uses performance scaling laws for cluster prediction, and maps subset to full set performance.

Result: Applied to 70B parameter LLM, achieved 1.36% average prediction error across eight key LLM benchmarks.

Conclusion: COD provides accurate downstream performance prediction for efficient resource allocation and training monitoring in LLM pre-training.

Abstract: The escalating scale and cost of Large Language Models (LLMs) training necessitate accurate pre-training prediction of downstream task performance for efficient resource allocation. This is challenged by: 1) the emergence phenomenon, where metrics become meaningful only after extensive training, hindering prediction by smaller models; and 2) uneven task difficulty and inconsistent performance scaling patterns, leading to high metric variability. Current prediction methods lack accuracy and reliability. We propose a Clustering-On-Difficulty (COD) framework for downstream performance prediction. The COD framework clusters tasks by their difficulty scaling features, thereby establishing a more stable and predictable support subset through the exclusion of tasks exhibiting non-emergent behavior or irregular scaling. We adopt a performance scaling law to predict cluster-wise performance with theoretical support. Predictable subset performance acts as an intermediate predictor for the full evaluation set. We further derive a mapping function to accurately extrapolate the performance of the subset to the full set. Applied to an LLM with 70B parameters, COD achieved a 1.36% average prediction error across eight key LLM benchmarks, offering actionable insights for resource allocation and training monitoring of LLMs pretraining.

[181] Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning

Wenkai Yang, Shuming Ma, Yankai Lin, Furu Wei

Main category: cs.CL

TL;DR: Scaling Chain of Thoughts (CoTs) length can impair LLM reasoning performance in certain domains, with optimal length distributions varying across domains. The proposed Thinking-Optimal Scaling strategy teaches models to adopt different reasoning efforts and achieves performance comparable to teacher models.

Details

Motivation: To investigate whether excessively scaling CoT length actually brings adverse effects to LLM reasoning performance, as current research focuses on benefits of increasing test-time compute through longer CoTs.

Method: Thinking-Optimal Scaling strategy: uses seed data with varying response length distributions to teach models different reasoning efforts, then selects shortest correct responses under different reasoning efforts for self-improvement.

Result: Self-improved models built on Qwen2.5-32B-Instruct outperform other distillation-based 32B o1-like models across math benchmarks and achieve performance on par with teacher model QwQ-32B-Preview.

Conclusion: There exists an optimal scaled length distribution that differs across domains, and scaling with longer CoTs can impair reasoning performance in certain domains, requiring domain-specific optimization strategies.

Abstract: Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks. While current researches continue to explore the benefits of increasing test-time compute by extending the CoT lengths of Large Language Models (LLMs), we are concerned about a potential issue hidden behind the current pursuit of test-time scaling: Would excessively scaling the CoT length actually bring adverse effects to a model’s reasoning performance? Our explorations on mathematical reasoning tasks reveal an unexpected finding that scaling with longer CoTs can indeed impair the reasoning performance of LLMs in certain domains. Moreover, we discover that there exists an optimal scaled length distribution that differs across different domains. Based on these insights, we propose a Thinking-Optimal Scaling strategy. Our method first uses a small set of seed data with varying response length distributions to teach the model to adopt different reasoning efforts for deep thinking. Then, the model selects its shortest correct response under different reasoning efforts on additional problems for self-improvement. Our self-improved models built upon Qwen2.5-32B-Instruct outperform other distillation-based 32B o1-like models across various math benchmarks, and achieve performance on par with the teacher model QwQ-32B-Preview that produces the seed data.

[182] MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors

Jakub Macina, Nico Daheim, Ido Hakimi, Manu Kapur, Iryna Gurevych, Mrinmaya Sachan

Main category: cs.CL

TL;DR: MathTutorBench is an open-source benchmark for evaluating AI tutoring models’ pedagogical capabilities, covering multiple tutoring abilities with datasets and metrics, showing that subject expertise doesn’t automatically translate to good teaching.

Details

Motivation: There is a lack of reliable, easy-to-use evaluation methods that reflect the pedagogical abilities of AI tutoring models, making it difficult to measure guided progress in the field.

Method: Created MathTutorBench with datasets and metrics covering tutor abilities based on learning sciences research. Trained a reward model to score pedagogical quality of teacher responses, and evaluated various closed- and open-weight models.

Result: Subject expertise (solving ability) doesn’t immediately translate to good teaching - pedagogy and subject expertise form a trade-off navigated by tutoring specialization. Tutoring becomes more challenging in longer dialogs where simple questioning strategies fail.

Conclusion: MathTutorBench enables rapid benchmarking of future tutoring models, revealing important insights about the relationship between subject expertise and pedagogical effectiveness in AI tutors.

Abstract: Evaluating the pedagogical capabilities of AI-based tutoring models is critical for making guided progress in the field. Yet, we lack a reliable, easy-to-use, and simple-to-run evaluation that reflects the pedagogical abilities of models. To fill this gap, we present MathTutorBench, an open-source benchmark for holistic tutoring model evaluation. MathTutorBench contains a collection of datasets and metrics that broadly cover tutor abilities as defined by learning sciences research in dialog-based teaching. To score the pedagogical quality of open-ended teacher responses, we train a reward model and show it can discriminate expert from novice teacher responses with high accuracy. We evaluate a wide set of closed- and open-weight models on MathTutorBench and find that subject expertise, indicated by solving ability, does not immediately translate to good teaching. Rather, pedagogy and subject expertise appear to form a trade-off that is navigated by the degree of tutoring specialization of the model. Furthermore, tutoring appears to become more challenging in longer dialogs, where simpler questioning strategies begin to fail. We release the benchmark, code, and leaderboard openly to enable rapid benchmarking of future models.

[183] Exploring the Generalizability of Factual Hallucination Mitigation via Enhancing Precise Knowledge Utilization

Siyuan Zhang, Yichi Zhang, Yinpeng Dong, Hang Su

Main category: cs.CL

TL;DR: PKUE enhances LLMs’ ability to use knowledge precisely by fine-tuning on self-generated responses to factual questions through preference optimization, reducing factual hallucinations.

Details

Motivation: LLMs often produce factual hallucinations that are hard to detect and mislead users. Existing mitigation methods have poor generalization and trade-off other capabilities.

Method: Propose PKUE (Precise Knowledge Utilization Enhancement) - fine-tunes LLMs on self-generated responses to precise factual questions using preference optimization. Also construct FactualBench dataset with 181k Chinese data across 21 domains.

Result: Extensive experiments show PKUE significantly improves LLM overall performance with consistent enhancement across factual tasks of various forms, general tasks beyond factuality, and tasks in different languages.

Conclusion: PKUE effectively addresses factual hallucinations by enhancing LLMs’ fundamental ability to precisely leverage knowledge, demonstrating broad improvements across multiple task types and languages.

Abstract: Large Language Models (LLMs) often struggle to align their responses with objective facts, resulting in the issue of factual hallucinations, which can be difficult to detect and mislead users without relevant knowledge. Although post-training techniques have been employed to mitigate the issue, existing methods usually suffer from poor generalization and trade-offs in other different capabilities. In this paper, we propose to address these by directly augmenting LLM’s fundamental ability to precisely leverage its knowledge and introduce PKUE (Precise Knowledge Utilization Enhancement), which fine-tunes the model on self-generated responses to precise and simple factual questions through preference optimization. Furthermore, we construct FactualBench, a comprehensive and precise factual QA dataset containing 181k Chinese data spanning 21 domains, to facilitate both evaluation and training. Extensive experiments demonstrate that PKUE significantly improves LLM overall performance, with consistent enhancement across factual tasks of various forms, general tasks beyond factuality, and tasks in different language.

[184] Disentangling Feature Structure: A Mathematically Provable Two-Stage Training Dynamics in Transformers

Zixuan Gong, Shijia Li, Yong Liu, Jiaye Teng

Main category: cs.CL

TL;DR: The paper theoretically demonstrates how transformers exhibit two-stage training dynamics with disentangled feature types (e.g., syntax and semantics), showing progression from incorrect to correct features through normalized ReLU self-attention layers.

Details

Motivation: Existing theoretical analyses don't account for the observed two-stage training dynamics in transformers where features like syntax and semantics are learned sequentially, despite this being common in real-world scenarios like natural language processing and protein analysis.

Method: The authors analyze feature learning dynamics using a simplified setting with normalized ReLU self-attention layers and structured data that contains disentangled two-type features, examining how these features are learned sequentially during training.

Result: The paper provides the first rigorous theoretical demonstration of feature-level two-stage optimization in transformers, showing how models progress from learning one feature type (e.g., syntax) to another (e.g., semantics), with this process being related to spectral properties of attention weights.

Conclusion: Transformers naturally exhibit two-stage training dynamics when dealing with disentangled feature types, and this sequential learning process is mathematically grounded in the spectral properties of attention mechanisms, providing theoretical explanation for observed empirical phenomena.

Abstract: Transformers may exhibit two-stage training dynamics during the real-world training process. For instance, when training GPT-2 on the Counterfact dataset, the answers progress from syntactically incorrect to syntactically correct to semantically correct. However, existing theoretical analyses hardly account for this feature-level two-stage phenomenon, which originates from the disentangled two-type features like syntax and semantics. In this paper, we theoretically demonstrate how the two-stage training dynamics potentially occur in transformers. Specifically, we analyze the feature learning dynamics induced by the aforementioned disentangled two-type feature structure, grounding our analysis in a simplified yet illustrative setting that comprises a normalized ReLU self-attention layer and structured data. Such disentanglement of feature structure is general in practice, e.g., natural languages contain syntax and semantics, and proteins contain primary and secondary structures. To our best knowledge, this is the first rigorous result regarding a feature-level two-stage optimization process in transformers. Additionally, a corollary indicates that such a two-stage process is closely related to the spectral properties of the attention weights, which accords well with our empirical findings.

[185] Test-Time Alignment for Large Language Models via Textual Model Predictive Control

Kuang-Da Wang, Teng-Ruei Chen, Yu Heng Hung, Guo-Xun Ko, Shuoyang Ding, Yueh-Hua Wu, Yu-Chiang Frank Wang, Chao-Han Huck Yang, Wen-Chih Peng, Ping-Chun Hsieh

Main category: cs.CL

TL;DR: TMPC is a novel test-time alignment framework that combines hierarchical planning with Model Predictive Control to address the horizon-vs-dimensionality trade-off in LLM alignment, using hindsight subgoal identification and subgoal-conditioned regeneration.

Details

Motivation: Traditional LLM alignment through finetuning is resource-intensive, and test-time alignment methods face either the curse of horizon (token-level actions) or curse of dimensionality (response-level actions).

Method: Textual Model Predictive Control (TMPC) adapts MPC for text generation with two key principles: (1) Hindsight Subgoal Identification to discover meaningful intermediate outputs, and (2) Subgoal-Conditioned Re-Generation to guide subsequent planning iterations.

Result: TMPC consistently improves performance across three tasks with distinct segmentation properties: discourse-level translation, long-form response generation, and program synthesis.

Conclusion: TMPC provides a general framework for test-time LLM alignment that effectively balances planning granularity through hierarchical subgoal discovery and regeneration.

Abstract: Aligning Large Language Models (LLMs) with human preferences through finetuning is resource-intensive, motivating lightweight alternatives at test time. We address test-time alignment through the lens of sequential decision making, a perspective that reveals two fundamental challenges. When actions are defined at the token level, as in guided decoding, alignment suffers from the curse of horizon. Conversely, when actions are at the response level, as in traditional iterative refinement, the curse of dimensionality emerges. To resolve this trade-off, we draw inspiration from Model Predictive Control (MPC) in control theory to propose Textual Model Predictive Control (TMPC), a novel predictive planning framework adapted for aligning LLMs at inference time. A key limitation of standard MPC is its reliance on predefined, hard segment boundaries, which are often absent in text generation. TMPC overcomes this by introducing two principles inspired by hierarchical reinforcement learning: (1) Hindsight Subgoal Identification, where TMPC analyzes generation subgoals to retrospectively identify high-reward intermediate outputs as subgoals. This allows the framework to discover meaningful, task-specific planning steps (e.g., a sentence in machine translation or a bug fix in code generation.). (2) Subgoal-Conditioned Re-Generation, where these identified subgoals are used to guide subsequent planning iterations. By conditioning on these proven, high-quality subgoals, TMPC ensures stable improvement by building upon previously validated successes. TMPC is evaluated on three tasks with distinct segmentation properties: discourse-level translation, long-form response generation, and program synthesis. The results demonstrate that TMPC consistently improves performance, highlighting the generality.

[186] Layered Insights: Generalizable Analysis of Authorial Style by Leveraging All Transformer Layers

Milad Alshomary, Nikhil Reddy Varimalla, Vishal Anand, Smaranda Muresan, Kathleen McKeown

Main category: cs.CL

TL;DR: A new authorship attribution approach using multiple transformer layers improves robustness and achieves state-of-the-art results, especially for out-of-domain data.

Details

Motivation: To leverage the diverse linguistic representations learned at different layers of pre-trained transformer models for more robust authorship attribution.

Method: Utilizes various linguistic representations from different layers of pre-trained transformer-based models for authorship attribution.

Result: Outperforms state-of-the-art baseline in both in-domain and out-of-domain scenarios, with improved robustness on out-of-domain data.

Conclusion: Using multiple transformer layers enhances model robustness and provides insights into layer specialization for stylistic feature representation in out-of-domain scenarios.

Abstract: We propose a new approach for the authorship attribution task that leverages the various linguistic representations learned at different layers of pre-trained transformer-based models. We evaluate our approach on three datasets, comparing it to a state-of-the-art baseline in in-domain and out-of-domain scenarios. We found that utilizing various transformer layers improves the robustness of authorship attribution models when tested on out-of-domain data, resulting in new state-of-the-art results. Our analysis gives further insights into how our model’s different layers get specialized in representing certain stylistic features that benefit the model when tested out of the domain.

[187] Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, Manling Li

Main category: cs.CL

TL;DR: The paper addresses spatial reasoning challenges in Vision Language Models (VLMs) by proposing ADAPTVIS, a training-free decoding method that adjusts attention distribution based on confidence scores to improve spatial reasoning performance.

Details

Motivation: Current VLMs struggle with simple spatial reasoning tasks like recognizing 'under' or 'behind' relationships between objects, despite their overall capabilities.

Method: Using mechanistic interpretability to analyze model internal states and attention distributions, then proposing ADAPTVIS - an inference-time method that sharpens attention on relevant regions when confident, and broadens attention when confidence is low.

Result: Significant improvements on spatial reasoning benchmarks (up to 50 absolute point improvement on WhatsUp and VSR) with negligible computational cost.

Conclusion: ADAPTVIS effectively enhances spatial reasoning in VLMs by dynamically adjusting attention distribution based on confidence, demonstrating that attention alignment with object locations is crucial for spatial reasoning success.

Abstract: Large Vision Language Models (VLMs) have long struggled with spatial reasoning tasks. Surprisingly, even simple spatial reasoning tasks, such as recognizing “under” or “behind” relationships between only two objects, pose significant challenges for current VLMs. In this work, we study the spatial reasoning challenge from the lens of mechanistic interpretability, diving into the model’s internal states to examine the interactions between image and text tokens. By tracing attention distribution over the image through out intermediate layers, we observe that successful spatial reasoning correlates strongly with the model’s ability to align its attention distribution with actual object locations, particularly differing between familiar and unfamiliar spatial relationships. Motivated by these findings, we propose ADAPTVIS based on inference-time confidence scores to sharpen the attention on highly relevant regions when confident, while smoothing and broadening the attention window to consider a wider context when confidence is lower. This training-free decoding method shows significant improvement (e.g., up to a 50 absolute point improvement) on spatial reasoning benchmarks such as WhatsUp and VSR with negligible cost. We make code and data publicly available for research purposes at https://github.com/shiqichen17/AdaptVis.

[188] Add-One-In: Incremental Sample Selection for Large Language Models via a Choice-Based Greedy Paradigm

Zhuo Li, Yuhao Du, Xiaoqi Jiao, Yiwen Guo, Yuege Feng, Xiang Wan, Anningzhe Gao, Jinpeng Hu

Main category: cs.CL

TL;DR: A novel choice-based sample selection framework that uses LLMs to compare sample contributions rather than individual quality, achieving better performance with fewer selections than full datasets and existing methods.

Details

Motivation: Existing methods fail to assess overall data value, focus too much on individual sample quality, and struggle to balance diversity with efficient data traversal in large datasets.

Method: Uses LLMs to evaluate comparative contribution value of samples when added to subsets, with a greedy incremental sampling process that avoids exhaustive dataset traversal.

Result: Selected data outperforms full dataset performance and achieves competitive results with recent methods while requiring fewer selections. Successfully validated on larger medical dataset.

Conclusion: The choice-based framework effectively balances quality and diversity while reducing training overhead, demonstrating practical applicability in real-world scenarios.

Abstract: Selecting high-quality and diverse training samples from extensive datasets plays a crucial role in reducing training overhead and enhancing the performance of Large Language Models (LLMs). However, existing studies fall short in assessing the overall value of selected data, focusing primarily on individual quality, and struggle to strike an effective balance between ensuring diversity and minimizing data point traversals. Therefore, this paper introduces a novel choice-based sample selection framework that shifts the focus from evaluating individual sample quality to comparing the contribution value of different samples when incorporated into the subset. Thanks to the advanced language understanding capabilities of LLMs, we utilize LLMs to evaluate the value of each option during the selection process. Furthermore, we design a greedy sampling process where samples are incrementally added to the subset, thereby improving efficiency by eliminating the need for exhaustive traversal of the entire dataset with the limited budget. Extensive experiments demonstrate that selected data from our method not only surpasses the performance of the full dataset but also achieves competitive results with recent powerful studies, while requiring fewer selections. Moreover, we validate our approach on a larger medical dataset, highlighting its practical applicability in real-world applications. Our code and data are available at https://github.com/BIRlz/comperative_sample_selection.

[189] LADM: Long-context Training Data Selection with Attention-based Dependency Measurement for LLMs

Jianghao Chen, Junhong Wu, Yangyifan Xu, Jiajun Zhang

Main category: cs.CL

TL;DR: LADM is a framework that uses attention-based dependency measurement to efficiently select high-quality long-context training data from large corpora, significantly improving LLM performance on long-context tasks with minimal training data.

Details

Motivation: Long-context modeling is increasingly important for LLMs, but current methods struggle to measure the quality of long-context training data, which is crucial for effective continual training.

Method: Proposed LADM framework that leverages attention mechanism’s retrieval capabilities to capture contextual dependencies and identify high-quality long-context data from multi-domain pre-training corpora.

Result: Experimental results show LADM significantly boosts LLM performance on multiple long-context tasks using only 1B tokens for continual training.

Conclusion: LADM provides an effective solution for long-context data quality measurement and selection, enabling efficient improvement of LLMs’ long-context capabilities with minimal training resources.

Abstract: Long-context modeling has drawn more and more attention in the area of Large Language Models (LLMs). Continual training with long-context data becomes the de-facto method to equip LLMs with the ability to process long inputs. However, it still remains an open challenge to measure the quality of long-context training data. To address this issue, we propose a Long-context data selection framework with Attention-based Dependency Measurement (LADM), which can efficiently identify high-quality long-context data from a large-scale, multi-domain pre-training corpus. LADM leverages the retrieval capabilities of the attention mechanism to capture contextual dependencies, ensuring a comprehensive quality measurement of long-context data. Experimental results show that our LADM framework significantly boosts the performance of LLMs on multiple long-context tasks with only 1B tokens for continual training.

[190] VNJPTranslate: A comprehensive pipeline for Vietnamese-Japanese translation

Hoang Hai Phan, Nguyen Duc Minh Vu, Nam Dang Phuong

Main category: cs.CL

TL;DR: VNJPTranslate pipeline improves Vietnamese-Japanese translation using LLM-based data augmentation and efficient fine-tuning of a 1.8B parameter model.

Details

Motivation: Address challenges in low-resource Vietnamese-Japanese NMT, including sparse parallel data and linguistic/cultural nuances, by leveraging LLMs' reasoning capabilities.

Method: Targeted data augmentation using advanced LLMs with Chain-of-Thought prompting for difficult segments, followed by efficient fine-tuning (Unsloth with QLoRA) on a 1.8B parameter Sailor model.

Result: Significant improvement in Vi-Ja translation quality over existing baselines.

Conclusion: Integrated approach combining LLM-based data augmentation and efficient fine-tuning effectively addresses low-resource translation challenges for Vietnamese-Japanese language pair.

Abstract: Neural Machine Translation (NMT) driven by Transformer architectures has advanced significantly, yet faces challenges with low-resource language pairs like Vietnamese-Japanese (Vi-Ja). Issues include sparse parallel data and handling linguistic/cultural nuances. Recent progress in Large Language Models (LLMs) with strong reasoning, often refined via Reinforcement Learning (RL), enables high-quality synthetic data generation. We introduce VNJPTranslate, a pipeline designed to systematically address the Vi-Ja translation task. It features a targeted data augmentation strategy using advanced LLMs with Chain-of-Thought prompting for challenging segments identified via corpus analysis. Subsequently, we employ efficient fine-tuning techniques (Unsloth with QLoRA) on a capable, low-parameter autoregressive model (specifically, a fine-tuned version of the 1.8B parameter Sailor model, which is based on the Qwen architecture) to create a practical and high-performing translation system. This integrated approach aims to improve Vi-Ja translation quality significantly over existing baselines.

[191] Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data

Shuai Zhao, Yunqiu Xu, Linchao Zhu, Yi Yang

Main category: cs.CL

TL;DR: RefAlign is a REINFORCE-style alignment algorithm that uses similarity metrics between generated text and reference answers as rewards, eliminating the need for binary preference data and explicit reward modeling.

Details

Motivation: Binary preference data collection and reward modeling for LLM alignment are resource-intensive but crucial for transferring human preferences across safety, confidence, and general preference scenarios.

Method: Uses similarity metrics like BERTScore between sampled generations and reference answers as surrogate rewards in a REINFORCE-style algorithm, extendable to various alignment scenarios by combining similarity rewards with task-specific objectives.

Result: Achieves performance comparable to prior alignment methods across multiple scenarios without requiring binary preference data or reward models.

Conclusion: RefAlign provides an effective alternative to traditional alignment approaches by leveraging similarity-based rewards, making alignment more resource-efficient while maintaining competitive performance.

Abstract: Large language models~(LLMs) are expected to be helpful, harmless, and honest. In different alignment scenarios, such as safety, confidence, and general preference alignment, binary preference data collection and reward modeling are resource-intensive but play a central role in transferring human preferences. In this work, we explore using the similarity between sampled generations and reference answers as a supplementary reward function for alignment. When unary reference answers are available, such similarity-based rewards can circumvent the need for binary preference data and explicit reward modeling. We introduce \textit{RefAlign}, a versatile REINFORCE-style alignment algorithm that does not rely on reward or reference models. RefAlign utilizes language generation evaluation metrics, such as BERTScore, between sampled generations and reference answers as surrogate rewards. Beyond general preference optimization, RefAlign can be naturally extended to diverse scenarios, including safety and confidence alignment, by combining similarity-based rewards with task-specific objectives. Across multiple scenarios, RefAlign achieves performance comparable to prior alignment methods while operating without binary preference data or reward models. The code is available at https://github.com/mzhaoshuai/RefAlign.

[192] Forecasting Clinical Risk from Textual Time Series: Structuring Narratives for Temporal AI in Healthcare

Shahriar Noroozizadeh, Sayantan Kumar, Jeremy C. Weiss

Main category: cs.CL

TL;DR: This paper introduces forecasting from textual time series using LLM-extracted clinical findings, showing encoder-based models excel at event prediction while decoder models perform better in survival analysis.

Details

Motivation: Clinical case reports contain valuable temporal patient trajectories that are underexploited by traditional machine learning methods relying on structured data.

Method: Systematic evaluation of diverse models including fine-tuned decoder-based LLMs and encoder-based transformers on event occurrence prediction, temporal ordering, and survival analysis using timestamped clinical findings extracted via LLM-assisted annotation.

Result: Encoder-based models consistently achieve higher F1 scores and superior temporal concordance for event forecasting, while fine-tuned masking approaches enhance ranking performance. Instruction-tuned decoder models show relative advantage in survival analysis, especially for early prognosis.

Conclusion: Time ordering in clinical time series provides additional benefits beyond text ordering, highlighting the value of time-ordered corpora for temporal tasks in the LLM era.

Abstract: Clinical case reports encode temporal patient trajectories that are often underexploited by traditional machine learning methods relying on structured data. In this work, we introduce the forecasting problem from textual time series, where timestamped clinical findings – extracted via an LLM-assisted annotation pipeline – serve as the primary input for prediction. We systematically evaluate a diverse suite of models, including fine-tuned decoder-based large language models and encoder-based transformers, on tasks of event occurrence prediction, temporal ordering, and survival analysis. Our experiments reveal that encoder-based models consistently achieve higher F1 scores and superior temporal concordance for short- and long-horizon event forecasting, while fine-tuned masking approaches enhance ranking performance. In contrast, instruction-tuned decoder models demonstrate a relative advantage in survival analysis, especially in early prognosis settings. Our sensitivity analyses further demonstrate the importance of time ordering, which requires clinical time series construction, as compared to text ordering, the format of the text inputs that LLMs are classically trained on. This highlights the additional benefit that can be ascertained from time-ordered corpora, with implications for temporal tasks in the era of widespread LLM use.

[193] Exploring Compositional Generalization (in COGS/ReCOGS_pos) by Transformers using Restricted Access Sequence Processing (RASP)

William Bruns

Main category: cs.CL

TL;DR: Transformers can achieve near-perfect compositional generalization on COGS and ReCOGS benchmarks using flat pattern-matching rules in RASP, without requiring hierarchical tree structures.

Details

Motivation: To demonstrate that Transformer models can systematically perform compositional generalization tasks like COGS and ReCOGS using flat, non-recursive approaches rather than hierarchical tree structures as previously thought.

Method: Used RASP (Restricted Access Sequence Processing) programming language with Transformer Encoder-Decoder architecture, employing 19 attention-head compatible flat pattern-matching rules, word-level tokens with POS tagging, and masking techniques for prepositional phrases and sentential complements.

Result: Achieved near perfect scores on structural generalization splits: exact match on COGS and semantic exact match on ReCOGS_pos, including handling pp recursion and cp recursion through decoder loops.

Conclusion: COGS tasks do not require hierarchical or tree-structured solutions; flat pattern-matching rules in Transformers can effectively handle compositional generalization without recursive processing.

Abstract: Humans understand new combinations of words encountered if they are combinations of words recognized from different contexts, an ability called Compositional Generalization. The COGS benchmark (Kim and Linzen, 2020) arXiv:2010.05465 reports 0% accuracy for Transformer models on some structural generalizations. We use (Weiss et al., 2021) arXiv:2106.06981’s Restricted Access Sequence Processing (RASP), a Transformer-equivalent programming language, to demonstrate that a Transformer Encoder-Decoder can perform COGS and the semantically equivalent ReCOGS_pos (Wu et al., 2024) arXiv:2303.13716 systematically and compositionally: Our RASP models attain near perfect scores on structural generalization splits on COGS (exact match) and ReCOGS_pos (semantic exact match). Our RASP models show the (Re)COGS tasks do not require a hierarchical or tree-structured solution (contrary to (Kim and Linzen, 2020) arXiv:2010.05465, (Yao and Koller, 2022) arXiv:2210.13050, (Murty et al., 2022) arXiv:2305.18741, (Liu et al., 2021) arXiv:2107.06516): we use word-level tokens with an “embedding” layer that tags with possible part of speech, applying just once per encoder pass 19 attention-head compatible flat pattern-matching rules (easily identified with specific training examples), shown using grammar coverage (Zeller et al., 2023) to cover the non-recursive aspects of the input grammar, plus masking out prepositional phrases (“pp noun”) and/or sentential complements (cp) when recognizing grammar patterns and extracting nouns related to the main verb in the sentence, and output the next logical form (LF) token (repeating until the LF is complete). The models do not apply recursive, tree-structured rules like “np_det pp np -> np_pp -> np”, but score near perfect semantic and string exact match on both COGS and ReCOGS pp recursion, cp recursion using the decoder loop.

[194] DMDTEval: An Evaluation and Analysis of LLMs on Disambiguation in Multi-domain Translation

Zhibo Man, Yuanmeng Chen, Yujie Zhang, Jinan Xu

Main category: cs.CL

TL;DR: The paper presents DMDTEval, a systematic evaluation framework for assessing LLMs’ disambiguation ability in multi-domain machine translation, including test set construction, prompt strategies, and disambiguation metrics.

Details

Motivation: LLMs show remarkable results in machine translation but perform less satisfactorily in multi-domain translation due to word ambiguity across domains, highlighting the need to evaluate their disambiguation capabilities.

Method: Developed DMDTEval framework with three components: (1) constructed translation test set with multi-domain ambiguous word annotations, (2) curated diverse disambiguation prompt strategies, (3) designed precise disambiguation metrics and studied various prompt strategies on multiple state-of-the-art LLMs.

Result: Comprehensive experiments across 4 language pairs and 13 domains revealed crucial findings about LLMs’ disambiguation performance in multi-domain translation settings.

Conclusion: The findings from extensive experiments are expected to pave the way and facilitate further research in improving LLMs’ disambiguation capabilities in multi-domain translation.

Abstract: Currently, Large Language Models (LLMs) have achieved remarkable results in machine translation. However, their performance in multi-domain translation (MDT) is less satisfactory, the meanings of words can vary across different domains, highlighting the significant ambiguity inherent in MDT. Therefore, evaluating the disambiguation ability of LLMs in MDT, remains an open problem. To this end, we present an evaluation and analysis of LLMs on disambiguation in multi-domain translation (DMDTEval), our systematic evaluation framework consisting of three critical aspects: (1) we construct a translation test set with multi-domain ambiguous word annotation, (2) we curate a diverse set of disambiguation prompt strategies, and (3) we design precise disambiguation metrics, and study the efficacy of various prompt strategies on multiple state-of-the-art LLMs. We conduct comprehensive experiments across 4 language pairs and 13 domains, our extensive experiments reveal a number of crucial findings that we believe will pave the way and also facilitate further research in the critical area of improving the disambiguation of LLMs.

[195] WebThinker: Empowering Large Reasoning Models with Deep Research Capability

Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yongkang Wu, Ji-Rong Wen, Yutao Zhu, Zhicheng Dou

Main category: cs.CL

TL;DR: WebThinker is a deep research agent that enhances large reasoning models by enabling autonomous web search, navigation, and real-time report drafting during reasoning processes.

Details

Motivation: Large reasoning models have limitations in handling complex, knowledge-intensive tasks due to reliance on static internal knowledge, which hinders their ability to produce comprehensive research reports requiring diverse web information synthesis.

Method: WebThinker integrates a Deep Web Explorer module for dynamic web search and information extraction, employs an Autonomous Think-Search-and-Draft strategy for real-time reasoning and writing, and uses RL-based training via iterative online Direct Preference Optimization to enhance research tool utilization.

Result: Extensive experiments on complex reasoning benchmarks (GPQA, GAIA, WebWalkerQA, HLE) and scientific report generation tasks (Glaive) show WebThinker significantly outperforms existing methods and strong proprietary systems.

Conclusion: WebThinker enhances LRM reliability and applicability in complex scenarios, paving the way for more capable and versatile deep research systems.

Abstract: Large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, demonstrate impressive long-horizon reasoning capabilities. However, their reliance on static internal knowledge limits their performance on complex, knowledge-intensive tasks and hinders their ability to produce comprehensive research reports requiring synthesis of diverse web information. To address this, we propose WebThinker, a deep research agent that empowers LRMs to autonomously search the web, navigate among web pages, and draft reports during the reasoning process. WebThinker integrates a Deep Web Explorer module, enabling LRMs to dynamically search, navigate, and extract information from the web when encountering knowledge gaps. It also employs an Autonomous Think-Search-and-Draft strategy, allowing the model to seamlessly interleave reasoning, information gathering, and report writing in real time. To further enhance research tool utilization, we introduce an RL-based training strategy via iterative online Direct Preference Optimization (DPO). Extensive experiments on complex reasoning benchmarks (GPQA, GAIA, WebWalkerQA, HLE) and scientific report generation tasks (Glaive) demonstrate that WebThinker significantly outperforms existing methods and strong proprietary systems. Our approach enhances LRM reliability and applicability in complex scenarios, paving the way for more capable and versatile deep research systems. The code is available at https://github.com/RUC-NLPIR/WebThinker.

[196] LiTransProQA: an LLM-based Literary Translation evaluation metric with Professional Question Answering

Ran Zhang, Wei Zhao, Lieve Macken, Steffen Eger

Main category: cs.CL

TL;DR: LITRANSPROQA is a reference-free, LLM-based QA framework for literary translation evaluation that incorporates human professional insights and outperforms current metrics, achieving near-human adequacy assessment performance.

Details

Motivation: Existing literary evaluation metrics prioritize mechanical accuracy over artistic expression and tend to overrate machine translation, potentially causing irreversible decline in translation quality and cultural authenticity.

Method: A reference-free, LLM-based question-answering framework that integrates humans in the loop to incorporate insights from professional literary translators and researchers, focusing on literary devices, cultural understanding, and authorial voice.

Result: LITRANSPROQA substantially outperforms current metrics with up to 0.07 gain in correlation and over 15 points improvement in adequacy assessments. It reaches adequacy performance comparable to trained linguistic students but still falls behind experienced professional translators.

Conclusion: LITRANSPROQA shows broad applicability to open-source models and potential as an accessible, training-free tool for evaluating literary translations, especially where local processing is needed due to copyright or ethical considerations.

Abstract: The impact of Large Language Models (LLMs) has extended into literary domains. However, existing evaluation metrics for literature prioritize mechanical accuracy over artistic expression and tend to overrate machine translation as being superior to human translation from experienced professionals. In the long run, this bias could result in an irreversible decline in translation quality and cultural authenticity. In response to the urgent need for a specialized literary evaluation metric, we introduce LITRANSPROQA, a novel, reference-free, LLM-based question-answering framework designed for literary translation evaluation. LITRANSPROQA integrates humans in the loop to incorporate insights from professional literary translators and researchers, focusing on critical elements in literary quality assessment such as literary devices, cultural understanding, and authorial voice. Our extensive evaluation shows that while literary-finetuned XCOMET-XL yields marginal gains, LITRANSPROQA substantially outperforms current metrics, achieving up to 0.07 gain in correlation and surpassing the best state-of-the-art metrics by over 15 points in adequacy assessments. Incorporating professional translator insights as weights further improves performance, highlighting the value of translator inputs. Notably, LITRANSPROQA reaches an adequacy performance comparable to trained linguistic student evaluators, though it still falls behind experienced professional translators. LITRANSPROQA shows broad applicability to open-source models like LLaMA3.3-70b and Qwen2.5-32b, indicating its potential as an accessible and training-free tool for evaluating literary translations that require local processing due to copyright or ethical considerations.

[197] Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning Model

Xinyue Lou, You Li, Jinan Xu, Xiangyu Shi, Chi Chen, Kaiyu Huang

Main category: cs.CL

TL;DR: Systematic safety evaluation of 11 Multimodal Large Reasoning Models reveals prevalent safety degradation, with distinct patterns across benchmarks. The study proposes leveraging models’ intrinsic reasoning capabilities through safety-oriented thought processes to enhance safety.

Details

Motivation: The rapid development of MLRMs shows broad potential but their safety and reliability remain critical concerns that need systematic exploration.

Method: Comprehensive safety evaluation of 11 MLRMs across 5 benchmarks, analysis of safety degradation patterns, and construction of multimodal tuning dataset with safety-oriented thought processes for fine-tuning.

Result: Revealed prevalent safety degradation in most advanced models, with significant degradation in jailbreak robustness benchmarks but less in safety-awareness benchmarks. Fine-tuning with safety-oriented thought process dataset effectively enhances safety on both benchmark types.

Conclusion: Leveraging models’ intrinsic reasoning capabilities through safety-oriented thought processes provides a new perspective for developing safe MLRMs, offering a potential approach to address safety issues.

Abstract: The rapid development of Multimodal Large Reasoning Models (MLRMs) has demonstrated broad application potential, yet their safety and reliability remain critical concerns that require systematic exploration. To address this gap, we conduct a comprehensive and systematic safety evaluation of 11 MLRMs across 5 benchmarks and unveil prevalent safety degradation phenomena in most advanced models. Moreover, our analysis reveals distinct safety patterns across different benchmarks: significant safety degradation is observed across jailbreak robustness benchmarks, whereas safety-awareness benchmarks demonstrate less pronounced degradation. In particular, the long thought process in some scenarios even enhances safety performance. Therefore, it is a potential approach to address safety issues in MLRMs by leveraging the intrinsic reasoning capabilities of the model to detect unsafe intent. To operationalize this insight, we construct a multimodal tuning dataset that incorporates a safety-oriented thought process. Experimental results from fine-tuning existing MLRMs with this dataset effectively enhances the safety on both jailbreak robustness and safety-awareness benchmarks. This study provides a new perspective for developing safe MLRMs. Our dataset is available at https://github.com/xinyuelou/Think-in-Safety.

[198] References Indeed Matter? Reference-Free Preference Optimization for Conversational Query Reformulation

Doyoung Kim, Youngjun Lee, Joeun Kim, Jihwan Bang, Hwanjun Song, Susik Yoon, Jae-Gil Lee

Main category: cs.CL

TL;DR: DualReform is a reference-free preference optimization framework for conversational query reformulation that generates pseudo reference passages from conversational datasets without needing actual reference passages, achieving near-optimal retrieval performance.

Details

Motivation: Existing CQR approaches require reference passages for optimization, which are impractical to obtain in real-world scenarios where only queries and responses are available.

Method: Uses two innovations: (1) response-based inference to generate pseudo reference passages from responses, and (2) response refinement leveraging the dual-role of CQR where the model refines responses based on shared objectives between response refinement and CQR.

Result: Achieves 96.9-99.1% of retrieval accuracy attainable with reference passages, and surpasses state-of-the-art method by up to 31.6% without using actual reference passages.

Conclusion: DualReform provides an effective reference-free solution for CQR optimization that performs nearly as well as methods requiring reference passages, making it practical for real-world applications.

Abstract: Conversational query reformulation (CQR) has become indispensable for improving retrieval in dialogue-based applications. However, existing approaches typically rely on reference passages for optimization, which are impractical to acquire in real-world scenarios. To address this limitation, we introduce a novel reference-free preference optimization framework DualReform that generates pseudo reference passages from commonly-encountered conversational datasets containing only queries and responses. DualReform attains this goal through two key innovations: (1) response-based inference, where responses serve as proxies to infer pseudo reference passages, and (2) response refinement via the dual-role of CQR, where a CQR model refines responses based on the shared objectives between response refinement and CQR. Despite not relying on reference passages, DualReform achieves 96.9–99.1% of the retrieval accuracy attainable only with reference passages and surpasses the state-of-the-art method by up to 31.6%.

[199] J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning

Chenxi Whitehouse, Tianlu Wang, Ping Yu, Xian Li, Jason Weston, Ilia Kulikov, Swarnadeep Saha

Main category: cs.CL

TL;DR: J1 is a reinforcement learning framework that teaches LLM judges to think before making decisions, achieving state-of-the-art performance across multiple benchmarks by optimizing chain-of-thought reasoning for evaluation tasks.

Details

Motivation: AI progress is bottlenecked by evaluation quality, and LLM-as-a-Judge models need effective optimization of their chain-of-thought reasoning to improve evaluation efficacy.

Method: Convert all judgment tasks into a unified format with verifiable rewards, then use RL to train thinking-judges at 8B, 32B, and 70B scales on synthetic data.

Result: J1-Qwen-32B outperforms o1-mini, o3, and 671B DeepSeek-R1 on some benchmarks, and develops systematic evaluation strategies including dynamic criteria generation and iterative self-correction.

Conclusion: The J1 framework effectively optimizes LLM judges’ reasoning process, achieving superior performance through RL training on synthetic data with verifiable rewards.

Abstract: The progress of AI is bottlenecked by the quality of evaluation, making powerful LLM-as-a-Judge models a core solution. The efficacy of these judges depends on their chain-of-thought reasoning, creating a critical need for methods that can effectively optimize this reasoning process. In this work, we introduce J1, a reinforcement learning framework for teaching LLM judges to think before making decisions. Our core contribution lies in converting all judgment tasks for non-verifiable and verifiable prompts into a unified format with verifiable rewards, enabling direct optimization of evaluation quality while mitigating positional bias. We then use RL to train thinking-judges at scales of 8B, 32B, and 70B and show that they obtain state-of-the-art performance across multiple benchmarks. In particular, J1-Qwen-32B, our multitasked pointwise and pairwise judge also outperforms o1-mini, o3, and a much larger 671B DeepSeek-R1 on some benchmarks, while only training on synthetic data. Through comprehensive ablations of pairwise, pointwise, and multitask J1 variants, we demonstrate the effectiveness of our approach across seed prompts, reward strategies, and training recipes. Qualitative analysis reveals that J1 develops systematic evaluation strategies, including dynamic criteria generation, reference answer creation, iterative self-correction of initial assessments, and feedback generation for low-quality responses.

[200] SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization

Huashan Sun, Shengyi Liao, Yansen Han, Yu Bai, Yang Gao, Cheng Fu, Weizhou Shen, Fanqi Wan, Ming Yan, Ji Zhang, Fei Huang

Main category: cs.CL

TL;DR: SoLoPO is a framework that decouples long-context preference optimization into short-context preference optimization and short-to-long reward alignment to improve LLMs’ ability to utilize real-world long-context information.

Details

Motivation: LLMs struggle with effectively utilizing real-world long-context information due to insufficient long-context alignment caused by data quality issues, training inefficiencies, and lack of well-designed optimization objectives.

Method: Decouples long-context preference optimization into two components: short-context PO (using preference pairs from short contexts) and short-to-long reward alignment (encouraging reward consistency between short and long contexts with identical task-relevant information).

Result: SoLoPO enhances preference optimization algorithms with stronger length and domain generalization abilities across various long-context benchmarks, while achieving notable improvements in computational and memory efficiency.

Conclusion: SoLoPO effectively transfers models’ short-context handling ability to long-context scenarios, improving data construction and training efficiency while being compatible with mainstream preference optimization algorithms.

Abstract: Despite advances in pretraining with extended context lengths, large language models (LLMs) still face challenges in effectively utilizing real-world long-context information, primarily due to insufficient long-context alignment caused by data quality issues, training inefficiencies, and the lack of well-designed optimization objectives. To address these limitations, we propose a framework named $\textbf{S}$h$\textbf{o}$rt-to-$\textbf{Lo}$ng $\textbf{P}$reference $\textbf{O}$ptimization ($\textbf{SoLoPO}$), decoupling long-context preference optimization (PO) into two components: short-context PO and short-to-long reward alignment (SoLo-RA), supported by both theoretical and empirical evidence. Specifically, short-context PO leverages preference pairs sampled from short contexts to enhance the model’s contextual knowledge utilization ability. Meanwhile, SoLo-RA explicitly encourages reward score consistency utilization for the responses when conditioned on both short and long contexts that contain identical task-relevant information. This facilitates transferring the model’s ability to handle short contexts into long-context scenarios. SoLoPO is compatible with mainstream preference optimization algorithms, while substantially improving the efficiency of data construction and training processes. Experimental results show that SoLoPO enhances all these algorithms with respect to stronger length and domain generalization abilities across various long-context benchmarks, while achieving notable improvements in both computational and memory efficiency.

[201] A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

Jitai Hao, Qiang Huang, Hao Liu, Xinyan Xiao, Zhaochun Ren, Jun Yu

Main category: cs.CL

TL;DR: LRC is an efficient pre-training method that trains low-rank projection matrices to enable soft pruning and activation cloning from teacher models, achieving high-performing SLMs with 1000x training efficiency using only 20B tokens.

Details

Motivation: To address challenges in training SLMs including information loss from hard pruning, inefficient representation alignment, and underutilization of FFN activations in existing knowledge distillation methods.

Method: LRC trains low-rank projection matrices that jointly perform soft pruning by compressing teacher weights and activation cloning by aligning student activations (including FFN signals) with teacher activations, eliminating the need for explicit alignment modules.

Result: LRC matches or surpasses state-of-the-art models trained on trillions of tokens while using only 20B tokens, achieving over 1,000x training efficiency with open-source teachers like Llama-3.2-3B-Instruct and Qwen2.5-3B/7B-Instruct.

Conclusion: LRC provides an efficient unified framework for knowledge transfer that maximizes behavioral equivalence with teacher models while dramatically reducing training costs.

Abstract: Training high-performing Small Language Models (SLMs) remains costly, even with knowledge distillation and pruning from larger teacher models. Existing work often faces three key challenges: (1) information loss from hard pruning, (2) inefficient alignment of representations, and (3) underutilization of informative activations, particularly from Feed-Forward Networks (FFNs). To address these challenges, we introduce Low-Rank Clone (LRC), an efficient pre-training method that constructs SLMs aspiring to behavioral equivalence with strong teacher models. LRC trains a set of low-rank projection matrices that jointly enable soft pruning by compressing teacher weights, and activation clone by aligning student activations, including FFN signals, with those of the teacher. This unified design maximizes knowledge transfer while removing the need for explicit alignment modules. Extensive experiments with open-source teachers (e.g., Llama-3.2-3B-Instruct, Qwen2.5-3B/7B-Instruct) show that LRC matches or surpasses state-of-the-art models trained on trillions of tokens–while using only 20B tokens, achieving over 1,000x training efficiency. Our codes and model checkpoints are available at https://github.com/CURRENTF/LowRankClone and https://huggingface.co/collections/JitaiHao/low-rank-clone-lrc-6828389e96a93f1d4219dfaf.

[202] Transparent and Robust RAG: Adaptive-Reward Reinforcement Learning for Decision Traceability

Jingyi Ren, Yekun Xu, Xiaolong Wang, Weitao Li, Weizhi Ma, Yang Liu

Main category: cs.CL

TL;DR: ARENA is a transparent and robust RAG framework that uses RL with designed rewards to improve interpretability and training stability, achieving 10-30% accuracy improvements on multi-hop QA tasks.

Details

Motivation: Existing RL-based RAG methods lack transparency in showing which references are used during reasoning and suffer from unstable training due to KL divergence gradient spikes.

Method: Proposes ARENA framework with structured protocol, KL divergence stabilization, and adaptive reward calculation modules to enable evidence identification, structured reasoning, and interpretable decision traces.

Result: Achieves 10-30% accuracy improvements on three multi-hop QA datasets using Qwen2.5-7B-Instruct and Llama3.1-8B-Instruct, comparable to advanced closed-source LLMs like OpenAI o1 and DeepSeek R1.

Conclusion: ARENA provides a transparent and robust RAG framework that generalizes well to unseen datasets and tasks, with publicly released models and codes.

Abstract: Retrieval-Augmented Generation (RAG) delivers substantial value in knowledge-intensive applications. Many recent works use reinforcement learning (RL) to elicit strong reasoning in RAG generators. However, two key challenges remain unresolved: (1) Transparency: most prior methods do not explicitly indicate which references are actually used during the reasoning that leads to the final answer, limiting interpretability and visibility; (2) Stability: the KL divergence estimator used in existing RL-based approaches may cause gradient spikes, leading to unstable training. To address these challenges, we propose Adaptive-Rewarded Evidence Navigation Agent (ARENA), a transparent and robust RAG generator framework trained via RL with designed rewards. Based on our structured protocol, KL divergence stabilization, and adaptive reward calculation modules, ARENA enables the RAG generator to identify key evidence, perform structured reasoning, and generate answers with interpretable decision traces. Applied to Qwen2.5-7B-Instruct and Llama3.1-8B-Instruct, extensive experiments across multiple baselines show 10-30% accuracy improvements on three multi-hop QA datasets, comparable to advanced closed-source LLMs (e.g., OpenAI o1, DeepSeek R1). Further analyses show that ARENA generalizes well to unseen datasets and tasks. Our models and codes are publicly released.

[203] Noise Injection Systemically Degrades Large Language Model Safety Guardrails

Prithviraj Singh Shahani, Kaveh Eskandari Miandoab, Matthias Scheutz

Main category: cs.CL

TL;DR: Safety fine-tuning in LLMs is vulnerable to Gaussian noise perturbations, which can increase harmful outputs by up to 27% without deeper protection from fine-tuning, though chain-of-thought reasoning remains intact.

Details

Motivation: To understand the robustness of safety guardrails in LLMs against perturbations, as current resilience is poorly understood despite being critical for preventing harmful outputs.

Method: Systematically injecting Gaussian noise into model activations across multiple open-weight LLMs to test safety fine-tuning robustness.

Result: Gaussian noise increases harmful-output rates by up to 27% (p < 0.001), deeper safety fine-tuning provides no extra protection, and chain-of-thought reasoning remains largely unaffected.

Conclusion: Current safety alignment techniques have critical vulnerabilities, highlighting reasoning-based and reinforcement learning approaches as promising directions for developing more robust AI safety systems.

Abstract: Safety guardrails in large language models (LLMs) are a critical component in preventing harmful outputs. Yet, their resilience under perturbation remains poorly understood. In this paper, we investigate the robustness of safety fine-tuning in LLMs by systematically injecting Gaussian noise into model activations. We show across multiple open-weight models that (1) Gaussian noise raises harmful-output rates (p < 0.001) by up to 27%, (2) that deeper safety fine-tuning affords no extra protection, and (3) that chain-of-thought reasoning remains largely intact. The findings reveal critical vulnerabilities in current safety alignment techniques and highlight the potential of reasoning-based and reinforcement learning approaches as promising direction for developing more robust AI safety systems. These results have important implications for real-world deployment of LLMs in safety-critical applications as these results imply that widely-deployed safety tuning methods can fail even without adversarial prompts.

[204] “Haet Bhasha aur Diskrimineshun”: Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs

Darpan Aswal, Siddharth D Jaiswal

Main category: cs.CL

TL;DR: Novel jailbreaking strategy using code-mixing and phonetic perturbations achieves high attack success rates (99% for text, 78% for image generation) against multilingual multimodal LLMs by exploiting tokenization vulnerabilities.

Details

Motivation: Existing safety efforts focus primarily on English, leaving models vulnerable to multilingual jailbreaking strategies, especially in multimodal contexts where prompts may contain misspelled words.

Method: Leverages code-mixing and phonetic perturbations to bypass safety filters, with an extension to jailbreak-template-based strategy and a novel template. Applies phonetic misspellings to sensitive words in code-mixed prompts.

Result: Achieved 99% Attack Success Rate for text generation and 78% for image generation, with Attack Relevance Rate of 100% for text and 96% for image generation. Phonetic perturbations impact word tokenization, leading to jailbreak success.

Conclusion: Study highlights need for more generalizable safety alignment for multilingual multimodal models, especially in real-world settings with misspelled words, as current models remain vulnerable to these attack strategies.

Abstract: Recently released LLMs have strong multilingual & multimodal capabilities. Model vulnerabilities are exposed using audits and red-teaming efforts. Existing efforts have focused primarily on the English language; thus, models continue to be susceptible to multilingual jailbreaking strategies, especially for multimodal contexts. In this study, we introduce a novel strategy that leverages code-mixing and phonetic perturbations to jailbreak LLMs for both text and image generation tasks. We also present an extension to a current jailbreak-template-based strategy and propose a novel template, showing higher effectiveness than baselines. Our work presents a method to effectively bypass safety filters in LLMs while maintaining interpretability by applying phonetic misspellings to sensitive words in code-mixed prompts. We achieve a 99% Attack Success Rate for text generation and 78% for image generation, with Attack Relevance Rate of 100% for text generation and 96% for image generation for the phonetically perturbed code-mixed prompts. Our interpretability experiments reveal that phonetic perturbations impact word tokenization, leading to jailbreak success. Our study motivates increasing the focus towards more generalizable safety alignment for multilingual multimodal models, especially in real-world settings wherein prompts can have misspelt words. \textit{\textbf{Warning: This paper contains examples of potentially harmful and offensive content.}}

[205] Saten: Sparse Augmented Tensor Networks for Post-Training Compression of Large Language Models

Ryan Solgi, Kai Zhen, Rupak Vignesh Swaminathan, Nathan Susanj, Athanasios Mouchtaris, Siegfried Kunzmann, Zheng Zhang

Main category: cs.CL

TL;DR: Proposes Saten (sparse augmented tensor networks) to enhance low-rank tensor compression for LLMs during fine-tuning, achieving better accuracy and compression efficiency.

Details

Motivation: Existing tensor compression methods struggle with pre-trained LLMs due to their high-rank nature and lack of pretraining data access, limiting post-training compression effectiveness.

Method: Develops sparse augmented tensor networks (Saten) that enable full model compression of LLMs during fine-tuning phase.

Result: Saten improves both accuracy and compression efficiency in tensorized language models, achieving state-of-the-art performance.

Conclusion: The Saten framework successfully addresses challenges in compressing pre-trained LLMs and enhances tensor network performance for downstream tasks.

Abstract: The efficient implementation of large language models (LLMs) is crucial for deployment on resource-constrained devices. Low-rank tensor compression techniques, such as tensor-train (TT) networks, have been widely studied for over-parameterized neural networks. However, their applications to compress pre-trained large language models (LLMs) for downstream tasks (post-training) remains challenging due to the high-rank nature of pre-trained LLMs and the lack of access to pretraining data. In this study, we investigate low-rank tensorized LLMs during fine-tuning and propose sparse augmented tensor networks (Saten) to enhance their performance. The proposed Saten framework enables full model compression. Experimental results demonstrate that Saten enhances both accuracy and compression efficiency in tensorized language models, achieving state-of-the-art performance.

[206] Hallucinate at the Last in Long Response Generation: A Case Study on Long Document Summarization

Joonho Yang, Seunghyun Yoon, Hwan Chang, Byeongjeong Kim, Hwanhee Lee

Main category: cs.CL

TL;DR: LLMs generate more hallucinations in the latter parts of long responses, with attention and decoding dynamics identified as contributing factors.

Details

Motivation: Faithfulness to source material remains a challenge in LLM text generation, with limited research on positional distribution of hallucinations in long outputs.

Method: Investigated positional distribution of hallucinations in LLM-based long response generation using long document summarization as case study, exploring attention and decoding dynamics.

Result: Found consistent phenomenon: hallucinations concentrate disproportionately in latter parts of generated long responses.

Conclusion: Identified positional hallucination bias and explored mitigation methods to improve faithfulness in concluding segments of long outputs.

Abstract: Large Language Models (LLMs) have significantly advanced text generation capabilities, including tasks like summarization, often producing coherent and fluent outputs. However, faithfulness to source material remains a significant challenge due to the generation of hallucinations. While extensive research focuses on detecting and reducing these inaccuracies, less attention has been paid to the positional distribution of hallucination within generated text, particularly in long outputs. In this work, we investigate where hallucinations occur in LLM-based long response generation, using long document summarization as a key case study. Focusing on the challenging setting of long context-aware long response generation, we find a consistent and concerning phenomenon: hallucinations tend to concentrate disproportionately in the latter parts of the generated long response. To understand this bias, we explore potential contributing factors related to the dynamics of attention and decoding over long sequences. Furthermore, we investigate methods to mitigate this positional hallucination, aiming to improve faithfulness specifically in the concluding segments of long outputs.

[207] From Problem-Solving to Teaching Problem-Solving: Aligning LLMs with Pedagogy using Reinforcement Learning

David Dinucu-Jianu, Jakub Macina, Nico Daheim, Ido Hakimi, Iryna Gurevych, Mrinmaya Sachan

Main category: cs.CL

TL;DR: An RL-based alignment framework trains LLMs to be effective tutors by emphasizing pedagogy over direct answers, achieving performance comparable to larger proprietary models while preserving reasoning capabilities.

Details

Motivation: LLMs optimized for direct question-answering undermine effective pedagogy, which requires strategically withholding answers to promote learning.

Method: Online reinforcement learning framework using simulated student-tutor interactions, with controllable reward weighting to balance pedagogical support and student solving accuracy.

Result: Trained a 7B parameter tutor model without human annotations that reaches similar performance to larger proprietary models like LearnLM, better preserving reasoning capabilities than SFT baselines.

Conclusion: The framework enables training effective pedagogical tutors with optional interpretability through thinking tags, tracing the Pareto frontier between pedagogical support and student accuracy.

Abstract: Large language models (LLMs) can transform education, but their optimization for direct question-answering often undermines effective pedagogy which requires strategically withholding answers. To mitigate this, we propose an online reinforcement learning (RL)-based alignment framework that can quickly adapt LLMs into effective tutors using simulated student-tutor interactions by emphasizing pedagogical quality and guided problem-solving over simply giving away answers. We use our method to train a 7B parameter tutor model without human annotations which reaches similar performance to larger proprietary models like LearnLM. We introduce a controllable reward weighting to balance pedagogical support and student solving accuracy, allowing us to trace the Pareto frontier between these two objectives. Our models better preserve reasoning capabilities than single-turn SFT baselines and can optionally enhance interpretability through thinking tags that expose the model’s instructional planning.

[208] TemplateRL: Structured Template-Guided Reinforcement Learning for LLM Reasoning

Jinyang Wu, Chonghua Liao, Mingkuan Feng, Shuai Zhang, Zhengqi Wen, Haoran Luo, Ling Yang, Huazhe Xu, Jianhua Tao

Main category: cs.CL

TL;DR: TemplateRL is a structured template-guided RL framework that improves reasoning by using explicit templates to guide policy optimization, enhancing sampling efficiency and stability compared to unstructured RL methods like GRPO.

Details

Motivation: Existing RL methods for model reasoning rely on unstructured self-sampling with scalar rewards, producing inefficient rollouts that fail to capture transferable problem-solving strategies.

Method: Constructs a problem-solving template library via MCTS on a small seed set, then integrates this high-level structured guidance into RL training to align rollout generation with proven template structures.

Result: Outperforms GRPO by 99% on AIME and 41% on AMC, with superior stability on weak models and remarkable cross-domain generalization.

Conclusion: TemplateRL’s structure-guided design effectively steers policies toward validated strategic patterns, improving RL sampling efficiency and demonstrating potential for broader tasks through interpretable and editable template libraries.

Abstract: Reinforcement learning (RL) has emerged as an effective paradigm for enhancing model reasoning. However, existing RL methods like GRPO often rely on unstructured self-sampling to fit scalar rewards, often producing inefficient rollouts that fail to capture transferable problem-solving strategies. To address these limitations, we propose TemplateRL, a structured template-guided RL framework that augments policy optimization with explicit template guidance. Our approach first constructs a problem-solving template library via MCTS on a small seed set, then seamlessly integrates this high-level structured guidance into RL training. By guiding rollout generation to align with proven template structures, TemplateRL significantly improves high-quality trajectory hit rates while reducing ineffective exploration. This structure-guided design steers the policy toward validated strategic patterns, stabilizing training dynamics, and enhancing RL sampling efficiency. Notably, the explicit template library is interpretable, editable, and supports online updates-enabling continuous updates during both training and inference. Extensive experiments demonstrate that TemplateRL outperforms GRPO by 99% on AIME and 41% on AMC, with superior stability on weak models and remarkable cross-domain generalization, highlighting its potential for broader tasks.

[209] Explain Less, Understand More: Jargon Detection via Personalized Parameter-Efficient Fine-tuning

Bohao Wu, Qingyun Wang, Yue Guo

Main category: cs.CL

TL;DR: Systematic study of personalized jargon detection using efficient methods like LoRA finetuning and personalized prompting, achieving 21.4% better F1 than GPT-4 with only 10% training data.

Details

Motivation: Personalizing jargon detection is essential for making technical documents accessible to diverse readers, but current methods require substantial annotation and computational resources for user-specific finetuning.

Method: Two personalization strategies: (1) lightweight finetuning using Low-Rank Adaptation (LoRA) on open-source models, and (2) personalized prompting that tailors model behavior at inference time without retraining. Also investigated semi-supervised approaches combining limited annotated data with self-supervised learning from users’ publications.

Result: Personalized LoRA model outperforms GPT-4 with contextual prompting by 21.4% in F1 score and exceeds the best performing oracle baseline by 8.3%. Achieves comparable performance using only 10% of annotated training data.

Conclusion: First systematic exploration of efficient, low-resource personalization for jargon detection using open-source language models, offering a practical path toward scalable, user-adaptive NLP systems.

Abstract: Personalizing jargon detection and explanation is essential for making technical documents accessible to readers with diverse disciplinary backgrounds. However, tailoring models to individual users typically requires substantial annotation efforts and computational resources due to user-specific finetuning. To address this, we present a systematic study of personalized jargon detection, focusing on methods that are both efficient and scalable for real-world deployment. We explore two personalization strategies: (1) lightweight finetuning using Low-Rank Adaptation (LoRA) on open-source models, and (2) personalized prompting, which tailors model behavior at inference time without retaining. To reflect realistic constraints, we also investigate semi-supervised approaches that combine limited annotated data with self-supervised learning from users’ publications. Our personalized LoRA model outperforms GPT-4 with contextual prompting by 21.4% in F1 score and exceeds the best performing oracle baseline by 8.3%. Remarkably, our method achieves comparable performance using only 10% of the annotated training data, demonstrating its practicality for resource-constrained settings. Our study offers the first work to systematically explore efficient, low-resource personalization of jargon detection using open-source language models, offering a practical path toward scalable, user-adaptive NLP system.

[210] Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation

Ruizhe Li, Chen Chen, Yuchen Hu, Yanjun Gao, Xi Wang, Emine Yilmaz

Main category: cs.CL

TL;DR: ARC-JSD is a novel Jensen-Shannon Divergence method for efficient context attribution in RAG systems without fine-tuning or surrogate models, achieving superior accuracy and computational efficiency.

Details

Motivation: Current context attribution methods in RAG systems are computationally intensive, requiring extensive fine-tuning or human annotation, making reliable attribution challenging.

Method: Jensen-Shannon Divergence driven method to Attribute Response to Context (ARC-JSD) that identifies essential context sentences without additional fine-tuning, gradient-calculation or surrogate modelling.

Result: Superior accuracy and significant computational efficiency improvements on RAG benchmarks (TyDi QA, Hotpot QA, Musique) compared to previous surrogate-based methods. Mechanistic analysis reveals specific attention heads and MLP layers responsible for context attribution.

Conclusion: ARC-JSD provides an efficient and accurate solution for context attribution in RAG systems while offering insights into the internal mechanisms of how LLMs process and attribute context.

Abstract: Retrieval-Augmented Generation (RAG) leverages large language models (LLMs) combined with external contexts to enhance the accuracy and reliability of generated responses. However, reliably attributing generated content to specific context segments, context attribution, remains challenging due to the computationally intensive nature of current methods, which often require extensive fine-tuning or human annotation. In this work, we introduce a novel Jensen-Shannon Divergence driven method to Attribute Response to Context (ARC-JSD), enabling efficient and accurate identification of essential context sentences without additional fine-tuning, gradient-calculation or surrogate modelling. Evaluations on a wide range of RAG benchmarks, such as TyDi QA, Hotpot QA, and Musique, using instruction-tuned LLMs in different scales demonstrate superior accuracy and significant computational efficiency improvements compared to the previous surrogate-based method. Furthermore, our mechanistic analysis reveals specific attention heads and multilayer perceptron (MLP) layers responsible for context attribution, providing valuable insights into the internal workings of RAG models and how they affect RAG behaviours. Our code is available at https://github.com/ruizheliUOA/ARC_JSD.

[211] TRIM: Achieving Extreme Sparsity with Targeted Row-wise Iterative Metric-driven Pruning

Florentin Beck, William Rudman, Carsten Eickhoff

Main category: cs.CL

TL;DR: TRIM introduces a novel LLM pruning method that applies varying sparsity ratios to individual output dimensions within layers, achieving state-of-the-art performance at high sparsity levels.

Details

Motivation: Existing one-shot pruning methods use uniform sparsity constraints across layers, leading to suboptimal performance especially at high sparsity ratios, creating a need for more fine-grained pruning approaches.

Method: TRIM employs an iterative adjustment process with quality metrics to optimize dimension-wise sparsity allocation, reducing variance in quality retention across outputs and preserving critical information. It can be integrated with existing layer-wise pruning strategies.

Result: TRIM achieves new SOTA results across diverse LLM families (Qwen2.5, LLaMA-2, OPT) and sparsity levels. At 80% sparsity, it reduces perplexity by 48% for Qwen2.5-14B and over 90% for OPT-13B compared to baselines, while enhancing stability.

Conclusion: Fine-grained, dimension-wise sparsity adaptation is crucial for pushing the limits of extreme LLM compression, demonstrating the effectiveness of targeted row-wise iterative pruning.

Abstract: Large Language Models (LLMs) present significant computational and memory challenges due to their extensive size, making pruning essential for their efficient deployment. Existing one-shot pruning methods often apply uniform sparsity constraints across layers or within each layer, resulting in suboptimal performance, especially at high sparsity ratios. This work introduces TRIM (Targeted Row-wise Iterative Metric-driven pruning), a novel approach that applies varying sparsity ratios to individual output dimensions (rows) within each layer. TRIM employs an iterative adjustment process guided by quality metrics to optimize dimension-wise sparsity allocation, focusing on reducing variance in quality retention across outputs to preserve critical information. TRIM can be seamlessly integrated with existing layer-wise pruning strategies. Our evaluations on perplexity and zero-shot tasks across diverse LLM families (Qwen2.5, LLaMA-2, and OPT) and sparsity levels demonstrate that TRIM achieves new state-of-the-art results and enhances stability. For instance, at 80% sparsity, TRIM reduces perplexity by 48% for Qwen2.5-14B and over 90% for OPT-13B compared to baseline methods. We conclude that fine-grained, dimension-wise sparsity adaptation is crucial for pushing the limits of extreme LLM compression. Code available at: https://github.com/flobk/TRIM

[212] CrosGrpsABS: Cross-Attention over Syntactic and Semantic Graphs for Aspect-Based Sentiment Analysis in a Low-Resource Language

Md. Mithun Hossain, Md. Shakil Hossain, Sudipto Chaki, Md. Rajib Hossain

Main category: cs.CL

TL;DR: CrosGrpsABS is a novel hybrid framework for Aspect-Based Sentiment Analysis that uses bidirectional cross-attention between syntactic and semantic graphs to improve sentiment classification, especially for low-resource languages like Bengali.

Details

Motivation: Address the lack of ABSA resources for low-resource languages like Bengali, which suffer from limited annotated data, pre-trained models, and optimized hyperparameters compared to resource-rich languages like English.

Method: A hybrid framework combining transformer-based contextual embeddings with graph convolutional networks, using bidirectional cross-attention between rule-based syntactic dependency parsing and semantic similarity computations to fuse local syntactic structure with global semantic context.

Result: Outperforms existing approaches on both low-resource Bengali datasets and high-resource English SemEval 2014 Task 4 dataset, achieving 0.93% F1-score improvement for Restaurant domain and 1.06% gain for Laptop domain in SemEval benchmark.

Conclusion: The bidirectional cross-attention mechanism effectively enhances aspect-level sentiment classification by integrating syntactic and semantic information, demonstrating strong performance across both low- and high-resource language settings.

Abstract: Aspect-Based Sentiment Analysis (ABSA) is a fundamental task in natural language processing, offering fine-grained insights into opinions expressed in text. While existing research has largely focused on resource-rich languages like English which leveraging large annotated datasets, pre-trained models, and language-specific tools. These resources are often unavailable for low-resource languages such as Bengali. The ABSA task in Bengali remains poorly explored and is further complicated by its unique linguistic characteristics and a lack of annotated data, pre-trained models, and optimized hyperparameters. To address these challenges, this research propose CrosGrpsABS, a novel hybrid framework that leverages bidirectional cross-attention between syntactic and semantic graphs to enhance aspect-level sentiment classification. The CrosGrpsABS combines transformerbased contextual embeddings with graph convolutional networks, built upon rule-based syntactic dependency parsing and semantic similarity computations. By employing bidirectional crossattention, the model effectively fuses local syntactic structure with global semantic context, resulting in improved sentiment classification performance across both low- and high-resource settings. We evaluate CrosGrpsABS on four low-resource Bengali ABSA datasets and the high-resource English SemEval 2014 Task 4 dataset. The CrosGrpsABS consistently outperforms existing approaches, achieving notable improvements, including a 0.93% F1-score increase for the Restaurant domain and a 1.06% gain for the Laptop domain in the SemEval 2014 Task 4 benchmark.

[213] Shifting AI Efficiency From Model-Centric to Data-Centric Compression

Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Tailai Chen, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, Xu Zheng, Honggang Chen, Weijia Li, Xuming Hu, Conghui He, Linfeng Zhang

Main category: cs.CL

TL;DR: The paper argues for a paradigm shift from model-centric compression to data-centric compression to address computational bottlenecks in AI, particularly for long-context processing.

Details

Motivation: Hardware limitations constrain further model scaling, and the quadratic cost of self-attention over long sequences (text, images, videos) has become the primary computational bottleneck.

Method: Establishes a unified framework for efficiency strategies and systematically reviews data-centric compression methods that directly compress data volume during training or inference.

Result: Demonstrates that data-centric compression constitutes a crucial paradigm change for long-context AI and analyzes benefits across diverse scenarios.

Conclusion: Provides a novel perspective on AI efficiency, synthesizes existing efforts, and outlines key challenges and future research directions to address ever-increasing context lengths.

Abstract: The advancement of large language models (LLMs) and multi-modal LLMs (MLLMs) has historically relied on scaling model parameters. However, as hardware limits constrain further model growth, the primary computational bottleneck has shifted to the quadratic cost of self-attention over increasingly long sequences by ultra-long text contexts, high-resolution images, and extended videos. In this position paper, \textbf{we argue that the focus of research for efficient artificial intelligence (AI) is shifting from model-centric compression to data-centric compression}. We position data-centric compression as the emerging paradigm, which improves AI efficiency by directly compressing the volume of data processed during model training or inference. To formalize this shift, we establish a unified framework for existing efficiency strategies and demonstrate why it constitutes a crucial paradigm change for long-context AI. We then systematically review the landscape of data-centric compression methods, analyzing their benefits across diverse scenarios. Finally, we outline key challenges and promising future research directions. Our work aims to provide a novel perspective on AI efficiency, synthesize existing efforts, and catalyze innovation to address the challenges posed by ever-increasing context lengths.

[214] MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning

Thang Nguyen, Peter Chin, Yu-Wing Tai

Main category: cs.CL

TL;DR: MA-RAG is a multi-agent framework that improves RAG by using specialized agents for different pipeline stages, achieving state-of-the-art performance on complex QA tasks without domain-specific fine-tuning.

Details

Motivation: To address inherent ambiguities and reasoning challenges in complex information-seeking tasks that conventional RAG methods struggle with, by enabling collaborative, modular reasoning.

Method: Orchestrates specialized AI agents (Planner, Step Definer, Extractor, QA Agents) that decompose tasks into subtasks, communicate via chain-of-thought prompting, and progressively refine retrieval and synthesis while maintaining modular interpretability.

Result: Significantly outperforms standalone LLMs and existing RAG methods across all model scales on multi-hop and ambiguous QA benchmarks. Small LLaMA3-8B with MA-RAG surpasses larger standalone LLMs, while larger variants set new SOTA results. Generalizes to specialized domains like medical QA without domain-specific fine-tuning.

Conclusion: MA-RAG establishes a new paradigm for efficient and reliable multi-agent RAG, improving answer accuracy, robustness, and providing interpretable intermediate reasoning steps through collaborative, modular reasoning.

Abstract: We present MA-RAG, a Multi-Agent framework for Retrieval-Augmented Generation (RAG) that addresses the inherent ambiguities and reasoning challenges in complex information-seeking tasks. Unlike conventional RAG methods that rely on end-to-end fine-tuning or isolated component enhancements, MA-RAG orchestrates a collaborative set of specialized AI agents: Planner, Step Definer, Extractor, and QA Agents, each responsible for a distinct stage of the RAG pipeline. By decomposing tasks into subtasks such as query disambiguation, evidence extraction, and answer synthesis, and enabling agents to communicate intermediate reasoning via chain-of-thought prompting, MA-RAG progressively refines retrieval and synthesis while maintaining modular interpretability. Extensive experiments on multi-hop and ambiguous QA benchmarks, including NQ, HotpotQA, 2WikimQA, and TriviaQA, demonstrate that MA-RAG significantly outperforms standalone LLMs and existing RAG methods across all model scales. Notably, even a small LLaMA3-8B model equipped with MA-RAG surpasses larger standalone LLMs, while larger variants (LLaMA3-70B and GPT-4o-mini) set new state-of-the-art results on challenging multi-hop datasets. Ablation studies reveal that both the planner and extractor agents are critical for multi-hop reasoning, and that high-capacity models are especially important for the QA agent to synthesize answers effectively. Beyond general-domain QA, MA-RAG generalizes to specialized domains such as medical QA, achieving competitive performance against domain-specific models without any domain-specific fine-tuning. Our results highlight the effectiveness of collaborative, modular reasoning in retrieval-augmented systems: MA-RAG not only improves answer accuracy and robustness but also provides interpretable intermediate reasoning steps, establishing a new paradigm for efficient and reliable multi-agent RAG.

[215] ARM: Adaptive Reasoning Model

Siye Wu, Jian Xie, Yikai Zhang, Aili Chen, Kai Zhang, Yu Su, Yanghua Xiao

Main category: cs.CL

TL;DR: ARM is an Adaptive Reasoning Model that dynamically selects appropriate reasoning formats (Direct Answer, Short CoT, Code, Long CoT) based on task difficulty, reducing token usage by 30-70% while maintaining performance comparable to Long CoT-only models.

Details

Motivation: Current large reasoning models suffer from 'overthinking' - excessive reasoning on simple tasks - which contradicts the goal of fully autonomous AI and wastes computational resources.

Method: Proposed ARM with Ada-GRPO training method, an adaptation of Group Relative Policy Optimization that prevents format collapse and enables adaptive format selection.

Result: ARM achieves 30% average token reduction (up to 70%) while maintaining comparable performance to Long CoT models, plus 2x training speedup.

Conclusion: ARM provides efficient adaptive reasoning with three modes (Adaptive, Instruction-Guided, Consensus-Guided), balancing performance and computational efficiency for autonomous AI systems.

Abstract: While large reasoning models demonstrate strong performance on complex tasks, they lack the ability to adjust reasoning token usage based on task difficulty. This often leads to the “overthinking” problem – excessive and unnecessary reasoning – which, although potentially mitigated by human intervention to control the token budget, still fundamentally contradicts the goal of achieving fully autonomous AI. In this work, we propose Adaptive Reasoning Model (ARM), a reasoning model capable of adaptively selecting appropriate reasoning formats based on the task at hand. These formats include three efficient ones – Direct Answer, Short CoT, and Code – as well as a more elaborate format, Long CoT. To train ARM, we introduce Ada-GRPO, an adaptation of Group Relative Policy Optimization (GRPO), which addresses the format collapse issue in traditional GRPO. Ada-GRPO enables ARM to achieve high token efficiency, reducing tokens by an average of 30%, and up to 70%, while maintaining performance comparable to the model that relies solely on Long CoT. Furthermore, not only does it improve inference efficiency through reduced token generation, but it also brings a 2x speedup in training. In addition to the default Adaptive Mode, ARM supports two additional reasoning modes: 1) Instruction-Guided Mode, which allows users to explicitly specify the reasoning format via special tokens – ideal when the appropriate format is known for a batch of tasks. 2) Consensus-Guided Mode, which aggregates the outputs of the three efficient formats and resorts to Long CoT in case of disagreement, prioritizing performance with higher token usage.

[216] Multi-Scale Manifold Alignment for Interpreting Large Language Models: A Unified Information-Geometric Framework

Yukun Zhang, Qi Dong

Main category: cs.CL

TL;DR: MSMA is an information-geometric framework that decomposes LLM representations into local, intermediate, and global manifolds and learns cross-scale mappings to preserve geometry and information, improving alignment metrics across multiple models.

Details

Motivation: To understand and improve the hierarchical patterns in LLM representations by decomposing them into different scales and learning cross-scale mappings that preserve information geometry.

Method: Multi-Scale Manifold Alignment (MSMA) framework that decomposes LLM representations into local, intermediate, and global manifolds, learns cross-scale mappings preserving geometry and information, and evaluates using multiple estimators including relative KL reduction and mutual information gains.

Result: Consistent hierarchical patterns observed across GPT-2, BERT, RoBERTa, and T5; MSMA improves alignment metrics with statistical significance across seeds; controlled interventions at different scales yield distinct architecture-dependent effects on lexical diversity, sentence structure, and discourse coherence.

Conclusion: Multi-objective alignment offers a practical lens for analyzing cross-scale information flow and guiding representation-level control, though theoretical analysis relies on idealized assumptions.

Abstract: We present Multi-Scale Manifold Alignment(MSMA), an information-geometric framework that decomposes LLM representations into local, intermediate, and global manifolds and learns cross-scale mappings that preserve geometry and information. Across GPT-2, BERT, RoBERTa, and T5, we observe consistent hierarchical patterns and find that MSMA improves alignment metrics under multiple estimators (e.g., relative KL reduction and MI gains with statistical significance across seeds). Controlled interventions at different scales yield distinct and architecture-dependent effects on lexical diversity, sentence structure, and discourse coherence. While our theoretical analysis relies on idealized assumptions, the empirical results suggest that multi-objective alignment offers a practical lens for analyzing cross-scale information flow and guiding representation-level control.

[217] Empirical Investigation of Latent Representational Dynamics in Large Language Models: A Manifold Evolution Perspective

Yukun Zhang, Qi Dong

Main category: cs.CL

TL;DR: DMET models LLM generation as continuous trajectories on low-dimensional semantic manifolds, using three metrics to analyze latent dynamics and their relationship with text quality.

Details

Motivation: To provide a unified framework for interpreting LLM behavior by connecting internal representation dynamics with external text generation quality.

Method: Proposes Dynamical Manifold Evolution Theory with three interpretable metrics (state continuity C, attractor compactness Q, topological persistence P) to analyze latent dynamics across Transformer architectures.

Result: Reveals consistent links between latent dynamics and text quality: smoother trajectories correlate with fluency, richer topological organization with coherence. Different models show distinct dynamical regimes, and decoding parameters shape trajectories predictably.

Conclusion: DMET offers a testable phenomenological framework for interpreting, monitoring, and guiding LLM behavior, providing insights into the relationship between internal dynamics and generation quality.

Abstract: This paper introduces the Dynamical Manifold Evolution Theory (DMET), a conceptual framework that models large language model (LLM) generation as a continuous trajectory evolving on a low-dimensional semantic manifold. The theory characterizes latent dynamics through three interpretable metrics-state continuity ($C$), attractor compactness ($Q$), and topological persistence ($P$)-which jointly capture the smoothness, stability, and structure of representation evolution. Empirical analyses across multiple Transformer architectures reveal consistent links between these latent dynamics and text quality: smoother trajectories correspond to greater fluency, and richer topological organization correlates with enhanced coherence. Different models exhibit distinct dynamical regimes, reflecting diverse strategies of semantic organization in latent space. Moreover, decoding parameters such as temperature and top-$p$ shape these trajectories in predictable ways, defining a balanced region that harmonizes fluency and creativity. As a phenomenological rather than first-principles framework, DMET provides a unified and testable perspective for interpreting, monitoring, and guiding LLM behavior, offering new insights into the interplay between internal representation dynamics and external text generation quality.

[218] Are Language Models Consequentialist or Deontological Moral Reasoners?

Keenan Samway, Max Kleiman-Weiner, David Guzman Piedrahita, Rada Mihalcea, Bernhard Schölkopf, Zhijing Jin

Main category: cs.CL

TL;DR: Large-scale analysis of moral reasoning in LLMs using 600+ trolley problems, revealing that chain-of-thought reasoning favors deontology while post-hoc explanations shift to consequentialism.

Details

Motivation: As AI systems are increasingly used in high-stakes domains like healthcare and law, understanding their ethical reasoning processes is critical for safe deployment.

Method: Used over 600 distinct trolley problems as probes, introduced a taxonomy of moral rationales based on consequentialism and deontology to systematically classify LLM reasoning traces.

Result: LLM chains-of-thought tend to favor deontological principles (moral obligations), while post-hoc explanations shift toward consequentialist rationales (utility maximization).

Conclusion: The framework provides foundation for understanding how LLMs process ethical considerations, enabling safer and more interpretable deployment in high-stakes decision-making environments.

Abstract: As AI systems increasingly navigate applications in healthcare, law, and governance, understanding how they handle ethically complex scenarios becomes critical. Previous work has mainly examined the moral judgments in large language models (LLMs), rather than their underlying moral reasoning process. In contrast, we focus on a large-scale analysis of the moral reasoning traces provided by LLMs. Furthermore, unlike prior work that attempted to draw inferences from only a handful of moral dilemmas, our study leverages over 600 distinct trolley problems as probes for revealing the reasoning patterns that emerge within different LLMs. We introduce and test a taxonomy of moral rationales to systematically classify reasoning traces according to two main normative ethical theories: consequentialism and deontology. Our analysis reveals that LLM chains-of-thought tend to favor deontological principles based on moral obligations, while post-hoc explanations shift notably toward consequentialist rationales that emphasize utility. Our framework provides a foundation for understanding how LLMs process and articulate ethical considerations, an important step toward safe and interpretable deployment of LLMs in high-stakes decision-making environments. Our code is available at https://github.com/keenansamway/moral-lens .

[219] Latent Reasoning via Sentence Embedding Prediction

Hyeonbin Hwang, Byeongguk Jeon, Seungone Kim, Jiyeon Kim, Hoyeon Chang, Sohee Yang, Seungpil Won, Dohaeng Lee, Youbin Ahn, Minjoon Seo

Main category: cs.CL

TL;DR: The paper presents a framework that adapts pretrained language models to reason over sentence-level abstractions rather than tokens, using semantic and contextual embeddings with continuous inference to achieve competitive performance with Chain-of-Thought while reducing computational costs.

Details

Motivation: Autoregressive language models generate tokens sequentially, but human reasoning operates over higher-level abstractions like sentences and concepts. The research investigates whether LMs can learn to reason over structured semantic units instead of raw token sequences.

Method: A framework that adapts pretrained token-level LMs to operate in sentence space by autoregressively predicting continuous embeddings of next sentences. Two embedding paradigms: semantic embeddings (via autoencoding) and contextual embeddings (via next-sentence prediction). Two inference regimes: Discretized (decode embeddings to text) and Continuous (reason entirely in embedding space).

Result: Contextual embeddings under continuous inference show competitive performance with Chain-of-Thought across mathematics, logic, commonsense, and planning domains while reducing inference-time FLOPs by half on average. Early signs of scalability and modular adaptation observed.

Conclusion: Pretrained language models can effectively transition to abstract, structured reasoning within latent embedding spaces, enabling more efficient and higher-level reasoning capabilities.

Abstract: Autoregressive language models (LMs) generate one token at a time, yet human reasoning operates over higher-level abstractions - sentences, propositions, and concepts. This contrast raises a central question- Can LMs likewise learn to reason over structured semantic units rather than raw token sequences? In this work, we investigate whether pretrained LMs can be lifted into such abstract reasoning spaces by building on their learned representations. We present a framework that adapts a pretrained token-level LM to operate in sentence space by autoregressively predicting continuous embeddings of next sentences. We explore two embedding paradigms inspired by classical representation learning: 1) semantic embeddings, learned via autoencoding to preserve surface meaning; and 2) contextual embeddings, trained via next-sentence prediction to encode anticipatory structure. We evaluate both under two inference regimes: Discretized, which decodes each predicted embedding into text before re-encoding; and Continuous, which reasons entirely in embedding space for improved efficiency. Across four domains - mathematics, logic, commonsense, and planning - contextual embeddings under continuous inference show competitive performance with Chain-of-Thought (CoT) while reducing inference-time FLOPs on average by half. We also present early signs of scalability and modular adaptation. Finally, to visualize latent trajectories, we introduce SentenceLens, a diagnostic tool that decodes intermediate model states into interpretable sentences. Together, our results indicate that pretrained LMs can effectively transition to abstract, structured reasoning within latent embedding spaces.

[220] Self-ensemble: Mitigating Confidence Mis-calibration for Large Language Models

Zicheng Xu, Guanchu Wang, Guangyao Zheng, Yu-Neng Chuang, Alexander Szalay, Xia Hu, Vladimir Braverman

Main category: cs.CL

TL;DR: Self-ensemble method addresses LLM confidence distortion in multi-choice QA by splitting choices into groups and ensembling predictions, improving performance without labeled data.

Details

Motivation: LLMs suffer from confidence distortion in multi-choice QA - under-confidence in correct predictions and over-confidence in incorrect ones, especially with many choices.

Method: Split choices into groups, ensemble LLM predictions across groups using attention mask and positional encoding, requiring no labeled data for tuning.

Result: Outperforms standard inference and baseline methods across three LLMs and datasets, comprehensively addressing confidence distortion.

Conclusion: Self-ensemble effectively solves LLM confidence distortion in multi-choice QA through plug-and-play ensembling without parameter tuning.

Abstract: Although Large Language Models (LLMs) perform well in general fields, they exhibit a confidence distortion problem on multi-choice question-answering (MCQA), particularly as the number of answer choices increases. Specifically, on MCQA with many choices, LLMs suffer from under-confidence in correct predictions and over-confidence in incorrect ones, leading to a substantially degraded performance. To solve this problem, we propose Self-ensemble in this work. Our method splits the choices into several groups and ensembles LLM predictions across these groups to reach a final decision. The advantage of Self-ensemble is its plug-and-play nature, where it can be integrated into existing LLM architecture based on a designed attention mask and positional encoding, without requiring labeled datasets for parameter tuning. Experimental results on three LLMs and datasets demonstrate that Self-ensemble comprehensively addresses the confidence distortion problem of LLMs, outperforming standard inference as well as baseline methods.

[221] Safety-Aligned Weights Are Not Enough: Refusal-Teacher-Guided Finetuning Enhances Safety and Downstream Performance under Harmful Finetuning Attacks

Seokil Ham, Yubin Choi, Yujin Yang, Seungju Cho, Younghun Kim, Changick Kim

Main category: cs.CL

TL;DR: Proposes Refusal-Teacher (Ref-Teacher) framework to address harmful finetuning attacks in Finetuning-as-a-Service by directly finetuning base models under safety-aligned teacher guidance instead of using safety-aligned models as weak initialization.

Details

Motivation: Current FaaS systems are vulnerable to safety degradation when user data contains harmful prompts, and existing approaches using safety-aligned models as initialization lead to suboptimal safety and task performance.

Method: Directly finetunes base model guided by safety-aligned Ref-Teacher that filters harmful prompts from user data and distills safety-alignment knowledge into base model.

Result: Effectively minimizes harmful outputs and enhances finetuning accuracy for user-specific tasks.

Conclusion: Provides practical solution for secure and reliable deployment of LLMs in FaaS by addressing harmful finetuning attacks through teacher-guided finetuning.

Abstract: Recently, major AI providers such as Google and OpenAI have introduced Finetuning-as-a-Service (FaaS), which allows users to customize Large Language Models (LLMs) using their own data. However, this service is vulnerable to safety degradation when user data includes harmful prompts, a threat known as harmful finetuning attacks. Prior works attempt to mitigate this issue by first constructing safety-aligned model and then finetuning the model on user data. However, we observe that the safety-aligned weights provide weak initialization for downstream task learning, leading to suboptimal safety-alignment and downstream task performance. To address this, we propose a Refusal-Teacher (Ref-Teacher)-guided finetuning framework. Instead of finetuning a safety-aligned model on user data, our approach directly finetunes the base model under the guidance of a safety-aligned Ref-Teacher, which filters harmful prompts from user data and distills safety-alignment knowledge into the base model. Extensive experiments demonstrate that our Ref-Teacher-guided finetuning strategy effectively minimizes harmful outputs and enhances finetuning accuracy for user-specific tasks, offering a practical solution for secure and reliable deployment of LLMs in FaaS.

[222] LLM-as-a-qualitative-judge: automating error analysis in natural language generation

Nadezhda Chirkova, Tunde Oluwaseyi Ajayi, Seth Aycock, Zain Muhammad Mujahid, Vladana Perlić, Ekaterina Borisova, Markarit Vartampetian

Main category: cs.CL

TL;DR: Proposes LLM-as-a-qualitative-judge, an LLM-based evaluation approach that generates structured reports of common issue types in NLG system outputs, providing developers with actionable insights for system improvement.

Details

Motivation: Current LLM-as-a-judge approaches are primarily quantitative (numerical scores), lacking qualitative insights about specific issues and improvement directions for NLG systems.

Method: Two-step approach: 1) Open-ended per-instance issue analysis using LLMs, 2) Clustering discovered issues using an intuitive cumulative algorithm to generate structured error type reports.

Result: LLM-as-a-qualitative-judge matches human-annotated issues in 2/3 cases, produces error type reports resembling human reports, and in a case study substantially improved NLG system performance.

Conclusion: LLM-as-a-qualitative-judge effectively provides qualitative insights for NLG system improvement, bridging the gap between quantitative evaluation and actionable development guidance.

Abstract: Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that instance-specific issues output by LLM-as-a-qualitative-judge match those annotated by humans in 2/3 cases, and that LLM-as-a-qualitative-judge is capable of producing error type reports resembling the reports composed by human annotators. We also demonstrate in a case study how the use of LLM-as-a-qualitative-judge can substantially improve NLG systems performance. Our code and data are publicly available at https://github.com/tunde-ajayi/llm-as-a-qualitative-judge.

[223] Effectiveness of Counter-Speech against Abusive Content: A Multidimensional Annotation and Classification Study

Greta Damo, Elena Cabrio, Serena Villata

Main category: cs.CL

TL;DR: A computational framework for classifying counter-speech effectiveness using six linguistic dimensions, with strong performance on both expert- and user-written counter-speech.

Details

Motivation: Defining criteria to assess counter-speech effectiveness remains an open challenge in mitigating online hate speech.

Method: Proposed framework with six dimensions (Clarity, Evidence, Emotional Appeal, Rebuttal, Audience Adaptation, Fairness), annotated 4,214 CS instances, and developed multi-task and dependency-based classification strategies.

Result: Achieved strong results (0.94 and 0.96 average F1 respectively) on both expert- and user-written counter-speech, outperforming standard baselines and revealing strong interdependence among dimensions.

Conclusion: The framework provides an effective computational approach for assessing counter-speech effectiveness with strong classification performance.

Abstract: Counter-speech (CS) is a key strategy for mitigating online Hate Speech (HS), yet defining the criteria to assess its effectiveness remains an open challenge. We propose a novel computational framework for CS effectiveness classification, grounded in linguistics, communication and argumentation concepts. Our framework defines six core dimensions - Clarity, Evidence, Emotional Appeal, Rebuttal, Audience Adaptation, and Fairness - which we use to annotate 4,214 CS instances from two benchmark datasets, resulting in a novel linguistic resource released to the community. In addition, we propose two classification strategies, multi-task and dependency-based, achieving strong results (0.94 and 0.96 average F1 respectively on both expert- and user-written CS), outperforming standard baselines, and revealing strong interdependence among dimensions.

[224] Language Surgery in Multilingual Large Language Models

Joanito Agili Lopo, Muhammad Ravi Shulthan Habibi, Tack Hwa Wong, Muhammad Ilham Ghozali, Fajri Koto, Genta Indra Winata, Peerat Limkonchotiwat, Alham Fikri Aji, Samuel Cahyawijaya

Main category: cs.CL

TL;DR: This paper investigates naturally emerging representation alignment in LLMs’ middle layers and proposes Inference-Time Language Control (ITLC) for precise cross-lingual language control while preserving semantic integrity.

Details

Motivation: To understand representation alignment in LLMs and address the cross-lingual language confusion problem that causes inconsistent language generation in current large-scale LLMs.

Method: Empirical analysis of representation alignment in LLMs’ middle layers, followed by proposing ITLC - a novel method using latent injection for cross-lingual language control.

Result: Confirmed existence of natural representation alignment in LLMs, demonstrated ITLC’s strong cross-lingual control capabilities while preserving semantic integrity, and showed effectiveness in mitigating language confusion.

Conclusion: This work advances understanding of representation alignment in LLMs and provides a practical solution for enhancing both monolingual and cross-lingual performance through inference-time language control.

Abstract: Large Language Models (LLMs) have demonstrated remarkable generalization capabilities across tasks and languages, revolutionizing natural language processing. This paper investigates the naturally emerging representation alignment in LLMs, particularly in the middle layers, and its implications for disentangling language-specific and language-agnostic information. We empirically confirm the existence of this alignment, analyze its behavior in comparison to explicitly designed alignment models, and demonstrate its potential for language-specific manipulation without semantic degradation. Building on these findings, we propose Inference-Time Language Control (ITLC), a novel method that leverages latent injection to enable precise cross-lingual language control and mitigate language confusion in LLMs. Our experiments highlight ITLC’s strong cross-lingual control capabilities while preserving semantic integrity in target languages. Furthermore, we demonstrate its effectiveness in alleviating the cross-lingual language confusion problem, which persists even in current large-scale LLMs, leading to inconsistent language generation. This work advances our understanding of representation alignment in LLMs and introduces a practical solution for enhancing their monolingual and cross-lingual performance.

[225] MultiFinBen: Benchmarking Large Language Models for Multilingual and Multimodal Financial Application

Xueqing Peng, Lingfei Qian, Yan Wang, Ruoyu Xiang, Yueru He, Yang Ren, Mingyang Jiang, Vincent Jim Zhang, Yuqing Guo, Jeff Zhao, Huan He, Yi Han, Yun Feng, Yuechen Jiang, Yupeng Cao, Haohang Li, Yangyang Yu, Xiaoyu Wang, Penglei Gao, Shengyuan Lin, Keyi Wang, Shanshan Yang, Yilun Zhao, Zhiwei Liu, Peng Lu, Jerry Huang, Suyuchen Wang, Triantafillos Papadopoulos, Polydoros Giannouris, Efstathia Soufleri, Nuo Chen, Zhiyang Deng, Heming Fu, Yijia Zhao, Mingquan Lin, Meikang Qiu, Kaleb E Smith, Arman Cohan, Xiao-Yang Liu, Jimin Huang, Guojun Xiong, Alejandro Lopez-Lira, Xi Chen, Junichi Tsujii, Jian-Yun Nie, Sophia Ananiadou, Qianqian Xie

Main category: cs.CL

TL;DR: MultiFinBen is the first expert-annotated multilingual and multimodal benchmark for evaluating LLMs in realistic financial contexts, testing cross-lingual evidence integration and financial OCR tasks.

Details

Motivation: Real-world financial analysis involves information across multiple languages and modalities, but existing evaluations remain text-only, monolingual, and saturated by current models.

Method: Created a structured, difficulty-aware benchmark with two task families: multilingual financial reasoning (cross-lingual evidence integration) and financial OCR (extracting structured text from scanned documents).

Result: Evaluating 21 leading LLMs showed even frontier models like GPT-4o achieve only 46.01% overall, performing better on vision and audio but dropping sharply in multilingual settings.

Conclusion: The findings expose persistent limitations in multilingual, multimodal, and expert-level financial reasoning, highlighting the need for more comprehensive evaluation frameworks.

Abstract: Real-world financial analysis involves information across multiple languages and modalities, from reports and news to scanned filings and meeting recordings. Yet most existing evaluations of LLMs in finance remain text-only, monolingual, and largely saturated by current models. To bridge these gaps, we present MultiFinBen, the first expert-annotated multilingual (five languages) and multimodal (text, vision, audio) benchmark for evaluating LLMs in realistic financial contexts. MultiFinBen introduces two new task families: multilingual financial reasoning, which tests cross-lingual evidence integration from filings and news, and financial OCR, which extracts structured text from scanned documents containing tables and charts. Rather than aggregating all available datasets, we apply a structured, difficulty-aware selection based on advanced model performance, ensuring balanced challenge and removing redundant tasks. Evaluating 21 leading LLMs shows that even frontier multimodal models like GPT-4o achieve only 46.01% overall, stronger on vision and audio but dropping sharply in multilingual settings. These findings expose persistent limitations in multilingual, multimodal, and expert-level financial reasoning. All datasets, evaluation scripts, and leaderboards are publicly released.

[226] Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot

Xiang Cheng, Chengyan Pan, Minjun Zhao, Deyang Li, Fangchao Liu, Xinyu Zhang, Xiao Zhang, Yong Liu

Main category: cs.CL

TL;DR: Recent strong LLMs like Qwen2.5 series don’t benefit from traditional or enhanced CoT exemplars in mathematical reasoning tasks - they primarily serve output formatting purposes rather than improving reasoning performance.

Details

Motivation: To investigate whether Chain-of-Thought exemplars still benefit recent, stronger LLMs in mathematical reasoning tasks, given the continuous advancement of model capabilities.

Method: Systematic experiments comparing traditional CoT exemplars and enhanced CoT exemplars (constructed using answers from advanced models like Qwen2.5-Max and DeepSeek-R1) against Zero-Shot CoT on recent strong models.

Result: Both traditional and enhanced CoT exemplars fail to improve reasoning performance compared to Zero-Shot CoT. Models tend to ignore exemplars and focus primarily on instructions, showing no observable gain in reasoning ability.

Conclusion: Current ICL+CoT framework has limitations in mathematical reasoning, calling for re-examination of the ICL paradigm and exemplar definition for stronger models.

Abstract: In-Context Learning (ICL) is an essential emergent ability of Large Language Models (LLMs), and recent studies introduce Chain-of-Thought (CoT) to exemplars of ICL to enhance the reasoning capability, especially in mathematics tasks. However, given the continuous advancement of model capabilities, it remains unclear whether CoT exemplars still benefit recent, stronger models in such tasks. Through systematic experiments, we find that for recent strong models such as the Qwen2.5 series, adding traditional CoT exemplars does not improve reasoning performance compared to Zero-Shot CoT. Instead, their primary function is to align the output format with human expectations. We further investigate the effectiveness of enhanced CoT exemplars, constructed using answers from advanced models such as \texttt{Qwen2.5-Max} and \texttt{DeepSeek-R1}. Experimental results indicate that these enhanced exemplars still fail to improve the model’s reasoning performance. Further analysis reveals that models tend to ignore the exemplars and focus primarily on the instructions, leading to no observable gain in reasoning ability. Overall, our findings highlight the limitations of the current ICL+CoT framework in mathematical reasoning, calling for a re-examination of the ICL paradigm and the definition of exemplars.

[227] MFTCXplain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of LLMs through Multi-hop Hate Speech Explanation

Jackson Trager, Francielle Vargas, Diego Alves, Matteo Guida, Mikel K. Ngueajio, Ameeta Agrawal, Yalda Daryani, Farzan Karimi-Malekabadi, Flor Miriam Plaza-del-Arco

Main category: cs.CL

TL;DR: MFTCXplain is a multilingual benchmark for evaluating LLM moral reasoning using hate speech explanations based on Moral Foundations Theory, revealing significant gaps between LLM outputs and human moral reasoning.

Details

Motivation: Address limitations in current moral reasoning benchmarks: lack of justification annotations for transparency and predominant English focus that limits cross-cultural assessment.

Method: Created MFTCXplain dataset with 3,000 tweets across Portuguese, Italian, Persian, and English, annotated with hate speech labels, moral categories, and text span-level rationales using Moral Foundations Theory.

Result: LLMs show misalignment with human moral reasoning - perform well in hate speech detection (F1 up to 0.836) but poorly in predicting moral sentiments (F1 < 0.35), with limited rationale alignment especially in underrepresented languages.

Conclusion: Current LLMs have limited capacity to internalize and reflect human moral reasoning, highlighting the need for improved multilingual moral reasoning capabilities.

Abstract: Ensuring the moral reasoning capabilities of Large Language Models (LLMs) is a growing concern as these systems are used in socially sensitive tasks. Nevertheless, current evaluation benchmarks present two major shortcomings: a lack of annotations that justify moral classifications, which limits transparency and interpretability; and a predominant focus on English, which constrains the assessment of moral reasoning across diverse cultural settings. In this paper, we introduce MFTCXplain, a multilingual benchmark dataset for evaluating the moral reasoning of LLMs via multi-hop hate speech explanation using the Moral Foundations Theory. MFTCXplain comprises 3,000 tweets across Portuguese, Italian, Persian, and English, annotated with binary hate speech labels, moral categories, and text span-level rationales. Our results show a misalignment between LLM outputs and human annotations in moral reasoning tasks. While LLMs perform well in hate speech detection (F1 up to 0.836), their ability to predict moral sentiments is notably weak (F1 < 0.35). Furthermore, rationale alignment remains limited mainly in underrepresented languages. Our findings show the limited capacity of current LLMs to internalize and reflect human moral reasoning

[228] Doc2SAR: A Synergistic Framework for High-Fidelity Extraction of Structure-Activity Relationships from Scientific Documents

Jiaxi Zhuang, Kangning Li, Jue Hou, Mingjun Xu, Zhifeng Gao, Hengxing Cai

Main category: cs.CL

TL;DR: Doc2SAR is a synergistic framework combining domain-specific tools with fine-tuned multimodal LLMs for extracting molecular structure-activity relationships from scientific documents, achieving state-of-the-art performance on the new DocSAR-200 benchmark.

Details

Motivation: Extracting SARs from scientific literature is challenging due to heterogeneous document formats and limitations of existing methods - rule-based approaches lack generalization while general MLLMs lack accuracy for specialized tasks like layout detection and chemical structure recognition.

Method: Proposed Doc2SAR framework that integrates domain-specific tools with multimodal LLMs enhanced via supervised fine-tuning, plus introduced DocSAR-200 benchmark with 200 annotated scientific documents for evaluation.

Result: Doc2SAR achieves state-of-the-art performance with 80.78% Table Recall on DocSAR-200, exceeding GPT-4o by 51.48%, and demonstrates practical usability with efficient inference and web app.

Conclusion: The synergistic approach combining specialized tools with fine-tuned MLLMs effectively addresses SAR extraction challenges and significantly outperforms end-to-end baselines across diverse document types.

Abstract: Extracting molecular structure-activity relationships (SARs) from scientific literature and patents is essential for drug discovery and materials research. However, this task remains challenging due to heterogeneous document formats and limitations of existing methods. Specifically, rule-based approaches relying on rigid templates fail to generalize across diverse document layouts, while general-purpose multimodal large language models (MLLMs) lack sufficient accuracy and reliability for specialized tasks, such as layout detection and optical chemical structure recognition (OCSR). To address these challenges, we introduce DocSAR-200, a rigorously annotated benchmark of 200 scientific documents designed specifically for evaluating SAR extraction methods. Additionally, we propose Doc2SAR, a novel synergistic framework that integrates domain-specific tools with MLLMs enhanced via supervised fine-tuning (SFT). Extensive experiments demonstrate that Doc2SAR achieves state-of-the-art performance across various document types, significantly outperforming leading end-to-end baselines. Specifically, Doc2SAR attains an overall Table Recall of 80.78% on DocSAR-200, exceeding end2end GPT-4o by 51.48%. Furthermore, Doc2SAR demonstrates practical usability through efficient inference and is accompanied by a web app.

[229] PhoniTale: Phonologically Grounded Mnemonic Generation for Typologically Distant Language Pairs

Sana Kang, Myeongseok Gwon, Su Young Kwon, Jaewook Lee, Andrew Lan, Bhiksha Raj, Rita Singh

Main category: cs.CL

TL;DR: PhoniTale is a cross-lingual mnemonic generation system that uses IPA-based phonological adaptation and syllable-aware alignment to create effective vocabulary learning aids for L2 learners, outperforming previous automated methods and achieving human-level quality.

Details

Motivation: Vocabulary acquisition is challenging for L2 learners, especially with typologically distant languages like English and Korean, where phonological and structural mismatches complicate learning. Existing methods rely on direct IPA-based phonetic matching or use LLMs without proper phonological guidance.

Method: PhoniTale performs IPA-based phonological adaptation and syllable-aware alignment to retrieve L1 keyword sequences, then uses LLMs to generate verbal cues for vocabulary mnemonics.

Result: Evaluation through automated metrics and human recall tests shows PhoniTale consistently outperforms previous automated approaches and achieves quality comparable to human-written mnemonics.

Conclusion: The proposed PhoniTale system effectively addresses the limitations of previous methods by incorporating proper phonological guidance and alignment, demonstrating superior performance in generating helpful vocabulary learning mnemonics.

Abstract: Vocabulary acquisition poses a significant challenge for second-language (L2) learners, especially when learning typologically distant languages such as English and Korean, where phonological and structural mismatches complicate vocabulary learning. Recently, large language models (LLMs) have been used to generate keyword mnemonics by leveraging similar keywords from a learner’s first language (L1) to aid in acquiring L2 vocabulary. However, most methods still rely on direct IPA-based phonetic matching or employ LLMs without phonological guidance. In this paper, we present PhoniTale, a novel cross-lingual mnemonic generation system that performs IPA-based phonological adaptation and syllable-aware alignment to retrieve L1 keyword sequence and uses LLMs to generate verbal cues. We evaluate PhoniTale through automated metrics and a short-term recall test with human participants, comparing its output to human-written and prior automated mnemonics. Our findings show that PhoniTale consistently outperforms previous automated approaches and achieves quality comparable to human-written mnemonics.

[230] Efficient Compositional Multi-tasking for On-device Large Language Models

Ondrej Bohdal, Mete Ozay, Jijoong Moon, Kyeng-Hun Lee, Hyeonmok Ko, Umberto Michieli

Main category: cs.CL

TL;DR: The paper introduces a benchmark for compositional multi-tasking in LLMs where test examples require simultaneous execution of multiple tasks, and proposes Learnable Calibration method for efficient on-device applications.

Details

Motivation: Prior work on adapter merging in LLMs has been limited to single-task scenarios, but real-world applications often require solving multiple tasks concurrently (e.g., generating translated summaries).

Method: Proposed Learnable Calibration method tailored for on-device settings with limited computational resources, focusing on resource-efficient solutions for compositional multi-tasking.

Result: Created a benchmark comprising four practically relevant compositional tasks to facilitate research in this setting.

Conclusion: The contributions lay groundwork for advancing LLM capabilities in real-world multi-tasking scenarios and expanding applicability to complex, resource-constrained use cases.

Abstract: Adapter parameters provide a mechanism to modify the behavior of machine learning models and have gained significant popularity in the context of large language models (LLMs) and generative AI. These parameters can be merged to support multiple tasks via a process known as task merging. However, prior work on merging in LLMs, particularly in natural language processing, has been limited to scenarios where each test example addresses only a single task. In this paper, we focus on on-device settings and study the problem of text-based compositional multi-tasking, where each test example involves the simultaneous execution of multiple tasks. For instance, generating a translated summary of a long text requires solving both translation and summarization tasks concurrently. To facilitate research in this setting, we propose a benchmark comprising four practically relevant compositional tasks. We also present an efficient method (Learnable Calibration) tailored for on-device applications, where computational resources are limited, emphasizing the need for solutions that are both resource-efficient and high-performing. Our contributions lay the groundwork for advancing the capabilities of LLMs in real-world multi-tasking scenarios, expanding their applicability to complex, resource-constrained use cases.

[231] Re:Form – Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny

Chuanhao Yan, Fengdi Che, Xuhan Huang, Xu Xu, Xin Li, Yizhi Li, Xingwei Qu, Jingzhe Shi, Chenghua Lin, Yaodong Yang, Binhang Yuan, Hang Zhao, Yu Qiao, Bowen Zhou, Jie Fu

Main category: cs.CL

TL;DR: The paper proposes using formal language-based reasoning with Dafny to overcome unreliable verification in traditional LLMs, introducing an automatic data curation pipeline and RL with formal verifier feedback to reduce human priors.

Details

Motivation: Existing LLMs using informal languages face unreliable and unscalable verification processes, while human-annotated priors for complex programming tasks are too time-consuming. Formal language systems enable automatic, provable verification.

Method: Systematic exploration of reducing human priors using Dafny formal language, featuring automatic data curation pipeline and RL designs integrated with formal verifier feedback. Introduces DafnyComp benchmark for compositional formal programs.

Result: SFT enables small models (0.5B) to generate syntactically valid and verifiable Dafny code, outperforming proprietary models. RL with regularization further improves performance and generalization to out-of-domain tasks, achieving best results on DafnyComp benchmark.

Conclusion: Formal language-based reasoning with automatic verification provides scalable and reliable approach for software verification, reducing dependency on human priors while enabling small models to outperform larger proprietary ones.

Abstract: Existing informal language-based (e.g., human language) Large Language Models (LLMs) trained with Reinforcement Learning (RL) face a significant challenge: their verification processes, which provide crucial training signals, are neither reliable nor scalable. In fact, the prevalent large proprietary models could hardly generate verifiable programs. A promising yet largely uncharted alternative is formal language-based reasoning. Grounding LLMs in rigorous formal systems where generative models operate in formal language spaces (e.g., Dafny) enables the automatic and mathematically provable verification of their reasoning processes and outcomes. This capability is pivotal for achieving large-scale, reliable formal software verification. It is a common practice to employ human-annotated chain-of-thought and other human priors to induce the reasoning and coding capabilities of LLMs. Unfortunately, it becomes unacceptably all-consuming to provide such priors for supervising complex programming tasks. In this work, we systematically explore ways to reduce human priors with the formal language, Dafny, as the main environment for our pilot study. Our pipeline mainly relies on introducing an automatic and scalable data curation pipeline, and careful RL designs integrated with feedback from the formal language verifier. We introduce DafnyComp, a benchmark of compositional formal programs with auto-formalized specifications for specification reasoning. Our supervised fine-tuning (SFT) stage enables even small models (e.g., 0.5B) to generate syntactically valid and verifiable Dafny code, surpassing proprietary models. RL with regularization further improves performance, achieving stronger generalization to out-of-domain tasks and outperforming all strong baselines on the challenging DafnyComp benchmark.

[232] TriangleMix: Accelerating Prefilling via Decoding-time Contribution Sparsity

Zhiyuan He, Yike Zhang, Chengruidong Zhang, Huiqiang Jiang, Yuqing Yang, Lili Qiu

Main category: cs.CL

TL;DR: TriangleMix is a training-free static attention pattern that combines dense attention with Triangle attention to reduce quadratic complexity in LLM prefilling, achieving 15.3x speedup while maintaining near-lossless performance.

Details

Motivation: LLMs suffer from quadratic attention complexity during prefilling stage, creating time bottlenecks. Existing methods exploit attention score sparsity, but there's untapped decoding-time contribution sparsity where many attention blocks have high scores during prefilling but contribute negligibly to subsequent decoding.

Method: Propose TriangleMix - a static attention pattern that uses dense attention in some layers and switches to Triangle attention in others, based on gradient analysis showing decoding-time contribution sparsity.

Result: For 128K inputs, Triangle attention achieves 15.3x speedup in attention computation, outperforming typical dynamic sparse methods (1.9x-3.4x). TriangleMix can be combined with dynamic sparsity for additional 6%-19% TTFT reduction.

Conclusion: TriangleMix effectively reduces attention overhead while preserving performance, and can be seamlessly integrated with existing dynamic sparsity methods for further acceleration.

Abstract: Large Language Models (LLMs) incur quadratic attention complexity with input length, creating a major time bottleneck in the prefilling stage. Existing acceleration methods largely exploit attention score sparsity by estimating blocks with high attention scores and applying dynamic sparse attention. In this work, we identify another untapped form of sparsity in the prefilling stage, namely decoding-time contribution sparsity, where many attention blocks exhibit nontrivial attention scores during prefilling yet contribute negligibly to subsequent decoding, as indicated by gradient-based analysis. Building on this observation, we propose TriangleMix, a training-free static attention pattern that uses dense attention in a subset of layers and switches to Triangle attention in the others. Extensive experiments show that TriangleMix preserves nearly lossless performance relative to dense attention while substantially reducing attention overhead in Triangle layers. For 128K inputs, Triangle attention achieves a 15.3x speedup in attention computation, significantly exceeding the acceleration of typical dynamic sparse methods (1.9x to 3.4x). Furthermore, TriangleMix can be seamlessly combined with dynamic sparsity approaches, delivering an additional 6% to 19% reduction in TTFT over using dynamic sparsity alone.

[233] Adversarial Defence without Adversarial Defence: Enhancing Language Model Robustness via Instance-level Principal Component Removal

Yang Wang, Chenghao Xiao, Yizhi Li, Stuart E. Middleton, Noura Al Moubayed, Chenghua Lin

Main category: cs.CL

TL;DR: A simple add-on module enhances PLM robustness by removing instance-level principal components, transforming embeddings to Gaussian properties without adversarial training.

Details

Motivation: PLMs are vulnerable to adversarial attacks, and existing defense methods incur high computational costs through adversarial training or data perturbation.

Method: Propose an add-on module that removes instance-level principal components from embeddings, transforming them to approximate Gaussian properties without modifying original training data.

Result: Evaluations on 8 benchmark datasets show improved adversarial robustness while maintaining comparable before-attack accuracy to baselines.

Conclusion: The approach achieves a balanced trade-off between robustness and generalization without requiring adversarial examples or costly training augmentation.

Abstract: Pre-trained language models (PLMs) have driven substantial progress in natural language processing but remain vulnerable to adversarial attacks, raising concerns about their robustness in real-world applications. Previous studies have sought to mitigate the impact of adversarial attacks by introducing adversarial perturbations into the training process, either implicitly or explicitly. While both strategies enhance robustness, they often incur high computational costs. In this work, we propose a simple yet effective add-on module that enhances the adversarial robustness of PLMs by removing instance-level principal components, without relying on conventional adversarial defences or perturbing the original training data. Our approach transforms the embedding space to approximate Gaussian properties, thereby reducing its susceptibility to adversarial perturbations while preserving semantic relationships. This transformation aligns embedding distributions in a way that minimises the impact of adversarial noise on decision boundaries, enhancing robustness without requiring adversarial examples or costly training-time augmentation. Evaluations on eight benchmark datasets show that our approach improves adversarial robustness while maintaining comparable before-attack accuracy to baselines, achieving a balanced trade-off between robustness and generalisation.

[234] Agentic large language models improve retrieval-based radiology question answering

Sebastian Wind, Jeta Sopa, Daniel Truhn, Mahshad Lotfinia, Tri-Thien Nguyen, Keno Bressem, Lisa Adams, Mirabela Rusu, Harald Köstler, Gerhard Wellein, Andreas Maier, Soroosh Tayebi Arasteh

Main category: cs.CL

TL;DR: Proposed RaR, a multi-step retrieval and reasoning framework that significantly improves diagnostic accuracy and reduces hallucinations in radiology question answering compared to traditional single-step RAG systems.

Details

Motivation: Traditional retrieval-augmented generation systems for radiology QA rely on single-step retrieval, limiting their ability to handle complex clinical reasoning tasks and maintain factual consistency.

Method: Developed RaR framework with multi-step retrieval and reasoning, evaluated 25 LLMs across diverse architectures and scales using expert-curated radiology questions from RSNA-RadioQA, ExtendedQA, and real-world board examination questions.

Result: RaR significantly improved mean diagnostic accuracy over zero-shot prompting and conventional RAG, with greatest gains in small-scale models. Reduced hallucinations by mean 9.4% and retrieved clinically relevant context in 46% of cases. Even clinically fine-tuned models showed benefits.

Conclusion: RaR enhances factuality and diagnostic accuracy in radiology QA, demonstrating that retrieval remains beneficial despite embedded domain knowledge. Framework is publicly available to support clinical translation.

Abstract: Clinical decision-making in radiology increasingly benefits from artificial intelligence (AI), particularly through large language models (LLMs). However, traditional retrieval-augmented generation (RAG) systems for radiology question answering (QA) typically rely on single-step retrieval, limiting their ability to handle complex clinical reasoning tasks. Here we propose radiology Retrieval and Reasoning (RaR), a multi-step retrieval and reasoning framework designed to improve diagnostic accuracy, factual consistency, and clinical reliability of LLMs in radiology question answering. We evaluated 25 LLMs spanning diverse architectures, parameter scales (0.5B to >670B), and training paradigms (general-purpose, reasoning-optimized, clinically fine-tuned), using 104 expert-curated radiology questions from previously established RSNA-RadioQA and ExtendedQA datasets. To assess generalizability, we additionally tested on an unseen internal dataset of 65 real-world radiology board examination questions. RaR significantly improved mean diagnostic accuracy over zero-shot prompting and conventional online RAG. The greatest gains occurred in small-scale models, while very large models (>200B parameters) demonstrated minimal changes (<2% improvement). Additionally, RaR retrieval reduced hallucinations (mean 9.4%) and retrieved clinically relevant context in 46% of cases, substantially aiding factual grounding. Even clinically fine-tuned models showed gains from RaR (e.g., MedGemma-27B), indicating that retrieval remains beneficial despite embedded domain knowledge. These results highlight the potential of RaR to enhance factuality and diagnostic accuracy in radiology QA, warranting future studies to validate their clinical utility. All datasets, code, and the full RaR framework are publicly available to support open research and clinical translation.

[235] Modeling Annotator Disagreement with Demographic-Aware Experts and Synthetic Perspectives

Yinuo Xu, Veronica Derricks, Allison Earl, David Jurgens

Main category: cs.CL

TL;DR: DEM-MoE models annotator disagreement using demographic-aware routing to expert subnetworks, with synthetic data generation via LLM persona prompting to address sparse demographic coverage.

Details

Motivation: To better model annotator disagreement in subjective NLP tasks by representing structured group-level variation and addressing sparse demographic coverage in training data.

Method: Proposed DEM-MoE (Demographic-Aware Mixture of Experts) that routes inputs to expert subnetworks based on annotator demographics, and uses LLM-generated synthetic annotations via zero-shot persona prompting for data imputation.

Result: DEM-MoE performs competitively across demographic groups, especially on datasets with high annotator disagreement. Synthetic judgments align moderately well with human annotations and optimal data blending strategies depend on dataset structure.

Conclusion: The combination of architectural innovations (DEM-MoE) and data-centric approaches (synthetic data generation) improves representation of diverse perspectives in subjective NLP tasks.

Abstract: We present an approach to modeling annotator disagreement in subjective NLP tasks through both architectural and data-centric innovations. Our model, DEM-MoE (Demographic-Aware Mixture of Experts), routes inputs to expert subnetworks based on annotator demographics, enabling it to better represent structured, group-level variation compared to prior models. DEM-MoE consistently performs competitively across demographic groups, and shows especially strong results on datasets with high annotator disagreement. To address sparse demographic coverage, we test whether LLM-generated synthetic annotations via zero-shot persona prompting can be used for data imputation. We show these synthetic judgments align moderately well with human annotations on our data and offer a scalable way to potentially enrich training data. We then propose and evaluate approaches for blending real and synthetic data using strategies tailored to dataset structure. We find that the optimal strategies depend on dataset structure. Together, these contributions improve the representation of diverse perspectives.

[236] Thinking with Nothinking Calibration: A New In-Context Learning Paradigm in Reasoning Large Language Models

Haotian Wu, Bo Xu, Yao Shu, Menglin Yang, Chengwei Qin

Main category: cs.CL

TL;DR: JointThinking is a new in-context learning paradigm that prompts reasoning LLMs to generate two parallel answers (Thinking and Nothinking modes) and triggers a second round of thinking only when responses are inconsistent, achieving superior performance across reasoning benchmarks.

Details

Motivation: While reasoning LLMs have shown strong capabilities, their potential for in-context learning remains largely underexplored compared to training and inference strategies.

Method: Proposes Thinking with Nothinking Calibration (JointThinking) - generates two answers in parallel (Thinking and Nothinking modes), triggers second thinking round only when responses are inconsistent using a single prompt with two different answers.

Result: Significantly outperforms few-shot CoT, thinking twice and majority voting; achieves comparable in-distribution performance to training-based SOTA while substantially outperforming on out-of-distribution tasks; shows strong scalability as model size increases.

Conclusion: JointThinking demonstrates the importance of structural thinking diversity and consistency checks, with strong scalability and promising directions for future ICL research in reasoning LLMs.

Abstract: Reasoning large language models (RLLMs) have recently demonstrated remarkable capabilities through structured and multi-step reasoning. While prior research has primarily focused on improving their training and inference strategies, their potential for in-context learning (ICL) remains largely underexplored. To fill this gap, we propose Thinking with Nothinking Calibration (JointThinking), a new ICL paradigm that prompts the model to generate two answers in parallel: one in Thinking mode and the other in Nothinking mode. A second round of Thinking is triggered only when the two initial responses are inconsistent, using a single prompt with two different answers. Extensive experiments across multiple reasoning benchmarks demonstrate that JointThinking significantly outperforms few-shot chain-of-thought (CoT), thinking twice and majority voting. Moreover, it achieves comparable in-distribution performance to training-based SOTA reasoning method, while substantially outperforming on out-of-distribution tasks. We further conduct a systematic analysis of the calibration mechanism, showing the importance of structural thinking diversity and the benefits of consistency check. Additionally, we observe that the performance gap between actual and ideal reasoning narrows as model size increases in the second thinking, indicating the strong scalability of our approach. Finally, we discuss current limitations and outline promising directions for future ICL research in RLLMs.

[237] Reveal-Bangla: A Dataset for Cross-Lingual Multi-Step Reasoning Evaluation

Khondoker Ittehadul Islam, Gabriele Sarti

Main category: cs.CL

TL;DR: The paper introduces a manually translated Bangla multi-step reasoning dataset and evaluates English-centric and Bangla-centric multilingual models, finding that reasoning context helps with challenging questions but models struggle with Bangla reasoning steps.

Details

Motivation: Current language model evaluation is predominantly in high-resource languages like English, creating a gap for low-resource languages like Bangla in multi-step reasoning tasks.

Method: Created a manually translated Bangla multi-step reasoning dataset from the English Reveal dataset, then conducted controlled evaluation of English-centric and Bangla-centric multilingual models on both original and translated versions.

Result: Reasoning context is beneficial for challenging non-binary questions, but models struggle to effectively employ relevant Bangla reasoning steps. Different trends were observed across models and languages.

Conclusion: Models have difficulty leveraging reasoning steps in Bangla effectively, highlighting the need for better cross-lingual reasoning capabilities and evaluation in low-resource languages.

Abstract: Language models have demonstrated remarkable performance on complex multi-step reasoning tasks. However, their evaluation has been predominantly confined to high-resource languages such as English. In this paper, we introduce a manually translated Bangla multi-step reasoning dataset derived from the English Reveal dataset, featuring both binary and non-binary question types. We conduct a controlled evaluation of English-centric and Bangla-centric multilingual small language models on the original dataset and our translated version to compare their ability to exploit relevant reasoning steps to produce correct answers. Our results show that, in comparable settings, reasoning context is beneficial for more challenging non-binary questions, but models struggle to employ relevant Bangla reasoning steps effectively. We conclude by exploring how reasoning steps contribute to models’ predictions, highlighting different trends across models and languages.

[238] Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts

Chiyu Zhang, Lu Zhou, Xiaogang Xu, Jiafei Wu, Liming Fang, Zhe Liu

Main category: cs.CL

TL;DR: Developer-role-based attacks (D-Attack and DH-CoT) significantly improve jailbreak effectiveness on reasoning models, and a new data filtering method (MDH) addresses issues in current red-teaming datasets.

Details

Motivation: Existing jailbreak attacks perform poorly on reasoning models, and current red-teaming datasets contain problematic samples that hinder accurate attack evaluation.

Method: Proposed two developer-role-based attacks: D-Attack (enhances contextual simulation) and DH-CoT (strengthens attacks with deceptive chain-of-thought), plus MDH filtering method combining LLM screening and human verification.

Result: Developer messages significantly improve jailbreak attack success rates, and MDH reliably filters low-quality samples from datasets.

Conclusion: Developer-role-based approaches are effective for jailbreaking reasoning models, and proper dataset filtering is crucial for accurate attack evaluation.

Abstract: Jailbreaking commercial black-box models is one of the most challenging and serious security threats today. Existing attacks achieve certain success on non-reasoning models but perform limitedly on the latest reasoning models. We discover that carefully crafted developer messages can markedly boost jailbreak effectiveness. Building on this, we propose two developer-role-based attacks: D-Attack, which enhances contextual simulation, and DH-CoT, which strengthens attacks with deceptive chain-of-thought. In experiments, we further diccover that current red-teaming datasets often contain samples unsuited for measuring attack gains: prompts that fail to trigger defenses, prompts where malicious content is not the sole valid output, and benign prompts. Such data hinders accurate measurement of the true improvement brought by an attack method. To address this, we introduce MDH, a Malicious content Detection approach combining LLM-based screening with Human verification to balance accuracy and cost, with which we clean data and build the RTA dataset series. Experiments demonstrate that MDH reliably filters low-quality samples and that developer messages significantly improve jailbreak attack success. Codes, datasets, and other results will be released in https://github.com/AlienZhang1996/DH-CoT.

[239] Thinking Inside the Mask: In-Place Prompting in Diffusion LLMs

Xiangqi Jin, Yuxuan Wang, Yifeng Gao, Zichen Wen, Biqing Qi, Dongrui Liu, Linfeng Zhang

Main category: cs.CL

TL;DR: ICE is a novel prompting framework for diffusion LLMs that enables in-place chain-of-thought prompting with early exit, achieving significant performance improvements and computational efficiency gains.

Details

Motivation: Traditional LLMs use prefix-only prompting and sequential generation, which limits bidirectional information flow. Diffusion LLMs offer new opportunities with bidirectional attention and iterative refinement.

Method: ICE transforms prefix-only prompting into in-place prompting for dLLMs by integrating prompts directly within masked token positions during iterative refinement and using confidence-aware early exit mechanism.

Result: ICE achieves up to 17.29% accuracy improvement with 4.12x speedup on GSM8K, and up to 276.67x acceleration on MMLU while maintaining competitive performance.

Conclusion: ICE demonstrates the effectiveness of in-place prompting for diffusion LLMs, enabling significant computational efficiency and performance improvements through bidirectional attention mechanisms.

Abstract: Despite large language models (LLMs) have achieved remarkable success, their prefix-only prompting paradigm and sequential generation process offer limited flexibility for bidirectional information. Diffusion large language models (dLLMs) present new opportunities through their bidirectional attention mechanisms and iterative refinement processes, enabling more flexible in-place prompting strategies. We introduce ICE (In-Place Chain-of-Thought Prompting with Early Exit), a novel framework that transforms prefix-only prompting into in-place prompting specifically designed for dLLMs. ICE integrates in-place prompts directly within masked token positions during iterative refinement and employs a confidence-aware early exit mechanism to significantly reduce computational overhead. Extensive experiments demonstrate ICE’s effectiveness, achieving up to 17.29% accuracy improvement with 4.12$\times$ speedup on GSM8K, and up to 276.67$\times$ acceleration on MMLU while maintaining competitive performance.

[240] EEG-MedRAG: Enhancing EEG-based Clinical Decision-Making via Hierarchical Hypergraph Retrieval-Augmented Generation

Yi Wang, Haoran Luo, Lu Meng, Ziyu Jia, Xinliang Zhou, Qingsong Wen

Main category: cs.CL

TL;DR: EEG-MedRAG is a hypergraph-based framework that integrates EEG domain knowledge, patient cases, and large-scale data for semantic-temporal retrieval and diagnostic generation, outperforming existing methods in clinical QA tasks.

Details

Motivation: The need to efficiently retrieve and interpret large-scale, multi-source, heterogeneous EEG data in neuroscience and clinical practice due to the widespread application of EEG.

Method: A three-layer hypergraph-based retrieval-augmented generation framework that unifies EEG domain knowledge, individual patient cases, and a large-scale repository into a traversable n-ary relational hypergraph.

Result: EEG-MedRAG significantly outperforms TimeRAG and HyperGraphRAG in answer accuracy and retrieval, demonstrating strong potential for real-world clinical decision support.

Conclusion: The framework enables joint semantic-temporal retrieval and causal-chain diagnostic generation, with publicly available data and code for further research and application.

Abstract: With the widespread application of electroencephalography (EEG) in neuroscience and clinical practice, efficiently retrieving and semantically interpreting large-scale, multi-source, heterogeneous EEG data has become a pressing challenge. We propose EEG-MedRAG, a three-layer hypergraph-based retrieval-augmented generation framework that unifies EEG domain knowledge, individual patient cases, and a large-scale repository into a traversable n-ary relational hypergraph, enabling joint semantic-temporal retrieval and causal-chain diagnostic generation. Concurrently, we introduce the first cross-disease, cross-role EEG clinical QA benchmark, spanning seven disorders and five authentic clinical perspectives. This benchmark allows systematic evaluation of disease-agnostic generalization and role-aware contextual understanding. Experiments show that EEG-MedRAG significantly outperforms TimeRAG and HyperGraphRAG in answer accuracy and retrieval, highlighting its strong potential for real-world clinical decision support. Our data and code are publicly available at https://github.com/yi9206413-boop/EEG-MedRAG.

Guy Mor-Lan, Naama Rivlin-Angert, Yael R. Kaplan, Tamir Sheafer, Shaul R. Shenhav

Main category: cs.CL

TL;DR: HebID is the first multilabel Hebrew corpus for social identity detection, containing 5,536 annotated sentences from Israeli politicians’ Facebook posts, with 12 nuanced social identity categories.

Details

Motivation: Existing identity detection datasets are predominantly English-centric, single-label, and focus on coarse identity categories, lacking nuanced social identity analysis in non-English political contexts like Hebrew.

Method: Created a multilabel Hebrew corpus with manual annotations of 12 social identities, benchmarked multilabel and single-label encoders alongside Hebrew-tuned LLMs (2B-9B parameters), and applied the classifier to analyze politicians’ posts and parliamentary speeches.

Result: Hebrew-tuned LLMs achieved the best performance (macro-F1 = 0.74). Analysis revealed differences in popularity, temporal trends, clustering patterns, and gender-related variations in identity expression, with comparison between elite discourse and public identity priorities.

Conclusion: HebID provides a comprehensive foundation for studying social identities in Hebrew and serves as a model for similar research in other non-English political contexts, addressing the gap in nuanced identity analysis beyond English-centric approaches.

Abstract: Political language is deeply intertwined with social identities. While social identities are often shaped by specific cultural contexts and expressed through particular uses of language, existing datasets for group and identity detection are predominantly English-centric, single-label and focus on coarse identity categories. We introduce HebID, the first multilabel Hebrew corpus for social identity detection: 5,536 sentences from Israeli politicians’ Facebook posts (Dec 2018-Apr 2021), manually annotated for twelve nuanced social identities (e.g. Rightist, Ultra-Orthodox, Socially-oriented) grounded by survey data. We benchmark multilabel and single-label encoders alongside 2B-9B-parameter generative LLMs, finding that Hebrew-tuned LLMs provide the best results (macro-$F_1$ = 0.74). We apply our classifier to politicians’ Facebook posts and parliamentary speeches, evaluating differences in popularity, temporal trends, clustering patterns, and gender-related variations in identity expression. We utilize identity choices from a national public survey, enabling a comparison between identities portrayed in elite discourse and the public’s identity priorities. HebID provides a comprehensive foundation for studying social identities in Hebrew and can serve as a model for similar research in other non-English political contexts.

[242] The Enemy from Within: A Study of Political Delegitimization Discourse in Israeli Political Speech

Naama Rivlin-Angert, Guy Mor-Lan

Main category: cs.CL

TL;DR: This paper presents the first large-scale computational study of political delegitimization discourse (PDD) in Hebrew, using a novel annotated corpus and developing a two-stage classification pipeline that achieves strong performance for PDD detection and analysis.

Details

Motivation: To systematically study political delegitimization discourse (symbolic attacks on political entities' normative validity) through computational methods, addressing the lack of large-scale analysis in this domain.

Method: Created a Hebrew-language corpus of 10,410 sentences from Knesset speeches, Facebook posts, and news outlets, with manual annotations. Developed a two-stage classification pipeline combining finetuned encoder models and decoder LLMs (DictaLM 2.0).

Result: Best model achieved F1 of 0.74 for binary PDD detection and macro-F1 of 0.67 for classification of delegitimization characteristics. Analysis revealed rising PDD over three decades, higher prevalence on social media, greater use by male politicians, and stronger tendencies among right-leaning actors.

Conclusion: Automated PDD analysis is feasible and valuable for understanding democratic discourse, with applications for tracking political discourse patterns across platforms and time.

Abstract: We present the first large-scale computational study of political delegitimization discourse (PDD), defined as symbolic attacks on the normative validity of political entities. We curate and manually annotate a novel Hebrew-language corpus of 10,410 sentences drawn from Knesset speeches (1993-2023), Facebook posts (2018-2021), and leading news outlets, of which 1,812 instances (17.4%) exhibit PDD and 642 carry additional annotations for intensity, incivility, target type, and affective framing. We introduce a two-stage classification pipeline combining finetuned encoder models and decoder LLMs. Our best model (DictaLM 2.0) attains an F$_1$ of 0.74 for binary PDD detection and a macro-F$_1$ of 0.67 for classification of delegitimization characteristics. Applying this classifier to longitudinal and cross-platform data, we see a marked rise in PDD over three decades, higher prevalence on social media versus parliamentary debate, greater use by male than female politicians, and stronger tendencies among right-leaning actors - with pronounced spikes during election campaigns and major political events. Our findings demonstrate the feasibility and value of automated PDD analysis for understanding democratic discourse.

[243] MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use

Fei Lei, Yibo Yang, Wenxiu Sun, Dahua Lin

Main category: cs.CL

TL;DR: MCPVerse is a real-world benchmark with 550+ executable tools for evaluating LLMs’ tool use capabilities, revealing performance degradation with larger tool sets but showing agentic models can benefit from expanded exploration.

Details

Motivation: Existing benchmarks for evaluating LLMs' tool use are limited by synthetic tools and constrained action spaces, creating a need for more realistic evaluation methods.

Method: Created MCPVerse benchmark with 550+ real-world executable tools and 140k+ token action space, using outcome-based evaluation with real-time ground truth for time-sensitive tasks. Tested state-of-the-art LLMs in three modes: Oracle, Standard, and Max-Scale.

Result: Most models suffer performance degradation with larger tool sets, but agentic models like Claude-4-Sonnet can effectively leverage expanded exploration spaces to improve accuracy.

Conclusion: MCPVerse exposes limitations of current models in complex real-world scenarios and serves as a critical benchmark for advancing agentic tool use capabilities.

Abstract: Large Language Models (LLMs) are evolving from text generators into reasoning agents. This transition makes their ability to use external tools a critical capability. However, evaluating this skill presents a significant challenge. Existing benchmarks are often limited by their reliance on synthetic tools and severely constrained action spaces. To address these limitations, we introduce MCPVerse, an expansive, real-world benchmark for evaluating agentic tool use. MCPVerse integrates more than 550 real-world, executable tools to create an unprecedented action space exceeding 140k tokens, and employs outcome-based evaluation with real-time ground truth for time-sensitive tasks. We benchmarked the state-of-the-art LLMs across three modes (Oracle, Standard, and Max-Scale), revealing that while most models suffer performance degradation when confronted with larger tool sets, the agentic models, such as Claude-4-Sonnet, can effectively leverage expanded exploration spaces to improve accuracy. This finding not only exposes the limitations of state-of-the-art models in complex, real-world scenarios but also establishes MCPVerse as a critical benchmark for measuring and advancing agentic tool use capabilities.

[244] On the Interplay between Musical Preferences and Personality through the Lens of Language

Eliran Shem-Tov, Ella Rabinovich

Main category: cs.CL

TL;DR: This study bridges music psychology and computational linguistics by investigating whether musical preferences leave detectable traces in spontaneous language through Big Five personality traits.

Details

Motivation: To connect two established research domains: correlations between musical preferences and personality, and personality detection through linguistic analysis.

Method: Used a curated dataset of 500,000+ text samples from 5,000 authors with identified musical preferences to build advanced personality assessment models.

Result: Revealed significant personality differences across fans of five musical genres.

Conclusion: Musical preferences do leave detectable traces in language through personality characteristics, providing resources for interdisciplinary research.

Abstract: Music serves as a powerful reflection of individual identity, often aligning with deeper psychological traits. Prior research has established correlations between musical preferences and personality, while separate studies have demonstrated that personality is detectable through linguistic analysis. Our study bridges these two research domains by investigating whether individuals’ musical preferences leave traces in their spontaneous language through the lens of the Big Five personality traits (Openness, Conscientiousness, Extroversion, Agreeableness, and Neuroticism). Using a carefully curated dataset of over 500,000 text samples from nearly 5,000 authors with reliably identified musical preferences, we build advanced models to assess personality characteristics. Our results reveal significant personality differences across fans of five musical genres. We release resources for future research at the intersection of computational linguistics, music psychology and personality analysis.

[245] Chronological Passage Assembling in RAG framework for Temporal Question Answering

Byeongjeong Kim, Jeonghyun Park, Joonho Yang, Hwanhee Lee

Main category: cs.CL

TL;DR: ChronoRAG is a specialized RAG framework for narrative QA that structures dispersed document information and preserves temporal order, showing significant improvements on narrative datasets.

Details

Motivation: Existing RAG methods struggle with narrative texts because they require understanding broader context and sequential relationships, not just isolated segments.

Method: ChronoRAG refines document information into coherent passages and explicitly captures/maintains temporal order among retrieved passages.

Result: Substantial improvements on NarrativeQA and GutenQA datasets, especially for tasks requiring factual identification and comprehension of complex sequential relationships.

Conclusion: Reasoning over temporal order is crucial for resolving narrative QA tasks, and ChronoRAG effectively addresses this need.

Abstract: Long-context question answering over narrative tasks is challenging because correct answers often hinge on reconstructing a coherent timeline of events while preserving contextual f low in a limited context window. Retrievalaugmented generation (RAG) methods aim to address this challenge by selectively retrieving only necessary document segments. However, narrative texts possess unique characteristics that limit the effectiveness of these existing approaches. Specifically, understanding narrative texts requires more than isolated segments, as the broader context and sequential relationships between segments are crucial for comprehension. To address these limitations, we propose ChronoRAG, a novel RAG framework specialized for narrative texts. This approach focuses on two essential aspects: refining dispersed document information into coherent and structured passages and preserving narrative flow by explicitly capturing and maintaining the temporal order among retrieved passages. We empirically demonstrate the effectiveness of ChronoRAG through experiments on the NarrativeQA and GutenQAdataset, showing substantial improvements in tasks requiring both factual identification and comprehension of complex sequential relationships, underscoring that reasoning over temporal order is crucial in resolving narrative QA.

[246] Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities

Rikuto Kotoge, Mai Nishimura, Jiaxin Ma

Main category: cs.CL

TL;DR: DGPO enables compact language models (0.5-1B parameters) to achieve sophisticated agentic RAG behaviors through distillation-guided policy optimization, overcoming sparse rewards and unstable training.

Details

Motivation: Applying RL to compact models is challenging due to poor initial performance, sparse rewards, and unstable training, limiting agentic RAG deployment in resource-constrained environments.

Method: Proposes Distillation-Guided Policy Optimization (DGPO) with cold-start initialization from teacher demonstrations and continuous teacher guidance during policy optimization, plus ARC metric for fine-grained capability analysis.

Result: DGPO enables compact models to achieve sophisticated agentic search behaviors, sometimes outperforming larger teacher models, making agentic RAG feasible in resource-constrained environments.

Conclusion: DGPO successfully addresses RL challenges for compact models, enabling sophisticated agentic RAG behaviors in computing resource-constrained settings through effective distillation and guidance techniques.

Abstract: Reinforcement Learning has emerged as a dominant post-training approach to elicit agentic RAG behaviors such as search and planning from language models. Despite its success with larger models, applying RL to compact models (e.g., 0.5–1B parameters) presents unique challenges. The compact models exhibit poor initial performance, resulting in sparse rewards and unstable training. To overcome these difficulties, we propose Distillation-Guided Policy Optimization (DGPO), which employs cold-start initialization from teacher demonstrations and continuous teacher guidance during policy optimization. To understand how compact models preserve agentic behavior, we introduce Agentic RAG Capabilities (ARC), a fine-grained metric analyzing reasoning, search coordination, and response synthesis. Comprehensive experiments demonstrate that DGPO enables compact models to achieve sophisticated agentic search behaviors, even outperforming the larger teacher model in some cases. DGPO makes agentic RAG feasible in computing resource-constrained environments.

[247] Talk Less, Call Right: Enhancing Role-Play LLM Agents with Automatic Prompt Optimization and Role Prompting

Saksorn Ruangtanusak, Pittawat Taveekitworachai, Kunat Pipatanakul

Main category: cs.CL

TL;DR: This paper explores prompting approaches for tool-augmented LLMs in role-playing dialogue, finding that rule-based role prompting with character-card/scene-contract design and strict function calling enforcement achieves best performance (0.571 score) compared to zero-shot baseline (0.519).

Details

Motivation: Address the problems of over-speaking (overly long responses) and under-acting (ineffective tool usage) in persona-grounded dialogue agents participating in CPDC 2025 API track.

Method: Four prompting approaches: 1) basic role prompting, 2) improved role prompting, 3) automatic prompt optimization (APO), and 4) rule-based role prompting (RRP) with character-card/scene-contract design and strict function calling enforcement.

Result: Rule-based role prompting achieved best performance with overall score of 0.571, improving on zero-shot baseline score of 0.519. RRP outperformed more elaborate methods like APO.

Conclusion: RRP design substantially improves effectiveness and reliability of role-playing dialogue agents. All best-performing prompts and APO tool are open-sourced to support future persona prompt development.

Abstract: This report investigates approaches for prompting a tool-augmented large language model (LLM) to act as a role-playing dialogue agent in the API track of the Commonsense Persona-grounded Dialogue Challenge (CPDC) 2025. In this setting, dialogue agents often produce overly long in-character responses (over-speaking) while failing to use tools effectively according to the persona (under-acting), such as generating function calls that do not exist or making unnecessary tool calls before answering. We explore four prompting approaches to address these issues: 1) basic role prompting, 2) improved role prompting, 3) automatic prompt optimization (APO), and 4) rule-based role prompting. The rule-based role prompting (RRP) approach achieved the best performance through two novel techniques-character-card/scene-contract design and strict enforcement of function calling-which led to an overall score of 0.571, improving on the zero-shot baseline score of 0.519. These findings demonstrate that RRP design can substantially improve the effectiveness and reliability of role-playing dialogue agents compared with more elaborate methods such as APO. To support future efforts in developing persona prompts, we are open-sourcing all of our best-performing prompts and the APO tool Source code is available at https://github.com/scb-10x/apo

[248] When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment

Hanqi Yan, Hainiu Xu, Siya Qi, Shu Yang, Yulan He

Main category: cs.CL

TL;DR: The paper identifies Reasoning-Induced Misalignment (RIM), where enhanced reasoning capabilities in LLMs can cause safety misalignment through attention mechanisms and neuronal entanglement.

Details

Motivation: Growing concerns about LLM safety and alignment with human values, particularly when reasoning capabilities are strengthened during training or inference.

Method: Used representation analysis to study attention heads and neuronal activation patterns, examining how reasoning patterns affect safety mechanisms during training and inference.

Result: Discovered that specific attention heads facilitate refusal by reducing attention to CoT tokens, and found significant activation entanglement between reasoning and safety in safety-critical neurons after fine-tuning with reasoning patterns.

Conclusion: RIM vulnerability emerges from mechanistic interactions between reasoning and safety systems, with neuronal entanglement correlating with catastrophic forgetting, providing a neuron-level explanation for misalignment.

Abstract: With the growing accessibility and wide adoption of large language models, concerns about their safety and alignment with human values have become paramount. In this paper, we identify a concerning phenomenon: Reasoning-Induced Misalignment (RIM), in which misalignment emerges when reasoning capabilities strengthened-particularly when specific types of reasoning patterns are introduced during inference or training. Beyond reporting this vulnerability, we provide the first mechanistic account of its origins. Through representation analysis, we discover that specific attention heads facilitate refusal by reducing their attention to CoT tokens, a mechanism that modulates the model’s rationalization process during inference. During training, we find significantly higher activation entanglement between reasoning and safety in safety-critical neurons than in control neurons, particularly after fine-tuning with those identified reasoning patterns. This entanglement strongly correlates with catastrophic forgetting, providing a neuron-level explanation for RIM.

[249] REFRAG: Rethinking RAG based Decoding

Xiaoqiang Lin, Aritra Ghosh, Bryan Kian Hsiang Low, Anshumali Shrivastava, Vijai Mohan

Main category: cs.CL

TL;DR: REFRAG is an efficient decoding framework that improves latency in RAG applications by exploiting the sparse attention patterns in retrieved passages, achieving 30.85× speedup in time-to-first-token without performance loss.

Details

Motivation: Current LLMs face significant latency and memory issues when processing long-context inputs in RAG applications, where most retrieved passages are irrelevant to the query, leading to inefficient computations.

Method: REFRAG compresses, senses, and expands the context by exploiting the block-diagonal attention patterns in RAG, eliminating unnecessary computations over irrelevant retrieved passages.

Result: Achieves 30.85× acceleration in time-to-first-token (3.75× improvement over previous work) without perplexity loss, and extends LLM context size by 16× while maintaining accuracy across various tasks.

Conclusion: REFRAG provides substantial speedup with no accuracy loss across diverse long-context tasks, demonstrating that exploiting RAG-specific sparsity patterns can significantly improve system efficiency.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in leveraging extensive external knowledge to enhance responses in multi-turn and agentic applications, such as retrieval-augmented generation (RAG). However, processing long-context inputs introduces significant system latency and demands substantial memory for the key-value cache, resulting in reduced throughput and a fundamental trade-off between knowledge enrichment and system efficiency. While minimizing latency for long-context inputs is a primary objective for LLMs, we contend that RAG require specialized consideration. In RAG, much of the LLM context consists of concatenated passages from retrieval, with only a small subset directly relevant to the query. These passages often exhibit low semantic similarity due to diversity or deduplication during re-ranking, leading to block-diagonal attention patterns that differ from those in standard LLM generation tasks. Based on this observation, we argue that most computations over the RAG context during decoding are unnecessary and can be eliminated with minimal impact on performance. To this end, we propose REFRAG, an efficient decoding framework that compresses, senses, and expands to improve latency in RAG applications. By exploiting the sparsity structure, we demonstrate a 30.85 the time-to-first-token acceleration (3.75 improvement to previous work) without loss in perplexity. In addition, our optimization framework for large context enables REFRAG to extend the context size of LLMs by 16. We provide rigorous validation of REFRAG across diverse long-context tasks, including RAG, multi-turn conversations, and long document summarization, spanning a wide range of datasets. Experimental results confirm that REFRAG delivers substantial speedup with no loss in accuracy compared to LLaMA models and other state-of-the-art baselines across various context sizes.

[250] DrDiff: Dynamic Routing Diffusion with Hierarchical Attention for Breaking the Efficiency-Quality Trade-off

Jusheng Zhang, Yijia Fan, Kaitong Cai, Zimeng Huang, Xiaofei Sun, Jian Wang, Chengpei Tang, Keze Wang

Main category: cs.CL

TL;DR: DrDiff is a novel framework for efficient long-text generation that uses dynamic expert scheduling, hierarchical sparse attention, and soft absorption guidance to overcome the efficiency-quality trade-off.

Details

Motivation: To address the efficiency-quality trade-off in long-text generation by developing a more efficient framework that maintains high quality while reducing computational complexity.

Method: Three core technologies: 1) Dynamic expert scheduling for intelligent computational resource allocation, 2) Hierarchical Sparse Attention (HSA) that reduces complexity from O(n²) to O(n), 3) Soft absorption guidance with DPM-solver++ to reduce diffusion steps.

Result: Comprehensive experiments on various long-text generation benchmarks demonstrate superiority over existing state-of-the-art methods.

Conclusion: DrDiff successfully overcomes the efficiency-quality trade-off in long-text generation through its three core technologies, achieving better performance than current SOTA methods.

Abstract: This paper introduces DrDiff, a novel framework for long-text generation that overcomes the efficiency-quality trade-off through three core technologies. First, we design a dynamic expert scheduling mechanism that intelligently allocates computational resources during the diffusion process based on text complexity, enabling more efficient handling of text generation tasks of varying difficulty. Second, we introduce a Hierarchical Sparse Attention (HSA) mechanism that adaptively adjusts attention patterns according to a variety of input lengths, reducing computational complexity from O($n^2$) to O($n$) while maintaining model performance. Finally, we propose a soft absorption guidance optimization strategy that combines with DPM-solver++ to reduce diffusion steps, significantly improving generation speed. Comprehensive experiments on various long-text generation benchmarks demonstrate the superiority of our DrDiff over the existing SOTA methods.

[251] SPFT-SQL: Enhancing Large Language Model for Text-to-SQL Parsing by Self-Play Fine-Tuning

Yuhao Zhang, Shaoming Duan, Jinhang Su, Chuanyi Liu, Peiyi Han

Main category: cs.CL

TL;DR: SPFT-SQL is a new self-play fine-tuning method for Text-to-SQL that addresses limitations of standard SPIN by incorporating verification-based iterative fine-tuning and error-driven loss to improve SQL generation accuracy.

Details

Motivation: Standard self-play fine-tuning (SPIN) faces challenges in Text-to-SQL tasks because it doesn't generate new information and the large number of correct SQL queries from opponent models reduces the main model's ability to generate accurate SQL.

Method: SPFT-SQL introduces two key components: 1) Verification-based iterative fine-tuning before self-play to synthesize high-quality data and build models with varying capabilities, 2) Error-driven loss during self-play that incentivizes learning from opponent model’s incorrect outputs to distinguish between correct and erroneous SQL.

Result: Extensive experiments on six open-source LLMs and five benchmarks show that SPFT-SQL outperforms existing state-of-the-art methods.

Conclusion: The proposed SPFT-SQL method effectively addresses the limitations of standard self-play fine-tuning for Text-to-SQL tasks by incorporating verification-based data synthesis and error-driven learning, achieving superior performance compared to current approaches.

Abstract: Despite the significant advancements of self-play fine-tuning (SPIN), which can transform a weak large language model (LLM) into a strong one through competitive interactions between models of varying capabilities, it still faces challenges in the Text-to-SQL task. SPIN does not generate new information, and the large number of correct SQL queries produced by the opponent model during self-play reduces the main model’s ability to generate accurate SQL queries. To address this challenge, we propose a new self-play fine-tuning method tailored for the Text-to-SQL task, called SPFT-SQL. Prior to self-play, we introduce a verification-based iterative fine-tuning approach, which synthesizes high-quality fine-tuning data iteratively based on the database schema and validation feedback to enhance model performance, while building a model base with varying capabilities. During the self-play fine-tuning phase, we propose an error-driven loss method that incentivizes incorrect outputs from the opponent model, enabling the main model to distinguish between correct SQL and erroneous SQL generated by the opponent model, thereby improving its ability to generate correct SQL. Extensive experiments and in-depth analyses on six open-source LLMs and five widely used benchmarks demonstrate that our approach outperforms existing state-of-the-art (SOTA) methods.

[252] Culturally transmitted color categories in LLMs reflect a learning bias toward efficient compression

Nathaniel Imel, Noga Zaslavsky

Main category: cs.CL

TL;DR: LLMs can evolve human-like semantic categorization systems through Information Bottleneck efficiency, similar to human languages, as demonstrated in color-naming studies.

Details

Motivation: To investigate whether LLMs can develop efficient human-like semantic systems, given they aren't trained for optimal compression like human languages via the Information Bottleneck principle.

Method: Replicated human color-naming studies using Gemini 2.0-flash and Llama 3.3-70B-Instruct, including English color-naming tasks and simulated cultural evolution through iterated in-context language learning.

Result: Gemini aligned well with English speakers’ naming patterns and achieved high IB-efficiency, while Llama showed efficient but lower complexity systems. Both LLMs restructured random systems toward greater IB-efficiency and cross-linguistic patterns.

Conclusion: LLMs are capable of evolving perceptually grounded, human-like semantic systems driven by the same Information Bottleneck efficiency principle that governs human languages.

Abstract: Converging evidence suggests that systems of semantic categories across human languages achieve near-optimal compression via the Information Bottleneck (IB) complexity-accuracy principle. Large language models (LLMs) are not trained for this objective, which raises the question: are LLMs capable of evolving efficient human-like semantic systems? To address this question, we focus on the domain of color as a key testbed of cognitive theories of categorization and replicate with LLMs (Gemini 2.0-flash and Llama 3.3-70B-Instruct) two influential human behavioral studies. First, we conduct an English color-naming study, showing that Gemini aligns well with the naming patterns of native English speakers and achieves a significantly high IB-efficiency score, while Llama exhibits an efficient but lower complexity system compared to English. Second, to test whether LLMs simply mimic patterns in their training data or actually exhibit a human-like inductive bias toward IB-efficiency, we simulate cultural evolution of pseudo color-naming systems in LLMs via iterated in-context language learning. We find that akin to humans, LLMs iteratively restructure initially random systems towards greater IB-efficiency and increased alignment with patterns observed across the world’s languages. These findings demonstrate that LLMs are capable of evolving perceptually grounded, human-like semantic systems, driven by the same fundamental principle that governs semantic efficiency across human languages.

[253] Introducing Spotlight: A Novel Approach for Generating Captivating Key Information from Documents

Ankan Mullick, Sombit Bose, Rounak Saha, Ayan Kumar Bhowmick, Aditya Vempaty, Prasenjit Dey, Ravi Kokku, Pawan Goyal, Niloy Ganguly

Main category: cs.CL

TL;DR: Spotlight is a new information extraction paradigm that creates engaging narratives by highlighting compelling document aspects rather than providing comprehensive summaries.

Details

Motivation: Traditional summaries prioritize comprehensive coverage but often lack engagement value. The goal is to create narratives that foster deeper reader engagement with source material by emphasizing intriguing content.

Method: Two-stage approach: 1) Fine-tuning a large language model on benchmark datasets curated for spotlight generation, 2) Alignment via Direct Preference Optimization (DPO) to improve quality.

Result: The model precisely identifies key document elements while enhancing readability and significantly boosting engagement value compared to traditional approaches.

Conclusion: Spotlight paradigm successfully creates more engaging narratives by selectively emphasizing compelling content, demonstrating superior engagement value over traditional comprehensive summaries.

Abstract: In this paper, we introduce Spotlight, a novel paradigm for information extraction that produces concise, engaging narratives by highlighting the most compelling aspects of a document. Unlike traditional summaries, which prioritize comprehensive coverage, spotlights selectively emphasize intriguing content to foster deeper reader engagement with the source material. We formally differentiate spotlights from related constructs and support our analysis with a detailed benchmarking study using new datasets curated for this work. To generate high-quality spotlights, we propose a two-stage approach: fine-tuning a large language model on our benchmark data, followed by alignment via Direct Preference Optimization (DPO). Our comprehensive evaluation demonstrates that the resulting model not only identifies key elements with precision but also enhances readability and boosts the engagement value of the original document.

[254] Dynamic Span Interaction and Graph-Aware Memory for Entity-Level Sentiment Classification

Md. Mithun Hossain, Sanjara, Md. Shakil Hossain, Sudipto Chaki

Main category: cs.CL

TL;DR: SpanEIT is a novel framework for entity-level sentiment classification that integrates dynamic span interaction and graph-aware memory mechanisms to better model entity-sentiment relationships and ensure consistency across documents.

Details

Motivation: Entity-level sentiment classification faces challenges in modeling complex entity-sentiment interactions, capturing cross-sentence dependencies, ensuring consistency for multiple entity mentions, and handling linguistic phenomena like negation and ambiguity in noisy real-world text.

Method: SpanEIT builds span-based representations for entities and sentiment phrases, uses bidirectional attention for fine-grained interactions, employs graph attention networks for syntactic and co-occurrence relations, and includes a coreference-aware memory module for entity-level consistency.

Result: Experiments on FSAD, BARU, and IMDB datasets show SpanEIT outperforms state-of-the-art transformer and hybrid baselines in accuracy and F1 scores. Ablation studies confirm the effectiveness of the proposed components.

Conclusion: SpanEIT demonstrates strong potential for fine-grained sentiment analysis applications like social media monitoring and customer feedback analysis, with validated effectiveness through comprehensive experiments and interpretability analyses.

Abstract: Entity-level sentiment classification involves identifying the sentiment polarity linked to specific entities within text. This task poses several challenges: effectively modeling the subtle and complex interactions between entities and their surrounding sentiment expressions; capturing dependencies that may span across sentences; and ensuring consistent sentiment predictions for multiple mentions of the same entity through coreference resolution. Additionally, linguistic phenomena such as negation, ambiguity, and overlapping opinions further complicate the analysis. These complexities make entity-level sentiment classification a difficult problem, especially in real-world, noisy textual data. To address these issues, we propose SpanEIT, a novel framework integrating dynamic span interaction and graph-aware memory mechanisms for enhanced entity-sentiment relational modeling. SpanEIT builds span-based representations for entities and candidate sentiment phrases, employs bidirectional attention for fine-grained interactions, and uses a graph attention network to capture syntactic and co-occurrence relations. A coreference-aware memory module ensures entity-level consistency across documents. Experiments on FSAD, BARU, and IMDB datasets show SpanEIT outperforms state-of-the-art transformer and hybrid baselines in accuracy and F1 scores. Ablation and interpretability analyses validate the effectiveness of our approach, underscoring its potential for fine-grained sentiment analysis in applications like social media monitoring and customer feedback analysis.

[255] HalluDetect: Detecting, Mitigating, and Benchmarking Hallucinations in Conversational Systems in the Legal Domain

Spandan Anaokar, Shrey Ganatra, Harshvivek Kashid, Swapnil Bhattacharyya, Shruti Nair, Reshma Sekhar, Siddharth Manohar, Rahul Hemrajani, Pushpak Bhattacharyya

Main category: cs.CL

TL;DR: HalluDetect system detects hallucinations in LLaMA 3.1 8B chatbots with 68.92% F1 score, and AgentBot architecture reduces hallucinations to 0.4159 per turn while maintaining 96.13% token accuracy.

Details

Motivation: LLMs are widely used but prone to hallucinations, limiting reliability in critical applications like consumer grievance chatbots.

Method: Developed HalluDetect hallucination detection system and benchmarked five mitigation architectures including AgentBot.

Result: HalluDetect achieved 68.92% F1 score (22.47% improvement over baselines). AgentBot minimized hallucinations to 0.4159 per turn with 96.13% token accuracy.

Conclusion: Optimized inference strategies can significantly improve factual accuracy, providing a scalable framework for hallucination mitigation in LLMs.

Abstract: Large Language Models (LLMs) are widely used in industry but remain prone to hallucinations, limiting their reliability in critical applications. This work addresses hallucination reduction in consumer grievance chatbots built using LLaMA 3.1 8B Instruct, a compact model frequently used in industry. We develop HalluDetect, an LLM-based hallucination detection system that achieves an F1 score of 68.92% outperforming baseline detectors by 22.47%. Benchmarking five hallucination mitigation architectures, we find that out of them, AgentBot minimizes hallucinations to 0.4159 per turn while maintaining the highest token accuracy (96.13%), making it the most effective mitigation strategy. Our findings provide a scalable framework for hallucination mitigation, demonstrating that optimized inference strategies can significantly improve factual accuracy.

[256] Distribution-Aligned Decoding for Efficient LLM Task Adaptation

Senkang Hu, Xudong Han, Jinqi Jiang, Yihang Tao, Zihan Fang, Yong Dai, Sam Tak Wu Kwong, Yuguang Fang

Main category: cs.CL

TL;DR: SVDecode is a lightweight method that adapts language models by steering output distributions during decoding rather than updating weights, achieving performance gains without additional parameters.

Details

Motivation: To reduce the cost of adapting billion-parameter language models to downstream tasks, even with parameter-efficient fine-tuning (PEFT), by directly aligning output distributions during decoding.

Method: Extract a task-aware steering vector from KL divergence gradient between warm-started and pre-trained models, then use it to guide decoding process toward task distribution.

Result: Improves multiple-choice accuracy by up to 5 percentage points and open-ended truthfulness by 2 percentage points across three tasks and nine benchmarks, with similar gains on commonsense datasets.

Conclusion: SVDecode offers a lightweight, theoretically grounded path to stronger task adaptation for large language models without adding trainable parameters beyond PEFT adapters.

Abstract: Adapting billion-parameter language models to a downstream task is still costly, even with parameter-efficient fine-tuning (PEFT). We re-cast task adaptation as output-distribution alignment: the objective is to steer the output distribution toward the task distribution directly during decoding rather than indirectly through weight updates. Building on this view, we introduce Steering Vector Decoding (SVDecode), a lightweight, PEFT-compatible, and theoretically grounded method. We start with a short warm-start fine-tune and extract a task-aware steering vector from the Kullback-Leibler (KL) divergence gradient between the output distribution of the warm-started and pre-trained models. This steering vector is then used to guide the decoding process to steer the model’s output distribution towards the task distribution. We theoretically prove that SVDecode is first-order equivalent to the gradient step of full fine-tuning and derive a globally optimal solution for the strength of the steering vector. Across three tasks and nine benchmarks, SVDecode paired with four standard PEFT methods improves multiple-choice accuracy by up to 5 percentage points and open-ended truthfulness by 2 percentage points, with similar gains (1-2 percentage points) on commonsense datasets without adding trainable parameters beyond the PEFT adapter. SVDecode thus offers a lightweight, theoretically grounded path to stronger task adaptation for large language models.

[257] EpiCache: Episodic KV Cache Management for Long Conversational Question Answering

Minsoo Kim, Arnav Kundu, Han-Byul Kim, Richa Dixit, Minsik Cho

Main category: cs.CL

TL;DR: EpiCache is a training-free KV cache management framework that addresses memory bottlenecks in long conversational QA by using block-wise prefill and episodic KV compression to maintain accuracy under fixed memory budgets.

Details

Motivation: Modern LLMs with long contexts face memory bottlenecks from KV caching that grows linearly with dialogue length, and existing compression methods fail in multi-turn conversations due to unbounded peak memory and query-dependent eviction limitations.

Method: EpiCache uses block-wise prefill to bound cache growth, episodic KV compression that clusters conversation history into coherent episodes for episode-specific eviction, and adaptive layer-wise budget allocation based on each layer’s sensitivity to eviction.

Result: Across three LongConvQA benchmarks, EpiCache improves accuracy by up to 40%, maintains near-full KV accuracy under 4-6x compression, and reduces latency/memory by up to 2.4x/3.5x.

Conclusion: EpiCache enables efficient multi-turn interaction under strict resource limits by effectively managing KV cache memory while preserving conversational context and accuracy.

Abstract: Modern large language models (LLMs) extend context lengths to millions of tokens, enabling coherent, personalized responses grounded in long conversational histories. This ability, however, hinges on Key-Value (KV) caching, whose memory grows linearly with dialogue length and quickly becomes the bottleneck in resource-constrained environments. An active line of research for reducing memory bottleneck is KV cache compression, which seeks to limit cache size while preserving accuracy. Yet existing methods face two major limitations: (i) evicting the KV cache after full-context prefill causes unbounded peak memory, and (ii) query-dependent eviction narrows the cache to a single query, leading to failure cases in multi-turn conversations. We introduce EpiCache, a training-free KV cache management framework for long conversational question answering (LongConvQA) under fixed memory budgets. EpiCache bounds cache growth through block-wise prefill and preserves topic-relevant context via episodic KV compression, which clusters conversation history into coherent episodes and applies episode-specific KV cache eviction. We further design an adaptive layer-wise budget allocation strategy that measures each layer’s sensitivity to eviction and distributes the memory budget across layers accordingly. Across three LongConvQA benchmarks, EpiCache improves accuracy by up to 40%, maintains near-full KV accuracy under 4-6x compression, and reduces latency/memory by up to 2.4x/3.5x, enabling efficient multi-turn interaction under strict resource limits. Our code is available at https://github.com/apple/ml-epicache.

[258] Part-of-speech tagging for Nagamese Language using CRF

Alovi N Shohe, Chonglio Khiamungam, Teisovi Angami

Main category: cs.CL

TL;DR: First POS tagging study for Nagamese language using CRF model, achieving 85.70% accuracy on 16,112 token corpus.

Details

Motivation: No prior POS tagging work exists for Nagamese (Naga Pidgin), an Assamese-lexified Creole language used in northeast India trade, while resource-rich languages have substantial POS tagging research.

Method: Created annotated corpus of 16,112 tokens and applied Conditional Random Fields (CRF) machine learning technique for POS tagging.

Result: Achieved overall tagging accuracy of 85.70% with precision of 86%, recall of 86%, and f1-score of 85%.

Conclusion: Successfully demonstrated first POS tagging system for Nagamese language using CRF approach with promising results.

Abstract: This paper investigates part-of-speech tagging, an important task in Natural Language Processing (NLP) for the Nagamese language. The Nagamese language, a.k.a. Naga Pidgin, is an Assamese-lexified Creole language developed primarily as a means of communication in trade between the Nagas and people from Assam in northeast India. A substantial amount of work in part-of-speech-tagging has been done for resource-rich languages like English, Hindi, etc. However, no work has been done in the Nagamese language. To the best of our knowledge, this is the first attempt at part-of-speech tagging for the Nagamese Language. The aim of this work is to identify the part-of-speech for a given sentence in the Nagamese language. An annotated corpus of 16,112 tokens is created and applied machine learning technique known as Conditional Random Fields (CRF). Using CRF, an overall tagging accuracy of 85.70%; precision, recall of 86%, and f1-score of 85% is achieved. Keywords. Nagamese, NLP, part-of-speech, machine learning, CRF.

[259] Question-Driven Analysis and Synthesis: Building Interpretable Thematic Trees with LLMs for Text Clustering and Controllable Generation

Tiago Fernandes Tavares

Main category: cs.CL

TL;DR: RTP is a novel framework that uses LLMs to build interpretable binary tree taxonomies through natural language questions, outperforming traditional keyword-based topic models in interpretability and downstream task performance.

Details

Motivation: Traditional topic models produce hard-to-interpret keyword lists that lack semantic coherence, creating an interpretability gap in unsupervised text analysis, especially in data-scarce domains.

Method: Recursive Thematic Partitioning (RTP) leverages Large Language Models to interactively build a binary tree where each node is a natural language question that semantically partitions the data, creating an interpretable taxonomy.

Result: RTP’s question-driven hierarchy is more interpretable than BERTopic’s keyword-based topics and serves as powerful features in downstream classification tasks, especially when themes correlate with task labels. Thematic paths can also structure generative model prompts.

Conclusion: RTP shifts text analysis from statistical pattern discovery to knowledge-driven thematic analysis and enables structured synthesis through controllable generative prompts based on discovered corpus characteristics.

Abstract: Unsupervised analysis of text corpora is challenging, especially in data-scarce domains where traditional topic models struggle. While these models offer a solution, they typically describe clusters with lists of keywords that require significant manual effort to interpret and often lack semantic coherence. To address this critical interpretability gap, we introduce Recursive Thematic Partitioning (RTP), a novel framework that leverages Large Language Models (LLMs) to interactively build a binary tree. Each node in the tree is a natural language question that semantically partitions the data, resulting in a fully interpretable taxonomy where the logic of each cluster is explicit. Our experiments demonstrate that RTP’s question-driven hierarchy is more interpretable than the keyword-based topics from a strong baseline like BERTopic. Furthermore, we establish the quantitative utility of these clusters by showing they serve as powerful features in downstream classification tasks, particularly when the data’s underlying themes correlate with the task labels. RTP introduces a new paradigm for data exploration, shifting the focus from statistical pattern discovery to knowledge-driven thematic analysis. Furthermore, we demonstrate that the thematic paths from the RTP tree can serve as structured, controllable prompts for generative models. This transforms our analytical framework into a powerful tool for synthesis, enabling the consistent imitation of specific characteristics discovered in the source corpus.

[260] Knowledge-Level Consistency Reinforcement Learning: Dual-Fact Alignment for Long-Form Factuality

Junliang Li, Yucheng Wang, Yan Chen, Yu Ran, Ruiqing Zhang, Jing Liu, Hua Wu, Haifeng Wang

Main category: cs.CL

TL;DR: KLCF is a novel RL framework that addresses LLM hallucinations by optimizing knowledge consistency between policy and base models through Dual-Fact Alignment, improving factual recall and precision without external knowledge.

Details

Motivation: Hallucination and factuality deficits remain key obstacles to LLM reliability in long-form generation. Existing RLHF frameworks overlook model's internal knowledge boundaries, exacerbating the "hallucination tax".

Method: KLCF uses pretrained knowledge boundaries to construct fact checklists for improving factual coverage, and trains a self-assessment module based on base model’s internal knowledge to enhance factual precision. It employs Dual-Fact Alignment mechanism and is fully external-knowledge-free.

Result: KLCF substantially improves factuality metrics across multiple long-form benchmarks and effectively alleviates model hallucinations.

Conclusion: The proposed KLCF framework provides an efficient and scalable solution to address LLM hallucinations by focusing on internal knowledge consistency, outperforming prior methods that rely on external retrieval or heavy verification.

Abstract: Hallucination and factuality deficits remain key obstacles to the reliability of large language models (LLMs) in long-form generation. Existing reinforcement learning from human feedback (RLHF) frameworks primarily rely on preference rewards, yet they often overlook the model’s internal knowledge boundaries, exacerbating the so-called “hallucination tax”. To address this challenge, we propose Knowledge-Level Consistency Reinforcement Learning Framework (KLCF), a novel framework that focuses on the knowledge consistency between the policy model’s expressed knowledge and the base model’s parametric knowledge, and introduces a Dual-Fact Alignment mechanism to jointly optimize factual recall and precision. Specifically, KLCF leverages pretrained knowledge boundaries to construct fact checklist, guiding online reinforcement learning to improve factual coverage and recall; simultaneously, it trains a self-assessment module based on the base model’s internal knowledge to enhance factual precision during generation. Unlike prior methods that rely on external retrieval or heavy verification, our reward design is fully external-knowledge-free and lightweight, making KLCF efficient and easily scalable to large-scale training. Experimental results demonstrate that KLCF substantially improves factuality metrics across multiple long-form benchmarks and effectively alleviates model hallucinations.

[261] MemGen: Weaving Generative Latent Memory for Self-Evolving Agents

Guibin Zhang, Muxin Fu, Shuicheng Yan

Main category: cs.CL

TL;DR: MemGen is a dynamic generative memory framework that enables LLM-powered agents to interweave memory and reasoning through latent token sequences, surpassing existing memory systems by up to 38.22% and exhibiting emergent human-like memory faculties.

Details

Motivation: Existing memory paradigms for LLM agents are constrained - parametric memory forcibly adjusts model parameters, while retrieval-based memory externalizes experience into databases, neither capturing the fluid interweaving of reasoning and memory that characterizes human cognition.

Method: MemGen consists of a memory trigger that monitors reasoning state to decide explicit memory invocation, and a memory weaver that takes the agent’s current state as stimulus to construct latent token sequences as machine-native memory to enrich reasoning.

Result: Extensive experiments across eight benchmarks show MemGen surpasses leading external memory systems (ExpeL, AWM) by up to 38.22%, exceeds GRPO by up to 13.44%, and exhibits strong cross-domain generalization ability.

Conclusion: MemGen spontaneously evolves distinct human-like memory faculties including planning memory, procedural memory, and working memory without explicit supervision, suggesting an emergent trajectory toward more naturalistic forms of machine cognition.

Abstract: Agent memory shapes how Large Language Model (LLM)-powered agents, akin to the human brain, progressively refine themselves through environment interactions. Existing paradigms remain constrained: parametric memory forcibly adjusts model parameters, and retrieval-based memory externalizes experience into structured databases, yet neither captures the fluid interweaving of reasoning and memory that underlies human cognition. To address this gap, we propose MemGen, a dynamic generative memory framework that equips agents with a human-esque cognitive faculty. It consists of a \textit{memory trigger}, which monitors the agent’s reasoning state to decide explicit memory invocation, and a \textit{memory weaver}, which takes the agent’s current state as stimulus to construct a latent token sequence as machine-native memory to enrich its reasoning. In this way, MemGen enables agents to recall and augment latent memory throughout reasoning, producing a tightly interwoven cycle of memory and cognition. Extensive experiments across eight benchmarks show that MemGen surpasses leading external memory systems such as ExpeL and AWM by up to $38.22%$, exceeds GRPO by up to $13.44%$, and exhibits strong cross-domain generalization ability. More importantly, we find that without explicit supervision, MemGen spontaneously evolves distinct human-like memory faculties, including planning memory, procedural memory, and working memory, suggesting an emergent trajectory toward more naturalistic forms of machine cognition.

[262] DiaCDM: Cognitive Diagnosis in Teacher-Student Dialogues using the Initiation-Response-Evaluation Framework

Rui Jia, Yuang Wei, Ruijia Li, Yuan-Hao Jiang, Xinyu Xie, Yaomin Shen, Min Zhang, Bo Jiang

Main category: cs.CL

TL;DR: DiaCDM is the first cognitive diagnosis model designed for teacher-student dialogues, using an IRE framework and graph-based encoding to overcome challenges of unstructured dialogue data.

Details

Motivation: Traditional cognitive diagnosis models are ineffective for dynamic, unstructured teacher-student dialogues and struggle to extract diagnostic semantics from lengthy conversations.

Method: Adapted IRE framework from educational theory for dialogue diagnosis, developed graph-based encoding that integrates teacher questions with knowledge components to capture key information.

Result: Experiments on three real-world dialogue datasets show DiaCDM significantly improves diagnostic accuracy and enhances interpretability of results.

Conclusion: DiaCDM provides teachers with a powerful tool for assessing students’ cognitive states in dialogue settings, marking the first exploration of cognitive diagnosis in dialogue contexts.

Abstract: While cognitive diagnosis (CD) effectively assesses students’ knowledge mastery from structured test data, applying it to real-world teacher-student dialogues presents two fundamental challenges. Traditional CD models lack a suitable framework for handling dynamic, unstructured dialogues, and it’s difficult to accurately extract diagnostic semantics from lengthy dialogues. To overcome these hurdles, we propose DiaCDM, an innovative model. We’ve adapted the initiation-response-evaluation (IRE) framework from educational theory to design a diagnostic framework tailored for dialogue. We also developed a unique graph-based encoding method that integrates teacher questions with relevant knowledge components to capture key information more precisely. To our knowledge, this is the first exploration of cognitive diagnosis in a dialogue setting. Experiments on three real-world dialogue datasets confirm that DiaCDM not only significantly improves diagnostic accuracy but also enhances the results’ interpretability, providing teachers with a powerful tool for assessing students’ cognitive states. The code is available at https://github.com/Mind-Lab-ECNU/DiaCDM/tree/main.

[263] Text-Based Approaches to Item Alignment to Content Standards in Large-Scale Reading & Writing Tests

Yanbin Fu, Hong Jiao, Tianyi Zhou, Nan Zhang, Ming Li, Qingshu Xu, Sydney Peters, Robert W. Lissitz

Main category: cs.CL

TL;DR: Fine-tuned small language models (SLMs) outperform embedding-based supervised models for automated item alignment in educational testing, with better performance achieved by including more item text data rather than just increasing sample size.

Details

Motivation: Human expert alignment of test items to content standards is subjective and time-consuming, so automated methods are needed to improve efficiency and objectivity in test development.

Method: Fine-tuned small language models for automated item alignment at domain and skill levels using data from college admissions tests, compared with embedding-based supervised machine learning models, and conducted semantic similarity analysis to understand misclassifications.

Result: SLMs consistently outperformed embedding-based models, especially for fine-grained skill alignment. Including more item text data substantially improved performance beyond sample size increases alone. Semantic analysis showed certain skills were too close, explaining misclassifications.

Conclusion: Fine-tuned SLMs are effective for automated item alignment and provide a more efficient alternative to human expert judgment, though semantic closeness between certain skills remains a challenge for perfect classification.

Abstract: Aligning test items to content standards is a critical step in test development to collect validity evidence based on content. Item alignment has typically been conducted by human experts. This judgmental process can be subjective and time-consuming. This study investigated the performance of fine-tuned small language models (SLMs) for automated item alignment using data from a large-scale standardized reading and writing test for college admissions. Different SLMs were trained for alignment at both domain and skill levels respectively with 10 skills mapped to 4 content domains. The model performance was evaluated in multiple criteria on two testing datasets. The impact of types and sizes of the input data for training was investigated. Results showed that including more item text data led to substantially better model performance, surpassing the improvements induced by sample size increase alone. For comparison, supervised machine learning models were trained using the embeddings from the multilingual-E5-large-instruct model. The study results showed that fine-tuned SLMs consistently outperformed the embedding-based supervised machine learning models, particularly for the more fine-grained skill alignment. To better understand model misclassifications, multiple semantic similarity analysis including pairwise cosine similarity, Kullback-Leibler divergence of embedding distributions, and two-dimension projections of item embeddings were conducted. These analyses consistently showed that certain skills in SAT and PSAT were semantically too close, providing evidence for the observed misclassification.

[264] Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment frm Heterogeneous Rewards

Zhuoran Zhuang, Ye Chen, Xia Zeng, Chao Luo, Luhui Liu, Yihan Chen

Main category: cs.CL

TL;DR: REPO is a reinforcement learning framework that combines multiple reward signals to train LLMs for persuasive price negotiation in OTAs, outperforming existing methods in dialogue quality and constraint compliance.

Details

Motivation: Existing post-training methods like SFT and single-source reward optimization overfit to scripts, miss nuanced persuasive styles, and fail to enforce business constraints in LLM-based negotiation agents.

Method: REPO uses heterogeneous rewards: preference-trained RM for human alignment, RJ for persuasive behavior and SOP compliance, and programmatic RF for deterministic checks on numerics, formatting, and guardrails.

Result: REPO achieved average dialogue rating of 4.63 (+1.20 over base), increased excellent response conversations to 66.67%, and achieved 93.33% bad-case fix rate with 75.56% clean fixes, outperforming SFT, DPO, PPO, and GRPO.

Conclusion: REPO effectively aligns LLMs with complex business requirements for negotiation, demonstrating emergent capabilities like proactive empathy and calibrated tactics that surpass human annotations.

Abstract: We study deploying large language models (LLMs) as business development (BD) agents for persuasive price negotiation in online travel agencies (OTAs), where aligning traveler affordability and hotel profitability directly affects bookings, partner relationships, and access to travel. The agent must follow a Standard Operating Procedure (SOP) while conducting multi-turn persuasion, interpreting colloquial inputs, and adhering to guardrails (no over-promising, no hallucinations). Conventional post-training – supervised fine-tuning (SFT) or single-source reward optimization – overfits scripts, misses nuanced persuasive style, and fails to enforce verifiable business constraints. We propose Reward-Enhanced Policy Optimization (REPO), a reinforcement learning post-training framework that aligns an LLM with heterogeneous rewards: a preference-trained reward model (RM) for dense human alignment, a reward judge (RJ) for high-level persuasive behavior and SOP compliance, and programmatic reward functions (RF) for deterministic checks on numerics, formatting, and guardrails. A straightforward enhancement mechanism is proposed to combine the RM with RJ and RF signals to curb reward hacking and improve negotiation quality. In production-style evaluations – approximately 150 turns from real dialogues and 225 turns from curated bad-case dialogues – REPO lifts average dialogue rating to 4.63: +1.20 over base, +0.83 over Direct Preference Optimization (DPO); +0.33 over Group Relative Policy Optimization (GRPO), increases the share of conversations with at least one excellent response to 66.67% (+23.34 percentage points over GRPO), and achieves a 93.33% bad-case fix rate with 75.56% clean fixes, outperforming SFT, DPO, PPO, and GRPO. We also observe emergent capabilities – proactive empathy, localized reasoning, calibrated tactics – that surpass gold annotations.

Ayan Majumdar, Feihao Chen, Jinghui Li, Xiaozhen Wang

Main category: cs.CL

TL;DR: This paper presents a comprehensive evaluation framework for assessing LLMs’ ability to detect demographic-targeted social biases in English texts, finding that fine-tuned smaller models show promise but gaps remain in detecting multi-demographic biases.

Details

Motivation: Large-scale web-scraped text corpora used for AI training often contain harmful demographic-targeted social biases, creating regulatory needs for data auditing and scalable bias-detection methods. Prior work has been narrow in scope, focusing on single content types and limited demographics.

Method: Developed a comprehensive evaluation framework for English texts, framing bias detection as a multi-label task using a demographic-focused taxonomy. Conducted systematic evaluation with models across scales and techniques including prompting, in-context learning, and fine-tuning using twelve datasets spanning diverse content types and demographics.

Result: The study demonstrates the promise of fine-tuned smaller models for scalable bias detection. However, analyses expose persistent gaps across demographic axes and multi-demographic targeted biases.

Conclusion: While fine-tuned smaller models show potential for scalable bias detection, there are significant gaps in detecting multi-demographic biases, underscoring the need for more effective and scalable auditing frameworks to address regulatory requirements.

Abstract: Large-scale web-scraped text corpora used to train general-purpose AI models often contain harmful demographic-targeted social biases, creating a regulatory need for data auditing and developing scalable bias-detection methods. Although prior work has investigated biases in text datasets and related detection methods, these studies remain narrow in scope. They typically focus on a single content type (e.g., hate speech), cover limited demographic axes, overlook biases affecting multiple demographics simultaneously, and analyze limited techniques. Consequently, practitioners lack a holistic understanding of the strengths and limitations of recent large language models (LLMs) for automated bias detection. In this study, we present a comprehensive evaluation framework aimed at English texts to assess the ability of LLMs in detecting demographic-targeted social biases. To align with regulatory requirements, we frame bias detection as a multi-label task using a demographic-focused taxonomy. We then conduct a systematic evaluation with models across scales and techniques, including prompting, in-context learning, and fine-tuning. Using twelve datasets spanning diverse content types and demographics, our study demonstrates the promise of fine-tuned smaller models for scalable detection. However, our analyses also expose persistent gaps across demographic axes and multi-demographic targeted biases, underscoring the need for more effective and scalable auditing frameworks.

[266] Automated Alignment of Math Items to Content Standards in Large-Scale Assessments Using Language Models

Qingshu Xu, Hong Jiao, Tianyi Zhou, Ming Li, Nan Zhang, Sydney Peters, Yanbin Fu

Main category: cs.CL

TL;DR: This study evaluated three automated methods for aligning assessment items to content standards, finding that fine-tuned language models (DeBERTa-v3-base and RoBERTa-large) outperformed classical machine learning and ensemble approaches for domain and skill alignment respectively.

Details

Motivation: Accurate alignment of items to content standards is critical for valid score interpretation in large-scale assessments, requiring automated methods to handle multiple domain and skill labels efficiently.

Method: Three approaches were tested: 1) Classical supervised ML with embeddings and dimensionality reduction, 2) Fine-tuning eight BERT model variants for domain and skill alignment, 3) Ensemble learning with majority voting and stacking using multiple meta-models.

Result: DeBERTa-v3-base achieved highest F1 score (0.950) for domain alignment, RoBERTa-large achieved highest F1 score (0.869) for skill alignment. Ensemble models didn’t surpass best language models. Dimension reduction helped linear classifiers but not language models.

Conclusion: Fine-tuned language models demonstrated superior performance for automated item alignment compared to classical machine learning and ensemble methods, providing effective solutions for aligning assessment items to content standards.

Abstract: Accurate alignment of items to content standards is critical for valid score interpretation in large-scale assessments. This study evaluates three automated paradigms for aligning items with four domain and nineteen skill labels. First, we extracted embeddings and trained multiple classical supervised machine learning models, and further investigated the impact of dimensionality reduction on model performance. Second, we fine-tuned eight BERT model and its variants for both domain and skill alignment. Third, we explored ensemble learning with majority voting and stacking with multiple meta-models. The DeBERTa-v3-base achieved the highest weighted-average F1 score of 0.950 for domain alignment while the RoBERTa-large yielded the highest F1 score of 0.869 for skill alignment. Ensemble models did not surpass the best-performing language models. Dimension reduction enhanced linear classifiers based on embeddings but did not perform better than language models. This study demonstrated different methods in automated item alignment to content standards.}

[267] Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics

Maojia Song, Renhang Liu, Xinyu Wang, Yong Jiang, Pengjun Xie, Fei Huang, Soujanya Poria, Jingren Zhou

Main category: cs.CL

TL;DR: WebDetective is a new benchmark for evaluating multi-hop reasoning in RAG systems and web agents, addressing limitations of current benchmarks by removing leaked reasoning paths and providing holistic evaluation metrics.

Details

Motivation: Current benchmarks for multi-hop reasoning leak reasoning paths in questions and use oversimplified evaluation metrics that obscure specific failure modes like inadequate search, poor knowledge use, or inappropriate refusal behavior.

Method: Created WebDetective benchmark with hint-free multi-hop questions and a controlled Wikipedia sandbox for full traceability, plus a holistic evaluation framework that separates search sufficiency, knowledge utilization, and refusal behavior.

Result: Evaluation of 25 state-of-the-art models revealed systematic weaknesses: models struggle with knowledge utilization despite sufficient evidence and show near-absent appropriate refusal when evidence is lacking. Models excel at executing given reasoning paths but fail at discovering them.

Conclusion: WebDetective’s diagnostic framework can guide architectural improvements, as demonstrated by EvidenceLoop workflow that incorporates verification loops and systematic evidence tracking. The benchmark is crucial for developing genuinely autonomous reasoning systems rather than pattern-following agents.

Abstract: RAG (Retrieval-Augmented Generation) systems and web agents are increasingly evaluated on multi-hop deep search tasks, yet current practice suffers from two major limitations. First, most benchmarks leak the reasoning path in the question text, allowing models to follow surface cues rather than discover reasoning chains autonomously. Second, evaluation is typically reduced to a single pass rate, which collapses diverse behaviours into one score and obscures whether failures stem from inadequate search, poor knowledge use, or inappropriate refusal. To address these issues, we present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox that ensures full traceability of model actions, and a holistic evaluation framework that separates search sufficiency, knowledge utilisation, and refusal behaviour. Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures: models struggle with knowledge utilisation despite having sufficient evidence and demonstrate near-absent appropriate refusal when evidence is lacking. These patterns expose a fundamental gap: today’s systems excel at executing given reasoning paths but fail when required to discover them. We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies, incorporating verification loops and systematic evidence tracking that improve both search and synthesis capabilities. This baseline demonstrates that WebDetective’s diagnostic framework can guide concrete architectural improvements, establishing our benchmark as a critical tool for developing genuinely autonomous reasoning systems rather than pattern-following agents.

[268] Self-Filtered Distillation with LLMs-generated Trust Indicators for Reliable Patent Classification

Yongmin Yoo, Xu Zhang, Longbing Cao

Main category: cs.CL

TL;DR: Self-Filtered Distillation framework uses LLM-generated rationales as trust signals rather than ground truth, employing three unsupervised metrics to filter and weight training samples for improved patent classification.

Details

Motivation: LLM-generated rationales often contain logical errors and domain misalignments, making direct supervision risky for training stability and accuracy.

Method: Uses three trust metrics: Self-Consistency, Class Entailment Alignment, and LLM Agreement Scoring to create unified trust scores for selective distillation.

Result: Outperforms label-based learning and conventional distillation on USPTO-2M dataset in accuracy, stability, and interpretability.

Conclusion: Establishes reliable paradigm for leveraging reasoning-aware trust indicators in patent analytics.

Abstract: Large language models (LLMs) increasingly generate natural language rationales to enhance interpretability, but these often contain logical errors, label mismatches, and domain-specific misalignments. Directly using such rationales as supervision risks propagating noise and undermining training stability. To address this challenge, we introduce Self-Filtered Distillation, a framework specifically tailored for patent classification, which treats LLM-generated rationales as trust signals rather than ground-truth supervision. The framework employs selective distillation guided by three unsupervised trust metrics: (1) Self-Consistency, which measures the stability of LLM-generated rationales across multiple generations; (2) Class Entailment Alignment, which assesses semantic coherence with patent-specific class definitions; and (3) LLM Agreement Scoring, which validates rationale-label plausibility. These metrics are integrated into a unified trust score that primarily weights training samples while optionally filtering out extremely low-trust cases, enabling reasoning-aware supervision. Experiments on the USPTO-2M dataset, a widely used benchmark for patent classification, show that our method outperforms label-based learning and conventional distillation in accuracy, stability, and interpretability, establishing a reliable paradigm for leveraging reasoning-aware trust indicators in patent analytics.

[269] Probing the Difficulty Perception Mechanism of Large Language Models

Sunbowen Lee, Qingyu Yin, Chak Tou Leong, Jialiang Zhang, Yicheng Gong, Shiwen Ni, Min Yang, Xiaoyu Shen

Main category: cs.CL

TL;DR: LLMs can implicitly perceive problem difficulty through their internal representations, with specific attention heads showing opposite activation patterns for simple vs. difficult math problems.

Details

Motivation: To investigate whether LLMs internally encode problem difficulty information, which is crucial for adaptive reasoning and efficient resource allocation in complex reasoning tasks.

Method: Used linear probes on final-token representations of LLMs and identified specific attention heads in the final Transformer layer that show opposite activation patterns for simple and difficult problems.

Result: Demonstrated that problem difficulty can be linearly modeled from LLM representations, with specific attention heads reliably distinguishing difficulty levels through opposite activation patterns.

Conclusion: LLMs possess structured difficulty perception capabilities that can be leveraged for automatic difficulty annotation, reducing reliance on human labeling in benchmark construction and curriculum learning.

Abstract: Large language models (LLMs) are increasingly deployed on complex reasoning tasks, yet little is known about their ability to internally evaluate problem difficulty, which is an essential capability for adaptive reasoning and efficient resource allocation. In this work, we investigate whether LLMs implicitly encode problem difficulty in their internal representations. Using a linear probe on the final-token representations of LLMs, we demonstrate that the difficulty level of math problems can be linearly modeled. We further locate the specific attention heads of the final Transformer layer: these attention heads have opposite activation patterns for simple and difficult problems, thus achieving perception of difficulty. Our ablation experiments prove the accuracy of the location. Crucially, our experiments provide practical support for using LLMs as automatic difficulty annotators, potentially substantially reducing reliance on costly human labeling in benchmark construction and curriculum learning. We also uncover that there is a significant difference in entropy and difficulty perception at the token level. Our study reveals that difficulty perception in LLMs is not only present but also structurally organized, offering new theoretical insights and practical directions for future research. Our code is available at https://github.com/Aegis1863/Difficulty-Perception-of-LLMs.

[270] TRepLiNa: Layer-wise CKA+REPINA Alignment Improves Low-Resource Machine Translation in Aya-23 8B

Toshiki Nakai, Ravi Kiran Chikkala, Lena Sophie Oberkircher, Nicholas Jennings, Natalia Skachkova, Tatiana Anikina, Jesujoba Oluwadara Alabi

Main category: cs.CL

TL;DR: TRepLiNa method combines CKA and REPINA to improve low-resource language translation by aligning mid-level layers in multilingual LLMs, showing effectiveness in data-scarce settings.

Details

Motivation: Addressing India's linguistic gaps by improving translation quality from low-resource languages (LRLs) to high-resource languages (HRLs) through cross-lingual representation alignment.

Method: Combined Centered Kernel Alignment (CKA) with REPINA regularization into TRepLiNa method, tested on Aya-23 8B with QLoRA across Mundari, Santali, Bhili language pairs with Hindi/English pivots in zero-shot, few-shot, and fine-tuning settings.

Result: Aligning mid-level layers using TRepLiNa (CKA+REPINA) proved to be a low-cost, practical approach that improves LRL translation, particularly in data-scarce environments.

Conclusion: The proposed TRepLiNa method effectively enhances translation quality for low-resource languages by enforcing cross-lingual similarity in specific internal layers of multilingual LLMs.

Abstract: The 2025 Multimodal Models for Low-Resource Contexts and Social Impact (MMLoSo) Language Challenge addresses one of India’s most pressing linguistic gaps: the lack of resources for its diverse low-resource languages (LRLs). In this study, we investigate whether enforcing cross-lingual similarity in specific internal layers of a decoder-only multilingual large language model (LLM) can improve translation quality from LRL to high-resource language (HRL). Specifically, we combine Centered Kernel Alignment (CKA), a similarity metric that encourages representations of different languages to align, with REPINA, a regularization method that constrains parameter updates to remain close to the pretrained model, into a joint method we call TRepLiNa. In this research project, we experiment with zero-shot, few-shot, and fine-tuning settings using Aya-23 8B with QLoRA across MMLoSo shared task language pairs (Mundari, Santali, Bhili) with Hindi/English pivots. Our results show that aligning mid-level layers using TRepLiNa (CKA+REPINA) is a low-cost, practical approach to improving LRL translation, especially in data-scarce settings.

[271] Native Hybrid Attention for Efficient Sequence Modeling

Jusen Du, Jiaxi Hu, Tao Zhang, Weigao Sun, Yu Cheng

Main category: cs.CL

TL;DR: Native Hybrid Attention (NHA) combines linear and full attention in a unified layer design, maintaining long-term context with linear RNN and short-term tokens with sliding window, achieving better efficiency and accuracy than Transformers.

Details

Motivation: Transformers have quadratic complexity while linear attention sacrifices recall accuracy over long contexts. NHA aims to balance efficiency and accuracy through hybrid attention.

Method: NHA integrates intra-layer and inter-layer hybridization using linear RNN for long-term context, sliding window for short-term tokens, and single softmax attention over all keys/values without additional parameters.

Result: NHA outperforms Transformers and other hybrid baselines on recall-intensive and commonsense reasoning tasks, and enables pretrained LLMs to achieve competitive accuracy with significant efficiency gains.

Conclusion: NHA provides an effective hybrid attention architecture that smoothly transitions between linear and full attention while maintaining structural uniformity and improving both efficiency and accuracy.

Abstract: Transformers excel at sequence modeling but face quadratic complexity, while linear attention offers improved efficiency but often compromises recall accuracy over long contexts. In this work, we introduce Native Hybrid Attention (NHA), a novel hybrid architecture of linear and full attention that integrates both intra & inter-layer hybridization into a unified layer design. NHA maintains long-term context in key-value slots updated by a linear RNN, and augments them with short-term tokens from a sliding window. A single \texttt{softmax attention} operation is then applied over all keys and values, enabling per-token and per-head context-dependent weighting without requiring additional fusion parameters. The inter-layer behavior is controlled through a single hyperparameter, the sliding window size, which allows smooth adjustment between purely linear and full attention while keeping all layers structurally uniform. Experimental results show that NHA surpasses Transformers and other hybrid baselines on recall-intensive and commonsense reasoning tasks. Furthermore, pretrained LLMs can be structurally hybridized with NHA, achieving competitive accuracy while delivering significant efficiency gains. Code is available at https://github.com/JusenD/NHA.

[272] Do LLMs Really Need 10+ Thoughts for “Find the Time 1000 Days Later”? Towards Structural Understanding of LLM Overthinking

Xinliang Frederick Zhang, Anhad Mohananey, Alexandra Chronopoulou, Pinelopi Papalampidi, Somit Gupta, Tsendsuren Munkhdalai, Lu Wang, Shyam Upadhyay

Main category: cs.CL

TL;DR: This paper introduces TRACE, a systematic analyzer that identifies overthinking in LLMs as caused by over-verification and over-exploration patterns, and proposes a utility-based definition for better overthinking management.

Details

Motivation: Long chain-of-thought reasoning models suffer from overthinking - engaging in unnecessarily extensive reasoning for simple queries, causing computational inefficiency without accuracy improvements. Current analyses are superficial and fail to understand the underlying causes.

Method: Developed TRACE analyzer that: 1) decomposes thought process into minimally complete sub-thoughts, 2) infers discourse relationships to construct granular thought progression graphs, 3) identifies common thinking patterns for similar queries.

Result: Identified two major overthinking patterns: Explorer and Late Landing. Found that long-thinking models are 5-20x slower on simple tasks with no substantial gains. Over-verification and over-exploration are the primary drivers of overthinking.

Conclusion: Proposed a utility-based definition of overthinking that moves beyond length-based metrics, offering more insightful understanding of LLM thought progression and practical guidelines for principled overthinking management.

Abstract: Models employing long chain-of-thought (CoT) reasoning have shown superior performance on complex reasoning tasks. Yet, this capability introduces a critical and often overlooked inefficiency – overthinking – models often engage in unnecessarily extensive reasoning even for simple queries, incurring significant computations without accuracy improvements. While prior work has explored solutions to mitigate overthinking, a fundamental gap remains in our understanding of its underlying causes. Most existing analyses are limited to superficial, profiling-based observations, failing to delve into LLMs’ inner workings. This study introduces a systematic, fine-grained analyzer of LLMs’ thought process to bridge the gap, TRACE. We first benchmark the overthinking issue, confirming that long-thinking models are five to twenty times slower on simple tasks with no substantial gains. We then use TRACE to first decompose the thought process into minimally complete sub-thoughts. Next, by inferring discourse relationships among sub-thoughts, we construct granular thought progression graphs and subsequently identify common thinking patterns for topically similar queries. Our analysis reveals two major patterns for open-weight thinking models – Explorer and Late Landing. This finding provides evidence that over-verification and over-exploration are the primary drivers of overthinking in LLMs. Grounded in thought structures, we propose a utility-based definition of overthinking, which moves beyond length-based metrics. This revised definition offers a more insightful understanding of LLMs’ thought progression, as well as practical guidelines for principled overthinking management.

Jialu Du, Guiyang Hou, Yihui Fu, Chen Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu

Main category: cs.CL

TL;DR: LLMs struggle with social reasoning due to confusion between objective reality and subjective beliefs. The paper proposes a world model-enhanced reasoning mechanism that tracks entity states and intervenes when confusion is detected, significantly improving social reasoning performance.

Details

Motivation: LLMs excel at mathematical and code reasoning but fail at social reasoning tasks, showing cognitive confusion, logical inconsistencies, and inability to distinguish between objective world states and subjective belief states.

Method: Proposed an adaptive world model-enhanced reasoning mechanism that constructs dynamic textual world models to track entity states and temporal sequences. It monitors reasoning trajectories for confusion indicators and intervenes by providing clear world state descriptions.

Result: Evaluations on three social benchmarks show significant accuracy improvements (e.g., +10% in Hi-ToM) while reducing computational costs by up to 33.8% token reduction.

Conclusion: The world model-enhanced reasoning mechanism offers a simple yet effective solution for deploying LLMs in social contexts by helping models navigate cognitive dilemmas and distinguish between external events and internal beliefs.

Abstract: While large language models (LLMs) excel in mathematical and code reasoning, we observe they struggle with social reasoning tasks, exhibiting cognitive confusion, logical inconsistencies, and conflation between objective world states and subjective belief states. Through deteiled analysis of DeepSeek-R1’s reasoning trajectories, we find that LLMs frequently encounter reasoning impasses and tend to output contradictory terms like “tricky” and “confused” when processing scenarios with multiple participants and timelines, leading to erroneous reasoning or infinite loops. The core issue is their inability to disentangle objective reality from agents’ subjective beliefs. To address this, we propose an adaptive world model-enhanced reasoning mechanism that constructs a dynamic textual world model to track entity states and temporal sequences. It dynamically monitors reasoning trajectories for confusion indicators and promptly intervenes by providing clear world state descriptions, helping models navigate through cognitive dilemmas. The mechanism mimics how humans use implicit world models to distinguish between external events and internal beliefs. Evaluations on three social benchmarks demonstrate significant improvements in accuracy (e.g., +10% in Hi-ToM) while reducing computational costs (up to 33.8% token reduction), offering a simple yet effective solution for deploying LLMs in social contexts.

[274] Formalizing Style in Personal Narratives

Gustave Cortal, Alain Finkel

Main category: cs.CL

TL;DR: A framework for analyzing style in personal narratives by formalizing linguistic choices as patterns, integrating functional linguistics, computer science, and psychology.

Details

Motivation: There is a lack of formal framework for systematically analyzing stylistic choices in personal narratives, which are fundamental to conveying subjective experiences.

Method: Using language models to automatically extract linguistic features (processes, participants, circumstances) and analyze sequential patterns in narratives.

Result: Applied to dream narratives including a war veteran with PTSD, revealing distinctive patterns where verbal processes dominate over mental ones.

Conclusion: The framework successfully links linguistic choices to psychological states, demonstrating the relationship between style and subjective experience.

Abstract: Personal narratives are stories authors construct to make meaning of their experiences. Style, the distinctive way authors use language to express themselves, is fundamental to how these narratives convey subjective experiences. Yet there is a lack of a formal framework for systematically analyzing these stylistic choices. We present a novel approach that formalizes style in personal narratives as patterns in the linguistic choices authors make when communicating subjective experiences. Our framework integrates three domains: functional linguistics establishes language as a system of meaningful choices, computer science provides methods for automatically extracting and analyzing sequential patterns, and these patterns are linked to psychological observations. Using language models, we automatically extract linguistic features such as processes, participants, and circumstances. We apply our framework to hundreds of dream narratives, including a case study on a war veteran with post-traumatic stress disorder. Analysis of his narratives uncovers distinctive patterns, particularly how verbal processes dominate over mental ones, illustrating the relationship between linguistic choices and psychological states.

[275] dInfer: An Efficient Inference Framework for Diffusion Language Models

Yuxin Ma, Lun Du, Lanning Wei, Kun Chen, Qian Xu, Kangyu Wang, Guofeng Feng, Guoshan Lu, Lin Liu, Xiaojing Qi, Xinyuan Zhang, Zhen Tao, Haibo Feng, Ziyun Jiang, Ying Xu, Zenan Huang, Yihong Zhuang, Haokai Xu, Jiaqi Hu, Zhenzhong Lan, Junbo Zhao, Jianguo Li, Da Zheng

Main category: cs.CL

TL;DR: dInfer is an efficient inference framework for diffusion-based large language models (dLLMs) that achieves 10x speedup over prior systems and 2-3x speedup over optimized AR models while maintaining output quality.

Details

Motivation: The widespread adoption of diffusion-based LLMs is constrained by the lack of standardized and efficient inference frameworks, despite their promise as alternatives to autoregressive LLMs.

Method: dInfer decomposes the inference pipeline into four modular components (model, diffusion iteration manager, decoding strategy, KV-cache manager) with novel algorithms and system-level optimizations.

Result: Achieves over 1,100 tokens/sec on HumanEval and averages 800+ tokens/sec across six benchmarks on 8x H800 GPUs, with 10x speedup over Fast-dLLM and 2-3x speedup over optimized AR model QWen2.5-3B.

Conclusion: dInfer provides a standardized, efficient inference framework for dLLMs that significantly outperforms existing systems while maintaining model performance, enabling broader adoption of diffusion-based language models.

Abstract: Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs, leveraging denoising-based generation to enable inherent parallelism. Even more and more open-sourced dLLM models emerge, yet their widespread adoption remains constrained by the lack of a standardized and efficient inference framework. We present dInfer, an efficient and extensible framework for dLLM inference. dInfer decomposes the inference pipeline into four modular components–model, diffusion iteration manager, decoding strategy, and KV-cache manager–and integrates novel algorithms for each component alongside system-level optimizations. Through this combination of algorithmic innovations and system enhancements, dInfer achieves substantial efficiency gains without compromising output quality on LLaDA-MoE. At batch size 1, it surpasses 1,100 tokens per second on HumanEval and averages over 800 tokens per second across six benchmarks on $8\times$ H800 GPUs. Compared to prior systems, dInfer delivers a $10\times$ speedup over Fast-dLLM while maintaining similar model performance. Even compared to the AR model (with a comparable number of activation parameters and performance) QWen2.5-3B, which is highly optimized with the latest vLLM inference engine, dInfer still delivers a $2$-$3\times$ speedup. The implementation of dInfer is open-sourced at https://github.com/inclusionAI/dInfer.

[276] Autoencoding-Free Context Compression for LLMs via Contextual Semantic Anchors

Xin Liu, Runsong Zhao, Pengcheng Huang, Xinyu Liu, Junyi Xiao, Chunyang Xiao, Tong Xiao, Shengxiang Gao, Zhengtao Yu, Jingbo Zhu

Main category: cs.CL

TL;DR: SAC is a novel context compression method that directly selects anchor tokens from original context and aggregates contextual information into their KV representations, eliminating the need for autoencoding training and outperforming existing methods.

Details

Motivation: Current context compression methods rely on autoencoding tasks which create a mismatch between reconstruction optimization and actual downstream tasks, weakening features beneficial for real-world usage.

Method: SAC directly selects anchor tokens from original context and aggregates contextual information into their KV representations using anchor embeddings and bidirectional attention modification, eliminating autoencoding training.

Result: SAC consistently outperforms existing context compression methods across various compression ratios, achieving 1 EM improvement at 5x compression over strong baselines on MRQA, with increasing advantages at higher compression ratios.

Conclusion: SAC provides a more effective approach to context compression by directly leveraging contextual tokens through anchor selection and aggregation, avoiding the limitations of autoencoding-based methods.

Abstract: Context compression presents a promising approach for accelerating large language model (LLM) inference by compressing long contexts into compact representations. Current context compression methods predominantly rely on autoencoding tasks to train context-agnostic compression tokens to compress contextual semantics. While autoencoding tasks enable compression tokens to acquire compression capabilities, compression via autoencoding tasks creates a fundamental mismatch: the models are optimized for reconstruction that diverge from actual downstream tasks, thereby weakening the features more beneficial for real-world usage. We propose Semantic-Anchor Compression (SAC), a novel method that shifts from autoencoding task based compression to an architecture that is equipped with this compression capability \textit{a priori}. Instead of training models to compress contexts through autoencoding tasks, SAC directly selects so-called anchor tokens from the original context and aggregates contextual information into their key-value (KV) representations. By deriving representations directly from the contextual tokens, SAC eliminates the need for autoencoding training. To ensure compression performance while directly leveraging anchor tokens, SAC incorporates two key designs: (1) anchor embeddings that enable the compressor to identify critical tokens, and (2) bidirectional attention modification that allows anchor tokens to capture information from the entire context. Experimental results demonstrate that SAC consistently outperforms existing context compression methods across various compression ratios. On out-of-distribution evaluation using MRQA, SAC achieves 1 EM improvement at 5x compression over strong baselines, with increasing advantages at higher compression ratios.

[277] DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation

Enze Zhang, Jiaying Wang, Mengxi Xiao, Jifei Liu, Ziyan Kuang, Rui Dong, Eric Dong, Sophia Ananiadou, Min Peng, Qianqian Xie

Main category: cs.CL

TL;DR: DITING is the first comprehensive evaluation framework for web novel translation, assessing narrative and cultural fidelity across six dimensions with 18K expert-annotated Chinese-English pairs. AgentEval, a reasoning-driven multi-agent framework, achieves highest correlation with human judgments.

Details

Motivation: Existing benchmarks rely on surface-level metrics that fail to capture distinctive traits of web novel translation, and effectiveness of LLMs in this domain remains unclear.

Method: Introduce DITING framework with six evaluation dimensions, develop AgentEval multi-agent evaluation framework, create MetricAlign meta-evaluation dataset, and evaluate fourteen LLM models.

Result: Chinese-trained LLMs surpass larger foreign counterparts; DeepSeek-V3 delivers most faithful and stylistically coherent translations; AgentEval achieves highest correlation with human judgments among seven automatic metrics.

Conclusion: Establishes new paradigm for exploring LLM-based web novel translation and provides public resources to advance future research.

Abstract: Large language models (LLMs) have substantially advanced machine translation (MT), yet their effectiveness in translating web novels remains unclear. Existing benchmarks rely on surface-level metrics that fail to capture the distinctive traits of this genre. To address these gaps, we introduce DITING, the first comprehensive evaluation framework for web novel translation, assessing narrative and cultural fidelity across six dimensions: idiom translation, lexical ambiguity, terminology localization, tense consistency, zero-pronoun resolution, and cultural safety, supported by over 18K expert-annotated Chinese-English sentence pairs. We further propose AgentEval, a reasoning-driven multi-agent evaluation framework that simulates expert deliberation to assess translation quality beyond lexical overlap, achieving the highest correlation with human judgments among seven tested automatic metrics. To enable metric comparison, we develop MetricAlign, a meta-evaluation dataset of 300 sentence pairs annotated with error labels and scalar quality scores. Comprehensive evaluation of fourteen open, closed, and commercial models reveals that Chinese-trained LLMs surpass larger foreign counterparts, and that DeepSeek-V3 delivers the most faithful and stylistically coherent translations. Our work establishes a new paradigm for exploring LLM-based web novel translation and provides public resources to advance future research.

[278] DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning

Chenyang Gu, Yewen Pu, Bruce Yang, Xiaofan Li, Huan Gao

Main category: cs.CL

TL;DR: DSPO is a new RL algorithm that enables LLMs to actively search external knowledge through multi-turn search and reasoning, achieving significant performance improvements without supervised data.

Details

Motivation: Current approaches for enhancing LLMs with external knowledge search either rely on prompting or suffer from performance limitations and instability when applying RL to complex interactive tasks.

Method: Dynamic-filter Sequence-level Policy Optimization (DSPO) - an improved RL algorithm using sequence-level optimization and dynamic sample filtering, trained purely through RL without supervised demonstration data.

Result: DSPO-trained 7B model improves over comparable previous work by 34.1% across multiple QA benchmarks, outperforms 14B model from previous work in complex multihop QA by nearly 9% relative, while maintaining exceptional training stability.

Conclusion: DSPO effectively unlocks LLMs’ true agentic potential for complex knowledge-seeking tasks through robust RL training, demonstrating superior performance and stability compared to existing approaches.

Abstract: Enhancing LLMs with the ability to actively search external knowledge is crucial for complex and real-world tasks. Current approaches either rely on prompting to elicit the model’s innate agent capabilities, or suffer from performance ceilings and collapse when applying RL to complex interactive tasks, leaving their true agentic potential untapped. To address this, we introduce \textbf{D}ynamic-filter \textbf{S}equence-level \textbf{P}olicy \textbf{O}ptimization (DSPO), an improved RL algorithm designed for robust agent training through sequence-level optimization and dynamic sample filtering. We train our model purely through RL to interleave multi-turn search and reasoning, obviating the need for supervised demonstration data. Across multiple QA benchmarks, our DSPO-trained 7B model improves over a comparable previous work by \textbf{34.1%}, and even outperforms the 14B model from previous work in complex multihop QA such as HotpotQA by nearly \textbf{9% relative}, maintaining exceptional training stability.

[279] SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

Chenyu Wang, Paria Rashidinejad, DiJia Su, Song Jiang, Sid Wang, Siyan Zhao, Cai Zhou, Shannon Zejiang Shen, Feiyu Chen, Tommi Jaakkola, Yuandong Tian, Bo Liu

Main category: cs.CL

TL;DR: SPG method improves RL alignment for diffusion LLMs by using both upper and lower bounds of log-likelihood, outperforming ELBO-based methods across multiple reasoning tasks.

Details

Motivation: Aligning diffusion LLMs with human preferences via RL is challenging due to intractable log-likelihood, and existing ELBO-based methods introduce significant policy gradient bias.

Method: Proposed Sandwiched Policy Gradient (SPG) that leverages both upper and lower bounds of the true log-likelihood instead of one-sided approximations like ELBO.

Result: SPG significantly outperforms baselines, improving accuracy by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku over state-of-the-art RL methods.

Conclusion: SPG provides an effective solution for RL alignment of diffusion LLMs by addressing the policy gradient bias issue through dual-bound approximation of log-likelihood.

Abstract: Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.

cs.CV

[280] TinyViT-Batten: Few-Shot Vision Transformer with Explainable Attention for Early Batten-Disease Detection on Pediatric MRI

Khartik Uppalapati, Bora Yimenicioglu, Shakeel Abdulkareem, Adan Eftekhari, Bhavya Uppalapati, Viraj Kamath

Main category: cs.CV

TL;DR: TinyViT-Batten is a few-shot Vision Transformer framework that detects early Batten disease from pediatric brain MRI with high accuracy (91%) using only limited training cases.

Details

Motivation: Early MRI signs of Batten disease are subtle and often missed, requiring an AI solution that can work with limited training data for this rare pediatric neurodegenerative disorder.

Method: Distill a large teacher ViT into a 5M-parameter TinyViT and fine-tune using metric-based few-shot learning (prototypical loss with 5-shot episodes), with Grad-CAM integration for explainable predictions.

Result: Achieves 91% accuracy and AUC ≥0.95 on multi-site dataset of 79 Batten-disease MRIs and 90 controls, outperforming 3D-ResNet and Swin-Tiny baselines with >90% sensitivity and ~90% specificity.

Conclusion: The model’s small size and strong performance demonstrate a practical AI solution for early Batten disease detection with explainable predictions.

Abstract: Batten disease (neuronal ceroid lipofuscinosis) is a rare pediatric neurodegenerative disorder whose early MRI signs are subtle and often missed. We propose TinyViT-Batten, a few-shot Vision Transformer (ViT) framework to detect early Batten disease from pediatric brain MRI with limited training cases. We distill a large teacher ViT into a 5 M-parameter TinyViT and fine-tune it using metric-based few-shot learning (prototypical loss with 5-shot episodes). Our model achieves high accuracy (approximately 91%) and area under ROC of at least 0.95 on a multi-site dataset of 79 genetically confirmed Batten-disease MRIs (27 CLN3 from the Hochstein natural-history study, 32 CLN2 from an international longitudinal cohort, 12 early-manifestation CLN2 cases reported by Cokal et al., and 8 public Radiopaedia scans) together with 90 age-matched controls, outperforming a 3D-ResNet and Swin-Tiny baseline. We further integrate Gradient-weighted Class Activation Mapping (Grad-CAM) to highlight disease-relevant brain regions, enabling explainable predictions. The model’s small size and strong performance (sensitivity greater than 90%, specificity approximately 90%) demonstrates a practical AI solution for early Batten disease detection.

[281] Ultralytics YOLO Evolution: An Overview of YOLO26, YOLO11, YOLOv8 and YOLOv5 Object Detectors for Computer Vision and Pattern Recognition

Ranjan Sapkota, Manoj Karkee

Main category: cs.CV

TL;DR: This paper provides a comprehensive review of the Ultralytics YOLO family evolution from YOLOv5 to YOLO26, analyzing architectural innovations, benchmarking performance on MS COCO dataset, deployment strategies, and future challenges in object detection.

Details

Motivation: To systematically document the architectural progression of YOLO object detectors, provide quantitative benchmarking across different versions, and identify deployment considerations and future research directions for the YOLO family.

Method: The review analyzes YOLO evolution chronologically from YOLOv5 to YOLO26, highlighting key innovations at each stage. Benchmarking is performed on MS COCO dataset comparing YOLOv5, YOLOv8, YOLO11, YOLO26 with other detectors using metrics like precision, recall, mAP, and inference speed.

Result: YOLO26 introduces significant innovations including DFL removal, native NMS-free inference, Progressive Loss Balancing, STAL, and MuSGD optimizer. Benchmarking shows trade-offs between accuracy and efficiency across different YOLO versions, with YOLO26 representing the latest advancements in the family.

Conclusion: The YOLO family has evolved significantly with continuous architectural improvements. Future challenges include addressing dense-scene limitations, integrating CNN-Transformer hybrids, developing open-vocabulary detection, and implementing edge-aware training approaches for broader real-world applications.

Abstract: This paper presents a comprehensive overview of the Ultralytics YOLO(You Only Look Once) family of object detectors, focusing the architectural evolution, benchmarking, deployment perspectives, and future challenges. The review begins with the most recent release, YOLO26 (YOLOv26), which introduces key innovations including Distribution Focal Loss (DFL) removal, native NMS-free inference, Progressive Loss Balancing (ProgLoss), Small-Target-Aware Label Assignment (STAL), and the MuSGD optimizer for stable training. The progression is then traced through YOLO11, with its hybrid task assignment and efficiency-focused modules; YOLOv8, which advanced with a decoupled detection head and anchor-free predictions; and YOLOv5, which established the modular PyTorch foundation that enabled modern YOLO development. Benchmarking on the MS COCO dataset provides a detailed quantitative comparison of YOLOv5, YOLOv8, YOLO11, and YOLO26, alongside cross-comparisons with YOLOv12, YOLOv13, RT-DETR, and DEIM. Metrics including precision, recall, F1 score, mean Average Precision, and inference speed are analyzed to highlight trade-offs between accuracy and efficiency. Deployment and application perspectives are further discussed, covering export formats, quantization strategies, and real-world use in robotics, agriculture, surveillance, and manufacturing. Finally, the paper identifies challenges and future directions, including dense-scene limitations, hybrid CNN-Transformer integration, open-vocabulary detection, and edge-aware training approaches.

[282] TreeNet: Layered Decision Ensembles

Zeshan Khan

Main category: cs.CV

TL;DR: TreeNet is a novel layered decision ensemble learning method for medical image analysis that combines neural networks, ensemble learning, and tree-based models to address data scarcity challenges while maintaining interpretability.

Details

Motivation: Current medical image analysis methods (Neural Networks, Decision Trees, Ensemble Learning) work well with large datasets but struggle with limited data availability and data confidence issues common in medical contexts.

Method: TreeNet integrates key features from neural networks, ensemble learning, and tree-based decision models into a layered decision ensemble learning methodology specifically designed for medical image analysis.

Result: Achieved F1-score of 0.85 with full training data and 0.77 with 50% training data (only 0.08 reduction), with processing speed of 32 FPS suitable for real-time applications.

Conclusion: TreeNet demonstrates efficiency and usability in medical image analysis, particularly for real-time applications, showing robustness to data limitations while maintaining interpretability.

Abstract: Within the domain of medical image analysis, three distinct methodologies have demonstrated commendable accuracy: Neural Networks, Decision Trees, and Ensemble-Based Learning Algorithms, particularly in the specialized context of genstro institutional track abnormalities detection. These approaches exhibit efficacy in disease detection scenarios where a substantial volume of data is available. However, the prevalent challenge in medical image analysis pertains to limited data availability and data confidence. This paper introduces TreeNet, a novel layered decision ensemble learning methodology tailored for medical image analysis. Constructed by integrating pivotal features from neural networks, ensemble learning, and tree-based decision models, TreeNet emerges as a potent and adaptable model capable of delivering superior performance across diverse and intricate machine learning tasks. Furthermore, its interpretability and insightful decision-making process enhance its applicability in complex medical scenarios. Evaluation of the proposed approach encompasses key metrics including Accuracy, Precision, Recall, and training and evaluation time. The methodology resulted in an F1-score of up to 0.85 when using the complete training data, with an F1-score of 0.77 when utilizing 50% of the training data. This shows a reduction of F1-score of 0.08 while in the reduction of 50% of the training data and training time. The evaluation of the methodology resulted in the 32 Frame per Second which is usable for the realtime applications. This comprehensive assessment underscores the efficiency and usability of TreeNet in the demanding landscape of medical image analysis specially in the realtime analysis.

[283] MCE: Towards a General Framework for Handling Missing Modalities under Imbalanced Missing Rates

Binyu Zhao, Wei Zhang, Zhaonian Zou

Main category: cs.CV

TL;DR: MCE addresses imbalanced missing modalities in multi-modal learning by enhancing learning capability through dynamic balancing and representation capability via feature improvement tasks, outperforming SOTA methods.

Details

Motivation: Existing methods fail to handle sample-level modality utility variations and degraded feature quality caused by imbalanced missing rates, creating a vicious cycle where modalities with higher missing rates receive fewer updates.

Method: MCE consists of two components: Learning Capability Enhancement (LCE) with multi-level factors for dynamic modality balancing, and Representation Capability Enhancement (RCE) with subset prediction and cross-modal completion tasks to improve feature semantics and robustness.

Result: Comprehensive evaluations on four multi-modal benchmarks show MCE consistently outperforms state-of-the-art methods under various missing configurations.

Conclusion: MCE effectively tackles the limitations of existing approaches by addressing both learning progress imbalance and feature quality degradation in multi-modal learning with missing modalities.

Abstract: Multi-modal learning has made significant advances across diverse pattern recognition applications. However, handling missing modalities, especially under imbalanced missing rates, remains a major challenge. This imbalance triggers a vicious cycle: modalities with higher missing rates receive fewer updates, leading to inconsistent learning progress and representational degradation that further diminishes their contribution. Existing methods typically focus on global dataset-level balancing, often overlooking critical sample-level variations in modality utility and the underlying issue of degraded feature quality. We propose Modality Capability Enhancement (MCE) to tackle these limitations. MCE includes two synergistic components: i) Learning Capability Enhancement (LCE), which introduces multi-level factors to dynamically balance modality-specific learning progress, and ii) Representation Capability Enhancement (RCE), which improves feature semantics and robustness through subset prediction and cross-modal completion tasks. Comprehensive evaluations on four multi-modal benchmarks show that MCE consistently outperforms state-of-the-art methods under various missing configurations. The journal preprint version is now available at https://doi.org/10.1016/j.patcog.2025.112591. Our code is available at https://github.com/byzhaoAI/MCE.

[284] OmniSAT: Compact Action Token, Faster Auto Regression

Huaihai Lyu, Chaofan Chen, Senwei Xie, Pengwei Wang, Xiansheng Chen, Shanghang Zhang, Changsheng Xu

Main category: cs.CV

TL;DR: OmniSAT introduces a compact action tokenizer that uses B-Spline encoding and multi-stage residual quantization to compress action sequences, enabling faster auto-regressive training while preserving reconstruction quality.

Details

Motivation: Existing auto-regressive VLA models face efficiency issues with long action sequences, and prior compression methods struggled with poor reconstruction or inefficient compression.

Method: Uses B-Spline encoding for consistent representation, then applies multi-stage residual quantization to position, rotation, and gripper subspaces to produce compressed discrete tokens with coarse-to-fine granularity.

Result: Achieves 6.8× sequence length reduction, lowers target entropy, enables faster training convergence and improved model performance across real-robot and simulation experiments.

Conclusion: OmniSAT provides efficient compression while maintaining reconstruction quality, making auto-regressive VLA models more scalable and effective.

Abstract: Existing Vision-Language-Action (VLA) models can be broadly categorized into diffusion-based and auto-regressive (AR) approaches: diffusion models capture continuous action distributions but rely on computationally heavy iterative denoising. In contrast, AR models enable efficient optimization and flexible sequence construction, making them better suited for large-scale pretraining. To further improve AR efficiency, particularly when action chunks induce extended and high-dimensional sequences, prior work applies entropy-guided and token-frequency techniques to shorten the sequence length. However, such compression struggled with \textit{poor reconstruction or inefficient compression}. Motivated by this, we introduce an Omni Swift Action Tokenizer, which learns a compact, transferable action representation. Specifically, we first normalize value ranges and temporal horizons to obtain a consistent representation with B-Spline encoding. Then, we apply multi-stage residual quantization to the position, rotation, and gripper subspaces, producing compressed discrete tokens with coarse-to-fine granularity for each part. After pre-training on the large-scale dataset Droid, the resulting discrete tokenization shortens the training sequence by 6.8$\times$, and lowers the target entropy. To further explore the potential of OmniSAT, we develop a cross-embodiment learning strategy that builds on the unified action-pattern space and jointly leverages robot and human demonstrations. It enables scalable auxiliary supervision from heterogeneous egocentric videos. Across diverse real-robot and simulation experiments, OmniSAT encompasses higher compression while preserving reconstruction quality, enabling faster AR training convergence and model performance.

[285] Connecting Giants: Synergistic Knowledge Transfer of Large Multimodal Models for Few-Shot Learning

Hao Tang, Shengfeng He, Jing Qin

Main category: cs.CV

TL;DR: SynTrans is a novel few-shot learning framework that transfers diverse knowledge from large multimodal models to enhance few-shot learners through synergistic knowledge mining and bi-directional visual-semantic knowledge transfer.

Details

Motivation: Few-shot learning faces data scarcity challenges, and existing methods using semantic knowledge from smaller models often introduce noise and bias due to data simplicity.

Method: Uses CLIP as teacher and few-shot vision encoder as student, with unsupervised proxy task for knowledge distillation. Includes synergistic knowledge mining module, visual-semantic bridging module, and adaptive classifier construction with visual weight generator and semantic weight reconstructor.

Result: Experimental results on four FSL datasets show SynTrans significantly outperforms current state-of-the-art methods, even when paired with a simple few-shot vision encoder.

Conclusion: SynTrans effectively addresses FSL challenges by leveraging large multimodal models’ knowledge through synergistic transfer mechanisms, achieving superior performance over existing approaches.

Abstract: Few-shot learning (FSL) addresses the challenge of classifying novel classes with limited training samples. While some methods leverage semantic knowledge from smaller-scale models to mitigate data scarcity, these approaches often introduce noise and bias due to the data’s inherent simplicity. In this paper, we propose a novel framework, Synergistic Knowledge Transfer (SynTrans), which effectively transfers diverse and complementary knowledge from large multimodal models to empower the off-the-shelf few-shot learner. Specifically, SynTrans employs CLIP as a robust teacher and uses a few-shot vision encoder as a weak student, distilling semantic-aligned visual knowledge via an unsupervised proxy task. Subsequently, a training-free synergistic knowledge mining module facilitates collaboration among large multimodal models to extract high-quality semantic knowledge. Building upon this, a visual-semantic bridging module enables bi-directional knowledge transfer between visual and semantic spaces, transforming explicit visual and implicit semantic knowledge into category-specific classifier weights. Finally, SynTrans introduces a visual weight generator and a semantic weight reconstructor to adaptively construct optimal multimodal FSL classifiers. Experimental results on four FSL datasets demonstrate that SynTrans, even when paired with a simple few-shot vision encoder, significantly outperforms current state-of-the-art methods.

[286] Knowledge-Aware Mamba for Joint Change Detection and Classification from MODIS Times Series

Zhengsen Xu, Yimin Zhu, Zack Dewis, Mabel Heffring, Motasem Alkayid, Saeid Taleghanidoozdoozan, Lincoln Linlin Xu

Main category: cs.CV

TL;DR: KAMamba: A knowledge-aware Mamba model for MODIS change detection that uses transition matrices, multi-task learning, and spatial-spectral-temporal modules to improve accuracy while reducing computational costs.

Details

Motivation: MODIS change detection faces challenges from mixed pixels, information coupling effects, and background class heterogeneity, requiring more effective methods.

Method: Proposes KAMamba with: 1) Knowledge-aware transition loss using class transition matrices, 2) Multi-task learning with three losses (PreC, PostC, Chg), 3) SSTMamba modules to disentangle spatial-spectral-temporal information, 4) SDMamba backbone for efficiency.

Result: On Saskatchewan MODIS dataset: 1.5-6% F1 improvement for change detection, ~2% gains in OA, AA, and Kappa for LULC classification over baselines.

Conclusion: KAMamba effectively addresses MODIS challenges through knowledge integration, multi-task learning, and efficient Mamba architecture, achieving significant performance improvements.

Abstract: Although change detection using MODIS time series is critical for environmental monitoring, it is a highly challenging task due to key MODIS difficulties, e.g., mixed pixels, spatial-spectral-temporal information coupling effect, and background class heterogeneity. This paper presents a novel knowledge-aware Mamba (KAMamba) for enhanced MODIS change detection, with the following contributions. First, to leverage knowledge regarding class transitions, we design a novel knowledge-driven transition-matrix-guided approach, leading to a knowledge-aware transition loss (KAT-loss) that can enhance detection accuracies. Second, to improve model constraints, a multi-task learning approach is designed, where three losses, i.e., pre-change classification loss (PreC-loss), post-change classification loss (PostC-loss), and change detection loss (Chg-loss) are used for improve model learning. Third, to disentangle information coupling in MODIS time series, novel spatial-spectral-temporal Mamba (SSTMamba) modules are designed. Last, to improve Mamba model efficiency and remove computational cost, a sparse and deformable Mamba (SDMamba) backbone is used in SSTMamba. On the MODIS time-series dataset for Saskatchewan, Canada, we evaluate the method on land-cover change detection and LULC classification; results show about 1.5-6% gains in average F1 for change detection over baselines, and about 2% improvements in OA, AA, and Kappa for LULC classification.

[287] CoPRS: Learning Positional Prior from Chain-of-Thought for Reasoning Segmentation

Zhenyu Lu, Liupeng Li, Jinpeng Wang, Yan Feng, Bin Chen, Ke Chen, Yaowei Wang

Main category: cs.CV

TL;DR: CoPRS introduces a Multi-modal Chain-of-Thought based positional perception model that bridges language reasoning to segmentation through a differentiable heatmap prior, improving interpretability and achieving state-of-the-art performance on segmentation benchmarks.

Details

Motivation: Existing reasoning segmentation methods either connect hidden language features directly to mask decoders or use text position representations, which limits interpretability and semantic detail.

Method: Uses a learnable concentration token to aggregate image and reasoning text features, generating a differentiable positional prior heatmap that is decoded to precise masks through a lightweight decoder.

Result: Matches or surpasses best reported metrics on RefCOCO series and ReasonSeg benchmarks under comparable protocols, achieving state-of-the-art performance across validation and test partitions.

Conclusion: The heatmap quality strongly influences mask quality, supporting consistent association between reasoning output and mask generation, demonstrating advantages in concentration and precision for bridging reasoning and segmentation.

Abstract: Existing works on reasoning segmentation either connect hidden features from a language model directly to a mask decoder or represent positions in text, which limits interpretability and semantic detail. To solve this, we present CoPRS, a Multi-modal Chain-of-Thought (MCoT)-based positional perception model that bridges language reasoning to segmentation through a differentiable and interpretable positional prior instantiated as a heatmap. By making the reasoning process clear via MCoT and expressing it as a dense, differentiable heatmap, this interface enhances interpretability and diagnostic analysis and yields more concentrated evidence on the target. A learnable concentration token aggregates features of the image and reasoning text to generate this positional prior, which is decoded to precise masks through a lightweight decoder, providing a direct connection between reasoning and segmentation. Across the RefCOCO series and ReasonSeg, CoPRS matches or surpasses the best reported metrics on each standard split under comparable protocols, with performance at or above prior state of the art across both validation and test partitions. Extensive experiments reveal that the quality of the heatmap strongly influences the resulting mask quality, supporting a consistent association between the reasoning output and downstream mask generation. Collectively, these findings support the utility of this paradigm in bridging reasoning and segmentation and show advantages in concentration driven by reasoning and predicting masks more precisely. Code, checkpoints and logs are released at https://github.com/ZhenyuLU-Heliodore/CoPRS.git.

[288] NNDM: NN_UNet Diffusion Model for Brain Tumor Segmentation

Sashank Makanaboyina

Main category: cs.CV

TL;DR: NNDM combines NN-UNet with diffusion models to improve brain tumor segmentation in MRI, achieving better boundary precision and generalization than existing methods.

Details

Motivation: Current CNN models like U-Net struggle with generalization, boundary precision, and limited data diversity in brain tumor segmentation from MRI scans.

Method: Hybrid framework integrating NN-UNet’s feature extraction with diffusion probabilistic models that iteratively refine segmentation masks by learning residual error distributions.

Result: Superior performance on BraTS 2021 datasets with improvements in Dice coefficient and Hausdorff distance metrics, enhanced robustness across modalities and tumor subregions.

Conclusion: NNDM establishes a new direction for combining deterministic segmentation networks with stochastic diffusion models, advancing automated brain tumor analysis.

Abstract: Accurate detection and segmentation of brain tumors in magnetic resonance imaging (MRI) are critical for effective diagnosis and treatment planning. Despite advances in convolutional neural networks (CNNs) such as U-Net, existing models often struggle with generalization, boundary precision, and limited data diversity. To address these challenges, we propose NNDM (NN_UNet Diffusion Model)a hybrid framework that integrates the robust feature extraction of NN-UNet with the generative capabilities of diffusion probabilistic models. In our approach, the diffusion model progressively refines the segmentation masks generated by NN-UNet by learning the residual error distribution between predicted and ground-truth masks. This iterative denoising process enables the model to correct fine structural inconsistencies and enhance tumor boundary delineation. Experiments conducted on the BraTS 2021 datasets demonstrate that NNDM achieves superior performance compared to conventional U-Net and transformer-based baselines, yielding improvements in Dice coefficient and Hausdorff distance metrics. Moreover, the diffusion-guided refinement enhances robustness across modalities and tumor subregions. The proposed NNDM establishes a new direction for combining deterministic segmentation networks with stochastic diffusion models, advancing the state of the art in automated brain tumor analysis.

[289] Adaptive Fusion Network with Temporal-Ranked and Motion-Intensity Dynamic Images for Micro-expression Recognition

Thi Bich Phuong Man, Luu Tu Nguyen, Vu Tram Anh Khuong, Thanh Ha Le, Thi Duyen Ngo

Main category: cs.CV

TL;DR: A novel micro-expression recognition method using two complementary dynamic image representations and an adaptive fusion network, achieving state-of-the-art performance on benchmark datasets.

Details

Motivation: Micro-expressions reveal genuine emotions but are subtle and hard to detect, making them valuable for lie detection, behavioral analysis, and psychological assessment.

Method: Proposes two representations: Temporal-ranked dynamic image (emphasizes temporal progression) and Motion-intensity dynamic image (highlights subtle motions through frame reordering). Uses Adaptive fusion network to optimally integrate these representations.

Result: Achieved 93.95% Accuracy and 0.897 UF1 on CASME-II (new SOTA), 82.47% Accuracy and 0.665 UF1 on SAMM, and 76.00% Accuracy on MMEW, demonstrating strong generalization.

Conclusion: Both the input representations and proposed architecture significantly improve MER performance, providing foundation for affective computing, lie detection, and human-computer interaction applications.

Abstract: Micro-expressions (MEs) are subtle, transient facial changes with very low intensity, almost imperceptible to the naked eye, yet they reveal a person genuine emotion. They are of great value in lie detection, behavioral analysis, and psychological assessment. This paper proposes a novel MER method with two main contributions. First, we propose two complementary representations - Temporal-ranked dynamic image, which emphasizes temporal progression, and Motion-intensity dynamic image, which highlights subtle motions through a frame reordering mechanism incorporating motion intensity. Second, we propose an Adaptive fusion network, which automatically learns to optimally integrate these two representations, thereby enhancing discriminative ME features while suppressing noise. Experiments on three benchmark datasets (CASME-II, SAMM and MMEW) demonstrate the superiority of the proposed method. Specifically, AFN achieves 93.95 Accuracy and 0.897 UF1 on CASME-II, setting a new state-of-the-art benchmark. On SAMM, the method attains 82.47 Accuracy and 0.665 UF1, demonstrating more balanced recognition across classes. On MMEW, the model achieves 76.00 Accuracy, further confirming its generalization ability. The obtained results show that both the input and the proposed architecture play important roles in improving the performance of MER. Moreover, they provide a solid foundation for further research and practical applications in the fields of affective computing, lie detection, and human-computer interaction.

[290] Multi Camera Connected Vision System with Multi View Analytics: A Comprehensive Survey

Muhammad Munsif, Waqas Ahmad, Amjid Ali, Mohib Ullah, Adnan Hussain, Sung Wook Baik

Main category: cs.CV

TL;DR: This survey provides the first comprehensive review of multi-view multi-camera (MVMC) systems that unifies tracking, re-identification, and action understanding into a single framework for Connected Vision Systems.

Details

Motivation: Existing surveys focus on isolated tasks and single-view setups, neglecting the integration and complexities of multi-camera collaboration needed for real-world CVS applications like autonomous vehicles and smart cities.

Method: Proposes a unique taxonomy dividing CVS into four key parts: MVMC tracking, Re-ID, action understanding, and combined methods. Systematically reviews state-of-the-art datasets, methodologies, results, and evaluation metrics.

Result: Provides a structured view of the field’s progression and identifies open research challenges including occlusions, diverse viewpoints, environmental variability, and emerging technologies like lifelong learning and federated learning.

Conclusion: Outlines key research directions for enhancing robustness, efficiency, and adaptability of CVS in complex real-world applications, aiming to inspire next-generation intelligent and adaptive vision systems.

Abstract: Connected Vision Systems (CVS) are transforming a variety of applications, including autonomous vehicles, smart cities, surveillance, and human-robot interaction. These systems harness multi-view multi-camera (MVMC) data to provide enhanced situational awareness through the integration of MVMC tracking, re-identification (Re-ID), and action understanding (AU). However, deploying CVS in real-world, dynamic environments presents a number of challenges, particularly in addressing occlusions, diverse viewpoints, and environmental variability. Existing surveys have focused primarily on isolated tasks such as tracking, Re-ID, and AU, often neglecting their integration into a cohesive system. These reviews typically emphasize single-view setups, overlooking the complexities and opportunities provided by multi-camera collaboration and multi-view data analysis. To the best of our knowledge, this survey is the first to offer a comprehensive and integrated review of MVMC that unifies MVMC tracking, Re-ID, and AU into a single framework. We propose a unique taxonomy to better understand the critical components of CVS, dividing it into four key parts: MVMC tracking, Re-ID, AU, and combined methods. We systematically arrange and summarize the state-of-the-art datasets, methodologies, results, and evaluation metrics, providing a structured view of the field’s progression. Furthermore, we identify and discuss the open research questions and challenges, along with emerging technologies such as lifelong learning, privacy, and federated learning, that need to be addressed for future advancements. The paper concludes by outlining key research directions for enhancing the robustness, efficiency, and adaptability of CVS in complex, real-world applications. We hope this survey will inspire innovative solutions and guide future research toward the next generation of intelligent and adaptive CVS.

[291] Exploration of Incremental Synthetic Non-Morphed Images for Single Morphing Attack Detection

David Benavente-Rios, Juan Ruiz Rodriguez, Gustavo Gatica

Main category: cs.CV

TL;DR: Using synthetic face data can improve Single-Morphing Attack Detection (S-MAD) when carefully integrated with real data, but relying solely on synthetic data leads to poor performance.

Details

Motivation: Address privacy concerns and data limitations in face morphing detection by exploring synthetic data alternatives to supplement scarce bona fide images.

Method: Employed various morphing tools and cross-dataset evaluation with incremental testing protocol to assess generalization as synthetic images were added during training.

Result: Generalization improves with controlled synthetic data integration or gradual addition of bona fide images, but indiscriminate synthetic use causes sub-optimal performance. Pure synthetic data achieves highest EER.

Conclusion: Synthetic data can enhance S-MAD when carefully combined with real data, but should not be used exclusively as it yields the worst performance in operational scenarios.

Abstract: This paper investigates the use of synthetic face data to enhance Single-Morphing Attack Detection (S-MAD), addressing the limitations of availability of large-scale datasets of bona fide images due to privacy concerns. Various morphing tools and cross-dataset evaluation schemes were utilized to conduct this study. An incremental testing protocol was implemented to assess the generalization capabilities as more and more synthetic images were added. The results of the experiments show that generalization can be improved by carefully incorporating a controlled number of synthetic images into existing datasets or by gradually adding bona fide images during training. However, indiscriminate use of synthetic data can lead to sub-optimal performance. Evenmore, the use of only synthetic data (morphed and non-morphed images) achieves the highest Equal Error Rate (EER), which means in operational scenarios the best option is not relying only on synthetic data for S-MAD.

[292] SMC++: Masked Learning of Unsupervised Video Semantic Compression

Yuan Tian, Xiaoyue Ling, Cong Geng, Qiang Hu, Guo Lu, Guangtao Zhai

Main category: cs.CV

TL;DR: A video compression framework that preserves semantics using Masked Video Modeling, with SMC and SMC++ models that outperform traditional codecs on video analysis tasks.

Details

Motivation: Traditional video compression focuses on human visual perception but neglects semantic preservation, causing semantic loss that hampers downstream video analysis tasks.

Method: Uses Masked Video Modeling to jointly mine and compress semantics in self-supervised manner, with explicit regularization of non-semantic entropy. SMC++ adds masked motion prediction and Transformer-based compression with blueprint semantic representation.

Result: SMC and SMC++ models show remarkable superiority over traditional, learnable, and perceptual quality-oriented video codecs on three video analysis tasks across seven datasets.

Conclusion: The proposed semantic-preserving video compression framework effectively maintains video semantics while achieving superior performance on downstream analysis tasks compared to existing methods.

Abstract: Most video compression methods focus on human visual perception, neglecting semantic preservation. This leads to severe semantic loss during the compression, hampering downstream video analysis tasks. In this paper, we propose a Masked Video Modeling (MVM)-powered compression framework that particularly preserves video semantics, by jointly mining and compressing the semantics in a self-supervised manner. While MVM is proficient at learning generalizable semantics through the masked patch prediction task, it may also encode non-semantic information like trivial textural details, wasting bitcost and bringing semantic noises. To suppress this, we explicitly regularize the non-semantic entropy of the compressed video in the MVM token space. The proposed framework is instantiated as a simple Semantic-Mining-then-Compression (SMC) model. Furthermore, we extend SMC as an advanced SMC++ model from several aspects. First, we equip it with a masked motion prediction objective, leading to better temporal semantic learning ability. Second, we introduce a Transformer-based compression module, to improve the semantic compression efficacy. Considering that directly mining the complex redundancy among heterogeneous features in different coding stages is non-trivial, we introduce a compact blueprint semantic representation to align these features into a similar form, fully unleashing the power of the Transformer-based compression module. Extensive results demonstrate the proposed SMC and SMC++ models show remarkable superiority over previous traditional, learnable, and perceptual quality-oriented video codecs, on three video analysis tasks and seven datasets. \textit{Codes and model are available at: https://github.com/tianyuan168326/VideoSemanticCompression-Pytorch.

[293] Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping

Dwip Dalal, Gautam Vashishtha, Utkarsh Mishra, Jeonghwan Kim, Madhav Kanda, Hyeonjeong Ha, Svetlana Lazebnik, Heng Ji, Unnat Jain

Main category: cs.CV

TL;DR: AttWarp is a lightweight method that uses MLLMs’ cross-modal attention to warp input images, allocating more resolution to query-relevant regions while preserving global context, improving accuracy across multiple benchmarks without changing model weights.

Details

Motivation: MLLMs often miss small details and spatial relations in cluttered scenes, leading to errors in fine-grained perceptual grounding.

Method: Uses MLLM’s cross-modal attention to perform rectilinear warping of input images, reallocating spatial resolution toward important regions while preserving all original image information non-uniformly.

Result: Consistently improves accuracy across five benchmarks (TextVQA, GQA, DocVQA, POPE, MMMU) and four MLLMs, strengthens compositional reasoning, and reduces hallucinations, outperforming four competitive baselines.

Conclusion: Attention-guided warping prioritizes query-relevant information while preserving context, and MLLMs perform better when given such warped inputs.

Abstract: Multimodal large language models (MLLMs) often miss small details and spatial relations in cluttered scenes, leading to errors in fine-grained perceptual grounding. We introduce AttWarp, a lightweight method that allocates more resolution to query-relevant content while compressing less informative areas, all while preserving global context. At test time, the approach uses an MLLM’s cross-modal attention to perform rectilinear warping of the input image, reallocating spatial resolution toward regions the model deems important, without changing model weights or architecture. This attention-guided warping preserves all original image information but redistributes it non-uniformly, so small objects and subtle relationships become easier for the same model to read while the global layout remains intact. Across five benchmarks (TextVQA, GQA, DocVQA, POPE, MMMU) and four MLLMs (LLaVA, Qwen-VL, InternVL, and InstructBLIP), AttWarp consistently improves accuracy, strengthens compositional reasoning, and reduces hallucinations, outperforming four competitive baselines that manipulate raw images at test time. Together, these results show that attention-guided warping prioritizes information relevant to the query while preserving context, and that the same MLLMs perform better when given such warped inputs.

[294] Explainable Human-in-the-Loop Segmentation via Critic Feedback Signals

Pouya Shaeri, Ryan T. Woo, Yasaman Mohammadpour, Ariane Middel

Main category: cs.CV

TL;DR: Human-in-the-loop interactive framework for segmentation that uses human corrections as interventional signals to steer models away from spurious correlations and toward robust semantic features, improving accuracy and reducing annotation effort.

Details

Motivation: Segmentation models often fail in real-world domains by relying on spurious correlations rather than true object boundaries, leading to poor generalization.

Method: Interactive framework where human corrections serve as interventional signals to identify when models rely on superficial features. The system propagates correction-informed edits across visually similar images to systematically correct failure modes.

Result: Improves segmentation accuracy by up to 9 mIoU points (12-15% relative improvement) on challenging cubemap data, with 3-4× reductions in annotation effort compared to standard retraining while maintaining competitive benchmark performance.

Conclusion: Provides a practical framework for building segmentation systems that are accurate, robust to dataset biases, data-efficient, and adaptable to real-world domains like urban climate monitoring and autonomous driving.

Abstract: Segmentation models achieve high accuracy on benchmarks but often fail in real-world domains by relying on spurious correlations instead of true object boundaries. We propose a human-in-the-loop interactive framework that enables interventional learning through targeted human corrections of segmentation outputs. Our approach treats human corrections as interventional signals that show when reliance on superficial features (e.g., color or texture) is inappropriate. The system learns from these interventions by propagating correction-informed edits across visually similar images, effectively steering the model toward robust, semantically meaningful features rather than dataset-specific artifacts. Unlike traditional annotation approaches that simply provide more training data, our method explicitly identifies when and why the model fails and then systematically corrects these failure modes across the entire dataset. Through iterative human feedback, the system develops increasingly robust representations that generalize better to novel domains and resist artifactual correlations. We demonstrate that our framework improves segmentation accuracy by up to 9 mIoU points (12-15% relative improvement) on challenging cubemap data and yields 3-4$\times$ reductions in annotation effort compared to standard retraining, while maintaining competitive performance on benchmark datasets. This work provides a practical framework for researchers and practitioners seeking to build segmentation systems that are accurate, robust to dataset biases, data-efficient, and adaptable to real-world domains such as urban climate monitoring and autonomous driving.

[295] Towards Understanding Ambiguity Resolution in Multimodal Inference of Meaning

Yufei Wang, Adriana Kovashka, Loretta Fernández, Marc N. Coutanche, Seth Wiener

Main category: cs.CV

TL;DR: Study examines how learners infer unfamiliar words using multimodal sentence-image pairs, analyzing data features and participant backgrounds that correlate with success.

Details

Motivation: To understand how learners can infer word meanings from multimodal contexts (sentences with paired images) and identify features that facilitate this learning process.

Method: Conducted studies with human participants using different image-text pairs, analyzed features of images and texts that help infer masked/unfamiliar words, and examined correlations with participant language backgrounds.

Result: Only some intuitive features strongly correlate with participant performance, suggesting need for further investigation of predictive features. AI systems show potential but need improvement in reasoning about participant performance.

Conclusion: The study reveals limited correlation between intuitive features and success in word inference tasks, highlighting the need for deeper investigation into predictive features and promising directions for improving AI reasoning about human learning performance.

Abstract: We investigate a new setting for foreign language learning, where learners infer the meaning of unfamiliar words in a multimodal context of a sentence describing a paired image. We conduct studies with human participants using different image-text pairs. We analyze the features of the data (i.e., images and texts) that make it easier for participants to infer the meaning of a masked or unfamiliar word, and what language backgrounds of the participants correlate with success. We find only some intuitive features have strong correlations with participant performance, prompting the need for further investigating of predictive features for success in these tasks. We also analyze the ability of AI systems to reason about participant performance, and discover promising future directions for improving this reasoning ability.

[296] Scaling Traffic Insights with AI and Language Model-Powered Camera Systems for Data-Driven Transportation Decision Making

Fan Zuo, Donglin Zhou, Jingqin Gao, Kaan Ozbay

Main category: cs.CV

TL;DR: An AI framework using traffic cameras and fine-tuned YOLOv11 for scalable traffic monitoring, with graph-based viewpoint normalization and LLM summarization, validated on NYC congestion pricing rollout showing 9% vehicle reduction.

Details

Motivation: Need for accurate, scalable traffic monitoring during disruptions like disasters or policy changes, but limited by high sensor costs and existing video analytics' struggles with dynamic viewpoints and large data volumes.

Method: End-to-end AI framework with fine-tuned YOLOv11 for real-time traffic detection, graph-based viewpoint normalization for camera inconsistencies, and domain-specific LLM for automated traffic pattern summaries from 24/7 video streams.

Result: Validated on 9M+ images from 1,000 NYC cameras during congestion pricing rollout: 9% weekday passenger vehicle decline in Congestion Relief Zone, early truck reductions with rebound signs, increased pedestrian/cyclist activity. LLM accuracy improved with example prompts.

Conclusion: Framework provides practical, infrastructure-ready solution for large-scale traffic monitoring with minimal human intervention, demonstrating policy-relevant insights for transportation management.

Abstract: Accurate, scalable traffic monitoring is critical for real-time and long-term transportation management, particularly during disruptions such as natural disasters, large construction projects, or major policy changes like New York City’s first-in-the-nation congestion pricing program. However, widespread sensor deployment remains limited due to high installation, maintenance, and data management costs. While traffic cameras offer a cost-effective alternative, existing video analytics struggle with dynamic camera viewpoints and massive data volumes from large camera networks. This study presents an end-to-end AI-based framework leveraging existing traffic camera infrastructure for high-resolution, longitudinal analysis at scale. A fine-tuned YOLOv11 model, trained on localized urban scenes, extracts multimodal traffic density and classification metrics in real time. To address inconsistencies from non-stationary pan-tilt-zoom cameras, we introduce a novel graph-based viewpoint normalization method. A domain-specific large language model was also integrated to process massive data from a 24/7 video stream to generate frequent, automated summaries of evolving traffic patterns, a task far exceeding manual capabilities. We validated the system using over 9 million images from roughly 1,000 traffic cameras during the early rollout of NYC congestion pricing in 2025. Results show a 9% decline in weekday passenger vehicle density within the Congestion Relief Zone, early truck volume reductions with signs of rebound, and consistent increases in pedestrian and cyclist activity at corridor and zonal scales. Experiments showed that example-based prompts improved LLM’s numerical accuracy and reduced hallucinations. These findings demonstrate the framework’s potential as a practical, infrastructure-ready solution for large-scale, policy-relevant traffic monitoring with minimal human intervention.

[297] Task-Aware Resolution Optimization for Visual Large Language Models

Weiqing Luo, Zhen Tan, Yifan Li, Xinyu Zhao, Kwonjoon Lee, Behzad Dariush, Tianlong Chen

Main category: cs.CV

TL;DR: This paper proposes a method to determine optimal image resolution for vision-language tasks and extends VLLMs to support these resolutions, improving performance.

Details

Motivation: Existing VLLMs use fixed resolutions for all tasks, leading to suboptimal performance. Different vision-language tasks require varying perceptual granularity.

Method: 1) Investigated resolution preferences across tasks, finding correlation with image complexity and VLLM uncertainty variance. 2) Proposed empirical formula to determine optimal resolution. 3) Developed parameter-efficient fine-tuning to extend VLLMs to optimal resolutions.

Result: Extensive experiments on various vision-language tasks validate the effectiveness of the proposed method.

Conclusion: The approach successfully addresses the resolution limitation in VLLMs and improves performance across different vision-language tasks.

Abstract: Real-world vision-language applications demand varying levels of perceptual granularity. However, most existing visual large language models (VLLMs), such as LLaVA, pre-assume a fixed resolution for downstream tasks, which leads to subpar performance. To address this problem, we first conduct a comprehensive and pioneering investigation into the resolution preferences of different vision-language tasks, revealing a correlation between resolution preferences with image complexity, and uncertainty variance of the VLLM at different image input resolutions. Building on this insight, we propose an empirical formula to determine the optimal resolution for a given vision-language task, combining these two factors. Second, based on rigorous experiments, we propose a novel parameter-efficient fine-tuning technique to extend the visual input resolution of pre-trained VLLMs to the identified optimal resolution. Extensive experiments on various vision-language tasks validate the effectiveness of our method.

[298] Uncertainty-Aware Post-Detection Framework for Enhanced Fire and Smoke Detection in Compact Deep Learning Models

Aniruddha Srinivas Joshi, Godwyn James William, Shreyas Srinivas Joshi

Main category: cs.CV

TL;DR: Proposes an uncertainty-aware post-detection framework that rescales detection confidences using statistical uncertainty and visual cues to improve fire and smoke detection in compact deep learning models.

Details

Motivation: Existing compact deep learning models like YOLOv5n/YOLOv8n for fire detection suffer from false positives and missed detections, while conventional post-detection methods (NMS/Soft-NMS) rely only on spatial overlap and can suppress true positives or retain false alarms.

Method: A lightweight Confidence Refinement Network integrates uncertainty estimates with color, edge, and texture features to adjust detection scores without modifying the base model.

Result: Experiments on D-Fire dataset show improved precision, recall, and mean average precision compared to baselines, with modest computational overhead.

Conclusion: Post-detection rescoring effectively enhances robustness of compact deep learning models for real-world fire and smoke detection.

Abstract: Accurate fire and smoke detection is critical for safety and disaster response, yet existing vision-based methods face challenges in balancing efficiency and reliability. Compact deep learning models such as YOLOv5n and YOLOv8n are widely adopted for deployment on UAVs, CCTV systems, and IoT devices, but their reduced capacity often results in false positives and missed detections. Conventional post-detection methods such as Non-Maximum Suppression and Soft-NMS rely only on spatial overlap, which can suppress true positives or retain false alarms in cluttered or ambiguous fire scenes. To address these limitations, we propose an uncertainty aware post-detection framework that rescales detection confidences using both statistical uncertainty and domain relevant visual cues. A lightweight Confidence Refinement Network integrates uncertainty estimates with color, edge, and texture features to adjust detection scores without modifying the base model. Experiments on the D-Fire dataset demonstrate improved precision, recall, and mean average precision compared to existing baselines, with only modest computational overhead. These results highlight the effectiveness of post-detection rescoring in enhancing the robustness of compact deep learning models for real-world fire and smoke detection.

[299] Real-Time Position-Aware View Synthesis from Single-View Input

Manu Gond, Emin Zerman, Sebastian Knorr, Mårten Sjöström

Main category: cs.CV

TL;DR: A lightweight position-aware network for real-time view synthesis from single images and target camera poses, achieving superior efficiency and visual quality without explicit geometric operations.

Details

Motivation: To address the limitations of state-of-the-art view synthesis methods that achieve high visual quality but lack real-time performance, making them unsuitable for live applications requiring low latency.

Method: A framework with Position Aware Embedding that maps target pose information to high-dimensional feature maps, and a Rendering Network with dual encoder branches that merges features to resolve both high and low level details.

Result: Experimental results show superior efficiency and visual quality compared to existing approaches, particularly in handling complex translational movements without explicit geometric operations like warping.

Conclusion: This work represents a step toward enabling real-time live and interactive telepresence applications by providing efficient view synthesis.

Abstract: Recent advancements in view synthesis have significantly enhanced immersive experiences across various computer graphics and multimedia applications, including telepresence and entertainment. By enabling the generation of new perspectives from a single input view, view synthesis allows users to better perceive and interact with their environment. However, many state-of-the-art methods, while achieving high visual quality, face limitations in real-time performance, which makes them less suitable for live applications where low latency is critical. In this paper, we present a lightweight, position-aware network designed for real-time view synthesis from a single input image and a target camera pose. The proposed framework consists of a Position Aware Embedding, which efficiently maps positional information from the target pose to generate high dimensional feature maps. These feature maps, along with the input image, are fed into a Rendering Network that merges features from dual encoder branches to resolve both high and low level details, producing a realistic new view of the scene. Experimental results demonstrate that our method achieves superior efficiency and visual quality compared to existing approaches, particularly in handling complex translational movements without explicit geometric operations like warping. This work marks a step toward enabling real-time live and interactive telepresence applications.

[300] Post Processing of image segmentation using Conditional Random Fields

Aashish Dhawan, Pankaj Bodani, Vishal Garg

Main category: cs.CV

TL;DR: This study evaluates different Conditional Random Field (CRF) models to improve image segmentation clarity in satellite imagery with low-quality features.

Details

Motivation: Satellite image segmentation often produces unclear results due to low-quality features, requiring better CRF models to achieve improved clarity.

Method: Tested various CRF types on two datasets: low-quality satellite imagery and high-quality aerial photographs, comparing performance across different approaches.

Result: Identified which CRF models perform best on different image types, revealing the strengths and limitations of various CRF approaches for segmentation.

Conclusion: The study demonstrates the varying effectiveness of different CRF models for image segmentation, highlighting both pitfalls and potentials of each approach for satellite and aerial imagery.

Abstract: The output of image the segmentation process is usually not very clear due to low quality features of Satellite images. The purpose of this study is to find a suitable Conditional Random Field (CRF) to achieve better clarity in a segmented image. We started with different types of CRFs and studied them as to why they are or are not suitable for our purpose. We evaluated our approach on two different datasets - Satellite imagery having low quality features and high quality Aerial photographs. During the study we experimented with various CRFs to find which CRF gives the best results on images and compared our results on these datasets to show the pitfalls and potentials of different approaches.

[301] YOLOv11-Litchi: Efficient Litchi Fruit Detection based on UAV-Captured Agricultural Imagery in Complex Orchard Environments

Hongxing Peng, Haopei Xie, Weijia Lia, Huanai Liuc, Ximing Li

Main category: cs.CV

TL;DR: YOLOv11-Litchi is a lightweight deep learning model for UAV-based litchi detection that reduces parameters by 32.5% while improving accuracy and achieving real-time performance.

Details

Motivation: Traditional manual litchi selection methods are inadequate for modern production demands, requiring automated solutions using UAV imagery and deep learning to enhance efficiency and reduce costs.

Method: Built on YOLOv11 framework with three innovations: multi-scale residual module for contextual feature extraction, lightweight feature fusion to reduce model size, and litchi occlusion detection head to mitigate occlusion effects.

Result: Achieves 6.35 MB parameter size (32.5% smaller than baseline), 90.1% mAP (+2.5%), 85.5% F1-Score (+1.4%), and 57.2 FPS for real-time detection.

Conclusion: YOLOv11-Litchi is suitable for UAV-based litchi detection in complex orchard environments and shows potential for broader precision agriculture applications.

Abstract: Litchi is a high-value fruit, yet traditional manual selection methods are increasingly inadequate for modern production demands. Integrating UAV-based aerial imagery with deep learning offers a promising solution to enhance efficiency and reduce costs. This paper introduces YOLOv11-Litchi, a lightweight and robust detection model specifically designed for UAV-based litchi detection. Built upon the YOLOv11 framework, the proposed model addresses key challenges such as small target size, large model parameters hindering deployment, and frequent target occlusion. To tackle these issues, three major innovations are incorporated: a multi-scale residual module to improve contextual feature extraction across scales, a lightweight feature fusion method to reduce model size and computational costs while maintaining high accuracy, and a litchi occlusion detection head to mitigate occlusion effects by emphasizing target regions and suppressing background interference. Experimental results validate the model’s effectiveness. YOLOv11-Litchi achieves a parameter size of 6.35 MB - 32.5% smaller than the YOLOv11 baseline

while improving mAP by 2.5% to 90.1% and F1-Score by 1.4% to 85.5%. Additionally, the model achieves a frame rate of 57.2 FPS, meeting real-time detection requirements. These findings demonstrate the suitability of YOLOv11-Litchi for UAV-based litchi detection in complex orchard environments, showcasing its potential for broader applications in precision agriculture.

[302] Goal-Based Vision-Language Driving

Santosh Patapati, Trisanth Srinivasan

Main category: cs.CV

TL;DR: NovaDrive is a single-branch vision-language architecture for autonomous driving that processes multiple sensor inputs (camera, HD-map, LiDAR, waypoints) using cross-attention fusion and achieves state-of-the-art performance with real-time inference.

Details

Motivation: Autonomous vehicles need millisecond-level reaction times while reasoning about complex road geometry and traffic intent, requiring efficient multi-modal fusion without recurrent memory overhead.

Method: Uses single-branch architecture with two-stage cross-attention to align waypoint tokens with HD map, then refine attention over image and depth patches. Employs smoothness loss to prevent abrupt steering/speed changes and fine-tunes top 15 layers of 11B LLaMA-3.2 backbone.

Result: Achieves 84% success rate (+4%), 0.66 SPL (+0.11), reduces collisions from 2.6% to 1.2% (-1.4%) on nuScenes/Waymo benchmark. Key contributions: waypoint tokens, partial VLM fine-tuning, and cross-attention fusion.

Conclusion: NovaDrive enables real-time autonomous driving with improved safety and efficiency, shorter routes reducing fuel/battery usage, and can be extended to other embodied-AI domains.

Abstract: Autonomous vehicles must react in milliseconds while reasoning about road geometry and traffic intent to navigate complex situations. We introduce NovaDrive, a single-branch vision-language architecture that processes front-camera images, HD-map tiles, LiDAR depth, and textual waypoints in a single branch. A lightweight, two-stage cross-attention block first aligns waypoint tokens with the HD map, then refines attention over fine-grained image and depth patches. Coupled with a novel smoothness loss that discourages abrupt steering and speed changes, this design eliminates the need for recurrent memory. We fine-tune the top 15 layers of an 11B LLaMA-3.2 vision-language backbone, enabling real-time inference. On the nuScenes / Waymo subset of the MD-NEX Outdoor benchmark, NovaDrive raises success rate to 84% (+4%), boosts path-efficiency (SPL) to 0.66 (+0.11), and reduces collision frequency from 2.6% to 1.2% (-1.4%) relative to the previous state-of-the-art. Our ablations confirm that waypoint tokens, partial VLM fine-tuning, and the cross-attention fusion each contribute the most to these gains. Beyond safety, NovaDrive’s shorter routes (resulting from the novel smoothness loss) translate to lower fuel or battery usage, pointing toward leaner, more easily updated driving stacks. NovaDrive can be extended to other embodied-AI domains as well.

[303] Cell Instance Segmentation: The Devil Is in the Boundaries

Peixian Liang, Yifan Ding, Yizhe Zhang, Jianxu Chen, Hao Zheng, Hongxiao Wang, Yejia Zhang, Guangyu Meng, Tim Weninger, Michael Niemier, X. Sharon Hu, Danny Z Chen

Main category: cs.CV

TL;DR: Ceb is a novel pixel clustering method for cell instance segmentation that leverages cell boundary features and labels to divide foreground pixels into cell instances, outperforming existing pixel clustering methods and achieving competitive performance with state-of-the-art approaches.

Details

Motivation: Existing deep learning methods for cell instance segmentation use pixel-wise objectives that may lose important geometric properties of cell instances like shape, curvature, and convexity, which require collections of pixels to represent.

Method: Ceb extracts potential foreground-foreground boundaries using a revised Watershed algorithm, constructs boundary signatures by sampling pixels from current and neighboring boundaries, uses a boundary classifier to predict binary boundary labels, and then divides/merges regions based on predicted labels.

Result: Extensive experiments on six datasets show Ceb outperforms existing pixel clustering methods on semantic segmentation probability maps and achieves highly competitive performance compared to state-of-the-art cell instance segmentation methods.

Conclusion: The Ceb method effectively addresses the limitations of pixel-wise objectives by leveraging boundary features and achieves superior performance in cell instance segmentation.

Abstract: State-of-the-art (SOTA) methods for cell instance segmentation are based on deep learning (DL) semantic segmentation approaches, focusing on distinguishing foreground pixels from background pixels. In order to identify cell instances from foreground pixels (e.g., pixel clustering), most methods decompose instance information into pixel-wise objectives, such as distances to foreground-background boundaries (distance maps), heat gradients with the center point as heat source (heat diffusion maps), and distances from the center point to foreground-background boundaries with fixed angles (star-shaped polygons). However, pixel-wise objectives may lose significant geometric properties of the cell instances, such as shape, curvature, and convexity, which require a collection of pixels to represent. To address this challenge, we present a novel pixel clustering method, called Ceb (for Cell boundaries), to leverage cell boundary features and labels to divide foreground pixels into cell instances. Starting with probability maps generated from semantic segmentation, Ceb first extracts potential foreground-foreground boundaries with a revised Watershed algorithm. For each boundary candidate, a boundary feature representation (called boundary signature) is constructed by sampling pixels from the current foreground-foreground boundary as well as the neighboring background-foreground boundaries. Next, a boundary classifier is used to predict its binary boundary label based on the corresponding boundary signature. Finally, cell instances are obtained by dividing or merging neighboring regions based on the predicted boundary labels. Extensive experiments on six datasets demonstrate that Ceb outperforms existing pixel clustering methods on semantic segmentation probability maps. Moreover, Ceb achieves highly competitive performance compared to SOTA cell instance segmentation methods.

[304] Guided Image Feature Matching using Feature Spatial Order

Chin-Hung Teng, Ben-Jian Dong

Main category: cs.CV

TL;DR: This paper proposes a method that integrates feature spatial order with epipolar geometry in a progressive matching framework to improve the efficiency and accuracy of image feature matching.

Details

Motivation: Traditional feature matching methods are time-consuming, especially for images with many features. Feature spatial order can complement epipolar geometry to guide feature matching more efficiently.

Method: The method uses initially matched features to build a computational model of feature spatial order, calculates possible spatial ranges for subsequent matches, filters unnecessary matches, and integrates with epipolar geometry. An image alignment method based on fundamental matrix removes rotation effects.

Result: Experiments on benchmark datasets, simulated images, and real images show the proposed method is significantly more efficient and accurate than traditional methods.

Conclusion: Integrating feature spatial order with epipolar geometry in a progressive framework effectively improves feature matching efficiency and accuracy.

Abstract: Image feature matching plays a vital role in many computer vision tasks. Although many image feature detection and matching techniques have been proposed over the past few decades, it is still time-consuming to match feature points in two images, especially for images with a large number of detected features. Feature spatial order can estimate the probability that a pair of features is correct. Since it is a completely independent concept from epipolar geometry, it can be used to complement epipolar geometry in guiding feature match in a target region so as to improve matching efficiency. In this paper, we integrate the concept of feature spatial order into a progressive matching framework. We use some of the initially matched features to build a computational model of feature spatial order and employs it to calculates the possible spatial range of subsequent feature matches, thus filtering out unnecessary feature matches. We also integrate it with epipolar geometry to further improve matching efficiency and accuracy. Since the spatial order of feature points is affected by image rotation, we propose a suitable image alignment method from the fundamental matrix of epipolar geometry to remove the effect of image rotation. To verify the feasibility of the proposed method, we conduct a series of experiments, including a standard benchmark dataset, self-generated simulated images, and real images. The results demonstrate that our proposed method is significantly more efficient and has more accurate feature matching than the traditional method.

[305] Cluster-Aware Prompt Ensemble Learning for Few-Shot Vision-Language Model Adaptation

Zhi Chen, Xin Yu, Xiaohui Tao, Yan Li, Zi Huang

Main category: cs.CV

TL;DR: CAPEL is a cluster-aware prompt ensemble learning framework that improves zero-shot classification by preserving prompt cluster structure and ensembling in logits space rather than feature space.

Details

Motivation: Conventional prompt ensembling methods average textual features, which shifts class centroids away from true distributions and yields suboptimal results.

Method: CAPEL classifies images into class clusters with distinct prompts, ensembles in classification logits space, uses cluster-preserving regularization, and adaptive prompt weighting.

Result: The method aligns better with visual feature distribution and maintains cluster-specific discriminative power for robust performance.

Conclusion: CAPEL effectively addresses limitations of conventional prompt ensembling by preserving cluster nature and optimizing prompt fine-tuning.

Abstract: Vision-language models (VLMs) such as CLIP achieve zero-shot transfer across various tasks by pre-training on numerous image-text pairs. These models often benefit from using an ensemble of context prompts to represent a class. Despite being effective, conventional prompt ensembling that averages textual features of context prompts often yields suboptimal results. This is because feature averaging shifts the class centroids away from the true class distribution. To address this issue, we propose the Cluster-Aware Prompt Ensemble Learning (CAPEL) framework, which preserves the cluster nature of context prompts. CAPEL classifies images into one of several class clusters, each represented by a distinct prompt. Instead of ensembling prompts in the feature space, we perform ensembling in the classification logits space, aligning better with the visual feature distribution. To further optimize prompt fine-tuning while maintaining cluster-specific discriminative power, we introduce a cluster-preserving regularization term. This ensures that prompts remain distinct and specialized for different clusters, preventing collapse into a uniform direction. Additionally, we integrate an adaptive prompt weighting technique to dynamically adjust the attention weights for flawed or ambiguous prompts, ensuring robust performance across diverse datasets and tasks.

[306] SceneTextStylizer: A Training-Free Scene Text Style Transfer Framework with Diffusion Model

Honghui Yuan, Keiji Yanai

Main category: cs.CV

TL;DR: SceneTextStylizer is a training-free diffusion-based framework for flexible and localized style transfer of text in scene images, enabling prompt-guided style transformation while preserving text readability.

Details

Motivation: Existing scene text editing methods are limited to content replacement and simple styles, lacking free-style transfer capabilities for localized text regions in scene images.

Method: Uses diffusion model inversion with a feature injection module for style transfer, region control mechanism with distance-based masks for spatial precision, and Fourier transform-based style enhancement for visual quality.

Result: Achieves superior performance in scene text style transformation, outperforming state-of-the-art methods in both visual fidelity and text preservation.

Conclusion: The proposed framework successfully enables flexible and high-fidelity style transfer for scene text while maintaining readability and stylistic consistency.

Abstract: With the rapid development of diffusion models, style transfer has made remarkable progress. However, flexible and localized style editing for scene text remains an unsolved challenge. Although existing scene text editing methods have achieved text region editing, they are typically limited to content replacement and simple styles, which lack the ability of free-style transfer. In this paper, we introduce SceneTextStylizer, a novel training-free diffusion-based framework for flexible and high-fidelity style transfer of text in scene images. Unlike prior approaches that either perform global style transfer or focus solely on textual content modification, our method enables prompt-guided style transformation specifically for text regions, while preserving both text readability and stylistic consistency. To achieve this, we design a feature injection module that leverages diffusion model inversion and self-attention to transfer style features effectively. Additionally, a region control mechanism is introduced by applying a distance-based changing mask at each denoising step, enabling precise spatial control. To further enhance visual quality, we incorporate a style enhancement module based on the Fourier transform to reinforce stylistic richness. Extensive experiments demonstrate that our method achieves superior performance in scene text style transformation, outperforming existing state-of-the-art methods in both visual fidelity and text preservation.

[307] Context Guided Transformer Entropy Modeling for Video Compression

Junlong Tong, Wei Zhang, Yaohui Jin, Xiaoyu Shen

Main category: cs.CV

TL;DR: Proposes Context Guided Transformer (CGT) entropy model that reduces video redundancy using resampled temporal context and dependency-weighted spatial context, achieving 65% faster entropy modeling and 11% BD-Rate reduction.

Details

Motivation: Existing conditional entropy models for video compression either introduce high computational cost when incorporating temporal context, or lack explicit modeling of spatial dependency ordering, limiting context availability during decoding.

Method: Uses temporal context resampler with predefined latent queries and transformer encoders to extract critical temporal information. Employs teacher-student network to generate attention and entropy maps for dependency-weighted spatial context selection, where student selects top-k tokens with highest spatial dependency.

Result: Reduces entropy modeling time by approximately 65% and achieves 11% BD-Rate reduction compared to previous state-of-the-art conditional entropy model.

Conclusion: The CGT entropy model effectively balances computational efficiency and compression performance by optimizing both temporal and spatial context modeling through resampling and dependency-weighted selection mechanisms.

Abstract: Conditional entropy models effectively leverage spatio-temporal contexts to reduce video redundancy. However, incorporating temporal context often introduces additional model complexity and increases computational cost. In parallel, many existing spatial context models lack explicit modeling the ordering of spatial dependencies, which may limit the availability of relevant context during decoding. To address these issues, we propose the Context Guided Transformer (CGT) entropy model, which estimates probability mass functions of the current frame conditioned on resampled temporal context and dependency-weighted spatial context. A temporal context resampler learns predefined latent queries to extract critical temporal information using transformer encoders, reducing downstream computational overhead. Meanwhile, a teacher-student network is designed as dependency-weighted spatial context assigner to explicitly model the dependency of spatial context order. The teacher generates an attention map to represent token importance and an entropy map to reflect prediction certainty from randomly masked inputs, guiding the student to select the weighted top-k tokens with the highest spatial dependency. During inference, only the student is used to predict undecoded tokens based on high-dependency context. Experimental results demonstrate that our CGT model reduces entropy modeling time by approximately 65% and achieves an 11% BD-Rate reduction compared to the previous state-of-the-art conditional entropy model.

[308] Fast Self-Supervised depth and mask aware Association for Multi-Object Tracking

Milad Khanchi, Maria Amer, Charalambos Poullis

Main category: cs.CV

TL;DR: Proposes a multi-object tracking method that uses fused depth and mask features through a self-supervised encoder for object representation, avoiding computationally expensive segmentation IoU calculations while improving tracking performance.

Details

Motivation: Traditional MOT methods rely on IoU for association, which becomes unreliable with similar or occluded objects, and computing segmentation IoU is computationally expensive.

Method: Fuses depth and mask features using a compact self-supervised encoder to produce stable object representations, combining these with bounding box IoU and re-identification features for matching. Uses zero-shot depth estimator and promptable visual segmentation model for spatial cues.

Result: Outperforms TBD state-of-the-art on challenging benchmarks with non-linear motion, occlusion, and crowded scenes (SportsMOT, DanceTrack), while achieving competitive performance on simpler benchmarks with linear motion (MOT17).

Conclusion: The proposed method effectively addresses limitations of IoU-based tracking by using fused depth-mask features through self-supervised learning, providing robust object representations without expensive segmentation IoU computations.

Abstract: Multi-object tracking (MOT) methods often rely on Intersection-over-Union (IoU) for association. However, this becomes unreliable when objects are similar or occluded. Also, computing IoU for segmentation masks is computationally expensive. In this work, we use segmentation masks to capture object shapes, but we do not compute segmentation IoU. Instead, we fuse depth and mask features and pass them through a compact encoder trained self-supervised. This encoder produces stable object representations, which we use as an additional similarity cue alongside bounding box IoU and re-identification features for matching. We obtain depth maps from a zero-shot depth estimator and object masks from a promptable visual segmentation model to obtain fine-grained spatial cues. Our MOT method is the first to use the self-supervised encoder to refine segmentation masks without computing masks IoU. MOT can be divided into joint detection-ReID (JDR) and tracking-by-detection (TBD) models. The latter are computationally more efficient. Experiments of our TBD method on challenging benchmarks with non-linear motion, occlusion, and crowded scenes, such as SportsMOT and DanceTrack, show that our method outperforms the TBD state-of-the-art on most metrics, while achieving competitive performance on simpler benchmarks with linear motion, such as MOT17.

[309] CHUG: Crowdsourced User-Generated HDR Video Quality Dataset

Shreshth Saini, Alan C. Bovik, Neil Birkbeck, Yilin Wang, Balu Adsumilli

Main category: cs.CV

TL;DR: CHUG is the first large-scale crowdsourced dataset for HDR video quality assessment focused on user-generated content, addressing the gap in existing datasets that primarily cover professionally generated content.

Details

Motivation: Existing HDR-VQA datasets focus on professionally generated content, leaving a gap in understanding real-world UGC-HDR degradations from diverse capture conditions, editing artifacts, and compression distortions.

Method: Created CHUG dataset with 856 UGC-HDR source videos transcoded across multiple resolutions and bitrates (total 5,992 videos), and conducted large-scale subjective study via Amazon Mechanical Turk collecting 211,848 perceptual ratings.

Result: CHUG provides the first comprehensive benchmark for analyzing UGC-specific distortions in HDR videos, offering a large-scale, diverse, and real-world UGC dataset.

Conclusion: CHUG will advance No-Reference HDR-VQA research by providing a publicly available dataset that addresses the unique challenges of user-generated HDR content quality assessment.

Abstract: High Dynamic Range (HDR) videos enhance visual experiences with superior brightness, contrast, and color depth. The surge of User-Generated Content (UGC) on platforms like YouTube and TikTok introduces unique challenges for HDR video quality assessment (VQA) due to diverse capture conditions, editing artifacts, and compression distortions. Existing HDR-VQA datasets primarily focus on professionally generated content (PGC), leaving a gap in understanding real-world UGC-HDR degradations. To address this, we introduce CHUG: Crowdsourced User-Generated HDR Video Quality Dataset, the first large-scale subjective study on UGC-HDR quality. CHUG comprises 856 UGC-HDR source videos, transcoded across multiple resolutions and bitrates to simulate real-world scenarios, totaling 5,992 videos. A large-scale study via Amazon Mechanical Turk collected 211,848 perceptual ratings. CHUG provides a benchmark for analyzing UGC-specific distortions in HDR videos. We anticipate CHUG will advance No-Reference (NR) HDR-VQA research by offering a large-scale, diverse, and real-world UGC dataset. The dataset is publicly available at: https://shreshthsaini.github.io/CHUG/.

[310] Holistic Evaluation of Multimodal LLMs on Spatial Intelligence

Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, Lei Yang

Main category: cs.CV

TL;DR: GPT-5 shows unprecedented spatial intelligence but still significantly lags behind human performance across various spatial tasks, with proprietary models not having decisive advantages on the most difficult spatial challenges.

Details

Motivation: To evaluate the current state of spatial understanding and reasoning in leading multimodal AI models (GPT, Gemini, Grok, Seed, Qwen, Intern) and assess their progress toward artificial general intelligence in physical world understanding.

Method: Proposed a holistic taxonomy of spatial tasks unifying existing benchmarks, conducted standardized evaluation across 8 key benchmarks using over 10 billion tokens, and performed qualitative evaluation on diverse human-intuitive scenarios.

Result: GPT-5 demonstrates unprecedented spatial intelligence strength but still falls significantly short of human performance; spatial tasks expose greater model capability deficiencies than non-spatial tasks; proprietary models show no decisive advantage on the most difficult spatial tasks.

Conclusion: Current multimodal models, including the most advanced GPT-5, still have substantial limitations in spatial intelligence compared to humans, highlighting a critical gap in achieving artificial general intelligence that can reason about the physical world.

Abstract: Multimodal models have achieved remarkable progress in recent years. Nevertheless, they continue to exhibit notable limitations in spatial understanding and reasoning, the very capability that anchors artificial general intelligence in the physical world. With the recent release of GPT-5, allegedly the most powerful AI model to date, it is timely to examine where the leading models (GPT, Gemini, Grok, Seed, Qwen, and Intern) stand on the path toward spatial intelligence. We first propose a holistic taxonomy of spatial tasks that unifies existing benchmarks and a standardized protocol for the fair evaluation of state-of-the-art proprietary and open-source models across eight key benchmarks, at a cost exceeding ten billion total tokens. Our empirical study then reveals that (1) GPT-5 demonstrates unprecedented strength in spatial intelligence (SI), yet (2) still falls short of human performance significantly across a broad spectrum of SI-tasks. Moreover, we (3) show that SI-tasks expose greater model capability deficiency than non-SI tasks, to the extent that (4) proprietary models do not exhibit a decisive advantage when facing the most difficult ones. In addition, we conduct a qualitative evaluation across a diverse set of scenarios that are intuitive for humans, yet fail even the most advanced multimodal models.

[311] Geometry-Aware Scene Configurations for Novel View Synthesis

Minkwan Kim, Changwoon Choi, Young Min Kim

Main category: cs.CV

TL;DR: Scene-adaptive strategies for efficient representation capacity allocation in indoor environment generation from incomplete observations, using geometric priors to guide optimal basis placement and virtual viewpoints.

Details

Motivation: Indoor scenes with multiple rooms have irregular layouts with varying complexity, clutter, occlusion, and flat walls, requiring efficient allocation of limited representation resources.

Method: Use geometric priors from pre-processing to guide optimal basis placement on estimated geometric scaffold, and introduce scene-adaptive virtual viewpoints to compensate for geometric deficiencies in input trajectory.

Result: Significant enhancements in rendering quality and memory requirements compared to baselines using regular placements, demonstrated through comprehensive analysis in large-scale indoor scenes.

Conclusion: Scene-adaptive strategies with geometric guidance and virtual viewpoints greatly improve upon uniform basis arrangements in scalable Neural Radiance Field representations for indoor environments.

Abstract: We propose scene-adaptive strategies to efficiently allocate representation capacity for generating immersive experiences of indoor environments from incomplete observations. Indoor scenes with multiple rooms often exhibit irregular layouts with varying complexity, containing clutter, occlusion, and flat walls. We maximize the utilization of limited resources with guidance from geometric priors, which are often readily available after pre-processing stages. We record observation statistics on the estimated geometric scaffold and guide the optimal placement of bases, which greatly improves upon the uniform basis arrangements adopted by previous scalable Neural Radiance Field (NeRF) representations. We also suggest scene-adaptive virtual viewpoints to compensate for geometric deficiencies inherent in view configurations in the input trajectory and impose the necessary regularization. We present a comprehensive analysis and discussion regarding rendering quality and memory requirements in several large-scale indoor scenes, demonstrating significant enhancements compared to baselines that employ regular placements.

[312] LTGS: Long-Term Gaussian Scene Chronology From Sparse View Updates

Minkwan Kim, Seungmin Lee, Junho Kim, Young Min Kim

Main category: cs.CV

TL;DR: LTGS is a novel scene representation that models long-term scene changes from sparse casual captures using Gaussian splatting with object templates that adapt to temporal variations.

Details

Motivation: To address challenges in capturing everyday environments with frequent scene changes from casual, sparse-view captures that are spatially and temporally incomplete.

Method: Uses Gaussian splatting with object templates as structural priors, followed by refinement pipeline that adapts templates to temporal variations using few-shot observations. Enables generalization across time steps through simple transformations.

Result: Achieves superior reconstruction quality compared to baselines while enabling fast and lightweight updates. Validated on real-world datasets collected specifically for long-term scene changes.

Conclusion: LTGS provides an efficient and scalable framework for modeling long-term scene chronology from sparse captures, significantly enhancing temporal evolution of 3D environments.

Abstract: Recent advances in novel-view synthesis can create the photo-realistic visualization of real-world environments from conventional camera captures. However, acquiring everyday environments from casual captures faces challenges due to frequent scene changes, which require dense observations both spatially and temporally. We propose long-term Gaussian scene chronology from sparse-view updates, coined LTGS, an efficient scene representation that can embrace everyday changes from highly under-constrained casual captures. Given an incomplete and unstructured Gaussian splatting representation obtained from an initial set of input images, we robustly model the long-term chronology of the scene despite abrupt movements and subtle environmental variations. We construct objects as template Gaussians, which serve as structural, reusable priors for shared object tracks. Then, the object templates undergo a further refinement pipeline that modulates the priors to adapt to temporally varying environments based on few-shot observations. Once trained, our framework is generalizable across multiple time steps through simple transformations, significantly enhancing the scalability for a temporal evolution of 3D environments. As existing datasets do not explicitly represent the long-term real-world changes with a sparse capture setup, we collect real-world datasets to evaluate the practicality of our pipeline. Experiments demonstrate that our framework achieves superior reconstruction quality compared to other baselines while enabling fast and light-weight updates.

[313] An uncertainty-aware framework for data-efficient multi-view animal pose estimation

Lenny Aharon, Keemin Lee, Karan Sikka, Selmaan Chettih, Cole Hurwitz, Liam Paninski, Matthew R Whiteway

Main category: cs.CV

TL;DR: A comprehensive multi-view pose estimation framework for animal behavior analysis that combines transformer architecture, geometric consistency, enhanced uncertainty quantification, and model distillation to achieve accurate tracking with limited labeled data.

Details

Motivation: Current multi-view pose estimation methods struggle with limited labeled data and poor uncertainty estimates, which hinders reliable animal behavior quantification in scientific research.

Method: Proposes multi-view transformer (MVT) with pretrained backbones and patch masking for cross-view correspondence, geometric consistency for calibrated setups, enhanced Ensemble Kalman Smoother with variance inflation for uncertainty, and distillation procedure using pseudo-labels.

Result: Framework components consistently outperform existing methods across three animal species (flies, mice, chickadees), with each component providing complementary benefits for practical pose estimation.

Conclusion: The framework provides a practical, uncertainty-aware system for reliable pose estimation that enables downstream behavioral analyses under real-world data constraints.

Abstract: Multi-view pose estimation is essential for quantifying animal behavior in scientific research, yet current methods struggle to achieve accurate tracking with limited labeled data and suffer from poor uncertainty estimates. We address these challenges with a comprehensive framework combining novel training and post-processing techniques, and a model distillation procedure that leverages the strengths of these techniques to produce a more efficient and effective pose estimator. Our multi-view transformer (MVT) utilizes pretrained backbones and enables simultaneous processing of information across all views, while a novel patch masking scheme learns robust cross-view correspondences without camera calibration. For calibrated setups, we incorporate geometric consistency through 3D augmentation and a triangulation loss. We extend the existing Ensemble Kalman Smoother (EKS) post-processor to the nonlinear case and enhance uncertainty quantification via a variance inflation technique. Finally, to leverage the scaling properties of the MVT, we design a distillation procedure that exploits improved EKS predictions and uncertainty estimates to generate high-quality pseudo-labels, thereby reducing dependence on manual labels. Our framework components consistently outperform existing methods across three diverse animal species (flies, mice, chickadees), with each component contributing complementary benefits. The result is a practical, uncertainty-aware system for reliable pose estimation that enables downstream behavioral analyses under real-world data constraints.

[314] SpectralCA: Bi-Directional Cross-Attention for Next-Generation UAV Hyperspectral Vision

D. V. Brovko

Main category: cs.CV

TL;DR: Developed a deep learning architecture integrating hyperspectral imaging into UAV perception using a modified Mobile 3D Vision Transformer with SpectralCA block for enhanced navigation, object detection, and terrain classification.

Details

Motivation: Growing demand for UAVs operating in complex environments where conventional navigation fails due to interference, poor visibility, or camouflage. Hyperspectral imaging enables fine-grained material recognition critical for navigation, surveillance, agriculture, and environmental monitoring.

Method: Modified Mobile 3D Vision Transformer (MDvT) by introducing SpectralCA block with bi-directional cross-attention to fuse spectral and spatial features, reducing parameters and inference time while maintaining accuracy.

Result: Experimental evaluation on WHU-Hi-HongHu dataset showed improved UAV perception efficiency, enabling real-time operation for navigation, object recognition, and environmental monitoring tasks.

Conclusion: The proposed architecture successfully enhances UAV perception capabilities through hyperspectral imaging integration, making it suitable for real-time applications in complex environments.

Abstract: The relevance of this research lies in the growing demand for unmanned aerial vehicles (UAVs) capable of operating reliably in complex environments where conventional navigation becomes unreliable due to interference, poor visibility, or camouflage. Hyperspectral imaging (HSI) provides unique opportunities for UAV-based computer vision by enabling fine-grained material recognition and object differentiation, which are critical for navigation, surveillance, agriculture, and environmental monitoring. The aim of this work is to develop a deep learning architecture integrating HSI into UAV perception for navigation, object detection, and terrain classification. Objectives include: reviewing existing HSI methods, designing a hybrid 2D/3D convolutional architecture with spectral-spatial cross-attention, training, and benchmarking. The methodology is based on the modification of the Mobile 3D Vision Transformer (MDvT) by introducing the proposed SpectralCA block. This block employs bi-directional cross-attention to fuse spectral and spatial features, enhancing accuracy while reducing parameters and inference time. Experimental evaluation was conducted on the WHU-Hi-HongHu dataset, with results assessed using Overall Accuracy, Average Accuracy, and the Kappa coefficient. The findings confirm that the proposed architecture improves UAV perception efficiency, enabling real-time operation for navigation, object recognition, and environmental monitoring tasks. Keywords: SpectralCA, deep learning, computer vision, hyperspectral imaging, unmanned aerial vehicle, object detection, semi-supervised learning.

[315] Tokenizing Motion: A Generative Approach for Scene Dynamics Compression

Shanzhi Yin, Zihan Zhang, Bolin Chen, Shiqi Wang, Yan Ye

Main category: cs.CV

TL;DR: A novel generative video compression framework using motion pattern priors from common scene dynamics for ultra-low bitrate communication with high-quality reconstruction.

Details

Motivation: To enable ultra-low bitrate video communication by leveraging compact motion priors from common scene dynamics rather than relying on specific video content priors like talking faces or human bodies.

Method: Encoder uses dense-to-sparse transformation to create compact motion prior representations. Decoder employs an advanced flow-driven diffusion model to reconstruct scene dynamics from these priors.

Result: Superior rate-distortion performance compared to state-of-the-art conventional video codec ECM on scene dynamics sequences.

Conclusion: The proposed framework demonstrates effective ultra-low bitrate video compression using motion pattern priors, achieving high-quality reconstruction across diverse scenes.

Abstract: This paper proposes a novel generative video compression framework that leverages motion pattern priors, derived from subtle dynamics in common scenes (e.g., swaying flowers or a boat drifting on water), rather than relying on video content priors (e.g., talking faces or human bodies). These compact motion priors enable a new approach to ultra-low bitrate communication while achieving high-quality reconstruction across diverse scene contents. At the encoder side, motion priors can be streamlined into compact representations via a dense-to-sparse transformation. At the decoder side, these priors facilitate the reconstruction of scene dynamics using an advanced flow-driven diffusion model. Experimental results illustrate that the proposed method can achieve superior rate-distortion-performance and outperform the state-of-the-art conventional-video codec Enhanced Compression Model (ECM) on-scene dynamics sequences. The project page can be found at-https://github.com/xyzysz/GNVDC.

[316] HeadsUp! High-Fidelity Portrait Image Super-Resolution

Renjie Li, Zihao Zhu, Xiaoyu Wang, Zhengzhong Tu

Main category: cs.CV

TL;DR: HeadsUp is a single-step diffusion model for portrait image super-resolution that handles both faces and backgrounds seamlessly in one step, avoiding blending artifacts from multi-model approaches.

Details

Motivation: Existing ISR methods use separate models for faces and backgrounds, causing blending artifacts. Human perception is sensitive to facial fidelity, requiring a unified approach for portrait photos.

Method: Built on single-step diffusion model with face supervision mechanism for facial focus and reference-based mechanism for identity restoration. Uses newly created PortraitSR-4K dataset for training.

Result: Achieves state-of-the-art performance on PortraitISR while maintaining comparable or better performance on general image and aligned face datasets.

Conclusion: HeadsUp provides an effective end-to-end solution for portrait super-resolution that eliminates blending artifacts and preserves facial identity.

Abstract: Portrait pictures, which typically feature both human subjects and natural backgrounds, are one of the most prevalent forms of photography on social media. Existing image super-resolution (ISR) techniques generally focus either on generic real-world images or strictly aligned facial images (i.e., face super-resolution). In practice, separate models are blended to handle portrait photos: the face specialist model handles the face region, and the general model processes the rest. However, these blending approaches inevitably introduce blending or boundary artifacts around the facial regions due to different model training recipes, while human perception is particularly sensitive to facial fidelity. To overcome these limitations, we study the portrait image supersolution (PortraitISR) problem, and propose HeadsUp, a single-step diffusion model that is capable of seamlessly restoring and upscaling portrait images in an end-to-end manner. Specifically, we build our model on top of a single-step diffusion model and develop a face supervision mechanism to guide the model in focusing on the facial region. We then integrate a reference-based mechanism to help with identity restoration, reducing face ambiguity in low-quality face restoration. Additionally, we have built a high-quality 4K portrait image ISR dataset dubbed PortraitSR-4K, to support model training and benchmarking for portrait images. Extensive experiments show that HeadsUp achieves state-of-the-art performance on the PortraitISR task while maintaining comparable or higher performance on both general image and aligned face datasets.

[317] Denoising Diffusion as a New Framework for Underwater Images

Nilesh Jain, Elie Alhajjar

Main category: cs.CV

TL;DR: This paper proposes using denoising diffusion models to expand underwater image datasets with diverse image types and Controlnet to enhance image quality, addressing limitations in current underwater image enhancement research.

Details

Motivation: Underwater images are crucial for ocean research but suffer from poor quality due to environmental factors. Existing enhancement methods have poor generalization and rely on limited datasets that lack diversity and contain mostly monocular images.

Method: Two-pronged approach: 1) Use denoising diffusion models to expand datasets with diverse image types (stereo, wide-angled, macro, close-up), 2) Apply Controlnet to enhance image quality and evaluate dataset improvements.

Result: The proposed methods aim to create more comprehensive and higher quality underwater image datasets, addressing current limitations in dataset diversity and image quality.

Conclusion: Expanding datasets with diffusion models and enhancing images with Controlnet can improve marine ecosystem studies by providing better quality and more diverse underwater imagery.

Abstract: Underwater images play a crucial role in ocean research and marine environmental monitoring since they provide quality information about the ecosystem. However, the complex and remote nature of the environment results in poor image quality with issues such as low visibility, blurry textures, color distortion, and noise. In recent years, research in image enhancement has proven to be effective but also presents its own limitations, like poor generalization and heavy reliance on clean datasets. One of the challenges herein is the lack of diversity and the low quality of images included in these datasets. Also, most existing datasets consist only of monocular images, a fact that limits the representation of different lighting conditions and angles. In this paper, we propose a new plan of action to overcome these limitations. On one hand, we call for expanding the datasets using a denoising diffusion model to include a variety of image types such as stereo, wide-angled, macro, and close-up images. On the other hand, we recommend enhancing the images using Controlnet to evaluate and increase the quality of the corresponding datasets, and hence improve the study of the marine ecosystem. Tags - Underwater Images, Denoising Diffusion, Marine ecosystem, Controlnet

[318] A PDE-Based Image Dehazing Method via Atmospheric Scattering Theory

Liubing Hu, Pu Wang, Guangwei Gao, Chunyan Wang, Zhuoran Zheng

Main category: cs.CV

TL;DR: A PDE-based single-image dehazing method using edge-preserving diffusion and nonlocal operators, with adaptive regularization guided by dark channel prior.

Details

Motivation: To provide a principled mathematical framework for image dehazing as an alternative to purely data-driven methods, addressing haze removal while preserving image fidelity.

Method: Embed atmospheric scattering model into PDE with edge-preserving diffusion and nonlocal operators, using adaptive regularization based on dark channel prior, and solving with GPU-accelerated fixed-point solver.

Result: Effective haze removal with preserved image fidelity, mathematically proven well-posedness with existence and uniqueness of weak solution in H₀¹(Ω).

Conclusion: The proposed PDE framework offers a rigorous mathematical alternative to data-driven dehazing techniques, achieving good performance with theoretical guarantees.

Abstract: This paper introduces a novel partial differential equation (PDE) framework for single-image dehazing. We embed the atmospheric scattering model into a PDE featuring edge-preserving diffusion and a nonlocal operator to maintain both local details and global structures. A key innovation is an adaptive regularization mechanism guided by the dark channel prior, which adjusts smoothing strength based on haze density. The framework’s mathematical well-posedness is rigorously established by proving the existence and uniqueness of its weak solution in $H_0^1(\Omega)$. An efficient, GPU-accelerated fixed-point solver is used for implementation. Experiments confirm our method achieves effective haze removal while preserving high image fidelity, offering a principled alternative to purely data-driven techniques.

[319] Semi-disentangled spatiotemporal implicit neural representations of longitudinal neuroimaging data for trajectory classification

Agampreet Aulakh, Nils D. Forkert, Matthias Wilms

Main category: cs.CV

TL;DR: A novel method using Implicit Neural Representations (INRs) to model brain aging trajectories from longitudinal MRI data, achieving 81.3% classification accuracy for healthy vs dementia-like aging patterns.

Details

Motivation: Longitudinal MRI data analysis is challenging due to discrete sampling patterns and inability of traditional deep learning methods to represent continuous biological processes of brain aging.

Method: Developed a novel INR architecture that disentangles spatial and temporal trajectory parameters, creating a framework that operates directly on INR parameter space for classification. Evaluated using biologically grounded trajectory simulation with 450 subjects.

Result: Achieved 81.3% accuracy for brain aging trajectory classification in irregular sampling experiments, outperforming standard deep learning baseline (73.7%).

Conclusion: INR-based approach effectively models continuous brain aging processes and provides superior classification performance compared to traditional methods, especially for irregularly sampled longitudinal data.

Abstract: The human brain undergoes dynamic, potentially pathology-driven, structural changes throughout a lifespan. Longitudinal Magnetic Resonance Imaging (MRI) and other neuroimaging data are valuable for characterizing trajectories of change associated with typical and atypical aging. However, the analysis of such data is highly challenging given their discrete nature with different spatial and temporal image sampling patterns within individuals and across populations. This leads to computational problems for most traditional deep learning methods that cannot represent the underlying continuous biological process. To address these limitations, we present a new, fully data-driven method for representing aging trajectories across the entire brain by modelling subject-specific longitudinal T1-weighted MRI data as continuous functions using Implicit Neural Representations (INRs). Therefore, we introduce a novel INR architecture capable of partially disentangling spatial and temporal trajectory parameters and design an efficient framework that directly operates on the INRs’ parameter space to classify brain aging trajectories. To evaluate our method in a controlled data environment, we develop a biologically grounded trajectory simulation and generate T1-weighted 3D MRI data for 450 healthy and dementia-like subjects at regularly and irregularly sampled timepoints. In the more realistic irregular sampling experiment, our INR-based method achieves 81.3% accuracy for the brain aging trajectory classification task, outperforming a standard deep learning baseline model (73.7%).

[320] RoHOI: Robustness Benchmark for Human-Object Interaction Detection

Di Wen, Kunyu Peng, Kailun Yang, Yufan Chen, Ruiping Liu, Junwei Zheng, Alina Roitberg, Danda Pani Paudel, Luc Van Gool, Rainer Stiefelhagen

Main category: cs.CV

TL;DR: This paper introduces RoHOI, the first robustness benchmark for Human-Object Interaction (HOI) detection, addressing model degradation in real-world conditions. The authors propose SAMPL, a Semantic-Aware Masking-based Progressive Learning strategy to enhance model robustness against corruptions.

Details

Motivation: HOI detection models trained on clean datasets perform poorly in real-world scenarios due to unforeseen corruptions like environmental variability, occlusions, and noise, leading to inaccurate predictions for robot-human assistance applications.

Method: Created RoHOI benchmark with 20 corruption types based on HICO-DET and V-COCO datasets, introduced a new robustness-focused metric, and proposed SAMPL strategy that uses semantic-aware masking and progressive learning to guide model optimization using holistic and partial cues.

Result: Extensive experiments show that existing HOI models suffer significant performance drops under corruptions, while the proposed SAMPL approach outperforms state-of-the-art methods and sets new standards for robust HOI detection.

Conclusion: The RoHOI benchmark and SAMPL strategy effectively address robustness challenges in HOI detection, providing a framework for evaluating and improving model resilience against real-world corruptions in robot-human assistance scenarios.

Abstract: Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support. However, models trained on clean datasets degrade in real-world conditions due to unforeseen corruptions, leading to inaccurate predictions. To address this, we introduce the first robustness benchmark for HOI detection, evaluating model resilience under diverse challenges. Despite advances, current models struggle with environmental variability, occlusions, and noise. Our benchmark, RoHOI, includes 20 corruption types based on the HICO-DET and V-COCO datasets and a new robustness-focused metric. We systematically analyze existing models in the HOI field, revealing significant performance drops under corruptions. To improve robustness, we propose a Semantic-Aware Masking-based Progressive Learning (SAMPL) strategy to guide the model to be optimized based on holistic and partial cues, thus dynamically adjusting the model’s optimization to enhance robust feature learning. Extensive experiments show that our approach outperforms state-of-the-art methods, setting a new standard for robust HOI detection. Benchmarks, datasets, and code are available at https://github.com/KratosWen/RoHOI.

[321] A Multi-Strategy Framework for Enhancing Shatian Pomelo Detection in Real-World Orchards

Pan Wang, Yihao Hu, Xiaodong Bai, Aiping Yang, Xiangxiang Li, Meiping Ding, Jianguo Yao

Main category: cs.CV

TL;DR: A multi-strategy framework for automated Shatian pomelo detection addresses challenges from imaging devices, lighting, scale variation, and occlusion using specialized data augmentation and the REAS-Det network with RFAConv, C3RFEM, and MultiSEAM modules.

Details

Motivation: Shatian pomelo requires automated detection for accurate quantity assessment and lean production, but existing methods degrade in real-world conditions due to imaging devices, lighting, scale variation, and occlusion.

Method: Proposed a multi-strategy framework: 1) Multi-scenario dataset STP-AgriData combining real orchard and internet data, 2) Data augmentation for lighting variations, 3) REAS-Det network with RFAConv and C3RFEM for scale variation and MultiSEAM with soft-NMS for occlusion.

Result: Achieved precision of 87.6%, recall of 74.9%, mAP@.50 of 82.8%, and mAP@.50:.95 of 53.3%, outperforming other state-of-the-art detection methods.

Conclusion: The proposed framework effectively addresses key challenges in real-world pomelo detection and demonstrates superior performance compared to existing methods.

Abstract: As a specialty agricultural product with a large market scale, Shatian pomelo necessitates the adoption of automated detection to ensure accurate quantity and meet commercial demands for lean production. Existing research often involves specialized networks tailored for specific theoretical or dataset scenarios, but these methods tend to degrade performance in real-world. Through analysis of factors in this issue, this study identifies four key challenges that affect the accuracy of Shatian pomelo detection: imaging devices, lighting conditions, object scale variation, and occlusion. To mitigate these challenges, a multi-strategy framework is proposed in this paper. Firstly, to effectively solve tone variation introduced by diverse imaging devices and complex orchard environments, we utilize a multi-scenario dataset, STP-AgriData, which is constructed by integrating real orchard images with internet-sourced data. Secondly, to simulate the inconsistent illumination conditions, specific data augmentations such as adjusting contrast and changing brightness, are applied to the above dataset. Thirdly, to address the issues of object scale variation and occlusion in fruit detection, an REAS-Det network is designed in this paper. For scale variation, RFAConv and C3RFEM modules are designed to expand and enhance the receptive fields. For occlusion variation, a multi-scale, multi-head feature selection structure (MultiSEAM) and soft-NMS are introduced to enhance the handling of occlusion issues to improve detection accuracy. The results of these experiments achieved a precision(P) of 87.6%, a recall (R) of 74.9%, a mAP@.50 of 82.8%, and a mAP@.50:.95 of 53.3%. Our proposed network demonstrates superior performance compared to other state-of-the-art detection methods.

[322] J-RAS: Enhancing Medical Image Segmentation via Retrieval-Augmented Joint Training

Salma J. Ahmed, Emad A. Mohammed, Azam Asilian Bidgoli

Main category: cs.CV

TL;DR: J-RAS is a joint training method that combines segmentation and retrieval models to improve medical image segmentation by leveraging retrieved image-mask pairs for better anatomical understanding and boundary delineation.

Details

Motivation: Manual medical image segmentation is time-consuming and variable, while AI methods require large annotated datasets and struggle with generalization across diverse imaging conditions and rare cases.

Method: Joint training of segmentation and retrieval models where both are optimized together - segmentation model uses retrieved image-mask pairs for anatomical understanding, while retrieval model learns segmentation-relevant features beyond visual similarity.

Result: Substantial improvements across multiple architectures: On ACDC dataset, SegFormer improved from Dice 0.8708±0.042 to 0.9115±0.031 and HD from 1.8130±2.49 to 1.1489±0.30. Consistent improvements shown on ACDC and M&Ms datasets.

Conclusion: J-RAS effectively enhances segmentation performance by enabling retrieval to provide meaningful contextual cues, demonstrating good generalizability across different architectures and datasets.

Abstract: Image segmentation, the process of dividing images into meaningful regions, is critical in medical applications for accurate diagnosis, treatment planning, and disease monitoring. Although manual segmentation by healthcare professionals produces precise outcomes, it is time-consuming, costly, and prone to variability due to differences in human expertise. Artificial intelligence (AI)-based methods have been developed to address these limitations by automating segmentation tasks; however, they often require large, annotated datasets that are rarely available in practice and frequently struggle to generalize across diverse imaging conditions due to inter-patient variability and rare pathological cases. In this paper, we propose Joint Retrieval Augmented Segmentation (J-RAS), a joint training method for guided image segmentation that integrates a segmentation model with a retrieval model. Both models are jointly optimized, enabling the segmentation model to leverage retrieved image-mask pairs to enrich its anatomical understanding, while the retrieval model learns segmentation-relevant features beyond simple visual similarity. This joint optimization ensures that retrieval actively contributes meaningful contextual cues to guide boundary delineation, thereby enhancing the overall segmentation performance. We validate J-RAS across multiple segmentation backbones, including U-Net, TransUNet, SAM, and SegFormer, on two benchmark datasets: ACDC and M&Ms, demonstrating consistent improvements. For example, on the ACDC dataset, SegFormer without J-RAS achieves a mean Dice score of 0.8708$\pm$0.042 and a mean Hausdorff Distance (HD) of 1.8130$\pm$2.49, whereas with J-RAS, the performance improves substantially to a mean Dice score of 0.9115$\pm$0.031 and a mean HD of 1.1489$\pm$0.30. These results highlight the method’s effectiveness and its generalizability across architectures and datasets.

[323] FlareX: A Physics-Informed Dataset for Lens Flare Removal via 2D Synthesis and 3D Rendering

Lishen Qu, Zhihao Liu, Jinshan Pan, Shihao Zhou, Jinglei Shi, Duosheng Chen, Jufeng Yang

Main category: cs.CV

TL;DR: Proposes FlareX dataset with physics-informed flare generation using both 2D synthesis and 3D rendering to address limitations of existing synthetic flare datasets.

Details

Motivation: Existing flare datasets use simple 2D template overlays that lack flare diversity and ignore physical principles, limiting model generalization to real-world scenarios.

Method: Three-stage physics-informed generation: parameterized template creation, illumination-aware 2D synthesis, and physical engine-based 3D rendering, plus a masking approach for real-world flare removal evaluation.

Result: Created FlareX dataset with 9,500 2D templates from 95 patterns and 3,000 flare image pairs from 60 3D scenes, with extensive experiments showing effectiveness.

Conclusion: The proposed physics-informed flare generation method and FlareX dataset significantly improve model performance on real-world flare removal compared to existing approaches.

Abstract: Lens flare occurs when shooting towards strong light sources, significantly degrading the visual quality of images. Due to the difficulty in capturing flare-corrupted and flare-free image pairs in the real world, existing datasets are typically synthesized in 2D by overlaying artificial flare templates onto background images. However, the lack of flare diversity in templates and the neglect of physical principles in the synthesis process hinder models trained on these datasets from generalizing well to real-world scenarios. To address these challenges, we propose a new physics-informed method for flare data generation, which consists of three stages: parameterized template creation, the laws of illumination-aware 2D synthesis, and physical engine-based 3D rendering, which finally gives us a mixed flare dataset that incorporates both 2D and 3D perspectives, namely FlareX. This dataset offers 9,500 2D templates derived from 95 flare patterns and 3,000 flare image pairs rendered from 60 3D scenes. Furthermore, we design a masking approach to obtain real-world flare-free images from their corrupted counterparts to measure the performance of the model on real-world images. Extensive experiments demonstrate the effectiveness of our method and dataset.

[324] BurstDeflicker: A Benchmark Dataset for Flicker Removal in Dynamic Scenes

Lishen Qu, Zhihao Liu, Shihao Zhou, Yaqi Luo, Jie Liang, Hui Zeng, Lei Zhang, Jufeng Yang

Main category: cs.CV

TL;DR: BurstDeflicker is a scalable benchmark for flicker removal that combines synthetic data generation, real-world captures, and motion-preserving green-screen methods to address the lack of large-scale datasets for AC lighting flicker artifacts.

Details

Motivation: Flicker artifacts from AC-powered lighting in rolling shutter cameras degrade image quality and affect high-level vision tasks, but research has been hindered by the lack of large-scale realistic datasets.

Method: Three complementary approaches: 1) Retinex-based synthesis pipeline for controllable flicker generation, 2) 4,000 real-world flicker image captures, 3) Green-screen method to incorporate motion while preserving real flicker degradation.

Result: The benchmark enables comprehensive evaluation of flicker removal methods and helps models better understand spatial and temporal characteristics of real flicker artifacts for improved generalization.

Conclusion: BurstDeflicker provides an effective dataset that advances flicker removal research by combining synthetic and real data acquisition strategies to overcome dataset limitations.

Abstract: Flicker artifacts in short-exposure images are caused by the interplay between the row-wise exposure mechanism of rolling shutter cameras and the temporal intensity variations of alternating current (AC)-powered lighting. These artifacts typically appear as uneven brightness distribution across the image, forming noticeable dark bands. Beyond compromising image quality, this structured noise also affects high-level tasks, such as object detection and tracking, where reliable lighting is crucial. Despite the prevalence of flicker, the lack of a large-scale, realistic dataset has been a significant barrier to advancing research in flicker removal. To address this issue, we present BurstDeflicker, a scalable benchmark constructed using three complementary data acquisition strategies. First, we develop a Retinex-based synthesis pipeline that redefines the goal of flicker removal and enables controllable manipulation of key flicker-related attributes (e.g., intensity, area, and frequency), thereby facilitating the generation of diverse flicker patterns. Second, we capture 4,000 real-world flicker images from different scenes, which help the model better understand the spatial and temporal characteristics of real flicker artifacts and generalize more effectively to wild scenarios. Finally, due to the non-repeatable nature of dynamic scenes, we propose a green-screen method to incorporate motion into image pairs while preserving real flicker degradation. Comprehensive experiments demonstrate the effectiveness of our dataset and its potential to advance research in flicker removal.

[325] MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output

Yanyuan Chen, Dexuan Xu, Yu Huang, Songkun Zhan, Hanpin Wang, Dongxue Chen, Xueping Wang, Meikang Qiu, Hang Li

Main category: cs.CV

TL;DR: MIMO is a unified medical vision language model that addresses limitations of existing models by incorporating visual referring multimodal input and pixel grounding multimodal output, enabling better understanding of medical images and grounding of medical terminologies.

Details

Motivation: Existing medical vision language models only use text instructions without direct visual understanding, and only provide text answers without connecting to key image areas, limiting their effectiveness in medical applications.

Method: Proposed MIMO model with visual referring multimodal input and pixel grounding multimodal output, trained on MIMOSeg dataset containing 895K samples covering instruction following and complex QA with multimodal input/output.

Result: Experiments on medical multimodal tasks show MIMO uniquely combines visual referring and pixel grounding capabilities not available in previous models.

Conclusion: MIMO successfully addresses the input and output limitations of existing medical vision language models by integrating visual understanding and pixel-level grounding capabilities.

Abstract: Currently, medical vision language models are widely used in medical vision question answering tasks. However, existing models are confronted with two issues: for input, the model only relies on text instructions and lacks direct understanding of visual clues in the image; for output, the model only gives text answers and lacks connection with key areas in the image. To address these issues, we propose a unified medical vision language model MIMO, with visual referring Multimodal Input and pixel grounding Multimodal Output. MIMO can not only combine visual clues and textual instructions to understand complex medical images and semantics, but can also ground medical terminologies in textual output within the image. To overcome the scarcity of relevant data in the medical field, we propose MIMOSeg, a comprehensive medical multimodal dataset including 895K samples. MIMOSeg is constructed from four different perspectives, covering basic instruction following and complex question answering with multimodal input and multimodal output. We conduct experiments on several downstream medical multimodal tasks. Extensive experimental results verify that MIMO can uniquely combine visual referring and pixel grounding capabilities, which are not available in previous models.

Junan Chen, Trung Thanh Nguyen, Takahiro Komamizu, Ichiro Ide

Main category: cs.CV

TL;DR: Q-Adapter is a lightweight visual adapter module that enables parameter-efficient fine-tuning for video captioning by introducing learnable query tokens and a gating layer into Vision Encoder, achieving state-of-the-art performance with only 1.4% of parameters compared to full fine-tuning.

Details

Motivation: Standard full fine-tuning of large pretrained models for video captioning is computationally prohibitive, and existing PEFT methods primarily focus on language components, lacking sufficient understanding of visual information during multimodal fine-tuning.

Method: Proposed Q-Adapter introduces learnable query tokens and a gating layer into Vision Encoder to extract sparse, caption-relevant features without external textual supervision, enabling efficient fine-tuning for video captioning tasks.

Result: Achieved state-of-the-art performance among PEFT methods on MSR-VTT and MSVD datasets across BLEU@4, METEOR, ROUGE-L, and CIDEr metrics, and competitive performance compared to full fine-tuning approaches while using only 1.4% of parameters.

Conclusion: Q-Adapter effectively balances caption quality and parameter efficiency, demonstrating strong scalability for video-language modeling and providing insights into optimization strategies for adapter-based learning.

Abstract: Recent advances in video captioning are driven by large-scale pretrained models, which follow the standard “pre-training followed by fine-tuning” paradigm, where the full model is fine-tuned for downstream tasks. Although effective, this approach becomes computationally prohibitive as the model size increases. The Parameter-Efficient Fine-Tuning (PEFT) approach offers a promising alternative, but primarily focuses on the language components of Multimodal Large Language Models (MLLMs). Despite recent progress, PEFT remains underexplored in multimodal tasks and lacks sufficient understanding of visual information during fine-tuning the model. To bridge this gap, we propose Query-Adapter (Q-Adapter), a lightweight visual adapter module designed to enhance MLLMs by enabling efficient fine-tuning for the video captioning task. Q-Adapter introduces learnable query tokens and a gating layer into Vision Encoder, enabling effective extraction of sparse, caption-relevant features without relying on external textual supervision. We evaluate Q-Adapter on two well-known video captioning datasets, MSR-VTT and MSVD, where it achieves state-of-the-art performance among the methods that take the PEFT approach across BLEU@4, METEOR, ROUGE-L, and CIDEr metrics. Q-Adapter also achieves competitive performance compared to methods that take the full fine-tuning approach while requiring only 1.4% of the parameters. We further analyze the impact of key hyperparameters and design choices on fine-tuning effectiveness, providing insights into optimization strategies for adapter-based learning. These results highlight the strong potential of Q-Adapter in balancing caption quality and parameter efficiency, demonstrating its scalability for video-language modeling.

[327] P-4DGS: Predictive 4D Gaussian Splatting with 90$\times$ Compression

Henan Wang, Hanxin Zhu, Xinliang Gong, Tianyu He, Xin Li, Zhibo Chen

Main category: cs.CV

TL;DR: P-4DGS is a compressed dynamic 3D Gaussian Splatting representation that achieves up to 40-90x compression while maintaining state-of-the-art reconstruction quality and fast rendering speed.

Details

Motivation: Existing dynamic 3DGS algorithms overlook substantial temporal and spatial redundancies in dynamic scenes, leading to prohibitive memory consumption.

Method: Uses 3D anchor point-based spatial-temporal prediction inspired by video compression, combined with adaptive quantization and context-based entropy coding to reduce storage.

Result: Achieves state-of-the-art reconstruction quality with fastest rendering speed and remarkably low storage footprint (~1MB), achieving 40x compression on synthetic scenes and 90x on real-world scenes.

Conclusion: P-4DGS provides an efficient solution for compact 4D scene modeling with superior compression efficiency while maintaining high reconstruction quality.

Abstract: 3D Gaussian Splatting (3DGS) has garnered significant attention due to its superior scene representation fidelity and real-time rendering performance, especially for dynamic 3D scene reconstruction (\textit{i.e.}, 4D reconstruction). However, despite achieving promising results, most existing algorithms overlook the substantial temporal and spatial redundancies inherent in dynamic scenes, leading to prohibitive memory consumption. To address this, we propose P-4DGS, a novel dynamic 3DGS representation for compact 4D scene modeling. Inspired by intra- and inter-frame prediction techniques commonly used in video compression, we first design a 3D anchor point-based spatial-temporal prediction module to fully exploit the spatial-temporal correlations across different 3D Gaussian primitives. Subsequently, we employ an adaptive quantization strategy combined with context-based entropy coding to further reduce the size of the 3D anchor points, thereby achieving enhanced compression efficiency. To evaluate the rate-distortion performance of our proposed P-4DGS in comparison with other dynamic 3DGS representations, we conduct extensive experiments on both synthetic and real-world datasets. Experimental results demonstrate that our approach achieves state-of-the-art reconstruction quality and the fastest rendering speed, with a remarkably low storage footprint (around \textbf{1MB} on average), achieving up to \textbf{40$\times$} and \textbf{90$\times$} compression on synthetic and real-world scenes, respectively.

[328] Complementary and Contrastive Learning for Audio-Visual Segmentation

Sitong Gong, Yunzhi Zhuge, Lu Zhang, Pingping Zhang, Huchuan Lu

Main category: cs.CV

TL;DR: CCFormer is a novel Transformer-based framework for Audio-Visual Segmentation that enhances cross-modal complementarity and captures spatial-temporal context through early integration, multi-query transformers, and bi-modal contrastive learning.

Details

Motivation: Traditional CNN methods have limited receptive fields, while existing Transformer approaches struggle with extracting multimodal coefficients and temporal dynamics in audio-visual segmentation tasks.

Method: Proposes CCFormer with three key components: Early Integration Module (EIM) for multi-scale visual-audio fusion, Multi-query Transformer Module (MTM) for spatial-temporal modeling with dynamic audio queries, and Bi-modal Contrastive Learning (BCL) for cross-modal alignment.

Result: Achieves new state-of-the-art performance on S4, MS3 and AVSS datasets, demonstrating superior segmentation accuracy and robustness.

Conclusion: CCFormer effectively addresses limitations of previous methods by comprehensively processing local/global information and capturing spatial-temporal context, setting new benchmarks in audio-visual segmentation.

Abstract: Audio-Visual Segmentation (AVS) aims to generate pixel-wise segmentation maps that correlate with the auditory signals of objects. This field has seen significant progress with numerous CNN and Transformer-based methods enhancing the segmentation accuracy and robustness. Traditional CNN approaches manage audio-visual interactions through basic operations like padding and multiplications but are restricted by CNNs’ limited local receptive field. More recently, Transformer-based methods treat auditory cues as queries, utilizing attention mechanisms to enhance audio-visual cooperation within frames. Nevertheless, they typically struggle to extract multimodal coefficients and temporal dynamics adequately. To overcome these limitations, we present the Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information and capturing spatial-temporal context comprehensively. Our CCFormer initiates with the Early Integration Module (EIM) that employs a parallel bilateral architecture, merging multi-scale visual features with audio data to boost cross-modal complementarity. To extract the intra-frame spatial features and facilitate the perception of temporal coherence, we introduce the Multi-query Transformer Module (MTM), which dynamically endows audio queries with learning capabilities and models the frame and video-level relations simultaneously. Furthermore, we propose the Bi-modal Contrastive Learning (BCL) to promote the alignment across both modalities in the unified feature space. Through the effective combination of those designs, our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets. Our source code and model weights will be made publicly available at https://github.com/SitongGong/CCFormer

[329] Think Twice to See More: Iterative Visual Reasoning in Medical VLMs

Kaitao Chen, Shaohao Rui, Yankai Jiang, Jiamin Wu, Qihao Zheng, Chunfeng Song, Xiaosong Wang, Mu Zhou, Mianxin Liu

Main category: cs.CV

TL;DR: ViTAR is a medical vision-language model that mimics human expert diagnostic reasoning through iterative “think-act-rethink-answer” cycles, outperforming state-of-the-art models by focusing on clinically relevant regions.

Details

Motivation: Current medical VLMs use single-pass reasoning and miss localized visual cues, unlike human experts who iteratively scan and refine regions of interest before diagnosis. This creates a machine-human perception gap that needs addressing.

Method: ViTAR uses a cognitive chain of “think-act-rethink-answer” to treat medical images as interactive objects. It employs a two-stage training: supervised fine-tuning with curated datasets (1K interactive examples + 16K VQA data) followed by reinforcement learning to optimize decision-making.

Result: ViTAR outperforms strong state-of-the-art models. Visual attention analysis shows it increasingly anchors to clinically critical regions from “think” to “rethink” rounds and maintains high attention to visual tokens during reasoning.

Conclusion: Embedding expert-style iterative thinking chains into VLMs enhances both performance and trustworthiness of medical AI by mimicking human diagnostic reasoning processes.

Abstract: Medical vision-language models (VLMs) excel at image-text understanding but typically rely on a single-pass reasoning that neglects localized visual cues. In clinical practice, however, human experts iteratively scan, focus, and refine the regions of interest before reaching a final diagnosis. To narrow this machine-human perception gap, we introduce ViTAR, a novel VLM framework that emulates the iterative reasoning process of human experts through a cognitive chain of “think-act-rethink-answer”. ViTAR treats medical images as interactive objects, enabling models to engage multi-step visual reasoning. To support this approach, we curate a high-quality instruction dataset comprising 1K interactive examples that encode expert-like diagnostic behaviors. In addition, a 16K visual question answering training data has been curated towards fine-grained visual diagnosis. We introduce a two-stage training strategy that begins with supervised fine-tuning to guide cognitive trajectories, followed by the reinforcement learning to optimize decision-making. Extensive evaluations demonstrate that ViTAR outperforms strong state-of-the-art models. Visual attention analysis reveals that from the “think” to “rethink” rounds, ViTAR increasingly anchors visual grounding to clinically critical regions and maintains high attention allocation to visual tokens during reasoning, providing mechanistic insight into its improved performance. These findings demonstrate that embedding expert-style iterative thinking chains into VLMs enhances both performance and trustworthiness of medical AI.

[330] DREAM: A Benchmark Study for Deepfake REalism AssessMent

Bo Peng, Zichuan Wang, Sheng Yu, Xiaochuan Jin, Wei Wang, Jing Dong

Main category: cs.CV

TL;DR: This paper introduces DREAM, a comprehensive benchmark for deepfake visual realism assessment that aims to automatically evaluate how realistic deepfakes appear to humans.

Details

Motivation: While deepfake detection has been well-studied, the subjective perception and computational modeling of deepfake visual realism lacks adequate research. This is important for evaluating deepfake quality, deceptiveness, and potential impact on society.

Method: The authors created the DREAM benchmark consisting of: a diverse deepfake video dataset, large-scale human annotations (140,000 realism scores and descriptions from 3,500 annotators), and evaluation of 16 realism assessment methods including CLIP-based approaches.

Result: The paper presents a comprehensive benchmark with extensive human annotations and evaluates multiple assessment methods, providing foundational resources for deepfake realism evaluation.

Conclusion: The DREAM benchmark establishes a foundation for future research in deepfake visual realism assessment and related areas, addressing a previously understudied aspect of deepfake analysis.

Abstract: Deep learning based face-swap videos, widely known as deepfakes, have drawn wide attention due to their threat to information credibility. Recent works mainly focus on the problem of deepfake detection that aims to reliably tell deepfakes apart from real ones, in an objective way. On the other hand, the subjective perception of deepfakes, especially its computational modeling and imitation, is also a significant problem but lacks adequate study. In this paper, we focus on the visual realism assessment of deepfakes, which is defined as the automatic assessment of deepfake visual realism that approximates human perception of deepfakes. It is important for evaluating the quality and deceptiveness of deepfakes which can be used for predicting the influence of deepfakes on Internet, and it also has potentials in improving the deepfake generation process by serving as a critic. This paper prompts this new direction by presenting a comprehensive benchmark called DREAM, which stands for Deepfake REalism AssessMent. It is comprised of a deepfake video dataset of diverse quality, a large scale annotation that includes 140,000 realism scores and textual descriptions obtained from 3,500 human annotators, and a comprehensive evaluation and analysis of 16 representative realism assessment methods, including recent large vision language model based methods and a newly proposed description-aligned CLIP method. The benchmark and insights included in this study can lay the foundation for future research in this direction and other related areas.

[331] Collaborative Learning of Semantic-Aware Feature Learning and Label Recovery for Multi-Label Image Recognition with Incomplete Labels

Zhi-Fen He, Ren-Dong Xie, Bo Li, Bin Liu, Jin-Yan Hu

Main category: cs.CV

TL;DR: CLSL is a collaborative learning method for multi-label image recognition with incomplete labels that simultaneously addresses semantic-aware feature learning and missing label recovery in a unified framework.

Details

Motivation: Multi-label image recognition with incomplete labels faces two core challenges: semantic-aware feature learning and missing label recovery, which need to be addressed together for better performance.

Method: Proposes a three-part framework: 1) semantic-related feature learning module to discover semantic information and label correlations, 2) semantic-guided feature enhancement module to align visual and semantic feature spaces, and 3) collaborative learning that integrates feature learning and label recovery in a mutually reinforced loop.

Result: Extensive experiments on MS-COCO, VOC2007, and NUS-WIDE datasets show that CLSL outperforms state-of-the-art methods for multi-label image recognition with incomplete labels.

Conclusion: The collaborative learning framework effectively addresses both semantic-aware feature learning and missing label recovery, demonstrating superior performance on multiple benchmark datasets.

Abstract: Multi-label image recognition with incomplete labels is a critical learning task and has emerged as a focal topic in computer vision. However, this task is confronted with two core challenges: semantic-aware feature learning and missing label recovery. In this paper, we propose a novel Collaborative Learning of Semantic-aware feature learning and Label recovery (CLSL) method for multi-label image recognition with incomplete labels, which unifies the two aforementioned challenges into a unified learning framework. More specifically, we design a semantic-related feature learning module to learn robust semantic-related features by discovering semantic information and label correlations. Then, a semantic-guided feature enhancement module is proposed to generate high-quality discriminative semantic-aware features by effectively aligning visual and semantic feature spaces. Finally, we introduce a collaborative learning framework that integrates semantic-aware feature learning and label recovery, which can not only dynamically enhance the discriminability of semantic-aware features but also adaptively infer and recover missing labels, forming a mutually reinforced loop between the two processes. Extensive experiments on three widely used public datasets (MS-COCO, VOC2007, and NUS-WIDE) demonstrate that CLSL outperforms the state-of-the-art multi-label image recognition methods with incomplete labels.

Pîrvu Mihai-Cristian, Leordeanu Marius

Main category: cs.CV

TL;DR: PHG-MAE is a novel model that combines neural graphs with masked autoencoders, enabling unified pre-training and fine-tuning while supporting inference-time ensembles and knowledge distillation for efficient multi-modal learning.

Details

Motivation: To address the need for self-supervised pre-training without manual labels and unify classical neural graphs with modern masked autoencoders for improved multi-modal learning in computer vision.

Method: Uses probabilistic hyper-graphs with masked autoencoders that randomly mask entire modalities (not just patches), combines pre-training and fine-tuning in a single loop, and enables inference-time ensembles with knowledge distillation.

Result: Enables creation of inference-time ensembles that boost prediction performance and consistency, with knowledge distillation working effectively even on models under 1M parameters.

Conclusion: PHG-MAE provides a unified framework for multi-modal learning applicable to domains like UAV scenes, autonomous driving, and robotics, with released code and extended dataset for reproducibility.

Abstract: The computer vision domain has greatly benefited from an abundance of data across many modalities to improve on various visual tasks. Recently, there has been a lot of focus on self-supervised pre-training methods through Masked Autoencoders (MAE) \cite{he2022masked,bachmann2022multimae}, usually used as a first step before optimizing for a downstream task, such as classification or regression. This is very useful as it doesn’t require any manually labeled data. In this work, we introduce Probabilistic Hyper-Graphs using Masked Autoencoders (PHG-MAE): a novel model that unifies the classical work on neural graphs \cite{leordeanu2021semi} with the modern approach of masked autoencoders under a common theoretical framework. Through random masking of entire modalities, not just patches, the model samples from the distribution of hyper-edges on each forward pass. Additionally, the model adapts the standard MAE algorithm by combining pre-training and fine-tuning into a single training loop. Moreover, our approach enables the creation of inference-time ensembles which, through aggregation, boost the final prediction performance and consistency. Lastly, we show that we can apply knowledge distillation on top of the ensembles with little loss in performance, even with models that have fewer than 1M parameters. While our work mostly focuses on outdoor UAV scenes that contain multiple world interpretations and modalities, the same steps can be followed in other similar domains, such as autonomous driving or indoor robotics. In order to streamline the process of integrating external pre-trained experts for computer vision multi-modal multi-task learning (MTL) scenarios, we developed a data-pipeline software. Using this tool, we have created and released a fully-automated extension of the Dronescapes dataset. All the technical details, code and reproduction steps are publicly released.

[333] Tracking the Spatiotemporal Evolution of Landslide Scars Using a Vision Foundation Model: A Novel and Universal Framework

Meijun Zhou, Gang Mei, Zhengjing Ma, Nengxiong Xu, Jianbing Peng

Main category: cs.CV

TL;DR: A novel framework using vision foundation models to track spatiotemporal evolution of large-scale landslide scars by transforming discrete optical remote sensing images into continuous video sequences.

Details

Motivation: Existing studies focus on single-phase or pre/post-failure dual-phase landslide identification, making it challenging to track the continuous spatiotemporal evolution of landslide scars needed for early warning and hazard assessment.

Method: Reconstructs discrete optical remote sensing images into continuous video sequences, enabling vision foundation models developed for video segmentation to track landslide scar evolution through knowledge-guided, auto-propagation, and interactive refinement paradigms.

Result: Validated on Baige and Sela landslides (2017-2025), the framework successfully tracks landslide scar evolution, capturing failure precursors for early warning and post-failure evolution for secondary hazard assessment.

Conclusion: The proposed universal framework enables continuous tracking of landslide scars, providing critical insights for early warning systems and long-term stability assessment of landslide hazards.

Abstract: Tracking the spatiotemporal evolution of large-scale landslide scars is critical for understanding the evolution mechanisms and failure precursors, enabling effective early-warning. However, most existing studies have focused on single-phase or pre- and post-failure dual-phase landslide identification. Although these approaches delineate post-failure landslide boundaries, it is challenging to track the spatiotemporal evolution of landslide scars. To address this problem, this study proposes a novel and universal framework for tracking the spatiotemporal evolution of large-scale landslide scars using a vision foundation model. The key idea behind the proposed framework is to reconstruct discrete optical remote sensing images into a continuous video sequence. This transformation enables a vision foundation model, which is developed for video segmentation, to be used for tracking the evolution of landslide scars. The proposed framework operates within a knowledge-guided, auto-propagation, and interactive refinement paradigm to ensure the continuous and accurate identification of landslide scars. The proposed framework was validated through application to two representative cases: the post-failure Baige landslide and the active Sela landslide (2017-2025). Results indicate that the proposed framework enables continuous tracking of landslide scars, capturing both failure precursors critical for early warning and post-failure evolution essential for assessing secondary hazards and long-term stability.

[334] Gesplat: Robust Pose-Free 3D Reconstruction via Geometry-Guided Gaussian Splatting

Jiahui Lu, Haihong Xiao, Xueyan Zhao, Wenxiong Kang

Main category: cs.CV

TL;DR: Gesplat enables robust 3D reconstruction and novel view synthesis from unposed sparse images using 3D Gaussian Splatting with VGGT foundation model initialization and hybrid optimization techniques.

Details

Motivation: NeRF and 3DGS require accurate camera poses and dense viewpoint coverage, limiting their applicability in sparse-view settings where pose estimation becomes unreliable and supervision is insufficient.

Method: Uses VGGT foundation model for reliable initial poses and dense point clouds, hybrid Gaussian representation with dual position-shape optimization, graph-guided attribute refinement, and flow-based depth regularization.

Result: Achieves more robust performance on both forward-facing and large-scale complex datasets compared to other pose-free methods.

Conclusion: Gesplat overcomes the limitations of pose-dependent methods in sparse-view settings, enabling geometrically consistent reconstruction from unposed sparse images.

Abstract: Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have advanced 3D reconstruction and novel view synthesis, but remain heavily dependent on accurate camera poses and dense viewpoint coverage. These requirements limit their applicability in sparse-view settings, where pose estimation becomes unreliable and supervision is insufficient. To overcome these challenges, we introduce Gesplat, a 3DGS-based framework that enables robust novel view synthesis and geometrically consistent reconstruction from unposed sparse images. Unlike prior works that rely on COLMAP for sparse point cloud initialization, we leverage the VGGT foundation model to obtain more reliable initial poses and dense point clouds. Our approach integrates several key innovations: 1) a hybrid Gaussian representation with dual position-shape optimization enhanced by inter-view matching consistency; 2) a graph-guided attribute refinement module to enhance scene details; and 3) flow-based depth regularization that improves depth estimation accuracy for more effective supervision. Comprehensive quantitative and qualitative experiments demonstrate that our approach achieves more robust performance on both forward-facing and large-scale complex datasets compared to other pose-free methods.

[335] Cooperative Pseudo Labeling for Unsupervised Federated Classification

Kuangpu Guo, Lijun Sheng, Yongcan Yu, Jian Liang, Zilei Wang, Ran He

Main category: cs.CV

TL;DR: FedCoPL extends Unsupervised Federated Learning to classification using CLIP, addressing global class imbalance through pseudo label distribution adjustment and partial prompt aggregation.

Details

Motivation: To enable classification in Unsupervised Federated Learning by leveraging CLIP's zero-shot capabilities, overcoming previous limitations where classification was infeasible without label information.

Method: Clients estimate and upload pseudo label distributions, server adjusts them to avoid global class imbalance, and uses partial prompt aggregation (visual prompts aggregated at server, text prompts kept locally).

Result: Extensive experiments show FedCoPL achieves superior performance compared to baseline methods in unsupervised federated classification tasks.

Conclusion: FedCoPL successfully enables classification in UFL using CLIP, demonstrating effective handling of class imbalance through cooperative pseudo labeling and personalized prompt strategies.

Abstract: Unsupervised Federated Learning (UFL) aims to collaboratively train a global model across distributed clients without sharing data or accessing label information. Previous UFL works have predominantly focused on representation learning and clustering tasks. Recently, vision language models (e.g., CLIP) have gained significant attention for their powerful zero-shot prediction capabilities. Leveraging this advancement, classification problems that were previously infeasible under the UFL paradigm now present promising new opportunities, yet remain largely unexplored. In this paper, we extend UFL to the classification problem with CLIP for the first time and propose a novel method, \underline{\textbf{Fed}}erated \underline{\textbf{Co}}operative \underline{\textbf{P}}seudo \underline{\textbf{L}}abeling (\textbf{FedCoPL}). Specifically, clients estimate and upload their pseudo label distribution, and the server adjusts and redistributes them to avoid global imbalance among classes. Moreover, we introduce a partial prompt aggregation protocol for effective collaboration and personalization. In particular, visual prompts containing general image features are aggregated at the server, while text prompts encoding personalized knowledge are retained locally. Extensive experiments demonstrate the superior performance of our FedCoPL compared to baseline methods. Our code is available at \href{https://github.com/krumpguo/FedCoPL}{https://github.com/krumpguo/FedCoPL}.

Minbin Huang, Runhui Huang, Chuanyang Zheng, Jingyao Li, Guoxuan Chen, Han Shi, Hong Cheng

Main category: cs.CV

TL;DR: The paper proposes ACRE (Answer-Consistent Reinforcement Learning) to address the problem of inconsistency between reasoning traces and final answers in multimodal LLMs trained with RLVR, achieving improvements over the GRPO baseline on video and math reasoning tasks.

Details

Motivation: Standard outcome-driven RL improves answer accuracy but can lead to inconsistency between reasoning chains and final answers, with only 79.7% consistency observed in experiments.

Method: ACRE modifies GRPO with an auxiliary consistency check: after generating reasoning and initial answer, answer options are shuffled and the model is prompted again with the same reasoning to predict a second answer, with rewards based on consistency and correctness.

Result: ACRE achieves average improvements of 2.2% on Video Reasoning benchmarks and 1.5% on multimodal math reasoning benchmarks over the GRPO baseline.

Conclusion: The proposed ACRE method effectively addresses reasoning-answer inconsistency in RL-trained multimodal LLMs by incorporating consistency verification into the reward mechanism.

Abstract: Recent advances in large language models (LLMs) have demonstrated that reinforcement learning with verifiable rewards (RLVR) can significantly enhance reasoning abilities by directly optimizing correctness, rather than relying solely on supervised imitation. This paradigm has been extended to multimodal LLMs for complex video and image understanding tasks. However, while outcome-driven RL improves answer accuracy, it can inadvertently decouple the reasoning chain from the final answer, leading to situations where models produce inconsistency between the reasoning trace and final answer. In our experiments on multiple-choice visual question-answering tasks, the standard GRPO method yields only 79.7% consistency on MMVU between the reasoning steps and the chosen answers, indicating frequent mismatches between answers and reasoning. To this end, we propose Answer-Consistent Reinforcement Learning (ACRE) that modifies the GRPO algorithm with an auxiliary consistency check. After the model generates a chain of thought and an initial answer for a given question, we shuffle the answer options and prompt the model again with the same reasoning trace to predict a second answer. We design a consistency-verification reward that grants a high reward only if both the original and the post-shuffle answers agree and are correct; otherwise, a lower reward is assigned accordingly. This mechanism penalizes reasoning-answer misalignment and discourages the model from relying on spurious patterns, such as option ordering biases. We evaluate ACRE on challenging Video Reasoning benchmarks and multimodal math reasoning benchmarks, achieving an average 2.2% and 1.5% improvement for Video Reasoning and Math Reasoning tasks over the GRPO baseline.

[337] Training-Free In-Context Forensic Chain for Image Manipulation Detection and Localization

Rui Chen, Bin Liu, Changtao Miao, Xinghao Wang, Yi Li, Tao Gong, Qi Chu, Nenghai Yu

Main category: cs.CV

TL;DR: ICFC is a training-free framework using multi-modal LLMs for interpretable image manipulation localization, achieving state-of-the-art performance without pixel-level annotations.

Details

Motivation: Address security threats from image tampering by providing effective localization without costly pixel-level annotations, overcoming limitations of existing weakly supervised and training-free methods.

Method: In-Context Forensic Chain integrates objectified rule construction with adaptive filtering to build knowledge base, and multi-step progressive reasoning pipeline mirroring expert workflows from coarse to fine-grained analysis.

Result: Surpasses state-of-the-art training-free methods and achieves competitive/superior performance compared to weakly and fully supervised approaches across multiple benchmarks.

Conclusion: ICFC enables systematic exploitation of MLLM reasoning for image classification, pixel localization, and text interpretability, providing effective training-free solution for image manipulation localization.

Abstract: Advances in image tampering pose serious security threats, underscoring the need for effective image manipulation localization (IML). While supervised IML achieves strong performance, it depends on costly pixel-level annotations. Existing weakly supervised or training-free alternatives often underperform and lack interpretability. We propose the In-Context Forensic Chain (ICFC), a training-free framework that leverages multi-modal large language models (MLLMs) for interpretable IML tasks. ICFC integrates an objectified rule construction with adaptive filtering to build a reliable knowledge base and a multi-step progressive reasoning pipeline that mirrors expert forensic workflows from coarse proposals to fine-grained forensics results. This design enables systematic exploitation of MLLM reasoning for image-level classification, pixel-level localization, and text-level interpretability. Across multiple benchmarks, ICFC not only surpasses state-of-the-art training-free methods but also achieves competitive or superior performance compared to weakly and fully supervised approaches.

[338] ImmerIris: A Large-Scale Dataset and Benchmark for Immersive Iris Recognition in Open Scenes

Yuxi Mi, Qiuyang Yuan, Zhizhou Zhong, Xuan Zhao, Jiaogen Zhou, Fubao Zhu, Jihong Guan, Shuigeng Zhou

Main category: cs.CV

TL;DR: ImmerIris is a large-scale dataset for off-axis iris recognition in VR/AR applications, containing 499,791 images from 564 subjects. The paper also proposes a normalization-free recognition method that outperforms traditional approaches.

Details

Motivation: Traditional iris recognition uses on-axis acquisition in controlled settings, but immersive applications like VR/AR capture off-axis iris images through headset cameras, creating challenges like perspective distortion and quality degradation. Existing datasets don't address these challenges.

Method: Created ImmerIris dataset collected via VR headsets, established evaluation protocols, and proposed a normalization-free recognition paradigm that learns directly from ocular images with minimal preprocessing.

Result: Current methods designed for on-axis imagery perform poorly on off-axis data. The proposed normalization-free approach consistently outperforms normalization-based methods on the ImmerIris dataset.

Conclusion: Normalization-free iris recognition is a promising direction for robust immersive recognition, addressing the limitations of traditional methods in off-axis acquisition scenarios.

Abstract: In egocentric applications such as augmented and virtual reality, immersive iris recognition is emerging as an accurate and seamless way to identify persons. While classic systems acquire iris images on-axis, i.e., via dedicated frontal sensors in controlled settings, the immersive setup primarily captures off-axis irises through tilt-placed headset cameras, with only mild control in open scenes. This yields unique challenges, including perspective distortion, intensified quality degradations, and intra-class variations in iris texture. Datasets capturing these challenges remain scarce. To fill this gap, this paper introduces ImmerIris, a large-scale dataset collected via VR headsets, containing 499,791 ocular images from 564 subjects. It is, to the best of current knowledge, the largest public dataset and among the first dedicated to off-axis acquisition. Based on ImmerIris, evaluation protocols are constructed to benchmark recognition methods under different challenging factors. Current methods, primarily designed for classic on-axis imagery, perform unsatisfactorily on the immersive setup, mainly due to reliance on fallible normalization. To this end, this paper further proposes a normalization-free paradigm that directly learns from ocular images with minimal adjustment. Despite its simplicity, this approach consistently outperforms normalization-based counterparts, pointing to a promising direction for robust immersive recognition.

[339] Multi Class Parkinsons Disease Detection Based on Finger Tapping Using Attention-Enhanced CNN BiLSTM

Abu Saleh Musa Miah, Najmul Hassan, Md Maruf Al Hossain, Yuichi Okuyama, Jungpil Shin

Main category: cs.CV

TL;DR: A hybrid deep learning model combining CNN, BiLSTM, and attention mechanisms for multi-class Parkinson’s disease severity classification using finger tapping video features.

Details

Motivation: Current gesture-based PD recognition systems have unsatisfactory accuracy, and accurate PD severity evaluation is crucial for clinical management and intervention development.

Method: Collected finger tapping videos, extracted temporal, frequency, and amplitude features from wrist/hand movements, and built a hybrid framework with Conv1D MaxPooling, BiLSTM layers, attention mechanisms, and dense layers for multi-class classification.

Result: The model demonstrated strong performance in distinguishing between five PD severity classes, showing improved automated PD severity detection.

Conclusion: Integrating spatial-temporal representations with attention mechanisms can improve automated PD severity detection, making it a promising non-invasive tool for clinical PD monitoring and progression tracking.

Abstract: Effective clinical management and intervention development depend on accurate evaluation of Parkinsons disease (PD) severity. Many researchers have worked on developing gesture-based PD recognition systems; however, their performance accuracy is not satisfactory. In this study, we propose a multi-class Parkinson Disease detection system based on finger tapping using an attention-enhanced CNN BiLSTM. We collected finger tapping videos and derived temporal, frequency, and amplitude based features from wrist and hand movements. Then, we proposed a hybrid deep learning framework integrating CNN, BiLSTM, and attention mechanisms for multi-class PD severity classification from video-derived motion features. First, the input sequence is reshaped and passed through a Conv1D MaxPooling block to capture local spatial dependencies. The resulting feature maps are fed into a BiLSTM layer to model temporal dynamics. An attention mechanism focuses on the most informative temporal features, producing a context vector that is further processed by a second BiLSTM layer. CNN-derived features and attention-enhanced BiLSTM outputs are concatenated, followed by dense and dropout layers, before the final softmax classifier outputs the predicted PD severity level. The model demonstrated strong performance in distinguishing between the five severity classes, suggesting that integrating spatial temporal representations with attention mechanisms can improve automated PD severity detection, making it a promising non-invasive tool to support clinicians in PD monitoring and progression tracking.

[340] DeepFusionNet: Autoencoder-Based Low-Light Image Enhancement and Super-Resolution

Halil Hüseyin Çalışkan, Talha Koruk

Main category: cs.CV

TL;DR: DeepFusionNet is a lightweight architecture for low-light image enhancement and super-resolution, achieving high SSIM/PSNR scores with significantly fewer parameters than existing methods.

Details

Motivation: Current autoencoder methods for low-light image enhancement suffer from low SSIM/PSNR scores and high computational requirements due to large parameter counts. Similarly, GAN-based super-resolution methods are computationally expensive.

Method: Developed DeepFusionNet architecture for both tasks: 1) Low-light enhancement with ~2.5M parameters, 2) Super-resolution with ~100K parameters using autoencoder approach instead of GANs.

Result: Low-light enhancement: 92.8% SSIM and 26.30 PSNR on LOL-v1 dataset. Super-resolution: 25.30 PSNR and 80.7% SSIM on validation set.

Conclusion: DeepFusionNet provides efficient solutions for both low-light enhancement and super-resolution tasks with significantly reduced computational requirements while maintaining high image quality metrics.

Abstract: Computer vision and image processing applications suffer from dark and low-light images, particularly during real-time image transmission. Currently, low light and dark images are converted to bright and colored forms using autoencoders; however, these methods often achieve low SSIM and PSNR scores and require high computational power due to their large number of parameters. To address these challenges, the DeepFusionNet architecture has been developed. According to the results obtained with the LOL-v1 dataset, DeepFusionNet achieved an SSIM of 92.8% and a PSNR score of 26.30, while containing only approximately 2.5 million parameters. On the other hand, conversion of blurry and low-resolution images into high-resolution and blur-free images has gained importance in image processing applications. Unlike GAN-based super-resolution methods, an autoencoder-based super resolution model has been developed that contains approximately 100 thousand parameters and uses the DeepFusionNet architecture. According to the results of the tests, the DeepFusionNet based super-resolution method achieved a PSNR of 25.30 and a SSIM score of 80.7 percent according to the validation set.

[341] Color3D: Controllable and Consistent 3D Colorization with Personalized Colorizer

Yecong Wan, Mingwen Shao, Renlong Wu, Wangmeng Zuo

Main category: cs.CV

TL;DR: Color3D is a framework for colorizing static and dynamic 3D scenes from monochrome inputs using a personalized colorizer approach that ensures consistency while preserving color diversity and user control.

Details

Motivation: Existing methods sacrifice chromatic richness and controllability by averaging color variations for multi-view consistency. Color3D aims to preserve color diversity and steerability while ensuring cross-view and cross-time consistency.

Method: Colorize a single key view, fine-tune a personalized colorizer to propagate colors to novel views and time steps, then use Lab color space Gaussian splatting for 3D reconstruction. Recasts 3D colorization as single image paradigm.

Result: Extensive experiments show Color3D delivers more consistent and chromatically rich renderings with precise user control across diverse static and dynamic 3D colorization benchmarks.

Conclusion: The framework successfully enables flexible user-guided colorization of 3D scenes while maintaining consistency and color richness, integrating arbitrary image colorization models with enhanced flexibility.

Abstract: In this work, we present Color3D, a highly adaptable framework for colorizing both static and dynamic 3D scenes from monochromatic inputs, delivering visually diverse and chromatically vibrant reconstructions with flexible user-guided control. In contrast to existing methods that focus solely on static scenarios and enforce multi-view consistency by averaging color variations which inevitably sacrifice both chromatic richness and controllability, our approach is able to preserve color diversity and steerability while ensuring cross-view and cross-time consistency. In particular, the core insight of our method is to colorize only a single key view and then fine-tune a personalized colorizer to propagate its color to novel views and time steps. Through personalization, the colorizer learns a scene-specific deterministic color mapping underlying the reference view, enabling it to consistently project corresponding colors to the content in novel views and video frames via its inherent inductive bias. Once trained, the personalized colorizer can be applied to infer consistent chrominance for all other images, enabling direct reconstruction of colorful 3D scenes with a dedicated Lab color space Gaussian splatting representation. The proposed framework ingeniously recasts complicated 3D colorization as a more tractable single image paradigm, allowing seamless integration of arbitrary image colorization models with enhanced flexibility and controllability. Extensive experiments across diverse static and dynamic 3D colorization benchmarks substantiate that our method can deliver more consistent and chromatically rich renderings with precise user control. Project Page https://yecongwan.github.io/Color3D/.

[342] Stroke Locus Net: Occluded Vessel Localization from MRI Modalities

Mohamed Hamad, Muhammad Khan, Tamer Khattab, Mohamed Mabrok

Main category: cs.CV

TL;DR: Stroke Locus Net is an end-to-end deep learning pipeline that detects, segments, and localizes occluded vessels in ischemic stroke using only MRI scans, combining nnUNet for lesion segmentation, arterial atlas for vessel mapping, and pGAN for MRA synthesis.

Details

Motivation: Current machine learning methods focus primarily on lesion segmentation with limited work on vessel localization, which is a key challenge in ischemic stroke diagnosis using medical imaging.

Method: The system combines a segmentation branch using nnUNet for lesion detection with an arterial atlas for vessel mapping and identification, and a generation branch using pGAN to synthesize MRA images from MRI.

Result: The implementation demonstrates promising results in localizing occluded vessels on stroke-affected T1 MRI scans.

Conclusion: The approach has potential for faster and more informed stroke diagnosis by accurately localizing occluded vessels using only MRI scans.

Abstract: A key challenge in ischemic stroke diagnosis using medical imaging is the accurate localization of the occluded vessel. Current machine learning methods in focus primarily on lesion segmentation, with limited work on vessel localization. In this study, we introduce Stroke Locus Net, an end-to-end deep learning pipeline for detection, segmentation, and occluded vessel localization using only MRI scans. The proposed system combines a segmentation branch using nnUNet for lesion detection with an arterial atlas for vessel mapping and identification, and a generation branch using pGAN to synthesize MRA images from MRI. Our implementation demonstrates promising results in localizing occluded vessels on stroke-affected T1 MRI scans, with potential for faster and more informed stroke diagnosis.

[343] ReMix: Towards a Unified View of Consistent Character Generation and Editing

Benjia Zhou, Bin Fu, Pei Cheng, Yanru Wang, Jiayuan Fan, Tao Chen

Main category: cs.CV

TL;DR: ReMix is a unified framework for character-consistent generation and editing that combines a ReMix Module for semantic feature editing and IP-ControlNet for pixel-level consistency and pose controllability.

Details

Motivation: Existing methods struggle to unify character generation and editing in a single framework, with generation-based approaches lacking fine-grained identity consistency and editing-based methods losing spatial controllability.

Method: Uses two components: ReMix Module leverages MLLMs to edit semantic features and adapt instruction embeddings to DiT backbone without fine-tuning; IP-ControlNet extends ControlNet to decouple semantic/layout cues and introduces ε-equivariant latent space for joint denoising in shared noise space.

Result: ReMix supports personalized generation, image editing, style transfer, and multi-condition synthesis while maintaining character consistency and spatial controllability.

Conclusion: ReMix provides an effective and efficient unified framework for character-consistent image generation and editing, validated through extensive experiments.

Abstract: Recent advances in large-scale text-to-image diffusion models (e.g., FLUX.1) have greatly improved visual fidelity in consistent character generation and editing. However, existing methods rarely unify these tasks within a single framework. Generation-based approaches struggle with fine-grained identity consistency across instances, while editing-based methods often lose spatial controllability and instruction alignment. To bridge this gap, we propose ReMix, a unified framework for character-consistent generation and editing. It constitutes two core components: the ReMix Module and IP-ControlNet. The ReMix Module leverages the multimodal reasoning ability of MLLMs to edit semantic features of input images and adapt instruction embeddings to the native DiT backbone without fine-tuning. While this ensures coherent semantic layouts, pixel-level consistency and pose controllability remain challenging. To address this, IP-ControlNet extends ControlNet to decouple semantic and layout cues from reference images and introduces an {\epsilon}-equivariant latent space that jointly denoises the reference and target images within a shared noise space. Inspired by convergent evolution and quantum decoherence,i.e., where environmental noise drives state convergence, this design promotes feature alignment in the hidden space, enabling consistent object generation while preserving identity. ReMix supports a wide range of tasks, including personalized generation, image editing, style transfer, and multi-condition synthesis. Extensive experiments validate its effectiveness and efficiency as a unified framework for character-consistent image generation and editing.

[344] SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation

Zhenjie Mao, Yuhuan Yang, Chaofan Ma, Dongsheng Jiang, Jiangchao Yao, Ya Zhang, Yanfeng Wang

Main category: cs.CV

TL;DR: SaFiRe is a novel framework for Referring Image Segmentation that addresses ambiguous expressions through a two-phase cognitive process, using Mamba’s scan-then-update property for efficient multi-cycle refinement with linear complexity.

Details

Motivation: Current RIS methods focus on simple expressions and reduce the task to keyword matching, limiting their ability to handle referential ambiguity in complex real-world scenarios like object-distracting and category-implicit expressions.

Method: Proposes SaFiRe framework that mimics human two-phase cognitive process: global understanding followed by detail-oriented inspection, leveraging Mamba’s scan-then-update property for efficient multi-cycle refinement with linear complexity.

Result: Extensive experiments on standard and proposed aRefCOCO benchmark demonstrate SaFiRe’s superiority over state-of-the-art baselines in handling ambiguous referring expressions.

Conclusion: SaFiRe effectively addresses challenging real-world RIS scenarios through its two-phase cognitive design and efficient refinement process, showing improved performance on ambiguous expressions.

Abstract: Referring Image Segmentation (RIS) aims to segment the target object in an image given a natural language expression. While recent methods leverage pre-trained vision backbones and more training corpus to achieve impressive results, they predominantly focus on simple expressions–short, clear noun phrases like “red car” or “left girl”. This simplification often reduces RIS to a key word/concept matching problem, limiting the model’s ability to handle referential ambiguity in expressions. In this work, we identify two challenging real-world scenarios: object-distracting expressions, which involve multiple entities with contextual cues, and category-implicit expressions, where the object class is not explicitly stated. To address the challenges, we propose a novel framework, SaFiRe, which mimics the human two-phase cognitive process–first forming a global understanding, then refining it through detail-oriented inspection. This is naturally supported by Mamba’s scan-then-update property, which aligns with our phased design and enables efficient multi-cycle refinement with linear complexity. We further introduce aRefCOCO, a new benchmark designed to evaluate RIS models under ambiguous referring expressions. Extensive experiments on both standard and proposed datasets demonstrate the superiority of SaFiRe over state-of-the-art baselines.

[345] SparseUWSeg: Active Sparse Point-Label Augmentation for Underwater Semantic Segmentation

César Borja, Carlos Plou, Rubén Martinez-Cantín, Ana C. Murillo

Main category: cs.CV

TL;DR: SparseUWSeg is a framework for underwater semantic segmentation that uses active sampling to guide point annotation and hybrid label propagation combining SAM2 and superpixels, achieving +5% mIoU improvement over baselines.

Details

Motivation: Fine-grained underwater scene analysis is challenging due to high annotation costs for dense segmentation labels. Sparse point-labels are easier to obtain but create challenges in annotation selection and label propagation.

Method: Uses active sampling strategy to guide annotators for optimal point label selection, then propagates sparse labels with hybrid approach leveraging both SAM2 and superpixel-based methods.

Result: Experiments on two underwater datasets show SparseUWSeg outperforms state-of-the-art approaches, achieving up to +5% mIoU improvement over D+NN baseline.

Conclusion: Main contribution is an effective interactive annotation tool that enables ecology researchers to efficiently generate high-quality segmentation masks using foundation models and computer vision.

Abstract: Semantic segmentation is essential to automate underwater imagery analysis with ecology monitoring purposes. Unfortunately, fine grained underwater scene analysis is still an open problem even for top performing segmentation models. The high cost of obtaining dense, expert-annotated, segmentation labels hinders the supervision of models in this domain. While sparse point-labels are easier to obtain, they introduce challenges regarding which points to annotate and how to propagate the sparse information. We present SparseUWSeg, a novel framework that addresses both issues. SparseUWSeg employs an active sampling strategy to guide annotators, maximizing the value of their point labels. Then, it propagates these sparse labels with a hybrid approach leverages both the best of SAM2 and superpixel-based methods. Experiments on two diverse underwater datasets demonstrate the benefits of SparseUWSeg over state-of-the-art approaches, achieving up to +5% mIoU over D+NN. Our main contribution is the design and release of a simple but effective interactive annotation tool, integrating our algorithms. It enables ecology researchers to leverage foundation models and computer vision to efficiently generate high-quality segmentation masks to process their data.

[346] ViConEx-Med: Visual Concept Explainability via Multi-Concept Token Transformer for Medical Image Analysis

Cristiano Patrício, Luís F. Teixeira, João C. Neves

Main category: cs.CV

TL;DR: ViConEx-Med is a transformer-based framework that introduces multi-concept learnable tokens to jointly predict and localize visual concepts, addressing limitations of existing concept-based models that lack visual explanations.

Details

Motivation: Existing concept-based models treat concepts as numerical attributes without providing visual explanations to localize predicted concepts, limiting their utility in real-world applications, especially in high-stakes medical scenarios.

Method: Proposes ViConEx-Med framework using multi-concept learnable tokens and specialized attention layers for processing visual and text-based concept tokens to produce concept-level localization maps while maintaining predictive accuracy.

Result: Experiments on synthetic and real-world medical datasets show ViConEx-Med outperforms prior concept-based models and achieves competitive performance with black-box models in both concept detection and localization precision.

Conclusion: The approach suggests a promising direction for building inherently interpretable models grounded in visual concepts, with code publicly available.

Abstract: Concept-based models aim to explain model decisions with human-understandable concepts. However, most existing approaches treat concepts as numerical attributes, without providing complementary visual explanations that could localize the predicted concepts. This limits their utility in real-world applications and particularly in high-stakes scenarios, such as medical use-cases. This paper proposes ViConEx-Med, a novel transformer-based framework for visual concept explainability, which introduces multi-concept learnable tokens to jointly predict and localize visual concepts. By leveraging specialized attention layers for processing visual and text-based concept tokens, our method produces concept-level localization maps while maintaining high predictive accuracy. Experiments on both synthetic and real-world medical datasets demonstrate that ViConEx-Med outperforms prior concept-based models and achieves competitive performance with black-box models in terms of both concept detection and localization precision. Our results suggest a promising direction for building inherently interpretable models grounded in visual concepts. Code is publicly available at https://github.com/CristianoPatricio/viconex-med.

[347] HccePose(BF): Predicting Front & Back Surfaces to Construct Ultra-Dense 2D-3D Correspondences for Pose Estimation

Yulin Wang, Mengting Hu, Hongli Li, Chen Luo

Main category: cs.CV

TL;DR: The paper proposes using both front and back surfaces of objects to create ultra-dense 2D-3D correspondences for improved pose estimation, introducing Hierarchical Continuous Coordinate Encoding (HCCE) for better coordinate representation.

Details

Motivation: Current pose estimation methods focus only on the object's front surface, overlooking the potential benefits of incorporating the back surface and interior of the object for more accurate pose estimation.

Method: Predict 3D coordinates of both front and back surfaces, densely sample 3D coordinates between them to create ultra-dense 2D-3D correspondences, and use Hierarchical Continuous Coordinate Encoding (HCCE) for accurate and efficient coordinate representation.

Result: Outperforms existing state-of-the-art methods across seven classic BOP core datasets on the BOP website.

Conclusion: Incorporating both front and back surfaces with ultra-dense sampling significantly improves pose estimation accuracy compared to methods that only use the front surface.

Abstract: In pose estimation for seen objects, a prevalent pipeline involves using neural networks to predict dense 3D coordinates of the object surface on 2D images, which are then used to establish dense 2D-3D correspondences. However, current methods primarily focus on more efficient encoding techniques to improve the precision of predicted 3D coordinates on the object’s front surface, overlooking the potential benefits of incorporating the back surface and interior of the object. To better utilize the full surface and interior of the object, this study predicts 3D coordinates of both the object’s front and back surfaces and densely samples 3D coordinates between them. This process creates ultra-dense 2D-3D correspondences, effectively enhancing pose estimation accuracy based on the Perspective-n-Point (PnP) algorithm. Additionally, we propose Hierarchical Continuous Coordinate Encoding (HCCE) to provide a more accurate and efficient representation of front and back surface coordinates. Experimental results show that, compared to existing state-of-the-art (SOTA) methods on the BOP website, the proposed approach outperforms across seven classic BOP core datasets. Code is available at https://github.com/WangYuLin-SEU/HCCEPose.

Zixu Zhao, Yang Zhan

Main category: cs.CV

TL;DR: The paper introduces DVTMD, a new drone video-text dataset with fine-grained captions, and proposes TCMA framework for text-video retrieval that achieves state-of-the-art performance.

Details

Motivation: Existing UAV video-text retrieval is limited by datasets with coarse and redundant captions, making it difficult to efficiently retrieve relevant content from massive aerial videos for applications like urban management and emergency response.

Method: Proposed Text-Conditioned Multi-granularity Alignment (TCMA) framework with global video-sentence alignment, sentence-guided frame aggregation, word-guided patch alignment, plus Word and Patch Selection module and Text-Adaptive Dynamic Temperature Mechanism.

Result: Achieved state-of-the-art performance with 45.5% R@1 in text-to-video and 42.8% R@1 in video-to-text retrieval on DVTMD dataset, establishing the first complete benchmark for drone text-video retrieval.

Conclusion: The DVTMD dataset and TCMA framework effectively address the limitations in UAV video-text retrieval, demonstrating superior performance and providing a solid foundation for future research in this domain.

Abstract: Unmanned aerial vehicles (UAVs) have become powerful platforms for real-time, high-resolution data collection, producing massive volumes of aerial videos. Efficient retrieval of relevant content from these videos is crucial for applications in urban management, emergency response, security, and disaster relief. While text-video retrieval has advanced in natural video domains, the UAV domain remains underexplored due to limitations in existing datasets, such as coarse and redundant captions. Thus, in this work, we construct the Drone Video-Text Match Dataset (DVTMD), which contains 2,864 videos and 14,320 fine-grained, semantically diverse captions. The annotations capture multiple complementary aspects, including human actions, objects, background settings, environmental conditions, and visual style, thereby enhancing text-video correspondence and reducing redundancy. Building on this dataset, we propose the Text-Conditioned Multi-granularity Alignment (TCMA) framework, which integrates global video-sentence alignment, sentence-guided frame aggregation, and word-guided patch alignment. To further refine local alignment, we design a Word and Patch Selection module that filters irrelevant content, as well as a Text-Adaptive Dynamic Temperature Mechanism that adapts attention sharpness to text type. Extensive experiments on DVTMD and CapERA establish the first complete benchmark for drone text-video retrieval. Our TCMA achieves state-of-the-art performance, including 45.5% R@1 in text-to-video and 42.8% R@1 in video-to-text retrieval, demonstrating the effectiveness of our dataset and method. The code and dataset will be released.

[349] Fairness Without Labels: Pseudo-Balancing for Bias Mitigation in Face Gender Classification

Haohua Dong, Ana Manzano Rodríguez, Camille Guinaudeau, Shin’ichi Satoh

Main category: cs.CV

TL;DR: Pseudo-balancing is a semi-supervised method that mitigates gender classification biases by enforcing demographic balance during pseudo-label selection using unlabeled race-balanced data, improving fairness and accuracy without ground-truth annotations.

Details

Motivation: Face gender classification models often reflect and amplify demographic biases from training data, leading to uneven performance across gender and racial subgroups.

Method: Pseudo-balancing enforces demographic balance during pseudo-label selection using only unlabeled images from a race-balanced dataset without requiring ground-truth annotations. Evaluated through fine-tuning biased classifiers and stress-testing with imbalanced data.

Result: Method achieved 79.81% overall accuracy (6.53% improvement over baseline) and reduced gender accuracy gap by 44.17%. In East Asian subgroup, gap narrowed from over 49% to just 5.01%.

Conclusion: Even without label supervision, access to demographically balanced unlabeled datasets can effectively debias computer vision models while preserving or enhancing accuracy.

Abstract: Face gender classification models often reflect and amplify demographic biases present in their training data, leading to uneven performance across gender and racial subgroups. We introduce pseudo-balancing, a simple and effective strategy for mitigating such biases in semi-supervised learning. Our method enforces demographic balance during pseudo-label selection, using only unlabeled images from a race-balanced dataset without requiring access to ground-truth annotations. We evaluate pseudo-balancing under two conditions: (1) fine-tuning a biased gender classifier using unlabeled images from the FairFace dataset, and (2) stress-testing the method with intentionally imbalanced training data to simulate controlled bias scenarios. In both cases, models are evaluated on the All-Age-Faces (AAF) benchmark, which contains a predominantly East Asian population. Our results show that pseudo-balancing consistently improves fairness while preserving or enhancing accuracy. The method achieves 79.81% overall accuracy - a 6.53% improvement over the baseline - and reduces the gender accuracy gap by 44.17%. In the East Asian subgroup, where baseline disparities exceeded 49%, the gap is narrowed to just 5.01%. These findings suggest that even in the absence of label supervision, access to a demographically balanced or moderately skewed unlabeled dataset can serve as a powerful resource for debiasing existing computer vision models.

[350] B2N3D: Progressive Learning from Binary to N-ary Relationships for 3D Object Grounding

Feng Xiao, Hongbin Xu, Hai Ci, Wenxiong Kang

Main category: cs.CV

TL;DR: A novel progressive relational learning framework for 3D object grounding that extends relational learning from binary to n-ary relationships to better handle complex spatial descriptions involving multiple objects.

Details

Motivation: Current methods for 3D object localization using natural language only model pairwise relationships, ignoring the global perceptual significance of n-ary combinations in multi-modal relational understanding, which is essential for distinguishing similar objects in complex descriptions.

Method: Proposed a progressive relational learning framework that extends from binary to n-ary relationships, uses a grouped supervision loss for n-ary relational learning without specific annotations, and employs a multi-modal network with hybrid attention mechanisms within scene graphs created with n-ary relationships.

Result: Experiments on ReferIt3D and ScanRefer benchmarks demonstrate that the method outperforms state-of-the-art approaches and proves the advantages of n-ary relational perception in 3D localization.

Conclusion: The proposed n-ary relational learning framework effectively addresses the limitations of pairwise relationship modeling and significantly improves 3D object localization performance by capturing global relational contexts in natural language descriptions.

Abstract: Localizing 3D objects using natural language is essential for robotic scene understanding. The descriptions often involve multiple spatial relationships to distinguish similar objects, making 3D-language alignment difficult. Current methods only model relationships for pairwise objects, ignoring the global perceptual significance of n-ary combinations in multi-modal relational understanding. To address this, we propose a novel progressive relational learning framework for 3D object grounding. We extend relational learning from binary to n-ary to identify visual relations that match the referential description globally. Given the absence of specific annotations for referred objects in the training data, we design a grouped supervision loss to facilitate n-ary relational learning. In the scene graph created with n-ary relationships, we use a multi-modal network with hybrid attention mechanisms to further localize the target within the n-ary combinations. Experiments and ablation studies on the ReferIt3D and ScanRefer benchmarks demonstrate that our method outperforms the state-of-the-art, and proves the advantages of the n-ary relational perception in 3D localization.

[351] From Generic to Specialized: A Subspecialty Diagnostic System Powered by Self-Supervised Learning for Cervical Histopathology

Yizhi Wang, Li Chen, Qiang Huang, Tian Guan, Xi Deng, Zhiyuan Shen, Jiawen Li, Xinrui Chen, Bin Hu, Xitong Ling, Taojie Zhu, Zirui Huang, Deshui Yu, Yan Liu, Jiurun Chen, Lianghui Zhu, Qiming He, Yiqing Liu, Diwei Shi, Hanzhong Liu, Junbo Hu, Hongyi Gao, Zhen Song, Xilong Zhao, Chao He, Ming Zhao, Yonghong He

Main category: cs.CV

TL;DR: CerS-Path is a cervical pathology diagnostic system that uses two-stage pretraining on 190M tissue patches and 2.5M image-text pairs, achieving 99.38% screening sensitivity and supporting eight diagnostic functions.

Details

Motivation: Current deep learning models lack accuracy and generalizability for cervical cancer diagnosis, while general foundation models fail to capture subspecialty-specific features and task adaptability.

Method: Two synergistic pretraining stages: self-supervised learning on 190M tissue patches from 140K slides for cervical-specific feature extraction, followed by multimodal enhancement with 2.5M image-text pairs, integrated with multiple downstream diagnostic functions.

Result: CerS-Path surpasses prior foundation models, achieving 99.38% screening sensitivity in prospective testing on 3,173 cases across five centers, with excellent generalizability and support for eight diagnostic functions including rare cancer classification.

Conclusion: The system represents a significant advance in cervical pathology with strong potential for subspecialty diagnostic translation and cervical cancer screening applications.

Abstract: Cervical cancer remains a major malignancy, necessitating extensive and complex histopathological assessments and comprehensive support tools. Although deep learning shows promise, these models still lack accuracy and generalizability. General foundation models offer a broader reach but remain limited in capturing subspecialty-specific features and task adaptability. We introduce the Cervical Subspecialty Pathology (CerS-Path) diagnostic system, developed through two synergistic pretraining stages: self-supervised learning on approximately 190 million tissue patches from 140,000 slides to build a cervical-specific feature extractor, and multimodal enhancement with 2.5 million image-text pairs, followed by integration with multiple downstream diagnostic functions. Supporting eight diagnostic functions, including rare cancer classification and multimodal Q&A, CerS-Path surpasses prior foundation models in scope and clinical applicability. Comprehensive evaluations demonstrate a significant advance in cervical pathology, with prospective testing on 3,173 cases across five centers maintaining 99.38% screening sensitivity and excellent generalizability, highlighting its potential for subspecialty diagnostic translation and cervical cancer screening.

[352] A Style-Based Metric for Quantifying the Synthetic-to-Real Gap in Autonomous Driving Image Datasets

Dingyi Yao, Xinyao Han, Ruibo Ming, Zhihang Song, Lihui Peng, Jianming Hu, Danya Yao, Yi Zhang

Main category: cs.CV

TL;DR: A framework for quantifying the synthetic-to-real gap in autonomous driving perception systems using Style Embedding Distribution Discrepancy (SEDD) metric.

Details

Motivation: Real-world testing of autonomous driving systems is impractical, and while synthetic datasets offer cost-effective alternatives, the domain gap between synthetic and real data limits model generalization. Quantifying this gap is essential for evaluating dataset utility.

Method: Proposes SEDD metric combining Gram matrix-based style extraction with metric learning optimized for intra-class compactness and inter-class separation to extract style embeddings. Establishes benchmark using public datasets.

Result: Experiments on various datasets and sim-to-real methods demonstrate the method’s capability to effectively quantify the synthetic-to-real gap.

Conclusion: Provides a standardized quality control tool for systematic diagnosis and targeted enhancement of synthetic datasets, advancing data-driven autonomous driving development.

Abstract: Ensuring the reliability of autonomous driving perception systems requires extensive environment-based testing, yet real-world execution is often impractical. Synthetic datasets have therefore emerged as a promising alternative, offering advantages such as cost-effectiveness, bias free labeling, and controllable scenarios. However, the domain gap between synthetic and real-world datasets remains a critical bottleneck for the generalization of AI-based autonomous driving models. Quantifying this synthetic-to-real gap is thus essential for evaluating dataset utility and guiding the design of more effective training pipelines. In this paper, we establish a systematic framework for quantifying the synthetic-to-real gap in autonomous driving systems, and propose Style Embedding Distribution Discrepancy (SEDD) as a novel evaluation metric. Our framework combines Gram matrix-based style extraction with metric learning optimized for intra-class compactness and inter-class separation to extract style embeddings. Furthermore, we establish a benchmark using publicly available datasets. Experiments are conducted on a variety of datasets and sim-to-real methods, and the results show that our method is capable of quantifying the synthetic-to-real gap. This work provides a standardized quality control tool that enables systematic diagnosis and targeted enhancement of synthetic datasets, advancing future development of data-driven autonomous driving systems.

[353] Semantic Visual Anomaly Detection and Reasoning in AI-Generated Images

Chuangchuang Tan, Xiang Ming, Jinglu Wang, Renshuai Tao, Bin Li, Yunchao Wei, Yao Zhao, Yan Lu

Main category: cs.CV

TL;DR: AnomReason is a benchmark for detecting semantic anomalies in AI-generated images using structured quadruple annotations, with AnomAgent enabling scalable annotation through multi-agent pipeline and human verification.

Details

Motivation: AI-generated images often contain subtle semantic anomalies that compromise plausibility, making detection essential for trustworthiness assessment in AIGC media, explainable deepfake detection, and semantic authenticity evaluation.

Method: Introduces AnomReason benchmark with structured quadruple annotations (Name, Phenomenon, Reasoning, Severity) created using AnomAgent - a modular multi-agent pipeline with lightweight human-in-the-loop verification, processing ~4.17B GPT-4o tokens.

Result: Models fine-tuned on AnomReason achieve consistent improvements over strong vision-language baselines using semantic matching metrics (SemAP and SemF1), demonstrating practical utility in explainable deepfake detection and semantic reasonableness assessment.

Conclusion: AnomReason and AnomAgent provide a foundation for measuring and improving semantic plausibility of AI-generated images, with released code, metrics, data, and models to support reproducible research on semantic authenticity and interpretable AIGC forensics.

Abstract: The rapid advancement of AI-generated content (AIGC) has enabled the synthesis of visually convincing images; however, many such outputs exhibit subtle \textbf{semantic anomalies}, including unrealistic object configurations, violations of physical laws, or commonsense inconsistencies, which compromise the overall plausibility of the generated scenes. Detecting these semantic-level anomalies is essential for assessing the trustworthiness of AIGC media, especially in AIGC image analysis, explainable deepfake detection and semantic authenticity assessment. In this paper, we formalize \textbf{semantic anomaly detection and reasoning} for AIGC images and introduce \textbf{AnomReason}, a large-scale benchmark with structured annotations as quadruples \emph{(Name, Phenomenon, Reasoning, Severity)}. Annotations are produced by a modular multi-agent pipeline (\textbf{AnomAgent}) with lightweight human-in-the-loop verification, enabling scale while preserving quality. At construction time, AnomAgent processed approximately 4.17,B GPT-4o tokens, providing scale evidence for the resulting structured annotations. We further show that models fine-tuned on AnomReason achieve consistent gains over strong vision-language baselines under our proposed semantic matching metric (\textit{SemAP} and \textit{SemF1}). Applications to {explainable deepfake detection} and {semantic reasonableness assessment of image generators} demonstrate practical utility. In summary, AnomReason and AnomAgent serve as a foundation for measuring and improving the semantic plausibility of AI-generated images. We will release code, metrics, data, and task-aligned models to support reproducible research on semantic authenticity and interpretable AIGC forensics.

[354] MRI Brain Tumor Detection with Computer Vision

Jack Krolik, Jake Lynn, John Henry Rudden, Dmytro Vremenko

Main category: cs.CV

TL;DR: This paper applies deep learning models including CNNs, ResNet, U-Net, and EfficientDet for automated brain tumor detection and segmentation from MRI scans, showing improved diagnostic accuracy and efficiency.

Details

Motivation: To enhance brain tumor diagnostics through automated detection and segmentation using deep learning techniques, aiming to improve clinical outcomes in medical imaging.

Method: Employed multiple machine learning models: logistic regression, CNNs, ResNet for classification; U-Net for semantic segmentation; EfficientDet for anchor-based object detection on MRI scans.

Result: Demonstrated promising improvements in accuracy and efficiency of brain tumor diagnostics through the applied deep learning techniques.

Conclusion: Deep learning shows significant potential in medical imaging for brain tumor detection and segmentation, with implications for improving clinical diagnostic outcomes.

Abstract: This study explores the application of deep learning techniques in the automated detection and segmentation of brain tumors from MRI scans. We employ several machine learning models, including basic logistic regression, Convolutional Neural Networks (CNNs), and Residual Networks (ResNet) to classify brain tumors effectively. Additionally, we investigate the use of U-Net for semantic segmentation and EfficientDet for anchor-based object detection to enhance the localization and identification of tumors. Our results demonstrate promising improvements in the accuracy and efficiency of brain tumor diagnostics, underscoring the potential of deep learning in medical imaging and its significance in improving clinical outcomes.

[355] Are Video Models Emerging as Zero-Shot Learners and Reasoners in Medical Imaging?

Yuxiang Lai, Jike Zhong, Ming Li, Yuheng Li, Xiaofeng Yang

Main category: cs.CV

TL;DR: A large vision model (LVM) demonstrates strong zero-shot generalization on medical imaging tasks including organ segmentation, denoising, super-resolution, and motion prediction, without any domain-specific training.

Details

Motivation: To investigate whether autoregressive video modeling principles can be directly applied to medical imaging tasks despite the model never being trained on medical data.

Method: Evaluated a large vision model (LVM) in zero-shot setting across four medical imaging tasks using 4D CT data from 122 patients (over 1,820 3D CT volumes).

Result: The LVM achieved competitive performance on segmentation, denoising, and super-resolution, and surpassed specialized baselines in motion prediction with state-of-the-art spatial accuracy.

Conclusion: General-purpose video models can serve as unified learners and reasoners for medical imaging, revealing emergent zero-shot capabilities and laying groundwork for future medical foundation models.

Abstract: Recent advances in large generative models have shown that simple autoregressive formulations, when scaled appropriately, can exhibit strong zero-shot generalization across domains. Motivated by this trend, we investigate whether autoregressive video modeling principles can be directly applied to medical imaging tasks, despite the model never being trained on medical data. Specifically, we evaluate a large vision model (LVM) in a zero-shot setting across four representative tasks: organ segmentation, denoising, super-resolution, and motion prediction. Remarkably, even without domain-specific fine-tuning, the LVM can delineate anatomical structures in CT scans and achieve competitive performance on segmentation, denoising, and super-resolution. Most notably, in radiotherapy motion prediction, the model forecasts future 3D CT phases directly from prior phases of a 4D CT scan, producing anatomically consistent predictions that capture patient-specific respiratory dynamics with realistic temporal coherence. We evaluate the LVM on 4D CT data from 122 patients, totaling over 1,820 3D CT volumes. Despite no prior exposure to medical data, the model achieves strong performance across all tasks and surpasses specialized DVF-based and generative baselines in motion prediction, achieving state-of-the-art spatial accuracy. These findings reveal the emergence of zero-shot capabilities in medical video modeling and highlight the potential of general-purpose video models to serve as unified learners and reasoners laying the groundwork for future medical foundation models built on video models.

[356] Opacity-Gradient Driven Density Control for Compact and Efficient Few-Shot 3D Gaussian Splatting

Abdelrhman Elrawy, Emad A. Mohammed

Main category: cs.CV

TL;DR: A framework that improves 3D Gaussian Splatting efficiency in few-shot scenarios by replacing positional gradient densification with opacity gradient triggers and conservative pruning, achieving 40-70% more compact models with minimal quality trade-off.

Details

Motivation: 3D Gaussian Splatting struggles with overfitting and bloated reconstructions in few-shot scenarios, and existing improvements often significantly increase primitive counts rather than optimizing efficiency.

Method: Replaces standard positional gradient heuristic with opacity gradient-based densification trigger, pairs with conservative pruning schedule, and uses depth-correlation loss for geometric guidance.

Result: Achieves 40% more compact models (32k vs 57k primitives) on 3-view LLFF dataset and ~70% reduction on Mip-NeRF 360 dataset with modest trade-off in reconstruction metrics.

Conclusion: Establishes new state-of-the-art on quality-vs-efficiency Pareto frontier for few-shot view synthesis through fundamental optimization improvements to 3DGS core framework.

Abstract: 3D Gaussian Splatting (3DGS) struggles in few-shot scenarios, where its standard adaptive density control (ADC) can lead to overfitting and bloated reconstructions. While state-of-the-art methods like FSGS improve quality, they often do so by significantly increasing the primitive count. This paper presents a framework that revises the core 3DGS optimization to prioritize efficiency. We replace the standard positional gradient heuristic with a novel densification trigger that uses the opacity gradient as a lightweight proxy for rendering error. We find this aggressive densification is only effective when paired with a more conservative pruning schedule, which prevents destructive optimization cycles. Combined with a standard depth-correlation loss for geometric guidance, our framework demonstrates a fundamental improvement in efficiency. On the 3-view LLFF dataset, our model is over 40% more compact (32k vs. 57k primitives) than FSGS, and on the Mip-NeRF 360 dataset, it achieves a reduction of approximately 70%. This dramatic gain in compactness is achieved with a modest trade-off in reconstruction metrics, establishing a new state-of-the-art on the quality-vs-efficiency Pareto frontier for few-shot view synthesis.

[357] VividAnimator: An End-to-End Audio and Pose-driven Half-Body Human Animation Framework

Donglin Huang, Yongyuan Li, Tianhang Liu, Junming Huang, Xiaoda Yang, Chi Wang, Weiwei Xu

Main category: cs.CV

TL;DR: VividAnimator is an end-to-end framework for high-quality half-body human animation driven by audio and sparse hand pose conditions, addressing stiff head movements and blurry hands through three innovations: Hand Clarity Codebook, Dual-Stream Audio-Aware Module, and Pose Calibration Trick.

Details

Motivation: Existing audio- and pose-driven human animation methods suffer from stiff head movements and blurry hands due to weak audio-head correlation and hand structural complexity.

Method: Three key innovations: 1) Pre-trained Hand Clarity Codebook (HCC) for high-fidelity hand texture priors, 2) Dual-Stream Audio-Aware Module (DSAA) for separate modeling of lip sync and head pose dynamics with interaction, 3) Pose Calibration Trick (PCT) for refining pose conditions with relaxed constraints.

Result: Extensive experiments show state-of-the-art performance with superior hand detail, gesture realism, and identity consistency, validated by both quantitative metrics and qualitative evaluations.

Conclusion: VividAnimator effectively addresses key limitations in human animation by combining HCC, DSAA, and PCT to generate high-quality animations with natural head movements and clear hand details.

Abstract: Existing for audio- and pose-driven human animation methods often struggle with stiff head movements and blurry hands, primarily due to the weak correlation between audio and head movements and the structural complexity of hands. To address these issues, we propose VividAnimator, an end-to-end framework for generating high-quality, half-body human animations driven by audio and sparse hand pose conditions. Our framework introduces three key innovations. First, to overcome the instability and high cost of online codebook training, we pre-train a Hand Clarity Codebook (HCC) that encodes rich, high-fidelity hand texture priors, significantly mitigating hand degradation. Second, we design a Dual-Stream Audio-Aware Module (DSAA) to model lip synchronization and natural head pose dynamics separately while enabling interaction. Third, we introduce a Pose Calibration Trick (PCT) that refines and aligns pose conditions by relaxing rigid constraints, ensuring smooth and natural gesture transitions. Extensive experiments demonstrate that Vivid Animator achieves state-of-the-art performance, producing videos with superior hand detail, gesture realism, and identity consistency, validated by both quantitative metrics and qualitative evaluations.

[358] Bridging Perspectives: Foundation Model Guided BEV Maps for 3D Object Detection and Tracking

Markus Käppeler, Özgün Çiçek, Daniele Cattaneo, Claudius Gläser, Yakov Miron, Abhinav Valada

Main category: cs.CV

TL;DR: DualViewDistill is a hybrid 3D object detection and tracking framework that combines perspective-view (PV) and bird’s-eye-view (BEV) features using foundation model guidance and feature distillation to achieve state-of-the-art performance on autonomous driving benchmarks.

Details

Motivation: Current approaches rely exclusively on either PV or BEV features, limiting their ability to leverage both fine-grained object details and spatially structured scene representations, which are complementary for robust perception.

Method: Proposes a hybrid framework that integrates PV features with BEV maps enriched with semantic and geometric features from DINOv2 foundation model, using a novel distillation process and deformable aggregation to combine both representations.

Result: Achieves state-of-the-art performance on nuScenes and Argoverse 2 benchmarks, demonstrating improved 3D object detection and tracking capabilities for autonomous driving.

Conclusion: The framework successfully leverages foundation model BEV maps to enable more reliable perception, showing the potential of combining PV and BEV representations with foundation model features for autonomous driving applications.

Abstract: Camera-based 3D object detection and tracking are essential for perception in autonomous driving. Current state-of-the-art approaches often rely exclusively on either perspective-view (PV) or bird’s-eye-view (BEV) features, limiting their ability to leverage both fine-grained object details and spatially structured scene representations. In this work, we propose DualViewDistill, a hybrid detection and tracking framework that incorporates both PV and BEV camera image features to leverage their complementary strengths. Our approach introduces BEV maps guided by foundation models, leveraging descriptive DINOv2 features that are distilled into BEV representations through a novel distillation process. By integrating PV features with BEV maps enriched with semantic and geometric features from DINOv2, our model leverages this hybrid representation via deformable aggregation to enhance 3D object detection and tracking. Extensive experiments on the nuScenes and Argoverse 2 benchmarks demonstrate that DualViewDistill achieves state-of-the-art performance. The results showcase the potential of foundation model BEV maps to enable more reliable perception for autonomous driving. We make the code and pre-trained models available at https://dualviewdistill.cs.uni-freiburg.de .

[359] SAM2LoRA: Composite Loss-Guided, Parameter-Efficient Finetuning of SAM2 for Retinal Fundus Segmentation

Sayan Mandal, Divyadarshini Karthikeyan, Manas Paldhe

Main category: cs.CV

TL;DR: SAM2LoRA is a parameter-efficient fine-tuning method that adapts SAM2 for fundus image segmentation using low-rank adapters, achieving state-of-the-art performance with less than 5% trainable parameters.

Details

Motivation: Fine-tuning the Segment Anything Model 2 (SAM2) for fundus image segmentation remains challenging despite its rapid inference capabilities in low-resource settings.

Method: Integrates low-rank adapters into both image encoder and mask decoder of SAM2, using a composite loss function combining segmentationBCE, SoftDice, and FocalTversky losses.

Result: Achieves Dice scores of 0.86 for blood vessel and 0.93 for optic disc segmentation, with AUC values up to 0.98 and 0.99 on 11 fundus datasets, demonstrating state-of-the-art performance.

Conclusion: SAM2LoRA enables efficient adaptation of SAM2 for fundus segmentation with substantially reduced training overhead while maintaining high performance.

Abstract: We propose SAM2LoRA, a parameter-efficient fine-tuning strategy that adapts the Segment Anything Model 2 (SAM2) for fundus image segmentation. SAM2 employs a masked autoencoder-pretrained Hierarchical Vision Transformer for multi-scale feature decoding, enabling rapid inference in low-resource settings; however, fine-tuning remains challenging. To address this, SAM2LoRA integrates a low-rank adapter into both the image encoder and mask decoder, requiring fewer than 5% of the original trainable parameters. Our analysis indicates that for cross-dataset fundus segmentation tasks, a composite loss function combining segmentationBCE, SoftDice, and FocalTversky losses is essential for optimal network tuning. Evaluated on 11 challenging fundus segmentation datasets, SAM2LoRA demonstrates high performance in both blood vessel and optic disc segmentation under cross-dataset training conditions. It achieves Dice scores of up to 0.86 and 0.93 for blood vessel and optic disc segmentation, respectively, and AUC values of up to 0.98 and 0.99, achieving state-of-the-art performance while substantially reducing training overhead.

[360] From Programs to Poses: Factored Real-World Scene Generation via Learned Program Libraries

Joy Hsu, Emily Jin, Jiajun Wu, Niloy J. Mitra

Main category: cs.CV

TL;DR: FactoredScenes is a framework that generates realistic 3D scenes by decomposing rooms into hierarchical concepts and learning object pose variations from real-world data.

Details

Motivation: Real-world scenes are difficult to capture with limited data available, and generating realistic scenes with varied object poses remains challenging.

Method: Uses a factored representation that decomposes scenes into room programs and object poses, learns layout patterns from a function library, uses LLMs for program generation, and employs program-conditioned hierarchical pose prediction with 3D object retrieval.

Result: Generates realistic, real-world rooms that are difficult to distinguish from real ScanNet scenes.

Conclusion: FactoredScenes successfully synthesizes realistic 3D scenes by leveraging structural decomposition and learning object pose variations from lived-in environments.

Abstract: Real-world scenes, such as those in ScanNet, are difficult to capture, with highly limited data available. Generating realistic scenes with varied object poses remains an open and challenging task. In this work, we propose FactoredScenes, a framework that synthesizes realistic 3D scenes by leveraging the underlying structure of rooms while learning the variation of object poses from lived-in scenes. We introduce a factored representation that decomposes scenes into hierarchically organized concepts of room programs and object poses. To encode structure, FactoredScenes learns a library of functions capturing reusable layout patterns from which scenes are drawn, then uses large language models to generate high-level programs, regularized by the learned library. To represent scene variations, FactoredScenes learns a program-conditioned model to hierarchically predict object poses, and retrieves and places 3D objects in a scene. We show that FactoredScenes generates realistic, real-world rooms that are difficult to distinguish from real ScanNet scenes.

Yu-Hsuan Lin

Main category: cs.CV

TL;DR: A multimodal framework combining visual-language reasoning, object detection, and motion analysis achieves 76.7% accuracy in classifying traffic congestion levels from 1-5, outperforming unimodal baselines.

Details

Motivation: Accurate traffic congestion classification is essential for intelligent transportation systems and real-time urban traffic management.

Method: Multimodal framework combining CLIP for visual-language reasoning, YOLO-World for object detection, and MOG2-based background subtraction for motion analysis, with motion-based confidence weighting and annotated visual outputs.

Result: Achieves 76.7% accuracy, F1 score of 0.752, and Quadratic Weighted Kappa of 0.684, significantly outperforming unimodal baselines.

Conclusion: The framework effectively preserves ordinal structure and leverages visual-language and motion modalities, with future enhancements including vehicle sizing and refined density metrics.

Abstract: Accurate traffic congestion classification is essential for intelligent transportation systems and real-time urban traffic management. This paper presents a multimodal framework combining open-vocabulary visual-language reasoning (CLIP), object detection (YOLO-World), and motion analysis via MOG2-based background subtraction. The system predicts congestion levels on an ordinal scale from 1 (free flow) to 5 (severe congestion), enabling semantically aligned and temporally consistent classification. To enhance interpretability, we incorporate motion-based confidence weighting and generate annotated visual outputs. Experimental results show the model achieves 76.7 percent accuracy, an F1 score of 0.752, and a Quadratic Weighted Kappa (QWK) of 0.684, significantly outperforming unimodal baselines. These results demonstrate the framework’s effectiveness in preserving ordinal structure and leveraging visual-language and motion modalities. Future enhancements include incorporating vehicle sizing and refined density metrics.

[362] Ortho-Fuse: Orthomosaic Generation for Sparse High-Resolution Crop Health Maps Through Intermediate Optical Flow Estimation

Rugved Katole, Christopher Stewart

Main category: cs.CV

TL;DR: Ortho-Fuse is an optical flow-based framework that enables reliable orthomosaic generation from sparse aerial imagery with reduced overlap requirements, achieving 20% lower minimum overlap needs.

Details

Motivation: AI-driven crop health mapping systems face adoption barriers due to technical limitations in orthomosaic generation from sparse aerial imagery, as traditional methods require 70-80% overlap which is difficult to achieve in resource-constrained conditions.

Method: The approach uses intermediate optical flow estimation to synthesize transitional imagery between consecutive aerial frames, artificially augmenting feature correspondences for improved geometric reconstruction.

Result: Experimental validation demonstrates a 20% reduction in minimum overlap requirements for orthomosaic generation.

Conclusion: The framework addresses adoption barriers in precision agriculture and provides pathways for enhanced integration of AI-driven monitoring systems by enabling reliable operation with reduced data requirements.

Abstract: AI-driven crop health mapping systems offer substantial advantages over conventional monitoring approaches through accelerated data acquisition and cost reduction. However, widespread farmer adoption remains constrained by technical limitations in orthomosaic generation from sparse aerial imagery datasets. Traditional photogrammetric reconstruction requires 70-80% inter-image overlap to establish sufficient feature correspondences for accurate geometric registration. AI-driven systems operating under resource-constrained conditions cannot consistently achieve these overlap thresholds, resulting in degraded reconstruction quality that undermines user confidence in autonomous monitoring technologies. In this paper, we present Ortho-Fuse, an optical flow-based framework that enables the generation of a reliable orthomosaic with reduced overlap requirements. Our approach employs intermediate flow estimation to synthesize transitional imagery between consecutive aerial frames, artificially augmenting feature correspondences for improved geometric reconstruction. Experimental validation demonstrates a 20% reduction in minimum overlap requirements. We further analyze adoption barriers in precision agriculture to identify pathways for enhanced integration of AI-driven monitoring systems.

[363] PointMAC: Meta-Learned Adaptation for Robust Test-Time Point Cloud Completion

Linlian Jiang, Rui Ma, Li Gu, Ziqiang Wang, Xinxin Zuo, Yang Wang

Main category: cs.CV

TL;DR: PointMAC is a meta-learned framework for test-time adaptation in point cloud completion that enables sample-specific refinement without additional supervision through self-supervised auxiliary objectives and meta-auxiliary learning.

Details

Motivation: Existing point cloud completion models perform static inference and rely heavily on inductive biases learned during training, limiting their ability to adapt to novel structural patterns and sensor-induced distortions at test time.

Method: The method optimizes the completion model under two self-supervised auxiliary objectives simulating structural and sensor-level incompleteness, using a meta-auxiliary learning strategy based on MAML. During inference, only the shared encoder is adapted on-the-fly while keeping the decoder fixed, with Adaptive λ-Calibration to balance gradients between primary and auxiliary objectives.

Result: Extensive experiments on synthetic, simulated, and real-world datasets demonstrate that PointMAC achieves state-of-the-art results by refining each sample individually to produce high-quality completions.

Conclusion: This is the first work to apply meta-auxiliary test-time adaptation to point cloud completion, enabling robust adaptation to novel patterns and distortions without requiring additional supervision.

Abstract: Point cloud completion is essential for robust 3D perception in safety-critical applications such as robotics and augmented reality. However, existing models perform static inference and rely heavily on inductive biases learned during training, limiting their ability to adapt to novel structural patterns and sensor-induced distortions at test time. To address this limitation, we propose PointMAC, a meta-learned framework for robust test-time adaptation in point cloud completion. It enables sample-specific refinement without requiring additional supervision. Our method optimizes the completion model under two self-supervised auxiliary objectives that simulate structural and sensor-level incompleteness. A meta-auxiliary learning strategy based on Model-Agnostic Meta-Learning (MAML) ensures that adaptation driven by auxiliary objectives is consistently aligned with the primary completion task. During inference, we adapt the shared encoder on-the-fly by optimizing auxiliary losses, with the decoder kept fixed. To further stabilize adaptation, we introduce Adaptive $\lambda$-Calibration, a meta-learned mechanism for balancing gradients between primary and auxiliary objectives. Extensive experiments on synthetic, simulated, and real-world datasets demonstrate that PointMAC achieves state-of-the-art results by refining each sample individually to produce high-quality completions. To the best of our knowledge, this is the first work to apply meta-auxiliary test-time adaptation to point cloud completion.

[364] Vision4PPG: Emergent PPG Analysis Capability of Vision Foundation Models for Vital Signs like Blood Pressure

Saurabh Kataria, Ayca Ermis, Lovely Yeswanth Panchumarthi, Minxiao Wang, Xiao Hu

Main category: cs.CV

TL;DR: Vision Foundation Models (VFMs) can achieve state-of-the-art performance on PPG signal analysis by converting 1D signals to 2D image representations like STFT, outperforming specialized time-series models.

Details

Motivation: To explore whether Vision Foundation Models can be effectively applied to photoplethysmography (PPG) signal analysis, potentially providing better performance than specialized time-series foundation models.

Method: Transform 1D PPG signals into 2D image-like representations (STFT, recurrence plots), then fine-tune Vision Foundation Models (DINOv3, SIGLIP-2) using Parameter-Efficient Fine-Tuning techniques.

Result: Achieved state-of-the-art performance on blood pressure estimation and promising results on other vital signs and blood lab measurements, outperforming time-series foundation models.

Conclusion: Vision Foundation Models provide a new powerful class of tools for PPG analysis that are computationally efficient and generalize well across different 2D input representations.

Abstract: Photoplethysmography (PPG) sensor in wearable and clinical devices provides valuable physiological insights in a non-invasive and real-time fashion. Specialized Foundation Models (FM) or repurposed time-series FMs are used to benchmark physiological tasks. Our experiments with fine-tuning FMs reveal that Vision FM (VFM) can also be utilized for this purpose and, in fact, surprisingly leads to state-of-the-art (SOTA) performance on many tasks, notably blood pressure estimation. We leverage VFMs by simply transforming one-dimensional PPG signals into image-like two-dimensional representations, such as the Short-Time Fourier transform (STFT). Using the latest VFMs, such as DINOv3 and SIGLIP-2, we achieve promising performance on other vital signs and blood lab measurement tasks as well. Our proposal, Vision4PPG, unlocks a new class of FMs to achieve SOTA performance with notable generalization to other 2D input representations, including STFT phase and recurrence plots. Our work improves upon prior investigations of vision models for PPG by conducting a comprehensive study, comparing them to state-of-the-art time-series FMs, and demonstrating the general PPG processing ability by reporting results on six additional tasks. Thus, we provide clinician-scientists with a new set of powerful tools that is also computationally efficient, thanks to Parameter-Efficient Fine-Tuning (PEFT) techniques.

[365] Self-Supervised Multi-Scale Transformer with Attention-Guided Fusion for Efficient Crack Detection

Blessing Agyei Kyem, Joshua Kofi Asamoah, Eugene Denteh, Andrews Danyo, Armstrong Aboah

Main category: cs.CV

TL;DR: Crack-Segmenter is a fully self-supervised framework for pixel-level crack segmentation that eliminates the need for manual annotations, achieving superior performance over supervised methods across multiple datasets.

Details

Motivation: To overcome the limitations of costly and time-intensive pixel-level annotations in pavement crack detection, enabling scalable and cost-effective infrastructure monitoring.

Method: Developed Crack-Segmenter with three modules: Scale-Adaptive Embedder for multi-scale feature extraction, Directional Attention Transformer for maintaining crack continuity, and Attention-Guided Fusion for adaptive feature integration.

Result: Outperformed 13 state-of-the-art supervised methods on ten public datasets across all major metrics including mIoU, Dice score, XOR, and Hausdorff Distance.

Conclusion: Annotation-free crack detection is not only feasible but superior to supervised approaches, enabling scalable infrastructure monitoring and advancing self-supervised learning in pavement crack detection.

Abstract: Pavement crack detection has long depended on costly and time-intensive pixel-level annotations, which limit its scalability for large-scale infrastructure monitoring. To overcome this barrier, this paper examines the feasibility of achieving effective pixel-level crack segmentation entirely without manual annotations. Building on this objective, a fully self-supervised framework, Crack-Segmenter, is developed, integrating three complementary modules: the Scale-Adaptive Embedder (SAE) for robust multi-scale feature extraction, the Directional Attention Transformer (DAT) for maintaining linear crack continuity, and the Attention-Guided Fusion (AGF) module for adaptive feature integration. Through evaluations on ten public datasets, Crack-Segmenter consistently outperforms 13 state-of-the-art supervised methods across all major metrics, including mean Intersection over Union (mIoU), Dice score, XOR, and Hausdorff Distance (HD). These findings demonstrate that annotation-free crack detection is not only feasible but also superior, enabling transportation agencies and infrastructure managers to conduct scalable and cost-effective monitoring. This work advances self-supervised learning and motivates pavement cracks detection research.

[366] Identifying bias in CNN image classification using image scrambling and transforms

Sai Teja Erukude

Main category: cs.CV

TL;DR: The paper addresses the “black box” problem in CNNs by proposing methods to identify hidden biases and distinguish between contextual information and background noise in image classification.

Details

Motivation: CNNs operate as black boxes, making it difficult to understand their decision-making process and detect biases from background information that may influence classification results.

Method: Two approaches: 1) Dividing images into smaller non-overlapping tiles and shuffling them randomly, 2) Applying image transforms (Fourier, Wavelet transforms, Median filter) and their combinations to recover background noise information used by CNNs.

Result: The methods were tested on six different datasets (natural, synthetic, hybrid) and effectively distinguished between contextual information and background noise, detecting background noise presence without requiring background information.

Conclusion: The proposed techniques successfully identify hidden biases in CNNs and can distinguish between meaningful contextual learning and problematic background noise, addressing the black box problem in CNN-based image classification.

Abstract: CNNs are now prevalent as the primary choice for most machine vision problems due to their superior rate of classification and the availability of user-friendly libraries. These networks effortlessly identify and select features in a non-intuitive data-driven manner, making it difficult to determine which features were most influential. That leads to a ``black box", where users cannot know how the image data are analyzed but rely on empirical results. Therefore the decision-making process can be biased by background information that is difficult to detect. Here we discuss examples of such hidden biases and propose techniques for identifying them, methods to distinguish between contextual information and background noise, and explore whether CNNs learn from irrelevant features. One effective approach to identify dataset bias is to classify blank background parts of the images. However, in some situations a blank background in the images is not available, making it more difficult to separate the foreground information from the blank background. Such parts of the image can also be considered contextual learning, not necessarily bias. To overcome this, we propose two approaches that were tested on six different datasets, including natural, synthetic, and hybrid datasets. The first method involves dividing images into smaller, non-overlapping tiles of various sizes, which are then shuffled randomly, making classification more challenging. The second method involves the application of several image transforms, including Fourier, Wavelet transforms, and Median filter, and their combinations. These transforms help recover background noise information used by CNN to classify images. Results indicate that this method can effectively distinguish between contextual information and background noise, and alert on the presence of background noise even without the need to use background information.

[367] AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration

Xinlong Chen, Yue Ding, Weihong Lin, Jingyun Hua, Linli Yao, Yang Shi, Bozhou Li, Yuanxing Zhang, Qiang Liu, Pengfei Wan, Liang Wang, Tieniu Tan

Main category: cs.CV

TL;DR: AVoCaDO is an audiovisual video captioner that uses temporal orchestration between audio and visual modalities through a two-stage post-training pipeline, achieving state-of-the-art performance on multiple benchmarks.

Details

Motivation: To generate semantically rich video descriptions with temporal alignment between visual and auditory events, benefiting both video understanding and generation tasks.

Method: Two-stage post-training pipeline: (1) AVoCaDO SFT - fine-tuning on 107K high-quality temporally-aligned audiovisual captions, (2) AVoCaDO GRPO - using tailored reward functions to enhance temporal coherence and dialogue accuracy while regularizing caption length.

Result: Significantly outperforms existing open-source models across four audiovisual video captioning benchmarks, and achieves competitive performance on VDC and DREAM-1K benchmarks under visual-only settings.

Conclusion: AVoCaDO demonstrates the effectiveness of temporal orchestration between audio and visual modalities for audiovisual video captioning, establishing new state-of-the-art performance.

Abstract: Audiovisual video captioning aims to generate semantically rich descriptions with temporal alignment between visual and auditory events, thereby benefiting both video understanding and generation. In this paper, we present AVoCaDO, a powerful audiovisual video captioner driven by the temporal orchestration between audio and visual modalities. We propose a two-stage post-training pipeline: (1) AVoCaDO SFT, which fine-tunes the model on a newly curated dataset of 107K high-quality, temporally-aligned audiovisual captions; and (2) AVoCaDO GRPO, which leverages tailored reward functions to further enhance temporal coherence and dialogue accuracy while regularizing caption length and reducing collapse. Experimental results demonstrate that AVoCaDO significantly outperforms existing open-source models across four audiovisual video captioning benchmarks, and also achieves competitive performance on the VDC and DREAM-1K benchmark under visual-only settings.

Zhao-Yang Wang, Jieneng Chen, Jiang Liu, Yuxiang Guo, Rama Chellappa

Main category: cs.CV

TL;DR: Mesh-Gait is a novel gait recognition framework that reconstructs 3D heatmaps from 2D silhouettes to combine 2D and 3D modalities efficiently, achieving state-of-the-art performance while maintaining computational efficiency.

Details

Motivation: Existing gait recognition methods using 2D representations struggle with viewpoint variations, occlusions, and noise, while multi-modal 3D approaches are computationally expensive and impractical for real-time applications.

Method: Mesh-Gait reconstructs 3D heatmaps as intermediate representations from 2D silhouettes, enabling effective capture of 3D geometric information. It uses supervised learning with loss calculated between reconstructed 3D joints, virtual markers, meshes and their ground truth for precise spatial alignment.

Result: Extensive experiments demonstrate that Mesh-Gait achieves state-of-the-art accuracy in gait recognition while maintaining computational efficiency.

Conclusion: Mesh-Gait effectively combines 2D silhouette and 3D geometric information through intermediate 3D heatmap reconstruction, providing robust gait recognition with computational efficiency suitable for real-time applications.

Abstract: Gait recognition, a fundamental biometric technology, leverages unique walking patterns for individual identification, typically using 2D representations such as silhouettes or skeletons. However, these methods often struggle with viewpoint variations, occlusions, and noise. Multi-modal approaches that incorporate 3D body shape information offer improved robustness but are computationally expensive, limiting their feasibility for real-time applications. To address these challenges, we introduce Mesh-Gait, a novel end-to-end multi-modal gait recognition framework that directly reconstructs 3D representations from 2D silhouettes, effectively combining the strengths of both modalities. Compared to existing methods, directly learning 3D features from 3D joints or meshes is complex and difficult to fuse with silhouette-based gait features. To overcome this, Mesh-Gait reconstructs 3D heatmaps as an intermediate representation, enabling the model to effectively capture 3D geometric information while maintaining simplicity and computational efficiency. During training, the intermediate 3D heatmaps are gradually reconstructed and become increasingly accurate under supervised learning, where the loss is calculated between the reconstructed 3D joints, virtual markers, and 3D meshes and their corresponding ground truth, ensuring precise spatial alignment and consistent 3D structure. Mesh-Gait extracts discriminative features from both silhouettes and reconstructed 3D heatmaps in a computationally efficient manner. This design enables the model to capture spatial and structural gait characteristics while avoiding the heavy overhead of direct 3D reconstruction from RGB videos, allowing the network to focus on motion dynamics rather than irrelevant visual details. Extensive experiments demonstrate that Mesh-Gait achieves state-of-the-art accuracy. The code will be released upon acceptance of the paper.

Zhao-Yang Wang, Zhimin Shao, Jieneng Chen, Rama Chellappa

Main category: cs.CV

TL;DR: A multi-modal framework combining 2D silhouettes and 3D SMPL features for gait recognition and human attribute estimation using unified transformer architecture.

Details

Motivation: Single modality approaches fail to capture full geometric and dynamic complexity of human walking patterns, especially in challenging conditions like long-range distances and extreme angles.

Method: Multi-modal and multi-task framework combining 2D temporal silhouettes with 3D SMPL features, using unified transformer for feature fusion and joint gait recognition with human attribute estimation (age, BMI, gender).

Result: Outperforms state-of-the-art methods on large-scale BRIAR datasets under challenging conditions (1km distance, 50° pitch angles), providing accurate gait recognition and human attribute estimation.

Conclusion: Multi-modal and multitask learning effectively advances gait-based human understanding in real-world scenarios by capturing comprehensive walking patterns and additional human attributes.

Abstract: Gait recognition is an important biometric for human identification at a distance, particularly under low-resolution or unconstrained environments. Current works typically focus on either 2D representations (e.g., silhouettes and skeletons) or 3D representations (e.g., meshes and SMPLs), but relying on a single modality often fails to capture the full geometric and dynamic complexity of human walking patterns. In this paper, we propose a multi-modal and multi-task framework that combines 2D temporal silhouettes with 3D SMPL features for robust gait analysis. Beyond identification, we introduce a multitask learning strategy that jointly performs gait recognition and human attribute estimation, including age, body mass index (BMI), and gender. A unified transformer is employed to effectively fuse multi-modal gait features and better learn attribute-related representations, while preserving discriminative identity cues. Extensive experiments on the large-scale BRIAR datasets, collected under challenging conditions such as long-range distances (up to 1 km) and extreme pitch angles (up to 50{\deg}), demonstrate that our approach outperforms state-of-the-art methods in gait recognition and provides accurate human attribute estimation. These results highlight the promise of multi-modal and multitask learning for advancing gait-based human understanding in real-world scenarios.

[370] Towards Cybersickness Severity Classification from VR Gameplay Videos Using Transfer Learning and Temporal Modeling

Jyotirmay Nag Setu, Kevin Desai, John Quarles

Main category: cs.CV

TL;DR: This paper proposes a video-based deep learning approach using InceptionV3 and LSTM networks to predict cybersickness severity in VR environments, achieving 68.4% classification accuracy.

Details

Motivation: Cybersickness remains a major barrier to VR adoption, and while multimodal approaches using sensor data exist, there's limited research on video-based features for predicting cybersickness.

Method: Transfer learning with InceptionV3 model pretrained on ImageNet to extract visual features from VR gameplay videos, followed by LSTM network to capture temporal dynamics and predict cybersickness severity.

Result: Achieved 68.4% classification accuracy for cybersickness severity, surpassing existing models trained solely on video data.

Conclusion: The approach provides a practical tool for VR developers to evaluate and mitigate cybersickness, and lays foundation for future video-based temporal modeling research to enhance user comfort in VR applications.

Abstract: With the rapid advancement of virtual reality (VR) technology, its adoption across domains such as healthcare, education, and entertainment has grown significantly. However, the persistent issue of cybersickness, marked by symptoms resembling motion sickness, continues to hinder widespread acceptance of VR. While recent research has explored multimodal deep learning approaches leveraging data from integrated VR sensors like eye and head tracking, there remains limited investigation into the use of video-based features for predicting cybersickness. In this study, we address this gap by utilizing transfer learning to extract high-level visual features from VR gameplay videos using the InceptionV3 model pretrained on the ImageNet dataset. These features are then passed to a Long Short-Term Memory (LSTM) network to capture the temporal dynamics of the VR experience and predict cybersickness severity over time. Our approach effectively leverages the time-series nature of video data, achieving a 68.4% classification accuracy for cybersickness severity. This surpasses the performance of existing models trained solely on video data, providing a practical tool for VR developers to evaluate and mitigate cybersickness in virtual environments. Furthermore, this work lays the foundation for future research on video-based temporal modeling for enhancing user comfort in VR applications.

[371] Taming a Retrieval Framework to Read Images in Humanlike Manner for Augmenting Generation of MLLMs

Suyang Xi, Chenxi Yang, Hong Ding, Yiqing Ni, Catherine C. Liu, Yunhao Liu, Chengqi Zhang

Main category: cs.CV

TL;DR: HuLiRAG introduces a human-like retrieval-augmented generation framework that stages multimodal reasoning as a “what-where-reweight” cascade to improve fine-grained visual question answering by reducing hallucinations and enhancing grounding fidelity.

Details

Motivation: MLLMs often fail in fine-grained visual QA due to hallucinations about object identities, positions, and relations, as textual queries are not explicitly anchored to visual referents. Current RAG approaches lack local detail and limit reasoning about fine-grained interactions.

Method: HuLiRAG uses a three-stage cascade: 1) “what” - anchor queries to candidate referents via open-vocabulary detection, 2) “where” - spatially resolve with SAM-derived masks for fine-grained precision, 3) “reweight” - adaptively prioritize through local-global alignment trade-off. Mask-guided fine-tuning injects spatial evidence into generation.

Result: Extensive experiments show the human-like cascade improves grounding fidelity and factual consistency while reducing hallucinations in multimodal question answering.

Conclusion: HuLiRAG advances multimodal question answering toward trustworthy reasoning by transforming grounding from a passive bias into an explicit constraint on answer formulation through human-like processing stages.

Abstract: Multimodal large language models (MLLMs) often fail in fine-grained visual question answering, producing hallucinations about object identities, positions, and relations because textual queries are not explicitly anchored to visual referents. Retrieval-augmented generation (RAG) alleviates some errors, but it fails to align with human-like processing at both the retrieval and augmentation levels. Specifically, it focuses only on global-level image information but lacks local detail and limits reasoning about fine-grained interactions. To overcome this limitation, we present Human-Like Retrieval-Augmented Generation (HuLiRAG), a framework that stages multimodal reasoning as a ``what–where–reweight’’ cascade. Queries are first anchored to candidate referents via open-vocabulary detection (what), then spatially resolved with SAM-derived masks to recover fine-grained precision (where), and adaptively prioritized through the trade-off between local and global alignment (reweight). Mask-guided fine-tuning further injects spatial evidence into the generation process, transforming grounding from a passive bias into an explicit constraint on answer formulation. Extensive experiments demonstrate that this human-like cascade improves grounding fidelity and factual consistency while reducing hallucinations, advancing multimodal question answering toward trustworthy reasoning.

[372] MonoSE(3)-Diffusion: A Monocular SE(3) Diffusion Framework for Robust Camera-to-Robot Pose Estimation

Kangjian Zhu, Haobo Jiang, Yigong Zhang, Jianjun Qian, Jian Yang, Jin Xie

Main category: cs.CV

TL;DR: MonoSE(3)-Diffusion is a monocular SE(3) diffusion framework that formulates markerless robot pose estimation as a conditional denoising diffusion process, achieving state-of-the-art performance on benchmarks.

Details

Motivation: To improve markerless, image-based robot pose estimation by addressing limitations of current methods that use fixed-scale perturbations, which lack diversity and may generate poses outside camera field of view.

Method: Uses a two-process framework: (1) visibility-constrained diffusion process for diverse pose augmentation that ensures transformations remain within camera field of view, and (2) timestep-aware reverse process for progressive pose refinement using a coarse-to-fine procedure.

Result: Achieved significant improvements on DREAM and RoboKeyGen benchmarks, with 66.75 AUC on the most challenging dataset - a 32.3% gain over state-of-the-art methods.

Conclusion: The proposed diffusion-based approach demonstrates superior generalization capability and robustness through its visibility-constrained pose augmentation and timestep-aware refinement scheme.

Abstract: We propose MonoSE(3)-Diffusion, a monocular SE(3) diffusion framework that formulates markerless, image-based robot pose estimation as a conditional denoising diffusion process. The framework consists of two processes: a visibility-constrained diffusion process for diverse pose augmentation and a timestep-aware reverse process for progressive pose refinement. The diffusion process progressively perturbs ground-truth poses to noisy transformations for training a pose denoising network. Importantly, we integrate visibility constraints into the process, ensuring the transformations remain within the camera field of view. Compared to the fixed-scale perturbations used in current methods, the diffusion process generates in-view and diverse training poses, thereby improving the network generalization capability. Furthermore, the reverse process iteratively predicts the poses by the denoising network and refines pose estimates by sampling from the diffusion posterior of current timestep, following a scheduled coarse-to-fine procedure. Moreover, the timestep indicates the transformation scales, which guide the denoising network to achieve more accurate pose predictions. The reverse process demonstrates higher robustness than direct prediction, benefiting from its timestep-aware refinement scheme. Our approach demonstrates improvements across two benchmarks (DREAM and RoboKeyGen), achieving a notable AUC of 66.75 on the most challenging dataset, representing a 32.3% gain over the state-of-the-art.

[373] On the Problem of Consistent Anomalies in Zero-Shot Industrial Anomaly Detection

Tai Le-Gia, Ahn Jaehyun

Main category: cs.CV

TL;DR: CoDeGraph is a novel zero-shot anomaly detection method that identifies and filters consistent anomalies (recurring defects) by detecting “neighbor-burnout” patterns in patch similarity graphs, achieving state-of-the-art performance on MVTec AD.

Details

Motivation: Existing representation-based methods struggle with consistent anomalies - similar defects recurring across multiple images - leading to poor performance in zero-shot anomaly classification and segmentation.

Method: CoDeGraph constructs an image-level graph where images are nodes and edges connect those with shared consistent-anomaly patterns. It identifies “neighbor-burnout” phenomenon where consistent-anomaly patches show abrupt similarity spikes after exhausting limited similar matches, and uses community detection to filter these anomalies.

Result: Achieved 98.3% AUROC for anomaly classification and 66.8% F1 (+4.2%) and 68.1% AP (+5.4%) for segmentation on MVTec AD with ViT-L-14-336. With DINOv2, segmentation improved to 69.1% F1 (+6.5%) and 71.9% AP (+9.2%) over state-of-the-art zero-shot methods.

Conclusion: CoDeGraph effectively addresses the consistent anomaly problem in zero-shot detection through graph-based filtering of neighbor-burnout patterns, demonstrating robustness across different backbone architectures and significant performance improvements.

Abstract: Zero-shot image anomaly classification (AC) and segmentation (AS) are vital for industrial quality control, detecting defects without prior training data. Existing representation-based methods compare patch features with nearest neighbors in unlabeled test images but struggle with consistent anomalies – similar defects recurring across multiple images – resulting in poor AC/AS performance. We introduce Consistent-Anomaly Detection Graph (CoDeGraph), a novel algorithm that identifies and filters consistent anomalies from similarity computations. Our key insight is that normal patches in industrial images show stable, gradually increasing similarity to other test images, while consistent-anomaly patches exhibit abrupt similarity spikes after exhausting a limited set of similar matches, a phenomenon we term ``neighbor-burnout.’’ CoDeGraph constructs an image-level graph, with images as nodes and edges connecting those with shared consistent-anomaly patterns, using community detection to filter these anomalies. We provide a theoretical foundation using Extreme Value Theory to explain the effectiveness of our approach. Experiments on MVTec AD with the ViT-L-14-336 backbone achieve 98.3% AUROC for AC and AS performance of 66.8% (+4.2%) F1 and 68.1% (+5.4%) AP over state-of-the-art zero-shot methods. Using the DINOv2 backbone further improves segmentation, yielding 69.1% (+6.5%) F1 and 71.9% (+9.2%) AP, demonstrating robustness across architectures.

[374] Learning from Disagreement: A Group Decision Simulation Framework for Robust Medical Image Segmentation

Chen Zhong, Yuxuan Yang, Xinyue Zhang, Ruohan Ma, Yong Guo, Gang Li, Jupeng Li

Main category: cs.CV

TL;DR: A new framework for medical image segmentation that treats expert disagreements as valuable signals rather than noise, simulating clinical panel decision-making to achieve state-of-the-art results.

Details

Motivation: Standard approaches that average expert labels discard valuable clinical uncertainty revealed in inter-rater variability, which stems from differences in annotator expertise and medical image blurriness.

Method: Group decision simulation framework with Expert Signature Generator (ESG) to learn individual annotator styles in latent space, and Simulated Consultation Module (SCM) to intelligently generate final segmentation by sampling from this space.

Result: Achieved state-of-the-art results on challenging CBCT and MRI datasets with 92.11% and 90.72% Dice scores respectively.

Conclusion: By treating expert disagreement as useful signal instead of noise, this approach provides a clear path toward more robust and trustworthy AI systems for healthcare.

Abstract: Medical image segmentation annotation suffers from inter-rater variability (IRV) due to differences in annotators’ expertise and the inherent blurriness of medical images. Standard approaches that simply average expert labels are flawed, as they discard the valuable clinical uncertainty revealed in disagreements. We introduce a fundamentally new approach with our group decision simulation framework, which works by mimicking the collaborative decision-making process of a clinical panel. Under this framework, an Expert Signature Generator (ESG) learns to represent individual annotator styles in a unique latent space. A Simulated Consultation Module (SCM) then intelligently generates the final segmentation by sampling from this space. This method achieved state-of-the-art results on challenging CBCT and MRI datasets (92.11% and 90.72% Dice scores). By treating expert disagreement as a useful signal instead of noise, our work provides a clear path toward more robust and trustworthy AI systems for healthcare.

[375] Post-TIPS Prediction via Multimodal Interaction: A Multi-Center Dataset and Framework for Survival, Complication, and Portal Pressure Assessment

Junhao Dong, Dejia Liu, Ruiqi Ding, Zongxing Chen, Yingjie Huang, Zhu Meng, Jianbo Zhao, Zhicheng Zhao, Fei Su

Main category: cs.CV

TL;DR: MultiTIPS is a novel multimodal framework for TIPS prognosis that addresses challenges in ROI annotation, unimodal reliability, and single-endpoint prediction through dual-option segmentation, multimodal interaction, and multi-task prediction.

Details

Motivation: TIPS procedures have variable survival outcomes and frequent overt hepatic encephalopathy, requiring accurate preoperative prognostic modeling. Current methods face challenges with labor-intensive ROI annotation, poor reliability of unimodal approaches, and incomplete assessment from single-endpoint prediction.

Method: Three core modules: (1) dual-option segmentation using semi-supervised and foundation model-based pipelines for robust ROI segmentation; (2) multimodal interaction with MGRA, POD, and CGPE techniques for cross-modal feature integration; (3) multi-task prediction with staged training for survival, PPG, and OHE prediction.

Result: Extensive experiments on MultiTIPS dataset demonstrate superiority over state-of-the-art approaches, with strong cross-domain generalization and interpretability.

Conclusion: The proposed framework shows promise for clinical application, providing comprehensive prognostic assessment for TIPS procedures with improved accuracy and robustness.

Abstract: Transjugular intrahepatic portosystemic shunt (TIPS) is an established procedure for portal hypertension, but provides variable survival outcomes and frequent overt hepatic encephalopathy (OHE), indicating the necessity of accurate preoperative prognostic modeling. Current studies typically build machine learning models from preoperative CT images or clinical characteristics, but face three key challenges: (1) labor-intensive region-of-interest (ROI) annotation, (2) poor reliability and generalizability of unimodal methods, and (3) incomplete assessment from single-endpoint prediction. Moreover, the lack of publicly accessible datasets constrains research in this field. Therefore, we present MultiTIPS, the first public multi-center dataset for TIPS prognosis, and propose a novel multimodal prognostic framework based on it. The framework comprises three core modules: (1) dual-option segmentation, which integrates semi-supervised and foundation model-based pipelines to achieve robust ROI segmentation with limited annotations and facilitate subsequent feature extraction; (2) multimodal interaction, where three techniques, multi-grained radiomics attention (MGRA), progressive orthogonal disentanglement (POD), and clinically guided prognostic enhancement (CGPE), are introduced to enable cross-modal feature interaction and complementary representation integration, thus improving model accuracy and robustness; and (3) multi-task prediction, where a staged training strategy is used to perform stable optimization of survival, portal pressure gradient (PPG), and OHE prediction for comprehensive prognostic assessment. Extensive experiments on MultiTIPS demonstrate the superiority of the proposed method over state-of-the-art approaches, along with strong cross-domain generalization and interpretability, indicating its promise for clinical application. The dataset and code are available.

Jinjin Cao, Zhiyang Chen, Zijun Wang, Liyuan Ma, Weijian Luo, Guojun Qi

Main category: cs.CV

TL;DR: Cross-Modal Guidance (CMG) is a training-free decoding method that reduces hallucinations in Vision-Language Models by leveraging differences between original and degraded visual-language attention distributions.

Details

Motivation: Existing VLMs suffer from severe hallucinations, generating fluent but image-irrelevant responses due to language bias.

Method: Adaptively mask attention weights of influential image tokens in selected transformer layers to corrupt visual-language perception, then use degradation-induced decoding to emphasize visual contexts.

Result: CMG significantly reduces language bias without harming VLM capabilities, improving performance on hallucination benchmarks without additional training costs.

Conclusion: CMG effectively addresses VLM hallucinations through cross-modal guidance, demonstrating superior advantages with no training requirements and good generalization across different VLMs.

Abstract: Vision-Language Models (VLMs) have shown solid ability for multimodal understanding of both visual and language contexts. However, existing VLMs often face severe challenges of hallucinations, meaning that VLMs tend to generate responses that are only fluent in the language but irrelevant to images in previous contexts. To address this issue, we analyze how language bias contributes to hallucinations and then introduce Cross-Modal Guidance(CMG), a training-free decoding method that addresses the hallucinations by leveraging the difference between the output distributions of the original model and the one with degraded visual-language attention. In practice, we adaptively mask the attention weight of the most influential image tokens in selected transformer layers to corrupt the visual-language perception as a concrete type of degradation. Such a degradation-induced decoding emphasizes the perception of visual contexts and therefore significantly reduces language bias without harming the ability of VLMs. In experiment sections, we conduct comprehensive studies. All results demonstrate the superior advantages of CMG with neither additional conditions nor training costs. We also quantitatively show CMG can improve different VLM’s performance on hallucination-specific benchmarks and generalize effectively.

[377] DAGLFNet:Deep Attention-Guided Global-Local Feature Fusion for Pseudo-Image Point Cloud Segmentation

Chuang Chen, Wenyi Ge

Main category: cs.CV

TL;DR: DAGLFNet is a pseudo-image-based semantic segmentation framework for LiDAR point clouds that enhances feature discriminability through global-local fusion, multi-branch feature extraction, and deep feature-guided attention mechanisms.

Details

Motivation: Existing pseudo-image-based methods for LiDAR point cloud processing often overlook structural and semantic details, resulting in limited feature fusion and discriminability, which hinders performance in environmental perception for autonomous systems.

Method: Three key components: 1) Global-Local Feature Fusion Encoding module to enhance local feature correlation and capture global context, 2) Multi-Branch Feature Extraction network to capture neighborhood information and enhance contour features, 3) Feature Fusion via Deep Feature-guided Attention mechanism for precise cross-channel feature fusion.

Result: Achieves 69.83% on SemanticKITTI validation set and 78.65% on nuScenes validation set, demonstrating high performance while maintaining real-time capability.

Conclusion: DAGLFNet effectively balances high performance with real-time processing, showing great potential for LiDAR-based real-time applications in autonomous navigation and environmental perception.

Abstract: Environmental perception systems play a critical role in high-precision mapping and autonomous navigation, with LiDAR serving as a core sensor that provides accurate 3D point cloud data. How to efficiently process unstructured point clouds while extracting structured semantic information remains a significant challenge, and in recent years, numerous pseudo-image-based representation methods have emerged to achieve a balance between efficiency and performance. However, they often overlook the structural and semantic details of point clouds, resulting in limited feature fusion and discriminability. In this work, we propose DAGLFNet, a pseudo-image-based semantic segmentation framework designed to extract discriminative features. First, the Global-Local Feature Fusion Encoding module is used to enhance the correlation among local features within a set and capture global contextual information. Second, the Multi-Branch Feature Extraction network is employed to capture more neighborhood information and enhance the discriminability of contour features. Finally, a Feature Fusion via Deep Feature-guided Attention mechanism is introduced to improve the precision of cross-channel feature fusion. Experimental evaluations show that DAGLFNet achieves 69.83% and 78.65% on the validation sets of SemanticKITTI and nuScenes, respectively. The method balances high performance with real-time capability, demonstrating great potential for LiDAR-based real-time applications.

[378] MSF-Mamba: Motion-aware State Fusion Mamba for Efficient Micro-Gesture Recognition

Deng Li, Jun Shao, Bohao Xing, Rong Gao, Bihan Wen, Heikki Kälviäinen, Xin Liu

Main category: cs.CV

TL;DR: Proposes MSF-Mamba, a motion-aware state fusion Mamba model for micro-gesture recognition that enhances vanilla Mamba with local spatiotemporal modeling and motion awareness to address limitations in capturing subtle motion cues.

Details

Motivation: Micro-gesture recognition requires modeling both long-range and local spatiotemporal dependencies. CNNs struggle with long-range dependencies, Transformers have high computational costs, and vanilla Mamba lacks local modeling capabilities and motion awareness.

Method: Introduces motion-aware state fusion Mamba (MSF-Mamba) that fuses local contextual neighboring states using central frame difference (CFD) for motion awareness. Also proposes MSF-Mamba+ with multiscale motion-aware state fusion and adaptive scale weighting module.

Result: Experiments on two public MGR datasets show that even the lightweight MSF-Mamba achieves state-of-the-art performance, outperforming CNN-, Transformer-, and SSM-based models while maintaining high efficiency.

Conclusion: The proposed MSF-Mamba framework effectively addresses vanilla Mamba’s limitations by enabling motion-aware local spatiotemporal modeling, making it suitable for capturing subtle motion cues in micro-gesture recognition tasks.

Abstract: Micro-gesture recognition (MGR) targets the identification of subtle and fine-grained human motions and requires accurate modeling of both long-range and local spatiotemporal dependencies. While CNNs are effective at capturing local patterns, they struggle with long-range dependencies due to their limited receptive fields. Transformer-based models address this limitation through self-attention mechanisms but suffer from high computational costs. Recently, Mamba has shown promise as an efficient model, leveraging state space models (SSMs) to enable linear-time processing However, directly applying the vanilla Mamba to MGR may not be optimal. This is because Mamba processes inputs as 1D sequences, with state updates relying solely on the previous state, and thus lacks the ability to model local spatiotemporal dependencies. In addition, previous methods lack a design of motion-awareness, which is crucial in MGR. To overcome these limitations, we propose motion-aware state fusion mamba (MSF-Mamba), which enhances Mamba with local spatiotemporal modeling by fusing local contextual neighboring states. Our design introduces a motion-aware state fusion module based on central frame difference (CFD). Furthermore, a multiscale version named MSF-Mamba+ has been proposed. Specifically, MSF-Mamba supports multiscale motion-aware state fusion, as well as an adaptive scale weighting module that dynamically weighs the fused states across different scales. These enhancements explicitly address the limitations of vanilla Mamba by enabling motion-aware local spatiotemporal modeling, allowing MSF-Mamba and MSF-Mamba to effectively capture subtle motion cues for MGR. Experiments on two public MGR datasets demonstrate that even the lightweight version, namely, MSF-Mamba, achieves SoTA performance, outperforming existing CNN-, Transformer-, and SSM-based models while maintaining high efficiency.

Yunlong Deng, Guangyi Chen, Tianpei Gu, Lingjing Kong, Yan Li, Zeyu Tang, Kun Zhang

Main category: cs.CV

TL;DR: This paper demonstrates that Vision-Language Models (VLMs) have inherent self-refinement capabilities, allowing them to generate high-quality supervised data autonomously without external inputs through a Triangular Consistency framework.

Details

Motivation: To explore the untapped potential of VLMs trained without supervised instruction and validate their inherent self-refinement capabilities for autonomous learning.

Method: Proposes a self-refinement framework based on Triangular Consistency principle with three steps: multi-task instruction tuning, generating image-query-answer triplets from unlabeled images with consistency filtering, and model updating using synthetic data.

Result: Using LLaVA-1.5 as baseline, the model achieved consistent improvements across multiple benchmarks without any external supervision, human annotations, or environmental feedback.

Conclusion: VLMs possess self-refinement capabilities that enable autonomous learning, and these insights can inspire future research on VLM learning mechanisms.

Abstract: Vision-Language Models (VLMs) integrate visual knowledge with the analytical capabilities of Large Language Models (LLMs) through supervised visual instruction tuning, using image-question-answer triplets. However, the potential of VLMs trained without supervised instruction remains largely unexplored. This study validates that VLMs possess inherent self-refinement capabilities, enabling them to generate high-quality supervised data without external inputs and thereby learn autonomously. Specifically, to stimulate the self-refinement ability of VLMs, we propose a self-refinement framework based on a Triangular Consistency principle: within the image-query-answer triangle, any masked elements should be consistently and accurately reconstructed. The framework involves three steps: (1) We enable the instruction generation ability of VLMs by adding multi-task instruction tuning like image$\rightarrow$question-answer or image-answer$\rightarrow$question. (2) We generate image-query-answer triplets from unlabeled images and use the Triangular Consistency principle for filtering. (3) The model is further updated using the filtered synthetic data. To investigate the underlying mechanisms behind this self-refinement capability, we conduct a theoretical analysis from a causal perspective. Using the widely recognized LLaVA-1.5 as our baseline, our experiments reveal that the model can autonomously achieve consistent, though deliberately modest, improvements across multiple benchmarks without any external supervision, such as human annotations or environmental feedback. We expect that the insights of this study on the self-refinement ability of VLMs can inspire future research on the learning mechanism of VLMs. Code is available at https://github.com/dengyl20/SRF-LLaVA-1.5.

[380] Head-wise Adaptive Rotary Positional Encoding for Fine-Grained Image Generation

Jiaye Li, Baoyou Chen, Hui Li, Zilong Dong, Jingdong Wang, Siyu Zhu

Main category: cs.CV

TL;DR: HARoPE is a head-wise adaptive extension of Rotary Position Embedding (RoPE) that addresses limitations in image generation by enabling dynamic frequency reallocation and semantic alignment through learnable linear transformations.

Details

Motivation: Standard multi-dimensional RoPE has limitations in fine-grained spatial relation modeling, color cues, and object counting for image generation due to rigid frequency allocation, axis-wise independence, and uniform head treatment.

Method: Proposes HARoPE which inserts a learnable linear transformation parameterized via singular value decomposition (SVD) before the rotary mapping, enabling dynamic frequency reallocation, semantic alignment of rotary planes, and head-specific positional receptive fields while preserving RoPE’s relative-position property.

Result: Extensive experiments on class-conditional ImageNet and text-to-image generation (Flux and MMDiT) show consistent performance improvements over strong RoPE baselines and other extensions.

Conclusion: HARoPE serves as an effective drop-in replacement that offers a principled and adaptable solution for enhancing positional awareness in transformer-based image generative models.

Abstract: Transformers rely on explicit positional encoding to model structure in data. While Rotary Position Embedding (RoPE) excels in 1D domains, its application to image generation reveals significant limitations such as fine-grained spatial relation modeling, color cues, and object counting. This paper identifies key limitations of standard multi-dimensional RoPE-rigid frequency allocation, axis-wise independence, and uniform head treatment-in capturing the complex structural biases required for fine-grained image generation. We propose HARoPE, a head-wise adaptive extension that inserts a learnable linear transformation parameterized via singular value decomposition (SVD) before the rotary mapping. This lightweight modification enables dynamic frequency reallocation, semantic alignment of rotary planes, and head-specific positional receptive fields while rigorously preserving RoPE’s relative-position property. Extensive experiments on class-conditional ImageNet and text-to-image generation (Flux and MMDiT) demonstrate that HARoPE consistently improves performance over strong RoPE baselines and other extensions. The method serves as an effective drop-in replacement, offering a principled and adaptable solution for enhancing positional awareness in transformer-based image generative models.

[381] Jigsaw3D: Disentangled 3D Style Transfer via Patch Shuffling and Masking

Yuteng Ye, Zheng Zhang, Qinchuan Zhang, Di Wang, Youjia Zhang, Wenxiao Zhang, Wei Yang, Yuan Liu

Main category: cs.CV

TL;DR: Jigsaw3D is a fast 3D style transfer method that uses jigsaw operations to decouple style from content and enables view-consistent stylization through multi-view diffusion models.

Details

Motivation: Existing 3D style transfer methods suffer from heavy per-scene optimization and entanglement of style with semantic content, requiring a solution that can decouple style from content efficiently.

Method: Uses jigsaw operations (spatial shuffling and random masking of reference patches) to suppress object semantics and isolate stylistic statistics, then integrates style cues into multi-view diffusion models via cross-attention for consistent stylization.

Result: Achieves high style fidelity and multi-view consistency with substantially lower latency compared to existing methods, and generalizes to partial reference stylization, multi-object scenes, and tileable texture generation.

Conclusion: Jigsaw3D provides an effective pipeline for fast, view-consistent 3D style transfer that successfully decouples style from content while maintaining high fidelity.

Abstract: Controllable 3D style transfer seeks to restyle a 3D asset so that its textures match a reference image while preserving the integrity and multi-view consistency. The prevalent methods either rely on direct reference style token injection or score-distillation from 2D diffusion models, which incurs heavy per-scene optimization and often entangles style with semantic content. We introduce Jigsaw3D, a multi-view diffusion based pipeline that decouples style from content and enables fast, view-consistent stylization. Our key idea is to leverage the jigsaw operation - spatial shuffling and random masking of reference patches - to suppress object semantics and isolate stylistic statistics (color palettes, strokes, textures). We integrate these style cues into a multi-view diffusion model via reference-to-view cross-attention, producing view-consistent stylized renderings conditioned on the input mesh. The renders are then style-baked onto the surface to yield seamless textures. Across standard 3D stylization benchmarks, Jigsaw3D achieves high style fidelity and multi-view consistency with substantially lower latency, and generalizes to masked partial reference stylization, multi-object scene styling, and tileable texture generation. Project page is available at: https://babahui.github.io/jigsaw3D.github.io/

[382] VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning

Qunzhong Wang, Jie Liu, Jiajun Liang, Yilei Jiang, Yuanxing Zhang, Jinyuan Chen, Yaozhi Zheng, Xintao Wang, Pengfei Wan, Xiangyu Yue, Jiaheng Liu

Main category: cs.CV

TL;DR: VR-Thinker is a thinking-with-image framework for multimodal reward models that enables active visual reasoning through operations like frame selection and configurable memory windows, overcoming limitations of current RMs that lose fine-grained details and suffer from hallucination.

Details

Motivation: Current multimodal reward models face limitations: visual inputs consume large context budgets (forcing fewer frames and loss of details), and packing all visual information into initial prompts exacerbates hallucination and forgetting during reasoning.

Method: VR-Thinker introduces visual reasoning operations and configurable visual memory windows. It uses a reinforcement fine-tuning pipeline with: (i) Cold Start with visual chain-of-thought data, (ii) Rejection sampling Fine-Tuning on high-quality traces, and (iii) Group Relative Policy Optimization to strengthen reasoning.

Result: Achieves state-of-the-art accuracy on video preference benchmarks: 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6% on MJ-Bench-Video with a 7B model, especially effective for longer videos.

Conclusion: The approach validates the effectiveness and promise of thinking-with-image multimodal reward modeling, enabling more reliable and faithful visual reasoning within context limits.

Abstract: Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations: (1) visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details; and (2) all visual information is packed into the initial prompt, exacerbating hallucination and forgetting during chain-of-thought reasoning. To overcome these issues, we introduce VideoReward Thinker (VR-Thinker), a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability. We activate visual reasoning via a reinforcement fine-tuning pipeline: (i) Cold Start with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting; (ii) select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning; and (iii) apply Group Relative Policy Optimization (GRPO) to strengthen reasoning. Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos: a 7B VR-Thinker achieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6% on MJ-Bench-Video. These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.

[383] Receptive Field Expanded Look-Up Tables for Vision Inference: Advancing from Low-level to High-level Tasks

Xi Zhang, Xiaolin Wu

Main category: cs.CV

TL;DR: This paper proposes novel techniques to expand the receptive field of CNNs in look-up table (LUT) methods while maintaining fixed table size, overcoming limitations of current LUT approaches that suffer from limited receptive field due to combinatorial explosion.

Details

Motivation: Current LUT methods for fast CNN inference suffer from limited receptive field due to combinatorial explosion of table size when trying to expand the receptive field, which restricts their performance and practical applicability.

Method: Proposes learning an optimal lattice vector quantizer that adaptively allocates quantization resolution across data dimensions based on significance, irregular dilated convolutions, and a U-shaped cascaded LUT structure to capture multi-level contextual information without increasing table size.

Result: The approach effectively balances speed, accuracy, and memory efficiency, demonstrating significant improvements over existing LUT methods by expanding receptive field while maintaining the same space complexity.

Conclusion: The proposed innovations enable effective expansion of CNN receptive field in LUT-driven inference with fixed table size, achieving better performance than current LUT methods through adaptive quantization and multi-level contextual capture strategies.

Abstract: Recently, several look-up table (LUT) methods were developed to greatly expedite the inference of CNNs in a classical strategy of trading space for speed. However, these LUT methods suffer from a common drawback of limited receptive field of the convolution kernels due to the combinatorial explosion of table size. This research aims to expand the CNN receptive field with a fixed table size, thereby enhancing the performance of LUT-driven fast CNN inference while maintaining the same space complexity. To achieve this goal, various techniques are proposed. The main contribution is a novel approach of learning an optimal lattice vector quantizer that adaptively allocates the quantization resolution across data dimensions based on their significance to the inference task. In addition, the lattice vector quantizer offers an inherently more accurate approximation of CNN kernels than scalar quantizer as used in current practice. Furthermore, we introduce other receptive field expansion strategies, including irregular dilated convolutions and a U-shaped cascaded LUT structure, designed to capture multi-level contextual information without inflating table size. Together, these innovations allow our approach to effectively balance speed, accuracy, and memory efficiency, demonstrating significant improvements over existing LUT methods.

Yang Liu, Yufei Yin, Chenchen Jing, Muzhi Zhu, Hao Chen, Yuling Xi, Bo Feng, Hao Wang, Shiyu Li, Chunhua Shen

Main category: cs.CV

TL;DR: COSINE is a unified open-world segmentation model that combines open-vocabulary and in-context segmentation using multi-modal prompts (text and image).

Details

Motivation: To overcome architectural discrepancies, divergent learning objectives, and distinct representation learning strategies in previous pipelines for open-vocabulary segmentation and in-context segmentation.

Method: COSINE uses foundation models to extract representations for input images and multi-modal prompts, and a SegDecoder to align these representations, model their interaction, and obtain masks specified by input prompts across different granularities.

Result: Comprehensive experiments show significant performance improvements in both open-vocabulary and in-context segmentation tasks. Synergistic collaboration between visual and textual prompts leads to improved generalization over single-modality approaches.

Conclusion: COSINE successfully unifies open-vocabulary and in-context segmentation through multi-modal prompt integration, demonstrating superior performance and better generalization capabilities.

Abstract: In this work, we present COSINE, a unified open-world segmentation model that consolidates open-vocabulary segmentation and in-context segmentation with multi-modal prompts (e.g., text and image). COSINE exploits foundation models to extract representations for an input image and corresponding multi-modal prompts, and a SegDecoder to align these representations, model their interaction, and obtain masks specified by input prompts across different granularities. In this way, COSINE overcomes architectural discrepancies, divergent learning objectives, and distinct representation learning strategies of previous pipelines for open-vocabulary segmentation and in-context segmentation. Comprehensive experiments demonstrate that COSINE has significant performance improvements in both open-vocabulary and in-context segmentation tasks. Our exploratory analyses highlight that the synergistic collaboration between using visual and textual prompts leads to significantly improved generalization over single-modality approaches.

[385] Layout-Independent License Plate Recognition via Integrated Vision and Language Models

Elham Shabaninia, Fatemeh Asadi-zeydabadi, Hossein Nezamabadi-pour

Main category: cs.CV

TL;DR: A pattern-aware ALPR framework using transformer-based vision model with iterative language modeling for layout-independent license plate recognition across diverse conditions.

Details

Motivation: To create a license plate recognition system that works reliably across different plate layouts and challenging real-world conditions without relying on manual layout classification or heuristic corrections.

Method: Combines high-precision detection network with unified recognition stage using transformer-based vision model and iterative language modeling for character identification and post-OCR refinement in one seamless process.

Result: Achieves superior accuracy and robustness across multiple international datasets (IR-LPR, UFPR-ALPR, AOLP), outperforming recent segmentation-free approaches.

Conclusion: The framework successfully bridges computer vision and language modeling through embedded pattern analysis, enabling enhanced adaptability for intelligent transportation and surveillance applications.

Abstract: This work presents a pattern-aware framework for automatic license plate recognition (ALPR), designed to operate reliably across diverse plate layouts and challenging real-world conditions. The proposed system consists of a modern, high-precision detection network followed by a recognition stage that integrates a transformer-based vision model with an iterative language modelling mechanism. This unified recognition stage performs character identification and post-OCR refinement in a seamless process, learning the structural patterns and formatting rules specific to license plates without relying on explicit heuristic corrections or manual layout classification. Through this design, the system jointly optimizes visual and linguistic cues, enables iterative refinement to improve OCR accuracy under noise, distortion, and unconventional fonts, and achieves layout-independent recognition across multiple international datasets (IR-LPR, UFPR-ALPR, AOLP). Experimental results demonstrate superior accuracy and robustness compared to recent segmentation-free approaches, highlighting how embedding pattern analysis within the recognition stage bridges computer vision and language modelling for enhanced adaptability in intelligent transportation and surveillance applications.

[386] GLOFNet – A Multimodal Dataset for GLOF Monitoring and Prediction

Zuha Fatima, Muhammad Anser Sohaib, Muhammad Talha, Sidra Sultana, Ayesha Kanwal, Nazia Perwaiz

Main category: cs.CV

TL;DR: GLOFNet is a multimodal dataset for monitoring and predicting Glacial Lake Outburst Floods (GLOFs), integrating Sentinel-2 imagery, glacier velocity data, and land surface temperature records from the Shisper Glacier region.

Details

Motivation: Predictive GLOF research is hindered by fragmented and unimodal data, with most prior work focusing on post-event mapping rather than forecasting. There's a need for harmonized datasets combining visual indicators with physical precursors.

Method: Integration of three data sources: Sentinel-2 multispectral imagery for spatial monitoring, NASA ITS_LIVE velocity products for glacier kinematics, and MODIS Land Surface Temperature records. Preprocessing included cloud masking, quality filtering, normalization, temporal interpolation, augmentation, and cyclical encoding.

Result: The dataset reveals seasonal glacier velocity cycles, long-term warming of ~0.8 K per decade, and spatial heterogeneity in cryospheric conditions. GLOFNet is publicly available and addresses challenges like class imbalance, cloud contamination, and coarse resolution.

Conclusion: GLOFNet provides a structured foundation for benchmarking multimodal deep learning approaches to rare hazard prediction, supporting future research in glacial hazard forecasting.

Abstract: Glacial Lake Outburst Floods (GLOFs) are rare but destructive hazards in high mountain regions, yet predictive research is hindered by fragmented and unimodal data. Most prior efforts emphasize post-event mapping, whereas forecasting requires harmonized datasets that combine visual indicators with physical precursors. We present GLOFNet, a multimodal dataset for GLOF monitoring and prediction, focused on the Shisper Glacier in the Karakoram. It integrates three complementary sources: Sentinel-2 multispectral imagery for spatial monitoring, NASA ITS_LIVE velocity products for glacier kinematics, and MODIS Land Surface Temperature records spanning over two decades. Preprocessing included cloud masking, quality filtering, normalization, temporal interpolation, augmentation, and cyclical encoding, followed by harmonization across modalities. Exploratory analysis reveals seasonal glacier velocity cycles, long-term warming of ~0.8 K per decade, and spatial heterogeneity in cryospheric conditions. The resulting dataset, GLOFNet, is publicly available to support future research in glacial hazard prediction. By addressing challenges such as class imbalance, cloud contamination, and coarse resolution, GLOFNet provides a structured foundation for benchmarking multimodal deep learning approaches to rare hazard prediction.

[387] MRS-YOLO Railroad Transmission Line Foreign Object Detection Based on Improved YOLO11 and Channel Pruning

Siyuan Liu, Junting Lin

Main category: cs.CV

TL;DR: MRS-YOLO is an improved YOLO11-based algorithm for railway transmission line foreign object detection, featuring multi-scale feature fusion, re-calibration FPN, spatial-channel reconstruction detection head, and channel pruning to enhance accuracy while reducing computational cost.

Details

Motivation: To address problems of missed detection, false detection, and low detection efficiency in transmission line foreign object detection under railway environment.

Method: Proposed MAKDF module for multi-scale feature extraction, RCFPN neck structure for feature integration, SC_Detect head with spatial and channel preprocessing, and channel pruning for model compression.

Result: mAP50 improved to 94.8% (0.7% higher than baseline), mAP50:95 improved to 86.4% (2.3% higher), while reducing Parameters by 44.2% and GFLOPs by 17.5%.

Conclusion: The improved algorithm demonstrates better performance and efficiency for railroad transmission line foreign object detection tasks.

Abstract: Aiming at the problems of missed detection, false detection and low detection efficiency in transmission line foreign object detection under railway environment, we proposed an improved algorithm MRS-YOLO based on YOLO11. Firstly, a multi-scale Adaptive Kernel Depth Feature Fusion (MAKDF) module is proposed and fused with the C3k2 module to form C3k2_MAKDF, which enhances the model’s feature extraction capability for foreign objects of different sizes and shapes. Secondly, a novel Re-calibration Feature Fusion Pyramid Network (RCFPN) is designed as a neck structure to enhance the model’s ability to integrate and utilize multi-level features effectively. Then, Spatial and Channel Reconstruction Detect Head (SC_Detect) based on spatial and channel preprocessing is designed to enhance the model’s overall detection performance. Finally, the channel pruning technique is used to reduce the redundancy of the improved model, drastically reduce Parameters and Giga Floating Point Operations Per Second (GFLOPs), and improve the detection efficiency. The experimental results show that the mAP50 and mAP50:95 of the MRS-YOLO algorithm proposed in this paper are improved to 94.8% and 86.4%, respectively, which are 0.7 and 2.3 percentage points higher compared to the baseline, while Parameters and GFLOPs are reduced by 44.2% and 17.5%, respectively. It is demonstrated that the improved algorithm can be better applied to the task of foreign object detection in railroad transmission lines.

[388] Deep semi-supervised approach based on consistency regularization and similarity learning for weeds classification

Farouq Benchallal, Adel Hafiane, Nicolas Ragot, Raphael Canals

Main category: cs.CV

TL;DR: A deep semi-supervised approach combining consistency regularization with similarity learning for weed species classification, addressing limited labeled data in agricultural applications.

Details

Motivation: Weed classification is challenging due to similarities with crops and field condition variations. Deep learning requires large annotated datasets, but data labeling is time-consuming and laborious in agricultural applications.

Method: Proposed deep semi-supervised approach combining consistency regularization with similarity learning using a deep auto-encoder architecture.

Result: Experiments on DeepWeeds dataset and noisy conditions demonstrated effectiveness and robustness compared to state-of-the-art fully supervised models. Ablation studies confirmed the joint learning strategy.

Conclusion: The proposed method effectively utilizes unlabeled data, provides robust classification performance, and addresses the limitation of scarce labeled data in agricultural weed classification applications.

Abstract: Weed species classification represents an important step for the development of automated targeting systems that allow the adoption of precision agriculture practices. To reduce costs and yield losses caused by their presence. The identification of weeds is a challenging problem due to their shared similarities with crop plants and the variability related to the differences in terms of their types. Along with the variations in relation to changes in field conditions. Moreover, to fully benefit from deep learning-based methods, large fully annotated datasets are needed. This requires time intensive and laborious process for data labeling, which represents a limitation in agricultural applications. Hence, for the aim of improving the utilization of the unlabeled data, regarding conditions of scarcity in terms of the labeled data available during the learning phase and provide robust and high classification performance. We propose a deep semi-supervised approach, that combines consistency regularization with similarity learning. Through our developed deep auto-encoder architecture, experiments realized on the DeepWeeds dataset and inference in noisy conditions demonstrated the effectiveness and robustness of our method in comparison to state-of-the-art fully supervised deep learning models. Furthermore, we carried out ablation studies for an extended analysis of our proposed joint learning strategy.

[389] UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation

Zhengrong Yue, Haiyu Zhang, Xiangyu Zeng, Boyu Chen, Chenting Wang, Shaobin Zhuang, Lu Dong, KunPeng Du, Yi Wang, Limin Wang, Yali Wang

Main category: cs.CV

TL;DR: UniFlow is a unified tokenizer that overcomes the performance trade-off between visual understanding and generation by using layer-wise adaptive self-distillation and a lightweight patch-wise pixel flow decoder.

Details

Motivation: Existing tokenizers face significant performance trade-offs between understanding and generation due to conflicts between high-level semantic abstraction and low-level pixel reconstruction. The goal is to develop a universal tokenizer that performs well in both domains.

Method: Proposes UniFlow with layer-wise adaptive self-distillation applied to pretrained visual encoders, and a lightweight patch-wise pixel flow decoder that models conditional flow from noisy state to pixel domain using semantic features as conditions.

Result: Achieves win-win outcomes across 13 benchmarks spanning 7 visual tasks. 7B UniFlow-XL surpasses 14B TokenFlow-XL by 7.75% on understanding benchmarks and achieves competitive results in visual reconstruction/generation, outperforming UniTok by 0.15 in rFID and 0.09 in gFID.

Conclusion: UniFlow successfully addresses the understanding-generation trade-off through its adaptive self-distillation and patch-wise flow decoder design, demonstrating superior performance across diverse visual tasks without compromising either capability.

Abstract: Tokenizer is a crucial component for both visual understanding and generation. To advance toward the ultimate goal of universal modeling, recent research has focused on developing a unified tokenizer. However, existing tokenizers face a significant performance trade-off between understanding and generation, stemming from the inherent conflict between high-level semantic abstraction and low-level pixel reconstruction. To tackle this challenge, we propose a generic and unified tokenizer, namely UniFlow, by flexibly adapting any visual encoder with a concise reconstruction decoder. Specifically, we introduce layer-wise adaptive self-distillation applied to the well-pretrained visual encoders, which enables UniFlow to simultaneously inherit the strong semantic features for visual understanding and flexibly adapt to model fine-grained details for visual generation. Moreover, we propose a lightweight patch-wise pixel flow decoder, which efficiently achieves high-fidelity pixel reconstruction by modeling a conditional flow from the noisy state back to the patch-wise pixel domain. By leveraging the semantic features as visual conditions for the decoder, we effectively alleviate the training conflicts between understanding and generation. Furthermore, the patch-wise learning strategy simplifies the data distribution, thereby improving training efficiency. Extensive experiments across 13 challenging benchmarks spanning 7 widely studied visual understanding and generation tasks demonstrate that UniFlow achieves a win-win outcome. For instance, our 7B UniFlow-XL not only surpasses the 14B TokenFlow-XL by 7.75% on average understanding benchmarks, but also achieves competitive results in both visual reconstruction and generation, surpassing UniTok by 0.15 in rFID and 0.09 in gFID (without guidance), respectively.

[390] Injecting Frame-Event Complementary Fusion into Diffusion for Optical Flow in Challenging Scenes

Haonan Wang, Hanyu Zhou, Haoyue Liu, Luxin Yan

Main category: cs.CV

TL;DR: Diff-ABFlow is a novel optical flow estimation framework using diffusion models with frame-event fusion to address challenges in high-speed and low-light scenes where conventional methods fail due to motion blur and insufficient illumination.

Details

Motivation: Optical flow estimation struggles in high-speed and low-light scenes due to motion blur and insufficient illumination, which weaken texture, amplify noise, and deteriorate appearance saturation and boundary completeness. Traditional methods relying on deteriorated visual features perform poorly in these degraded conditions.

Method: Proposes Diff-ABFlow framework based on diffusion models with frame-event appearance-boundary fusion. Instead of learning mapping from visual features to motion fields, it learns mapping from noisy flow to clear flow using diffusion models, leveraging complementary strengths of frame cameras (dense appearance) and event cameras (dense boundary completeness).

Result: The method addresses the limitations of both discriminative and generative models that are affected by deteriorated visual features in challenging scenes, providing a more robust optical flow estimation approach.

Conclusion: Diffusion models offer a promising alternative for optical flow estimation in degraded scenes by learning flow-to-flow mappings rather than relying on deteriorated visual features, with frame-event fusion providing complementary appearance and boundary information.

Abstract: Optical flow estimation has achieved promising results in conventional scenes but faces challenges in high-speed and low-light scenes, which suffer from motion blur and insufficient illumination. These conditions lead to weakened texture and amplified noise and deteriorate the appearance saturation and boundary completeness of frame cameras, which are necessary for motion feature matching. In degraded scenes, the frame camera provides dense appearance saturation but sparse boundary completeness due to its long imaging time and low dynamic range. In contrast, the event camera offers sparse appearance saturation, while its short imaging time and high dynamic range gives rise to dense boundary completeness. Traditionally, existing methods utilize feature fusion or domain adaptation to introduce event to improve boundary completeness. However, the appearance features are still deteriorated, which severely affects the mostly adopted discriminative models that learn the mapping from visual features to motion fields and generative models that generate motion fields based on given visual features. So we introduce diffusion models that learn the mapping from noising flow to clear flow, which is not affected by the deteriorated visual features. Therefore, we propose a novel optical flow estimation framework Diff-ABFlow based on diffusion models with frame-event appearance-boundary fusion.

[391] Equipping Vision Foundation Model with Mixture of Experts for Out-of-Distribution Detection

Shizhen Zhao, Jiahui Liu, Xin Wen, Haoru Tan, Xiaojuan Qi

Main category: cs.CV

TL;DR: Vision foundation models like DINOv2 show strong OOD detection capabilities without fine-tuning, but struggle with large semantic spaces. The paper proposes Mixture of Feature Experts (MoFE) and Dynamic-β Mixup to improve performance.

Details

Motivation: Pre-trained vision foundation models have strong feature learning capabilities but their impact on out-of-distribution (OOD) detection remains underexplored, especially in scenarios with large semantic spaces.

Method: 1) Systematic investigation of vision foundation models for OOD detection; 2) Proposed Mixture of Feature Experts (MoFE) module to partition features into subspaces; 3) Dynamic-β Mixup strategy with adaptive interpolation weights from beta distribution.

Result: Pre-trained DINOv2 achieves OOD detection performance comparable to state-of-the-art methods without fine-tuning. The proposed MoFE and Dynamic-β Mixup significantly outperform baseline methods, especially in large semantic space scenarios.

Conclusion: Vision foundation models provide strong OOD detection capabilities, and the proposed MoFE module with Dynamic-β Mixup effectively addresses challenges in large semantic spaces by refining decision boundaries and adapting to varying learning difficulties.

Abstract: Pre-trained vision foundation models have transformed many computer vision tasks. Despite their strong ability to learn discriminative and generalizable features crucial for out-of-distribution (OOD) detection, their impact on this task remains underexplored. Motivated by this gap, we systematically investigate representative vision foundation models for OOD detection. Our findings reveal that a pre-trained DINOv2 model, even without fine-tuning on in-domain (ID) data, naturally provides a highly discriminative feature space for OOD detection, achieving performance comparable to existing state-of-the-art methods without requiring complex designs. Beyond this, we explore how fine-tuning foundation models on in-domain (ID) data can enhance OOD detection. However, we observe that the performance of vision foundation models remains unsatisfactory in scenarios with a large semantic space. This is due to the increased complexity of decision boundaries as the number of categories grows, which complicates the optimization process. To mitigate this, we propose the Mixture of Feature Experts (MoFE) module, which partitions features into subspaces, effectively capturing complex data distributions and refining decision boundaries. Further, we introduce a Dynamic-$\beta$ Mixup strategy, which samples interpolation weights from a dynamic beta distribution. This adapts to varying levels of learning difficulty across categories, improving feature learning for more challenging categories. Extensive experiments demonstrate the effectiveness of our approach, significantly outperforming baseline methods.

[392] A Simple and Better Baseline for Visual Grounding

Jingchao Wang, Wenlong Zhang, Dingjiang Huang, Hong Wang, Yefeng Zheng

Main category: cs.CV

TL;DR: FSVG is a simple yet effective visual grounding method that directly integrates linguistic and visual modalities without iterative procedures, using language-guided feature selection for efficient object localization.

Details

Motivation: Existing visual grounding methods use iterative procedures across image scales and require caching linguistic/visual features, causing computational overhead. FSVG aims to simplify this process while maintaining accuracy.

Method: FSVG encapsulates linguistic and visual modalities in a single network without iterations, using language as parallel guidance for modality interaction. It employs similarity-based feature selection to focus only on language-relevant visual features.

Result: Extensive experiments on benchmark datasets show FSVG achieves better balance between accuracy and efficiency compared to state-of-the-art methods.

Conclusion: FSVG provides an effective baseline for visual grounding that reduces computational overhead while maintaining competitive performance through streamlined architecture and feature selection.

Abstract: Visual grounding aims to predict the locations of target objects specified by textual descriptions. For this task with linguistic and visual modalities, there is a latest research line that focuses on only selecting the linguistic-relevant visual regions for object localization to reduce the computational overhead. Albeit achieving impressive performance, it is iteratively performed on different image scales, and at every iteration, linguistic features and visual features need to be stored in a cache, incurring extra overhead. To facilitate the implementation, in this paper, we propose a feature selection-based simple yet effective baseline for visual grounding, called FSVG. Specifically, we directly encapsulate the linguistic and visual modalities into an overall network architecture without complicated iterative procedures, and utilize the language in parallel as guidance to facilitate the interaction between linguistic modal and visual modal for extracting effective visual features. Furthermore, to reduce the computational cost, during the visual feature learning, we introduce a similarity-based feature selection mechanism to only exploit language-related visual features for faster prediction. Extensive experiments conducted on several benchmark datasets comprehensively substantiate that the proposed FSVG achieves a better balance between accuracy and efficiency beyond the current state-of-the-art methods. Code is available at https://github.com/jcwang0602/FSVG.

[393] ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models

Yuqi Liu, Liangyu Chen, Jiazhen Liu, Mingkang Zhu, Zhisheng Zhong, Bei Yu, Jiaya Jia

Main category: cs.CV

TL;DR: ViSurf is a unified post-training paradigm that integrates Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) to overcome limitations of both methods in Large Vision-and-Language Models.

Details

Motivation: SFT leads to sub-optimal performance while RLVR struggles with tasks beyond the model's knowledge base. The authors aim to combine the strengths of both approaches.

Method: ViSurf integrates SFT and RLVR in a single stage by injecting ground-truth labels into RLVR rollouts, providing simultaneous external supervision and internal reinforcement. Three novel reward control strategies are introduced to stabilize training.

Result: Extensive experiments across diverse benchmarks show ViSurf outperforms individual SFT, RLVR, and two-stage SFT→RLVR approaches.

Conclusion: ViSurf provides an effective unified framework that validates the integration of supervised and reinforcement learning paradigms for enhanced LVLM performance.

Abstract: Typical post-training paradigms for Large Vision-and-Language Models (LVLMs) include Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR). SFT leverages external guidance to inject new knowledge, whereas RLVR utilizes internal reinforcement to enhance reasoning capabilities and overall performance. However, our analysis reveals that SFT often leads to sub-optimal performance, while RLVR struggles with tasks that exceed the model’s internal knowledge base. To address these limitations, we propose ViSurf (\textbf{Vi}sual \textbf{Su}pervised-and-\textbf{R}einforcement \textbf{F}ine-Tuning), a unified post-training paradigm that integrates the strengths of both SFT and RLVR within a single stage. We analyze the derivation of the SFT and RLVR objectives to establish the ViSurf objective, providing a unified perspective on these two paradigms. The core of ViSurf involves injecting ground-truth labels into the RLVR rollouts, thereby providing simultaneous external supervision and internal reinforcement. Furthermore, we introduce three novel reward control strategies to stabilize and optimize the training process. Extensive experiments across several diverse benchmarks demonstrate the effectiveness of ViSurf, outperforming both individual SFT, RLVR, and two-stage SFT \textrightarrow RLVR. In-depth analysis corroborates these findings, validating the derivation and design principles of ViSurf.

[394] OmniQuality-R: Advancing Reward Models Through All-Encompassing Quality Assessment

Yiting Lu, Fengbin Guan, Yixin Gao, Yan Zhong, Xinge Peng, Jiakang Yuan, Yihao Liu, Bo Zhang, Xin Li, Zhibo Chen, Weisi Lin

Main category: cs.CV

TL;DR: OmniQuality-R is a unified reward modeling framework that transforms multi-task quality reasoning into continuous and interpretable reward signals for policy optimization across visual evaluation tasks.

Details

Motivation: Current visual evaluation approaches are constrained to single tasks, lacking a unified framework for multi-task quality assessment.

Method: Construct reasoning-enhanced reward modeling dataset via rejection sampling, apply Group Relative Policy Optimization with Gaussian-based reward, and incorporate STD filtering and entropy gating for stable training.

Result: Evaluated on three key IQA tasks: aesthetic quality assessment, technical quality evaluation, and text-image alignment.

Conclusion: The framework successfully enables multi-dimensional quality reasoning and provides interpretable reward signals for policy optimization across multiple visual evaluation tasks.

Abstract: Current visual evaluation approaches are typically constrained to a single task. To address this, we propose OmniQuality-R, a unified reward modeling framework that transforms multi-task quality reasoning into continuous and interpretable reward signals for policy optimization. Inspired by subjective experiments, where participants are given task-specific instructions outlining distinct assessment principles prior to evaluation, we propose OmniQuality-R, a structured reward modeling framework that transforms multi-dimensional reasoning into continuous and interpretable reward signals. To enable this, we construct a reasoning-enhanced reward modeling dataset by sampling informative plan-reason trajectories via rejection sampling, forming a reliable chain-of-thought (CoT) dataset for supervised fine-tuning (SFT). Building on this, we apply Group Relative Policy Optimization (GRPO) for post-training, using a Gaussian-based reward to support continuous score prediction. To further stabilize the training and improve downstream generalization, we incorporate standard deviation (STD) filtering and entropy gating mechanisms during reinforcement learning. These techniques suppress unstable updates and reduce variance in policy optimization. We evaluate OmniQuality-R on three key IQA tasks: aesthetic quality assessment, technical quality evaluation, and text-image alignment.

[395] GraphTARIF: Linear Graph Transformer with Augmented Rank and Improved Focus

Zhaolin Hu, Kun Li, Hehe Fan, Yi Yang

Main category: cs.CV

TL;DR: A hybrid framework combining linear attention with gated local graph networks to enhance expressiveness while maintaining linear complexity in Graph Transformers.

Details

Motivation: Linear attention mechanisms in Graph Transformers suffer from reduced expressiveness due to low-rank projections and uniform attention distributions, limiting classification ability.

Method: Enhances linear attention by attaching gated local graph network to value matrix to increase rank, and introduces learnable log-power function to reduce attention entropy and sharpen focus.

Result: Achieves competitive performance on both homophilic and heterophilic graph benchmarks while preserving linear attention scalability.

Conclusion: The proposed hybrid framework successfully addresses expressiveness limitations of linear attention through rank enhancement and attention focusing, maintaining efficiency benefits.

Abstract: Linear attention mechanisms have emerged as efficient alternatives to full self-attention in Graph Transformers, offering linear time complexity. However, existing linear attention models often suffer from a significant drop in expressiveness due to low-rank projection structures and overly uniform attention distributions. We theoretically prove that these properties reduce the class separability of node representations, limiting the model’s classification ability. To address this, we propose a novel hybrid framework that enhances both the rank and focus of attention. Specifically, we enhance linear attention by attaching a gated local graph network branch to the value matrix, thereby increasing the rank of the resulting attention map. Furthermore, to alleviate the excessive smoothing effect inherent in linear attention, we introduce a learnable log-power function into the attention scores to reduce entropy and sharpen focus. We theoretically show that this function decreases entropy in the attention distribution, enhancing the separability of learned embeddings. Extensive experiments on both homophilic and heterophilic graph benchmarks demonstrate that our method achieves competitive performance while preserving the scalability of linear attention.

[396] DEMO: Disentangled Motion Latent Flow Matching for Fine-Grained Controllable Talking Portrait Synthesis

Peiyin Chen, Zhuowei Yang, Hui Feng, Sheng Jiang, Rui Yan

Main category: cs.CV

TL;DR: DEMO is a flow-matching framework for audio-driven talking-head generation that provides disentangled control over lip motion, head pose, and eye gaze through a structured latent space and transformer-based flow matching.

Details

Motivation: Existing diffusion-based talking-head generation methods struggle with temporal coherence and fine-grained motion control, creating a need for better controllable video synthesis.

Method: Uses a motion auto-encoder to create a structured latent space with orthogonalized motion factors, then applies optimal-transport-based flow matching with a transformer predictor to generate smooth motion trajectories conditioned on audio.

Result: Outperforms prior methods on multiple benchmarks in video realism, lip-audio synchronization, and motion fidelity.

Conclusion: Combining fine-grained motion disentanglement with flow-based generative modeling provides a powerful new paradigm for controllable talking-head video synthesis.

Abstract: Audio-driven talking-head generation has advanced rapidly with diffusion-based generative models, yet producing temporally coherent videos with fine-grained motion control remains challenging. We propose DEMO, a flow-matching generative framework for audio-driven talking-portrait video synthesis that delivers disentangled, high-fidelity control of lip motion, head pose, and eye gaze. The core contribution is a motion auto-encoder that builds a structured latent space in which motion factors are independently represented and approximately orthogonalized. On this disentangled motion space, we apply optimal-transport-based flow matching with a transformer predictor to generate temporally smooth motion trajectories conditioned on audio. Extensive experiments across multiple benchmarks show that DEMO outperforms prior methods in video realism, lip-audio synchronization, and motion fidelity. These results demonstrate that combining fine-grained motion disentanglement with flow-based generative modeling provides a powerful new paradigm for controllable talking-head video synthesis.

[397] A Machine Learning Perspective on Automated Driving Corner Cases

Sebastian Schmidt, Julius Körner, Stephan Günnemann

Main category: cs.CV

TL;DR: Proposes a novel ML approach for corner case recognition in autonomous driving that considers data distribution rather than individual examples, achieving strong performance on detection tasks and unifying existing taxonomies.

Details

Motivation: Traditional corner case categorization is not scalable and lacks data coverage perspective, failing to generalize to ML model training data. Need for safe operation in high-stakes applications like autonomous driving.

Method: Framework for corner case recognition based on data distribution perspective, extends out-of-distribution detection benchmarks, and introduces fog-augmented Lost & Found dataset for combined corner case analysis.

Result: Unifies existing scenario-based corner case taxonomies, achieves strong performance on corner case detection tasks across standard benchmarks, and enables analysis of combined corner cases.

Conclusion: Provides principled basis for corner case recognition with manual specification-free definition, offering scalable approach for safety-critical applications.

Abstract: For high-stakes applications, like autonomous driving, a safe operation is necessary to prevent harm, accidents, and failures. Traditionally, difficult scenarios have been categorized into corner cases and addressed individually. However, this example-based categorization is not scalable and lacks a data coverage perspective, neglecting the generalization to training data of machine learning models. In our work, we propose a novel machine learning approach that takes the underlying data distribution into account. Based on our novel perspective, we present a framework for effective corner case recognition for perception on individual samples. In our evaluation, we show that our approach (i) unifies existing scenario-based corner case taxonomies under a distributional perspective, (ii) achieves strong performance on corner case detection tasks across standard benchmarks for which we extend established out-of-distribution detection benchmarks, and (iii) enables analysis of combined corner cases via a newly introduced fog-augmented Lost & Found dataset. These results provide a principled basis for corner case recognition, underlining our manual specification-free definition.

[398] Stability Under Scrutiny: Benchmarking Representation Paradigms for Online HD Mapping

Hao Shan, Ruikai Li, Han Jiang, Yizhe Fan, Ziyang Yan, Bohan Li, Xiaoshuai Hao, Hao Zhao, Zhiyong Cui, Yilong Ren, Haiyang Yu

Main category: cs.CV

TL;DR: This paper introduces the first comprehensive benchmark for evaluating temporal stability in online HD mapping models for autonomous driving, proposing novel stability metrics and showing that accuracy and stability are largely independent performance dimensions.

Details

Motivation: Existing online HD mapping models focus on per-frame accuracy but ignore temporal stability, which is crucial for autonomous driving systems operating in dynamic environments where sensor displacement causes mapping shifts that challenge downstream tasks.

Method: Proposed a multi-dimensional stability evaluation framework with novel metrics for Presence, Localization, and Shape Stability, integrated into a unified mean Average Stability (mAS) score. Conducted extensive experiments on 42 models and variants.

Result: Accuracy (mAP) and stability (mAS) represent largely independent performance dimensions. Analysis identified architectural and training factors that contribute to high accuracy, high stability, or both.

Conclusion: Temporal stability should be treated as a core evaluation criterion alongside accuracy for online HD mapping models. The authors will release a public benchmark to encourage broader focus on stability in autonomous driving systems.

Abstract: As one of the fundamental modules in autonomous driving, online high-definition (HD) maps have attracted significant attention due to their cost-effectiveness and real-time capabilities. Since vehicles always cruise in highly dynamic environments, spatial displacement of onboard sensors inevitably causes shifts in real-time HD mapping results, and such instability poses fundamental challenges for downstream tasks. However, existing online map construction models tend to prioritize improving each frame’s mapping accuracy, while the mapping stability has not yet been systematically studied. To fill this gap, this paper presents the first comprehensive benchmark for evaluating the temporal stability of online HD mapping models. We propose a multi-dimensional stability evaluation framework with novel metrics for Presence, Localization, and Shape Stability, integrated into a unified mean Average Stability (mAS) score. Extensive experiments on 42 models and variants show that accuracy (mAP) and stability (mAS) represent largely independent performance dimensions. We further analyze the impact of key model design choices on both criteria, identifying architectural and training factors that contribute to high accuracy, high stability, or both. To encourage broader focus on stability, we will release a public benchmark. Our work highlights the importance of treating temporal stability as a core evaluation criterion alongside accuracy, advancing the development of more reliable autonomous driving systems. The benchmark toolkit, code, and models will be available at https://stablehdmap.github.io/.

[399] Scalable Face Security Vision Foundation Model for Deepfake, Diffusion, and Spoofing Detection

Gaojian Wang, Feng Lin, Tong Wu, Zhisheng Yan, Kui Ren

Main category: cs.CV

TL;DR: FS-VFM is a self-supervised pre-training framework that learns robust facial representations by combining masked image modeling and instance discrimination with three learning objectives (3C) for local patterns and global semantics, achieving state-of-the-art performance on various face security tasks.

Details

Motivation: To learn robust and transferable facial representations from abundant unlabeled real face images that can generalize across various face security tasks, addressing the need for fundamental representations in face security applications.

Method: Proposes FS-VFM framework with three learning objectives (3C) that synergize masked image modeling (MIM) and instance discrimination (ID). Uses facial masking strategies including CRFR-P masking for intra-region consistency and inter-region coherency, and self-distillation to couple MIM with ID for local-to-global correspondence. Also introduces FS-Adapter for efficient transfer learning.

Result: Extensive experiments on 11 public benchmarks show FS-VFM consistently outperforms diverse vision foundation models across natural and facial domains, various supervision paradigms, and ViT scales. Even surpasses state-of-the-art task-specific methods while FS-Adapter provides excellent efficiency-performance trade-off.

Conclusion: FS-VFM successfully learns fundamental facial representations that generalize well across multiple face security tasks, demonstrating the effectiveness of combining MIM and ID with the proposed 3C learning objectives for robust face security applications.

Abstract: With abundant, unlabeled real faces, how can we learn robust and transferable facial representations to boost generalization across various face security tasks? We make the first attempt and propose FS-VFM, a scalable self-supervised pre-training framework, to learn fundamental representations of real face images. We introduce three learning objectives, namely 3C, that synergize masked image modeling (MIM) and instance discrimination (ID), empowering FS-VFM to encode both local patterns and global semantics of real faces. Specifically, we formulate various facial masking strategies for MIM and devise a simple yet effective CRFR-P masking, which explicitly prompts the model to pursue meaningful intra-region Consistency and challenging inter-region Coherency. We present a reliable self-distillation mechanism that seamlessly couples MIM with ID to establish underlying local-to-global Correspondence. After pre-training, vanilla vision transformers (ViTs) serve as universal Vision Foundation Models for downstream Face Security tasks: cross-dataset deepfake detection, cross-domain face anti-spoofing, and unseen diffusion facial forensics. To efficiently transfer the pre-trained FS-VFM, we further propose FS-Adapter, a lightweight plug-and-play bottleneck atop the frozen backbone with a novel real-anchor contrastive objective. Extensive experiments on 11 public benchmarks demonstrate that our FS-VFM consistently generalizes better than diverse VFMs, spanning natural and facial domains, fully, weakly, and self-supervised paradigms, small, base, and large ViT scales, and even outperforms SOTA task-specific methods, while FS-Adapter offers an excellent efficiency-performance trade-off. The code and models are available on https://fsfm-3c.github.io/fsvfm.html.

[400] AdaViewPlanner: Adapting Video Diffusion Models for Viewpoint Planning in 4D Scenes

Yu Li, Menghan Xia, Gongye Liu, Jianhong Bai, Xintao Wang, Conglang Zhang, Yuxuan Lin, Ruihang Chu, Pengfei Wan, Yujiu Yang

Main category: cs.CV

TL;DR: Proposes a two-stage method to adapt pre-trained Text-to-Video models for viewpoint prediction from 4D scenes, using video generation priors to extract camera viewpoints.

Details

Motivation: Leverage the powerful capability of Text-to-Video models in simulating real-world geometry and physics as implicit world models for viewpoint planning from 4D scenes.

Method: Two-stage approach: 1) Inject 4D scene representation into T2V model via adaptive learning branch, 2) Formulate viewpoint extraction as hybrid-condition guided camera extrinsic denoising process with additional diffusion branch.

Result: Experimental results show superiority over existing competitors, and ablation studies validate effectiveness of key technical designs.

Conclusion: This work proves the potential of video generation models toward 4D interaction in real world.

Abstract: Recent Text-to-Video (T2V) models have demonstrated powerful capability in visual simulation of real-world geometry and physical laws, indicating its potential as implicit world models. Inspired by this, we explore the feasibility of leveraging the video generation prior for viewpoint planning from given 4D scenes, since videos internally accompany dynamic scenes with natural viewpoints. To this end, we propose a two-stage paradigm to adapt pre-trained T2V models for viewpoint prediction, in a compatible manner. First, we inject the 4D scene representation into the pre-trained T2V model via an adaptive learning branch, where the 4D scene is viewpoint-agnostic and the conditional generated video embeds the viewpoints visually. Then, we formulate viewpoint extraction as a hybrid-condition guided camera extrinsic denoising process. Specifically, a camera extrinsic diffusion branch is further introduced onto the pre-trained T2V model, by taking the generated video and 4D scene as input. Experimental results show the superiority of our proposed method over existing competitors, and ablation studies validate the effectiveness of our key technical designs. To some extent, this work proves the potential of video generation models toward 4D interaction in real world.

[401] Image-to-Video Transfer Learning based on Image-Language Foundation Models: A Comprehensive Survey

Jinxuan Li, Chaolei Tan, Haoxuan Chen, Jianxin Ma, Jian-Fang Hu, Wei-Shi Zheng, Jianhuang Lai

Main category: cs.CV

TL;DR: This survey provides the first comprehensive review of image-to-video transfer learning, which extends image-language foundation models to video domain to reduce data and computational requirements for video-text learning.

Details

Motivation: To address the substantial data and computational requirements of training video-language foundation models from scratch by leveraging existing image-language foundation models through transfer learning.

Method: Systematically classifies image-to-video transfer learning strategies into two categories: frozen features (preserving original representations) and modified features (undergoing modifications), and analyzes their applications across various video-text learning tasks.

Result: Presents detailed experimental analysis investigating the efficacy of different image-to-video transfer learning paradigms on downstream video understanding tasks.

Conclusion: Identifies prevailing challenges and highlights promising directions for future research, aiming to establish a structured roadmap for advancing video-text learning based on existing image-language foundation models.

Abstract: Image-Language Foundation Models (ILFM) have demonstrated remarkable success in image-text understanding/generation tasks, providing transferable multimodal representations that generalize across diverse downstream image-based tasks. The advancement of video-text research has spurred growing interest in extending image-based models to the video domain. This paradigm, known as image-to-video transfer learning, succeeds in alleviating the substantial data and computational requirements associated with training video-language foundation models from scratch for video-text learning. This survey provides the first comprehensive review of this emerging field, which begins by summarizing the widely used ILFM and their capabilities. We then systematically classify existing image-to-video transfer learning strategies into two categories: frozen features and modified features, depending on whether the original representations from ILFM are preserved or undergo modifications. Building upon the task-specific nature of image-to-video transfer, this survey methodically elaborates these strategies and details their applications across a spectrum of video-text learning tasks, ranging from fine-grained (e.g., spatio-temporal video grounding) to coarse-grained (e.g., video question answering). We further present a detailed experimental analysis to investigate the efficacy of different image-to-video transfer learning paradigms on a range of downstream video understanding tasks. Finally, we identify prevailing challenges and highlight promising directions for future research. By offering a comprehensive and structured overview, this survey aims to establish a structured roadmap for advancing video-text learning based on existing ILFM, and to inspire future research directions in this rapidly evolving domain.

Yuxiang Luo, Qing Xu, Hai Huang, Yuqi Ouyang, Zhen Chen, Wenting Duan

Main category: cs.CV

TL;DR: MSM-Seg is a novel framework for multi-modal brain tumor segmentation that introduces a dual-memory paradigm integrating multi-modal and inter-slice information with category-agnostic prompts, achieving state-of-the-art performance.

Details

Motivation: Existing prompt-based segmentation methods ignore cross-modal correlations and rely on labor-intensive category-specific prompts, limiting their applicability in real-world clinical scenarios.

Method: Proposes MSM-Seg with three key components: modality-and-slice memory attention (MSMA) for cross-modal and inter-slice relationships, multi-scale category-agnostic prompt encoder (MCP-Encoder) for tumor guidance, and modality-adaptive fusion decoder (MF-Decoder) for complementary decoding across modalities.

Result: Extensive experiments on different MRI datasets demonstrate that MSM-Seg outperforms state-of-the-art methods in multi-modal metastases and glioma tumor segmentation.

Conclusion: The proposed MSM-Seg framework effectively addresses limitations of existing methods and provides superior performance for multi-modal brain tumor segmentation in clinical applications.

Abstract: Multi-modal brain tumor segmentation is critical for clinical diagnosis, and it requires accurate identification of distinct internal anatomical subregions. While the recent prompt-based segmentation paradigms enable interactive experiences for clinicians, existing methods ignore cross-modal correlations and rely on labor-intensive category-specific prompts, limiting their applicability in real-world scenarios. To address these issues, we propose a MSM-Seg framework for multi-modal brain tumor segmentation. The MSM-Seg introduces a novel dual-memory segmentation paradigm that synergistically integrates multi-modal and inter-slice information with the efficient category-agnostic prompt for brain tumor understanding. To this end, we first devise a modality-and-slice memory attention (MSMA) to exploit the cross-modal and inter-slice relationships among the input scans. Then, we propose a multi-scale category-agnostic prompt encoder (MCP-Encoder) to provide tumor region guidance for decoding. Moreover, we devise a modality-adaptive fusion decoder (MF-Decoder) that leverages the complementary decoding information across different modalities to improve segmentation accuracy. Extensive experiments on different MRI datasets demonstrate that our MSM-Seg framework outperforms state-of-the-art methods in multi-modal metastases and glioma tumor segmentation. The code is available at https://github.com/xq141839/MSM-Seg.

[403] Action-Dynamics Modeling and Cross-Temporal Interaction for Online Action Understanding

Xinyu Yang, Zheheng Jiang, Feixiang Zhou, Yihang Zhu, Na Lv, Nan Xing, Huiyu Zhou

Main category: cs.CV

TL;DR: SSM framework unifies action detection and anticipation by compressing video frames into critical states, modeling action dynamics with state-transition graphs, and using cross-temporal interactions to refine features.

Details

Motivation: Address redundant information in untrimmed videos and the overlooked influence of agent intention on action understanding.

Method: Critical State-Based Memory Compression reduces redundancy; Action Pattern Learning constructs state-transition graphs with multi-dimensional edges; Cross-Temporal Interaction models mutual influence between intentions and past/current information.

Result: Superior performance on EPIC-Kitchens-100, THUMOS'14, TVSeries, and PDMB datasets compared to state-of-the-art approaches.

Conclusion: Demonstrates importance of action dynamics learning and cross-temporal interactions for future action understanding research.

Abstract: Action understanding, encompassing action detection and anticipation, plays a crucial role in numerous practical applications. However, untrimmed videos are often characterized by substantial redundant information and noise. Moreover, in modeling action understanding, the influence of the agent’s intention on the action is often overlooked. Motivated by these issues, we propose a novel framework called the State-Specific Model (SSM), designed to unify and enhance both action detection and anticipation tasks. In the proposed framework, the Critical State-Based Memory Compression module compresses frame sequences into critical states, reducing information redundancy. The Action Pattern Learning module constructs a state-transition graph with multi-dimensional edges to model action dynamics in complex scenarios, on the basis of which potential future cues can be generated to represent intention. Furthermore, our Cross-Temporal Interaction module models the mutual influence between intentions and past as well as current information through cross-temporal interactions, thereby refining present and future features and ultimately realizing simultaneous action detection and anticipation. Extensive experiments on multiple benchmark datasets – including EPIC-Kitchens-100, THUMOS'14, TVSeries, and the introduced Parkinson’s Disease Mouse Behaviour (PDMB) dataset – demonstrate the superior performance of our proposed framework compared to other state-of-the-art approaches. These results highlight the importance of action dynamics learning and cross-temporal interactions, laying a foundation for future action understanding research.

[404] Dynamic Gaussian Splatting from Defocused and Motion-blurred Monocular Videos

Xuankai Zhang, Junjin Xiao, Qing Zhang

Main category: cs.CV

TL;DR: A unified framework for high-quality dynamic Gaussian Splatting that handles both defocused and motion-blurred monocular videos using blur prediction networks and dynamic Gaussian densification.

Details

Motivation: Existing methods are tailored for either defocus blur or motion blur, lacking the ability to handle both simultaneously. Joint modeling as blur kernel-based convolution is limited by the difficulty in estimating accurate blur kernels.

Method: Proposes per-pixel reliable blur kernel estimation using a blur prediction network with blur-aware sparsity constraint, dynamic Gaussian densification for incomplete regions, and incorporation of unseen view information for scene optimization.

Result: Extensive experiments show the method outperforms state-of-the-art methods in generating photorealistic novel view synthesis from defocused and motion-blurred monocular videos.

Conclusion: The proposed unified framework successfully handles both defocus and motion blur in monocular videos, achieving superior novel view synthesis quality compared to existing specialized approaches.

Abstract: This paper presents a unified framework that allows high-quality dynamic Gaussian Splatting from both defocused and motion-blurred monocular videos. Due to the significant difference between the formation processes of defocus blur and motion blur, existing methods are tailored for either one of them, lacking the ability to simultaneously deal with both of them. Although the two can be jointly modeled as blur kernel-based convolution, the inherent difficulty in estimating accurate blur kernels greatly limits the progress in this direction. In this work, we go a step further towards this direction. Particularly, we propose to estimate per-pixel reliable blur kernels using a blur prediction network that exploits blur-related scene and camera information and is subject to a blur-aware sparsity constraint. Besides, we introduce a dynamic Gaussian densification strategy to mitigate the lack of Gaussians for incomplete regions, and boost the performance of novel view synthesis by incorporating unseen view information to constrain scene optimization. Extensive experiments show that our method outperforms the state-of-the-art methods in generating photorealistic novel view synthesis from defocused and motion-blurred monocular videos. Our code and trained model will be made publicly available.

[405] WorldMirror: Universal 3D World Reconstruction with Any-Prior Prompting

Yifan Liu, Zhiyuan Min, Zhenwei Wang, Junta Wu, Tengfei Wang, Yixuan Yuan, Yawei Luo, Chunchao Guo

Main category: cs.CV

TL;DR: WorldMirror is a unified feed-forward model that integrates diverse geometric priors to generate multiple 3D representations simultaneously, achieving state-of-the-art performance across various 3D geometric prediction tasks.

Details

Motivation: Existing methods are constrained to image-only inputs or customized for specific tasks, lacking flexibility to handle diverse geometric priors and generate multiple 3D representations in a unified manner.

Method: A feed-forward framework that flexibly integrates camera poses, intrinsics, depth maps and other geometric priors to simultaneously generate dense point clouds, multi-view depth maps, camera parameters, surface normals, and 3D Gaussians in a single forward pass.

Result: Achieves state-of-the-art performance across diverse benchmarks including camera estimation, point map generation, depth estimation, surface normal estimation, and novel view synthesis, while maintaining efficient feed-forward inference.

Conclusion: WorldMirror provides an elegant and unified architecture that leverages available prior information to resolve structural ambiguities and deliver geometrically consistent 3D outputs efficiently.

Abstract: We present WorldMirror, an all-in-one, feed-forward model for versatile 3D geometric prediction tasks. Unlike existing methods constrained to image-only inputs or customized for a specific task, our framework flexibly integrates diverse geometric priors, including camera poses, intrinsics, and depth maps, while simultaneously generating multiple 3D representations: dense point clouds, multi-view depth maps, camera parameters, surface normals, and 3D Gaussians. This elegant and unified architecture leverages available prior information to resolve structural ambiguities and delivers geometrically consistent 3D outputs in a single forward pass. WorldMirror achieves state-of-the-art performance across diverse benchmarks from camera, point map, depth, and surface normal estimation to novel view synthesis, while maintaining the efficiency of feed-forward inference. Code and models will be publicly available soon.

[406] Seeing My Future: Predicting Situated Interaction Behavior in Virtual Reality

Yuan Xu, Zimu Zhang, Xiaoxuan Ma, Wentao Zhu, Yu Qiao, Yizhou Wang

Main category: cs.CV

TL;DR: A hierarchical intention-aware framework using dynamic Graph Convolutional Networks to model human intentions and predict situated behaviors in VR/AR environments.

Details

Motivation: VR/AR systems need intelligent adaptation to user behaviors for enhanced interaction experiences, requiring accurate understanding of human intentions and prediction of future behaviors like gaze direction and object interactions.

Method: Proposes a hierarchical framework that identifies interaction targets and forecasts fine-grained behaviors using dynamic GCNs to capture human-environment relationships from historical dynamics and scene contexts.

Result: Achieves superior performance across all metrics on real-world benchmarks and live VR environments, demonstrating effectiveness for proactive VR systems.

Conclusion: The framework enables practical applications for proactive VR systems that anticipate user behaviors and adapt virtual environments accordingly.

Abstract: Virtual and augmented reality systems increasingly demand intelligent adaptation to user behaviors for enhanced interaction experiences. Achieving this requires accurately understanding human intentions and predicting future situated behaviors - such as gaze direction and object interactions - which is vital for creating responsive VR/AR environments and applications like personalized assistants. However, accurate behavioral prediction demands modeling the underlying cognitive processes that drive human-environment interactions. In this work, we introduce a hierarchical, intention-aware framework that models human intentions and predicts detailed situated behaviors by leveraging cognitive mechanisms. Given historical human dynamics and the observation of scene contexts, our framework first identifies potential interaction targets and forecasts fine-grained future behaviors. We propose a dynamic Graph Convolutional Network (GCN) to effectively capture human-environment relationships. Extensive experiments on challenging real-world benchmarks and live VR environment demonstrate the effectiveness of our approach, achieving superior performance across all metrics and enabling practical applications for proactive VR systems that anticipate user behaviors and adapt virtual environments accordingly.

[407] Uncovering Anomalous Events for Marine Environmental Monitoring via Visual Anomaly Detection

Laura Weihl, Nejc Novak, Stefan H. Bengtson, Malte Pedersen

Main category: cs.CV

TL;DR: AURA is the first multi-annotator benchmark for underwater visual anomaly detection, showing current models’ performance varies dramatically and is sensitive to training data and scene variability.

Details

Motivation: Manual inspection of underwater video footage is impractical due to vast volume of uneventful content, requiring automated methods to identify interesting/anomalous events for marine biodiversity monitoring.

Method: Introduces AURA benchmark dataset, evaluates four VAD models across two marine scenes, implements robust frame selection strategies, and compares against multiple human annotators using soft and consensus labels.

Result: VAD model performance varies dramatically and is highly sensitive to training data amount and visual content variability defining ’normal’ scenes.

Conclusion: The work demonstrates the value of soft/consensus labels and provides a practical approach for supporting scientific exploration and scalable biodiversity monitoring.

Abstract: Underwater video monitoring is a promising strategy for assessing marine biodiversity, but the vast volume of uneventful footage makes manual inspection highly impractical. In this work, we explore the use of visual anomaly detection (VAD) based on deep neural networks to automatically identify interesting or anomalous events. We introduce AURA, the first multi-annotator benchmark dataset for underwater VAD, and evaluate four VAD models across two marine scenes. We demonstrate the importance of robust frame selection strategies to extract meaningful video segments. Our comparison against multiple annotators reveals that VAD performance of current models varies dramatically and is highly sensitive to both the amount of training data and the variability in visual content that defines “normal” scenes. Our results highlight the value of soft and consensus labels and offer a practical approach for supporting scientific exploration and scalable biodiversity monitoring.

[408] Restricted Receptive Fields for Face Verification

Kagan Ozturk, Aman Bhatta, Haiyu Wu, Patrick Flynn, Kevin W. Bowyer

Main category: cs.CV

TL;DR: The paper proposes an inherently interpretable face similarity metric that breaks down global similarity into patch-level contributions, avoiding the need for post-hoc explanation methods.

Details

Motivation: Current post-hoc interpretability methods for deep neural networks have uncertain fidelity due to lack of reliable evaluation metrics, motivating the design of models with inherently interpretable decision processes.

Method: A face similarity metric that decomposes global similarity into sum of patch-level similarity scores from restricted receptive fields, providing locally additive explanations without post-hoc analysis.

Result: Achieves competitive verification performance with 28x28 patches in 112x112 face images, and surpasses state-of-the-art methods when using 56x56 patches.

Conclusion: The proposed inherently interpretable approach provides transparent face similarity analysis while maintaining competitive performance, offering an alternative to post-hoc explanation methods.

Abstract: Understanding how deep neural networks make decisions is crucial for analyzing their behavior and diagnosing failure cases. In computer vision, a common approach to improve interpretability is to assign importance to individual pixels using post-hoc methods. Although they are widely used to explain black-box models, their fidelity to the model’s actual reasoning is uncertain due to the lack of reliable evaluation metrics. This limitation motivates an alternative approach, which is to design models whose decision processes are inherently interpretable. To this end, we propose a face similarity metric that breaks down global similarity into contributions from restricted receptive fields. Our method defines the similarity between two face images as the sum of patch-level similarity scores, providing a locally additive explanation without relying on post-hoc analysis. We show that the proposed approach achieves competitive verification performance even with patches as small as 28x28 within 112x112 face images, and surpasses state-of-the-art methods when using 56x56 patches.

[409] EGD-YOLO: A Lightweight Multimodal Framework for Robust Drone-Bird Discrimination via Ghost-Enhanced YOLOv8n and EMA Attention under Adverse Condition

Sudipto Sarkar, Mohammad Asif Hasan, Khondokar Ashik Shahriar, Fablia Labiba, Nahian Tasnim, Sheikh Anawarul Haq Fattah

Main category: cs.CV

TL;DR: EGD-YOLOv8n is a lightweight object detection model that improves drone and bird identification using RGB and IR images from the VIP CUP 2025 dataset, achieving high accuracy and real-time performance.

Details

Motivation: Accurate identification of drones and birds is crucial for airspace safety and security system enhancement.

Method: Developed EGD-YOLOv8n with enhanced feature capture, attention layers, and a specialized detection head for handling objects of varying shapes and sizes. Trained three versions: RGB-only, IR-only, and combined RGB+IR.

Result: The combined RGB+IR model achieved the best accuracy and reliability while maintaining real-time performance on standard GPUs.

Conclusion: EGD-YOLOv8n provides an effective lightweight solution for drone and bird detection that balances accuracy with computational efficiency for practical deployment.

Abstract: Identifying drones and birds correctly is essential for keeping the skies safe and improving security systems. Using the VIP CUP 2025 dataset, which provides both RGB and infrared (IR) images, this study presents EGD-YOLOv8n, a new lightweight yet powerful model for object detection. The model improves how image features are captured and understood, making detection more accurate and efficient. It uses smart design changes and attention layers to focus on important details while reducing the amount of computation needed. A special detection head helps the model adapt to objects of different shapes and sizes. We trained three versions: one using RGB images, one using IR images, and one combining both. The combined model achieved the best accuracy and reliability while running fast enough for real-time use on common GPUs.

[410] Structured Spectral Graph Learning for Multi-label Abnormality Classification in 3D Chest CT Scans

Theo Di Piazza, Carole Lazarus, Olivier Nempont, Loic Boussel

Main category: cs.CV

TL;DR: A graph-based framework for multi-label classification of 3D Chest CT scans that represents CT volumes as structured graphs with axial slice triplets as nodes, enabling efficient modeling of inter-slice dependencies while maintaining clinical deployment compatibility.

Details

Motivation: Growing volume of CT examinations creates demand for automated tools to support radiologists. Existing 3D CNNs struggle with long-range dependencies, while Vision Transformers require extensive pre-training on domain-specific datasets.

Method: Proposes a 2.5D graph-based framework representing 3D CT volumes as structured graphs with axial slice triplets as nodes processed through spectral graph convolution, enabling reasoning over inter-slice dependencies.

Result: Achieves strong cross-dataset generalization across 3 independent institution datasets, shows competitive performance compared to state-of-the-art visual encoders, and demonstrates broader applicability through transfer experiments on radiology report generation and abdominal CT data.

Conclusion: The graph-based approach provides an effective alternative for multi-label CT classification that captures complex spatial relationships while maintaining practical deployment complexity, with demonstrated generalization across datasets and tasks.

Abstract: With the growing volume of CT examinations, there is an increasing demand for automated tools such as organ segmentation, abnormality detection, and report generation to support radiologists in managing their clinical workload. Multi-label classification of 3D Chest CT scans remains a critical yet challenging problem due to the complex spatial relationships inherent in volumetric data and the wide variability of abnormalities. Existing methods based on 3D convolutional neural networks struggle to capture long-range dependencies, while Vision Transformers often require extensive pre-training on large-scale, domain-specific datasets to perform competitively. In this work, we propose a 2.5D alternative by introducing a new graph-based framework that represents 3D CT volumes as structured graphs, where axial slice triplets serve as nodes processed through spectral graph convolution, enabling the model to reason over inter-slice dependencies while maintaining complexity compatible with clinical deployment. Our method, trained and evaluated on 3 datasets from independent institutions, achieves strong cross-dataset generalization, and shows competitive performance compared to state-of-the-art visual encoders. We further conduct comprehensive ablation studies to evaluate the impact of various aggregation strategies, edge-weighting schemes, and graph connectivity patterns. Additionally, we demonstrate the broader applicability of our approach through transfer experiments on automated radiology report generation and abdominal CT data.\ This work extends our previous contribution presented at the MICCAI 2025 EMERGE Workshop.

[411] DISC-GAN: Disentangling Style and Content for Cluster-Specific Synthetic Underwater Image Generation

Sneha Varur, Anirudh R Hanchinamani, Tarun S Bagewadi, Uma Mudenagudi, Chaitra D Desai, Sujata C, Padmashree Desai, Sumit Meharwade

Main category: cs.CV

TL;DR: DISC-GAN is a novel framework that integrates style-content disentanglement with cluster-specific training for photorealistic underwater image synthesis, addressing challenges like color attenuation and turbidity.

Details

Motivation: Underwater images suffer from optical phenomena like color attenuation and turbidity, with distinct stylistic variations across waterbodies. Generative models often fail to model non-uniform underwater conditions.

Method: Uses K-means clustering to partition dataset into style-specific domains, separate encoders for style and content latent spaces, AdaIN for integration, and cluster-specific training to preserve domain characteristics.

Result: Achieved state-of-the-art performance with SSIM of 0.9012, PSNR of 32.5118 dB, and FID of 13.3728.

Conclusion: DISC-GAN successfully addresses underwater image synthesis challenges through style-content disentanglement and cluster-specific training, demonstrating superior performance metrics.

Abstract: In this paper, we propose a novel framework, Disentangled Style-Content GAN (DISC-GAN), which integrates style-content disentanglement with a cluster-specific training strategy towards photorealistic underwater image synthesis. The quality of synthetic underwater images is challenged by optical due to phenomena such as color attenuation and turbidity. These phenomena are represented by distinct stylistic variations across different waterbodies, such as changes in tint and haze. While generative models are well-suited to capture complex patterns, they often lack the ability to model the non-uniform conditions of diverse underwater environments. To address these challenges, we employ K-means clustering to partition a dataset into style-specific domains. We use separate encoders to get latent spaces for style and content; we further integrate these latent representations via Adaptive Instance Normalization (AdaIN) and decode the result to produce the final synthetic image. The model is trained independently on each style cluster to preserve domain-specific characteristics. Our framework demonstrates state-of-the-art performance, obtaining a Structural Similarity Index (SSIM) of 0.9012, an average Peak Signal-to-Noise Ratio (PSNR) of 32.5118 dB, and a Frechet Inception Distance (FID) of 13.3728.

[412] ImHead: A Large-scale Implicit Morphable Model for Localized Head Modeling

Rolandos Alexandros Potamias, Stathis Galanakis, Jiankang Deng, Athanasios Papaioannou, Stefanos Zafeiriou

Main category: cs.CV

TL;DR: imHead is a novel implicit 3D morphable model that enables expressive 3D head avatar generation with localized facial feature editing, using a compact identity space and region-specific latent representation instead of expensive latent divisions.

Details

Motivation: Traditional 3DMMs struggle with complex full-head shapes due to strict topology and linear nature limitations. Previous methods used expensive latent space divisions for local editing, leading to large latent sizes.

Method: Proposed imHead uses deep implicit functions with a single compact identity space and introduces intermediate region-specific latent representation to enable local edits. Trained on a large-scale dataset of 4K distinct identities.

Result: The model demonstrates superior expressive power in representing diverse identities and expressions compared to previous approaches, while providing interpretable localized editing capabilities.

Conclusion: imHead represents a significant advancement in 3D head modeling by combining expressive avatar generation with efficient localized editing through a compact latent structure.

Abstract: Over the last years, 3D morphable models (3DMMs) have emerged as a state-of-the-art methodology for modeling and generating expressive 3D avatars. However, given their reliance on a strict topology, along with their linear nature, they struggle to represent complex full-head shapes. Following the advent of deep implicit functions, we propose imHead, a novel implicit 3DMM that not only models expressive 3D head avatars but also facilitates localized editing of the facial features. Previous methods directly divided the latent space into local components accompanied by an identity encoding to capture the global shape variations, leading to expensive latent sizes. In contrast, we retain a single compact identity space and introduce an intermediate region-specific latent representation to enable local edits. To train imHead, we curate a large-scale dataset of 4K distinct identities, making a step-towards large scale 3D head modeling. Under a series of experiments we demonstrate the expressive power of the proposed model to represent diverse identities and expressions outperforming previous approaches. Additionally, the proposed approach provides an interpretable solution for 3D face manipulation, allowing the user to make localized edits.

[413] Full segmentation annotations of 3D time-lapse microscopy images of MDA231 cells

Aleksandra Melnikova, Petr Matula

Main category: cs.CV

TL;DR: This paper presents the first publicly available full 3D time-lapse segmentation annotations of migrating cells with complex dynamic shapes, providing comprehensive dataset description and validation experiments.

Details

Motivation: High-quality segmentation annotations of volumetric images are critical for image processing but are time-consuming and challenging to create, especially for large numbers of targets with complex dynamic shapes.

Method: Three distinct humans annotated two sequences of MDA231 human breast carcinoma cells from the Cell Tracking Challenge, creating consistent 3D time-lapse segmentation annotations that were validated against CTC tracking markers and 2D gold truth.

Result: The created annotations are consistent with CTC tracking markers, segmentation accuracy is within inter-annotator variability margins, and the 3D annotations better represent input image complexity compared to automatically created silver truth from CTC.

Conclusion: The presented 3D cell segmentation annotations are valuable for testing and training cell segmentation algorithms, as well as analyzing 3D shapes of highly dynamic objects.

Abstract: High-quality, publicly available segmentation annotations of image and video datasets are critical for advancing the field of image processing. In particular, annotations of volumetric images of a large number of targets are time-consuming and challenging. In (Melnikova, A., & Matula, P., 2025), we presented the first publicly available full 3D time-lapse segmentation annotations of migrating cells with complex dynamic shapes. Concretely, three distinct humans annotated two sequences of MDA231 human breast carcinoma cells (Fluo-C3DL-MDA231) from the Cell Tracking Challenge (CTC). This paper aims to provide a comprehensive description of the dataset and accompanying experiments that were not included in (Melnikova, A., & Matula, P., 2025) due to limitations in publication space. Namely, we show that the created annotations are consistent with the previously published tracking markers provided by the CTC organizers and the segmentation accuracy measured based on the 2D gold truth of CTC is within the inter-annotator variability margins. We compared the created 3D annotations with automatically created silver truth provided by CTC. We have found the proposed annotations better represent the complexity of the input images. The presented annotations can be used for testing and training cell segmentation, or analyzing 3D shapes of highly dynamic objects.

[414] MSCloudCAM: Cross-Attention with Multi-Scale Context for Multispectral Cloud Segmentation

Md Abdullah Al Mazid, Liangdong Deng, Naphtali Rishe

Main category: cs.CV

TL;DR: MSCloudCAM is a novel cross-attention network for multispectral cloud segmentation that achieves state-of-the-art performance on Sentinel-2 and Landsat-8 datasets by combining Swin Transformer with multi-scale context modules and attention mechanisms.

Details

Motivation: Clouds in optical satellite imagery pose significant challenges for environmental monitoring, land cover mapping, and climate research, requiring robust cloud segmentation methods to enable reliable analysis.

Method: The framework uses Swin Transformer backbone for hierarchical feature extraction, multi-scale context modules (ASPP and PSP), cross-attention for multisensor/multispectral feature fusion, and attention blocks (ECAB and Spatial Attention) for feature refinement.

Result: MSCloudCAM achieves state-of-the-art segmentation accuracy on CloudSEN12 and L8Biome datasets, surpassing leading baseline architectures while maintaining competitive parameter efficiency and computational requirements.

Conclusion: The model demonstrates effectiveness and practicality for large-scale Earth observation tasks, making it well-suited for real-world applications in satellite imagery analysis.

Abstract: Clouds remain a critical challenge in optical satellite imagery, hindering reliable analysis for environmental monitoring, land cover mapping, and climate research. To overcome this, we propose MSCloudCAM, a Cross-Attention with Multi-Scale Context Network tailored for multispectral and multi-sensor cloud segmentation. Our framework exploits the spectral richness of Sentinel-2 (CloudSEN12) and Landsat-8 (L8Biome) data to classify four semantic categories: clear sky, thin cloud, thick cloud, and cloud shadow. MSCloudCAM combines a Swin Transformer backbone for hierarchical feature extraction with multi-scale context modules ASPP and PSP for enhanced scale-aware learning. A Cross-Attention block enables effective multisensor and multispectral feature fusion, while the integration of an Efficient Channel Attention Block (ECAB) and a Spatial Attention Module adaptively refine feature representations. Comprehensive experiments on CloudSEN12 and L8Biome demonstrate that MSCloudCAM delivers state-of-the-art segmentation accuracy, surpassing leading baseline architectures while maintaining competitive parameter efficiency and FLOPs. These results underscore the model’s effectiveness and practicality, making it well-suited for large-scale Earth observation tasks and real-world applications.

[415] From Detection to Mitigation: Addressing Bias in Deep Learning Models for Chest X-Ray Diagnosis

Clemence Mottez, Louisa Fay, Maya Varma, Sophie Ostmeier, Curtis Langlotz

Main category: cs.CV

TL;DR: A framework combining CNN with XGBoost improves fairness in chest X-ray diagnosis while maintaining performance, achieving competitive bias reduction at low computational cost.

Details

Motivation: Deep learning models risk perpetuating healthcare disparities when performance varies across demographic groups like sex, age, and race in chest X-ray diagnosis.

Method: Extend CNN-XGBoost pipeline for multi-label classification, replace CNN final layer with XGBoost classifier, validate with DenseNet-121 and ResNet-50 backbones, compare with adversarial training, reweighting, data augmentation, and active learning.

Result: Improves fairness across demographic subgroups while maintaining or improving overall predictive performance. XGBoost with active learning yields largest bias reduction on CheXpert and MIMIC datasets.

Conclusion: Provides practical and effective path toward equitable deep learning deployment in clinical radiology with model-agnostic design and computational efficiency.

Abstract: Deep learning models have shown promise in improving diagnostic accuracy from chest X-rays, but they also risk perpetuating healthcare disparities when performance varies across demographic groups. In this work, we present a comprehensive bias detection and mitigation framework targeting sex, age, and race-based disparities when performing diagnostic tasks with chest X-rays. We extend a recent CNN-XGBoost pipeline to support multi-label classification and evaluate its performance across four medical conditions. We show that replacing the final layer of CNN with an eXtreme Gradient Boosting classifier improves the fairness of the subgroup while maintaining or improving the overall predictive performance. To validate its generalizability, we apply the method to different backbones, namely DenseNet-121 and ResNet-50, and achieve similarly strong performance and fairness outcomes, confirming its model-agnostic design. We further compare this lightweight adapter training method with traditional full-model training bias mitigation techniques, including adversarial training, reweighting, data augmentation, and active learning, and find that our approach offers competitive or superior bias reduction at a fraction of the computational cost. Finally, we show that combining eXtreme Gradient Boosting retraining with active learning yields the largest reduction in bias across all demographic subgroups, both in and out of distribution on the CheXpert and MIMIC datasets, establishing a practical and effective path toward equitable deep learning deployment in clinical radiology.

[416] FastHMR: Accelerating Human Mesh Recovery via Token and Layer Merging with Diffusion Decoding

Soroush Mehraban, Andrea Iaboni, Babak Taati

Main category: cs.CV

TL;DR: The paper proposes two merging strategies (ECLM and Mask-ToMe) and a diffusion-based decoder to reduce computational cost in 3D Human Mesh Recovery while maintaining performance.

Details

Motivation: Transformer-based 3D HMR models suffer from high computational cost and complexity due to deep architectures and redundant tokens.

Method: Two merging strategies: Error-Constrained Layer Merging (ECLM) for selective layer merging based on MPJPE impact, and Mask-guided Token Merging (Mask-ToMe) for merging background tokens. Plus a diffusion-based decoder with temporal context and pose priors.

Result: Achieves up to 2.3x speed-up while slightly improving performance over baseline across multiple benchmarks.

Conclusion: The proposed method effectively reduces computational cost in 3D HMR while maintaining or slightly improving performance through strategic merging and diffusion-based decoding.

Abstract: Recent transformer-based models for 3D Human Mesh Recovery (HMR) have achieved strong performance but often suffer from high computational cost and complexity due to deep transformer architectures and redundant tokens. In this paper, we introduce two HMR-specific merging strategies: Error-Constrained Layer Merging (ECLM) and Mask-guided Token Merging (Mask-ToMe). ECLM selectively merges transformer layers that have minimal impact on the Mean Per Joint Position Error (MPJPE), while Mask-ToMe focuses on merging background tokens that contribute little to the final prediction. To further address the potential performance drop caused by merging, we propose a diffusion-based decoder that incorporates temporal context and leverages pose priors learned from large-scale motion capture datasets. Experiments across multiple benchmarks demonstrate that our method achieves up to 2.3x speed-up while slightly improving performance over the baseline.

[417] rareboost3d: a synthetic lidar dataset with enhanced rare classes

Shutong Lin, Zhengkang Xiang, Jianzhong Qi, Kourosh Khoshelham

Main category: cs.CV

TL;DR: RareBoost3D is a synthetic point cloud dataset that addresses the long-tail problem in LiDAR perception by providing more instances for rare classes, complemented by a cross-domain semantic alignment method (CSC loss) to improve segmentation performance.

Details

Motivation: Real-world point cloud datasets suffer from long-tail distribution problems where rare classes have limited instances, hindering the development of robust LiDAR-based perception systems for autonomous driving.

Method: Created synthetic dataset RareBoost3D to supplement rare classes, and proposed CSC loss method for cross-domain semantic alignment to align feature representations of same classes across synthetic and real-world domains.

Result: Experimental results show that the cross-domain semantic alignment significantly enhances LiDAR point cloud segmentation performance on real-world data.

Conclusion: The combination of synthetic data augmentation for rare classes and cross-domain feature alignment effectively addresses the long-tail problem in LiDAR point cloud segmentation.

Abstract: Real-world point cloud datasets have made significant contributions to the development of LiDAR-based perception technologies, such as object segmentation for autonomous driving. However, due to the limited number of instances in some rare classes, the long-tail problem remains a major challenge in existing datasets. To address this issue, we introduce a novel, synthetic point cloud dataset named RareBoost3D, which complements existing real-world datasets by providing significantly more instances for object classes that are rare in real-world datasets. To effectively leverage both synthetic and real-world data, we further propose a cross-domain semantic alignment method named CSC loss that aligns feature representations of the same class across different domains. Experimental results demonstrate that this alignment significantly enhances the performance of LiDAR point cloud segmentation models over real-world data.

[418] Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales

Zhaofang Qian, Hardy Chen, Zeyu Wang, Li Zhang, Zijun Wang, Xiaoke Huang, Hui Liu, Xianfeng Tang, Zeyu Zheng, Haoqin Tu, Cihang Xie, Yuyin Zhou

Main category: cs.CV

TL;DR: EarthWhere is a comprehensive benchmark for evaluating vision-language models’ image geolocation capabilities across country-level and street-level tasks, revealing performance gaps and regional biases.

Details

Motivation: To comprehensively evaluate VLMs' capacity for image-grounded geolocation in open-world conditions, which is challenging and in high demand but not thoroughly assessed.

Method: Created EarthWhere benchmark with 810 globally distributed images across two scales: WhereCountry (500 country-level MCQs) and WhereStreet (310 street-level tasks requiring multi-step reasoning). Used final-prediction metrics (Acc@k for coordinates, hierarchical path scores) and proposed intermediate reasoning evaluation with human-verified visual clues and Shapley-reweighted thinking scores.

Result: Gemini-2.5-Pro achieved best average accuracy at 56.32%, while strongest open-weight model GLM-4.5V reached 34.71%. Web search and reasoning didn’t guarantee improved performance with limited visual clues, and models showed regional biases (up to 42.7% higher scores in certain areas).

Conclusion: The findings highlight both the promise and persistent challenges of VLMs in mitigating bias and achieving robust fine-grained localization, emphasizing the need for continued improvement in geolocation capabilities.

Abstract: Vision-language models (VLMs) have advanced rapidly, yet their capacity for image-grounded geolocation in open-world conditions, a task that is challenging and of demand in real life, has not been comprehensively evaluated. We present EarthWhere, a comprehensive benchmark for VLM image geolocation that evaluates visual recognition, step-by-step reasoning, and evidence use. EarthWhere comprises 810 globally distributed images across two complementary geolocation scales: WhereCountry (i.e., 500 multiple-choice question-answering, with country-level answer and panoramas) and WhereStreet (i.e., 310 fine-grained street-level identification tasks requiring multi-step reasoning with optional web search). For evaluation, we adopt the final-prediction metrics: location accuracies within k km (Acc@k) for coordinates and hierarchical path scores for textual localization. Beyond this, we propose to explicitly score intermediate reasoning chains using human-verified key visual clues and a Shapley-reweighted thinking score that attributes credit to each clue’s marginal contribution. We benchmark 13 state-of-the-art VLMs with web searching tools on our EarthWhere and report different types of final answer accuracies as well as the calibrated model thinking scores. Overall, Gemini-2.5-Pro achieves the best average accuracy at 56.32%, while the strongest open-weight model, GLM-4.5V, reaches 34.71%. We reveal that web search and reasoning do not guarantee improved performance when visual clues are limited, and models exhibit regional biases, achieving up to 42.7% higher scores in certain areas than others. These findings highlight not only the promise but also the persistent challenges of models to mitigate bias and achieve robust, fine-grained localization. We open-source our benchmark at https://github.com/UCSC-VLAA/EarthWhere.

[419] Topological Alignment of Shared Vision-Language Embedding Space

Junwon You, Dasol Kang, Jae-Hun Jung

Main category: cs.CV

TL;DR: ToMCLIP introduces topological alignment to improve multilingual CLIP models by preserving global geometry in embedding spaces, addressing English bias in contrastive VLMs.

Details

Motivation: Current multilingual VLMs have English bias due to limited multilingual data and focus only on instance-level alignment, neglecting the global structure of embedding spaces.

Method: Proposes topological alignment framework using persistent homology to define alignment loss, with graph sparsification for efficient persistence diagram approximation with theoretical error bounds.

Result: Shows enhanced structural coherence of multilingual representations, higher zero-shot accuracy on CIFAR-100, and stronger multilingual retrieval performance on xFlickr&CO.

Conclusion: ToMCLIP provides a general method for incorporating topological alignment into representation learning, extending beyond VLMs to improve multilingual multimodal understanding.

Abstract: Contrastive Vision-Language Models (VLMs) have demonstrated strong zero-shot capabilities. However, their cross-modal alignment remains biased toward English due to limited multilingual multimodal data. Recent multilingual extensions have alleviated this gap but enforce instance-level alignment while neglecting the global geometry of the shared embedding space. We address this problem by introducing ToMCLIP (Topological Alignment for Multilingual CLIP), a topology-aware framework aligning embedding spaces with topology-preserving constraints. The proposed method applies persistent homology to define a topological alignment loss and approximates persistence diagram with theoretical error bounds using graph sparsification strategy. This work validates the proposed approach, showing enhanced structural coherence of multilingual representations, higher zero-shot accuracy on the CIFAR-100, and stronger multilingual retrieval performance on the xFlickr&CO. Beyond VLMs, the proposed approach provides a general method for incorporating topological alignment into representation learning.

[420] DreamMakeup: Face Makeup Customization using Latent Diffusion Models

Geon Yeong Park, Inhwa Han, Serin Yang, Yeobin Hong, Seongmin Jeong, Heechan Jeon, Myeongjin Goh, Sung Won Yi, Jin Nam, Jong Chul Ye

Main category: cs.CV

TL;DR: DreamMakeup is a training-free diffusion model for virtual makeup customization that overcomes GAN limitations by using DDIM inversion for facial structure preservation and enabling customization via reference images, RGB colors, and text descriptions.

Details

Motivation: Address training instability and limited customization in GAN-based virtual makeup simulation, leveraging diffusion models' superior controllability for precise real-image editing.

Method: Uses early-stopped DDIM inversion to preserve facial structure and identity, with customization through reference images, RGB colors, and textual descriptions as conditioning inputs.

Result: Demonstrates improved customization, color-matching, identity preservation, and compatibility with text/LLMs while maintaining affordable computational costs compared to existing methods.

Conclusion: DreamMakeup provides a superior training-free diffusion-based solution for virtual makeup customization that outperforms both GAN-based and recent diffusion-based frameworks.

Abstract: The exponential growth of the global makeup market has paralleled advancements in virtual makeup simulation technology. Despite the progress led by GANs, their application still encounters significant challenges, including training instability and limited customization capabilities. Addressing these challenges, we introduce DreamMakup - a novel training-free Diffusion model based Makeup Customization method, leveraging the inherent advantages of diffusion models for superior controllability and precise real-image editing. DreamMakeup employs early-stopped DDIM inversion to preserve the facial structure and identity while enabling extensive customization through various conditioning inputs such as reference images, specific RGB colors, and textual descriptions. Our model demonstrates notable improvements over existing GAN-based and recent diffusion-based frameworks - improved customization, color-matching capabilities, identity preservation and compatibility with textual descriptions or LLMs with affordable computational costs.

[421] FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model

Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Ji Ao, Dawei Leng, Yuhui Yin

Main category: cs.CV

TL;DR: FG-CLIP 2 is a bilingual vision-language model that advances fine-grained alignment for English and Chinese using rich supervision and novel losses, achieving state-of-the-art performance across 29 datasets.

Details

Motivation: Current vision-language models like CLIP struggle with fine-grained details in object attributes, spatial relations, and linguistic expressions, particularly in non-English settings with limited bilingual comprehension capabilities.

Method: Leverages rich fine-grained supervision including region-text matching and long-caption modeling, with multiple discriminative objectives and a novel Textual Intra-modal Contrastive (TIC) loss to distinguish semantically similar captions. Trained on curated large-scale English and Chinese data.

Result: Achieves powerful bilingual performance, outperforming existing methods across 29 datasets in 8 tasks and establishing state-of-the-art results in both English and Chinese.

Conclusion: FG-CLIP 2 successfully addresses fine-grained vision-language alignment challenges in bilingual settings, with released model, code, and benchmark to support future research.

Abstract: Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension. To address these challenges, we introduce FG-CLIP 2, a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese. Our approach leverages rich fine-grained supervision, including region-text matching and long-caption modeling, alongside multiple discriminative objectives. We further introduce the Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions. Trained on a carefully curated mixture of large-scale English and Chinese data, FG-CLIP 2 achieves powerful bilingual performance. To enable rigorous evaluation, we present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification. Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results in both languages. We release the model, code, and benchmark to facilitate future research on bilingual fine-grained alignment.

[422] DKPMV: Dense Keypoints Fusion from Multi-View RGB Frames for 6D Pose Estimation of Textureless Objects

Jiahong Chen, Jinghao Wang, Zi Wang, Ziwen Wang, Banglei Guan, Qifeng Yu

Main category: cs.CV

TL;DR: DKPMV is a multi-view RGB-only 6D pose estimation pipeline that uses dense keypoint-level fusion and progressive pose optimization to outperform RGB-D methods on textureless objects.

Details

Motivation: Current multi-view methods either rely on depth data or insufficiently exploit multi-view geometric cues, limiting performance for 6D pose estimation of textureless objects in industrial robotics.

Method: Three-stage progressive pose optimization with dense keypoint-level fusion using only RGB images, enhanced with attentional aggregation and symmetry-aware training to improve accuracy and resolve symmetric object ambiguities.

Result: Extensive experiments on ROBI dataset show DKPMV outperforms state-of-the-art multi-view RGB approaches and surpasses RGB-D methods in most cases.

Conclusion: The proposed RGB-only pipeline achieves superior performance through dense keypoint fusion and geometric optimization, demonstrating effectiveness for textureless object pose estimation.

Abstract: 6D pose estimation of textureless objects is valuable for industrial robotic applications, yet remains challenging due to the frequent loss of depth information. Current multi-view methods either rely on depth data or insufficiently exploit multi-view geometric cues, limiting their performance. In this paper, we propose DKPMV, a pipeline that achieves dense keypoint-level fusion using only multi-view RGB images as input. We design a three-stage progressive pose optimization strategy that leverages dense multi-view keypoint geometry information. To enable effective dense keypoint fusion, we enhance the keypoint network with attentional aggregation and symmetry-aware training, improving prediction accuracy and resolving ambiguities on symmetric objects. Extensive experiments on the ROBI dataset demonstrate that DKPMV outperforms state-of-the-art multi-view RGB approaches and even surpasses the RGB-D methods in the majority of cases. The code will be available soon.

[423] Towards Distribution-Shift Uncertainty Estimation for Inverse Problems with Generative Priors

Namhoon Kim, Sara Fridovich-Keil

Main category: cs.CV

TL;DR: Proposes a calibration-free uncertainty indicator for detecting distribution shift in inverse problems using generative priors, based on reconstruction stability under random measurement variations.

Details

Motivation: Generative models improve reconstruction quality but risk hallucinating features when test images are out-of-distribution. Existing uncertainty methods require calibration data, provide heuristic estimates, or don't specifically address distribution shift.

Method: Uses reconstruction stability under random measurement variations as proxy for distribution shift detection. OOD images show higher variability in reconstructions compared to in-distribution images.

Result: Validated on MNIST digit reconstruction, where a model trained only on digit “0” showed higher reconstruction variability and error for OOD digits (1-9), confirming the indicator’s effectiveness.

Conclusion: Proposes pairing generative priors with lightweight guardrails to enable aggressive measurement reduction for in-distribution cases while automatically warning when applied to OOD data.

Abstract: Generative models have shown strong potential as data-driven priors for solving inverse problems such as reconstructing medical images from undersampled measurements. While these priors improve reconstruction quality with fewer measurements, they risk hallucinating features when test images lie outside the training distribution. Existing uncertainty quantification methods in this setting (i) require an in-distribution calibration dataset, which may not be available, (ii) provide heuristic rather than statistical estimates, or (iii) quantify uncertainty from model capacity or limited measurements rather than distribution shift. We propose an instance-level, calibration-free uncertainty indicator that is sensitive to distribution shift, requires no knowledge of the training distribution, and incurs no retraining cost. Our key hypothesis is that reconstructions of in-distribution images remain stable under random measurement variations, while reconstructions of out-of-distribution (OOD) images exhibit greater instability. We use this stability as a proxy for detecting distribution shift. Our proposed OOD indicator is efficiently computable for any computational imaging inverse problem; we demonstrate it on tomographic reconstruction of MNIST digits, where a learned proximal network trained only on digit “0” is evaluated on all ten digits. Reconstructions of OOD digits show higher variability and correspondingly higher reconstruction error, validating this indicator. These results suggest a deployment strategy that pairs generative priors with lightweight guardrails, enabling aggressive measurement reduction for in-distribution cases while automatically warning when priors are applied out of distribution.

[424] IUT-Plug: A Plug-in tool for Interleaved Image-Text Generation

Zeteng Lin, Xingxing Li, Wen You, Xiaoyang Li, Zehan Lu, Yujun Cai, Jing Tang

Main category: cs.CV

TL;DR: IUT-Plug is a module that enhances vision language models by using Image Understanding Trees to reduce context drift in logic, object identity, and style during multimodal generation.

Details

Motivation: Existing VLMs struggle to preserve logic, object identity, and style in multimodal image-text generation, limiting their generalization in complex scenarios.

Method: Two-stage framework: (1) dynamic IUT-Plug extraction module parses visual scenes into hierarchical symbolic structures, (2) coordinated narrative-flow and image synthesis mechanism ensures cross-modal consistency.

Result: IUT-Plug improves accuracy on established benchmarks and effectively alleviates the three critical forms of context drift across diverse multimodal QA scenarios.

Conclusion: The proposed IUT-Plug framework successfully mitigates context drift in VLMs through explicit structured reasoning, enhancing their multimodal generation capabilities.

Abstract: Existing vision language models (VLMs), including GPT-4 and DALL-E, often struggle to preserve logic, object identity, and style in multimodal image-text generation. This limitation significantly hinders the generalization capability of VLMs in complex image-text input-output scenarios. To address this issue, we propose IUT-Plug, a module grounded in an Image Understanding Tree (IUT), which enhances existing interleaved VLMs through explicit structured reasoning, thereby mitigating context drift in logic, entity identity, and style. The proposed framework operates in two stages. (1) A dynamic IUT-Plug extraction module parses visual scenes into hierarchical symbolic structures. (2) A coordinated narrative-flow and image synthesis mechanism ensures cross-modal consistency. To evaluate our approach, we construct a novel benchmark based on 3,000 real human-generated question-answer pairs over fine-tuned large models, introducing a dynamic evaluation protocol for quantifying context drift in interleaved VLMs. Experimental results demonstrate that IUT-Plug not only improves accuracy on established benchmarks but also effectively alleviates the three critical forms of context drift across diverse multimodal question answering (QA) scenarios.

[425] Chart-RVR: Reinforcement Learning with Verifiable Rewards for Explainable Chart Reasoning

Sanchit Sinha, Oana Frunza, Kashif Rasul, Yuriy Nevmyvaka, Aidong Zhang

Main category: cs.CV

TL;DR: Chart-RVR is a framework that fine-tunes Large Vision-Language Models to improve robustness and explainability in chart reasoning using Group Relative Policy Optimization with verifiable rewards.

Details

Motivation: Current LVLMs struggle with out-of-distribution data and produce unreliable chain-of-thought rationales, limiting their explainability and trustworthiness in chart reasoning tasks.

Method: Uses Group Relative Policy Optimization with three verifiable rewards: chart-type classification accuracy, faithful chart table reconstruction, and process conformity to fine-tune 3B-parameter LVLMs.

Result: Chart-RVR consistently outperforms standard supervised fine-tuning on both in-distribution and out-of-distribution datasets, achieving state-of-the-art results on six chart-reasoning benchmarks and producing more interpretable rationales.

Conclusion: The framework demonstrates that verifiable rewards with GRPO can train reliable, interpretable chart-reasoning models that close the OOD performance gap while improving rationale fidelity.

Abstract: The capabilities of Large Vision-Language Models (LVLMs) have reached state-of-the-art on many visual reasoning tasks, including chart reasoning, yet they still falter on out-of-distribution (OOD) data, and degrade further when asked to produce their chain-of-thought (CoT) rationales, limiting explainability. We present Chart-RVR, a general framework that fine-tunes LVLMs to be more robust and explainable for chart reasoning by coupling Group Relative Policy Optimization (GRPO) with automatically verifiable rewards. Our framework comprises of three rewards that maximize: (i) correct chart-type classification, (ii) faithful chart table reconstruction, and (iii) process conformity. Applied to 3-billion-parameter LVLMs, Chart-RVR consistently outperforms standard supervised fine-tuning (SFT) on both in-distribution and out-of-distribution datasets, closing the OOD performance gap while improving rationale fidelity. The resulting models, the Chart-RVR-3B series, achieve state-of-the-art results on six chart-reasoning benchmarks spanning in-domain and OOD settings, surpassing all existing models of comparable size. Beyond accuracy, Chart-RVR yields more interpretable CoT rationales, strengthening trust and reliability - showcasing the power of verifiable rewards with GRPO for training reliable, interpretable chart-reasoning models.

[426] Mixup Helps Understanding Multimodal Video Better

Xiaoyu Ma, Ding Ding, Hao Chen

Main category: cs.CV

TL;DR: Proposes Multimodal Mixup (MM) and Balanced Multimodal Mixup (B-MM) to address overfitting of strong modalities in multimodal video understanding, with B-MM dynamically adjusting mixing ratios based on modality contributions.

Details

Motivation: Multimodal models tend to overfit strong modalities, which dominate learning and suppress weaker modalities, limiting the effectiveness of multimodal video understanding for tasks like action recognition and emotion classification.

Method: First introduces Multimodal Mixup (MM) applying Mixup at aggregated multimodal feature level to generate virtual feature-label pairs. Then proposes Balanced Multimodal Mixup (B-MM) that dynamically adjusts mixing ratios for each modality based on their relative contributions to learning.

Result: Extensive experiments on multiple datasets show both methods effectively improve generalization and multimodal robustness, with B-MM providing additional benefits by addressing modality imbalance.

Conclusion: The proposed MM and B-MM methods successfully mitigate overfitting of strong modalities and improve multimodal learning performance, with B-MM offering enhanced capability to handle modality imbalance during training.

Abstract: Multimodal video understanding plays a crucial role in tasks such as action recognition and emotion classification by combining information from different modalities. However, multimodal models are prone to overfitting strong modalities, which can dominate learning and suppress the contributions of weaker ones. To address this challenge, we first propose Multimodal Mixup (MM), which applies the Mixup strategy at the aggregated multimodal feature level to mitigate overfitting by generating virtual feature-label pairs. While MM effectively improves generalization, it treats all modalities uniformly and does not account for modality imbalance during training. Building on MM, we further introduce Balanced Multimodal Mixup (B-MM), which dynamically adjusts the mixing ratios for each modality based on their relative contributions to the learning objective. Extensive experiments on several datasets demonstrate the effectiveness of our methods in improving generalization and multimodal robustness.

[427] A Survey on Agentic Multimodal Large Language Models

Huanjin Yao, Ruifei Zhang, Jiaxing Huang, Jingyi Zhang, Yibo Wang, Bo Fang, Ruolin Zhu, Yongcheng Jing, Shunyu Liu, Guanbin Li, Dacheng Tao

Main category: cs.CV

TL;DR: This paper presents a comprehensive survey on Agentic Multimodal Large Language Models (Agentic MLLMs), exploring their conceptual foundations, framework dimensions, and providing resources for future research.

Details

Motivation: The motivation stems from the shift from traditional static AI agents to dynamic, proactive agentic AI systems and their potential trajectory toward AGI, driven by growing interest in this emerging field.

Method: The authors establish a conceptual framework organizing agentic MLLMs along three dimensions: agentic internal intelligence (reasoning, reflection, memory), agentic external tool invocation (proactive tool usage), and agentic environment interaction (action-taking in dynamic scenarios).

Result: The survey provides a systematic organization of agentic MLLM concepts, compiles open-source training frameworks and datasets, reviews downstream applications, and outlines future research directions with an actively maintained public repository.

Conclusion: Agentic MLLMs represent an emerging paradigm that combines multimodal capabilities with proactive, goal-directed behavior, creating a foundation for more advanced AI systems that can operate effectively in dynamic real-world environments.

Abstract: With the recent emergence of revolutionary autonomous agentic systems, research community is witnessing a significant shift from traditional static, passive, and domain-specific AI agents toward more dynamic, proactive, and generalizable agentic AI. Motivated by the growing interest in agentic AI and its potential trajectory toward AGI, we present a comprehensive survey on Agentic Multimodal Large Language Models (Agentic MLLMs). In this survey, we explore the emerging paradigm of agentic MLLMs, delineating their conceptual foundations and distinguishing characteristics from conventional MLLM-based agents. We establish a conceptual framework that organizes agentic MLLMs along three fundamental dimensions: (i) Agentic internal intelligence functions as the system’s commander, enabling accurate long-horizon planning through reasoning, reflection, and memory; (ii) Agentic external tool invocation, whereby models proactively use various external tools to extend their problem-solving capabilities beyond their intrinsic knowledge; and (iii) Agentic environment interaction further situates models within virtual or physical environments, allowing them to take actions, adapt strategies, and sustain goal-directed behavior in dynamic real-world scenarios. To further accelerate research in this area for the community, we compile open-source training frameworks, training and evaluation datasets for developing agentic MLLMs. Finally, we review the downstream applications of agentic MLLMs and outline future research directions for this rapidly evolving field. To continuously track developments in this rapidly evolving field, we will also actively update a public repository at https://github.com/HJYao00/Awesome-Agentic-MLLMs.

[428] Perspective-aware 3D Gaussian Inpainting with Multi-view Consistency

Yuxin Cheng, Binxiao Huang, Taiqiang Wu, Wenyong Zhou, Chenchen Ding, Zhengwu Liu, Graziano Chesi, Ngai Wong

Main category: cs.CV

TL;DR: PAInpainter improves 3D Gaussian inpainting by using perspective-aware content propagation and multi-view consistency verification to enhance global consistency and texture fidelity in restored 3D scenes.

Details

Motivation: Ensuring multi-view consistency remains a key challenge in 3D Gaussian inpainting despite progress with pretrained diffusion models, as it's essential for high-quality inpainting in virtual reality and multimedia applications.

Method: Iteratively refines inpainting and optimizes 3D Gaussian representation using multiple views adaptively sampled from a perspective graph, propagating inpainted images as prior information and verifying consistency across neighboring views.

Result: Achieves superior 3D inpainting quality with PSNR scores of 26.03 dB on SPIn-NeRF and 29.51 dB on NeRFiller datasets, outperforming existing methods.

Conclusion: PAInpainter demonstrates effectiveness and generalization capability in advancing 3D Gaussian inpainting through perspective-aware content propagation and multi-view consistency verification.

Abstract: 3D Gaussian inpainting, a critical technique for numerous applications in virtual reality and multimedia, has made significant progress with pretrained diffusion models. However, ensuring multi-view consistency, an essential requirement for high-quality inpainting, remains a key challenge. In this work, we present PAInpainter, a novel approach designed to advance 3D Gaussian inpainting by leveraging perspective-aware content propagation and consistency verification across multi-view inpainted images. Our method iteratively refines inpainting and optimizes the 3D Gaussian representation with multiple views adaptively sampled from a perspective graph. By propagating inpainted images as prior information and verifying consistency across neighboring views, PAInpainter substantially enhances global consistency and texture fidelity in restored 3D scenes. Extensive experiments demonstrate the superiority of PAInpainter over existing methods. Our approach achieves superior 3D inpainting quality, with PSNR scores of 26.03 dB and 29.51 dB on the SPIn-NeRF and NeRFiller datasets, respectively, highlighting its effectiveness and generalization capability.

[429] ContextGen: Contextual Layout Anchoring for Identity-Consistent Multi-Instance Generation

Ruihang Xu, Dewei Zhou, Fan Ma, Yi Yang

Main category: cs.CV

TL;DR: ContextGen is a novel Diffusion Transformer framework for multi-instance image generation that uses layout and reference images to achieve precise object positioning and identity preservation.

Details

Motivation: Current diffusion models struggle with multi-instance generation due to limitations in precise layout control and maintaining identity consistency of multiple distinct subjects.

Method: Proposes ContextGen with two key mechanisms: Contextual Layout Anchoring (CLA) to incorporate layout images for object positioning, and Identity Consistency Attention (ICA) to maintain identity using reference images. Also introduces IMIG-100K dataset with detailed annotations.

Result: ContextGen sets new state-of-the-art performance, outperforming existing methods in control precision, identity fidelity, and overall visual quality.

Conclusion: The proposed framework effectively addresses multi-instance generation challenges through layout anchoring and identity consistency mechanisms, demonstrating superior performance over existing approaches.

Abstract: Multi-instance image generation (MIG) remains a significant challenge for modern diffusion models due to key limitations in achieving precise control over object layout and preserving the identity of multiple distinct subjects. To address these limitations, we introduce ContextGen, a novel Diffusion Transformer framework for multi-instance generation that is guided by both layout and reference images. Our approach integrates two key technical contributions: a Contextual Layout Anchoring (CLA) mechanism that incorporates the composite layout image into the generation context to robustly anchor the objects in their desired positions, and Identity Consistency Attention (ICA), an innovative attention mechanism that leverages contextual reference images to ensure the identity consistency of multiple instances. Recognizing the lack of large-scale, hierarchically-structured datasets for this task, we introduce IMIG-100K, the first dataset with detailed layout and identity annotations. Extensive experiments demonstrate that ContextGen sets a new state-of-the-art, outperforming existing methods in control precision, identity fidelity, and overall visual quality.

[430] Frequency Domain Unlocks New Perspectives for Abdominal Medical Image Segmentation

Kai Han, Siqi Ma, Chengxuan Qian, Jun Chen, Chongwen Lyu, Yuqing Song, Zhe Liu

Main category: cs.CV

TL;DR: The paper proposes FASS framework for medical image segmentation that addresses challenges in low-contrast tumor segmentation through foreground-aware modules, frequency enhancement, and edge constraints.

Details

Motivation: Foundation models struggle with foreground focus in complex, low-contrast medical images where malignant tumors closely resemble normal organs, making contextual differentiation difficult.

Method: Three main components: 1) Foreground-aware module to amplify background-target distinction, 2) Wavelet-based frequency enhancement for boundary recognition, 3) Edge constraint module for geometric continuity preservation.

Result: Extensive experiments show superior performance across all metrics, with particular strength in robustness under complex conditions and fine structure recognition.

Conclusion: FASS framework significantly enhances low-contrast image segmentation, enabling applications in diverse and complex medical imaging scenarios.

Abstract: Accurate segmentation of tumors and adjacent normal tissues in medical images is essential for surgical planning and tumor staging. Although foundation models generally perform well in segmentation tasks, they often struggle to focus on foreground areas in complex, low-contrast backgrounds, where some malignant tumors closely resemble normal organs, complicating contextual differentiation. To address these challenges, we propose the Foreground-Aware Spectrum Segmentation (FASS) framework. First, we introduce a foreground-aware module to amplify the distinction between background and the entire volume space, allowing the model to concentrate more effectively on target areas. Next, a feature-level frequency enhancement module, based on wavelet transform, extracts discriminative high-frequency features to enhance boundary recognition and detail perception. Eventually, we introduce an edge constraint module to preserve geometric continuity in segmentation boundaries. Extensive experiments on multiple medical datasets demonstrate superior performance across all metrics, validating the effectiveness of our framework, particularly in robustness under complex conditions and fine structure recognition. Our framework significantly enhances segmentation of low-contrast images, paving the way for applications in more diverse and complex medical imaging scenarios.

[431] COCO-Tree: Compositional Hierarchical Concept Trees for Enhanced Reasoning in Vision Language Models

Sanchit Sinha, Guangzhi Xiong, Aidong Zhang

Main category: cs.CV

TL;DR: COCO-Tree improves vision-language models’ compositional reasoning by augmenting them with neurosymbolic concept trees learned from LLMs, boosting performance by 5-10% on compositionality benchmarks.

Details

Motivation: Modern VLMs struggle with compositional reasoning involving multiple objects, attributes, and relations. Existing approaches are either resource-intensive or lack interpretable reasoning processes.

Method: Augments VLM outputs with neurosymbolic concept trees learned from LLMs using a beam search-inspired reasoning process that provides interpretable rationales.

Result: Significantly improves compositional generalization by 5-10% across four benchmarks (Winoground, EqBench, ColorSwap, SugarCrepe) in seven different open-source VLMs.

Conclusion: COCO-Tree effectively enhances VLMs’ compositional reasoning capabilities while providing interpretable rationales, addressing key limitations in current vision-language models.

Abstract: Compositional reasoning remains a persistent weakness of modern vision language models (VLMs): they often falter when a task hinges on understanding how multiple objects, attributes, and relations interact within an image. Multiple research works have attempted to improve compositionality performance by creative tricks such as improving prompt structure, chain of thought reasoning, etc. A more recent line of work attempts to impart additional reasoning in VLMs using well-trained Large Language Models (LLMs), which are far superior in linguistic understanding than VLMs to compensate for the limited linguistic prowess of VLMs. However, these approaches are either resource-intensive or do not provide an interpretable reasoning process. In this paper, we present ‘COCO-Tree’ - a novel approach that augments VLM outputs with carefully designed neurosymbolic concept trees learned from LLMs to improve VLM’s linguistic reasoning. COCO-Tree’s beam search-inspired reasoning process boosts compositionality performance and provides a rationale behind VLM predictions. Empirical results on four compositionality benchmarks, Winoground, EqBench, ColorSwap, and SugarCrepe, in seven different open-source VLMs with varying sizes, demonstrate that COCO-Tree significantly improves compositional generalization by 5-10% over baselines.

[432] High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation

Runyang Feng, Hyung Jin Chang, Tze Ho Elden Tse, Boeun Kim, Yi Chang, Yixing Gao

Main category: cs.CV

TL;DR: A novel framework that extends Mamba models to separately learn global and local spatiotemporal representations for video-based human pose estimation, achieving better performance and computational efficiency than existing methods.

Details

Motivation: Current VHPE methods struggle to balance global dynamic contexts and local motion details within single modeling structures, and suffer from quadratic complexity when capturing global dependencies. Mamba models show potential for long-range modeling with linear complexity but are limited to 1D data.

Method: Proposes Global Spatiotemporal Mamba with 6D selective space-time scan and modulated scan merging for global representations, and Local Refinement Mamba with windowed space-time scan for local keypoint motion details.

Result: Extensive experiments on four benchmark datasets show the proposed model outperforms state-of-the-art VHPE approaches while achieving better computational trade-offs.

Conclusion: The framework successfully extends Mamba to handle high-resolution spatiotemporal representations for VHPE, effectively balancing global and local modeling with linear complexity.

Abstract: Modeling high-resolution spatiotemporal representations, including both global dynamic contexts (e.g., holistic human motion tendencies) and local motion details (e.g., high-frequency changes of keypoints), is essential for video-based human pose estimation (VHPE). Current state-of-the-art methods typically unify spatiotemporal learning within a single type of modeling structure (convolution or attention-based blocks), which inherently have difficulties in balancing global and local dynamic modeling and may bias the network to one of them, leading to suboptimal performance. Moreover, existing VHPE models suffer from quadratic complexity when capturing global dependencies, limiting their applicability especially for high-resolution sequences. Recently, the state space models (known as Mamba) have demonstrated significant potential in modeling long-range contexts with linear complexity; however, they are restricted to 1D sequential data. In this paper, we present a novel framework that extends Mamba from two aspects to separately learn global and local high-resolution spatiotemporal representations for VHPE. Specifically, we first propose a Global Spatiotemporal Mamba, which performs 6D selective space-time scan and spatial- and temporal-modulated scan merging to efficiently extract global representations from high-resolution sequences. We further introduce a windowed space-time scan-based Local Refinement Mamba to enhance the high-frequency details of localized keypoint motions. Extensive experiments on four benchmark datasets demonstrate that the proposed model outperforms state-of-the-art VHPE approaches while achieving better computational trade-offs.

Shasha Guo, Liang Pang, Xi Wang, Yanling Wang, Huawei Shen, Jing Zhang

Main category: cs.CV

TL;DR: The paper proposes a reinforcement learning framework to enhance LVLMs’ ability to generate textual descriptions of auxiliary lines in geometry problems, achieving competitive performance on auxiliary-line reasoning benchmarks.

Details

Motivation: Auxiliary lines are crucial for solving complex geometric problems but current LVLMs struggle with them. Image editing approaches lack geometric precision, so the authors focus on generating textual descriptions that better align with LVLMs' capabilities.

Method: Propose a reinforcement learning framework with cross-modal reward that evaluates alignment between generated auxiliary-line descriptions and ground-truth diagrams. Use GRPO-based RL for precise diagram-text alignment. Create GeoVLMath model and AuxSolidMath dataset with 3,018 geometry problems.

Result: GeoVLMath at 3B and 7B scales achieves competitive and often superior performance compared to strong open-source and proprietary LVLMs on auxiliary-line reasoning benchmarks.

Conclusion: The reinforcement learning approach with cross-modal rewards effectively enhances LVLMs’ auxiliary-line reasoning capabilities in geometry problems, demonstrating the value of focusing on textual descriptions rather than diagram editing.

Abstract: Auxiliary lines are essential for solving complex geometric problems but remain challenging for large vision-language models (LVLMs). Rather than editing diagrams to draw auxiliary lines, which current image editing models struggle to render with geometric precision, we generate textual descriptions of auxiliary-line constructions to better align with the representational strengths of LVLMs. To bridge the gap between textual descriptions and spatial structure, we propose a reinforcement learning framework that enhances diagram-text alignment. At the core of our approach is a cross-modal reward that evaluates how well the generated auxiliary-line description for an original diagram matches a ground-truth auxiliary-line diagram. Built on this reward, we present GeoVLMath, an open-source LVLM tailored to auxiliary-line reasoning in solid geometry. This fine-grained signal drives a GRPO-based RL stage, yielding precise diagram-text alignment. To support training, we develop a scalable data creation pipeline and construct AuxSolidMath, a dataset of 3,018 real-exam geometry problems with paired diagrams and aligned textual fields. At the 3B and 7B scales, GeoVLMath achieves competitive and often superior performance compared with strong open-source and proprietary LVLMs on auxiliary-line reasoning benchmarks.

[434] GIR-Bench: Versatile Benchmark for Generating Images with Reasoning

Hongxiang Li, Yaowei Li, Bin Lin, Yuwei Niu, Yuhang Yang, Xiaoshuang Huang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Long Chen

Main category: cs.CV

TL;DR: GIR-Bench is a comprehensive benchmark that evaluates unified multimodal models across three perspectives: understanding-generation consistency, reasoning-centric text-to-image generation, and multi-step reasoning in editing.

Details

Motivation: The community lacks a rigorous reasoning-centric benchmark to systematically evaluate the alignment between understanding and generation, and their generalization potential in complex visual tasks.

Method: GIR-Bench evaluates models across three complementary perspectives: 1) Understanding-generation consistency (GIR-Bench-UGC), 2) Reasoning-centric text-to-image generation (GIR-Bench-T2I), and 3) Multi-step reasoning in editing (GIR-Bench-Edit). Each subset has carefully designed task-specific evaluation pipelines tailored for each task.

Result: Extensive ablations show that unified models are more capable of reasoning-driven visual tasks but still exhibit a persistent gap between understanding and generation.

Conclusion: GIR-Bench provides a comprehensive framework for evaluating unified multimodal models, revealing that while these models show promise for advanced multimodal intelligence, there remains a significant gap between their understanding and generation capabilities that needs to be addressed.

Abstract: Unified multimodal models integrate the reasoning capacity of large language models with both image understanding and generation, showing great promise for advanced multimodal intelligence. However, the community still lacks a rigorous reasoning-centric benchmark to systematically evaluate the alignment between understanding and generation, and their generalization potential in complex visual tasks. To this end, we introduce \textbf{GIR-Bench}, a comprehensive benchmark that evaluates unified models across three complementary perspectives. Firstly, we investigate understanding-generation consistency (GIR-Bench-UGC), asking whether models can consistently leverage the same knowledge in both understanding and generation tasks. Secondly, we investigate whether models can perform reasoning-centric text-to-image generation that requires applying logical constraints and implicit knowledge to generate faithful visual content (GIR-Bench-T2I). Thirdly, we evaluate whether models can handle multi-step reasoning in editing (GIR-Bench-Edit). For each subset, we carefully design different task-specific evaluation pipelines tailored for each task. This enables fine-grained and interpretable evaluation while mitigating biases from the prevalent MLLM-as-a-Judge paradigm. Extensive ablations over various unified models and generation-only systems have shown that: Although unified models are more capable of reasoning-driven visual tasks, they still exhibit a persistent gap between understanding and generation. The data and code for GIR-Bench are available at \href{https://hkust-longgroup.github.io/GIR-Bench}{https://hkust-longgroup.github.io/GIR-Bench}.

[435] Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

Ganlin Yang, Tianyi Zhang, Haoran Hao, Weiyun Wang, Yibin Liu, Dehui Wang, Guanzhou Chen, Zijian Cai, Junting Chen, Weijie Su, Wengang Zhou, Yu Qiao, Jifeng Dai, Jiangmiao Pang, Gen Luo, Wenhai Wang, Yao Mu, Zhi Hou

Main category: cs.CV

TL;DR: Vlaser is a Vision-Language-Action model that bridges embodied reasoning with policy learning, achieving SOTA performance on embodied reasoning benchmarks and robot control tasks.

Details

Motivation: To address the critical gap between upstream VLM-based reasoning and downstream VLA policy learning in embodied AI systems.

Method: Developed Vlaser - a foundational vision-language model integrating high-level reasoning with low-level control, built on the Vlaser-6M dataset. Systematically examined how different VLM initializations affect supervised VLA fine-tuning.

Result: Achieved state-of-the-art performance on embodied reasoning benchmarks (spatial reasoning, embodied grounding, embodied QA, task planning) and SOTA results on WidowX benchmark with competitive performance on Google Robot benchmark.

Conclusion: Successfully bridged embodied reasoning with VLA policy learning, providing insights into mitigating domain shift between pre-training data and embodied policy learning data.

Abstract: While significant research has focused on developing embodied reasoning capabilities using Vision-Language Models (VLMs) or integrating advanced VLMs into Vision-Language-Action (VLA) models for end-to-end robot control, few studies directly address the critical gap between upstream VLM-based reasoning and downstream VLA policy learning. In this work, we take an initial step toward bridging embodied reasoning with VLA policy learning by introducing Vlaser - a Vision-Language-Action Model with synergistic embodied reasoning capability, which is a foundational vision-language model designed to integrate high-level reasoning with low-level control for embodied agents. Built upon the high-quality Vlaser-6M dataset, Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks - including spatial reasoning, embodied grounding, embodied QA, and task planning. Furthermore, we systematically examine how different VLM initializations affect supervised VLA fine-tuning, offering novel insights into mitigating the domain shift between internet-scale pre-training data and embodied-specific policy learning data. Based on these insights, our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.

[436] Enhancing Zero-Shot Anomaly Detection: CLIP-SAM Collaboration with Cascaded Prompts

Yanning Hou, Ke Xu, Junfa Li, Yanran Ruan, Jianfeng Qiu

Main category: cs.CV

TL;DR: A novel two-stage framework for zero-shot anomaly segmentation that combines CLIP’s anomaly localization with SAM’s boundary perception, using specialized modules to guide segmentation of anomalous regions rather than entire objects.

Details

Motivation: Foundation models show strong generalization for zero-shot anomaly segmentation, but effectively guiding them for downstream tasks remains challenging. The paper aims to leverage CLIP's anomaly detection and SAM's boundary capabilities for industrial anomaly detection.

Method: Two-stage framework: (1) Co-Feature Point Prompt Generation (PPG) module uses CLIP and SAM to generate positive/negative point prompts, guiding SAM to focus on anomalies rather than entire objects; (2) Cascaded Prompts for SAM (CPS) module employs hybrid prompts with SAM’s lightweight decoder to refine boundaries and reduce noise.

Result: Achieves state-of-the-art zero-shot anomaly segmentation across multiple datasets. On Visa dataset, outperforms SOTA by 10.3% in F1-max and 7.7% in AP metrics.

Conclusion: The proposed framework effectively combines CLIP and SAM capabilities for precise anomaly segmentation, demonstrating superior performance in zero-shot industrial anomaly detection tasks.

Abstract: Recently, the powerful generalization ability exhibited by foundation models has brought forth new solutions for zero-shot anomaly segmentation tasks. However, guiding these foundation models correctly to address downstream tasks remains a challenge. This paper proposes a novel two-stage framework, for zero-shot anomaly segmentation tasks in industrial anomaly detection. This framework excellently leverages the powerful anomaly localization capability of CLIP and the boundary perception ability of SAM.(1) To mitigate SAM’s inclination towards object segmentation, we propose the Co-Feature Point Prompt Generation (PPG) module. This module collaboratively utilizes CLIP and SAM to generate positive and negative point prompts, guiding SAM to focus on segmenting anomalous regions rather than the entire object. (2) To further optimize SAM’s segmentation results and mitigate rough boundaries and isolated noise, we introduce the Cascaded Prompts for SAM (CPS) module. This module employs hybrid prompts cascaded with a lightweight decoder of SAM, achieving precise segmentation of anomalous regions. Across multiple datasets, consistent experimental validation demonstrates that our approach achieves state-of-the-art zero-shot anomaly segmentation results. Particularly noteworthy is our performance on the Visa dataset, where we outperform the state-of-the-art methods by 10.3% and 7.7% in terms of {$F_1$-max} and AP metrics, respectively.

[437] Benchmarking Deep Learning Models for Laryngeal Cancer Staging Using the LaryngealCT Dataset

Nivea Roy, Son Tran, Atul Sajjanhar, K. Devaraja, Prakashini Koteshwara, Yong Xiang, Divya Rao

Main category: cs.CV

TL;DR: LaryngealCT is a curated benchmark of 1,029 CT scans for laryngeal cancer, providing standardized data and 3D DL model benchmarks for reproducible AI research in laryngeal oncology.

Details

Motivation: Laryngeal cancer imaging research lacks standardized datasets for reproducible deep learning model development, hindering AI-driven clinical decision support.

Method: Created LaryngealCT benchmark with 1,029 CT scans from TCIA, extracted uniform 1mm isotropic volumes using weakly supervised parameter search, and benchmarked 3D DL architectures (3D CNN, ResNet18,50,101, DenseNet121) on early vs. advanced and T4 vs. non-T4 classification tasks.

Result: 3D CNN achieved AUC-0.881 and F1-macro-0.821 for early vs. advanced classification; ResNet18 achieved AUC-0.892 and F1-macro-0.646 for T4 vs. non-T4 classification. 3D GradCAM analysis showed different attention patterns between T4 and non-T4 cases.

Conclusion: LaryngealCT provides open-source data, pretrained models, and explainability tools to establish a reproducible foundation for AI-driven laryngeal cancer research and clinical decision support.

Abstract: Laryngeal cancer imaging research lacks standardised datasets to enable reproducible deep learning (DL) model development. We present LaryngealCT, a curated benchmark of 1,029 computed tomography (CT) scans aggregated from six collections from The Cancer Imaging Archive (TCIA). Uniform 1 mm isotropic volumes of interest encompassing the larynx were extracted using a weakly supervised parameter search framework validated by clinical experts. 3D DL architectures (3D CNN, ResNet18,50,101, DenseNet121) were benchmarked on (i) early (Tis,T1,T2) vs. advanced (T3,T4) and (ii) T4 vs. non-T4 classification tasks. 3D CNN (AUC-0.881, F1-macro-0.821) and ResNet18 (AUC-0.892, F1-macro-0.646) respectively outperformed the other models in the two tasks. Model explainability assessed using 3D GradCAMs with thyroid cartilage overlays revealed greater peri-cartilage attention in non-T4 cases and focal activations in T4 predictions. Through open-source data, pretrained models, and integrated explainability tools, LaryngealCT offers a reproducible foundation for AI-driven research to support clinical decisions in laryngeal oncology.

[438] Zero-shot Face Editing via ID-Attribute Decoupled Inversion

Yang Hou, Minggu Wang, Jianjun Zhao

Main category: cs.CV

TL;DR: A zero-shot face editing method using ID-Attribute Decoupled Inversion that maintains identity and structural consistency while enabling precise facial attribute manipulation through text prompts.

Details

Motivation: Existing text-guided diffusion models struggle to maintain ID and structural consistency in real face editing tasks, limiting their practical application.

Method: Decompose face representation into ID and attribute features, using them as joint conditions to guide both inversion and reverse diffusion processes for independent control over ID and attributes.

Result: The method achieves strong ID preservation and structural consistency while enabling precise facial attribute manipulation, supporting complex multi-attribute editing tasks with only text prompts.

Conclusion: The proposed zero-shot face editing method is practical and effective, operating at DDIM inversion speed without requiring region-specific input.

Abstract: Recent advancements in text-guided diffusion models have shown promise for general image editing via inversion techniques, but often struggle to maintain ID and structural consistency in real face editing tasks. To address this limitation, we propose a zero-shot face editing method based on ID-Attribute Decoupled Inversion. Specifically, we decompose the face representation into ID and attribute features, using them as joint conditions to guide both the inversion and the reverse diffusion processes. This allows independent control over ID and attributes, ensuring strong ID preservation and structural consistency while enabling precise facial attribute manipulation. Our method supports a wide range of complex multi-attribute face editing tasks using only text prompts, without requiring region-specific input, and operates at a speed comparable to DDIM inversion. Comprehensive experiments demonstrate its practicality and effectiveness.

[439] LSVOS 2025 Challenge Report: Recent Advances in Complex Video Object Segmentation

Chang Liu, Henghui Ding, Kaining Ying, Lingyi Hong, Ning Xu, Linjie Yang, Yuchen Fan, Mingqi Gao, Jingkun Chen, Yunqi Miao, Gengshen Wu, Zhijin Qin, Jungong Han, Zhixiong Zhang, Shuangrui Ding, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Chang Soo Lim, Joonyoung Moon, Donghyeon Cho, Tingmin Li, Yixuan Li, Yang Yang, An Yan, Leilei Cao, Feng Lu, Ran Hong, Youhai Jiang, Fengjie Zhu, Yujie Xie, Hongyang Zhang, Zhihui Liu, Shihai Ruan, Quanzhu Niu, Dengxian Gong, Shihao Chen, Tao Zhang, Yikang Zhou, Haobo Yuan, Lu Qi, Xiangtai Li, Shunping Ji, Ran Hong, Feng Lu, Leilei Cao, An Yan, Alexey Nekrasov, Ali Athar, Daan de Geus, Alexander Hermans, Bastian Leibe

Main category: cs.CV

TL;DR: The 7th LSVOS Challenge at ICCV 2025 introduces a new Complex VOS track (MOSEv2) with more realistic and challenging scenarios, while maintaining traditional VOS and RVOS tracks to advance video object segmentation robustness.

Details

Motivation: To push video object segmentation beyond curated benchmarks by introducing more realistic challenges like dense small objects, disappear/reappear events, severe occlusions, and adverse conditions, aiming for better long-term consistency and generalization.

Method: The challenge features three tracks: Classic VOS, Referring VOS, and the new Complex VOS (MOSEv2) with increased difficulty. Standard J, F, and J&F metrics are used for VOS/RVOS, while MOSEv2 adopts J&Ḟ to better evaluate objects across scales and disappearance cases.

Result: The challenge highlights top-performing solutions and emerging trends including the growing role of LLM/MLLM components and memory-aware propagation in video segmentation systems.

Conclusion: The LSVOS Challenge aims to chart future directions for resilient, language-aware video segmentation in the wild by introducing more realistic evaluation scenarios and tracking emerging technological trends.

Abstract: This report presents an overview of the 7th Large-scale Video Object Segmentation (LSVOS) Challenge held in conjunction with ICCV 2025. Besides the two traditional tracks of LSVOS that jointly target robustness in realistic video scenarios: Classic VOS (VOS), and Referring VOS (RVOS), the 2025 edition features a newly introduced track, Complex VOS (MOSEv2). Building upon prior insights, MOSEv2 substantially increases difficulty, introducing more challenging but realistic scenarios including denser small objects, frequent disappear/reappear events, severe occlusions, adverse weather and lighting, etc., pushing long-term consistency and generalization beyond curated benchmarks. The challenge retains standard ${J}$, $F$, and ${J&F}$ metrics for VOS and RVOS, while MOSEv2 adopts ${J&\dot{F}}$ as the primary ranking metric to better evaluate objects across scales and disappearance cases. We summarize datasets and protocols, highlight top-performing solutions, and distill emerging trends, such as the growing role of LLM/MLLM components and memory-aware propagation, aiming to chart future directions for resilient, language-aware video segmentation in the wild.

[440] ROFI: A Deep Learning-Based Ophthalmic Sign-Preserving and Reversible Patient Face Anonymizer

Yuan Tian, Min Zhou, Yitong Chen, Fang Li, Lingzi Qi, Shuo Wang, Xieyang Xu, Yu Yu, Shiqiong Xu, Chaoyu Lei, Yankai Jiang, Rongzhao Zhang, Jia Tan, Li Wu, Hong Chen, Xiaowei Liu, Wei Lu, Lin Li, Huifang Zhou, Xuefei Song, Guangtao Zhai, Xianqun Fan

Main category: cs.CV

TL;DR: ROFI is a deep learning framework that anonymizes facial features in patient eye images while preserving disease diagnostic features, achieving high privacy protection and diagnostic accuracy.

Details

Motivation: Patient face images are convenient for eye disease evaluation but raise privacy concerns, requiring a solution that protects facial identity while maintaining diagnostic utility.

Method: Uses weakly supervised learning and neural identity translation to anonymize facial features while retaining disease features, working with AI systems and supporting secure image reversal.

Result: Achieves 100% diagnostic sensitivity, over 98% accuracy (κ > 0.90) across 11 eye diseases in 3 cohorts, anonymizes over 95% of images, maintains original diagnoses (κ > 0.80), and supports secure reversal with over 98% similarity.

Conclusion: ROFI effectively protects patient privacy in digital medicine while preserving diagnostic capabilities, enabling secure medical image sharing and long-term care.

Abstract: Patient face images provide a convenient mean for evaluating eye diseases, while also raising privacy concerns. Here, we introduce ROFI, a deep learning-based privacy protection framework for ophthalmology. Using weakly supervised learning and neural identity translation, ROFI anonymizes facial features while retaining disease features (over 98% accuracy, $\kappa > 0.90$). It achieves 100% diagnostic sensitivity and high agreement ($\kappa > 0.90$) across eleven eye diseases in three cohorts, anonymizing over 95% of images. ROFI works with AI systems, maintaining original diagnoses ($\kappa > 0.80$), and supports secure image reversal (over 98% similarity), enabling audits and long-term care. These results show ROFI’s effectiveness of protecting patient privacy in the digital medicine era.

[441] Source-Free Object Detection with Detection Transformer

Huizai Yao, Sicheng Zhao, Shuo Lu, Hui Chen, Yangyang Li, Guoping Liu, Tengfei Xing, Chenggang Yan, Jianhua Tao, Guiguang Ding

Main category: cs.CV

TL;DR: FRANCK is a novel Source-Free Object Detection framework specifically designed for DETR models, featuring query-centric feature enhancement with four key components: objectness score-based sample reweighting, contrastive learning with memory banks, uncertainty-weighted feature distillation, and improved self-training.

Details

Motivation: Most existing SFOD approaches are confined to conventional object detection models like Faster R-CNN or designed as general solutions without tailored adaptations for novel OD architectures, especially Detection Transformer (DETR).

Method: FRANCK comprises four components: (1) Objectness Score-based Sample Reweighting module for attention-based scoring and loss reweighting, (2) Contrastive Learning with Matching-based Memory Bank for multi-level feature integration, (3) Uncertainty-weighted Query-fused Feature Distillation for improved feature distillation, and (4) improved self-training pipeline with Dynamic Teacher Updating Interval.

Result: Extensive experiments on several widely used benchmarks demonstrate that FRANCK achieves state-of-the-art performance, highlighting its effectiveness and compatibility with DETR-based SFOD models.

Conclusion: FRANCK effectively adapts source-pre-trained DETR models to target domains with enhanced robustness and generalization, providing a specialized solution for DETR-based source-free object detection.

Abstract: Source-Free Object Detection (SFOD) enables knowledge transfer from a source domain to an unsupervised target domain for object detection without access to source data. Most existing SFOD approaches are either confined to conventional object detection (OD) models like Faster R-CNN or designed as general solutions without tailored adaptations for novel OD architectures, especially Detection Transformer (DETR). In this paper, we introduce Feature Reweighting ANd Contrastive Learning NetworK (FRANCK), a novel SFOD framework specifically designed to perform query-centric feature enhancement for DETRs. FRANCK comprises four key components: (1) an Objectness Score-based Sample Reweighting (OSSR) module that computes attention-based objectness scores on multi-scale encoder feature maps, reweighting the detection loss to emphasize less-recognized regions; (2) a Contrastive Learning with Matching-based Memory Bank (CMMB) module that integrates multi-level features into memory banks, enhancing class-wise contrastive learning; (3) an Uncertainty-weighted Query-fused Feature Distillation (UQFD) module that improves feature distillation through prediction quality reweighting and query feature fusion; and (4) an improved self-training pipeline with a Dynamic Teacher Updating Interval (DTUI) that optimizes pseudo-label quality. By leveraging these components, FRANCK effectively adapts a source-pre-trained DETR model to a target domain with enhanced robustness and generalization. Extensive experiments on several widely used benchmarks demonstrate that our method achieves state-of-the-art performance, highlighting its effectiveness and compatibility with DETR-based SFOD models.

[442] Text-Enhanced Panoptic Symbol Spotting in CAD Drawings

Xianlin Liu, Yan Gong, Bohao Li, Jiajing Huang, Bowen Du, Junchen Ye, Liyan Xu

Main category: cs.CV

TL;DR: A panoptic symbol spotting framework for CAD drawings that incorporates textual annotations and models geometric-textual relationships using Transformer with type-aware attention.

Details

Motivation: Existing CAD symbol spotting methods overlook textual annotations and lack explicit modeling of relationships among primitives, leading to incomplete understanding of drawings.

Method: Joint modeling of geometric and textual primitives using unified representations, with Transformer backbone enhanced by type-aware attention mechanism to model spatial dependencies.

Result: Outperforms existing approaches on symbol spotting tasks involving textual annotations and shows superior robustness on complex CAD drawings.

Conclusion: Incorporating textual annotations and explicit relationship modeling significantly improves panoptic symbol spotting performance in CAD drawings.

Abstract: With the widespread adoption of Computer-Aided Design(CAD) drawings in engineering, architecture, and industrial design, the ability to accurately interpret and analyze these drawings has become increasingly critical. Among various subtasks, panoptic symbol spotting plays a vital role in enabling downstream applications such as CAD automation and design retrieval. Existing methods primarily focus on geometric primitives within the CAD drawings to address this task, but they face following major problems: they usually overlook the rich textual annotations present in CAD drawings and they lack explicit modeling of relationships among primitives, resulting in incomprehensive understanding of the holistic drawings. To fill this gap, we propose a panoptic symbol spotting framework that incorporates textual annotations. The framework constructs unified representations by jointly modeling geometric and textual primitives. Then, using visual features extract by pretrained CNN as the initial representations, a Transformer-based backbone is employed, enhanced with a type-aware attention mechanism to explicitly model the different types of spatial dependencies between various primitives. Extensive experiments on the real-world dataset demonstrate that the proposed method outperforms existing approaches on symbol spotting tasks involving textual annotations, and exhibits superior robustness when applied to complex CAD drawings.

[443] Future-Aware End-to-End Driving: Bidirectional Modeling of Trajectory Planning and Scene Evolution

Bozhou Zhang, Nan Song, Jingyu Li, Xiatian Zhu, Jiankang Deng, Li Zhang

Main category: cs.CV

TL;DR: SeerDrive is an end-to-end autonomous driving framework that jointly models future scene evolution and trajectory planning in a closed-loop manner, outperforming state-of-the-art methods on benchmarks.

Details

Motivation: Traditional one-shot autonomous driving approaches underestimate scene dynamics and temporal evolution, limiting adaptive decision-making. The future trajectory is bidirectionally related to evolving environmental dynamics.

Method: Predicts future BEV representations to anticipate scene dynamics, then uses this foresight for trajectory planning. Features future-aware planning and iterative scene modeling with collaborative optimization.

Result: Significantly outperforms existing state-of-the-art methods on NAVSIM and nuScenes benchmarks.

Conclusion: Joint modeling of future scene evolution and trajectory planning in a closed-loop manner enables more informed and adaptive autonomous driving decisions.

Abstract: End-to-end autonomous driving methods aim to directly map raw sensor inputs to future driving actions such as planned trajectories, bypassing traditional modular pipelines. While these approaches have shown promise, they often operate under a one-shot paradigm that relies heavily on the current scene context, potentially underestimating the importance of scene dynamics and their temporal evolution. This limitation restricts the model’s ability to make informed and adaptive decisions in complex driving scenarios. We propose a new perspective: the future trajectory of an autonomous vehicle is closely intertwined with the evolving dynamics of its environment, and conversely, the vehicle’s own future states can influence how the surrounding scene unfolds. Motivated by this bidirectional relationship, we introduce SeerDrive, a novel end-to-end framework that jointly models future scene evolution and trajectory planning in a closed-loop manner. Our method first predicts future bird’s-eye view (BEV) representations to anticipate the dynamics of the surrounding scene, then leverages this foresight to generate future-context-aware trajectories. Two key components enable this: (1) future-aware planning, which injects predicted BEV features into the trajectory planner, and (2) iterative scene modeling and vehicle planning, which refines both future scene prediction and trajectory generation through collaborative optimization. Extensive experiments on the NAVSIM and nuScenes benchmarks show that SeerDrive significantly outperforms existing state-of-the-art methods.

Fengling Zhu, Boshi Liu, Jingyu Hua, Sheng Zhong

Main category: cs.CV

TL;DR: This paper proposes a supervised diffusion-based denoising framework to defend multimodal large language models against adversarial attacks on visual inputs, achieving higher quality reconstructions and improved robustness compared to existing methods.

Details

Motivation: MLLMs are vulnerable to adversarial attacks on visual inputs, and existing defense strategies like adversarial training and input purification have limitations in computational cost, image quality degradation, and generalization to complex multimodal tasks.

Method: A supervised diffusion-based denoising framework that fine-tunes diffusion models using paired adversarial-clean image datasets with directional, task-specific guidance, complemented by prompt optimization as an additional defense mechanism.

Result: Extensive experiments on image captioning and VQA show the method substantially improves robustness and exhibits strong transferability to unknown adversarial attacks while maintaining high-quality image reconstructions.

Conclusion: The supervised diffusion-based denoising approach effectively defends multimodal systems against adversarial threats, enabling more reliable and secure deployment of MLLMs in real-world applications.

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success in tasks such as image captioning, visual question answering, and cross-modal reasoning by integrating visual and textual modalities. However, their multimodal nature also exposes them to adversarial threats, where attackers can perturb either modality or both jointly to induce harmful, misleading, or policy violating outputs. Existing defense strategies, such as adversarial training and input purification, face notable limitations: adversarial training typically improves robustness only against known attacks while incurring high computational costs, whereas conventional purification approaches often suffer from degraded image quality and insufficient generalization to complex multimodal tasks. In this work, we focus on defending the visual modality, which frequently serves as the primary entry point for adversarial manipulation. We propose a supervised diffusion based denoising framework that leverages paired adversarial clean image datasets to fine-tune diffusion models with directional, task specific guidance. Unlike prior unsupervised purification methods such as DiffPure, our approach achieves higher quality reconstructions while significantly improving defense robustness in multimodal tasks. Furthermore, we incorporate prompt optimization as a complementary defense mechanism, enhancing resistance against diverse and unseen attack strategies. Extensive experiments on image captioning and visual question answering demonstrate that our method not only substantially improves robustness but also exhibits strong transferability to unknown adversarial attacks. These results highlight the effectiveness of supervised diffusion based denoising for multimodal defense, paving the way for more reliable and secure deployment of MLLMs in real world applications.

[445] Compositional Zero-Shot Learning: A Survey

Ans Munir, Faisal Z. Qureshi, Mohsen Ali, Muhammad Haris Khan

Main category: cs.CV

TL;DR: This paper presents the first comprehensive survey on Compositional Zero-Shot Learning (CZSL), systematically reviewing state-of-the-art methods and introducing a taxonomy based on disentanglement approaches.

Details

Motivation: CZSL addresses the combinatorial challenge of recognizing unseen attribute-object combinations, which is crucial since visual appearances of primitives are highly contextual and differ significantly across compositions.

Method: The survey introduces a taxonomy grounded in disentanglement with four approach families: no explicit disentanglement, textual disentanglement, visual disentanglement, and cross-modal disentanglement. It provides detailed comparative analysis across different problem settings.

Result: The paper systematically reviews CZSL methods, highlighting their core advantages and limitations in various settings like closed-world and open-world CZSL, and identifies significant open challenges.

Conclusion: This survey serves as a foundational resource to guide and inspire further advancements in Compositional Zero-Shot Learning, with available code and papers on their GitHub repository.

Abstract: Compositional Zero-Shot Learning (CZSL) is a critical task in computer vision that enables models to recognize unseen combinations of known attributes and objects during inference, addressing the combinatorial challenge of requiring training data for every possible composition. This is particularly challenging because the visual appearance of primitives is highly contextual; for example, small'' cats appear visually distinct from older’’ ones, and wet'' cars differ significantly from wet’’ cats. Effectively modeling this contextuality and the inherent compositionality is crucial for robust compositional zero-shot recognition. This paper presents, to our knowledge, the first comprehensive survey specifically focused on Compositional Zero-Shot Learning. We systematically review the state-of-the-art CZSL methods, introducing a taxonomy grounded in disentanglement, with four families of approaches: no explicit disentanglement, textual disentanglement, visual disentanglement, and cross-modal disentanglement. We provide a detailed comparative analysis of these methods, highlighting their core advantages and limitations in different problem settings, such as closed-world and open-world CZSL. Finally, we identify the most significant open challenges and outline promising future research directions. This survey aims to serve as a foundational resource to guide and inspire further advancements in this fascinating and important field. Papers studied in this survey with their official code are available on our github: https://github.com/ans92/Compositional-Zero-Shot-Learning

[446] MoMaps: Semantics-Aware Scene Motion Generation with Motion Maps

Jiahui Lei, Kyle Genova, George Kopanas, Noah Snavely, Leonidas Guibas

Main category: cs.CV

TL;DR: Learning 3D motion priors from real videos to predict future scene motion from single images using pixel-aligned Motion Maps (MoMaps) and diffusion models.

Details

Motivation: To address the challenge of learning semantically meaningful 3D motion priors from real-world videos for predicting future 3D scene motion from single input images.

Method: Proposed pixel-aligned Motion Map (MoMap) representation for 3D scene motion, created large-scale database from 50,000+ real videos, trained diffusion model on MoMaps, and developed pipeline for 2D video synthesis via warping and completion.

Result: Experimental results show the approach generates plausible and semantically consistent 3D scene motion.

Conclusion: The MoMap representation and diffusion-based approach effectively enable 3D motion prediction from single images and suggest a new pipeline for 2D video synthesis.

Abstract: This paper addresses the challenge of learning semantically and functionally meaningful 3D motion priors from real-world videos, in order to enable prediction of future 3D scene motion from a single input image. We propose a novel pixel-aligned Motion Map (MoMap) representation for 3D scene motion, which can be generated from existing generative image models to facilitate efficient and effective motion prediction. To learn meaningful distributions over motion, we create a large-scale database of MoMaps from over 50,000 real videos and train a diffusion model on these representations. Our motion generation not only synthesizes trajectories in 3D but also suggests a new pipeline for 2D video synthesis: first generate a MoMap, then warp an image accordingly and complete the warped point-based renderings. Experimental results demonstrate that our approach generates plausible and semantically consistent 3D scene motion.

[447] Multimodal Disease Progression Modeling via Spatiotemporal Disentanglement and Multiscale Alignment

Chen Liu, Wenfang Yao, Kejing Yin, William K. Cheung, Jing Qin

Main category: cs.CV

TL;DR: DiPro is a framework that disentangles static and dynamic features from sequential chest X-rays and aligns them with EHR data to model disease progression, achieving state-of-the-art performance.

Details

Motivation: Longitudinal multimodal data (EHR and CXRs) is underutilized due to redundancy in consecutive CXR sequences and temporal misalignment between sparse imaging and continuous EHR data.

Method: Uses region-aware disentanglement to separate static (anatomy) and dynamic (pathology progression) features in CXRs, then hierarchically aligns these features with EHR data via local and global synchronization.

Result: Extensive experiments on MIMIC dataset show DiPro effectively extracts temporal clinical dynamics and achieves state-of-the-art performance on disease progression identification and ICU prediction tasks.

Conclusion: DiPro successfully addresses challenges in multimodal longitudinal data by disentangling meaningful features and aligning them across different timescales for improved disease progression modeling.

Abstract: Longitudinal multimodal data, including electronic health records (EHR) and sequential chest X-rays (CXRs), is critical for modeling disease progression, yet remains underutilized due to two key challenges: (1) redundancy in consecutive CXR sequences, where static anatomical regions dominate over clinically-meaningful dynamics, and (2) temporal misalignment between sparse, irregular imaging and continuous EHR data. We introduce $\texttt{DiPro}$, a novel framework that addresses these challenges through region-aware disentanglement and multi-timescale alignment. First, we disentangle static (anatomy) and dynamic (pathology progression) features in sequential CXRs, prioritizing disease-relevant changes. Second, we hierarchically align these static and dynamic CXR features with asynchronous EHR data via local (pairwise interval-level) and global (full-sequence) synchronization to model coherent progression pathways. Extensive experiments on the MIMIC dataset demonstrate that $\texttt{DiPro}$ could effectively extract temporal clinical dynamics and achieve state-of-the-art performance on both disease progression identification and general ICU prediction tasks.

[448] Demystifying Numerosity in Diffusion Models – Limitations and Remedies

Yaqi Zhao, Xiaochen Wang, Li Dong, Wentao Zhang, Yuhui Yuan

Main category: cs.CV

TL;DR: Diffusion models struggle with counting objects accurately despite scaling up datasets and models. A new benchmark shows scaling alone doesn’t improve numerosity, and a count-aware layout injection method significantly boosts accuracy.

Details

Motivation: To investigate whether diffusion models can inherently generate correct object counts through dataset and model scaling, addressing the numerosity challenge in text-to-image generation.

Method: Created synthetic numerosity benchmark (GrayCount250 and NaturalCount6), analyzed scaling effects, identified noise initialization bias, and proposed count-aware layout injection into noise prior.

Result: Scaling hypothesis rejected - larger models/datasets didn’t improve counting accuracy. Proposed method improved GrayCount250 from 20.0% to 85.3% and NaturalCount6 from 74.8% to 86.3%.

Conclusion: Diffusion models rely more on noise initialization than prompt numerosity, but count-aware layout injection effectively addresses counting challenges and generalizes well.

Abstract: Numerosity remains a challenge for state-of-the-art text-to-image generation models like FLUX and GPT-4o, which often fail to accurately follow counting instructions in text prompts. In this paper, we aim to study a fundamental yet often overlooked question: Can diffusion models inherently generate the correct number of objects specified by a textual prompt simply by scaling up the dataset and model size? To enable rigorous and reproducible evaluation, we construct a clean synthetic numerosity benchmark comprising two complementary datasets: GrayCount250 for controlled scaling studies, and NaturalCount6 featuring complex naturalistic scenes. Second, we empirically show that the scaling hypothesis does not hold: larger models and datasets alone fail to improve counting accuracy on our benchmark. Our analysis identifies a key reason: diffusion models tend to rely heavily on the noise initialization rather than the explicit numerosity specified in the prompt. We observe that noise priors exhibit biases toward specific object counts. In addition, we propose an effective strategy for controlling numerosity by injecting count-aware layout information into the noise prior. Our method achieves significant gains, improving accuracy on GrayCount250 from 20.0% to 85.3% and on NaturalCount6 from 74.8% to 86.3%, demonstrating effective generalization across settings.

[449] video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory

Guangzhi Sun, Yixuan Li, Xiaodong Wu, Yudong Yang, Wei Li, Zejun Ma, Chao Zhang

Main category: cs.CV

TL;DR: video-SALMONN S is the first streaming audio-visual LLM that processes 3-hour videos at 1 FPS and 360p resolution under fixed memory, using test-time-training memory modules and selective memory retrieval to maintain high-quality understanding on long videos.

Details

Motivation: Current video-understanding LLMs struggle with continuous, high-frame-rate processing of long video streams. Offline methods require adapting frame rates, while streaming methods lose information by merging or discarding tokens.

Method: Introduces a test-time-training (TTT) memory module that continually updates token representations to capture long-range dependencies, replacing token merging. Uses a prompt-dependent memory reader that selectively retrieves context-relevant content from fixed-size memory, with TTT optimized using Hessian-free conjugate-gradient procedure.

Result: Achieves 74.2% overall and 67.8% on Video-MME long split, outperforming both offline and streaming baselines. Sustains high-quality understanding on multi-hour videos with 10k frames and 1M tokens.

Conclusion: video-SALMONN S successfully addresses the scalability challenge in video-understanding LLMs, enabling continuous processing of long video streams under fixed memory constraints while maintaining performance.

Abstract: Continuous, high-frame-rate, high-resolution processing of long video streams is critical for future AI agents, yet current video-understanding LLMs struggle to scale. Offline, fixed-frame-number methods require the stream length to adapt frame rates; streaming methods constrain memory by merging or discarding tokens, losing information. We propose video-SALMONN S, a streaming audio-visual LLM that, to our knowledge, is the first to process 3-hour videos at 1 FPS and 360p resolution under a fixed memory budget. Our model introduces (i) a test-time-training (TTT) memory module that continually updates token representations to capture long-range dependencies by replacing token merging, and (ii) a prompt-dependent memory reader that selectively retrieves context-relevant content from fixed-size memory. The TTT module is optimised with a Hessian-free conjugate-gradient procedure (TTT_HF) for efficient adaptation. On long-video benchmarks (Video-MME, LVBench, VideoEvalPro), video-SALMONN S sustains high-quality understanding on multi-hour videos with 10k frames and 1M tokens. Our 8B-parameter model achieves 74.2% overall and 67.8% on the Video-MME long split, outperforming both offline and streaming baselines.

[450] Validation of an Artificial Intelligence Tool for the Detection of Sperm DNA Fragmentation Using the TUNEL In Situ Hybridization Assay

Byron Alexander Jacobs, Aqeel Morris, Ifthakaar Shaik, Frando Lin

Main category: cs.CV

TL;DR: AI tool using phase contrast microscopy and ensemble learning achieves 60% sensitivity and 75% specificity in detecting sperm DNA fragmentation, providing non-destructive assessment for fertility applications.

Details

Motivation: Conventional semen analysis fails to evaluate sperm DNA fragmentation (SDF), which is critical for male fertility assessment. Current methods require destructive testing, limiting clinical utility.

Method: Developed a morphology-assisted ensemble AI model combining image processing with transformer-based machine learning (GC-ViT) to predict DNA fragmentation from phase contrast microscopy images, using TUNEL assay as gold standard.

Result: The proposed framework achieved 60% sensitivity and 75% specificity in detecting sperm DNA fragmentation, outperforming pure transformer vision models and morphology-only models.

Conclusion: This non-destructive AI methodology enables real-time sperm selection based on DNA integrity, representing a significant advancement for reproductive medicine diagnostics and therapeutic applications.

Abstract: Sperm DNA fragmentation (SDF) is a critical parameter in male fertility assessment that conventional semen analysis fails to evaluate. This study presents the validation of a novel artificial intelligence (AI) tool designed to detect SDF through digital analysis of phase contrast microscopy images, using the terminal deoxynucleotidyl transferase dUTP nick end labeling (TUNEL) assay as the gold standard reference. Utilising the established link between sperm morphology and DNA integrity, the present work proposes a morphology assisted ensemble AI model that combines image processing techniques with state-of-the-art transformer based machine learning models (GC-ViT) for the prediction of DNA fragmentation in sperm from phase contrast images. The ensemble model is benchmarked against a pure transformer vision' model as well as a morphology-only` model. Promising results show the proposed framework is able to achieve sensitivity of 60% and specificity of 75%. This non-destructive methodology represents a significant advancement in reproductive medicine by enabling real-time sperm selection based on DNA integrity for clinical diagnostic and therapeutic applications.

[451] Multiview Manifold Evidential Fusion for PolSAR Image Classification

Junfei Shi, Haojia Zhang, Haiyan Jin, Junhuai Li, Xiaogang Song, Yuanfan Guo, Haonan Su, Weisi Lin

Main category: cs.CV

TL;DR: Proposes MMEFnet for fusing PolSAR covariance matrices and multi-features using manifold learning and evidence theory, achieving more reliable classification with uncertainty quantification.

Details

Motivation: Traditional fusion methods ignore the different geometric structures of covariance matrices and multi-features, overlook view importance variations, and lack uncertainty quantification, leading to unreliable predictions.

Method: Represents covariance matrices on HPD manifold and multi-features on Grassmann manifold, uses kernel metric learning networks for manifold representations, employs trusted multiview evidence fusion with Dempster-Shafer theory for combining evidence.

Result: Extensive experiments on three real-world PolSAR datasets show consistent outperformance over existing methods in accuracy, robustness, and interpretability.

Conclusion: MMEFnet effectively integrates PolSAR manifold learning and evidence fusion, providing more reliable and interpretable classification with uncertainty quantification.

Abstract: Polarimetric Synthetic Aperture Radar (PolSAR) covariance matrices and their extracted multi-features - such as scattering angle, entropy, texture, and boundary descriptors - provide complementary and physically interpretable information for image classification. Traditional fusion strategies typically concatenate these features or employ deep learning networks to combine them. However, the covariance matrices and multi-features, as two complementary views, lie on different manifolds with distinct geometric structures. Existing fusion methods also overlook the varying importance of different views and ignore uncertainty, often leading to unreliable predictions. To address these issues, we propose a Multiview Manifold Evidential Fusion (MMEFnet) method to effectively fuse these two views. It gives a new framework to integrate PolSAR manifold learning and evidence fusion into a unified architecture. Specifically, covariance matrices are represented on the Hermitian Positive Definite (HPD) manifold, while multi-features are modeled on the Grassmann manifold. Two different kernel metric learning networks are constructed to learn their manifold representations. Subsequently, a trusted multiview evidence fusion, replacing the conventional softmax classifier, estimates belief mass and quantifies the uncertainty of each view from the learned deep features. Finally, a Dempster-Shafer theory-based fusion strategy combines evidence, enabling a more reliable and interpretable classification. Extensive experiments on three real-world PolSAR datasets demonstrate that the proposed method consistently outperforms existing approaches in accuracy, robustness, and interpretability.

Xiang Ma, Litian Xu, Lexin Fang, Caiming Zhang, Lizhen Cui

Main category: cs.CV

TL;DR: PICO is a novel framework that suppresses style interference in cross-modal alignment by quantifying semantic probability of feature columns and using prototype iterative construction to improve alignment performance.

Details

Motivation: Conventional cross-modal alignment methods assume embeddings contain only semantic information, ignoring non-semantic style variations that cause information bias and loss during alignment.

Method: PICO quantifies the probability of each feature column representing semantic information and uses it as weight during embedding interaction. It employs prototype iterative construction with performance feedback-based weighting to ensure reliable semantic probability.

Result: Extensive experiments show PICO outperforms state-of-the-art methods by 5.2%-14.1% across various benchmarks and model backbones.

Conclusion: PICO effectively suppresses style interference in cross-modal alignment by separating semantic from style information through probabilistic feature weighting and iterative prototype construction.

Abstract: Cross-modal alignment is an important multi-modal task, aiming to bridge the semantic gap between different modalities. The most reliable fundamention for achieving this objective lies in the semantic consistency between matched pairs. Conventional methods implicitly assume embeddings contain solely semantic information, ignoring the impact of non-semantic information during alignment, which inevitably leads to information bias or even loss. These non-semantic information primarily manifest as stylistic variations in the data, which we formally define as style information. An intuitive approach is to separate style from semantics, aligning only the semantic information. However, most existing methods distinguish them based on feature columns, which cannot represent the complex coupling relationship between semantic and style information. In this paper, we propose PICO, a novel framework for suppressing style interference during embedding interaction. Specifically, we quantify the probability of each feature column representing semantic information, and regard it as the weight during the embedding interaction. To ensure the reliability of the semantic probability, we propose a prototype iterative construction method. The key operation of this method is a performance feedback-based weighting function, and we have theoretically proven that the function can assign higher weight to prototypes that bring higher performance improvements. Extensive experiments on various benchmarks and model backbones demonstrate the superiority of PICO, outperforming state-of-the-art methods by 5.2%-14.1%.

[453] G2L:From Giga-Scale to Cancer-Specific Large-Scale Pathology Foundation Models via Knowledge Distillation

Yesung Cho, Sungmin Lee, Geongyu Lee, Minkyung Lee, Jongbae Park, Dongmyung Shin

Main category: cs.CV

TL;DR: The G2L framework uses knowledge distillation to boost large-scale pathology models (15% of giga-scale parameters) to match giga-scale performance on cancer-specific tasks using only 1K slides per cancer type.

Details

Motivation: Giga-scale pathology models with billions of parameters are computationally prohibitive for practical use, creating a need for parameter-efficient alternatives that maintain high performance.

Method: Knowledge distillation framework that transfers capabilities from giga-scale models to large-scale models using only 1,000 pathology slides per target cancer type.

Result: Distilled models outperformed same-size state-of-the-art models, sometimes surpassed giga-scale teacher and huge-scale models, and showed higher robustness to multi-institutional image variations.

Conclusion: The distillation approach provides a data- and parameter-efficient way to achieve giga-scale performance for cancer-specific applications without prohibitive computational costs.

Abstract: Recent studies in pathology foundation models have shown that scaling training data, diversifying cancer types, and increasing model size consistently improve their performance. However, giga-scale foundation models, which are trained on hundreds of thousands of slides covering tens of cancer types and contain billions of parameters, pose significant challenges for practical use due to their tremendous computational costs in both development and deployment. In this work, we present a novel strategy, named the G2L framework, to increase the performance of large-scale foundation models, which consist of only $15%$ of the parameters of giga-scale models, to a comparable performance level of giga-scale models in cancer-specific tasks. Our approach applies knowledge distillation, transferring the capabilities of a giga-scale model to a large-scale model, using just 1K pathology slides of a target cancer (e.g., breast, prostate, etc.). The resulting distilled model not only outperformed state-of-the-art models of the same size (i.e., large-scale) across several benchmarks but also, interestingly, surpassed the giga-scale teacher and huge-scale models in some benchmarks. In addition, the distilled model exhibited a higher robustness index, indicating improved resilience to image variations originating from multiple institutions. These findings suggest that the proposed distillation approach for a large-scale model is a data- and parameter-efficient way to achieve giga-scale-level performance for cancer-specific applications without prohibitive computational burden.

[454] BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models

Bryan Chen Zhengyu Tan, Zheng Weihua, Zhengyuan Liu, Nancy F. Chen, Hwaran Lee, Kenny Tsu Wei Choo, Roy Ka-Wei Lee

Main category: cs.CV

TL;DR: BLEnD-Vis is a multimodal benchmark that evaluates cultural knowledge robustness in VLMs across 16 regions, revealing performance drops under linguistic rephrasing and low cross-modal consistency.

Details

Motivation: Current VLM evaluations focus on static recall or isolated visual grounding, leaving gaps in assessing robust and transferable cultural understanding across different regions and modalities.

Method: Built on BLEnD dataset, created 313 cultural question templates with three aligned MCQ formats: text-only baseline (Region→Entity), inverted text-only (Entity→Region), and VQA-style with generated images, totaling 4,916 images and 21,000+ MCQ instances.

Result: VLMs show significant fragility in cultural knowledge with performance drops under linguistic rephrasing. Visual cues help but low cross-modal consistency reveals challenges in integrating textual and visual understanding, especially for lower-resource regions.

Conclusion: BLEnD-Vis provides a crucial testbed for analyzing cultural robustness and multimodal grounding, exposing limitations and guiding development of more culturally competent VLMs.

Abstract: As vision-language models (VLMs) are deployed globally, their ability to understand culturally situated knowledge becomes essential. Yet, existing evaluations largely assess static recall or isolated visual grounding, leaving unanswered whether VLMs possess robust and transferable cultural understanding. We introduce BLEnD-Vis, a multimodal, multicultural benchmark designed to evaluate the robustness of everyday cultural knowledge in VLMs across linguistic rephrasings and visual modalities. Building on the BLEnD dataset, BLEnD-Vis constructs 313 culturally grounded question templates spanning 16 regions and generates three aligned multiple-choice formats: (i) a text-only baseline querying from Region $\to$ Entity, (ii) an inverted text-only variant (Entity $\to$ Region), and (iii) a VQA-style version of (ii) with generated images. The resulting benchmark comprises 4,916 images and over 21,000 multiple-choice question (MCQ) instances, validated through human annotation. BLEnD-Vis reveals significant fragility in current VLM cultural knowledge; models exhibit performance drops under linguistic rephrasing and, whilst visual cues often aid performance, low cross-modal consistency highlights challenges in robustly integrating textual and visual understanding, particularly for lower-resource regions. BLEnD-Vis thus provides a crucial testbed for systematically analysing cultural robustness and multimodal grounding, exposing limitations and guiding the development of more culturally competent VLMs.

[455] Saudi Sign Language Translation Using T5

Ali Alhejab, Tomas Zelezny, Lamya Alkanhal, Ivan Gruber, Yazeed Alharbi, Jakub Straka, Vaclav Javorek, Marek Hruz, Badriah Alkalifah, Ahmed Ali

Main category: cs.CV

TL;DR: T5 models applied to Saudi Sign Language translation show that pre-training on American Sign Language data significantly improves performance (3x BLEU-4), demonstrating cross-linguistic transferability.

Details

Motivation: To address the challenges of Saudi Sign Language translation, particularly with unique characteristics like face coverings, and explore whether pre-training on larger ASL datasets can improve SSL translation performance.

Method: Used T5 models for SSL translation with a novel SSL dataset containing three testing protocols. Compared models pre-trained on YouTubeASL dataset versus models trained directly on SSL data.

Result: Pre-training on YouTubeASL significantly improved model performance by approximately 3 times in BLEU-4 score, showing effective cross-linguistic transfer from ASL to SSL.

Conclusion: Leveraging large-scale ASL data through pre-training is beneficial for improving SSL translation, highlighting the transferability of sign language models across different sign languages.

Abstract: This paper explores the application of T5 models for Saudi Sign Language (SSL) translation using a novel dataset. The SSL dataset includes three challenging testing protocols, enabling comprehensive evaluation across different scenarios. Additionally, it captures unique SSL characteristics, such as face coverings, which pose challenges for sign recognition and translation. In our experiments, we investigate the impact of pre-training on American Sign Language (ASL) data by comparing T5 models pre-trained on the YouTubeASL dataset with models trained directly on the SSL dataset. Experimental results demonstrate that pre-training on YouTubeASL significantly improves models’ performance (roughly $3\times$ in BLEU-4), indicating cross-linguistic transferability in sign language models. Our findings highlight the benefits of leveraging large-scale ASL data to improve SSL translation and provide insights into the development of more effective sign language translation systems. Our code is publicly available at our GitHub repository.

[456] FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models

Shengming Yuan, Xinyu Lyu, Shuailong Wang, Beitao Chen, Jingkuan Song, Lianli Gao

Main category: cs.CV

TL;DR: FlexAC is a training-free framework that enables flexible control over associative reasoning in MLLMs by modulating middle layer representations using hallucination-guided steering vectors, achieving improved creativity and reduced hallucinations.

Details

Motivation: MLLMs face a trade-off between faithfulness and creativity, but existing methods lack flexibility to modulate associative reasoning strength, limiting adaptability across factual and creative scenarios.

Method: FlexAC induces hallucination-guided intermediate representations, constructs associative steering vectors from high-association instances, adaptively calibrates their strengths, and incorporates task-specific vectors from target-domain samples for multi-dimensional associative control.

Result: Achieves up to 5.8x improvement in creativity on Creation-MMBench and 29% reduction in hallucination rate on CHAIR, surpassing existing baselines.

Conclusion: FlexAC effectively enables flexible control over associative reasoning in MLLMs, balancing creative guidance with output stability across different tasks.

Abstract: Multimodal large language models (MLLMs) face an inherent trade-off between faithfulness and creativity, as different tasks require varying degrees of associative reasoning. However, existing methods lack the flexibility to modulate this reasoning strength, limiting MLLMs’ adaptability across factual and creative scenarios. To bridge this gap, we propose equipping MLLMs with mechanisms that enable flexible control over associative reasoning. We begin by investigating the internal mechanisms underlying associative behavior in MLLMs and find that: (1) middle layers play a pivotal role in shaping model’s associative tendencies, (2) modifying representations in these layers effectively regulates associative reasoning strength, and (3) hallucinations can be exploited to derive steering vectors that guide this modulation. Building on these findings, we introduce Flexible Association Control (FlexAC), a lightweight and training-free framework for modulating associative behavior in MLLMs. FlexAC first induces hallucination-guided intermediate representations to encode associative directions. Then, it selects high-association instances to construct effective associative steering vectors, whose strengths are adaptively calibrated to balance creative guidance with output stability. Finally, recognizing the multi-dimensional nature of associative reasoning, FlexAC incorporates task-specific associative vectors derived from a forward pass on a few target-domain samples, enabling models to follow diverse associative directions and better adapt to creative tasks. Notably, our method achieves up to a 5.8x improvement in creativity on Creation-MMBench and a 29% reduction in hallucination rate on CHAIR, surpassing existing baselines and demonstrating its effectiveness in enabling flexible control over associative reasoning in MLLMs. Our code is available at https://github.com/ylhz/FlexAC.

[457] Class Prototypes based Contrastive Learning for Classifying Multi-Label and Fine-Grained Educational Videos

Rohit Gupta, Anirban Roy, Claire Christensen, Sujeong Kim, Sarah Gerard, Madeline Cincebeaux, Ajay Divakaran, Todd Grindal, Mubarak Shah

Main category: cs.CV

TL;DR: This paper presents a multimodal approach for detecting fine-grained educational content in online videos, focusing on literacy and math categories using supervised contrastive learning with class prototypes.

Details

Motivation: The growing consumption of online media by children requires tools to filter appropriate educational content for young learners, particularly for literacy and math education.

Method: Proposes a class prototypes-based supervised contrastive learning approach with multimodal transformer network to capture visual-audio interactions, treating it as fine-grained multilabel classification.

Result: The approach outperforms strong baselines on the APPROVE dataset (193 hours of expert-annotated videos with 19 classes) and other benchmarks like Youtube-8M and COIN.

Conclusion: The proposed multimodal contrastive learning method effectively detects fine-grained educational content in videos and the APPROVE dataset provides a valuable resource for educational content filtering.

Abstract: The recent growth in the consumption of online media by children during early childhood necessitates data-driven tools enabling educators to filter out appropriate educational content for young learners. This paper presents an approach for detecting educational content in online videos. We focus on two widely used educational content classes: literacy and math. For each class, we choose prominent codes (sub-classes) based on the Common Core Standards. For example, literacy codes include letter names', letter sounds’, and math codes include counting', sorting’. We pose this as a fine-grained multilabel classification problem as videos can contain multiple types of educational content and the content classes can get visually similar (e.g., letter names' vs letter sounds’). We propose a novel class prototypes based supervised contrastive learning approach that can handle fine-grained samples associated with multiple labels. We learn a class prototype for each class and a loss function is employed to minimize the distances between a class prototype and the samples from the class. Similarly, distances between a class prototype and the samples from other classes are maximized. As the alignment between visual and audio cues are crucial for effective comprehension, we consider a multimodal transformer network to capture the interaction between visual and audio cues in videos while learning the embedding for videos. For evaluation, we present a dataset, APPROVE, employing educational videos from YouTube labeled with fine-grained education classes by education researchers. APPROVE consists of 193 hours of expert-annotated videos with 19 classes. The proposed approach outperforms strong baselines on APPROVE and other benchmarks such as Youtube-8M, and COIN. The dataset is available at https://github.com/rohit-gupta/MMContrast/tree/main/APPROVE

[458] Investigating Identity Signals in Conversational Facial Dynamics via Disentangled Expression Features

Masoumeh Chapariniya, Pierre Vuillecard, Jean-Marc Odobez, Volker Dellwo, Teodora Vukovic

Main category: cs.CV

TL;DR: This paper demonstrates that individuals can be identified solely through their facial expression dynamics, independent of static facial appearance, achieving 61.14% accuracy on 1,429-way classification using conversational videos.

Details

Motivation: To investigate whether facial dynamics alone carry sufficient identity information, disentangled from static facial shape, for person identification.

Method: Used FLAME 3D morphable model to disentangle facial shape and expression dynamics, extracted frame-by-frame parameters from conversational videos, and trained a Conformer model with supervised contrastive learning on the CANDOR dataset.

Result: Achieved 61.14% accuracy on 1,429-way classification (458 times above chance), demonstrating strong identity signatures in facial dynamics. Introduced drift-to-noise ratio (DNR) that negatively correlates with recognition performance.

Conclusion: Facial dynamics contain person-specific signatures that enable identification independent of static appearance, with implications for social perception and clinical assessment.

Abstract: This work investigates whether individuals can be identified solely through the pure dynamical components of their facial expressions, independent of static facial appearance. We leverage the FLAME 3D morphable model to achieve explicit disentanglement between facial shape and expression dynamics, extracting frame-by-frame parameters from conversational videos while retaining only expression and jaw coefficients. On the CANDOR dataset of 1,429 speakers in naturalistic conversations, our Conformer model with supervised contrastive learning achieves 61.14%accuracy on 1,429-way classification – 458 times above chance – demonstrating that facial dynamics carry strong identity signatures. We introduce a drift-to-noise ratio (DNR) that quantifies the reliability of shape expression separation by measuring across-session shape changes relative to within-session variability. DNR strongly negatively correlates with recognition performance, confirming that unstable shape estimation compromises dynamic identification. Our findings reveal person-specific signatures in conversational facial dynamics, with implications for social perception and clinical assessment.

[459] DocReward: A Document Reward Model for Structuring and Stylizing

Junpeng Liu, Yuzhong Zhao, Bowen Cao, Jiayu Ding, Yilin Jia, Tengchao Lv, Yupan Huang, Shaohan Huang, Nan Yang, Li Dong, Lei Cui, Tao Ge, Xun Wang, Huitian Jiao, Sun Mao, FNU Kartik, Si-Qing Chen, Wai Lam, Furu Wei

Main category: cs.CV

TL;DR: DocReward is a document reward model that evaluates documents based on structure and style rather than just textual quality, trained on a multi-domain dataset to guide agentic workflows in producing more professional-looking documents.

Details

Motivation: Current agentic workflows for document generation focus primarily on textual quality while neglecting visual structure and style, which are crucial for readability and engagement. There's a gap in suitable reward models to guide workflows toward better structural and stylistic quality.

Method: Constructed DocPair dataset with 117K paired documents across 32 domains and 267 document types, each containing high- and low-professionalism versions with identical content but different structure/style. Trained DocReward using Bradley-Terry loss to score documents and penalize predictions that contradict human rankings.

Result: DocReward outperforms GPT-4o and GPT-5 in accuracy by 30.6 and 19.4 percentage points respectively. In document generation evaluation, DocReward achieves 60.8% win rate compared to GPT-5’s 37.7%, demonstrating superior performance in guiding generation agents.

Conclusion: DocReward effectively addresses the gap in evaluating document structure and style, significantly outperforming existing models and proving valuable for guiding agentic workflows to produce human-preferred professional documents.

Abstract: Recent advances in agentic workflows have enabled the automation of tasks such as professional document generation. However, they primarily focus on textual quality, neglecting visual structure and style, which are crucial for readability and engagement. This gap arises mainly from the absence of suitable reward models to guide agentic workflows toward producing documents with stronger structural and stylistic quality. To address this, we propose DocReward, a document reward model that evaluates documents based on their structure and style. We construct a multi-domain dataset DocPair of 117K paired documents, covering 32 domains and 267 document types, each including a high- and low-professionalism document with identical content but different structure and style. This enables the model to evaluate professionalism comprehensively, and in a textual-quality-agnostic way. DocReward is trained using the Bradley-Terry loss to score documents, penalizing predictions that contradict the annotated ranking. To assess the performance of reward models, we create a test dataset containing document bundles ranked by well-educated human evaluators. Notably, DocReward outperforms GPT-4o and GPT-5 in accuracy by 30.6 and 19.4 percentage points, respectively, demonstrating its superiority over baselines. In an extrinsic evaluation of document generation, DocReward achieves a significantly higher win rate of 60.8%, compared to GPT-5’s 37.7% win rate, demonstrating its utility in guiding generation agents toward producing human-preferred documents.

[460] LightPneumoNet: Lightweight Pneumonia Classifier

Neilansh Chauhan, Piyush Kumar Gupta, Faraz Doja

Main category: cs.CV

TL;DR: LightPneumoNet is a lightweight CNN for pneumonia detection from chest X-rays that achieves high accuracy (94.2%) and near-perfect recall (99%) with only 388K parameters and 1.48MB memory footprint, enabling deployment in resource-limited settings.

Details

Motivation: To address the challenge of deploying large, computationally expensive deep learning models for pneumonia diagnosis in resource-limited settings by creating an efficient, accessible diagnostic solution.

Method: Built a custom lightweight CNN architecture with four blocks of stacked convolutional layers, trained on 5,856 chest X-ray images with preprocessing (resizing to 224x224, grayscale conversion, normalization) and data augmentation (rotation, zoom, shear) to prevent overfitting.

Result: Achieved exceptional performance with overall accuracy of 0.942, precision of 0.92, F1-Score of 0.96, and critically high sensitivity (recall) of 0.99 on independent test set, outperforming heavier architectures on the same dataset.

Conclusion: LightPneumoNet provides an efficient solution for pneumonia detection that can be deployed on low-cost hardware, making advanced computer-aided diagnosis accessible in underserved clinics and serving as a reliable second-opinion tool to improve patient outcomes.

Abstract: Effective pneumonia diagnosis is often challenged by the difficulty of deploying large, computationally expensive deep learning models in resource-limited settings. This study introduces LightPneumoNet, an efficient, lightweight convolutional neural network (CNN) built from scratch to provide an accessible and accurate diagnostic solution for pneumonia detection from chest X-rays. Our model was trained on a public dataset of 5,856 chest X-ray images. Preprocessing included image resizing to 224x224, grayscale conversion, and pixel normalization, with data augmentation (rotation, zoom, shear) to prevent overfitting. The custom architecture features four blocks of stacked convolutional layers and contains only 388,082 trainable parameters, resulting in a minimal 1.48 MB memory footprint. On the independent test set, our model delivered exceptional performance, achieving an overall accuracy of 0.942, precision of 0.92, and an F1-Score of 0.96. Critically, it obtained a sensitivity (recall) of 0.99, demonstrating a near-perfect ability to identify true pneumonia cases and minimize clinically significant false negatives. Notably, LightPneumoNet achieves this high recall on the same dataset where existing approaches typically require significantly heavier architectures or fail to reach comparable sensitivity levels. The model’s efficiency enables deployment on low-cost hardware, making advanced computer-aided diagnosis accessible in underserved clinics and serving as a reliable second-opinion tool to improve patient outcomes.

[461] Nepali Sign Language Characters Recognition: Dataset Development and Deep Learning Approaches

Birat Poudel, Satyam Ghimire, Sijan Bhattarai, Saurav Bhandari, Suramya Sharma Dahal

Main category: cs.CV

TL;DR: This paper introduces the first benchmark dataset for Nepali Sign Language (NSL) with 36 gesture classes and 1,500 samples per class, achieving 90.45% accuracy using fine-tuned MobileNetV2.

Details

Motivation: Digital linguistic dataset resources for underrepresented sign languages like Nepali Sign Language remain scarce, creating a need for systematic research in this area.

Method: Created a benchmark NSL dataset with 36 gesture classes, then fine-tuned MobileNetV2 and ResNet50 architectures on this dataset for sign recognition.

Result: MobileNetV2 achieved 90.45% classification accuracy and ResNet50 achieved 88.78% accuracy on the NSL dataset.

Conclusion: Convolutional neural networks are effective for sign recognition in low-resource settings, and transfer learning/fine-tuning can advance research in underexplored sign languages.

Abstract: Sign languages serve as essential communication systems for individuals with hearing and speech impairments. However, digital linguistic dataset resources for underrepresented sign languages, such as Nepali Sign Language (NSL), remain scarce. This study introduces the first benchmark dataset for NSL, consisting of 36 gesture classes with 1,500 samples per class, designed to capture the structural and visual features of the language. To evaluate recognition performance, we fine-tuned MobileNetV2 and ResNet50 architectures on the dataset, achieving classification accuracies of 90.45% and 88.78%, respectively. These findings demonstrate the effectiveness of convolutional neural networks in sign recognition tasks, particularly within low-resource settings. To the best of our knowledge, this work represents the first systematic effort to construct a benchmark dataset and assess deep learning approaches for NSL recognition, highlighting the potential of transfer learning and fine-tuning for advancing research in underexplored sign languages.

[462] DTEA: Dynamic Topology Weaving and Instability-Driven Entropic Attenuation for Medical Image Segmentation

Weixuan Li, Quanjun Li, Guang Yu, Song Yang, Zimeng Li, Chi-Man Pun, Yupeng Liu, Xuhang Chen

Main category: cs.CV

TL;DR: The paper proposes DTEA, a medical image segmentation model with novel skip connections using Semantic Topology Reconfiguration (STR) and Entropic Perturbation Gating (EPG) modules to improve structural representation and contextual modeling.

Details

Motivation: Current medical image segmentation methods struggle with limited structural representation and insufficient contextual modeling, which affects generalization in complex clinical scenarios.

Method: DTEA model with Semantic Topology Reconfiguration (STR) module that reorganizes multi-scale semantic features into a dynamic hypergraph, and Entropic Perturbation Gating (EPG) module that assesses channel stability and filters high-entropy channels to improve spatial attention.

Result: Extensive experiments on three benchmark datasets show superior segmentation accuracy and better generalization across various clinical settings compared to existing methods.

Conclusion: The proposed DTEA framework with STR and EPG modules effectively enhances structural and semantic representation in medical image segmentation, achieving improved performance and generalization in clinical applications.

Abstract: In medical image segmentation, skip connections are used to merge global context and reduce the semantic gap between encoder and decoder. Current methods often struggle with limited structural representation and insufficient contextual modeling, affecting generalization in complex clinical scenarios. We propose the DTEA model, featuring a new skip connection framework with the Semantic Topology Reconfiguration (STR) and Entropic Perturbation Gating (EPG) modules. STR reorganizes multi-scale semantic features into a dynamic hypergraph to better model cross-resolution anatomical dependencies, enhancing structural and semantic representation. EPG assesses channel stability after perturbation and filters high-entropy channels to emphasize clinically important regions and improve spatial attention. Extensive experiments on three benchmark datasets show our framework achieves superior segmentation accuracy and better generalization across various clinical settings. The code is available at \href{https://github.com/LWX-Research/DTEA}{https://github.com/LWX-Research/DTEA}.

[463] A Large-Language-Model Assisted Automated Scale Bar Detection and Extraction Framework for Scanning Electron Microscopic Images

Yuxuan Chen, Ruotong Yang, Zhengyang Zhang, Mehreen Ahmed, Yanming Wang

Main category: cs.CV

TL;DR: An automated multi-modal framework for scale bar detection and extraction in SEM images using object detection, hybrid OCR, and LLM agent for verification and analysis.

Details

Motivation: Manual scale bar determination in SEM analysis is time-consuming and error-prone, requiring an automated solution to improve efficiency and accuracy.

Method: Four-phase framework: 1) Auto-DG for dataset generation, 2) scale bar object detection, 3) hybrid OCR with DenseNet and CRNN, 4) LLM agent for verification and analysis.

Result: Object detection achieved 100% precision, 95.8% recall, 99.2% mAP@0.5; OCR achieved 89% precision, 65% recall, 75% F1 score; outperforms mainstream OCR engines.

Conclusion: The automated LLM-powered framework significantly enhances efficiency and accuracy of scale bar detection in SEM images, advancing scientific imaging analysis.

Abstract: Microscopic characterizations, such as Scanning Electron Microscopy (SEM), are widely used in scientific research for visualizing and analyzing microstructures. Determining the scale bars is an important first step of accurate SEM analysis; however, currently, it mainly relies on manual operations, which is both time-consuming and prone to errors. To address this issue, we propose a multi-modal and automated scale bar detection and extraction framework that provides concurrent object detection, text detection and text recognition with a Large Language Model (LLM) agent. The proposed framework operates in four phases; i) Automatic Dataset Generation (Auto-DG) model to synthesize a diverse dataset of SEM images ensuring robust training and high generalizability of the model, ii) scale bar object detection, iii) information extraction using a hybrid Optical Character Recognition (OCR) system with DenseNet and Convolutional Recurrent Neural Network (CRNN) based algorithms, iv) an LLM agent to analyze and verify accuracy of the results. The proposed model demonstrates a strong performance in object detection and accurate localization with a precision of 100%, recall of 95.8%, and a mean Average Precision (mAP) of 99.2% at IoU=0.5 and 69.1% at IoU=0.5:0.95. The hybrid OCR system achieved 89% precision, 65% recall, and a 75% F1 score on the Auto-DG dataset, significantly outperforming several mainstream standalone engines, highlighting its reliability for scientific image analysis. The LLM is introduced as a reasoning engine as well as an intelligent assistant that suggests follow-up steps and verifies the results. This automated method powered by an LLM agent significantly enhances the efficiency and accuracy of scale bar detection and extraction in SEM images, providing a valuable tool for microscopic analysis and advancing the field of scientific imaging.

[464] Exploring and Leveraging Class Vectors for Classifier Editing

Jaeik Kim, Jaeyoung Do

Main category: cs.CV

TL;DR: The paper introduces Class Vectors, a method for flexible post-hoc editing of image classifiers by capturing class-specific representation adjustments in latent space, enabling efficient model adaptation without extensive retraining.

Details

Motivation: Existing classifier editing methods are limited - they either focus narrowly on error correction or require expensive retraining, creating a bottleneck for flexible editing in applications like medical imaging and manufacturing anomaly detection.

Method: Proposes Class Vectors that capture class-specific representation adjustments during fine-tuning. These vectors disentangle each class’s adaptation in latent space and can be used to steer latent features or update decision boundaries in weight space.

Result: Class Vectors effectively capture semantic shifts for each class, demonstrate linearity and orthogonality properties that enable efficient concept editing via class arithmetic, and show utility in applications like unlearning, environmental adaptation, and adversarial defense.

Conclusion: Class Vectors provide a flexible and efficient approach for post-hoc classifier editing, overcoming limitations of existing methods and enabling practical applications in various domains requiring model adaptation.

Abstract: Image classifiers play a critical role in detecting diseases in medical imaging and identifying anomalies in manufacturing processes. However, their predefined behaviors after extensive training make post hoc model editing difficult, especially when it comes to forgetting specific classes or adapting to distribution shifts. Existing classifier editing methods either focus narrowly on correcting errors or incur extensive retraining costs, creating a bottleneck for flexible editing. Moreover, such editing has seen limited investigation in image classification. To overcome these challenges, we introduce Class Vectors, which capture class-specific representation adjustments during fine-tuning. Whereas task vectors encode task-level changes in weight space, Class Vectors disentangle each class’s adaptation in the latent space. We show that Class Vectors capture each class’s semantic shift and that classifier editing can be achieved either by steering latent features along these vectors or by mapping them into weight space to update the decision boundaries. We also demonstrate that the inherent linearity and orthogonality of Class Vectors support efficient, flexible, and high-level concept editing via simple class arithmetic. Finally, we validate their utility in applications such as unlearning, environmental adaptation, adversarial defense, and adversarial trigger optimization.

[465] EEMS: Edge-Prompt Enhanced Medical Image Segmentation Based on Learnable Gating Mechanism

Han Xia, Quanjun Li, Qian Li, Zimeng Li, Hongbin Ye, Yupeng Liu, Haolun Li, Xuhang Chen

Main category: cs.CV

TL;DR: EEMS is a medical image segmentation model that combines edge-aware enhancement and multi-scale prompt generation to address challenges like ambiguous boundaries and background noise.

Details

Motivation: Medical image segmentation faces challenges from complex factors like ambiguous edges and background noise, which affect diagnosis and treatment planning accuracy.

Method: Combines Edge-Aware Enhancement Unit (EAEU) for multi-frequency edge feature extraction, Multi-scale Prompt Generation Unit (MSPGU) for semantic-spatial feature integration, and Dual-Source Adaptive Gated Fusion Unit (DAGFU) to merge edge and semantic features.

Result: Tests on ISIC2018 dataset confirm superior performance and reliability as a clinical tool.

Conclusion: EEMS provides enhanced segmentation accuracy and robustness for medical applications.

Abstract: Medical image segmentation is vital for diagnosis, treatment planning, and disease monitoring but is challenged by complex factors like ambiguous edges and background noise. We introduce EEMS, a new model for segmentation, combining an Edge-Aware Enhancement Unit (EAEU) and a Multi-scale Prompt Generation Unit (MSPGU). EAEU enhances edge perception via multi-frequency feature extraction, accurately defining boundaries. MSPGU integrates high-level semantic and low-level spatial features using a prompt-guided approach, ensuring precise target localization. The Dual-Source Adaptive Gated Fusion Unit (DAGFU) merges edge features from EAEU with semantic features from MSPGU, enhancing segmentation accuracy and robustness. Tests on datasets like ISIC2018 confirm EEMS’s superior performance and reliability as a clinical tool.

[466] Human Uncertainty-Aware Data Selection and Automatic Labeling in Visual Question Answering

Jian Lan, Zhicheng Liu, Udo Schlegel, Raoyuan Zhao, Yihong Liu, Hinrich Schütze, Michael A. Hedderich, Thomas Seidl

Main category: cs.CV

TL;DR: HaDola is a human uncertainty-aware framework that improves VLM training by selectively using high-uncertainty samples and automatic labeling, reducing annotation costs while maintaining or improving performance.

Details

Motivation: Standard supervised fine-tuning ignores human uncertainty in annotations, leading to suboptimal model performance and poor calibration. Current methods waste resources on high-uncertainty samples that degrade performance.

Method: Four-stage framework: discriminate (identify harmful samples), self-annotate (automatic labeling), error trigger (detect mistakes), and training. Uses iterative process with small seed set (5% of data) to prioritize informative samples.

Result: HaDola matches or outperforms state-of-the-art baselines on VQAv2 and VizWiz datasets with less training data. Models become more accurate and better calibrated while reducing reliance on costly human annotations.

Conclusion: Explicitly modeling human uncertainty in training is more effective than simply scaling dataset size. Better utilization of uncertainty distributions improves model performance and calibration while reducing annotation costs.

Abstract: Large vision-language models (VLMs) achieve strong performance in Visual Question Answering but still rely heavily on supervised fine-tuning (SFT) with massive labeled datasets, which is costly due to human annotations. Crucially, real-world datasets often exhibit human uncertainty (HU) – variation in human confidence across annotations – but standard SFT simply optimizes toward the most frequent label, disregarding HU distributions. This leaves two open questions: How does HU affect SFT, and how can HU be effectively leveraged in training? In this work, we first conduct a systematic evaluation of VLMs across varying HU levels. We have two key findings: (i) surprisingly, high-HU samples contribute little or even degrade model performance, and (ii) naively training on the full dataset yields under-calibrated models that fail to capture HU distributions. Motivated by these findings, we introduce HaDola, a human uncertainty-aware data selection and automatic labeling framework. HaDola operates in four stages – discriminate, self-annotate, error trigger, and training – to iteratively identify harmful samples, prioritize informative ones, and bootstrap from a small seed set (5% of data). Our approach substantially reduces reliance on costly HU annotations and makes VLMs more accurate and better calibrated. Extensive experiments on VQAv2 and VizWiz datasets demonstrate that HaDola consistently matches or outperforms state-of-the-art baselines with less training data. Our work highlights the importance of explicitly modeling HU in SFT, suggesting that better utilization of HU is more effective than merely scaling up dataset size.

[467] $Δ\mathrm{Energy}$: Optimizing Energy Change During Vision-Language Alignment Improves both OOD Detection and OOD Generalization

Lin Zhu, Yifeng Yang, Xinbing Wang, Qinying Gu, Nanyang Ye

Main category: cs.CV

TL;DR: The paper introduces ΔEnergy, a novel OOD score that improves both OOD detection and generalization for vision-language models by leveraging energy changes during modality realignment.

Details

Motivation: VLMs encounter both ID and OOD data in real-world applications, including covariate shifts (known classes with style changes) and semantic shifts (unseen classes). Current methods need better generalization to covariate-shifted OOD data while effectively detecting semantic-shifted OOD classes.

Method: Proposed ΔEnergy score based on energy changes during vision-language modality realignment, with EBM (lower-bound maximization) to simultaneously improve OOD detection and generalization. Developed unified fine-tuning framework.

Result: Extensive experiments show superiority, outperforming recent approaches by 10% to 25% in AUROC on challenging OOD detection and generalization benchmarks.

Conclusion: ΔEnergy provides a unified solution that significantly enhances VLMs’ robustness in both OOD generalization and detection, with theoretical guarantees and strong empirical performance.

Abstract: Recent approaches for vision-language models (VLMs) have shown remarkable success in achieving fast downstream adaptation. When applied to real-world downstream tasks, VLMs inevitably encounter both the in-distribution (ID) data and out-of-distribution (OOD) data. The OOD datasets often include both covariate shifts (e.g., known classes with changes in image styles) and semantic shifts (e.g., test-time unseen classes). This highlights the importance of improving VLMs’ generalization ability to covariate-shifted OOD data, while effectively detecting open-set semantic-shifted OOD classes. In this paper, inspired by the substantial energy change observed in closed-set data when re-aligning vision-language modalities (specifically by directly reducing the maximum cosine similarity to a low value), we introduce a novel OOD score, named {\Delta}Energy. {\Delta}Energy significantly outperforms the vanilla energy-based OOD score and provides a more reliable approach for OOD detection. Furthermore, {\Delta}Energy can simultaneously improve OOD generalization under covariate shifts, which is achieved by lower-bound maximization for {\Delta}Energy (termed EBM). EBM is theoretically proven to not only enhance OOD detection but also yields a domain-consistent Hessian, which serves as a strong indicator for OOD generalization. Based on this finding, we developed a unified fine-tuning framework that allows for improving VLMs’ robustness in both OOD generalization and OOD detection. Extensive experiments on challenging OOD detection and generalization benchmarks demonstrate the superiority of our method, outperforming recent approaches by 10% to 25% in AUROC.

[468] When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models

Samer Al-Hamadani

Main category: cs.CV

TL;DR: This paper compares supervised YOLO detection (91.2% accuracy, $10,800 annotation cost) vs zero-shot Gemini VLM (68.5% accuracy, no annotation) across 1,200 images, establishing break-even thresholds and cost-effectiveness analysis for detection system selection.

Details

Motivation: To provide the first comprehensive cost-effectiveness analysis comparing traditional supervised object detection (requiring expensive manual annotations) with emerging zero-shot VLM approaches (eliminating annotation costs but with lower accuracy).

Method: Systematic evaluation on 1,000 stratified COCO images and 200 diverse product images, combined with detailed Total Cost of Ownership modeling to establish quantitative break-even thresholds for architecture selection.

Result: Supervised YOLO achieves 91.2% accuracy vs 68.5% for zero-shot Gemini on standard categories, but requires $10,800 annotation cost. Break-even occurs at 55M inferences (151k images/day for 1 year). Gemini shows 52.3% accuracy on diverse products where YOLO gets 0%. Cost per correct detection: Gemini $0.00050 vs YOLO $0.143 at 100k inferences.

Conclusion: Optimal detection architecture selection depends critically on deployment volume, category stability, budget constraints, and accuracy requirements rather than purely technical performance metrics, with clear break-even thresholds established for decision-making.

Abstract: Object detection systems have traditionally relied on supervised learning with manually annotated bounding boxes, achieving high accuracy at the cost of substantial annotation investment. The emergence of Vision-Language Models (VLMs) offers an alternative paradigm enabling zero-shot detection through natural language queries, eliminating annotation requirements but operating with reduced accuracy. This paper presents the first comprehensive cost-effectiveness analysis comparing supervised detection (YOLO) with zero-shot VLM inference (Gemini Flash 2.5). Through systematic evaluation on 1,000 stratified COCO images and 200 diverse product images spanning consumer electronics and rare categories, combined with detailed Total Cost of Ownership modeling, we establish quantitative break-even thresholds governing architecture selection. Our findings reveal that supervised YOLO achieves 91.2% accuracy versus 68.5% for zero-shot Gemini on standard categories, representing a 22.7 percentage point advantage that costs $10,800 in annotation for 100-category systems. However, this advantage justifies investment only beyond 55 million inferences, equivalent to 151,000 images daily for one year. Zero-shot Gemini demonstrates 52.3% accuracy on diverse product categories (ranging from highly web-prevalent consumer electronics at 75-85% to rare specialized equipment at 25-40%) where supervised YOLO achieves 0% due to architectural constraints preventing detection of untrained classes. Cost per Correct Detection analysis reveals substantially lower per-detection costs for Gemini ($0.00050 vs $0.143) at 100,000 inferences despite accuracy deficits. We develop decision frameworks demonstrating that optimal architecture selection depends critically on deployment volume, category stability, budget constraints, and accuracy requirements rather than purely technical performance metrics.

[469] sketch2symm: Symmetry-aware sketch-to-shape generation via semantic bridging

Yan Zhou, Mingji Li, Xiantao Zeng, Jie Lin, Yuexia Zhou

Main category: cs.CV

TL;DR: Sketch2Symm is a two-stage method for 3D reconstruction from sketches that uses sketch-to-image translation for semantic enrichment and symmetry constraints as geometric priors, achieving state-of-the-art performance.

Details

Motivation: Sketch-based 3D reconstruction is challenging due to the abstract and sparse nature of sketch inputs that lack sufficient semantic and geometric information.

Method: Two-stage generation method with semantic bridging via sketch-to-image translation to enrich sparse representations, and symmetry constraints as geometric priors to leverage structural regularity.

Result: Superior performance compared to existing methods on mainstream sketch datasets in terms of Chamfer Distance, Earth Mover’s Distance, and F-Score.

Conclusion: The proposed semantic bridging and symmetry-aware design effectively address the challenges of sketch-based 3D reconstruction.

Abstract: Sketch-based 3D reconstruction remains a challenging task due to the abstract and sparse nature of sketch inputs, which often lack sufficient semantic and geometric information. To address this, we propose Sketch2Symm, a two-stage generation method that produces geometrically consistent 3D shapes from sketches. Our approach introduces semantic bridging via sketch-to-image translation to enrich sparse sketch representations, and incorporates symmetry constraints as geometric priors to leverage the structural regularity commonly found in everyday objects. Experiments on mainstream sketch datasets demonstrate that our method achieves superior performance compared to existing sketch-based reconstruction methods in terms of Chamfer Distance, Earth Mover’s Distance, and F-Score, verifying the effectiveness of the proposed semantic bridging and symmetry-aware design.

[470] Evaluating the effects of preprocessing, method selection, and hyperparameter tuning on SAR-based flood mapping and water depth estimation

Jean-Paul Travert, Cédric Goeury, Sébastien Boyaval, Vito Bacchi, Fabrice Zaoui

Main category: cs.CV

TL;DR: This study evaluates SAR imagery processing methods for flood mapping and water depth estimation, showing that method choices at each step significantly impact results and advocating for ensemble approaches to account for methodological uncertainty.

Details

Motivation: To understand how different preprocessing, flood mapping, and water depth estimation methods from SAR imagery affect flood analysis results, and to quantify the uncertainty introduced by methodological choices.

Method: Used ensemble approach with various preprocessing (speckle filtering), flood mapping (supervised/unsupervised), and water depth estimation methods on SAR imagery from two Garonne River flood events, validated against hydrodynamic simulations and in-situ observations.

Result: Speckle filter choice altered flood extent by several km²; supervised flood mapping outperformed unsupervised but tuned unsupervised methods achieved comparable results; compounded uncertainty from preprocessing and mapping introduced high variability in water depth estimates.

Conclusion: Methodological choices significantly impact flood analysis results, with flood mapping method choice being most influential. Ensemble approaches should be adopted to account for uncertainty rather than relying on single configurations.

Abstract: Flood mapping and water depth estimation from Synthetic Aperture Radar (SAR) imagery are crucial for calibrating and validating hydraulic models. This study uses SAR imagery to evaluate various preprocessing (especially speckle noise reduction), flood mapping, and water depth estimation methods. The impact of the choice of method at different steps and its hyperparameters is studied by considering an ensemble of preprocessed images, flood maps, and water depth fields. The evaluation is conducted for two flood events on the Garonne River (France) in 2019 and 2021, using hydrodynamic simulations and in-situ observations as reference data. Results show that the choice of speckle filter alters flood extent estimations with variations of several square kilometers. Furthermore, the selection and tuning of flood mapping methods also affect performance. While supervised methods outperformed unsupervised ones, tuned unsupervised approaches (such as local thresholding or change detection) can achieve comparable results. The compounded uncertainty from preprocessing and flood mapping steps also introduces high variability in the water depth field estimates. This study highlights the importance of considering the entire processing pipeline, encompassing preprocessing, flood mapping, and water depth estimation methods and their associated hyperparameters. Rather than relying on a single configuration, adopting an ensemble approach and accounting for methodological uncertainty should be privileged. For flood mapping, the method choice has the most influence. For water depth estimation, the most influential processing step was the flood map input resulting from the flood mapping step and the hyperparameters of the methods.

[471] Uncertainty-Aware ControlNet: Bridging Domain Gaps with Synthetic Image Generation

Joshua Niemeijer, Jan Ehrhardt, Heinz Handels, Hristina Uzunova

Main category: cs.CV

TL;DR: A method to train ControlNets using unlabeled domain data by introducing uncertainty control, enabling creation of annotated synthetic data from target domains to improve segmentation performance without additional supervision.

Details

Motivation: ControlNets tend to reproduce original training distributions, limiting their effectiveness for data augmentation. There's a need to utilize unlabeled domain data to create synthetic annotated data that can bridge domain gaps in applications like retinal OCT segmentation.

Method: Introduces uncertainty into ControlNet’s control mechanism, combining uncertainty control from unlabeled datasets with semantic control from labeled datasets. This allows creation of annotated data with high uncertainty from target domains.

Result: The approach successfully synthesizes annotated images from Home-OCT domain, significantly improving segmentation results without additional supervision. Also demonstrated effectiveness in traffic scene experiments.

Conclusion: Uncertainty-guidance enables arbitrary domain shifts without strict style learning, outperforming style transfer methods and providing a flexible solution for domain adaptation in segmentation tasks.

Abstract: Generative Models are a valuable tool for the controlled creation of high-quality image data. Controlled diffusion models like the ControlNet have allowed the creation of labeled distributions. Such synthetic datasets can augment the original training distribution when discriminative models, like semantic segmentation, are trained. However, this augmentation effect is limited since ControlNets tend to reproduce the original training distribution. This work introduces a method to utilize data from unlabeled domains to train ControlNets by introducing the concept of uncertainty into the control mechanism. The uncertainty indicates that a given image was not part of the training distribution of a downstream task, e.g., segmentation. Thus, two types of control are engaged in the final network: an uncertainty control from an unlabeled dataset and a semantic control from the labeled dataset. The resulting ControlNet allows us to create annotated data with high uncertainty from the target domain, i.e., synthetic data from the unlabeled distribution with labels. In our scenario, we consider retinal OCTs, where typically high-quality Spectralis images are available with given ground truth segmentations, enabling the training of segmentation networks. The recent development in Home-OCT devices, however, yields retinal OCTs with lower quality and a large domain shift, such that out-of-the-pocket segmentation networks cannot be applied for this type of data. Synthesizing annotated images from the Home-OCT domain using the proposed approach closes this gap and leads to significantly improved segmentation results without adding any further supervision. The advantage of uncertainty-guidance becomes obvious when compared to style transfer: it enables arbitrary domain shifts without any strict learning of an image style. This is also demonstrated in a traffic scene experiment.

[472] REACT3D: Recovering Articulations for Interactive Physical 3D Scenes

Zhao Huang, Boyang Sun, Alexandros Delitzas, Jiaqi Chen, Marc Pollefeys

Main category: cs.CV

TL;DR: REACT3D is a zero-shot framework that converts static 3D scenes into interactive replicas with movable parts, joint estimation, and hidden-geometry completion for simulation use.

Details

Motivation: Existing 3D scene datasets are limited due to labor-intensive annotation of part segmentation, kinematic types, and motion trajectories needed for embodied intelligence applications.

Method: Four-stage framework: (i) openable-object detection and segmentation, (ii) articulation estimation for joint types and motion parameters, (iii) hidden-geometry completion and interactive object assembly, (iv) interactive scene integration in standard formats.

Result: Achieves state-of-the-art performance on detection/segmentation and articulation metrics across diverse indoor scenes, enabling scalable interactive scene generation.

Conclusion: Provides a practical foundation for scalable interactive scene generation, lowering the barrier to large-scale research on articulated scene understanding.

Abstract: Interactive 3D scenes are increasingly vital for embodied intelligence, yet existing datasets remain limited due to the labor-intensive process of annotating part segmentation, kinematic types, and motion trajectories. We present REACT3D, a scalable zero-shot framework that converts static 3D scenes into simulation-ready interactive replicas with consistent geometry, enabling direct use in diverse downstream tasks. Our contributions include: (i) openable-object detection and segmentation to extract candidate movable parts from static scenes, (ii) articulation estimation that infers joint types and motion parameters, (iii) hidden-geometry completion followed by interactive object assembly, and (iv) interactive scene integration in widely supported formats to ensure compatibility with standard simulation platforms. We achieve state-of-the-art performance on detection/segmentation and articulation metrics across diverse indoor scenes, demonstrating the effectiveness of our framework and providing a practical foundation for scalable interactive scene generation, thereby lowering the barrier to large-scale research on articulated scene understanding. Our project page is \textit{\hypersetup{urlcolor=black}\href{https://react3d.github.io/}{react3d.github.io}}.

[473] InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models

Haomin Wang, Jinhui Yin, Qi Wei, Wenguang Zeng, Lixin Gu, Shenglong Ye, Zhangwei Gao, Yaohui Wang, Yanting Zhang, Yuanqi Li, Yanwen Guo, Wenhai Wang, Kai Chen, Yu Qiao, Hongjie Zhang

Main category: cs.CV

TL;DR: The paper presents InternSVG, a unified multimodal large language model for SVG understanding, editing, and generation, along with a comprehensive dataset (SAgoge) and benchmark (SArena) to address challenges in SVG modeling.

Details

Motivation: To overcome challenges in general SVG modeling including fragmented datasets, limited transferability across tasks, and difficulty handling structural complexity by leveraging MLLMs' transfer and generalization capabilities.

Method: Proposes InternSVG with SVG-specific special tokens, subword-based embedding initialization, and two-stage training progressing from short static SVGs to long-sequence illustrations and complex animations. Built on SAgoge dataset and SArena benchmark.

Result: InternSVG achieves substantial gains and consistently outperforms leading open and proprietary counterparts on SArena and prior benchmarks, demonstrating positive transfer and improved overall performance.

Conclusion: The unified MLLM approach with comprehensive data resources enables effective SVG understanding, editing, and generation, addressing key challenges in the field through integrated data-benchmark-model design.

Abstract: General SVG modeling remains challenging due to fragmented datasets, limited transferability of methods across tasks, and the difficulty of handling structural complexity. In response, we leverage the strong transfer and generalization capabilities of multimodal large language models (MLLMs) to achieve unified modeling for SVG understanding, editing, and generation. We present the InternSVG family, an integrated data-benchmark-model suite. At its core is SAgoge, the largest and most comprehensive multimodal dataset for SVG tasks, encompassing both static graphics and dynamic animations. It covers icons, long-sequence illustrations, scientific diagrams, and dynamic animations, supporting tasks of varied difficulty levels and providing deeper hierarchies with richer attributes compared to previous datasets. Based on this resource, we introduce SArena, a companion benchmark with comprehensive task definitions and standardized evaluation that aligns with the domains and difficulty spectrum covered by SAgoge. Building on these foundations, we propose InternSVG, a unified MLLM for SVG understanding, editing, and generation with SVG-specific special tokens, subword-based embedding initialization, and a two-stage training strategy that progresses from short static SVGs to long-sequence illustrations and complex animations. This unified formulation induces positive transfer and improves overall performance. Experiments on SArena and prior benchmark confirm that InternSVG achieves substantial gains and consistently outperforms leading open and proprietary counterparts.

[474] MMAP: A Multi-Magnification and Prototype-Aware Architecture for Predicting Spatial Gene Expression

Hai Dang Nguyen, Nguyen Dang Huy Pham, The Minh Duc Nguyen, Dac Thai Nguyen, Hang Thi Nguyen, Duong M. Nguyen

Main category: cs.CV

TL;DR: MMAP is a novel framework that uses multi-magnification patch representations and prototype embeddings to predict spatial gene expression from H&E-stained whole-slide images, outperforming existing methods.

Details

Motivation: Predicting spatial gene expression from histological images is challenging due to the modality gap between visual features and molecular signals, and existing methods have limitations in local feature granularity and global spatial context coverage.

Method: MMAP uses multi-magnification patch representations to capture fine-grained histological details and learns latent prototype embeddings to represent slide-level global context information.

Result: MMAP consistently outperforms all existing state-of-the-art methods across multiple evaluation metrics including MAE, MSE, and Pearson Correlation Coefficient.

Conclusion: The proposed MMAP framework effectively addresses both local feature granularity and global spatial context challenges in predicting spatial gene expression from histological images.

Abstract: Spatial Transcriptomics (ST) enables the measurement of gene expression while preserving spatial information, offering critical insights into tissue architecture and disease pathology. Recent developments have explored the use of hematoxylin and eosin (H&E)-stained whole-slide images (WSIs) to predict transcriptome-wide gene expression profiles through deep neural networks. This task is commonly framed as a regression problem, where each input corresponds to a localized image patch extracted from the WSI. However, predicting spatial gene expression from histological images remains a challenging problem due to the significant modality gap between visual features and molecular signals. Recent studies have attempted to incorporate both local and global information into predictive models. Nevertheless, existing methods still suffer from two key limitations: (1) insufficient granularity in local feature extraction, and (2) inadequate coverage of global spatial context. In this work, we propose a novel framework, MMAP (Multi-MAgnification and Prototype-enhanced architecture), that addresses both challenges simultaneously. To enhance local feature granularity, MMAP leverages multi-magnification patch representations that capture fine-grained histological details. To improve global contextual understanding, it learns a set of latent prototype embeddings that serve as compact representations of slide-level information. Extensive experimental results demonstrate that MMAP consistently outperforms all existing state-of-the-art methods across multiple evaluation metrics, including Mean Absolute Error (MAE), Mean Squared Error (MSE), and Pearson Correlation Coefficient (PCC).

[475] Reasoning as Representation: Rethinking Visual Reinforcement Learning in Image Quality Assessment

Shijie Zhao, Xuanyu Zhang, Weiqi Li, Junlin Li, Li Zhang, Tianfan Xue, Jian Zhang

Main category: cs.CV

TL;DR: RALI uses contrastive learning to align images with generalizable text representations from RL-trained IQA models, achieving comparable generalization with 95% fewer parameters and inference time.

Details

Motivation: To understand why reasoning-based IQA models generalize well and address their high inference costs (energy and latency) that limit deployment.

Method: Proposed RALI algorithm using contrastive learning to directly align images with compact text representations learned by RL, eliminating reasoning processes and LLM loading.

Result: RALI achieves generalization performance comparable to reasoning-based models while requiring less than 5% of model parameters and inference time.

Conclusion: The generalization of reasoning-based IQA models comes from converting visual representations to compact text representations, and RALI successfully captures this capability without expensive reasoning processes.

Abstract: Reasoning-based image quality assessment (IQA) models trained through reinforcement learning (RL) exhibit exceptional generalization, yet the underlying mechanisms and critical factors driving this capability remain underexplored in current research. Moreover, despite their superior performance, these models incur inference energy usage and latency orders of magnitude higher than their earlier counterparts, restricting their deployment in specific scenarios. Through extensive experiments, this paper verifies and elaborates that through RL training, MLLMs leverage their reasoning capability to convert redundant visual representations into compact, cross-domain aligned text representations. This conversion is precisely the source of the generalization exhibited by these reasoning-based IQA models. Building on this fundamental insight, we propose a novel algorithm, RALI, which employs contrastive learning to directly align images with these generalizable text representations learned by RL. This approach eliminates the reliance on reasoning processes and even obviates the need to load an LLM. For the quality scoring task, this framework achieves generalization performance comparable to reasoning-based models while requiring less than 5% of their model parameters and inference time.

[476] MaterialRefGS: Reflective Gaussian Splatting with Multi-view Consistent Material Inference

Wenyuan Zhang, Jimin Tang, Weiqi Zhang, Yi Fang, Yu-Shen Liu, Zhizhong Han

Main category: cs.CV

TL;DR: A method for modeling reflections in 2D images using Gaussian Splatting with multi-view consistent material inference and physically-based environment modeling to achieve accurate reflections and photorealistic rendering.

Details

Motivation: Current approaches using Gaussian primitives for reflection modeling lack sufficient constraints, especially under limited environment modeling, leading to illumination aliasing and reduced generalization.

Method: Enforces 2D Gaussians to produce multi-view consistent material maps during deferred shading, tracks photometric variations across views to identify reflective regions, and introduces environment modeling through ray tracing with 2DGS for indirect illumination.

Result: Faithfully recovers both illumination and geometry, achieving state-of-the-art rendering quality in novel view synthesis on widely used benchmarks.

Conclusion: Multi-view consistent material inference with physically-based environment modeling is key to learning accurate reflections with Gaussian Splatting.

Abstract: Modeling reflections from 2D images is essential for photorealistic rendering and novel view synthesis. Recent approaches enhance Gaussian primitives with reflection-related material attributes to enable physically based rendering (PBR) with Gaussian Splatting. However, the material inference often lacks sufficient constraints, especially under limited environment modeling, resulting in illumination aliasing and reduced generalization. In this work, we revisit the problem from a multi-view perspective and show that multi-view consistent material inference with more physically-based environment modeling is key to learning accurate reflections with Gaussian Splatting. To this end, we enforce 2D Gaussians to produce multi-view consistent material maps during deferred shading. We also track photometric variations across views to identify highly reflective regions, which serve as strong priors for reflection strength terms. To handle indirect illumination caused by inter-object occlusions, we further introduce an environment modeling strategy through ray tracing with 2DGS, enabling photorealistic rendering of indirect radiance. Experiments on widely used benchmarks show that our method faithfully recovers both illumination and geometry, achieving state-of-the-art rendering quality in novel views synthesis.

[477] Robust Ego-Exo Correspondence with Long-Term Memory

Yijun Hu, Bing Fan, Xin Gu, Haiqing Ren, Dongfang Liu, Heng Fan, Libo Zhang

Main category: cs.CV

TL;DR: LM-EEC is a novel ego-exo correspondence framework based on SAM 2 that addresses challenges in object-level correspondence between egocentric and exocentric views through dual-memory architecture and adaptive feature routing.

Details

Motivation: Existing approaches for ego-exo correspondence suffer from extreme viewpoint variations, occlusions, small objects, ineffective feature fusion, and limited long-term memory capacity, especially for long videos.

Method: Proposes LM-EEC with: (1) Memory-View MoE module using dual-branch routing to adaptively assign weights to expert features along channel and spatial dimensions, (2) Dual-memory bank system with compression strategy to retain critical long-term information while eliminating redundancy.

Result: Achieves new state-of-the-art results on EgoExo4D benchmark, significantly outperforming existing methods and SAM 2 baseline, with strong generalization across diverse scenarios.

Conclusion: LM-EEC effectively addresses the limitations of SAM 2 for ego-exo correspondence through its dual-memory architecture and adaptive feature routing, demonstrating superior performance on challenging correspondence tasks.

Abstract: Establishing object-level correspondence between egocentric and exocentric views is essential for intelligent assistants to deliver precise and intuitive visual guidance. However, this task faces numerous challenges, including extreme viewpoint variations, occlusions, and the presence of small objects. Existing approaches usually borrow solutions from video object segmentation models, but still suffer from the aforementioned challenges. Recently, the Segment Anything Model 2 (SAM 2) has shown strong generalization capabilities and excellent performance in video object segmentation. Yet, when simply applied to the ego-exo correspondence (EEC) task, SAM 2 encounters severe difficulties due to ineffective ego-exo feature fusion and limited long-term memory capacity, especially for long videos. Addressing these problems, we propose a novel EEC framework based on SAM 2 with long-term memories by presenting a dual-memory architecture and an adaptive feature routing module inspired by Mixture-of-Experts (MoE). Compared to SAM 2, our approach features (i) a Memory-View MoE module which consists of a dual-branch routing mechanism to adaptively assign contribution weights to each expert feature along both channel and spatial dimensions, and (ii) a dual-memory bank system with a simple yet effective compression strategy to retain critical long-term information while eliminating redundancy. In the extensive experiments on the challenging EgoExo4D benchmark, our method, dubbed LM-EEC, achieves new state-of-the-art results and significantly outperforms existing methods and the SAM 2 baseline, showcasing its strong generalization across diverse scenarios. Our code and model are available at https://github.com/juneyeeHu/LM-EEC.

[478] Enhancing Maritime Domain Awareness on Inland Waterways: A YOLO-Based Fusion of Satellite and AIS for Vessel Characterization

Geoffery Agorku, Sarah Hernandez, Hayley Hames, Cade Wagner

Main category: cs.CV

TL;DR: A novel framework fusing satellite imagery with AIS data for inland waterway monitoring, using YOLO v11 for vessel detection and classification with high accuracy across multiple metrics.

Details

Motivation: Address limitations of AIS-based maritime monitoring by leveraging non-cooperative satellite imagery to identify dark vessels, validate cooperative traffic, and support advanced Maritime Domain Awareness.

Method: Fusion of high-resolution satellite imagery with AIS trajectory data using YOLO v11 object detection model for vessel detection, classification, and characterization (vessel type, barge cover, operational status, direction, barge count).

Result: High performance across metrics: vessel classification F1=95.8%, barge cover detection F1=91.6%, operational status F1=99.4%, directionality accuracy=93.8%, barge count MAE=2.4 barges. Spatial transferability maintained 98% accuracy.

Conclusion: Integration of non-cooperative satellite sensing with AIS fusion is viable for near-real-time fleet inventories, anomaly detection, and inland waterway surveillance. Future work will expand datasets and incorporate temporal tracking.

Abstract: Maritime Domain Awareness (MDA) for inland waterways remains challenged by cooperative system vulnerabilities. This paper presents a novel framework that fuses high-resolution satellite imagery with vessel trajectory data from the Automatic Identification System (AIS). This work addresses the limitations of AIS-based monitoring by leveraging non-cooperative satellite imagery and implementing a fusion approach that links visual detections with AIS data to identify dark vessels, validate cooperative traffic, and support advanced MDA. The You Only Look Once (YOLO) v11 object detection model is used to detect and characterize vessels and barges by vessel type, barge cover, operational status, barge count, and direction of travel. An annotated data set of 4,550 instances was developed from $5{,}973~\mathrm{mi}^2$ of Lower Mississippi River imagery. Evaluation on a held-out test set demonstrated vessel classification (tugboat, crane barge, bulk carrier, cargo ship, and hopper barge) with an F1 score of 95.8%; barge cover (covered or uncovered) detection yielded an F1 score of 91.6%; operational status (staged or in motion) classification reached an F1 score of 99.4%. Directionality (upstream, downstream) yielded 93.8% accuracy. The barge count estimation resulted in a mean absolute error (MAE) of 2.4 barges. Spatial transferability analysis across geographically disjoint river segments showed accuracy was maintained as high as 98%. These results underscore the viability of integrating non-cooperative satellite sensing with AIS fusion. This approach enables near-real-time fleet inventories, supports anomaly detection, and generates high-quality data for inland waterway surveillance. Future work will expand annotated datasets, incorporate temporal tracking, and explore multi-modal deep learning to further enhance operational scalability.

[479] AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model

Zhiwei Jin, Xiaohui Song, Nan Wang, Yafei Liu, Chao Li, Xin Li, Ruichen Wang, Zhihao Li, Qi Qi, Long Cheng, Dongze Hao, Quanlong Zheng, Yanhao Zhang, Haobo Ji, Jian Ma, Zhitong Zheng, Zhenyi Lin, Haolin Deng, Xin Zou, Xiaojie Yin, Ruilin Wang, Liankai Cai, Haijing Liu, Yuqing Qiu, Ke Chen, Zixian Li, Chi Xie, Huafei Li, Chenxing Li, Chuangchuang Wang, Kai Tang, Zhiguang Zhu, Kai Tang, Wenmei Gao, Rui Wang, Jun Wu, Chao Liu, Qin Xie, Chen Chen, Haonan Lu

Main category: cs.CV

TL;DR: AndesVL is a suite of mobile-friendly MLLMs with 0.6B to 4B parameters that achieves top-tier performance across various benchmarks while being suitable for edge devices.

Details

Motivation: Cloud-based MLLMs have huge model sizes that exceed the limitations of edge devices like mobile phones in terms of memory, power consumption, and computing capacity.

Method: Based on Qwen3’s LLM and various visual encoders, with comprehensive model architectures, training pipeline, and training data. Introduces a 1+N LoRA approach.

Result: Achieves first-tier performance across text-rich image understanding, reasoning and math, multi-image comprehension, general VQA, hallucination mitigation, multilingual understanding, and GUI-related tasks compared to similar-scale models.

Conclusion: AndesVL provides effective mobile-side MLLMs that balance performance with edge device constraints.

Abstract: In recent years, while cloud-based MLLMs such as QwenVL, InternVL, GPT-4o, Gemini, and Claude Sonnet have demonstrated outstanding performance with enormous model sizes reaching hundreds of billions of parameters, they significantly surpass the limitations in memory, power consumption, and computing capacity of edge devices such as mobile phones. This paper introduces AndesVL, a suite of mobile-side MLLMs with 0.6B to 4B parameters based on Qwen3’s LLM and various visual encoders. We comprehensively outline the model architectures, training pipeline, and training data of AndesVL, which achieves first-tier performance across a wide range of open-source benchmarks, including fields such as text-rich image understanding, reasoning and math, multi-image comprehension, general VQA, hallucination mitigation, multilingual understanding, and GUI-related tasks when compared with state-of-the-art models of a similar scale. Furthermore, we introduce a 1+N LoR

[480] Coupled Degradation Modeling and Fusion: A VLM-Guided Degradation-Coupled Network for Degradation-Aware Infrared and Visible Image Fusion

Tianpei Zhang, Jufeng Zhao, Yiming Zhu, Guangmang Cui

Main category: cs.CV

TL;DR: VGDCFusion is a novel infrared and visible image fusion method that tightly couples degradation modeling with fusion, using vision-language models for degradation-aware perception and guided suppression, outperforming existing methods on degraded images.

Details

Motivation: Existing IVIF methods assume high-quality inputs and rely on manual pre-processing for degraded images, leading to performance degradation due to decoupled degradation handling and fusion.

Method: Proposes VGDCFusion with Specific-Prompt Degradation-Coupled Extractor for modality-specific degradation awareness and joint modeling, and Joint-Prompt Degradation-Coupled Fusion for cross-modal degradation perception and residual degradation filtering.

Result: Extensive experiments show VGDCFusion significantly outperforms state-of-the-art fusion approaches under various degraded image scenarios.

Conclusion: The proposed method successfully couples degradation modeling with fusion process and leverages VLMs for effective degradation-aware perception and suppression in image fusion tasks.

Abstract: Existing Infrared and Visible Image Fusion (IVIF) methods typically assume high-quality inputs. However, when handing degraded images, these methods heavily rely on manually switching between different pre-processing techniques. This decoupling of degradation handling and image fusion leads to significant performance degradation. In this paper, we propose a novel VLM-Guided Degradation-Coupled Fusion network (VGDCFusion), which tightly couples degradation modeling with the fusion process and leverages vision-language models (VLMs) for degradation-aware perception and guided suppression. Specifically, the proposed Specific-Prompt Degradation-Coupled Extractor (SPDCE) enables modality-specific degradation awareness and establishes a joint modeling of degradation suppression and intra-modal feature extraction. In parallel, the Joint-Prompt Degradation-Coupled Fusion (JPDCF) facilitates cross-modal degradation perception and couples residual degradation filtering with complementary cross-modal feature fusion. Extensive experiments demonstrate that our VGDCFusion significantly outperforms existing state-of-the-art fusion approaches under various degraded image scenarios. Our code is available at https://github.com/Lmmh058/VGDCFusion.

[481] VA-GS: Enhancing the Geometric Representation of Gaussian Splatting via View Alignment

Qing Li, Huifang Feng, Xun Gong, Yu-Shen Liu

Main category: cs.CV

TL;DR: A method that enhances 3D Gaussian Splatting’s geometric representation through view alignment, improving surface reconstruction and multi-view consistency.

Details

Motivation: 3D Gaussian Splatting shows promise for novel view synthesis but struggles with accurate surface reconstruction due to discrete Gaussians and image-only supervision leading to inaccurate geometry and inconsistent multi-view alignment.

Method: Incorporates edge-aware image cues, visibility-aware photometric alignment loss, normal-based constraints, and deep image feature embeddings to enforce geometric consistency across views and improve surface boundary delineation.

Result: Achieves state-of-the-art performance in both surface reconstruction and novel view synthesis on standard benchmarks.

Conclusion: The proposed view alignment method significantly improves 3D Gaussian Splatting’s geometric representation capabilities while maintaining its efficiency advantages.

Abstract: 3D Gaussian Splatting has recently emerged as an efficient solution for high-quality and real-time novel view synthesis. However, its capability for accurate surface reconstruction remains underexplored. Due to the discrete and unstructured nature of Gaussians, supervision based solely on image rendering loss often leads to inaccurate geometry and inconsistent multi-view alignment. In this work, we propose a novel method that enhances the geometric representation of 3D Gaussians through view alignment (VA). Specifically, we incorporate edge-aware image cues into the rendering loss to improve surface boundary delineation. To enforce geometric consistency across views, we introduce a visibility-aware photometric alignment loss that models occlusions and encourages accurate spatial relationships among Gaussians. To further mitigate ambiguities caused by lighting variations, we incorporate normal-based constraints to refine the spatial orientation of Gaussians and improve local surface estimation. Additionally, we leverage deep image feature embeddings to enforce cross-view consistency, enhancing the robustness of the learned geometry under varying viewpoints and illumination. Extensive experiments on standard benchmarks demonstrate that our method achieves state-of-the-art performance in both surface reconstruction and novel view synthesis. The source code is available at https://github.com/LeoQLi/VA-GS.

[482] Towards Fast and Scalable Normal Integration using Continuous Components

Francesco Milano, Jen Jen Chung, Lionel Ott, Roland Siegwart

Main category: cs.CV

TL;DR: A new surface normal integration method that reduces optimization variables by grouping pixels into continuous components and estimating relative scales, achieving state-of-the-art results with significant speed improvements.

Details

Motivation: Existing surface normal integration approaches require iterative global optimization at the pixel level, which scales poorly to larger normal maps and becomes computationally expensive.

Method: Recast normal integration as estimation of relative scales of continuous components, with heuristic component estimation, optimization term rebalancing, and iterative component merging to reduce problem size.

Result: Achieves state-of-the-art results on standard benchmarks in seconds, with one-order-of-magnitude speedup over pixel-level approaches on large-resolution normal maps.

Conclusion: The proposed component-based approach provides an efficient and scalable solution to surface normal integration by drastically reducing optimization variables while maintaining accuracy.

Abstract: Surface normal integration is a fundamental problem in computer vision, dealing with the objective of reconstructing a surface from its corresponding normal map. Existing approaches require an iterative global optimization to jointly estimate the depth of each pixel, which scales poorly to larger normal maps. In this paper, we address this problem by recasting normal integration as the estimation of relative scales of continuous components. By constraining pixels belonging to the same component to jointly vary their scale, we drastically reduce the number of optimization variables. Our framework includes a heuristic to accurately estimate continuous components from the start, a strategy to rebalance optimization terms, and a technique to iteratively merge components to further reduce the size of the problem. Our method achieves state-of-the-art results on the standard normal integration benchmark in as little as a few seconds and achieves one-order-of-magnitude speedup over pixel-level approaches on large-resolution normal maps.

[483] LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference

Jianhao Yuan, Fabio Pizzati, Francesco Pinto, Lars Kunze, Ivan Laptev, Paul Newman, Philip Torr, Daniele De Martini

Main category: cs.CV

TL;DR: LikePhys is a training-free method that evaluates intuitive physics understanding in video diffusion models by distinguishing physically valid vs impossible videos using denoising objectives as likelihood surrogates.

Details

Motivation: Current evaluation methods struggle to disentangle physics correctness from visual appearance in video generation, making it challenging to assess intuitive physics understanding in video diffusion models.

Method: Uses denoising objective as ELBO-based likelihood surrogate on curated valid-invalid video pairs, with Plausibility Preference Error (PPE) metric tested on 12 scenarios across 4 physics domains.

Result: PPE demonstrates strong alignment with human preference, outperforming state-of-the-art evaluator baselines. Models show clear improvement in physics understanding as capacity and inference settings scale, though struggle with complex dynamics.

Conclusion: LikePhys provides effective evaluation of intuitive physics in video diffusion models, revealing domain-specific capacity variations and scaling trends in physics understanding.

Abstract: Intuitive physics understanding in video diffusion models plays an essential role in building general-purpose physically plausible world simulators, yet accurately evaluating such capacity remains a challenging task due to the difficulty in disentangling physics correctness from visual appearance in generation. To the end, we introduce LikePhys, a training-free method that evaluates intuitive physics in video diffusion models by distinguishing physically valid and impossible videos using the denoising objective as an ELBO-based likelihood surrogate on a curated dataset of valid-invalid pairs. By testing on our constructed benchmark of twelve scenarios spanning over four physics domains, we show that our evaluation metric, Plausibility Preference Error (PPE), demonstrates strong alignment with human preference, outperforming state-of-the-art evaluator baselines. We then systematically benchmark intuitive physics understanding in current video diffusion models. Our study further analyses how model design and inference settings affect intuitive physics understanding and highlights domain-specific capacity variations across physical laws. Empirical results show that, despite current models struggling with complex and chaotic dynamics, there is a clear trend of improvement in physics understanding as model capacity and inference settings scale.

[484] Situat3DChange: Situated 3D Change Understanding Dataset for Multimodal Large Language Model

Ruiping Liu, Junwei Zheng, Yufan Chen, Zirui Wang, Kunyu Peng, Kailun Yang, Jiaming Zhang, Marc Pollefeys, Rainer Stiefelhagen

Main category: cs.CV

TL;DR: Situat3DChange is a comprehensive 3D dataset for situation-aware change understanding, featuring 121K QA pairs, 36K change descriptions, and 17K rearrangement instructions. It addresses limitations in current 3D datasets by capturing dynamic scenarios and situations.

Details

Motivation: Current 3D datasets focus on either dynamic scenarios or situations in isolation, leading to incomplete understanding. The paper aims to overcome these limitations by providing a comprehensive dataset for situation-aware change understanding.

Method: The dataset leverages 11K human observations with egocentric/allocentric perspectives and spatial relations, integrated using LLMs. For point cloud comparison, they propose SCReasoner - an efficient 3D MLLM approach with minimal parameter overhead.

Result: Comprehensive evaluation shows progress and limitations of MLLMs in dynamic scene understanding. Additional experiments demonstrate the task-agnostic effectiveness of Situat3DChange as training data for MLLMs, with good cross-domain transfer capabilities.

Conclusion: Situat3DChange enables better understanding of dynamic 3D environments and supports human-AI collaboration through shared mental models. The proposed SCReasoner efficiently handles point cloud comparison with minimal computational overhead.

Abstract: Physical environments and circumstances are fundamentally dynamic, yet current 3D datasets and evaluation benchmarks tend to concentrate on either dynamic scenarios or dynamic situations in isolation, resulting in incomplete comprehension. To overcome these constraints, we introduce Situat3DChange, an extensive dataset supporting three situation-aware change understanding tasks following the perception-action model: 121K question-answer pairs, 36K change descriptions for perception tasks, and 17K rearrangement instructions for the action task. To construct this large-scale dataset, Situat3DChange leverages 11K human observations of environmental changes to establish shared mental models and shared situational awareness for human-AI collaboration. These observations, enriched with egocentric and allocentric perspectives as well as categorical and coordinate spatial relations, are integrated using an LLM to support understanding of situated changes. To address the challenge of comparing pairs of point clouds from the same scene with minor changes, we propose SCReasoner, an efficient 3D MLLM approach that enables effective point cloud comparison with minimal parameter overhead and no additional tokens required for the language decoder. Comprehensive evaluation on Situat3DChange tasks highlights both the progress and limitations of MLLMs in dynamic scene and situation understanding. Additional experiments on data scaling and cross-domain transfer demonstrate the task-agnostic effectiveness of using Situat3DChange as a training dataset for MLLMs.

Kedi Ying, Ruiping Liu, Chongyan Chen, Mingzhe Tao, Hao Shi, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen

Main category: cs.CV

TL;DR: mmWalk is a multi-modal dataset for outdoor safe navigation assistance for blind/low vision users, featuring 120 walking trajectories with 62k synchronized frames and 559k panoramic images across RGB, depth, and semantic modalities, plus a 69k VQA benchmark.

Details

Motivation: Address the challenge of walking assistance in extreme/complex environments for BLV users by providing holistic scene understanding through multi-modal data integration.

Method: Created mmWalk dataset with manually controlled walking trajectories, multi-view sensors, accessibility features, and generated mmWalkVQA benchmark with visual question-answer triplets across 9 categories.

Result: State-of-the-art VLMs struggle with risk assessment and navigational tasks in zero/few-shot settings, but mmWalk-finetuned models show effectiveness on real-world datasets.

Conclusion: The mmWalk dataset advances multi-modal walking assistance research and demonstrates the need for specialized training for navigation tasks in complex outdoor environments.

Abstract: Walking assistance in extreme or complex environments remains a significant challenge for people with blindness or low vision (BLV), largely due to the lack of a holistic scene understanding. Motivated by the real-world needs of the BLV community, we build mmWalk, a simulated multi-modal dataset that integrates multi-view sensor and accessibility-oriented features for outdoor safe navigation. Our dataset comprises 120 manually controlled, scenario-categorized walking trajectories with 62k synchronized frames. It contains over 559k panoramic images across RGB, depth, and semantic modalities. Furthermore, to emphasize real-world relevance, each trajectory involves outdoor corner cases and accessibility-specific landmarks for BLV users. Additionally, we generate mmWalkVQA, a VQA benchmark with over 69k visual question-answer triplets across 9 categories tailored for safe and informed walking assistance. We evaluate state-of-the-art Vision-Language Models (VLMs) using zero- and few-shot settings and found they struggle with our risk assessment and navigational tasks. We validate our mmWalk-finetuned model on real-world datasets and show the effectiveness of our dataset for advancing multi-modal walking assistance.

[486] Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers

Chaofan Gan, Zicheng Zhao, Yuanpeng Tu, Xi Chen, Ziran Qin, Tieyuan Chen, Mehrtash Harandi, Weiyao Lin

Main category: cs.CV

TL;DR: The paper investigates Massive Activations (MAs) in Diffusion Transformers (DiTs) and proposes Detail Guidance (DG), a training-free method to enhance local detail synthesis in visual generation.

Details

Motivation: To understand the role of Massive Activations in DiTs' internal feature maps and leverage them to improve local detail fidelity in visual generation.

Method: Proposed Detail Guidance (DG) - a training-free self-guidance strategy that constructs a degraded ‘detail-deficient’ model by disrupting MAs and uses it to guide the original network toward better detail synthesis. DG integrates with Classifier-Free Guidance.

Result: DG consistently improves fine-grained detail quality across various pre-trained DiTs (SD3, SD3.5, and Flux) without requiring retraining.

Conclusion: Massive Activations play a key role in local detail synthesis, and the proposed DG method effectively enhances detail fidelity in DiT-based visual generation systems.

Abstract: Diffusion Transformers (DiTs) have recently emerged as a powerful backbone for visual generation. Recent observations reveal \emph{Massive Activations} (MAs) in their internal feature maps, yet their function remains poorly understood. In this work, we systematically investigate these activations to elucidate their role in visual generation. We found that these massive activations occur across all spatial tokens, and their distribution is modulated by the input timestep embeddings. Importantly, our investigations further demonstrate that these massive activations play a key role in local detail synthesis, while having minimal impact on the overall semantic content of output. Building on these insights, we propose \textbf{D}etail \textbf{G}uidance (\textbf{DG}), a MAs-driven, training-free self-guidance strategy to explicitly enhance local detail fidelity for DiTs. Specifically, DG constructs a degraded ``detail-deficient’’ model by disrupting MAs and leverages it to guide the original network toward higher-quality detail synthesis. Our DG can seamlessly integrate with Classifier-Free Guidance (CFG), enabling further refinements of fine-grained details. Extensive experiments demonstrate that our DG consistently improves fine-grained detail quality across various pre-trained DiTs (\eg, SD3, SD3.5, and Flux).

[487] ODI-Bench: Can MLLMs Understand Immersive Omnidirectional Environments?

Liu Yang, Huiyu Duan, Ran Tao, Juntao Cheng, Sijing Wu, Yunhao Li, Jing Liu, Xiongkuo Min, Guangtao Zhai

Main category: cs.CV

TL;DR: ODI-Bench is a comprehensive benchmark for omnidirectional image understanding with 2,000 images and 4,000 QA pairs across 10 tasks. Current MLLMs struggle with ODI comprehension, but Omni-CoT method significantly improves performance through chain-of-thought reasoning.

Details

Motivation: Multi-modal large language models perform well on 2D images but their ability to understand immersive omnidirectional images remains unexplored, creating a gap in VR/AR and embodied intelligence applications.

Method: Created ODI-Bench with 2,000 ODIs and 4,000 QA pairs across 10 tasks. Evaluated 20 MLLMs and proposed Omni-CoT, a training-free chain-of-thought method that integrates textual and visual cues for better ODI understanding.

Result: Experimental results show current MLLMs struggle to capture immersive ODI context. Omni-CoT significantly enhances MLLMs’ comprehension ability in omnidirectional environments.

Conclusion: The study addresses the gap in ODI understanding by providing a comprehensive benchmark and an effective training-free method that improves MLLM performance on omnidirectional images.

Abstract: Omnidirectional images (ODIs) provide full 360x180 view which are widely adopted in VR, AR and embodied intelligence applications. While multi-modal large language models (MLLMs) have demonstrated remarkable performance on conventional 2D image and video understanding benchmarks, their ability to comprehend the immersive environments captured by ODIs remains largely unexplored. To address this gap, we first present ODI-Bench, a novel comprehensive benchmark specifically designed for omnidirectional image understanding. ODI-Bench contains 2,000 high-quality omnidirectional images and over 4,000 manually annotated question-answering (QA) pairs across 10 fine-grained tasks, covering both general-level and spatial-level ODI understanding. Extensive experiments are conducted to benchmark 20 representative MLLMs, including proprietary and open-source models, under both close-ended and open-ended settings. Experimental results reveal that current MLLMs still struggle to capture the immersive context provided by ODIs. To this end, we further introduce Omni-CoT, a training-free method which significantly enhances MLLMs’ comprehension ability in the omnidirectional environment through chain-of-thought reasoning across both textual information and visual cues. Both the benchmark and the code will be released upon the publication.

[488] How many samples to label for an application given a foundation model? Chest X-ray classification study

Nikolay Nechaev, Evgenia Przhezdzetskaya, Viktor Gombolevskiy, Dmitry Umerenkov, Dmitry Dylov

Main category: cs.CV

TL;DR: Power-law fits predict training size needed for chest X-ray classification performance thresholds, showing foundation models require fewer labeled examples than ResNet-50 baseline.

Details

Motivation: Chest X-ray classification is resource-intensive and typically requires extensive annotated data. Foundation models reduce this reliance but the exact number of labeled samples needed remains unclear.

Method: Systematically evaluate power-law fits to predict training size required for specific ROC-AUC thresholds. Test multiple pathologies and foundation models (XrayCLIP, XraySigLIP) against ResNet-50 baseline.

Result: XrayCLIP and XraySigLIP achieve strong performance with significantly fewer labeled examples than ResNet-50 baseline. Learning curve slopes from just 50 labeled cases accurately forecast final performance plateaus.

Conclusion: Results enable practitioners to minimize annotation costs by labeling only the essential samples needed for targeted performance levels.

Abstract: Chest X-ray classification is vital yet resource-intensive, typically demanding extensive annotated data for accurate diagnosis. Foundation models mitigate this reliance, but how many labeled samples are required remains unclear. We systematically evaluate the use of power-law fits to predict the training size necessary for specific ROC-AUC thresholds. Testing multiple pathologies and foundation models, we find XrayCLIP and XraySigLIP achieve strong performance with significantly fewer labeled examples than a ResNet-50 baseline. Importantly, learning curve slopes from just 50 labeled cases accurately forecast final performance plateaus. Our results enable practitioners to minimize annotation costs by labeling only the essential samples for targeted performance.

[489] SNAP: Towards Segmenting Anything in Any Point Cloud

Aniket Gupta, Hanhui Wang, Charles Saunders, Aruni RoyChowdhury, Hanumant Singh, Huaizu Jiang

Main category: cs.CV

TL;DR: SNAP is a unified model for interactive 3D point cloud segmentation that supports both point-based and text-based prompts across diverse domains (indoor, outdoor, aerial), achieving state-of-the-art performance through cross-domain training and domain-adaptive normalization.

Details

Motivation: Current 3D segmentation approaches are limited to single domains and single interaction types, with training on multiple datasets causing negative transfer and lack of generalizability.

Method: Training on 7 diverse datasets with domain-adaptive normalization to prevent negative transfer; using automatically generated mask proposals matched against CLIP embeddings for text-prompted segmentation; supporting both panoptic and open-vocabulary segmentation.

Result: Achieves SOTA on 8/9 zero-shot benchmarks for spatial-prompted segmentation and competitive results on all 5 text-prompted benchmarks, demonstrating that unified models can match or exceed specialized domain-specific approaches.

Conclusion: SNAP provides a practical, scalable tool for 3D annotation by unifying multiple interaction modalities and domains in a single model that outperforms specialized approaches.

Abstract: Interactive 3D point cloud segmentation enables efficient annotation of complex 3D scenes through user-guided prompts. However, current approaches are typically restricted in scope to a single domain (indoor or outdoor), and to a single form of user interaction (either spatial clicks or textual prompts). Moreover, training on multiple datasets often leads to negative transfer, resulting in domain-specific tools that lack generalizability. To address these limitations, we present \textbf{SNAP} (\textbf{S}egment a\textbf{N}ything in \textbf{A}ny \textbf{P}oint cloud), a unified model for interactive 3D segmentation that supports both point-based and text-based prompts across diverse domains. Our approach achieves cross-domain generalizability by training on 7 datasets spanning indoor, outdoor, and aerial environments, while employing domain-adaptive normalization to prevent negative transfer. For text-prompted segmentation, we automatically generate mask proposals without human intervention and match them against CLIP embeddings of textual queries, enabling both panoptic and open-vocabulary segmentation. Extensive experiments demonstrate that SNAP consistently delivers high-quality segmentation results. We achieve state-of-the-art performance on 8 out of 9 zero-shot benchmarks for spatial-prompted segmentation and demonstrate competitive results on all 5 text-prompted benchmarks. These results show that a unified model can match or exceed specialized domain-specific approaches, providing a practical tool for scalable 3D annotation. Project page is at, https://neu-vi.github.io/SNAP/

[490] A Framework for Low-Effort Training Data Generation for Urban Semantic Segmentation

Denis Zavadski, Damjan Kalšan, Tim Küchler, Haebom Lee, Stefan Roth, Carsten Rother

Main category: cs.CV

TL;DR: A framework that adapts diffusion models to generate high-fidelity, target-aligned images from synthetic semantic maps using imperfect pseudo-labels, transforming weak synthetic data into effective real-domain training sets.

Details

Motivation: Synthetic datasets show noticeable gaps to real imagery, especially when adapting to specific target domains like Cityscapes, limiting downstream performance. Detailed 3D modeling is expensive and defeats the purpose of low-cost labeled data.

Method: Adapts an off-the-shelf diffusion model to a target domain using only imperfect pseudo-labels. The method filters suboptimal generations, rectifies image-label misalignments, and standardizes semantics across datasets.

Result: Experiments on five synthetic datasets and two real target datasets show segmentation gains of up to +8.0%pt. mIoU over state-of-the-art translation methods, making rapidly constructed synthetic datasets as effective as high-effort ones.

Conclusion: Fast semantic prototyping combined with generative models enables scalable, high-quality training data creation for urban scene understanding, highlighting a valuable collaborative paradigm.

Abstract: Synthetic datasets are widely used for training urban scene recognition models, but even highly realistic renderings show a noticeable gap to real imagery. This gap is particularly pronounced when adapting to a specific target domain, such as Cityscapes, where differences in architecture, vegetation, object appearance, and camera characteristics limit downstream performance. Closing this gap with more detailed 3D modelling would require expensive asset and scene design, defeating the purpose of low-cost labelled data. To address this, we present a new framework that adapts an off-the-shelf diffusion model to a target domain using only imperfect pseudo-labels. Once trained, it generates high-fidelity, target-aligned images from semantic maps of any synthetic dataset, including low-effort sources created in hours rather than months. The method filters suboptimal generations, rectifies image-label misalignments, and standardises semantics across datasets, transforming weak synthetic data into competitive real-domain training sets. Experiments on five synthetic datasets and two real target datasets show segmentation gains of up to +8.0%pt. mIoU over state-of-the-art translation methods, making rapidly constructed synthetic datasets as effective as high-effort, time-intensive synthetic datasets requiring extensive manual design. This work highlights a valuable collaborative paradigm where fast semantic prototyping, combined with generative models, enables scalable, high-quality training data creation for urban scene understanding.

[491] Benchmarking foundation models for hyperspectral image classification: Application to cereal crop type mapping

Walid Elbarz, Mohamed Bourriz, Hicham Hajji, Hamd Ait Abdelali, François Bourzeix

Main category: cs.CV

TL;DR: Benchmarking three foundation models (HyperSigma, DOFA, and SpectralEarth Vision Transformers) for hyperspectral cereal crop mapping, with SpectralEarth achieving best performance (93.5% OA) and demonstrating strong generalization across regions and sensors.

Details

Motivation: Foundation models are transforming Earth observation but their potential for hyperspectral crop mapping remains underexplored, requiring systematic evaluation.

Method: Fine-tuned three foundation models on manually labeled hyperspectral data from training region and evaluated on independent test region using overall accuracy, average accuracy, and F1-score metrics.

Result: HyperSigma: 34.5% OA, DOFA: 62.6% OA, SpectralEarth: 93.5% OA. Compact SpectralEarth variant achieved 91% when trained from scratch.

Conclusion: SpectralEarth foundation model shows superior performance for operational hyperspectral crop mapping, with model architecture being crucial for strong cross-region and cross-sensor generalization.

Abstract: Foundation models are transforming Earth observation, but their potential for hyperspectral crop mapping remains underexplored. This study benchmarks three foundation models for cereal crop mapping using hyperspectral imagery: HyperSigma, DOFA, and Vision Transformers pre-trained on the SpectralEarth dataset (a large multitemporal hyperspectral archive). Models were fine-tuned on manually labeled data from a training region and evaluated on an independent test region. Performance was measured with overall accuracy (OA), average accuracy (AA), and F1-score. HyperSigma achieved an OA of 34.5% (+/- 1.8%), DOFA reached 62.6% (+/- 3.5%), and the SpectralEarth model achieved an OA of 93.5% (+/- 0.8%). A compact SpectralEarth variant trained from scratch achieved 91%, highlighting the importance of model architecture for strong generalization across geographic regions and sensor platforms. These results provide a systematic evaluation of foundation models for operational hyperspectral crop mapping and outline directions for future model development.

[492] EvoCAD: Evolutionary CAD Code Generation with Vision Language Models

Tobias Preintner, Weixuan Yuan, Adrian König, Thomas Bäck, Elena Raponi, Niki van Stein

Main category: cs.CV

TL;DR: EvoCAD combines vision language models with evolutionary optimization to generate CAD objects through symbolic representations, outperforming previous methods on topological correctness using novel Euler characteristic-based metrics.

Details

Motivation: To leverage the generative capabilities of large language models with evolutionary algorithms for improved CAD object generation, particularly focusing on topological correctness.

Method: EvoCAD samples multiple CAD objects using vision language models and optimizes them through evolutionary approach with vision and reasoning language models, evaluated on CADPrompt benchmark.

Result: EvoCAD outperforms previous approaches on multiple metrics, especially in generating topologically correct objects, with novel Euler characteristic-based metrics proving effective.

Conclusion: The combination of LLMs with evolutionary computation shows promise for CAD generation, with EvoCAD demonstrating superior performance in topological correctness using new evaluation metrics.

Abstract: Combining large language models with evolutionary computation algorithms represents a promising research direction leveraging the remarkable generative and in-context learning capabilities of LLMs with the strengths of evolutionary algorithms. In this work, we present EvoCAD, a method for generating computer-aided design (CAD) objects through their symbolic representations using vision language models and evolutionary optimization. Our method samples multiple CAD objects, which are then optimized using an evolutionary approach with vision language and reasoning language models. We assess our method using GPT-4V and GPT-4o, evaluating it on the CADPrompt benchmark dataset and comparing it to prior methods. Additionally, we introduce two new metrics based on topological properties defined by the Euler characteristic, which capture a form of semantic similarity between 3D objects. Our results demonstrate that EvoCAD outperforms previous approaches on multiple metrics, particularly in generating topologically correct objects, which can be efficiently evaluated using our two novel metrics that complement existing spatial metrics.

[493] MS-Mix: Unveiling the Power of Mixup for Multimodal Sentiment Analysis

Hongyu Zhu, Lin Chen, Mounim A. El-Yacoubi, Mingsheng Shang

Main category: cs.CV

TL;DR: MS-Mix is an emotion-sensitive augmentation framework for multimodal sentiment analysis that addresses label ambiguity and semantic inconsistency in traditional Mixup methods through sentiment-aware sample selection, dynamic mixing ratios, and sentiment alignment loss.

Details

Motivation: Current multimodal sentiment analysis models are limited by scarce annotated data, and direct application of Mixup augmentation introduces label ambiguity and semantic inconsistency due to lack of emotion-aware mixing mechanisms.

Method: MS-Mix framework includes: 1) Sentiment-Aware Sample Selection to prevent mixing contradictory emotions, 2) Sentiment Intensity Guided module using multi-head self-attention for dynamic modality-specific mixing ratios, 3) Sentiment Alignment Loss with KL-divergence regularization to align predictions across modalities.

Result: Extensive experiments on three benchmark datasets with six state-of-the-art backbones show MS-Mix consistently outperforms existing methods, establishing new standards for robust multimodal sentiment augmentation.

Conclusion: MS-Mix effectively addresses the limitations of traditional Mixup in multimodal sentiment analysis by incorporating emotion-aware mechanisms, leading to improved performance and generalization across various backbone models.

Abstract: Multimodal Sentiment Analysis (MSA) aims to identify and interpret human emotions by integrating information from heterogeneous data sources such as text, video, and audio. While deep learning models have advanced in network architecture design, they remain heavily limited by scarce multimodal annotated data. Although Mixup-based augmentation improves generalization in unimodal tasks, its direct application to MSA introduces critical challenges: random mixing often amplifies label ambiguity and semantic inconsistency due to the lack of emotion-aware mixing mechanisms. To overcome these issues, we propose MS-Mix, an adaptive, emotion-sensitive augmentation framework that automatically optimizes sample mixing in multimodal settings. The key components of MS-Mix include: (1) a Sentiment-Aware Sample Selection (SASS) strategy that effectively prevents semantic confusion caused by mixing samples with contradictory emotions. (2) a Sentiment Intensity Guided (SIG) module using multi-head self-attention to compute modality-specific mixing ratios dynamically based on their respective emotional intensities. (3) a Sentiment Alignment Loss (SAL) that aligns the prediction distributions across modalities, and incorporates the Kullback-Leibler-based loss as an additional regularization term to train the emotion intensity predictor and the backbone network jointly. Extensive experiments on three benchmark datasets with six state-of-the-art backbones confirm that MS-Mix consistently outperforms existing methods, establishing a new standard for robust multimodal sentiment augmentation. The source code is available at: https://github.com/HongyuZhu-s/MS-Mix.

[494] NV3D: Leveraging Spatial Shape Through Normal Vector-based 3D Object Detection

Krittin Chaowakarn, Paramin Sangwongngam, Nang Htet Htet Aung, Chalie Charoenlarpnopparut

Main category: cs.CV

TL;DR: NV3D is a 3D object detection model that uses normal vectors from voxel neighbors to enhance feature representation, achieving superior performance on KITTI dataset with up to 55% data reduction while maintaining accuracy.

Details

Motivation: Existing multi-modal methods face feature alignment challenges, and local feature extraction can be oversimplified for complex 3D object detection tasks. The paper aims to address these limitations by leveraging informative normal vector features.

Method: NV3D extracts local features using normal vectors computed per voxel basis via KNN and PCA. It offers two sampling strategies: normal vector density-based sampling and FOV-aware bin-based sampling. The model uses element-wise attention fusion with voxel features as query/value and normal vector features as key.

Result: On KITTI validation set, NV3D without sampling achieves 86.60% and 80.18% mAP for car and cyclist detection, outperforming baseline Voxel R-CNN by 2.61% and 4.23% respectively. With both samplings, it achieves 85.54% mAP for car detection (1.56% improvement) while filtering out ~55% of voxels.

Conclusion: NV3D demonstrates that normal vector features effectively capture surface relationships for 3D object detection, enabling significant data reduction while maintaining or improving performance, particularly benefiting detection of objects with distinct spatial shapes like cars and cyclists.

Abstract: Recent studies in 3D object detection for autonomous vehicles aim to enrich features through the utilization of multi-modal setups or the extraction of local patterns within LiDAR point clouds. However, multi-modal methods face significant challenges in feature alignment, and gaining features locally can be oversimplified for complex 3D object detection tasks. In this paper, we propose a novel model, NV3D, which utilizes local features acquired from voxel neighbors, as normal vectors computed per voxel basis using K-nearest neighbors (KNN) and principal component analysis (PCA). This informative feature enables NV3D to determine the relationship between the surface and pertinent target entities, including cars, pedestrians, or cyclists. During the normal vector extraction process, NV3D offers two distinct sampling strategies: normal vector density-based sampling and FOV-aware bin-based sampling, allowing elimination of up to 55% of data while maintaining performance. In addition, we applied element-wise attention fusion, which accepts voxel features as the query and value and normal vector features as the key, similar to the attention mechanism. Our method is trained on the KITTI dataset and has demonstrated superior performance in car and cyclist detection owing to their spatial shapes. In the validation set, NV3D without sampling achieves 86.60% and 80.18% mean Average Precision (mAP), greater than the baseline Voxel R-CNN by 2.61% and 4.23% mAP, respectively. With both samplings, NV3D achieves 85.54% mAP in car detection, exceeding the baseline by 1.56% mAP, despite roughly 55% of voxels being filtered out.

[495] ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training

Leonard Bruns, Axel Barroso-Laguna, Tommaso Cavallari, Áron Monszpart, Sowmya Munukutla, Victor Adrian Prisacariu, Eric Brachmann

Main category: cs.CV

TL;DR: ACE-G separates coordinate regression from map representation using a generic transformer and scene-specific map code, enabling pre-training on thousands of scenes and improved generalization to unseen query images.

Details

Motivation: Traditional scene coordinate regression (SCR) methods overfit to training views by design, limiting their generalization capabilities when query images have different lighting or viewpoints. This inherent limitation prevents SCR from matching the robustness of classical feature-matching approaches.

Method: Proposes ACE-G framework that separates coordinate regressor and map representation into a generic transformer and scene-specific map code. This allows pre-training the transformer on tens of thousands of scenes and training it to generalize from mapping images to unseen query images during pre-training.

Result: Demonstrates significantly increased robustness on multiple challenging relocalization datasets while maintaining attractive computational efficiency.

Conclusion: The separation of coordinate regression and map representation enables pre-training and improved generalization, overcoming the inherent overfitting limitation of previous SCR frameworks.

Abstract: Scene coordinate regression (SCR) has established itself as a promising learning-based approach to visual relocalization. After mere minutes of scene-specific training, SCR models estimate camera poses of query images with high accuracy. Still, SCR methods fall short of the generalization capabilities of more classical feature-matching approaches. When imaging conditions of query images, such as lighting or viewpoint, are too different from the training views, SCR models fail. Failing to generalize is an inherent limitation of previous SCR frameworks, since their training objective is to encode the training views in the weights of the coordinate regressor itself. The regressor essentially overfits to the training views, by design. We propose to separate the coordinate regressor and the map representation into a generic transformer and a scene-specific map code. This separation allows us to pre-train the transformer on tens of thousands of scenes. More importantly, it allows us to train the transformer to generalize from mapping images to unseen query images during pre-training. We demonstrate on multiple challenging relocalization datasets that our method, ACE-G, leads to significantly increased robustness while keeping the computational footprint attractive.

[496] ExpVid: A Benchmark for Experiment Video Understanding & Reasoning

Yicheng Xu, Yue Wu, Jiashuo Yu, Ziang Yan, Tianxiang Jiang, Yinan He, Qingsong Zhao, Kai Chen, Yu Qiao, Limin Wang, Manabu Okumura, Yi Wang

Main category: cs.CV

TL;DR: ExpVid is a new benchmark for evaluating Multimodal Large Language Models (MLLMs) on scientific experiment videos, revealing their limitations in fine-grained perception, procedural understanding, and scientific reasoning despite strong coarse-grained recognition.

Details

Motivation: Existing benchmarks fail to capture the fine-grained and long-horizon nature of authentic laboratory work, especially in wet-lab settings, making it difficult to understand MLLMs' true capabilities for accelerating scientific discovery.

Method: Created ExpVid benchmark with three-level task hierarchy: Fine-grained Perception, Procedural Understanding, and Scientific Reasoning. Used vision-centric annotation pipeline combining automated generation with multi-disciplinary expert validation from peer-reviewed video publications.

Result: Evaluation of 19 leading MLLMs showed they excel at coarse-grained recognition but struggle with fine details, state tracking over time, and linking procedures to outcomes. Notable performance gap between proprietary and open-source models, especially in high-order reasoning.

Conclusion: ExpVid provides a diagnostic tool and roadmap for developing MLLMs capable of becoming trustworthy partners in scientific experimentation by identifying current limitations in fine-grained understanding and scientific reasoning capabilities.

Abstract: Multimodal Large Language Models (MLLMs) hold promise for accelerating scientific discovery by interpreting complex experimental procedures. However, their true capabilities are poorly understood, as existing benchmarks neglect the fine-grained and long-horizon nature of authentic laboratory work, especially in wet-lab settings. To bridge this gap, we introduce ExpVid, the first benchmark designed to systematically evaluate MLLMs on scientific experiment videos. Curated from peer-reviewed video publications, ExpVid features a new three-level task hierarchy that mirrors the scientific process: (1) Fine-grained Perception of tools, materials, and actions; (2) Procedural Understanding of step order and completeness; and (3) Scientific Reasoning that connects the full experiment to its published conclusions. Our vision-centric annotation pipeline, combining automated generation with multi-disciplinary expert validation, ensures that tasks require visual grounding. We evaluate 19 leading MLLMs on ExpVid and find that while they excel at coarse-grained recognition, they struggle with disambiguating fine details, tracking state changes over time, and linking experimental procedures to scientific outcomes. Our results reveal a notable performance gap between proprietary and open-source models, particularly in high-order reasoning. ExpVid not only provides a diagnostic tool but also charts a roadmap for developing MLLMs capable of becoming trustworthy partners in scientific experimentation.

[497] FACE: Faithful Automatic Concept Extraction

Dipkamal Bhusal, Michael Clifford, Sara Rampazzi, Nidhi Rastogi

Main category: cs.CV

TL;DR: FACE is a novel framework that improves concept-based explanations of deep neural networks by ensuring alignment between original and concept-based predictions through KL divergence regularization.

Details

Motivation: Existing automatic concept discovery methods often fail to align extracted concepts with the model's true decision-making process, compromising explanation faithfulness.

Method: FACE augments Non-negative Matrix Factorization (NMF) with a Kullback-Leibler (KL) divergence regularization term and incorporates classifier supervision during concept learning to enforce predictive consistency.

Result: FACE outperforms existing methods across faithfulness and sparsity metrics on ImageNet, COCO, and CelebA datasets, with theoretical guarantees bounding predictive distribution deviations.

Conclusion: The proposed FACE framework successfully addresses the faithfulness issue in concept-based explanations by ensuring predictive consistency through KL divergence regularization and classifier supervision.

Abstract: Interpreting deep neural networks through concept-based explanations offers a bridge between low-level features and high-level human-understandable semantics. However, existing automatic concept discovery methods often fail to align these extracted concepts with the model’s true decision-making process, thereby compromising explanation faithfulness. In this work, we propose FACE (Faithful Automatic Concept Extraction), a novel framework that augments Non-negative Matrix Factorization (NMF) with a Kullback-Leibler (KL) divergence regularization term to ensure alignment between the model’s original and concept-based predictions. Unlike prior methods that operate solely on encoder activations, FACE incorporates classifier supervision during concept learning, enforcing predictive consistency and enabling faithful explanations. We provide theoretical guarantees showing that minimizing the KL divergence bounds the deviation in predictive distributions, thereby promoting faithful local linearity in the learned concept space. Systematic evaluations on ImageNet, COCO, and CelebA datasets demonstrate that FACE outperforms existing methods across faithfulness and sparsity metrics.

[498] High-resolution Photo Enhancement in Real-time: A Laplacian Pyramid Network

Feng Zhang, Haoyou Deng, Zhiqiang Li, Lida Li, Bin Xu, Qingbo Lu, Zisheng Cao, Minchen Wei, Changxin Gao, Nong Sang, Xiang Bai

Main category: cs.CV

TL;DR: LLF-LUT++ is a pyramid network for photo enhancement that combines global and local operators using Laplacian pyramid decomposition, achieving both high performance and computational efficiency for edge devices.

Details

Motivation: Existing photo enhancement methods either focus on performance but are too heavy for edge devices, or prioritize efficiency but deliver inadequate performance for real-world applications.

Method: Integrates global and local operators through closed-form Laplacian pyramid decomposition and reconstruction. Uses image-adaptive 3D LUT for global tonal enhancement with weight fusion strategies, and spatial-frequency transformer weight predictor to extract distinct weights. Applies local Laplacian filters to refine edge details in high-frequency components.

Result: Achieves 2.64 dB improvement in PSNR on HDR+ dataset, processes 4K resolution images in 13 ms on a single GPU, and performs favorably compared to state-of-the-art methods on benchmark datasets.

Conclusion: LLF-LUT++ successfully bridges the gap between performance and efficiency in photo enhancement, enabling fast processing of high-resolution images while maintaining excellent enhancement quality.

Abstract: Photo enhancement plays a crucial role in augmenting the visual aesthetics of a photograph. In recent years, photo enhancement methods have either focused on enhancement performance, producing powerful models that cannot be deployed on edge devices, or prioritized computational efficiency, resulting in inadequate performance for real-world applications. To this end, this paper introduces a pyramid network called LLF-LUT++, which integrates global and local operators through closed-form Laplacian pyramid decomposition and reconstruction. This approach enables fast processing of high-resolution images while also achieving excellent performance. Specifically, we utilize an image-adaptive 3D LUT that capitalizes on the global tonal characteristics of downsampled images, while incorporating two distinct weight fusion strategies to achieve coarse global image enhancement. To implement this strategy, we designed a spatial-frequency transformer weight predictor that effectively extracts the desired distinct weights by leveraging frequency features. Additionally, we apply local Laplacian filters to adaptively refine edge details in high-frequency components. After meticulously redesigning the network structure and transformer model, LLF-LUT++ not only achieves a 2.64 dB improvement in PSNR on the HDR+ dataset, but also further reduces runtime, with 4K resolution images processed in just 13 ms on a single GPU. Extensive experimental results on two benchmark datasets further show that the proposed approach performs favorably compared to state-of-the-art methods. The source code will be made publicly available at https://github.com/fengzhang427/LLF-LUT.

[499] IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment

Yinan Chen, Jiangning Zhang, Teng Hu, Yuxiang Zeng, Zhucun Xue, Qingdong He, Chengjie Wang, Yong Liu, Xiaobin Hu, Shuicheng Yan

Main category: cs.CV

TL;DR: IVEBench is a comprehensive benchmark for instruction-guided video editing that addresses limitations in existing benchmarks through diverse source videos, broad task coverage, and multi-dimensional evaluation metrics.

Details

Motivation: Existing video editing benchmarks fail to adequately support instruction-guided video editing evaluation and suffer from limited source diversity, narrow task coverage, and incomplete evaluation metrics.

Method: IVEBench includes 600 high-quality source videos across 7 semantic dimensions, 8 categories of editing tasks with 35 subcategories, and establishes a three-dimensional evaluation protocol (video quality, instruction compliance, video fidelity) using both traditional metrics and multimodal LLM-based assessments.

Result: Extensive experiments demonstrate IVEBench’s effectiveness in benchmarking state-of-the-art instruction-guided video editing methods and its ability to provide comprehensive, human-aligned evaluation outcomes.

Conclusion: IVEBench successfully addresses the limitations of existing benchmarks and provides a modern, comprehensive evaluation suite for instruction-guided video editing methods.

Abstract: Instruction-guided video editing has emerged as a rapidly advancing research direction, offering new opportunities for intuitive content transformation while also posing significant challenges for systematic evaluation. Existing video editing benchmarks fail to support the evaluation of instruction-guided video editing adequately and further suffer from limited source diversity, narrow task coverage and incomplete evaluation metrics. To address the above limitations, we introduce IVEBench, a modern benchmark suite specifically designed for instruction-guided video editing assessment. IVEBench comprises a diverse database of 600 high-quality source videos, spanning seven semantic dimensions, and covering video lengths ranging from 32 to 1,024 frames. It further includes 8 categories of editing tasks with 35 subcategories, whose prompts are generated and refined through large language models and expert review. Crucially, IVEBench establishes a three-dimensional evaluation protocol encompassing video quality, instruction compliance and video fidelity, integrating both traditional metrics and multimodal large language model-based assessments. Extensive experiments demonstrate the effectiveness of IVEBench in benchmarking state-of-the-art instruction-guided video editing methods, showing its ability to provide comprehensive and human-aligned evaluation outcomes.

[500] PhySIC: Physically Plausible 3D Human-Scene Interaction and Contact from a Single Image

Pradyumna Yalandur Muralidhar, Yuxuan Xue, Xianghui Xie, Margaret Kostyrko, Gerard Pons-Moll

Main category: cs.CV

TL;DR: PhySIC reconstructs metrically accurate 3D humans and scenes from single images by addressing depth ambiguity and physical inconsistencies through occlusion-aware depth fusion, contact optimization, and joint human-scene refinement.

Details

Motivation: Existing methods struggle with depth ambiguity, occlusions, and physically inconsistent contacts when reconstructing humans and scenes from single images, limiting applications in VR, robotics, and 3D scene understanding.

Method: PhySIC starts with coarse monocular depth and body estimates, performs occlusion-aware inpainting, fuses visible depth with unscaled geometry, synthesizes missing support surfaces, and uses confidence-weighted optimization to refine body pose, camera parameters, and global scale while enforcing depth alignment, contact priors, interpenetration avoidance, and 2D reprojection consistency.

Result: PhySIC reduces mean per-vertex scene error from 641 mm to 227 mm, halves PA-MPJPE to 42 mm, improves contact F1 from 0.09 to 0.51, and handles multiple humans efficiently (9s for optimization, 27s end-to-end).

Conclusion: PhySIC advances scalable 3D scene understanding by converting single images into physically plausible 3D human-scene pairs with realistic interactions and handling of occlusions.

Abstract: Reconstructing metrically accurate humans and their surrounding scenes from a single image is crucial for virtual reality, robotics, and comprehensive 3D scene understanding. However, existing methods struggle with depth ambiguity, occlusions, and physically inconsistent contacts. To address these challenges, we introduce PhySIC, a framework for physically plausible Human-Scene Interaction and Contact reconstruction. PhySIC recovers metrically consistent SMPL-X human meshes, dense scene surfaces, and vertex-level contact maps within a shared coordinate frame from a single RGB image. Starting from coarse monocular depth and body estimates, PhySIC performs occlusion-aware inpainting, fuses visible depth with unscaled geometry for a robust metric scaffold, and synthesizes missing support surfaces like floors. A confidence-weighted optimization refines body pose, camera parameters, and global scale by jointly enforcing depth alignment, contact priors, interpenetration avoidance, and 2D reprojection consistency. Explicit occlusion masking safeguards invisible regions against implausible configurations. PhySIC is efficient, requiring only 9 seconds for joint human-scene optimization and under 27 seconds end-to-end. It naturally handles multiple humans, enabling reconstruction of diverse interactions. Empirically, PhySIC outperforms single-image baselines, reducing mean per-vertex scene error from 641 mm to 227 mm, halving PA-MPJPE to 42 mm, and improving contact F1 from 0.09 to 0.51. Qualitative results show realistic foot-floor interactions, natural seating, and plausible reconstructions of heavily occluded furniture. By converting a single image into a physically plausible 3D human-scene pair, PhySIC advances scalable 3D scene understanding. Our implementation is publicly available at https://yuxuan-xue.com/physic.

[501] InfiniHuman: Infinite 3D Human Creation with Precise Control

Yuxuan Xue, Xianghui Xie, Margaret Kostyrko, Gerard Pons-Moll

Main category: cs.CV

TL;DR: InfiniHuman is a framework that distills foundation models to generate unlimited, richly annotated 3D human data, enabling fast and controllable avatar generation with unprecedented diversity.

Details

Motivation: Generating realistic and controllable 3D human avatars is challenging due to the high cost and limited scale/diversity of capturing real human datasets. The paper aims to leverage existing foundation models to create theoretically unlimited human data at minimal cost.

Method: Proposes InfiniHumanData - an automatic pipeline using vision-language and image generation models to create large-scale multi-modal datasets, and InfiniHumanGen - a diffusion-based generative pipeline conditioned on text, body shape, and clothing assets.

Result: Generated 111K identities with unprecedented diversity, each annotated with multi-granularity text, multi-view RGB images, detailed clothing images, and SMPL parameters. User studies show generated identities are indistinguishable from scan renderings. Demonstrates significant improvements in visual quality, generation speed, and controllability over state-of-the-art methods.

Conclusion: The approach enables high-quality avatar generation with fine-grained control at effectively unbounded scale through a practical and affordable solution, with plans to publicly release the pipeline, dataset, and models.

Abstract: Generating realistic and controllable 3D human avatars is a long-standing challenge, particularly when covering broad attribute ranges such as ethnicity, age, clothing styles, and detailed body shapes. Capturing and annotating large-scale human datasets for training generative models is prohibitively expensive and limited in scale and diversity. The central question we address in this paper is: Can existing foundation models be distilled to generate theoretically unbounded, richly annotated 3D human data? We introduce InfiniHuman, a framework that synergistically distills these models to produce richly annotated human data at minimal cost and with theoretically unlimited scalability. We propose InfiniHumanData, a fully automatic pipeline that leverages vision-language and image generation models to create a large-scale multi-modal dataset. User study shows our automatically generated identities are undistinguishable from scan renderings. InfiniHumanData contains 111K identities spanning unprecedented diversity. Each identity is annotated with multi-granularity text descriptions, multi-view RGB images, detailed clothing images, and SMPL body-shape parameters. Building on this dataset, we propose InfiniHumanGen, a diffusion-based generative pipeline conditioned on text, body shape, and clothing assets. InfiniHumanGen enables fast, realistic, and precisely controllable avatar generation. Extensive experiments demonstrate significant improvements over state-of-the-art methods in visual quality, generation speed, and controllability. Our approach enables high-quality avatar generation with fine-grained control at effectively unbounded scale through a practical and affordable solution. We will publicly release the automatic data generation pipeline, the comprehensive InfiniHumanData dataset, and the InfiniHumanGen models at https://yuxuan-xue.com/infini-human.

[502] Beyond ‘Templates’: Category-Agnostic Object Pose, Size, and Shape Estimation from a Single View

Jinyu Zhang, Haitao Lin, Jiashu Hou, Xiangyang Xue, Yanwei Fu

Main category: cs.CV

TL;DR: A unified framework for 6D pose, size, and shape estimation from single RGB-D images without requiring CAD models or category labels, achieving real-time performance and strong zero-shot generalization.

Details

Motivation: Existing methods rely on object-specific priors like CAD models or suffer from limited generalization due to pose-shape entanglement and multi-stage pipelines.

Method: Fuses dense 2D features from vision foundation models with partial 3D point clouds using Transformer encoder enhanced by Mixture-of-Experts, with parallel decoders for pose-size estimation and shape reconstruction.

Result: Achieves state-of-the-art accuracy on seen categories and strong zero-shot generalization on unseen real-world objects across four benchmarks spanning 300+ categories, with real-time inference at 28 FPS.

Conclusion: Establishes a new standard for open-set 6D understanding in robotics and embodied AI, demonstrating effective category-agnostic performance trained solely on synthetic data.

Abstract: Estimating an object’s 6D pose, size, and shape from visual input is a fundamental problem in computer vision, with critical applications in robotic grasping and manipulation. Existing methods either rely on object-specific priors such as CAD models or templates, or suffer from limited generalization across categories due to pose-shape entanglement and multi-stage pipelines. In this work, we propose a unified, category-agnostic framework that simultaneously predicts 6D pose, size, and dense shape from a single RGB-D image, without requiring templates, CAD models, or category labels at test time. Our model fuses dense 2D features from vision foundation models with partial 3D point clouds using a Transformer encoder enhanced by a Mixture-of-Experts, and employs parallel decoders for pose-size estimation and shape reconstruction, achieving real-time inference at 28 FPS. Trained solely on synthetic data from 149 categories in the SOPE dataset, our framework is evaluated on four diverse benchmarks SOPE, ROPE, ObjaversePose, and HANDAL, spanning over 300 categories. It achieves state-of-the-art accuracy on seen categories while demonstrating remarkably strong zero-shot generalization to unseen real-world objects, establishing a new standard for open-set 6D understanding in robotics and embodied AI.

[503] CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images

Chengqi Duan, Kaiyue Sun, Rongyao Fang, Manyuan Zhang, Yan Feng, Ying Luo, Yufang Liu, Ke Wang, Peng Pei, Xunliang Cai, Hongsheng Li, Yi Ma, Xihui Liu

Main category: cs.CV

TL;DR: CodePlot-CoT is a code-driven Chain-of-Thought paradigm that enables VLMs to generate both text reasoning and executable plotting code, which renders into images as “visual thought” to solve mathematical problems requiring visual assistance.

Details

Motivation: Current LLMs and VLMs struggle with mathematical problems that require visual assistance like drawing auxiliary lines or plotting functions, as they are constrained to text-only reasoning and lack precision in multimodal generation.

Method: 1) Construct Math-VR, a large-scale bilingual dataset for mathematics with visual reasoning; 2) Develop an image-to-code converter to parse mathematical figures into code; 3) Train CodePlot-CoT model using this data to generate both text reasoning and executable plotting code.

Result: The model achieves up to 21% improvement over base models on the new benchmark, validating the efficacy of the code-driven reasoning paradigm for multimodal mathematical reasoning.

Conclusion: This work opens a new direction for multimodal mathematical reasoning and provides the community with the first large-scale dataset, benchmark, and strong approach for such problems, with all resources made publicly available.

Abstract: Recent advances in Large Language Models (LLMs) and Vision Language Models (VLMs) have shown significant progress in mathematical reasoning, yet they still face a critical bottleneck with problems requiring visual assistance, such as drawing auxiliary lines or plotting functions to solve the problems. Most LLMs and VLMs are constrained to text-only reasoning chains, while multimodal unified models that can generate interleaved text and images lack the necessary precision and controllability for such tasks. To address this, we propose CodePlot-CoT, a code-driven Chain-of-Thought paradigm for “thinking with images” in mathematics. Our approach leverages the VLM to generate text reasoning as well as executable plotting code, which is then rendered into images as “visual thought”, to solve mathematical problems. To achieve this, we first construct Math-VR, the first large-scale, bilingual dataset and benchmark for Mathematics problems with Visual Reasoning, comprising 178K samples. Second, to create high-quality training data, we develop a state-of-the-art image-to-code converter specialized for parsing complex mathematical figures into codes. Finally, using these training data, we train the CodePlot-CoT model for solving mathematical problems. Experimental results show that our model achieves up to 21% increase over base model on our new benchmark, fully validating the efficacy of our proposed code-driven reasoning paradigm. Our work opens a new direction for multimodal mathematical reasoning and provides the community with the first large-scale dataset, comprehensive benchmark, and strong approach for such problems. To facilitate future research, we make our datasets, code, and pretrained models publicly available at https://github.com/HKU-MMLab/Math-VR-CodePlot-CoT.

[504] Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, Saining Xie

Main category: cs.CV

TL;DR: The paper proposes replacing traditional VAEs with Representation Autoencoders (RAEs) using pretrained encoders like DINO/SigLIP/MAE for diffusion transformers, addressing limitations of outdated backbones, low-dimensional latent spaces, and weak representations in current DiT approaches.

Details

Motivation: Current DiTs rely on outdated VAE encoders that introduce architectural limitations, low-dimensional latent spaces restricting information capacity, and weak representations from purely reconstruction-based training, ultimately limiting generative quality.

Method: Replace VAE with pretrained representation encoders (DINO, SigLIP, MAE) paired with trained decoders to form RAEs, analyze challenges of high-dimensional latent spaces, propose theoretically motivated solutions, and use DiT variant with lightweight DDT head.

Result: Achieves faster convergence without auxiliary losses, strong image generation results: 1.51 FID at 256x256 (no guidance) and 1.13 FID at both 256x256 and 512x512 (with guidance).

Conclusion: RAE offers clear advantages and should be the new default for diffusion transformer training, providing both high-quality reconstructions and semantically rich latent spaces with scalable transformer-based architecture.

Abstract: Latent generative modeling, where a pretrained autoencoder maps pixels into a latent space for the diffusion process, has become the standard strategy for Diffusion Transformers (DiT); however, the autoencoder component has barely evolved. Most DiTs continue to rely on the original VAE encoder, which introduces several limitations: outdated backbones that compromise architectural simplicity, low-dimensional latent spaces that restrict information capacity, and weak representations that result from purely reconstruction-based training and ultimately limit generative quality. In this work, we explore replacing the VAE with pretrained representation encoders (e.g., DINO, SigLIP, MAE) paired with trained decoders, forming what we term Representation Autoencoders (RAEs). These models provide both high-quality reconstructions and semantically rich latent spaces, while allowing for a scalable transformer-based architecture. Since these latent spaces are typically high-dimensional, a key challenge is enabling diffusion transformers to operate effectively within them. We analyze the sources of this difficulty, propose theoretically motivated solutions, and validate them empirically. Our approach achieves faster convergence without auxiliary representation alignment losses. Using a DiT variant equipped with a lightweight, wide DDT head, we achieve strong image generation results on ImageNet: 1.51 FID at 256x256 (no guidance) and 1.13 at both 256x256 and 512x512 (with guidance). RAE offers clear advantages and should be the new default for diffusion transformer training.

[505] Bayesian Topological Convolutional Neural Nets

Sarah Harkins Dayton, Hayden Everett, Ioannis Schizas, David L. Boothe Jr., Vasileios Maroulas

Main category: cs.CV

TL;DR: A Bayesian topological CNN that combines topology-aware learning with Bayesian sampling to improve training efficiency, reduce calibration error, and enhance uncertainty quantification in image classification.

Details

Motivation: To address limitations of conventional CNNs including large data requirements, overconfident predictions, and poor uncertainty quantification.

Method: Proposes a Bayesian topological CNN with prior distributions on network parameters, posterior learning, and a consistency condition in the learning cost to modify priors.

Result: Superior performance over conventional CNNs, Bayesian neural networks, and topological CNNs, especially with limited or corrupted data, and better out-of-distribution detection.

Conclusion: The hybrid approach shows potential for more efficient and robust image classification through improved uncertainty quantification and data efficiency.

Abstract: Convolutional neural networks (CNNs) have been established as the main workhorse in image data processing; nonetheless, they require large amounts of data to train, often produce overconfident predictions, and frequently lack the ability to quantify the uncertainty of their predictions. To address these concerns, we propose a new Bayesian topological CNN that promotes a novel interplay between topology-aware learning and Bayesian sampling. Specifically, it utilizes information from important manifolds to accelerate training while reducing calibration error by placing prior distributions on network parameters and properly learning appropriate posteriors. One important contribution of our work is the inclusion of a consistency condition in the learning cost, which can effectively modify the prior distributions to improve the performance of our novel network architecture. We evaluate the model on benchmark image classification datasets and demonstrate its superiority over conventional CNNs, Bayesian neural networks (BNNs), and topological CNNs. In particular, we supply evidence that our method provides an advantage in situations where training data is limited or corrupted. Furthermore, we show that the new model allows for better uncertainty quantification than standard BNNs since it can more readily identify examples of out-of-distribution data on which it has not been trained. Our results highlight the potential of our novel hybrid approach for more efficient and robust image classification.

[506] DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training

Haoran Feng, Dizhe Zhang, Xiangtai Li, Bo Du, Lu Qi

Main category: cs.CV

TL;DR: DiT360 is a DiT-based framework for panoramic image generation that uses hybrid training on perspective and panoramic data to address geometric fidelity and photorealism issues.

Details

Motivation: The main challenge in panoramic image generation is the lack of large-scale, high-quality real-world panoramic data, which affects geometric fidelity and photorealism. This data-centric approach differs from previous methods focused on model design.

Method: DiT360 employs inter-domain transformation and intra-domain augmentation at both image level (cross-domain knowledge through perspective guidance and panoramic refinement) and token level (hybrid supervision with circular padding, yaw loss, and cube loss for boundary continuity, rotational robustness, and distortion awareness).

Result: Extensive experiments on text-to-panorama, inpainting, and outpainting tasks show improved boundary consistency and image fidelity across eleven quantitative metrics.

Conclusion: The proposed DiT360 framework effectively addresses panoramic image generation challenges through hybrid training and multi-level supervision, achieving superior performance in geometric fidelity and photorealism.

Abstract: In this work, we propose DiT360, a DiT-based framework that performs hybrid training on perspective and panoramic data for panoramic image generation. For the issues of maintaining geometric fidelity and photorealism in generation quality, we attribute the main reason to the lack of large-scale, high-quality, real-world panoramic data, where such a data-centric view differs from prior methods that focus on model design. Basically, DiT360 has several key modules for inter-domain transformation and intra-domain augmentation, applied at both the pre-VAE image level and the post-VAE token level. At the image level, we incorporate cross-domain knowledge through perspective image guidance and panoramic refinement, which enhance perceptual quality while regularizing diversity and photorealism. At the token level, hybrid supervision is applied across multiple modules, which include circular padding for boundary continuity, yaw loss for rotational robustness, and cube loss for distortion awareness. Extensive experiments on text-to-panorama, inpainting, and outpainting tasks demonstrate that our method achieves better boundary consistency and image fidelity across eleven quantitative metrics. Our code is available at https://github.com/Insta360-Research-Team/DiT360.

[507] Point Prompting: Counterfactual Tracking with Video Diffusion Models

Ayush Shrivastava, Sanyam Mehta, Daniel Geng, Andrew Owens

Main category: cs.CV

TL;DR: Pretrained video diffusion models can perform zero-shot point tracking by prompting them to visually mark points as they move over time, achieving competitive performance with specialized self-supervised models.

Details

Motivation: Trackers and video generators solve related problems - analyzing vs synthesizing motion. This connection enables leveraging pretrained video diffusion models for tracking tasks without additional training.

Method: Place a distinctively colored marker at query point, regenerate video from intermediate noise level to propagate marker across frames. Use unedited initial frame as negative prompt to maintain marker visibility.

Result: Emergent tracks outperform prior zero-shot methods, persist through occlusions, and achieve performance competitive with specialized self-supervised models.

Conclusion: Video diffusion models inherently possess tracking capabilities that can be unlocked through simple prompting, bridging the gap between motion analysis and synthesis.

Abstract: Trackers and video generators solve closely related problems: the former analyze motion, while the latter synthesize it. We show that this connection enables pretrained video diffusion models to perform zero-shot point tracking by simply prompting them to visually mark points as they move over time. We place a distinctively colored marker at the query point, then regenerate the rest of the video from an intermediate noise level. This propagates the marker across frames, tracing the point’s trajectory. To ensure that the marker remains visible in this counterfactual generation, despite such markers being unlikely in natural videos, we use the unedited initial frame as a negative prompt. Through experiments with multiple image-conditioned video diffusion models, we find that these “emergent” tracks outperform those of prior zero-shot methods and persist through occlusions, often obtaining performance that is competitive with specialized self-supervised models.

[508] Ev4DGS: Novel-view Rendering of Non-Rigid Objects from Monocular Event Streams

Takuya Nakabayashi, Navami Kairanda, Hideo Saito, Vladislav Golyanik

Main category: cs.CV

TL;DR: Ev4DGS is the first method for novel view rendering of non-rigidly deforming objects from monocular event streams only, without requiring RGB inputs.

Details

Motivation: Existing event-based rendering approaches for non-rigid objects require additional sparse RGB inputs, which limits practical applications. The paper explores whether similar models can be learned from event streams alone.

Method: Regresses a deformable 3D Gaussian Splatting representation using: 1) a loss relating model outputs with 2D event observation space, and 2) a coarse 3D deformation model trained from binary masks generated from events.

Result: Experimental comparisons on synthetic and real datasets show Ev4DGS is valid and outperforms multiple naive baselines applicable in this setting.

Conclusion: Ev4DGS successfully demonstrates novel view rendering of non-rigid objects from event streams only, overcoming previous limitations that required RGB inputs.

Abstract: Event cameras offer various advantages for novel view rendering compared to synchronously operating RGB cameras, and efficient event-based techniques supporting rigid scenes have been recently demonstrated in the literature. In the case of non-rigid objects, however, existing approaches additionally require sparse RGB inputs, which can be a substantial practical limitation; it remains unknown if similar models could be learned from event streams only. This paper sheds light on this challenging open question and introduces Ev4DGS, i.e., the first approach for novel view rendering of non-rigidly deforming objects in the explicit observation space (i.e., as RGB or greyscale images) from monocular event streams. Our method regresses a deformable 3D Gaussian Splatting representation through 1) a loss relating the outputs of the estimated model with the 2D event observation space, and 2) a coarse 3D deformation model trained from binary masks generated from events. We perform experimental comparisons on existing synthetic and newly recorded real datasets with non-rigid objects. The results demonstrate the validity of Ev4DGS and its superior performance compared to multiple naive baselines that can be applied in our setting. We will release our models and the datasets used in the evaluation for research purposes; see the project webpage: https://4dqv.mpi-inf.mpg.de/Ev4DGS/.

[509] Invariant Feature Learning for Generalized Long-Tailed Classification

Kaihua Tang, Mingyuan Tao, Jiaxin Qi, Zhenguang Liu, Hanwang Zhang

Main category: cs.CV

TL;DR: This paper introduces Generalized Long-Tailed classification (GLT), which addresses both class-wise and attribute-wise imbalances in datasets, unlike traditional long-tailed methods that only handle class imbalance. The authors propose an Invariant Feature Learning (IFL) method as a baseline for GLT.

Details

Motivation: Existing long-tailed classification methods only address class imbalance but overlook attribute-wise imbalance within classes, which is more ubiquitous and challenging. The authors aim to solve both types of imbalances through a generalized approach.

Method: Proposed Invariant Feature Learning (IFL) method that first discovers environments with divergent intra-class distributions from imperfect predictions, then learns invariant features across these environments. IFL serves as an improved feature backbone that can boost existing LT methods.

Result: Most class-wise long-tailed methods degenerate on the proposed ImageNet-GLT and MSCOCO-GLT benchmarks. IFL shows promising results as it can enhance all existing LT approaches including re-balance, augmentation, and ensemble methods.

Conclusion: GLT is a more comprehensive problem than traditional LT classification, and the proposed IFL method provides an effective baseline for addressing both class-wise and attribute-wise imbalances in datasets.

Abstract: Existing long-tailed classification (LT) methods only focus on tackling the class-wise imbalance that head classes have more samples than tail classes, but overlook the attribute-wise imbalance. In fact, even if the class is balanced, samples within each class may still be long-tailed due to the varying attributes. Note that the latter is fundamentally more ubiquitous and challenging than the former because attributes are not just implicit for most datasets, but also combinatorially complex, thus prohibitively expensive to be balanced. Therefore, we introduce a novel research problem: Generalized Long-Tailed classification (GLT), to jointly consider both kinds of imbalances. By “generalized”, we mean that a GLT method should naturally solve the traditional LT, but not vice versa. Not surprisingly, we find that most class-wise LT methods degenerate in our proposed two benchmarks: ImageNet-GLT and MSCOCO-GLT. We argue that it is because they over-emphasize the adjustment of class distribution while neglecting to learn attribute-invariant features. To this end, we propose an Invariant Feature Learning (IFL) method as the first strong baseline for GLT. IFL first discovers environments with divergent intra-class distributions from the imperfect predictions and then learns invariant features across them. Promisingly, as an improved feature backbone, IFL boosts all the LT line-up: one/two-stage re-balance, augmentation, and ensemble. Codes and benchmarks are available on Github: https://github.com/KaihuaTang/Generalized-Long-Tailed-Benchmarks.pytorch

[510] Class Is Invariant to Context and Vice Versa: On Learning Invariance for Out-Of-Distribution Generalization

Jiaxin Qi, Kaihua Tang, Qianru Sun, Xian-Sheng Hua, Hanwang Zhang

Main category: cs.CV

TL;DR: The paper proposes a novel method for Out-Of-Distribution generalization by leveraging class invariance to context, achieving state-of-the-art performance without requiring context labels.

Details

Motivation: Current OOD methods rely on annotated context bias or biased class predictions, which are often incomplete or incorrect. The authors argue that context is invariant to class, allowing classes to serve as varying environments for resolving context bias without explicit context labels.

Method: The method minimizes contrastive loss of intra-class sample similarity while ensuring this similarity remains invariant across all classes. It uses a simple re-weighting based classifier with the proposed context estimation approach.

Result: The approach achieves state-of-the-art performance on benchmarks with various context biases and domain gaps, demonstrating effectiveness across different challenging scenarios.

Conclusion: The paper successfully demonstrates that leveraging class invariance to context provides an effective alternative to traditional context annotation methods for OOD generalization, with theoretical justification and practical implementation.

Abstract: Out-Of-Distribution generalization (OOD) is all about learning invariance against environmental changes. If the context in every class is evenly distributed, OOD would be trivial because the context can be easily removed due to an underlying principle: class is invariant to context. However, collecting such a balanced dataset is impractical. Learning on imbalanced data makes the model bias to context and thus hurts OOD. Therefore, the key to OOD is context balance. We argue that the widely adopted assumption in prior work, the context bias can be directly annotated or estimated from biased class prediction, renders the context incomplete or even incorrect. In contrast, we point out the everoverlooked other side of the above principle: context is also invariant to class, which motivates us to consider the classes (which are already labeled) as the varying environments to resolve context bias (without context labels). We implement this idea by minimizing the contrastive loss of intra-class sample similarity while assuring this similarity to be invariant across all classes. On benchmarks with various context biases and domain gaps, we show that a simple re-weighting based classifier equipped with our context estimation achieves state-of-the-art performance. We provide the theoretical justifications in Appendix and codes on https://github.com/simpleshinobu/IRMCon.

[511] Information Topology

Xin Li

Main category: cs.CV

TL;DR: Information Topology unifies information theory and algebraic topology by treating cycle closure as the fundamental operation of inference, separating transient fluctuations from stable predictive structures.

Details

Motivation: To create a unified framework that bridges information theory and algebraic topology, addressing how systems form stable, predictive structures from transient data through topological stabilization.

Method: Develops the dot-cycle dichotomy, Structure-Before-Specificity principle, Context-Content Uncertainty Principle, and defines homological capacity as the topological dual of Shannon capacity.

Result: Shows that prediction requires invariance for generalization, explains order invariance through measure concentration, and links dynamical entropy to structural capacity across domains like visual binding and working memory.

Conclusion: Recasts inference, learning, and communication as topological stabilization - the formation, closure, and persistence of informational cycles that make prediction robust and scalable.

Abstract: We introduce \emph{Information Topology}: a framework that unifies information theory and algebraic topology by treating \emph{cycle closure} as the primitive operation of inference. The starting point is the \emph{dot-cycle dichotomy}, which separates pointwise, order-sensitive fluctuations (dots) from order-invariant, predictive structure (cycles). Algebraically, closure is the cancellation of boundaries ($\partial^2=0$), which converts transient histories into stable invariants. Building on this, we derive the \emph{Structure-Before-Specificity} (SbS) principle: stable information resides in nontrivial homology classes that persist under perturbations, while high-entropy contextual details act as scaffolds. The \emph{Context-Content Uncertainty Principle} (CCUP) quantifies this balance by decomposing uncertainty into contextual spread and content precision, showing why prediction requires invariance for generalization. Measure concentration onto residual invariant manifolds explains \emph{order invariance}: when mass collapses to a narrow tube around a closed cycle, reparameterizations of micro-steps leave predictive functionals unchanged. We then define \emph{homological capacity}, the topological dual of Shannon capacity, as the sustainable number of independent informational cycles supported by a system. This capacity links dynamical (KS) entropy to structural (homological) capacity and refines Euler characteristics from a net'' summary to a gross’’ count of persistent invariants. Finally, we illustrate the theory across three domains where \emph{more is different}: \textbf{visual binding}, \textbf{working memory}, and \textbf{access consciousness}. Together, these results recast inference, learning, and communication as \emph{topological stabilization}: the formation, closure, and persistence of informational cycles that make prediction robust and scalable.

[512] Camouflaged Image Synthesis Is All You Need to Boost Camouflaged Detection

Haichao Zhang, Can Qin, Yu Yin, Yun Fu

Main category: cs.CV

TL;DR: A framework for synthesizing camouflage data using generative models to improve camouflaged object detection in natural scenes.

Details

Motivation: Camouflaged objects pose detection challenges and current research is constrained by limited data availability.

Method: Uses a camouflage environment generator supervised by a camouflage distribution classifier to synthesize realistic camouflage images for dataset expansion.

Result: Outperforms state-of-the-art methods on three datasets (COD10k, CAMO, and CHAMELEON).

Conclusion: Provides an effective plug-and-play data generation module that introduces more diversity into camouflage datasets.

Abstract: Camouflaged objects that blend into natural scenes pose significant challenges for deep-learning models to detect and synthesize. While camouflaged object detection is a crucial task in computer vision with diverse real-world applications, this research topic has been constrained by limited data availability. We propose a framework for synthesizing camouflage data to enhance the detection of camouflaged objects in natural scenes. Our approach employs a generative model to produce realistic camouflage images, which can be used to train existing object detection models. Specifically, we use a camouflage environment generator supervised by a camouflage distribution classifier to synthesize the camouflage images, which are then fed into our generator to expand the dataset. Our framework outperforms the current state-of-the-art method on three datasets (COD10k, CAMO, and CHAMELEON), demonstrating its effectiveness in improving camouflaged object detection. This approach can serve as a plug-and-play data generation and augmentation module for existing camouflaged object detection tasks and provides a novel way to introduce more diversity and distributions into current camouflage datasets.

[513] Hyper-STTN: Hypergraph Augmented Spatial-Temporal Transformer Network for Trajectory Prediction

Weizheng Wang, Baijian Yang, Sungeun Hong, Wenhai Sun, Byung-Cheol Min

Main category: cs.CV

TL;DR: Hyper-STTN: A hypergraph-based spatial-temporal transformer network for crowd trajectory prediction that models both pairwise interactions and groupwise dynamics through multimodal fusion.

Details

Motivation: Accurate crowd trajectory prediction is crucial for social robotics and autonomous driving, but remains challenging due to complex spatial-temporal interactions and heterogeneous group influences.

Method: Constructs multiscale hypergraphs of varying group sizes with spectral hypergraph convolution based on random-walk probabilities, combined with spatial-temporal transformer for pairwise interactions, fused via multimodal transformer.

Result: Extensive experiments on public pedestrian datasets show Hyper-STTN consistently outperforms state-of-the-art baselines and ablation models.

Conclusion: The proposed Hyper-STTN effectively addresses the challenges of modeling complex crowd behavior by integrating both pairwise and groupwise interactions through hypergraph and transformer architectures.

Abstract: Predicting crowd intentions and trajectories is critical for a range of real-world applications, involving social robotics and autonomous driving. Accurately modeling such behavior remains challenging due to the complexity of pairwise spatial-temporal interactions and the heterogeneous influence of groupwise dynamics. To address these challenges, we propose Hyper-STTN, a Hypergraph-based Spatial-Temporal Transformer Network for crowd trajectory prediction. Hyper-STTN constructs multiscale hypergraphs of varying group sizes to model groupwise correlations, captured through spectral hypergraph convolution based on random-walk probabilities. In parallel, a spatial-temporal transformer is employed to learn pedestrians’ pairwise latent interactions across spatial and temporal dimensions. These heterogeneous groupwise and pairwise features are subsequently fused and aligned via a multimodal transformer. Extensive experiments on public pedestrian motion datasets demonstrate that Hyper-STTN consistently outperforms state-of-the-art baselines and ablation models.

[514] MarkPlugger: Generalizable Watermark Framework for Latent Diffusion Models without Retraining

Guokai Zhang, Lanjun Wang, Yuting Su, An-An Liu

Main category: cs.CV

TL;DR: MarkPlugger is a plug-and-play watermark framework for latent diffusion models that embeds watermarks without requiring model retraining, using orthogonal watermark representations in latent space.

Details

Motivation: Address security concerns in AI-generated content by providing traceability without the high cost of retraining watermark models for rapidly evolving LDMs.

Method: Identifies watermark representations orthogonal to semantic content in latent space and uses additive fusion strategy to embed watermarks during denoising process without modifying LDM components.

Result: Effectively balances image quality and watermark recovery rate, generalizes to multiple LDM versions and variants without retraining, and performs robustly under various attacks.

Conclusion: MarkPlugger provides a cost-effective, generalizable solution for watermarking AI-generated content that maintains image quality while ensuring reliable watermark detection.

Abstract: Today, the family of latent diffusion models (LDMs) has gained prominence for its high quality outputs and scalability. This has also raised security concerns on social media, as malicious users can create and disseminate harmful content. Existing approaches typically involve training specific components or entire generative models to embed a watermark in generated images for traceability and responsibility. However, in the fast-evolving era of AI-generated content (AIGC), the rapid iteration and modification of LDMs makes retraining with watermark models costly. To address the problem, we propose MarkPlugger, a generalizable plug-and-play watermark framework without LDM retraining. In particular, to reduce the disturbance of the watermark on the semantics of the generated image, we try to identify a watermark representation that is approaching orthogonal to the semantic in latent space, and apply an additive fusion strategy for the watermark and the semantic. Without modifying any components of the LDMs, we embed diverse watermarks in latent space, adapting to the denoising process. Our experimental findings reveal that our method effectively harmonizes image quality and watermark recovery rate. We also have validated that our method is generalized to multiple official versions and modified variants of LDMs, even without retraining the watermark model. Furthermore, it performs robustly under various attacks of different intensities.

[515] Attention based End to end network for Offline Writer Identification on Word level data

Vineet Kumar, Suresh Sundaram

Main category: cs.CV

TL;DR: Proposes an attention-driven CNN for writer identification using fragments from word images, achieving robust performance with limited handwriting samples.

Details

Motivation: Writer identification performs well with ample handwriting samples but struggles with limited word images, creating a need for improved methods in constrained scenarios.

Method: Uses attention-driven CNN trained on pyramid-extracted fragments from word images to capture multi-level features, enhancing representation learning.

Result: Demonstrates proficiency on three benchmark databases, showing strong performance particularly when handwriting data is limited.

Conclusion: The attention-based fragment approach enables effective writer identification even with constrained handwriting samples, outperforming traditional methods.

Abstract: Writer identification due to its widespread application in various fields has gained popularity over the years. In scenarios where optimum handwriting samples are available, whether they be in the form of a single line, a sentence, or an entire page, writer identification algorithms have demonstrated noteworthy levels of accuracy. However, in scenarios where only a limited number of handwritten samples are available, particularly in the form of word images, there is a significant scope for improvement. In this paper, we propose a writer identification system based on an attention-driven Convolutional Neural Network (CNN). The system is trained utilizing image segments, known as fragments, extracted from word images, employing a pyramid-based strategy. This methodology enables the system to capture a comprehensive representation of the data, encompassing both fine-grained details and coarse features across various levels of abstraction. These extracted fragments serve as the training data for the convolutional network, enabling it to learn a more robust representation compared to traditional convolution-based networks trained on word images. Additionally, the paper explores the integration of an attention mechanism to enhance the representational power of the learned features. The efficacy of the proposed algorithm is evaluated on three benchmark databases, demonstrating its proficiency in writer identification tasks, particularly in scenarios with limited access to handwriting data.

[516] Improving Hierarchical Representations of Vectorized HD Maps with Perspective Clues

Chi Zhang, Qi Song, Feifei Li, Jie Li, Rui Huang

Main category: cs.CV

TL;DR: PerCMap is a novel approach for vectorized HD map construction that addresses limitations in current pipelines by exploiting perspective-view features at instance and point levels to improve map prediction accuracy.

Details

Motivation: Current map vector estimation pipelines face limitations: input-agnostic queries struggle with complex map structures, and view transformation causes information loss, leading to inaccurate shape restoration or missing instances.

Method: Proposes Cross-view Instance Activation (CIA) to activate instance queries across surround-view images for recovering instance attributes, and Dual-view Point Embedding (DPE) that fuses features from both views to generate input-aware positional embeddings for better point coordinate estimation.

Result: Achieves strong performance on nuScenes (67.1 mAP) and Argoverse 2 (70.5 mAP) benchmarks, demonstrating consistent improvements across datasets.

Conclusion: PerCMap effectively addresses key limitations in HD map construction by leveraging perspective-view features at multiple levels, resulting in more accurate and complete map vector predictions.

Abstract: The construction of vectorized High-Definition (HD) maps from onboard surround-view cameras has become a significant focus in autonomous driving. However, current map vector estimation pipelines face two key limitations: input-agnostic queries struggle to capture complex map structures, and the view transformation leads to information loss. These issues often result in inaccurate shape restoration or missing instances in map predictions. To address this concern, we propose a novel approach, namely \textbf{PerCMap}, which explicitly exploits clues from perspective-view features at both instance and point level. Specifically, at instance level, we propose Cross-view Instance Activation (CIA) to activate instance queries across surround-view images, thereby helping the model recover the instance attributes of map vectors. At point level, we design Dual-view Point Embedding (DPE), which fuses features from both views to generate input-aware positional embeddings and improve the accuracy of point coordinate estimation. Extensive experiments on \textit{nuScenes} and \textit{Argoverse 2} demonstrate that PerCMap achieves strong and consistent performance across benchmarks, reaching 67.1 and 70.5 mAP, respectively.

[517] UniRGB-IR: A Unified Framework for Visible-Infrared Semantic Tasks via Adapter Tuning

Maoxun Yuan, Bo Cui, Tianyi Zhao, Jiayi Wang, Shan Fu, Xue Yang, Xingxing Wei

Main category: cs.CV

TL;DR: UniRGB-IR is a scalable framework for RGB-IR semantic tasks that uses adapter modules to incorporate multi-modal features into pre-trained RGB foundation models without fine-tuning the entire model.

Details

Motivation: Existing methods for RGB-IR semantic analysis have poor scalability and limited generalization because they rely on task-specific frameworks and direct fine-tuning of foundation models on RGB-IR datasets, without pre-trained foundation models specifically for infrared images.

Method: Proposes UniRGB-IR with three components: ViT foundation model, Multi-modal Feature Pool (MFP) module, and Supplementary Feature Injector (SFI) module. The MFP and SFI work as adapters to complement ViT features with contextual multi-scale features. The foundation model is frozen during training, only MFP and SFI are optimized.

Result: Experimental results on various RGB-IR semantic tasks demonstrate state-of-the-art performance using ViT-Base as the foundation model.

Conclusion: The proposed UniRGB-IR framework effectively addresses scalability and generalization limitations in RGB-IR semantic analysis by introducing adapter mechanisms that incorporate multi-modal features without fine-tuning the entire foundation model.

Abstract: Semantic analysis on visible (RGB) and infrared (IR) images has gained significant attention due to their enhanced accuracy and robustness under challenging conditions including low-illumination and adverse weather. However, due to the lack of pre-trained foundation models on the large-scale infrared image datasets, existing methods prefer to design task-specific frameworks and directly fine-tune them with pre-trained foundation models on their RGB-IR semantic relevance datasets, which results in poor scalability and limited generalization. To address these limitations, we propose UniRGB-IR, a scalable and efficient framework for RGB-IR semantic tasks that introduces a novel adapter mechanism to effectively incorporate rich multi-modal features into pre-trained RGB-based foundation models. Our framework comprises three key components: a vision transformer (ViT) foundation model, a Multi-modal Feature Pool (MFP) module, and a Supplementary Feature Injector (SFI) module. The MFP and SFI modules cooperate with each other as an adpater to effectively complement the ViT features with the contextual multi-scale features. During training process, we freeze the entire foundation model to inherit prior knowledge and only optimize the MFP and SFI modules. Furthermore, to verify the effectiveness of our framework, we utilize the ViT-Base as the pre-trained foundation model to perform extensive experiments. Experimental results on various RGB-IR semantic tasks demonstrate that our method can achieve state-of-the-art performance.

[518] Streamlining Image Editing with Layered Diffusion Brushes

Peyman Gholami, Robert Xiao

Main category: cs.CV

TL;DR: Layered Diffusion Brushes (LDB) is a training-free framework for interactive, layer-based image editing using standard diffusion models, enabling independent, non-destructive edits with 140ms speed per edit.

Details

Motivation: Interactive, localized editing workflows for denoising diffusion models remain underdeveloped, limiting creative workflows.

Method: LDB defines layers as self-contained parameter sets guiding generation, uses intermediate latent caching to reduce edits to few denoising steps, and implements familiar layer concepts in an editor.

Result: LDB achieves superior speed (140ms per edit) with comparable or improved image quality, background preservation, and edit fidelity compared to state-of-the-art methods.

Conclusion: LDB significantly enhances creative workflows with intuitive, efficient diffusion-based editing and has potential for expansion into video editing and related domains.

Abstract: Denoising diffusion models have emerged as powerful tools for image manipulation, yet interactive, localized editing workflows remain underdeveloped. We introduce Layered Diffusion Brushes (LDB), a novel training-free framework that enables interactive, layer-based editing using standard diffusion models. LDB defines each “layer” as a self-contained set of parameters guiding the generative process, enabling independent, non-destructive, and fine-grained prompt-guided edits, even in overlapping regions. LDB leverages a unique intermediate latent caching approach to reduce each edit to only a few denoising steps, achieving 140~ms per edit on consumer GPUs. An editor implementing LDB, incorporating familiar layer concepts, was evaluated via user study and quantitative metrics. Results demonstrate LDB’s superior speed alongside comparable or improved image quality, background preservation, and edit fidelity relative to state-of-the-art methods across various sequential image manipulation tasks. The findings highlight LDB’s ability to significantly enhance creative workflows by providing an intuitive and efficient approach to diffusion-based image editing and its potential for expansion into related subdomains, such as video editing.

[519] RATLIP: Generative Adversarial CLIP Text-to-Image Synthesis Based on Recurrent Affine Transformations

Chengde Lin, Xijun Lu, Guangxi Chen

Main category: cs.CV

TL;DR: The paper proposes RATLIP, a novel text-to-image synthesis model that combines recurrent affine transformations (RAT) with CLIP to improve image-text consistency and image quality over traditional GAN approaches.

Details

Motivation: Traditional GANs for text-to-image synthesis suffer from low consistency between images and text descriptions, and insufficient richness in synthesized images. Conditional affine transformations (CAT) in GANs lack global textual information sharing across layers.

Method: Proposes RATLIP model that replaces CAT with recurrent affine transformations (RAT) to enable global information sharing across layers, adds shuffle attention between RAT to mitigate information forgetting, and integrates CLIP pre-trained model in both generator and discriminator for better multimodal understanding.

Result: Extensive experiments on CUB, Oxford, and CelebA-tiny datasets demonstrate superior performance over state-of-the-art models in text-to-image synthesis quality and image-text consistency.

Conclusion: The proposed RATLIP model effectively addresses limitations of traditional GANs by enabling global information sharing through RAT and leveraging CLIP’s multimodal understanding capabilities, achieving improved text-to-image synthesis performance.

Abstract: Synthesizing high-quality photorealistic images with textual descriptions as a condition is very challenging. Generative Adversarial Networks (GANs), the classical model for this task, frequently suffer from low consistency between image and text descriptions and insufficient richness in synthesized images. Recently, conditional affine transformations (CAT), such as conditional batch normalization and instance normalization, have been applied to different layers of GAN to control content synthesis in images. CAT is a multi-layer perceptron that independently predicts data based on batch statistics between neighboring layers, with global textual information unavailable to other layers. To address this issue, we first model CAT and a recurrent neural network (RAT) to ensure that different layers can access global information. We then introduce shuffle attention between RAT to mitigate the characteristic of information forgetting in recurrent neural networks. Moreover, both our generator and discriminator utilize the powerful pre-trained model, Clip, which has been extensively employed for establishing associations between text and images through the learning of multimodal representations in latent space. The discriminator utilizes CLIP’s ability to comprehend complex scenes to accurately assess the quality of the generated images. Extensive experiments have been conducted on the CUB, Oxford, and CelebA-tiny datasets to demonstrate the superiority of the proposed model over current state-of-the-art models. The code is https://github.com/OxygenLu/RATLIP.

[520] A Unified Approach Towards Active Learning and Out-of-Distribution Detection

Sebastian Schmidt, Leonard Schenk, Leo Schwinn, Stephan Günnemann

Main category: cs.CV

TL;DR: SISOM is a unified framework that simultaneously addresses active learning (AL) and out-of-distribution (OOD) detection using feature space distance metrics, achieving state-of-the-art performance in both tasks.

Details

Motivation: Current approaches treat active learning and OOD detection as separate problems, but in real-world applications both are needed to handle unlabeled data and distribution shifts effectively.

Method: Leverages feature space distance metrics to combine AL and OOD detection capabilities in a unified framework called SISOM.

Result: Achieved first place in two OpenOOD benchmarks and second place in the remaining one; delivered top-1 performance in three AL benchmarks, outperforming other methods.

Conclusion: SISOM effectively unifies AL and OOD detection, demonstrating that solving both tasks together leads to superior performance compared to treating them separately.

Abstract: When applying deep learning models in open-world scenarios, active learning (AL) strategies are crucial for identifying label candidates from a nearly infinite amount of unlabeled data. In this context, robust out-of-distribution (OOD) detection mechanisms are essential for handling data outside the target distribution of the application. However, current works investigate both problems separately. In this work, we introduce SISOM as the first unified solution for both AL and OOD detection. By leveraging feature space distance metrics SISOM combines the strengths of the currently independent tasks to solve both effectively. We conduct extensive experiments showing the problems arising when migrating between both tasks. In these evaluations SISOM underlined its effectiveness by achieving first place in two of the widely used OpenOOD benchmarks and second place in the remaining one. In AL, SISOM outperforms others and delivers top-1 performance in three benchmarks

[521] Contrastive Local Manifold Learning for No-Reference Image Quality Assessment

Zihao Huang, Runze Hu, Timin Gao, Yan Zhang, Yunhang Shen, Ke Li

Main category: cs.CV

TL;DR: LML-IQA is a novel no-reference image quality assessment method that combines local manifold learning and contrastive learning to improve discriminative capabilities in perceptual quality evaluation.

Details

Motivation: Traditional IQA methods overlook local manifold structures, which compromises their discriminative capabilities in perceptual quality evaluation.

Method: Extracts multiple patches from each image, identifies the most visually salient region as positive sample for contrastive learning, treats other patches as intra-class negatives, and patches from different images as inter-class negatives. Also introduces mutual learning strategy.

Result: Achieved significant performance gains across eight benchmark datasets, with PLCC of 0.942 on TID2013 (vs 0.908 baseline) and 0.977 on CSIQ (vs 0.965 baseline).

Conclusion: LML-IQA effectively addresses the limitation of traditional IQA methods by leveraging local manifold structures and contrastive learning, demonstrating superior performance over state-of-the-art methods.

Abstract: Image Quality Assessment (IQA) methods typically overlook local manifold structures, leading to compromised discriminative capabilities in perceptual quality evaluation. To address this limitation, we present LML-IQA, an innovative no-reference IQA (NR-IQA) approach that leverages a combination of local manifold learning and contrastive learning. Our approach first extracts multiple patches from each image and identifies the most visually salient region. This salient patch serves as a positive sample for contrastive learning, while other patches from the same image are treated as intra-class negatives to preserve local distinctiveness. Patches from different images also act as inter-class negatives to enhance feature separation. Additionally, we introduce a mutual learning strategy to improve the model’s ability to recognize and prioritize visually important regions. Comprehensive experiments across eight benchmark datasets demonstrate significant performance gains over state-of-the-art methods, achieving a PLCC of 0.942 on TID2013 (compared to 0.908) and 0.977 on CSIQ (compared to 0.965).

[522] Open Vocabulary Multi-Label Video Classification

Rohit Gupta, Mamshad Nayeem Rizve, Jayakrishnan Unnikrishnan, Ashish Tawari, Son Tran, Mubarak Shah, Benjamin Yao, Trishul Chilimbi

Main category: cs.CV

TL;DR: The paper proposes a method to adapt pre-trained vision-language models (VLMs) for open vocabulary multilabel video classification, enabling simultaneous recognition of multiple actions and entities in videos.

Details

Motivation: Previous methods focused on single label action classification but fall short in holistic video understanding that requires recognizing multiple actions and entities simultaneously in an open vocabulary setting.

Method: An end-to-end trainable architecture that learns to prompt LLMs to generate soft attributes for CLIP text-encoder, integrates temporal modeling into CLIP’s vision encoder, and uses novel regularized finetuning for video domain adaptation.

Result: Extensive experimentation shows the efficacy of the approach on multiple benchmark datasets.

Conclusion: The proposed method successfully extends VLMs to open vocabulary multilabel video classification, addressing the limitations of previous single-label approaches.

Abstract: Pre-trained vision-language models (VLMs) have enabled significant progress in open vocabulary computer vision tasks such as image classification, object detection and image segmentation. Some recent works have focused on extending VLMs to open vocabulary single label action classification in videos. However, previous methods fall short in holistic video understanding which requires the ability to simultaneously recognize multiple actions and entities e.g., objects in the video in an open vocabulary setting. We formulate this problem as open vocabulary multilabel video classification and propose a method to adapt a pre-trained VLM such as CLIP to solve this task. We leverage large language models (LLMs) to provide semantic guidance to the VLM about class labels to improve its open vocabulary performance with two key contributions. First, we propose an end-to-end trainable architecture that learns to prompt an LLM to generate soft attributes for the CLIP text-encoder to enable it to recognize novel classes. Second, we integrate a temporal modeling module into CLIP’s vision encoder to effectively model the spatio-temporal dynamics of video concepts as well as propose a novel regularized finetuning technique to ensure strong open vocabulary classification performance in the video domain. Our extensive experimentation showcases the efficacy of our approach on multiple benchmark datasets.

[523] Motion Capture from Inertial and Vision Sensors

Xiaodong Chen, Wu Liu, Qian Bao, Xinchen Liu, Ruoli Dai, Yongdong Zhang, Tao Mei

Main category: cs.CV

TL;DR: MINIONS is a large-scale multi-modal motion capture dataset combining monocular camera and few IMUs, with a SparseNet framework for consumer-affordable motion capture.

Details

Motivation: Consumer-affordable and easy-to-use motion capture solutions are still immature despite industrial systems being widely adopted. There's a need for accessible solutions using minimal equipment for personal applications.

Method: Created MINIONS dataset with over 5M frames, 400 minutes of IMU signals and RGB videos labeled with motion data. Proposed SparseNet framework to capture motion from IMUs and videos by discovering their supplementary features.

Result: The framework demonstrates the unique advantages of combining inertial and vision sensors, showing promise for consumer-affordable multi-modal motion capture.

Conclusion: MINIONS provides a valuable resource for research and development in accessible motion capture, enabling further exploration of consumer-grade solutions using minimal equipment.

Abstract: Human motion capture is the foundation for many computer vision and graphics tasks. While industrial motion capture systems with complex camera arrays or expensive wearable sensors have been widely adopted in movie and game production, consumer-affordable and easy-to-use solutions for personal applications are still far from mature. To utilize a mixture of a monocular camera and very few inertial measurement units (IMUs) for accurate multi-modal human motion capture in daily life, we contribute MINIONS in this paper, a large-scale Motion capture dataset collected from INertial and visION Sensors. MINIONS has several featured properties: 1) large scale of over five million frames and 400 minutes duration; 2) multi-modality data of IMUs signals and RGB videos labeled with joint positions, joint rotations, SMPL parameters, etc.; 3) a diverse set of 146 fine-grained single and interactive actions with textual descriptions. With the proposed MINIONS dataset, we propose a SparseNet framework to capture human motion from IMUs and videos by discovering their supplementary features and exploring the possibilities of consumer-affordable motion capture using a monocular camera and very few IMUs. The experiment results emphasize the unique advantages of inertial and vision sensors, showcasing the promise of consumer-affordable multi-modal motion capture and providing a valuable resource for further research and development.

[524] Automated detection of underdiagnosed medical conditions via opportunistic imaging

Asad Aali, Andrew Johnston, Louis Blankemeier, Dave Van Veen, Laura T Derry, David Svec, Jason Hom, Robert D. Boutin, Akshay S. Chaudhari

Main category: cs.CV

TL;DR: Deep learning analysis of 2,674 abdominal CT scans reveals significant under-documentation of sarcopenia, hepatic steatosis, and ascites in ICD coding compared to opportunistic CT findings.

Details

Motivation: To address the underdiagnosis of conditions like sarcopenia, hepatic steatosis, and ascites by leveraging opportunistic CT scans and identifying documentation gaps in clinical practice.

Method: Used deep learning methods to analyze 2,674 inpatient CT scans, comparing imaging phenotypes from opportunistic CT with radiology reports and ICD coding documentation.

Result: Only 0.5% of sarcopenia, 3.2% of hepatic steatosis, and 30.7% of ascites cases identified through imaging or reports were properly ICD-coded, showing major documentation gaps.

Conclusion: Opportunistic CT has significant potential to improve diagnostic precision and accuracy of risk adjustment models in precision medicine.

Abstract: Abdominal computed tomography (CT) scans are frequently performed in clinical settings. Opportunistic CT involves repurposing routine CT images to extract diagnostic information and is an emerging tool for detecting underdiagnosed conditions such as sarcopenia, hepatic steatosis, and ascites. This study utilizes deep learning methods to promote accurate diagnosis and clinical documentation. We analyze 2,674 inpatient CT scans to identify discrepancies between imaging phenotypes (characteristics derived from opportunistic CT scans) and their corresponding documentation in radiology reports and ICD coding. Through our analysis, we find that only 0.5%, 3.2%, and 30.7% of scans diagnosed with sarcopenia, hepatic steatosis, and ascites (respectively) through either opportunistic imaging or radiology reports were ICD-coded. Our findings demonstrate opportunistic CT’s potential to enhance diagnostic precision and accuracy of risk adjustment models, offering advancements in precision medicine.

[525] LiDAR-GS:Real-time LiDAR Re-Simulation using Gaussian Splatting

Qifeng Chen, Sheng Yang, Sicong Du, Tao Tang, Rengan Xie, Peng Chen, Yuchi Huo

Main category: cs.CV

TL;DR: LiDAR-GS is a Gaussian Splatting method for real-time, high-fidelity LiDAR scan re-simulation in urban road scenes, addressing unique LiDAR sensor challenges through differentiable laser beam splatting and Neural Gaussian Representation.

Details

Motivation: To extend Gaussian Splatting methods from cameras to LiDAR sensors while preserving high accuracy and unique LiDAR characteristics, overcoming challenges posed by active 3D sensing.

Method: Uses differentiable laser beam splatting with range-view representation for precise surface projection, Neural Gaussian Representation for view-dependent properties, and dynamic instances decomposition for handling complex scenes.

Result: Achieves state-of-the-art results in both rendering frame rate and quality on large public scene datasets, successfully re-simulating depth, intensity, and ray-drop channels simultaneously.

Conclusion: LiDAR-GS demonstrates superior performance compared to explicit mesh or implicit NeRF methods, providing an effective solution for real-time LiDAR scan re-simulation with high fidelity.

Abstract: We present LiDAR-GS, a Gaussian Splatting (GS) method for real-time, high-fidelity re-simulation of LiDAR scans in public urban road scenes. Recent GS methods proposed for cameras have achieved significant advancements in real-time rendering beyond Neural Radiance Fields (NeRF). However, applying GS representation to LiDAR, an active 3D sensor type, poses several challenges that must be addressed to preserve high accuracy and unique characteristics. Specifically, LiDAR-GS designs a differentiable laser beam splatting, using range-view representation for precise surface splatting by projecting lasers onto micro cross-sections, effectively eliminating artifacts associated with local affine approximations. Furthermore, LiDAR-GS leverages Neural Gaussian Representation, which further integrate view-dependent clues, to represent key LiDAR properties that are influenced by the incident direction and external factors. Combining these practices with some essential adaptations, e.g., dynamic instances decomposition, LiDAR-GS succeeds in simultaneously re-simulating depth, intensity, and ray-drop channels, achieving state-of-the-art results in both rendering frame rate and quality on publically available large scene datasets when compared with the methods using explicit mesh or implicit NeRF. Our source code is publicly available at https://www.github.com/cqf7419/LiDAR-GS.

[526] OVS Meets Continual Learning: Towards Sustainable Open-Vocabulary Segmentation

Dongjun Hwang, Yejin Kim, Minyoung Lee, Seong Joon Oh, Junsuk Choe

Main category: cs.CV

TL;DR: ConOVS is a continual learning method for Open-Vocabulary Segmentation that uses a Mixture-of-Experts framework to dynamically combine expert decoders based on input sample distribution, enabling effective learning from sequentially collected data.

Details

Motivation: Most existing Open-Vocabulary Segmentation models assume fixed training data, but practical scenarios involve continuously collected datasets over time. Current approaches like retraining, fine-tuning, and continual learning have clear limitations for this sequential data setting.

Method: Proposed ConOVS method based on Mixture-of-Experts framework that dynamically combines expert decoders according to the probability that input samples belong to the distribution of each incremental dataset.

Result: ConOVS consistently outperforms existing methods across pre-training, incremental, and zero-shot test datasets, effectively expanding recognition capabilities when data is collected sequentially.

Conclusion: ConOVS successfully addresses the challenge of continual learning in Open-Vocabulary Segmentation, enabling models to effectively learn from sequentially collected data while maintaining performance across various test scenarios.

Abstract: Open-Vocabulary Segmentation (OVS) aims to segment classes that are not present in the training dataset. However, most existing studies assume that the training data is fixed in advance, overlooking more practical scenarios where new datasets are continuously collected over time. To address this, we first analyze how existing OVS models perform under such conditions. In this context, we explore several approaches such as retraining, fine-tuning, and continual learning but find that each of them has clear limitations. To address these issues, we propose ConOVS, a novel continual learning method based on a Mixture-of-Experts framework. ConOVS dynamically combines expert decoders based on the probability that an input sample belongs to the distribution of each incremental dataset. Through extensive experiments, we show that ConOVS consistently outperforms existing methods across pre-training, incremental, and zero-shot test datasets, effectively expanding the recognition capabilities of OVS models when data is collected sequentially.

[527] Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction

Yuanhao Cai, He Zhang, Kai Zhang, Yixun Liang, Mengwei Ren, Fujun Luan, Qing Liu, Soo Ye Kim, Jianming Zhang, Zhifei Zhang, Yuqian Zhou, Yulun Zhang, Xiaokang Yang, Zhe Lin, Alan Yuille

Main category: cs.CV

TL;DR: DiffusionGS is a single-stage 3D diffusion model that directly generates 3D Gaussian point clouds, ensuring view consistency and handling both object-centric and scene reconstruction tasks without requiring depth estimators.

Details

Motivation: Existing feedforward image-to-3D methods rely on 2D multi-view diffusion models that lack 3D consistency, collapse with changing view directions, and mainly handle object-centric cases.

Method: Proposes DiffusionGS, a novel single-stage 3D diffusion model that outputs 3D Gaussian point clouds at each timestep, using a scene-object mixed training strategy to scale up 3D training data.

Result: Achieves improvements of 2.20 dB/23.25 and 1.34 dB/19.16 in PSNR/FID for objects and scenes compared to state-of-the-art methods, with over 5x faster speed (~6s on A100 GPU) and no depth estimator required.

Conclusion: DiffusionGS provides robust 3D generation from single views with view consistency, handles arbitrary view directions, and demonstrates superior performance and efficiency for both object generation and scene reconstruction tasks.

Abstract: Existing feedforward image-to-3D methods mainly rely on 2D multi-view diffusion models that cannot guarantee 3D consistency. These methods easily collapse when changing the prompt view direction and mainly handle object-centric cases. In this paper, we propose a novel single-stage 3D diffusion model, DiffusionGS, for object generation and scene reconstruction from a single view. DiffusionGS directly outputs 3D Gaussian point clouds at each timestep to enforce view consistency and allow the model to generate robustly given prompt views of any directions, beyond object-centric inputs. Plus, to improve the capability and generality of DiffusionGS, we scale up 3D training data by developing a scene-object mixed training strategy. Experiments show that DiffusionGS yields improvements of 2.20 dB/23.25 and 1.34 dB/19.16 in PSNR/FID for objects and scenes than the state-of-the-art methods, without depth estimator. Plus, our method enjoys over 5$\times$ faster speed ($\sim$6s on an A100 GPU). Our Project page at https://caiyuanhao1998.github.io/project/DiffusionGS/ shows the video and interactive results. The code and models are publicly available at https://github.com/caiyuanhao1998/Open-DiffusionGS

[528] Multimodal Alignment and Fusion: A Survey

Songtao Li, Hao Tang

Main category: cs.CV

TL;DR: A comprehensive survey of multimodal alignment and fusion techniques, categorizing approaches by structure (data/feature/output-level) and methodology (statistical, kernel-based, graphical, generative, contrastive, attention-based, LLM-based), analyzing 260+ studies and addressing key challenges.

Details

Motivation: The increasing availability of diverse data modalities (text, images, audio, video) and limitations of previous surveys that focused on specific modalities or limited fusion strategies drive the need for a comprehensive, structure-centric framework for multimodal learning.

Method: Systematic categorization through structural perspectives (data-level, feature-level, output-level fusion) and methodological paradigms (statistical, kernel-based, graphical, generative, contrastive, attention-based, LLM-based methods) based on extensive review of 260+ studies.

Result: Provides a comprehensive framework for understanding multimodal alignment and fusion, identifies critical challenges (cross-modal misalignment, computational bottlenecks, data quality, modality gap), and explores applications across social media analysis, medical imaging, emotion recognition, and embodied AI.

Conclusion: The survey guides future research toward optimizing multimodal learning systems for improved scalability, robustness, and generalizability across diverse domains by providing structured insights into alignment and fusion techniques.

Abstract: This survey provides a comprehensive overview of recent advances in multimodal alignment and fusion within the field of machine learning, driven by the increasing availability and diversity of data modalities such as text, images, audio, and video. Unlike previous surveys that often focus on specific modalities or limited fusion strategies, our work presents a structure-centric and method-driven framework that emphasizes generalizable techniques. We systematically categorize and analyze key approaches to alignment and fusion through both structural perspectives – data-level, feature-level, and output-level fusion – and methodological paradigms – including statistical, kernel-based, graphical, generative, contrastive, attention-based, and large language model (LLM)-based methods, drawing insights from an extensive review of over 260 relevant studies. Furthermore, this survey highlights critical challenges such as cross-modal misalignment, computational bottlenecks, data quality issues, and the modality gap, along with recent efforts to address them. Applications ranging from social media analysis and medical imaging to emotion recognition and embodied AI are explored to illustrate the real-world impact of robust multimodal systems. The insights provided aim to guide future research toward optimizing multimodal learning systems for improved scalability, robustness, and generalizability across diverse domains.

[529] Learning Visual Hierarchies in Hyperbolic Space for Image Retrieval

Ziwei Wang, Sameera Ramasinghe, Chenchen Xu, Julien Monteil, Loris Bazzani, Thalaiyasingam Ajanthan

Main category: cs.CV

TL;DR: A novel learning paradigm that encodes user-defined multi-level visual hierarchies in hyperbolic space without explicit hierarchical labels, using contrastive loss with pairwise entailment metrics for improved hierarchical image retrieval.

Details

Motivation: Most image understanding models focus on visual similarity rather than learning visual hierarchies, which limits their ability to capture semantic and structural information at multiple abstraction levels.

Method: Define part-based image hierarchies using object-level annotations, then enforce the hierarchy using contrastive loss with pairwise entailment metrics in hyperbolic space.

Result: Experiments show significant improvements in hierarchical image retrieval tasks, demonstrating the model’s capability to capture complex visual hierarchies.

Conclusion: The approach successfully encodes complex visual hierarchies that transcend mere visual similarity, capturing semantic and structural information for improved hierarchical understanding.

Abstract: Structuring latent representations in a hierarchical manner enables models to learn patterns at multiple levels of abstraction. However, most prevalent image understanding models focus on visual similarity, and learning visual hierarchies is relatively unexplored. In this work, for the first time, we introduce a learning paradigm that can encode user-defined multi-level complex visual hierarchies in hyperbolic space without requiring explicit hierarchical labels. As a concrete example, first, we define a part-based image hierarchy using object-level annotations within and across images. Then, we introduce an approach to enforce the hierarchy using contrastive loss with pairwise entailment metrics. Finally, we discuss new evaluation metrics to effectively measure hierarchical image retrieval. Encoding these complex relationships ensures that the learned representations capture semantic and structural information that transcends mere visual similarity. Experiments in part-based image retrieval show significant improvements in hierarchical retrieval tasks, demonstrating the capability of our model in capturing visual hierarchies.

[530] FairDD: Fair Dataset Distillation

Qihang Zhou, Shenhao Fang, Shibo He, Wenchao Meng, Jiming Chen

Main category: cs.CV

TL;DR: FairDD is a novel fair dataset distillation framework that addresses bias in condensed datasets by synchronously matching synthetic data to protected attribute groups, improving fairness while maintaining accuracy.

Details

Motivation: Previous dataset distillation methods overlook fairness concerns and actually worsen bias towards minority groups in condensed datasets due to their smaller size.

Method: FairDD synchronously matches synthetic datasets to protected attribute-wise groups of original datasets rather than indiscriminate alignment to whole distributions, preventing collapse into majority groups and enabling balanced generation across all groups.

Result: FairDD significantly improves fairness compared to vanilla dataset distillation methods while maintaining accuracy, achieving a promising trade-off between fairness and accuracy across diverse matching-based approaches.

Conclusion: FairDD establishes itself as a versatile fair dataset distillation approach that effectively regularizes vanilla methods to favor balanced generation toward minority groups, with consistent superiority across Distribution and Gradient Matching methods.

Abstract: Condensing large datasets into smaller synthetic counterparts has demonstrated its promise for image classification. However, previous research has overlooked a crucial concern in image recognition: ensuring that models trained on condensed datasets are unbiased towards protected attributes (PA), such as gender and race. Our investigation reveals that dataset distillation fails to alleviate the unfairness towards minority groups within original datasets. Moreover, this bias typically worsens in the condensed datasets due to their smaller size. To bridge the research gap, we propose a novel fair dataset distillation (FDD) framework, namely FairDD, which can be seamlessly applied to diverse matching-based DD approaches (DDs), requiring no modifications to their original architectures. The key innovation of FairDD lies in synchronously matching synthetic datasets to PA-wise groups of original datasets, rather than indiscriminate alignment to the whole distributions in vanilla DDs, dominated by majority groups. This synchronized matching allows synthetic datasets to avoid collapsing into majority groups and bootstrap their balanced generation to all PA groups. Consequently, FairDD could effectively regularize vanilla DDs to favor biased generation toward minority groups while maintaining the accuracy of target attributes. Theoretical analyses and extensive experimental evaluations demonstrate that FairDD significantly improves fairness compared to vanilla DDs, with a promising trade-off between fairness and accuracy. Its consistent superiority across diverse DDs, spanning Distribution and Gradient Matching, establishes it as a versatile FDD approach. Code is available at https://github.com/zqhang/FairDD.

[531] Beyond [cls]: Exploring the true potential of Masked Image Modeling representations

Marcin Przewięźlikowski, Randall Balestriero, Wojciech Jasiński, Marek Śmieja, Bartosz Zieliński

Main category: cs.CV

TL;DR: Masked Image Modeling (MIM) has poor out-of-the-box performance due to uniform attention distribution that makes [cls] token aggregation ineffective. The paper proposes Selective Aggregation to better utilize patch tokens and improve MIM performance without fine-tuning.

Details

Motivation: MIM's practical use is limited because its out-of-the-box performance is inferior to competing SSL approaches, and most users cannot afford fine-tuning due to data requirements, GPU consumption, and specialized knowledge.

Method: The authors propose Selective Aggregation to better capture semantic information in patch tokens, addressing the issue that attention in MIMs is spread uniformly over many patches, leading to ineffective [cls] token aggregation.

Result: Selective Aggregation significantly improves the out-of-the-box performance of MIM models.

Conclusion: The poor out-of-the-box performance of MIMs is not due to weaker features but rather suboptimal usage, specifically ineffective attention aggregation, which can be addressed through Selective Aggregation.

Abstract: Masked Image Modeling (MIM) has emerged as a promising approach for Self-Supervised Learning (SSL) of visual representations. However, the out-of-the-box performance of MIMs is typically inferior to competing approaches. Most users cannot afford fine-tuning due to the need for large amounts of data, high GPU consumption, and specialized user knowledge. Therefore, the practical use of MIM representations is limited. In this paper we ask what is the reason for the poor out-of-the-box performance of MIMs. Is it due to weaker features produced by MIM models, or is it due to suboptimal usage? Through detailed analysis, we show that attention in MIMs is spread almost uniformly over many patches, leading to ineffective aggregation by the [cls] token. Based on this insight, we propose Selective Aggregation to better capture the rich semantic information retained in patch tokens, which significantly improves the out-of-the-box performance of MIM.

[532] TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training

Felix Krause, Timy Phan, Ming Gui, Stefan Andreas Baumann, Vincent Tao Hu, Björn Ommer

Main category: cs.CV

TL;DR: TREAD improves diffusion model training efficiency and performance by using routes to transport tokens from early to deeper layers, achieving 14x convergence speedup and better FID scores without architectural changes.

Details

Motivation: Diffusion models suffer from high training costs and sample inefficiency. Existing efficiency methods come with tradeoffs between performance and computational cost.

Method: Uses routes as transport mechanisms for randomly selected tokens from early layers to deeper layers, applicable to both transformer-based and state-space models without architectural modifications.

Result: 14x convergence speedup at 400K iterations vs DiT, 37x vs DiT’s best performance at 7M iterations. Achieved FID of 2.09 (guided) and 3.93 (unguided) on ImageNet-256, improving upon DiT.

Conclusion: TREAD simultaneously improves training efficiency and generative performance without architectural changes or additional parameters, making diffusion model training more accessible.

Abstract: Diffusion models have emerged as the mainstream approach for visual generation. However, these models typically suffer from sample inefficiency and high training costs. Consequently, methods for efficient finetuning, inference and personalization were quickly adopted by the community. However, training these models in the first place remains very costly. While several recent approaches - including masking, distillation, and architectural modifications - have been proposed to improve training efficiency, each of these methods comes with a tradeoff: they achieve enhanced performance at the expense of increased computational cost or vice versa. In contrast, this work aims to improve training efficiency as well as generative performance at the same time through routes that act as a transport mechanism for randomly selected tokens from early layers to deeper layers of the model. Our method is not limited to the common transformer-based model - it can also be applied to state-space models and achieves this without architectural modifications or additional parameters. Finally, we show that TREAD reduces computational cost and simultaneously boosts model performance on the standard ImageNet-256 benchmark in class-conditional synthesis. Both of these benefits multiply to a convergence speedup of 14x at 400K training iterations compared to DiT and 37x compared to the best benchmark performance of DiT at 7M training iterations. Furthermore, we achieve a competitive FID of 2.09 in a guided and 3.93 in an unguided setting, which improves upon the DiT, without architectural changes.

[533] CULTURE3D: A Large-Scale and Diverse Dataset of Cultural Landmarks and Terrains for Gaussian-Based Scene Rendering

Xinyi Zheng, Steve Zhang, Weizhe Lin, Aaron Zhang, Walterio W. Mayol-Cuevas, Yunze Liu, Junxiao Shen

Main category: cs.CV

TL;DR: A new extra-large 3D reconstruction dataset with 10 billion points from 41,006 drone images covering 20 culturally significant scenes worldwide, enabling fine-grained 3D applications and benchmarking large-scale Gaussian Splatting methods.

Details

Motivation: Current 3D reconstruction models lack sufficiently large-scale and detailed datasets for building extra-large outdoor scenes, limiting their capabilities in fine-grained applications.

Method: Created dataset using 41,006 drone-captured high-resolution aerial images from 20 diverse worldwide locations, providing accurate spatial layouts and comprehensive structural information in COLMAP format.

Result: Dataset offers significantly larger scale and higher detail than existing datasets, supporting detailed 3D reconstruction tasks and establishing benchmarks for large-scale Gaussian Splatting evaluation.

Conclusion: The dataset’s flexibility supports innovations and model plug-ins, paving the way for future 3D breakthroughs, with all datasets and code being open-sourced for community use.

Abstract: Current state-of-the-art 3D reconstruction models face limitations in building extra-large scale outdoor scenes, primarily due to the lack of sufficiently large-scale and detailed datasets. In this paper, we present a extra-large fine-grained dataset with 10 billion points composed of 41,006 drone-captured high-resolution aerial images, covering 20 diverse and culturally significant scenes from worldwide locations such as Cambridge Uni main buildings, the Pyramids, and the Forbidden City Palace. Compared to existing datasets, ours offers significantly larger scale and higher detail, uniquely suited for fine-grained 3D applications. Each scene contains an accurate spatial layout and comprehensive structural information, supporting detailed 3D reconstruction tasks. By reconstructing environments using these detailed images, our dataset supports multiple applications, including outputs in the widely adopted COLMAP format, establishing a novel benchmark for evaluating state-of-the-art large-scale Gaussian Splatting methods.The dataset’s flexibility encourages innovations and supports model plug-ins, paving the way for future 3D breakthroughs. All datasets and code will be open-sourced for community use.

[534] Concept Steerers: Leveraging K-Sparse Autoencoders for Test-Time Controllable Generations

Dahye Kim, Deepti Ghadiyaram

Main category: cs.CV

TL;DR: A novel framework using k-sparse autoencoders for efficient, interpretable concept manipulation in diffusion models without retraining, improving unsafe concept removal by 20.01% and being ~5x faster than SOTA.

Details

Motivation: Text-to-image models are vulnerable to adversarial attacks and generate unsafe content. Existing methods require fine-tuning, are computationally expensive, lack scalability, and compromise generation quality.

Method: Use k-sparse autoencoders to identify interpretable monosemantic concepts in text embedding latent space, enabling precise steering of generation away from or towards concepts during test time without model retraining.

Result: Achieves 20.01% improvement in unsafe concept removal, effective style manipulation, ~5x faster than state-of-the-art, maintains generation quality, and robust against adversarial prompts.

Conclusion: The proposed framework provides simple, efficient, interpretable concept manipulation for diffusion models without compromising quality or requiring retraining, offering significant improvements over existing methods.

Abstract: Despite the remarkable progress in text-to-image generative models, they are prone to adversarial attacks and inadvertently generate unsafe, unethical content. Existing approaches often rely on fine-tuning models to remove specific concepts, which is computationally expensive, lacks scalability, and/or compromises generation quality. In this work, we propose a novel framework leveraging k-sparse autoencoders (k-SAEs) to enable efficient and interpretable concept manipulation in diffusion models. Specifically, we first identify interpretable monosemantic concepts in the latent space of text embeddings and leverage them to precisely steer the generation away or towards a given concept (e.g., nudity) or to introduce a new concept (e.g., photographic style) – all during test time. Through extensive experiments, we demonstrate that our approach is very simple, requires no retraining of the base model nor LoRA adapters, does not compromise the generation quality, and is robust to adversarial prompt manipulations. Our method yields an improvement of $\mathbf{20.01%}$ in unsafe concept removal, is effective in style manipulation, and is $\mathbf{\sim5}$x faster than the current state-of-the-art. Code is available at: https://github.com/kim-dahye/steerers

[535] GDO:Gradual Domain Osmosis

Zixi Wang, Yubo Huang

Main category: cs.CV

TL;DR: Proposes Gradual Domain Osmosis method for smooth knowledge migration in Gradual Domain Adaptation by dynamically balancing source/target domain losses using hyperparameter λ, achieving better cross-domain generalization than existing methods.

Details

Motivation: Traditional GDA methods face challenges of inefficient knowledge migration and missing intermediate domain data. Need for smoother domain transition and better knowledge transfer.

Method: Optimization framework with dynamic λ parameter (0→1) balancing source/target domain losses. Uses self-training for pseudo-labels and weighted loss minimization for stable progressive adaptation.

Result: Outperforms baseline methods on rotated MNIST, color-shifted MNIST, portrait dataset, and forest cover type dataset. Ablation studies confirm advantages of progressive domain penetration.

Conclusion: Provides theoretical support and practical framework for asymptotic domain adaptation, expanding application potential in dynamic environments through progressive domain penetration strategy.

Abstract: In this paper, we propose a new method called Gradual Domain Osmosis, which aims to solve the problem of smooth knowledge migration from source domain to target domain in Gradual Domain Adaptation (GDA). Traditional Gradual Domain Adaptation methods mitigate domain bias by introducing intermediate domains and self-training strategies, but often face the challenges of inefficient knowledge migration or missing data in intermediate domains. In this paper, we design an optimisation framework based on the hyperparameter $\lambda$ by dynamically balancing the loss weights of the source and target domains, which enables the model to progressively adjust the strength of knowledge migration ($\lambda$ incrementing from 0 to 1) during the training process, thus achieving cross-domain generalisation more efficiently. Specifically, the method incorporates self-training to generate pseudo-labels and iteratively updates the model by minimising a weighted loss function to ensure stability and robustness during progressive adaptation in the intermediate domain. The experimental part validates the effectiveness of the method on rotated MNIST, colour-shifted MNIST, portrait dataset and forest cover type dataset, and the results show that it outperforms existing baseline methods. The paper further analyses the impact of the dynamic tuning strategy of the hyperparameter $\lambda$ on the performance through ablation experiments, confirming the advantages of progressive domain penetration in mitigating the domain bias and enhancing the model generalisation capability. The study provides a theoretical support and practical framework for asymptotic domain adaptation and expands its application potential in dynamic environments.

[536] Generating Multi-Image Synthetic Data for Text-to-Image Customization

Nupur Kumari, Xi Yin, Jun-Yan Zhu, Ishan Misra, Samaneh Azadi

Main category: cs.CV

TL;DR: Proposes an encoder-based text-to-image customization method using synthetic 3D data and shared attention mechanism with improved inference normalization.

Details

Motivation: Existing customization methods either require expensive test-time optimization or train on single-image datasets without multi-image supervision, limiting image quality.

Method: Create Synthetic Customization Dataset (SynCD) using text-to-image models and 3D data, train encoder with shared attention mechanism, and use inference normalization for text/image guidance vectors.

Result: Improves upon existing encoder-based methods on standard customization benchmarks.

Conclusion: The proposed approach with synthetic dataset, shared attention, and inference normalization effectively enhances text-to-image customization quality.

Abstract: Customization of text-to-image models enables users to insert new concepts or objects and generate them in unseen settings. Existing methods either rely on comparatively expensive test-time optimization or train encoders on single-image datasets without multi-image supervision, which can limit image quality. We propose a simple approach to address these challenges. We first leverage existing text-to-image models and 3D datasets to create a high-quality Synthetic Customization Dataset (SynCD) consisting of multiple images of the same object in different lighting, backgrounds, and poses. Using this dataset, we train an encoder-based model that incorporates fine-grained visual details from reference images via a shared attention mechanism. Finally, we propose an inference technique that normalizes text and image guidance vectors to mitigate overexposure issues in sampled images. Through extensive experiments, we show that our encoder-based model, trained on SynCD, and with the proposed inference algorithm, improves upon existing encoder-based methods on standard customization benchmarks.

[537] Contrastive Representation Distillation via Multi-Scale Feature Decoupling

Cuipeng Wang, Haipeng Wang

Main category: cs.CV

TL;DR: MSDCRD is a knowledge distillation framework that decouples global features into multi-scale local features and uses contrastive learning for efficient distillation without external memory buffers.

Details

Motivation: Previous feature-based distillation methods focus on global feature alignment but neglect local semantic decoupling, causing semantic confusion. Traditional contrastive distillation is inefficient due to large memory buffer requirements.

Method: Proposes MSDCRD framework that systematically decouples global features into multi-scale local features and uses sample-wise and feature-wise contrastive losses for efficient single-batch distillation.

Result: MSDCRD achieves superior performance in both homogeneous teacher-student settings and heterogeneous architectures with significant feature discrepancies.

Conclusion: The framework demonstrates strong generalization capability by effectively addressing semantic confusion and inefficiency issues in knowledge distillation.

Abstract: Knowledge distillation enhances the performance of compact student networks by transferring knowledge from more powerful teacher networks without introducing additional parameters. In the feature space, local regions within an individual global feature encode distinct yet interdependent semantic information. Previous feature-based distillation methods mainly emphasize global feature alignment while neglecting the decoupling of local regions within an individual global feature, which often results in semantic confusion and suboptimal performance. Moreover, conventional contrastive representation distillation suffers from low efficiency due to its reliance on a large memory buffer to store feature samples. To address these limitations, this work proposes MSDCRD, a model-agnostic distillation framework that systematically decouples global features into multi-scale local features and leverages the resulting semantically rich feature samples with tailored sample-wise and feature-wise contrastive losses. This design enables efficient distillation using only a single batch, eliminating the dependence on external memory. Extensive experiments demonstrate that MSDCRD achieves superior performance not only in homogeneous teacher-student settings but also in heterogeneous architectures where feature discrepancies are more pronounced, highlighting its strong generalization capability.

[538] FCVSR: A Frequency-aware Method for Compressed Video Super-Resolution

Qiang Zhu, Fan Zhang, Feiyu Chen, Shuyuan Zhu, David Bull, Bing Zeng

Main category: cs.CV

TL;DR: FCVSR is a compressed video super-resolution model that uses frequency domain processing with motion-guided alignment and multi-frequency refinement to improve SR performance.

Details

Motivation: Existing compressed video SR methods don't adequately differentiate frequency subbands spatially or capture temporal frequency dynamics, leading to suboptimal results.

Method: Proposes FCVSR with motion-guided adaptive alignment network and multi-frequency feature refinement module, trained with frequency-aware contrastive loss.

Result: Achieves up to 0.14dB PSNR gain over second-best model on three public compressed video SR datasets with good complexity.

Conclusion: FCVSR demonstrates effectiveness in compressed video super-resolution through frequency domain processing and achieves state-of-the-art performance.

Abstract: Compressed video super-resolution (SR) aims to generate high-resolution (HR) videos from the corresponding low-resolution (LR) compressed videos. Recently, some compressed video SR methods attempt to exploit the spatio-temporal information in the frequency domain, showing great promise in super-resolution performance. However, these methods do not differentiate various frequency subbands spatially or capture the temporal frequency dynamics, potentially leading to suboptimal results. In this paper, we propose a deep frequency-based compressed video SR model (FCVSR) consisting of a motion-guided adaptive alignment (MGAA) network and a multi-frequency feature refinement (MFFR) module. Additionally, a frequency-aware contrastive loss is proposed for training FCVSR, in order to reconstruct finer spatial details. The proposed model has been evaluated on three public compressed video super-resolution datasets, with results demonstrating its effectiveness when compared to existing works in terms of super-resolution performance (up to a 0.14dB gain in PSNR over the second-best model) and complexity.

[539] MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification

Anh-Tien Nguyen, Duy Minh Ho Nguyen, Nghiem Tuong Diep, Trung Quoc Nguyen, Nhat Ho, Jacqueline Michelle Metsch, Miriam Cindy Maurer, Daniel Sonntag, Hanibal Bohnenberger, Anne-Christin Hauschild

Main category: cs.CV

TL;DR: A prompt learning method for few-shot pathology image classification using large vision-language models with multi-granular attention and optimal transport-based visual-text distance.

Details

Motivation: Address challenges in whole slide pathology image classification due to gigapixel sizes and limited annotations, improving model generalization for few-shot learning.

Method: Extend Prov-GigaPath vision model into vision-language model using adaptors and contrastive learning, then fine-tune with learnable prompts using multi-granular attention and optimal transport-based visual-text distance.

Result: Empirical experiments on lung, kidney, and breast pathology modalities show superior performance over latest competitors and consistent improvements across CLIP, PLIP, and Prov-GigaPath integrated PLIP architectures.

Conclusion: The proposed approach effectively enhances few-shot pathology classification by capturing both fine-grained details and broader context through multi-granular attention and robust visual-text alignment.

Abstract: Whole slide pathology image classification presents challenges due to gigapixel image sizes and limited annotation labels, hindering model generalization. This paper introduces a prompt learning method to adapt large vision-language models for few-shot pathology classification. We first extend the Prov-GigaPath vision foundation model, pre-trained on 1.3 billion pathology image tiles, into a vision-language model by adding adaptors and aligning it with medical text encoders via contrastive learning on 923K image-text pairs. The model is then used to extract visual features and text embeddings from few-shot annotations and fine-tunes with learnable prompt embeddings. Unlike prior methods that combine prompts with frozen features using prefix embeddings or self-attention, we propose multi-granular attention that compares interactions between learnable prompts with individual image patches and groups of them. This approach improves the model’s ability to capture both fine-grained details and broader context, enhancing its recognition of complex patterns across sub-regions. To further improve accuracy, we leverage (unbalanced) optimal transport-based visual-text distance to secure model robustness by mitigating perturbations that might occur during the data augmentation process. Empirical experiments on lung, kidney, and breast pathology modalities validate the effectiveness of our approach; thereby, we surpass several of the latest competitors and consistently improve performance across diverse architectures, including CLIP, PLIP, and Prov-GigaPath integrated PLIP.

[540] OmniSAM: Omnidirectional Segment Anything Model for UDA in Panoramic Semantic Segmentation

Ding Zhong, Xu Zheng, Chenfei Liao, Yuanhuiyi Lyu, Jialei Chen, Shengyang Wu, Linfeng Zhang, Xuming Hu

Main category: cs.CV

TL;DR: OmniSAM adapts SAM2 for panoramic semantic segmentation by treating panoramic patches as video sequences and using SAM2’s memory mechanism to handle distortion and lack of semantic understanding in 360° images.

Details

Motivation: SAM2 performs well on pinhole images but struggles with panoramic images due to large FoV differences causing distortion and lack of semantic understanding.

Method: Divides panoramas into patch sequences, treats them like video frames, uses SAM2’s memory mechanism for cross-patch correspondence, fine-tunes encoder for semantic prediction, and adds FoV-based prototypical adaptation with dynamic pseudo labels.

Result: Outperforms state-of-the-art methods: 79.06% (+10.22%) on SPin8-to-SPan8, 62.46% (+6.58%) on CS13-to-DP13.

Conclusion: OmniSAM successfully adapts SAM2 for panoramic semantic segmentation, bridging the domain gap through patch sequence processing and memory mechanisms.

Abstract: Segment Anything Model 2 (SAM2) has emerged as a strong base model in various pinhole imaging segmentation tasks. However, when applying it to $360^\circ$ domain, the significant field-of-view (FoV) gap between pinhole ($70^\circ \times 70^\circ$) and panoramic images ($180^\circ \times 360^\circ$) poses unique challenges. Two major concerns for this application includes 1) inevitable distortion and object deformation brought by the large FoV disparity between domains; 2) the lack of pixel-level semantic understanding that the original SAM2 cannot provide. To address these issues, we propose a novel OmniSAM framework, which makes the first attempt to apply SAM2 for panoramic semantic segmentation. Specifically, to bridge the first gap, OmniSAM first divides the panorama into sequences of patches. These patches are then treated as image sequences in similar manners as in video segmentation tasks. We then leverage the SAM2’s memory mechanism to extract cross-patch correspondences that embeds the cross-FoV dependencies, improving feature continuity and the prediction consistency along mask boundaries. For the second gap, OmniSAM fine-tunes the pretrained image encoder and reutilize the mask decoder for semantic prediction. An FoV-based prototypical adaptation module with dynamic pseudo label update mechanism is also introduced to facilitate the alignment of memory and backbone features, thereby improving model generalization ability across different sizes of source models. Extensive experimental results demonstrate that OmniSAM outperforms the state-of-the-art methods by large margins, e.g., 79.06% (+10.22%) on SPin8-to-SPan8, 62.46% (+6.58%) on CS13-to-DP13.

[541] Evaluation of Deformable Image Registration under Alignment-Regularity Trade-off

Vasiliki Sideri-Lampretsa, Daniel Rueckert, Huaqi Qiu

Main category: cs.CV

TL;DR: The paper proposes a new evaluation scheme using Alignment-Regularity Characteristic (ARC) curves to continuously capture the trade-off between alignment accuracy and deformation regularity in deformable image registration, addressing limitations of existing evaluation practices.

Details

Motivation: Existing DIR evaluation methods inadequately address or overlook the inherent trade-off between alignment accuracy and deformation regularity, leading to incomplete assessment of registration methods.

Method: Introduces ARC curves that describe registration performance as a spectrum under various regularity degrees, and uses a HyperNetwork approach to continuously interpolate across the full regularization range for efficient ARC curve construction.

Result: ARC curves reveal unique insights not evident from existing evaluation practices, demonstrated through experiments on various deep learning DIR methods with different architectures and transformation models.

Conclusion: Provides guidelines for nuanced model evaluation and selection using the proposed ARC curve evaluation scheme, benefiting both practitioners and registration researchers.

Abstract: Evaluating deformable image registration (DIR) is challenging due to the inherent trade-off between achieving high alignment accuracy and maintaining deformation regularity. However, most existing DIR works either address this trade-off inadequately or overlook it altogether. In this paper, we highlight the issues with existing practices and propose an evaluation scheme that captures the trade-off continuously to holistically evaluate DIR methods. We first introduce the alignment regularity characteristic (ARC) curves, which describe the performance of a given registration method as a spectrum under various degrees of regularity. We demonstrate that the ARC curves reveal unique insights that are not evident from existing evaluation practices, using experiments on representative deep learning DIR methods with various network architectures and transformation models. We further adopt a HyperNetwork based approach that learns to continuously interpolate across the full regularization range, accelerating the construction and improving the sample density of ARC curves. Finally, we provide general guidelines for a nuanced model evaluation and selection using our evaluation scheme for both practitioners and registration researchers.

Qiang Zhu, Yuxuan Jiang, Shuyuan Zhu, Fan Zhang, David Bull, Bing Zeng

Main category: cs.CV

TL;DR: BVSR-IK proposes a blind video super-resolution model using implicit neural representations for spatio-temporal varying kernels and a recurrent Transformer for accurate filtering, achieving state-of-the-art performance.

Details

Motivation: Existing BVSR methods use spatially invariant blur kernels that don't account for spatio-temporal varying degradations in videos, leading to suboptimal performance.

Method: Constructs multi-scale kernel dictionary parameterized by implicit neural representations and employs a recurrent Transformer to predict coefficient weights for filtering in frame correction and feature alignment.

Result: Outperforms four state-of-the-art BVSR models on three datasets, beating the second best approach (FMA-Net) by up to 0.59 dB in PSNR.

Conclusion: BVSR-IK effectively handles spatio-temporal varying degradations in videos through implicit kernel representations and recurrent Transformer architecture, demonstrating superior performance over existing methods.

Abstract: Blind video super-resolution (BVSR) is a low-level vision task which aims to generate high-resolution videos from low-resolution counterparts in unknown degradation scenarios. Existing approaches typically predict blur kernels that are spatially invariant in each video frame or even the entire video. These methods do not consider potential spatio-temporal varying degradations in videos, resulting in suboptimal BVSR performance. In this context, we propose a novel BVSR model based on Implicit Kernels, BVSR-IK, which constructs a multi-scale kernel dictionary parameterized by implicit neural representations. It also employs a newly designed recurrent Transformer to predict the coefficient weights for accurate filtering in both frame correction and feature alignment. Experimental results have demonstrated the effectiveness of the proposed BVSR-IK, when compared with four state-of-the-art BVSR models on three commonly used datasets, with BVSR-IK outperforming the second best approach, FMA-Net, by up to 0.59 dB in PSNR. Source code will be available at https://github.com/QZ1-boy/BVSR-IK.

[543] Isolated Channel Vision Transformers: From Single-Channel Pretraining to Multi-Channel Finetuning

Wenyi Lian, Patrick Micke, Joakim Lindblad, Nataša Sladoje

Main category: cs.CV

TL;DR: IC-ViT is a pretraining framework for multi-channel imaging data that patchifies image channels individually, enabling effective processing of multimodal data and achieving 4-14% performance improvement over existing approaches.

Details

Motivation: Vision Transformers struggle with multi-channel imaging data where different modalities can obscure complementary information, limiting their application in medical and remote sensing domains.

Method: Proposes Isolated Channel ViT (IC-ViT) that patchifies image channels individually, allowing pretraining on single channels and fine-tuning on multi-channel datasets to capture dependencies between patches and channels.

Result: Achieves 4-14 percentage points performance improvement over existing channel-adaptive approaches on benchmarks including JUMP-CP, CHAMMI for cell microscopy, and So2Sat-LCZ42 for satellite imaging.

Conclusion: IC-ViT provides an effective pretraining framework for multi-channel imaging data, enabling robust feature representation and efficient training for foundation models on heterogeneous data.

Abstract: Vision Transformers (ViTs) have achieved remarkable success in standard RGB image processing tasks. However, applying ViTs to multi-channel imaging (MCI) data, e.g., for medical and remote sensing applications, remains a challenge. In particular, MCI data often consist of layers acquired from different modalities. Directly training ViTs on such data can obscure complementary information and impair the performance. In this paper, we introduce a simple yet effective pretraining framework for large-scale MCI datasets. Our method, named Isolated Channel ViT (IC-ViT), patchifies image channels individually and thereby enables pretraining for multimodal multi-channel tasks. We show that this channel-wise patchifying is a key technique for MCI processing. More importantly, one can pretrain the IC-ViT on single channels and finetune it on downstream multi-channel datasets. This pretraining framework captures dependencies between patches as well as channels and produces robust feature representation. Experiments on various tasks and benchmarks, including JUMP-CP and CHAMMI for cell microscopy imaging, and So2Sat-LCZ42 for satellite imaging, show that the proposed IC-ViT delivers 4-14 percentage points of performance improvement over existing channel-adaptive approaches. Further, its efficient training makes it a suitable candidate for large-scale pretraining of foundation models on heterogeneous data. Our code is available at https://github.com/shermanlian/IC-ViT.

[544] A Comprehensive Survey on Knowledge Distillation

Amir M. Mansourian, Rozhan Ahmadi, Masoud Ghafouri, Amir Mohammad Babaei, Elaheh Badali Golezani, Zeynab Yasamani Ghamchi, Vida Ramezanian, Alireza Taherian, Kimia Dinashi, Amirali Miri, Shohreh Kasaei

Main category: cs.CV

TL;DR: A comprehensive survey of knowledge distillation methods that categorizes and investigates recent approaches across multiple dimensions including distillation sources, schemes, algorithms, modalities, and applications, with special focus on emerging areas like diffusion models, 3D inputs, foundational models, transformers, and LLMs.

Details

Motivation: The motivation is to address the deployment challenges of large DNNs, transformers, and foundation models on edge devices due to high runtime and memory consumption, with knowledge distillation being a key technique to transfer knowledge from cumbersome teacher models to lightweight student models.

Method: The survey reviews knowledge distillation from multiple aspects: distillation sources, distillation schemes, distillation algorithms, distillation by modalities, applications, and comparisons among existing methods. It provides a new categorization framework and investigates recent methods systematically.

Result: The survey provides a comprehensive overview of knowledge distillation methods with a novel representation structure, covering emerging subcategories such as KD for diffusion models, 3D inputs, foundational models, transformers, and LLMs that are not adequately covered in previous surveys.

Conclusion: The work presents an up-to-date and comprehensive survey of knowledge distillation with new perspectives, identifies existing challenges in the field, and discusses potential future research directions to advance the state of knowledge distillation techniques.

Abstract: Deep Neural Networks (DNNs) have achieved notable performance in the fields of computer vision and natural language processing with various applications in both academia and industry. However, with recent advancements in DNNs and transformer models with a tremendous number of parameters, deploying these large models on edge devices causes serious issues such as high runtime and memory consumption. This is especially concerning with the recent large-scale foundation models, Vision-Language Models (VLMs), and Large Language Models (LLMs). Knowledge Distillation (KD) is one of the prominent techniques proposed to address the aforementioned problems using a teacher-student architecture. More specifically, a lightweight student model is trained using additional knowledge from a cumbersome teacher model. In this work, a comprehensive survey of knowledge distillation methods is proposed. This includes reviewing KD from different aspects: distillation sources, distillation schemes, distillation algorithms, distillation by modalities, applications of distillation, and comparison among existing methods. In contrast to most existing surveys, which are either outdated or simply update former surveys, this work proposes a comprehensive survey with a new point of view and representation structure that categorizes and investigates the most recent methods in knowledge distillation. This survey considers various critically important subcategories, including KD for diffusion models, 3D inputs, foundational models, transformers, and LLMs. Furthermore, existing challenges in KD and possible future research directions are discussed. Github page of the project: https://github.com/IPL-Sharif/KD_Survey

[545] Free-Lunch Color-Texture Disentanglement for Stylized Image Generation

Jiang Qin, Senmao Li, Alexandra Gomez-Villa, Shiqi Yang, Yaxing Wang, Kai Wang, Joost van de Weijer

Main category: cs.CV

TL;DR: A tuning-free approach for disentangling color and texture in stylized text-to-image generation, achieving independent control over style attributes without model fine-tuning.

Details

Motivation: Current diffusion-based methods struggle with fine-grained style customization and controlling multiple style attributes like color and texture independently.

Method: Leverages CLIP image embedding space’s additivity property to extract Color-Texture Embeddings, applies whitening and coloring transformation for color consistency, and introduces noise term to prevent texture loss during Regularized Whitening and Coloring Transformation.

Result: SADis surpasses state-of-the-art stylization methods both qualitatively and quantitatively on WikiArt and StyleDrop datasets for Disentangled Stylized Image Generation task.

Conclusion: The proposed SADis approach provides a more precise and customizable solution for stylized image generation with effective color-texture disentanglement.

Abstract: Recent advances in Text-to-Image (T2I) diffusion models have transformed image generation, enabling significant progress in stylized generation using only a few style reference images. However, current diffusion-based methods struggle with fine-grained style customization due to challenges in controlling multiple style attributes, such as color and texture. This paper introduces the first tuning-free approach to achieve free-lunch color-texture disentanglement in stylized T2I generation, addressing the need for independently controlled style elements for the Disentangled Stylized Image Generation (DisIG) problem. Our approach leverages the Image-Prompt Additivity property in the CLIP image embedding space to develop techniques for separating and extracting Color-Texture Embeddings (CTE) from individual color and texture reference images. To ensure that the color palette of the generated image aligns closely with the color reference, we apply a whitening and coloring transformation to enhance color consistency. Additionally, to prevent texture loss due to the signal-leak bias inherent in diffusion training, we introduce a noise term that preserves textural fidelity during the Regularized Whitening and Coloring Transformation (RegWCT). Through these methods, our Style Attributes Disentanglement approach (SADis) delivers a more precise and customizable solution for stylized image generation. Experiments on images from the WikiArt and StyleDrop datasets demonstrate that, both qualitatively and quantitatively, SADis surpasses state-of-the-art stylization methods in the DisIG task.Code is released at https://deepffff.github.io/sadis.github.io/.

[546] Surface-Aware Distilled 3D Semantic Features

Lukas Uzolas, Elmar Eisemann, Petr Kellnhofer

Main category: cs.CV

TL;DR: This paper introduces a self-supervised method for learning surface-aware 3D embeddings that address semantic ambiguities in correspondence matching between 3D shapes, enabling robust mapping across entire shape families without pairwise optimization.

Details

Motivation: Existing semantic features from pre-trained vision models struggle to differentiate instances of the same semantic class (e.g., 'left hand' vs 'right hand'), leading to substantial mapping errors in 3D correspondence tasks.

Method: The approach uses a contrastive loss that preserves semantic content from foundational models while disambiguating features located far apart on the shape’s surface. It requires only a small number of unpaired training meshes and learns a joint embedding space for entire shape families.

Result: The method achieves superior performance in correspondence matching benchmarks and enables various downstream applications including 2D-to-3D and 3D-to-3D texture transfer, in-part segmentation, pose alignment, and motion transfer in low-data regimes.

Conclusion: Unlike previous pairwise approaches, this solution constructs a joint embedding space where both seen and unseen 3D shapes are implicitly aligned without further optimization, providing a more efficient and robust framework for 3D correspondence tasks.

Abstract: Many 3D tasks such as pose alignment, animation, motion transfer, and 3D reconstruction rely on establishing correspondences between 3D shapes. This challenge has recently been approached by pairwise matching of semantic features from pre-trained vision models. However, despite their power, these features struggle to differentiate instances of the same semantic class such as left hand'' versus right hand’’ which leads to substantial mapping errors. To solve this, we learn a surface-aware embedding space that is robust to these ambiguities while facilitating shared mapping for an entire family of 3D shapes. Importantly, our approach is self-supervised and requires only a small number of unpaired training meshes to infer features for new possibly imperfect 3D shapes at test time. We achieve this by introducing a contrastive loss that preserves the semantic content of the features distilled from foundational models while disambiguating features located far apart on the shape’s surface. We observe superior performance in correspondence matching benchmarks and enable downstream applications including 2D-to-3D and 3D-to-3D texture transfer, in-part segmentation, pose alignment, and motion transfer in low-data regimes. Unlike previous pairwise approaches, our solution constructs a joint embedding space, where both seen and unseen 3D shapes are implicitly aligned without further optimization. The code is available at https://graphics.tudelft.nl/SurfaceAware3DFeatures.

[547] Learning to Instruct for Visual Instruction Tuning

Zhihan Zhou, Feng Hong, Jiaan Luo, Jiangchao Yao, Dongsheng Li, Bo Han, Ya Zhang, Yanfeng Wang

Main category: cs.CV

TL;DR: L2T improves visual instruction tuning by incorporating loss function into both instruction and response sequences, preventing overfitting and shortcut learning while enhancing multimodal capabilities without extra data or computational cost.

Details

Motivation: Current visual instruction tuning methods cause overfitting and shortcut learning by overemphasizing instruction-following while neglecting proactive visual understanding, degrading multimodal LLM performance.

Method: L2T incorporates the loss function into both instruction and response sequences, expanding training data and regularizing MLLMs to prevent over-reliance on language priors.

Result: Achieves up to 9% relative improvement on multimodal benchmarks, 18% improvement in captioning performance, and reduces hallucination in MLLMs without additional training data or computational overhead.

Conclusion: L2T provides an effective approach to enhance multimodal capabilities by addressing overfitting in visual instruction tuning, achieving significant performance gains with minimal computational requirements.

Abstract: We propose L2T, an advancement of visual instruction tuning (VIT). While VIT equips Multimodal LLMs (MLLMs) with promising multimodal capabilities, the current design choices for VIT often result in overfitting and shortcut learning, potentially degrading performance. This gap arises from an overemphasis on instruction-following abilities, while neglecting the proactive understanding of visual information. Inspired by this, L2T adopts a simple yet effective approach by incorporating the loss function into both the instruction and response sequences. It seamlessly expands the training data, and regularizes the MLLMs from overly relying on language priors. Based on this merit, L2T achieves a significant relative improvement of up to 9% on comprehensive multimodal benchmarks, requiring no additional training data and incurring negligible computational overhead. Surprisingly, L2T attains exceptional fundamental visual capabilities, yielding up to an 18% improvement in captioning performance, while simultaneously alleviating hallucination in MLLMs. Github code: https://github.com/Feng-Hong/L2T.

[548] SAVeD: Learning to Denoise Low-SNR Video for Improved Downstream Performance

Suzanne Stathatos, Michael Hobley, Pietro Perona, Markus Marks

Main category: cs.CV

TL;DR: SAVeD is a self-supervised method that denoises low-SNR videos using only noisy data, enhancing foreground visibility and reducing background/camera noise without clean reference videos.

Details

Motivation: Low signal-to-noise ratio videos from sensors like sonar, ultrasound, and microscopy pose challenges for computer vision models, especially when paired clean imagery is unavailable.

Method: Leverages distinctions between foreground and background motion, exaggerates objects with stronger motion signal, and includes architectural optimizations for faster throughput, training, and inference.

Result: Achieves state-of-the-art results for classification, detection, tracking, and counting tasks with fewer training resource requirements than existing deep-learning-based denoising methods.

Conclusion: SAVeD provides an effective self-supervised solution for denoising low-SNR videos without requiring clean reference data, with improved efficiency and performance across multiple computer vision tasks.

Abstract: Low signal-to-noise ratio videos – such as those from underwater sonar, ultrasound, and microscopy – pose significant challenges for computer vision models, particularly when paired clean imagery is unavailable. We present Spatiotemporal Augmentations and denoising in Video for Downstream Tasks (SAVeD), a novel self-supervised method that denoises low-SNR sensor videos using only raw noisy data. By leveraging distinctions between foreground and background motion and exaggerating objects with stronger motion signal, SAVeD enhances foreground object visibility and reduces background and camera noise without requiring clean video. SAVeD has a set of architectural optimizations that lead to faster throughput, training, and inference than existing deep learning methods. We also introduce a new denoising metric, FBD, which indicates foreground-background divergence for detection datasets without requiring clean imagery. Our approach achieves state-of-the-art results for classification, detection, tracking, and counting tasks, and it does so with fewer training resource requirements than existing deep-learning-based denoising methods. Project page: https://suzanne-stathatos.github.io/SAVeD Code page: https://github.com/suzanne-stathatos/SAVeD

[549] VideoAds for Fast-Paced Video Understanding

Zheyuan Zhang, Monica Dou, Linkai Peng, Hongyi Pan, Ulas Bagci, Boqing Gong

Main category: cs.CV

TL;DR: VideoAds is the first dataset for benchmarking MLLMs on advertisement videos, featuring complex temporal structures and manually annotated questions across visual finding, video summary, and visual reasoning tasks.

Details

Motivation: Advertisement videos are purpose-driven with complex narratives and rapid scene transitions, posing significant challenges to MLLMs that current benchmarks don't adequately address.

Method: Created VideoAds dataset with well-curated advertisement videos having complex temporal structures, accompanied by manually annotated diverse questions across three core tasks. Proposed quantitative measure for video complexity comparison.

Result: Qwen2.5-VL-72B achieved 73.35% accuracy, outperforming GPT-4o (66.82%) and Gemini-1.5 Pro (69.66%). Human experts achieved 94.27% accuracy. Proprietary models lagged in video summarization and reasoning but performed best in visual finding.

Conclusion: VideoAds serves as a pivotal benchmark for advancing MLLMs’ temporal modeling capabilities, highlighting the need for better understanding of complex videos requiring high FPS sampling.

Abstract: Advertisement videos serve as a rich and valuable source of purpose-driven information, encompassing high-quality visual, textual, and contextual cues designed to engage viewers. They are often more complex than general videos of similar duration due to their structured narratives and rapid scene transitions, posing significant challenges to multi-modal large language models (MLLMs). In this work, we introduce VideoAds, the first dataset tailored for benchmarking the performance of MLLMs on advertisement videos. VideoAds comprises well-curated advertisement videos with complex temporal structures, accompanied by \textbf{manually} annotated diverse questions across three core tasks: visual finding, video summary, and visual reasoning. We propose a quantitative measure to compare VideoAds against existing benchmarks in terms of video complexity. Through extensive experiments, we find that Qwen2.5-VL-72B, an opensource MLLM, achieves 73.35% accuracy on VideoAds, outperforming GPT-4o (66.82%) and Gemini-1.5 Pro (69.66%); the two proprietary models especially fall behind the opensource model in video summarization and reasoning, but perform the best in visual finding. Notably, human experts easily achieve a remarkable accuracy of 94.27%. These results underscore the necessity of advancing MLLMs’ temporal modeling capabilities and highlight VideoAds as a potentially pivotal benchmark for future research in understanding video that requires high FPS sampling. The dataset and evaluation code will be publicly available at https://videoadsbenchmark.netlify.app.

[550] BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning

Shengao Wang, Arjun Chandra, Aoming Liu, Venkatesh Saligrama, Boqing Gong

Main category: cs.CV

TL;DR: BabyVLM is a developmentally-inspired framework with evaluation benchmarks and synthetic training data that enables efficient vision-language learning, outperforming models trained only on infant data or general data of similar size.

Details

Motivation: To address the limitations of existing evaluation benchmarks being too simplistic, narrowly scoped, or misaligned with developmentally-inspired training, and to overcome the issue of training exclusively on infant data which overlooks broader learning inputs.

Method: Proposed BabyVLM framework with comprehensive in-domain evaluation benchmarks and a synthetic training dataset created via child-directed transformations of existing datasets.

Result: VLMs trained with the synthetic dataset achieve superior performance on BabyVLM tasks compared to models trained solely on SAYCam or general-purpose data of SAYCam size.

Conclusion: BabyVLM provides a robust, developmentally-aligned evaluation tool and demonstrates that compact models trained on carefully curated data can generalize effectively, enabling data-efficient vision-language learning.

Abstract: Human infants rapidly develop visual reasoning skills from minimal input, suggesting that developmentally inspired pretraining could significantly enhance the efficiency of vision-language models (VLMs). Although recent efforts have leveraged infant-inspired datasets like SAYCam, existing evaluation benchmarks remain misaligned–they are either too simplistic, narrowly scoped, or tailored for large-scale pretrained models. Additionally, training exclusively on infant data overlooks the broader, diverse input from which infants naturally learn. To address these limitations, we propose BabyVLM, a novel framework comprising comprehensive in-domain evaluation benchmarks and a synthetic training dataset created via child-directed transformations of existing datasets. We demonstrate that VLMs trained with our synthetic dataset achieve superior performance on BabyVLM tasks compared to models trained solely on SAYCam or general-purpose data of the SAYCam size. BabyVLM thus provides a robust, developmentally aligned evaluation tool and illustrates how compact models trained on carefully curated data can generalize effectively, opening pathways toward data-efficient vision-language learning paradigms.

[551] DDFusion:Degradation-Decoupled Fusion Framework for Robust Infrared and Visible Images Fusion

Tianpei Zhang, Jufeng Zhao, Yiming Zhu, Guangmang Cui, Yuxin Jing

Main category: cs.CV

TL;DR: DDFusion is a framework that addresses real-world degradations in infrared and visible image fusion by decoupling degradation suppression from fusion in a unified model, achieving superior performance in both clean and degraded conditions.

Details

Motivation: Conventional infrared and visible image fusion methods assume high-quality inputs and neglect real-world degradations like low-light and noise, limiting their practical applicability.

Method: Proposes a Degradation-Decoupled Fusion framework with two components: Degradation-Decoupled Optimization Network for degradation-specific decomposition and component-specific extraction, and Interactive Local-Global Fusion Network for multi-scale feature aggregation.

Result: Extensive experiments demonstrate that DDFusion achieves superior fusion performance under both clean and degraded conditions compared to conventional methods.

Conclusion: The proposed DDFusion framework effectively addresses real-world degradations in infrared and visible image fusion through degradation decoupling and unified modeling of degradation suppression and image fusion.

Abstract: Conventional infrared and visible image fusion(IVIF) methods often assume high-quality inputs, neglecting real-world degradations such as low-light and noise, which limits their practical applicability. To address this, we propose a Degradation-Decoupled Fusion(DDFusion) framework, which achieves degradation decoupling and jointly models degradation suppression and image fusion in a unified manner. Specifically, the Degradation-Decoupled Optimization Network(DDON) performs degradation-specific decomposition to decouple inter-degradation and degradation-information components, followed by component-specific extraction paths for effective suppression of degradation and enhancement of informative features. The Interactive Local-Global Fusion Network (ILGFN) aggregates complementary features across multi-scale pathways and alleviates performance degradation caused by the decoupling between degradation optimization and image fusion. Extensive experiments demonstrate that DDFusion achieves superior fusion performance under both clean and degraded conditions. Our code is available at https://github.com/Lmmh058/DDFusion.

[552] LSP-ST: Ladder Shape-Biased Side-Tuning for Robust Infrared Small Target Detection

Guoyi Zhang, Siyang Chen, Guangsheng Xu, Han Wang, Donghe Wang, Xiaohu Zhang

Main category: cs.CV

TL;DR: LSP-ST introduces shape bias to SAM for infrared small target detection, overcoming texture bias through hierarchical structural learning with minimal parameters.

Details

Motivation: Fine-tuning SAM for infrared small target detection faces domain shifts and texture bias limitations in foundation models, requiring shape-aware adaptation.

Method: Proposes Ladder Shape-Biased Side-Tuning (LSP-ST) with Shape-Enhanced Large-Kernel Attention Module to hierarchically capture global structural information without handcrafted guidance.

Result: Achieves state-of-the-art performance on infrared small target detection benchmarks with only 4.72M parameters, and shows strong generalization across multiple detection tasks.

Conclusion: LSP-ST’s shape bias complements texture-based reasoning rather than competing with it, enabling robust adaptation while maintaining performance on texture-driven tasks.

Abstract: Fine-tuning the Segment Anything Model (SAM) for infrared small target detection poses significant challenges due to severe domain shifts. Existing adaptation methods often incorporate handcrafted priors to bridge this gap, yet such designs limit generalization and scalability. We identify a fundamental texture bias in foundation models, which overly depend on local texture cues for target localization. To address this, we propose Ladder Shape-Biased Side-Tuning (LSP-ST), a novel approach that introduces a shape-aware inductive bias to facilitate effective adaptation beyond texture cues. In contrast to prior work that injects explicit edge or contour features, LSP-ST models shape as a global structural prior, integrating both boundaries and internal layouts. We design a Shape-Enhanced Large-Kernel Attention Module to hierarchically and implicitly capture structural information in a fully differentiable manner, without task-specific handcrafted guidance. A theoretical analysis grounded in matched filtering and backpropagation reveals the mechanism by which the proposed attention improves structure-aware learning. With only 4.72M learnable parameters, LSP-ST achieves state-of-the-art performance on multiple infrared small target detection benchmarks. Furthermore, its strong generalization is validated across tasks such as mirror detection, shadow detection, and camouflaged object detection, while maintaining stable performance on texture-driven tasks like salient object detection, demonstrating that the introduced shape bias complements rather than competes with texture-based reasoning.

[553] Motion-Enhanced Nonlocal Similarity Implicit Neural Representation for Infrared Dim and Small Target Detection

Pei Liu, Yisi Luo, Wenzhen Wang, Xiangyong Cao

Main category: cs.CV

TL;DR: A motion-enhanced nonlocal similarity implicit neural representation framework for infrared dim and small target detection that integrates optical flow motion estimation and tensor decomposition-based INR to capture dynamic backgrounds and spatial-temporal correlations.

Details

Motivation: Traditional low-rank plus sparse models fail to capture dynamic backgrounds and global spatial-temporal correlations in infrared dim target detection, leading to background leakage or target loss.

Method: Integrates motion estimation via optical flow for subtle target movements, uses multi-frame fusion for motion saliency enhancement, leverages nonlocal similarity to construct patch tensors, and proposes tensor decomposition-based INR model with alternating direction method of multipliers optimization.

Result: The approach robustly separates dim targets from complex infrared backgrounds and outperforms state-of-the-art methods in detection accuracy and robustness.

Conclusion: The proposed motion-enhanced nonlocal similarity INR framework effectively addresses the challenges of infrared dim and small target detection by capturing dynamic backgrounds and spatial-temporal correlations through continuous neural representations.

Abstract: Infrared dim and small target detection presents a significant challenge due to dynamic multi-frame scenarios and weak target signatures in the infrared modality. Traditional low-rank plus sparse models often fail to capture dynamic backgrounds and global spatial-temporal correlations, which results in background leakage or target loss. In this paper, we propose a novel motion-enhanced nonlocal similarity implicit neural representation (INR) framework to address these challenges. We first integrate motion estimation via optical flow to capture subtle target movements, and propose multi-frame fusion to enhance motion saliency. Second, we leverage nonlocal similarity to construct patch tensors with strong low-rank properties, and propose an innovative tensor decomposition-based INR model to represent the nonlocal patch tensor, effectively encoding both the nonlocal low-rankness and spatial-temporal correlations of background through continuous neural representations. An alternating direction method of multipliers is developed for the nonlocal INR model, which enjoys theoretical fixed-point convergence. Experimental results show that our approach robustly separates dim targets from complex infrared backgrounds, outperforming state-of-the-art methods in detection accuracy and robustness.

[554] ViDRiP-LLaVA: A Dataset and Benchmark for Diagnostic Reasoning from Pathology Videos

Trinh T. L. Vuong, Jin Tae Kwak

Main category: cs.CV

TL;DR: ViDRiP-LLaVA is the first large multimodal model for computational pathology that integrates single patch images, automatically segmented pathology video clips, and manually segmented pathology videos to mimic pathologists’ diagnostic process.

Details

Motivation: To create an AI system that closely mirrors the natural diagnostic process of pathologists by integrating multiple image scenarios and bridging visual narratives with diagnostic reasoning.

Method: Uses ViDRiP-Instruct dataset with 4278 video and diagnosis-specific chain-of-thought instructional pairs from YouTube educational videos, transfers knowledge from single-image datasets to train on weakly annotated clips, then fine-tunes on manually segmented videos.

Result: Establishes a new benchmark in pathology video analysis and provides detailed histological descriptions with definitive sign-out diagnoses.

Conclusion: ViDRiP-LLaVA offers a promising foundation for future AI systems supporting clinical decision-making through integrated visual and diagnostic reasoning, with code and data publicly available.

Abstract: We present ViDRiP-LLaVA, the first large multimodal model (LMM) in computational pathology that integrates three distinct image scenarios, including single patch images, automatically segmented pathology video clips, and manually segmented pathology videos. This integration closely mirrors the natural diagnostic process of pathologists. By generating detailed histological descriptions and culminating in a definitive sign-out diagnosis, ViDRiP-LLaVA bridges visual narratives with diagnostic reasoning. Central to our approach is the ViDRiP-Instruct dataset, comprising 4278 video and diagnosis-specific chain-of-thought instructional pairs sourced from educational histopathology videos on YouTube. Although high-quality data is critical for enhancing diagnostic reasoning, its creation is time-intensive and limited in volume. To overcome this challenge, we transfer knowledge from existing single-image instruction datasets to train on weakly annotated, keyframe-extracted clips, followed by fine-tuning on manually segmented videos. ViDRiP-LLaVA establishes a new benchmark in pathology video analysis and offers a promising foundation for future AI systems that support clinical decision-making through integrated visual and diagnostic reasoning. Our code, data, and model are publicly available at: https://github.com/QuIIL/ViDRiP-LLaVA.

[555] xTrace: A Facial Expressive Behaviour Analysis Tool for Continuous Affect Recognition

Mani Kumar Tellamekala, Shashank Jaiswal, Thomas Smith, Timur Alamev, Gary McKeown, Anthony Brown, Michel Valstar

Main category: cs.CV

TL;DR: xTrace is a robust tool for real-time facial expressive behavior analysis that predicts continuous valence and arousal values from in-the-wild face videos, addressing data scarcity and feature extraction challenges.

Details

Motivation: To build a robust real-time system for naturalistic facial expressive behavior analysis by overcoming two key challenges: lack of large-scale labeled video datasets with 2D emotion space coverage, and difficulty extracting discriminative, interpretable, and efficient facial features.

Method: Trained on largest facial affect video dataset (~450k videos) covering most emotion zones; uses explainable facial affect descriptors with low computational complexity; benchmarked against MediaPipe, OpenFace, and Augsburg Affect Toolbox.

Result: Achieves 0.86 mean CCC on in-the-wild benchmarking set (~50k videos) and 0.75 mean CCC on SEWA test set, outperforming existing SOTA by ~7.1%.

Conclusion: xTrace provides a versatile, accurate, and computationally efficient solution for real-time facial expressive behavior analysis in naturalistic settings.

Abstract: Recognising expressive behaviours in face videos is a long-standing challenge in Affective Computing. Despite significant advancements in recent years, it still remains a challenge to build a robust and reliable system for naturalistic and in-the-wild facial expressive behaviour analysis in real time. This paper addresses two key challenges in building such a system: (1). The paucity of large-scale labelled facial affect video datasets with extensive coverage of the 2D emotion space, and (2). The difficulty of extracting facial video features that are discriminative, interpretable, robust, and computationally efficient. Toward addressing these challenges, this work introduces xTrace, a robust tool for facial expressive behaviour analysis and predicting continuous values of dimensional emotions, namely valence and arousal, from in-the-wild face videos. To address challenge (1), the proposed affect recognition model is trained on the largest facial affect video data set, containing $\sim$450k videos that cover most emotion zones in the dimensional emotion space, making xTrace highly versatile in analysing a wide spectrum of naturalistic expressive behaviours. To address challenge (2), xTrace uses facial affect descriptors that are not only explainable, but can also achieve a high degree of accuracy and robustness with low computational complexity. The key components of xTrace are benchmarked against three existing tools: MediaPipe, OpenFace, and Augsburg Affect Toolbox. On an in-the-wild benchmarking set composed of $\sim$50k videos, xTrace achieves 0.86 mean Concordance Correlation Coefficient (CCC) and on the SEWA test set it achieves 0.75 mean CCC, outperforming existing SOTA by $\sim$7.1%.

[556] CURE: Concept Unlearning via Orthogonal Representation Editing in Diffusion Models

Shristi Das Biswas, Arani Roy, Kaushik Roy

Main category: cs.CV

TL;DR: CURE is a training-free concept unlearning framework that uses Spectral Eraser to remove unwanted concepts from diffusion models through orthogonal projection in weight space, achieving fast and specific concept suppression without retraining.

Details

Motivation: Existing safety interventions for Text-to-Image models suffer from incomplete concept removal, jail-breaking susceptibility, computational inefficiency, or collateral damage to unrelated capabilities.

Method: Uses Spectral Eraser - a closed-form orthogonal projection module that identifies discriminative subspaces using SVD over token embeddings. Includes Expansion Mechanism for spectral regularization to balance filtering and preservation.

Result: Achieves efficient and thorough removal of targeted concepts (artistic styles, objects, identities, explicit content) in only 2 seconds, with minor damage to original generation ability and enhanced robustness against red-teaming.

Conclusion: CURE provides a fast, interpretable, and highly specific method for concept unlearning in diffusion models without requiring retraining, supervision, or iterative optimization.

Abstract: As Text-to-Image models continue to evolve, so does the risk of generating unsafe, copyrighted, or privacy-violating content. Existing safety interventions - ranging from training data curation and model fine-tuning to inference-time filtering and guidance - often suffer from incomplete concept removal, susceptibility to jail-breaking, computational inefficiency, or collateral damage to unrelated capabilities. In this paper, we introduce CURE, a training-free concept unlearning framework that operates directly in the weight space of pre-trained diffusion models, enabling fast, interpretable, and highly specific suppression of undesired concepts. At the core of our method is the Spectral Eraser, a closed-form, orthogonal projection module that identifies discriminative subspaces using Singular Value Decomposition over token embeddings associated with the concepts to forget and retain. Intuitively, the Spectral Eraser identifies and isolates features unique to the undesired concept while preserving safe attributes. This operator is then applied in a single step update to yield an edited model in which the target concept is effectively unlearned - without retraining, supervision, or iterative optimization. To balance the trade-off between filtering toxicity and preserving unrelated concepts, we further introduce an Expansion Mechanism for spectral regularization which selectively modulates singular vectors based on their relative significance to control the strength of forgetting. All the processes above are in closed-form, guaranteeing extremely efficient erasure in only $2$ seconds. Benchmarking against prior approaches, CURE achieves a more efficient and thorough removal for targeted artistic styles, objects, identities, or explicit content, with minor damage to original generation ability and demonstrates enhanced robustness against red-teaming.

[557] InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition

Yijie Zheng, Weijie Wu, Qingyun Li, Xuehui Wang, Xu Zhou, Aiai Ren, Jun Shen, Long Zhao, Guoqing Li, Xue Yang

Main category: cs.CV

TL;DR: Proposes InstructCDS tasks and EarthInstruct benchmark for instruction-driven object recognition in remote sensing, plus InstructSAM framework that uses vision-language models and SAM2 for training-free object counting, detection, and segmentation.

Details

Motivation: Existing methods rely on explicit category cues and struggle with complex/implicit queries requiring advanced reasoning in remote sensing object recognition.

Method: InstructSAM framework: uses large vision-language models to interpret instructions and estimate counts, SAM2 for mask proposals, and binary integer programming for mask-label assignment using semantic similarity and counting constraints.

Result: Matches or surpasses specialized baselines across tasks, maintains near-constant inference time regardless of object count, reduces output tokens by 89% and runtime by over 32% compared to direct generation approaches.

Conclusion: The proposed tasks, benchmark, and effective approach advance versatile object recognition systems for remote sensing applications.

Abstract: Language-Guided object recognition in remote sensing imagery is crucial for large-scale mapping and automated data annotation. However, existing open-vocabulary and visual grounding methods rely on explicit category cues, limiting their ability to handle complex or implicit queries that require advanced reasoning. To address this issue, we introduce a new suite of tasks, including Instruction-Oriented Object Counting, Detection, and Segmentation (InstructCDS), covering open-vocabulary, open-ended, and open-subclass scenarios. We further present EarthInstruct, the first InstructCDS benchmark for earth observation. It is constructed from two diverse remote sensing datasets with varying spatial resolutions and annotation rules across 20 categories, necessitating models to interpret dataset-specific instructions. Given the scarcity of semantically rich labeled data in remote sensing, we propose InstructSAM, a training-free framework for instruction-driven object recognition. InstructSAM leverages large vision-language models to interpret user instructions and estimate object counts, employs SAM2 for mask proposal, and formulates mask-label assignment as a binary integer programming problem. By integrating semantic similarity with counting constraints, InstructSAM efficiently assigns categories to predicted masks without relying on confidence thresholds. Experiments demonstrate that InstructSAM matches or surpasses specialized baselines across multiple tasks while maintaining near-constant inference time regardless of object count, reducing output tokens by 89% and overall runtime by over 32% compared to direct generation approaches. We believe the contributions of the proposed tasks, benchmark, and effective approach will advance future research in developing versatile object recognition systems.

[558] DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?

Qirui Jiao, Daoyuan Chen, Yilun Huang, Xika Lin, Ying Shen, Yaliang Li

Main category: cs.CV

TL;DR: DetailMaster is the first comprehensive benchmark for evaluating text-to-image models’ ability to handle long, detail-intensive prompts. It reveals that current models achieve only ~50% accuracy in key dimensions and show performance degradation with longer prompts.

Details

Motivation: Current text-to-image models perform poorly with long, detail-intensive prompts required for professional applications, highlighting the need for a dedicated evaluation benchmark.

Method: Created a benchmark with four evaluation dimensions: Character Attributes, Structured Character Locations, Multi-Dimensional Scene Attributes, and Spatial/Interactive Relationships. Used long prompts averaging 284.89 tokens validated by expert annotators.

Result: Evaluation of 12 T2I models showed critical limitations: ~50% accuracy in attribute binding and spatial reasoning, with all models degrading performance as prompt length increases. Analysis revealed compositional reasoning failures and attribute leakage issues.

Conclusion: Current T2I models have fundamental limitations in handling complex compositional requirements. The benchmark is open-sourced to advance detail-rich text-to-image generation.

Abstract: While recent text-to-image (T2I) models show impressive capabilities in synthesizing images from brief descriptions, their performance significantly degrades when confronted with long, detail-intensive prompts required in professional applications. We present DetailMaster, the first comprehensive benchmark specifically designed to evaluate T2I models’ systematic abilities to handle extended textual inputs that contain complex compositional requirements. Our benchmark introduces four critical evaluation dimensions: Character Attributes, Structured Character Locations, Multi-Dimensional Scene Attributes, and Spatial/Interactive Relationships. The benchmark comprises long and detail-rich prompts averaging 284.89 tokens, with high quality validated by expert annotators. Evaluation on 7 general-purpose and 5 long-prompt-optimized T2I models reveals critical performance limitations: state-of-the-art models achieve merely $\sim$50% accuracy in key dimensions like attribute binding and spatial reasoning, while all models showing progressive performance degradation as prompt length increases. Our analysis reveals fundamental limitations in compositional reasoning, demonstrating that current encoders flatten complex grammatical structures and that diffusion models suffer from attribute leakage under detail-intensive conditions. We open-source our dataset, data curation code, and evaluation tools to advance detail-rich T2I generation and enable applications previously hindered by the lack of a dedicated benchmark.

[559] VORTA: Efficient Video Diffusion via Routing Sparse Attention

Wenhao Sun, Rong-Cheng Tu, Yifu Ding, Zhao Jin, Jingyi Liao, Shunyu Liu, Dacheng Tao

Main category: cs.CV

TL;DR: VORTA is an acceleration framework for video diffusion transformers that uses sparse attention and adaptive routing to achieve 1.76× speedup without quality loss, and up to 14.41× speedup when combined with other methods.

Details

Motivation: Video diffusion transformers are computationally expensive due to quadratic attention complexity over high-dimensional video sequences, and existing acceleration methods struggle with long-range computation.

Method: Proposes VORTA with two components: 1) sparse attention mechanism for efficient long-range dependency capture, and 2) routing strategy that adaptively replaces full 3D attention with specialized sparse attention variants.

Result: Achieves 1.76× end-to-end speedup without quality loss on VBench, and up to 14.41× speedup when integrated with other acceleration methods like model caching and step distillation, with negligible performance degradation.

Conclusion: VORTA demonstrates efficiency and enhances the practicality of video diffusion transformers in real-world settings, with codes and weights publicly available.

Abstract: Video diffusion transformers have achieved remarkable progress in high-quality video generation, but remain computationally expensive due to the quadratic complexity of attention over high-dimensional video sequences. Recent acceleration methods enhance the efficiency by exploiting the local sparsity of attention scores; yet they often struggle with accelerating the long-range computation. To address this problem, we propose VORTA, an acceleration framework with two novel components: 1) a sparse attention mechanism that efficiently captures long-range dependencies, and 2) a routing strategy that adaptively replaces full 3D attention with specialized sparse attention variants. VORTA achieves an end-to-end speedup $1.76\times$ without loss of quality on VBench. Furthermore, it can seamlessly integrate with various other acceleration methods, such as model caching and step distillation, reaching up to speedup $14.41\times$ with negligible performance degradation. VORTA demonstrates its efficiency and enhances the practicality of video diffusion transformers in real-world settings. Codes and weights are available at https://github.com/wenhao728/VORTA.

[560] SpikeStereoNet: A Brain-Inspired Framework for Stereo Depth Estimation from Spike Streams

Zhuoheng Gao, Yihao Li, Jiyao Zhang, Rui Zhao, Tong Wu, Hao Tang, Zhaofei Yu, Hao Dong, Guozhang Chen, Tiejun Huang

Main category: cs.CV

TL;DR: SpikeStereoNet is the first framework to estimate stereo depth directly from raw spike streams using a recurrent spiking neural network, outperforming existing methods on both synthetic and real-world datasets.

Details

Motivation: Conventional frame-based cameras struggle with stereo depth estimation in rapidly changing scenes, while spike cameras offer microsecond-level resolution but lack specialized stereo algorithms and benchmarks.

Method: Proposes SpikeStereoNet - a brain-inspired framework that fuses raw spike streams from two viewpoints and iteratively refines depth estimation through a recurrent spiking neural network (RSNN) update module.

Result: Outperforms existing methods on both synthetic and real-world datasets, particularly in challenging regions like textureless surfaces and extreme lighting conditions. Shows strong data efficiency with high accuracy even with reduced training data.

Conclusion: The framework successfully leverages spike streams’ ability to capture subtle edges and intensity shifts, providing an effective solution for stereo depth estimation in dynamic scenes.

Abstract: Conventional frame-based cameras often struggle with stereo depth estimation in rapidly changing scenes. In contrast, bio-inspired spike cameras emit asynchronous events at microsecond-level resolution, providing an alternative sensing modality. However, existing methods lack specialized stereo algorithms and benchmarks tailored to the spike data. To address this gap, we propose SpikeStereoNet, a brain-inspired framework and the first to estimate stereo depth directly from raw spike streams. The model fuses raw spike streams from two viewpoints and iteratively refines depth estimation through a recurrent spiking neural network (RSNN) update module. To benchmark our approach, we introduce a large-scale synthetic spike stream dataset and a real-world stereo spike dataset with dense depth annotations. SpikeStereoNet outperforms existing methods on both datasets by leveraging spike streams’ ability to capture subtle edges and intensity shifts in challenging regions such as textureless surfaces and extreme lighting conditions. Furthermore, our framework exhibits strong data efficiency, maintaining high accuracy even with substantially reduced training data. The source code and datasets will be publicly available.

[561] Learning Shared Representations from Unpaired Data

Amitai Yacobi, Nir Ben-Ari, Ronen Talmon, Uri Shaham

Main category: cs.CV

TL;DR: This paper demonstrates that shared multimodal representations can be learned almost exclusively from unpaired data using spectral embeddings of random walk matrices, achieving strong performance in various cross-modal tasks without requiring paired samples.

Details

Motivation: Current multimodal representation learning methods heavily rely on paired samples from each modality, which are significantly harder to obtain than unpaired data. The authors aim to overcome this limitation by developing methods that can learn shared representations from unpaired data.

Method: The approach uses spectral embeddings of random walk matrices constructed independently from each unimodal representation. This allows learning shared representations without requiring paired samples across modalities.

Result: Empirical results in computer vision and natural language processing show the method’s effectiveness in capturing meaningful cross-modal relations, achieving high performance in retrieval tasks, generation, arithmetics, zero-shot, and cross-domain classification.

Conclusion: This is the first work to demonstrate that shared cross-modal representations can be learned almost exclusively from unpaired samples, creating a universal cross-modal embedding that is independent of specific data modalities.

Abstract: Learning shared representations is a primary area of multimodal representation learning. The current approaches to achieve a shared embedding space rely heavily on paired samples from each modality, which are significantly harder to obtain than unpaired ones. In this work, we demonstrate that shared representations can be learned almost exclusively from unpaired data. Our arguments are grounded in the spectral embeddings of the random walk matrices constructed independently from each unimodal representation. Empirical results in computer vision and natural language processing domains support its potential, revealing the effectiveness of unpaired data in capturing meaningful cross-modal relations, demonstrating high capabilities in retrieval tasks, generation, arithmetics, zero-shot, and cross-domain classification. This work, to the best of our knowledge, is the first to demonstrate these capabilities almost exclusively from unpaired samples, giving rise to a cross-modal embedding that could be viewed as universal, i.e., independent of the specific modalities of the data. Our project page: https://shaham-lab.github.io/SUE_page.

[562] Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles

Zifu Wang, Junyi Zhu, Bo Tang, Zhiyu Li, Feiyu Xiong, Jiaqian Yu, Matthew B. Blaschko

Main category: cs.CV

TL;DR: This paper studies rule-based visual reinforcement learning using jigsaw puzzles as a testbed, revealing key insights about MLLM training, generalization, reasoning patterns, and the superiority of RL over supervised fine-tuning.

Details

Motivation: To understand how rule-based RL applies to multimodal LLMs for perception-heavy tasks, using jigsaw puzzles as a structured framework with inherent ground truth and adjustable difficulty.

Method: Used jigsaw puzzles as experimental framework to study MLLM training through fine-tuning, comparing RL and supervised fine-tuning approaches, and analyzing reasoning patterns.

Result: MLLMs achieved near-perfect accuracy on jigsaw puzzles after fine-tuning and generalized to complex configurations; RL showed better generalization than SFT; reasoning patterns appeared pre-existing rather than emergent; training on jigsaw puzzles induced generalization to other visual tasks.

Conclusion: RL exhibits more effective generalization than SFT for visual tasks, complex reasoning patterns are pre-existing in MLLMs, and jigsaw puzzles provide a valuable framework for studying rule-based visual RL in multimodal learning.

Abstract: The application of rule-based reinforcement learning (RL) to multimodal large language models (MLLMs) introduces unique challenges and potential deviations from findings in text-only domains, particularly for perception-heavy tasks. This paper provides a comprehensive study of rule-based visual RL, using jigsaw puzzles as a structured experimental framework. Jigsaw puzzles offer inherent ground truth, adjustable difficulty, and demand complex decision-making, making them ideal for this study. Our research reveals several key findings: \textit{Firstly,} we find that MLLMs, initially performing near to random guessing on the simplest jigsaw puzzles, achieve near-perfect accuracy and generalize to complex, unseen configurations through fine-tuning. \textit{Secondly,} training on jigsaw puzzles can induce generalization to other visual tasks, with effectiveness tied to specific task configurations. \textit{Thirdly,} MLLMs can learn and generalize with or without explicit reasoning, though open-source models often favor direct answering. Consequently, even when trained for step-by-step reasoning, they can ignore the thinking process in deriving the final answer. \textit{Fourthly,} we observe that complex reasoning patterns appear to be pre-existing rather than emergent, with their frequency increasing alongside training and task difficulty. \textit{Finally,} our results demonstrate that RL exhibits more effective generalization than Supervised Fine-Tuning (SFT), and an initial SFT cold start phase can hinder subsequent RL optimization. Although these observations are based on jigsaw puzzles and may vary across other visual tasks, this research contributes a valuable piece of jigsaw to the larger puzzle of collective understanding rule-based visual RL and its potential in multimodal learning. The code is available at: https://github.com/zifuwanggg/Jigsaw-R1

[563] Seg2Any: Open-set Segmentation-Mask-to-Image Generation with Precise Shape and Semantic Control

Danfeng li, Hui Zhang, Sheng Wang, Jiacheng Li, Zuxuan Wu

Main category: cs.CV

TL;DR: Seg2Any is a novel segmentation-mask-to-image framework that achieves precise spatial layout control by decoupling mask conditions into semantic and shape components, preventing attribute leakage across entities, and using a large-scale dataset for open-set generation.

Details

Motivation: Existing segmentation-mask-to-image methods fail to simultaneously ensure semantic consistency and shape consistency, and struggle with attribute leakage in multi-entity scenarios.

Method: Decouples segmentation masks into regional semantic conditions (via Semantic Alignment Attention Mask) and high-frequency shape conditions (via Entity Contour Map), and introduces Attribute Isolation Attention Mask to prevent cross-entity attribute leakage.

Result: Achieves state-of-the-art performance on both open-set and closed-set S2I benchmarks, particularly excelling in fine-grained spatial and attribute control of entities.

Conclusion: Seg2Any effectively addresses the limitations of existing S2I methods by providing comprehensive solutions for semantic consistency, shape consistency, and attribute isolation in multi-entity generation.

Abstract: Despite recent advances in diffusion models, top-tier text-to-image (T2I) models still struggle to achieve precise spatial layout control, i.e. accurately generating entities with specified attributes and locations. Segmentation-mask-to-image (S2I) generation has emerged as a promising solution by incorporating pixel-level spatial guidance and regional text prompts. However, existing S2I methods fail to simultaneously ensure semantic consistency and shape consistency. To address these challenges, we propose Seg2Any, a novel S2I framework built upon advanced multimodal diffusion transformers (e.g. FLUX). First, to achieve both semantic and shape consistency, we decouple segmentation mask conditions into regional semantic and high-frequency shape components. The regional semantic condition is introduced by a Semantic Alignment Attention Mask, ensuring that generated entities adhere to their assigned text prompts. The high-frequency shape condition, representing entity boundaries, is encoded as an Entity Contour Map and then introduced as an additional modality via multi-modal attention to guide image spatial structure. Second, to prevent attribute leakage across entities in multi-entity scenarios, we introduce an Attribute Isolation Attention Mask mechanism, which constrains each entity’s image tokens to attend exclusively to themselves during image self-attention. To support open-set S2I generation, we construct SACap-1M, a large-scale dataset containing 1 million images with 5.9 million segmented entities and detailed regional captions, along with a SACap-Eval benchmark for comprehensive S2I evaluation. Extensive experiments demonstrate that Seg2Any achieves state-of-the-art performance on both open-set and closed-set S2I benchmarks, particularly in fine-grained spatial and attribute control of entities.

[564] SatDreamer360: Multiview-Consistent Generation of Ground-Level Scenes from Satellite Imagery

Xianghui Ze, Beiyi Zhu, Zhenbo Song, Jianfeng Lu, Yujiao Shi

Main category: cs.CV

TL;DR: SatDreamer360 generates multiview-consistent 360° ground-level panoramas from single satellite images using triplane representation and panoramic epipolar-constrained attention.

Details

Motivation: Existing methods struggle with multiview consistency and rely on auxiliary inputs, limiting applications in simulation, autonomous navigation, and digital twin cities.

Method: Uses triplane representation for scene encoding, ray-based pixel attention for viewpoint discrepancy, and panoramic epipolar-constrained attention for cross-frame feature alignment.

Result: Outperforms existing methods in satellite-to-ground alignment and multiview consistency, validated on the new VIGOR++ dataset.

Conclusion: SatDreamer360 effectively addresses the challenge of generating geometrically consistent multi-view ground panoramas from satellite imagery.

Abstract: Generating multiview-consistent $360^\circ$ ground-level scenes from satellite imagery is a challenging task with broad applications in simulation, autonomous navigation, and digital twin cities. Existing approaches primarily focus on synthesizing individual ground-view panoramas, often relying on auxiliary inputs like height maps or handcrafted projections, and struggle to produce multiview consistent sequences. In this paper, we propose SatDreamer360, a framework that generates geometrically consistent multi-view ground-level panoramas from a single satellite image, given a predefined pose trajectory. To address the large viewpoint discrepancy between ground and satellite images, we adopt a triplane representation to encode scene features and design a ray-based pixel attention mechanism that retrieves view-specific features from the triplane. To maintain multi-frame consistency, we introduce a panoramic epipolar-constrained attention module that aligns features across frames based on known relative poses. To support the evaluation, we introduce {VIGOR++}, a large-scale dataset for generating multi-view ground panoramas from a satellite image, by augmenting the original VIGOR dataset with more ground-view images and their pose annotations. Experiments show that SatDreamer360 outperforms existing methods in both satellite-to-ground alignment and multiview consistency.

[565] Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments

Di Wen, Lei Qi, Kunyu Peng, Kailun Yang, Fei Teng, Ao Luo, Jia Fu, Yufan Chen, Ruiping Liu, Yitian Shi, M. Saquib Sarfraz, Rainer Stiefelhagen

Main category: cs.CV

TL;DR: MicroG-4M is the first benchmark dataset for understanding human activities in microgravity, featuring 4,759 video clips, 50 actions, 1,238 captions, and 7,000+ QA pairs to evaluate spatio-temporal and semantic understanding in space environments.

Details

Motivation: Existing video understanding datasets are limited to Earth's gravity conditions, creating a critical gap for real-world vision systems in safety-critical space applications where microgravity alters human motion and visual semantics.

Method: Constructed MicroG-4M dataset from real-world space missions and cinematic simulations, supporting three core tasks: fine-grained multi-label action recognition, temporal video captioning, and visual question answering on astronaut activities.

Result: Established baselines using state-of-the-art models and made all data, annotations, and code publicly available at https://github.com/LEI-QI-233/HAR-in-Space.

Conclusion: MicroG-4M addresses the critical gap in domain-robust video understanding for space applications by providing the first comprehensive benchmark for evaluating spatio-temporal and semantic reasoning in microgravity contexts.

Abstract: Despite substantial progress in video understanding, most existing datasets are limited to Earth’s gravitational conditions. However, microgravity alters human motion, interactions, and visual semantics, revealing a critical gap for real-world vision systems. This presents a challenge for domain-robust video understanding in safety-critical space applications. To address this, we introduce MicroG-4M, the first benchmark for spatio-temporal and semantic understanding of human activities in microgravity. Constructed from real-world space missions and cinematic simulations, the dataset includes 4,759 clips covering 50 actions, 1,238 context-rich captions, and over 7,000 question-answer pairs on astronaut activities and scene understanding. MicroG-4M supports three core tasks: fine-grained multi-label action recognition, temporal video captioning, and visual question answering, enabling a comprehensive evaluation of both spatial localization and semantic reasoning in microgravity contexts. We establish baselines using state-of-the-art models. All data, annotations, and code are available at https://github.com/LEI-QI-233/HAR-in-Space.

[566] Boosting Adversarial Transferability via Commonality-Oriented Gradient Optimization

Yanting Gao, Yepeng Liu, Junming Liu, Qi Zhang, Hongyun Zhang, Duoqian Miao, Cairong Zhao

Main category: cs.CV

TL;DR: COGO improves adversarial example transferability for Vision Transformers by enhancing common mid-to-low frequency features across models and suppressing individual-specific gradients.

Details

Motivation: Existing adversarial attack methods suffer from weak transferability due to overfitting to surrogate models, failing to leverage shared features among models trained on the same task.

Method: Proposes COGO with two components: Commonality Enhancement (CE) that perturbs mid-to-low frequency regions, and Individuality Suppression (IS) that uses adaptive thresholds to weight gradients based on correlation with model individuality.

Result: Extensive experiments show COGO significantly improves transfer success rates of adversarial attacks, outperforming current state-of-the-art methods.

Conclusion: Enhancing common features shared across models while suppressing individual characteristics is an effective strategy for improving adversarial example transferability in Vision Transformers.

Abstract: Exploring effective and transferable adversarial examples is vital for understanding the characteristics and mechanisms of Vision Transformers (ViTs). However, adversarial examples generated from surrogate models often exhibit weak transferability in black-box settings due to overfitting. Existing methods improve transferability by diversifying perturbation inputs or applying uniform gradient regularization within surrogate models, yet they have not fully leveraged the shared and unique features of surrogate models trained on the same task, leading to suboptimal transfer performance. Therefore, enhancing perturbations of common information shared by surrogate models and suppressing those tied to individual characteristics offers an effective way to improve transferability. Accordingly, we propose a commonality-oriented gradient optimization strategy (COGO) consisting of two components: Commonality Enhancement (CE) and Individuality Suppression (IS). CE perturbs the mid-to-low frequency regions, leveraging the fact that ViTs trained on the same dataset tend to rely more on mid-to-low frequency information for classification. IS employs adaptive thresholds to evaluate the correlation between backpropagated gradients and model individuality, assigning weights to gradients accordingly. Extensive experiments demonstrate that COGO significantly improves the transfer success rates of adversarial attacks, outperforming current state-of-the-art methods.

[567] Revisit What You See: Disclose Language Prior in Vision Tokens for LVLM Decoding

Beomsik Cho, Jaehyung Kim

Main category: cs.CV

TL;DR: ReVisiT is a training-free decoding method that reduces hallucinations in Large Vision-Language Models by referencing vision tokens to guide text generation through context-aware constrained divergence minimization.

Details

Motivation: Vision tokens provide meaningful visual information even during hallucinations, and their semantics can be encoded in textual space under appropriate vocabulary constraints, suggesting untapped potential for improving visual grounding.

Method: Projects vision tokens into text token distribution, dynamically selects most relevant vision token at each decoding step via context-aware constrained divergence minimization, and refines output distribution to incorporate visual semantics.

Result: Consistently enhances visual grounding across five benchmarks on recent LVLMs with minimal computational overhead, achieves competitive or superior results to state-of-the-art decoding baselines while reducing computational cost by up to 2x.

Conclusion: ReVisiT effectively leverages vision token semantics to improve visual grounding in LVLMs through a simple, training-free decoding approach that reduces hallucinations and computational costs.

Abstract: Large Vision-Language Models (LVLMs) achieve strong performance across multimodal tasks by integrating visual perception with language understanding. However, how vision information contributes to the model’s decoding process remains under-explored, as reflected in frequent hallucinations. Through a series of analyses, we found that (i) vision tokens provide meaningful visual information even when hallucinations occur, and (ii) their semantics are encoded in the textual space and become explicit under appropriate vocabulary constraints. Building on these observations, we propose ReVisiT, a simple training-free decoding method that references vision tokens to guide text generation. Our approach leverages the semantic information embedded within vision tokens by projecting them into the text token distribution. Specifically, ReVisiT dynamically selects the most relevant vision token at each decoding step via context-aware constrained divergence minimization, and using its constrained projection to refine the output distribution to better incorporate visual semantics. Across five benchmarks on recent LVLMs, ReVisiT consistently enhances visual grounding with minimal computational overhead, and achieves competitive or superior results to state-of-the-art decoding baselines while reducing computational cost by up to $2\times$.

[568] VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision-Language Models

Christos Ziakas, Alessandra Russo

Main category: cs.CV

TL;DR: VITA enhances zero-shot value function learning in Vision-Language Models through test-time adaptation and temporal reasoning improvements.

Details

Motivation: Frozen pre-trained representations in VLMs limit generalization and temporal reasoning for zero-shot goal-conditioned value functions.

Method: Uses lightweight adaptation module updated via gradient descent on meta-learned self-supervised loss during inference, with dissimilarity-based sampling to prevent shortcut learning.

Result: Outperforms state-of-the-art zero-shot methods in real-world robotic manipulation tasks and enables effective reward shaping in offline RL.

Conclusion: VITA successfully addresses temporal reasoning limitations and improves generalization in zero-shot value estimation for VLMs.

Abstract: Vision-Language Models (VLMs) show promise as zero-shot goal-conditioned value functions, but their frozen pre-trained representations limit generalization and temporal reasoning. We introduce VITA, a zero-shot value function learning method that enhances both capabilities via test-time adaptation. At inference, a lightweight adaptation module is updated via a gradient step on a meta-learned self-supervised loss, such that each test-time update improves value estimation. By updating sequentially over a trajectory, VITA encodes history into its parameters, addressing the temporal reasoning limitations. To mitigate shortcut learning, we propose a dissimilarity-based sampling strategy that selects semantically diverse segments of the trajectory during training. In real-world robotic manipulation tasks, VITA generalizes from a single training environment to diverse out-of-distribution tasks, environments, and embodiments, outperforming the state-of-the-art zero-shot method using autoregressive VLMs. Furthermore, we demonstrate that VITA’s zero-shot value estimates can be utilized for reward shaping in offline reinforcement learning, resulting in multi-task policies on the Meta-World benchmark that exceed the performance of those trained with the simulation’s fuzzy-logic dense rewards.

[569] ViFusionTST: Deep Fusion of Time-Series Image Representations from Load Signals for Early Bed-Exit Prediction

Hao Liu, Yu Hu, Rakiba Rayhana, Ling Bai, Zheng Liu

Main category: cs.CV

TL;DR: Early bed-exit intent prediction using a single load cell under bed leg, converting signals to images and using dual-stream Swin Transformer for classification.

Details

Motivation: Bed-related falls are a major injury source in healthcare facilities, and current commercial alarms only trigger after patients have already left bed.

Method: Load signals converted to RGB line plot and three texture maps (recurrence plot, Markov transition field, Gramian angular field). ViFusionTST dual-stream Swin Transformer processes these in parallel with cross-attention fusion.

Result: On 6-month real-world data from 95 beds, achieved accuracy of 0.885 and F1 score of 0.794, outperforming recent 1D and 2D time-series baselines.

Conclusion: Image-based fusion of load-sensor signals is practical and effective for real-time, privacy-preserving fall prevention.

Abstract: Bed-related falls remain a major source of injury in hospitals and long-term care facilities, yet many commercial alarms trigger only after a patient has already left the bed. We show that early bed-exit intent can be predicted using only one low-cost load cell mounted under a bed leg. The resulting load signals are first converted into a compact set of complementary images: an RGB line plot that preserves raw waveforms and three texture maps-recurrence plot, Markov transition field, and Gramian angular field-that expose higher-order dynamics. We introduce ViFusionTST, a dual-stream Swin Transformer that processes the line plot and texture maps in parallel and fuses them through cross-attention to learn data-driven modality weights. To provide a realistic benchmark, we collected six months of continuous data from 95 beds in a long-term-care facility. On this real-world dataset ViFusionTST reaches an accuracy of 0.885 and an F1 score of 0.794, surpassing recent 1D and 2D time-series baselines across F1, recall, accuracy, and AUPRC. The results demonstrate that image-based fusion of load-sensor signals for time series classification is a practical and effective solution for real-time, privacy-preserving fall prevention.

[570] OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

Yuanhao Cai, He Zhang, Xi Chen, Jinbo Xing, Yiwei Hu, Yuqian Zhou, Kai Zhang, Zhifei Zhang, Soo Ye Kim, Tianyu Wang, Yulun Zhang, Xiaokang Yang, Zhe Lin, Alan Yuille

Main category: cs.CV

TL;DR: OmniVCus is a diffusion Transformer framework that enables multi-subject video customization and instructive editing using control signals like depth and masks, overcoming limitations of existing single-subject methods.

Details

Motivation: Existing methods are limited to single-subject scenarios due to lack of multi-subject training data, and there's insufficient exploration of using control signals (depth, mask, camera, text) for subject editing in customized videos.

Method: Proposes VideoCus-Factory pipeline for multi-subject data construction, IVTM training with image editing data, and OmniVCus framework with Lottery Embedding (for multi-subject inference) and Temporally Aligned Embedding (for control signal guidance).

Result: Significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations, demonstrating superior performance in multi-subject video customization and control.

Conclusion: The proposed method successfully addresses multi-subject video customization challenges and enables effective instructive editing using various control signals, representing a significant advancement in the field.

Abstract: Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging problem that how to use the signals such as depth, mask, camera, and text prompts to control and edit the subject in the customized video is still less explored. In this paper, we first propose a data construction pipeline, VideoCus-Factory, to produce training data pairs for multi-subject customization from raw videos without labels and control signals such as depth-to-video and mask-to-video pairs. Based on our constructed data, we develop an Image-Video Transfer Mixed (IVTM) training with image editing data to enable instructive editing for the subject in the customized video. Then we propose a diffusion Transformer framework, OmniVCus, with two embedding mechanisms, Lottery Embedding (LE) and Temporally Aligned Embedding (TAE). LE enables inference with more subjects by using the training subjects to activate more frame embeddings. TAE encourages the generation process to extract guidance from temporally aligned control signals by assigning the same frame embeddings to the control and noise tokens. Experiments demonstrate that our method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations. Video demos are at our project page: https://caiyuanhao1998.github.io/project/OmniVCus/. Our code will be released at https://github.com/caiyuanhao1998/Open-OmniVCus

[571] LH2Face: Loss function for Hard High-quality Face

Fan Xie, Yang Wang, Yikang Jiao, Zhenyu Yuan, Congxi Chen, Chuanxin Zhao

Main category: cs.CV

TL;DR: LH2Face is a novel loss function that improves face recognition performance on hard high-quality faces by incorporating adaptive margins, vMF-based similarity measures, proxy-based constraints, and face reconstruction optimization.

Details

Motivation: Current face recognition methods using cosine similarity with softmax struggle with hard samples and use uniform margin strategies that don't consider face quality or recognition hardness.

Method: Proposes LH2Face loss function with: 1) vMF distribution-based similarity measure, 2) Uncertainty-Aware Margin Function for adaptive margins, 3) proxy-based loss functions for representation space optimization, and 4) renderer for face reconstruction optimization.

Result: Achieves 49.39% accuracy on IJB-B dataset, surpassing the second-place method by 2.37%. Superior performance on hard high-quality face datasets compared to similar schemes.

Conclusion: LH2Face effectively addresses the limitations of uniform margin strategies by incorporating quality-aware adaptive margins and multi-modal optimization, significantly improving face recognition performance on challenging samples.

Abstract: In current practical face authentication systems, most face recognition (FR) algorithms are based on cosine similarity with softmax classification. Despite its reliable classification performance, this method struggles with hard samples. A popular strategy to improve FR performance is incorporating angular or cosine margins. However, it does not take face quality or recognition hardness into account, simply increasing the margin value and thus causing an overly uniform training strategy. To address this problem, a novel loss function is proposed, named Loss function for Hard High-quality Face (LH2Face). Firstly, a similarity measure based on the von Mises-Fisher (vMF) distribution is stated, specifically focusing on the logarithm of the Probability Density Function (PDF), which represents the distance between a probability distribution and a vector. Then, an adaptive margin-based multi-classification method using softmax, called the Uncertainty-Aware Margin Function, is implemented in the article. Furthermore, proxy-based loss functions are used to apply extra constraints between the proxy and sample to optimize their representation space distribution. Finally, a renderer is constructed that optimizes FR through face reconstruction and vice versa. Our LH2Face is superior to similiar schemes on hard high-quality face datasets, achieving 49.39% accuracy on the IJB-B dataset, which surpasses the second-place method by 2.37%.

[572] DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy

Ming Dai, Wenxuan Cheng, Jiang-jiang Liu, Sen Yang, Wenxiao Cai, Yanpeng Sun, Wankou Yang

Main category: cs.CV

TL;DR: DeRIS decomposes Referring Image Segmentation into perception and cognition modules, revealing that cognitive capacity is the main bottleneck. It introduces Loopback Synergy mechanism and data augmentation to improve performance.

Details

Motivation: To systematically analyze fundamental bottlenecks in existing RIS frameworks, which have been underexplored despite focus on vision-language interactions and localization.

Method: Proposes DeRIS framework with perception-cognition decomposition, Loopback Synergy mechanism for module interaction, and non-referent sample conversion data augmentation.

Result: DeRIS demonstrates improved segmentation precision and robust image-text comprehension, with inherent adaptability to non- and multi-referents scenarios without architectural changes.

Conclusion: Cognitive capacity, not perception, is the primary limitation in RIS. DeRIS’s modular approach effectively addresses this bottleneck and enhances general applicability.

Abstract: Referring Image Segmentation (RIS) is a challenging task that aims to segment objects in an image based on natural language expressions. While prior studies have predominantly concentrated on improving vision-language interactions and achieving fine-grained localization, a systematic analysis of the fundamental bottlenecks in existing RIS frameworks remains underexplored. To bridge this gap, we propose DeRIS, a novel framework that decomposes RIS into two key components: perception and cognition. This modular decomposition facilitates a systematic analysis of the primary bottlenecks impeding RIS performance. Our findings reveal that the predominant limitation lies not in perceptual deficiencies, but in the insufficient multi-modal cognitive capacity of current models. To mitigate this, we propose a Loopback Synergy mechanism, which enhances the synergy between the perception and cognition modules, thereby enabling precise segmentation while simultaneously improving robust image-text comprehension. Additionally, we analyze and introduce a simple non-referent sample conversion data augmentation to address the long-tail distribution issue related to target existence judgement in general scenarios. Notably, DeRIS demonstrates inherent adaptability to both non- and multi-referents scenarios without requiring specialized architectural modifications, enhancing its general applicability. The codes and models are available at https://github.com/Dmmm1997/DeRIS.

[573] Investigating VLM Hallucination from a Cognitive Psychology Perspective: A First Step Toward Interpretation with Intriguing Observations

Xiangrui Liu, Man Luo, Agneet Chatterjee, Hua Wei, Chitta Baral, Yezhou Yang

Main category: cs.CV

TL;DR: This paper introduces a psychological taxonomy for VLMs’ hallucinations, identifying cognitive biases like sycophancy and appeal to authority, and proposes AIpsych benchmark to analyze these behaviors across model architectures and sizes.

Details

Motivation: Existing research attributes hallucinations to technical limitations or sycophancy bias, but may have neglected that hallucination behaviors might mirror cognitive biases observed in human psychology.

Method: Design AIpsych benchmark to reveal psychological tendencies in model responses, investigate how model architecture and parameter size influence behavior when responding to strategically manipulated questions, and conduct human subject study.

Result: As model size increases, VLMs exhibit stronger sycophantic tendencies but reduced authority bias, suggesting increasing competence but potential erosion of response integrity. Human study validates hypotheses and highlights behavioral differences between VLMs and humans.

Conclusion: This work provides a new psychological perspective for understanding hallucination in VLMs and highlights the importance of integrating psychological principles into model evaluation.

Abstract: Hallucination is a long-standing problem that has been actively investigated in Vision-Language Models (VLMs). Existing research commonly attributes hallucinations to technical limitations or sycophancy bias, where the latter means the models tend to generate incorrect answers to align with user expectations. However, these explanations primarily focus on technical or externally driven factors, and may have neglected the possibility that hallucination behaviours might mirror cognitive biases observed in human psychology. In this work, we introduce a psychological taxonomy, categorizing VLMs’ cognitive biases that lead to hallucinations, including sycophancy, logical inconsistency, and a newly identified VLMs behaviour: appeal to authority. To systematically analyze these behaviours, we design AIpsych, a scalable benchmark that reveals psychological tendencies in model response patterns. Leveraging this benchmark, we investigate how variations in model architecture and parameter size influence model behaviour when responding to strategically manipulated questions. Our experiments reveal that as model size increases, VLMs exhibit stronger sycophantic tendencies but reduced authority bias, suggesting increasing competence but a potential erosion of response integrity. A human subject study further validates our hypotheses and highlights key behavioural differences between VLMs and human respondents. This work suggests a new perspective for understanding hallucination in VLMs and highlights the importance of integrating psychological principles into model evaluation.

[574] SenseShift6D: Multimodal RGB-D Benchmarking for Robust 6D Pose Estimation across Environment and Sensor Variations

Yegyu Han, Taegyoon Yoon, Dayeon Woo, Sojeong Kim, Hyung-Sin Kim

Main category: cs.CV

TL;DR: SenseShift6D is the first RGB-D dataset that systematically explores real-world variations in illumination, exposure, gain, and depth-sensor modes for 6D object pose estimation, demonstrating that test-time sensor control significantly improves model robustness without additional training.

Details

Motivation: Current 6D pose estimation datasets are captured under fixed conditions, leaving the impact of real-world variations in illumination, exposure, gain, and depth-sensor modes unexplored. The authors aim to bridge this gap and explore how test-time sensor control can mitigate these variations.

Method: Created SenseShift6D dataset by physically sweeping 13 RGB exposures, 9 RGB gains, auto-exposure, 4 depth-capture modes, and 5 illumination levels for five common household objects, acquiring 166.4k RGB and 16.7k depth images with 1,380 unique sensor-lighting permutations per object pose.

Result: Test-time sensor control yields substantial performance gains: 19.5 percentage point improvement on pretrained generalizable models, enhances robustness where models typically fail, and remains effective even for instance-level pose estimators without additional training.

Conclusion: SenseShift6D shifts object pose evaluation from data-centered to sensor-aware robustness, laying foundation for adaptive, self-tuning perception systems that can operate robustly in uncertain real-world environments without costly training data expansion.

Abstract: Recent advances on 6D object-pose estimation have achieved high performance on representative benchmarks such as LM-O, YCB-V, and T-Less. However, these datasets were captured under fixed illumination and camera settings, leaving the impact of real-world variations in illumination, exposure, gain or depth-sensor mode-and the potential of test-time sensor control to mitigate such variations-largely unexplored. To bridge this gap, we introduce SenseShift6D, the first RGB-D dataset that physically sweeps 13 RGB exposures, 9 RGB gains, auto-exposure, 4 depth-capture modes, and 5 illumination levels. For five common household objects (spray, pringles, tincase, sandwich, and mouse), we acquire 166.4k RGB and 16.7k depth images, which can provide 1,380 unique sensor-lighting permutations per object pose. Experiments with state-of-the-art models on our dataset demonstrate that applying multimodal sensor control at test time yields substantial performance gains, achieving a 19.5 pp improvement on pretrained generalizable models. It also enhances robustness precisely where those models tend to fail. Moreover, even instance-level pose estimators, where train and test set share identical object and background, performance still varies under environmental and sensor change, demonstrating that test-time sensor control remains effective compared to costly expansions in the quantity and diversity of real-world training data, without any additional training. SenseShift6D extends the object pose evaluation paradigm from data-centered to sensor-aware robustness, laying a foundation for adaptive, self-tuning perception systems capable of operating robustly in uncertain real-world environments.

[575] Automatic Synthesis of High-Quality Triplet Data for Composed Image Retrieval

Haiwen Li, Delong Liu, Zhaohui Hou, Zhicheng Zhao, Fei Su

Main category: cs.CV

TL;DR: Proposes a scalable pipeline for automatic triplet generation using LLMs and text-to-image models, creating a synthetic CIRHS dataset, and introduces Hybrid Contextual Alignment (CoAlign) framework that achieves state-of-the-art performance in both zero-shot and supervised settings.

Details

Motivation: Existing CIR methods rely on costly manual triplet labeling, which limits scalability and zero-shot capability. The authors aim to address this by creating a fully synthetic dataset and developing a more robust retrieval framework.

Method: 1) Automatic triplet generation pipeline using LLM-generated prompts and text-to-image models to create image pairs with identical elements; 2) Hybrid Contextual Alignment (CoAlign) framework that performs global alignment and local reasoning within broader context for robust representation learning.

Result: CoAlign achieves outstanding zero-shot performance on three benchmarks, demonstrating first feasibility of training CIR models on fully synthetic data. Under supervised training, it outperforms all state-of-the-art supervised CIR approaches.

Conclusion: The proposed synthetic dataset generation pipeline and CoAlign framework effectively address scalability and zero-shot capability limitations in CIR, achieving superior performance while reducing reliance on manual labeling.

Abstract: As a challenging vision-language (VL) task, Composed Image Retrieval (CIR) aims to retrieve target images using multimodal (image+text) queries. Although many existing CIR methods have attained promising performance, their reliance on costly, manually labeled triplets hinders scalability and zero-shot capability. To address this issue, we propose a scalable pipeline for automatic triplet generation, along with a fully synthetic dataset named Composed Image Retrieval on High-quality Synthetic Triplets (CIRHS). Our pipeline leverages a large language model (LLM) to generate diverse prompts, controlling a text-to-image generative model to produce image pairs with identical elements in each pair, which are then filtered and reorganized to form the CIRHS dataset. In addition, we introduce Hybrid Contextual Alignment (CoAlign), a novel CIR framework, which can accomplish global alignment and local reasoning within a broader context, enabling the model to learn more robust and informative representations. By utilizing the synthetic CIRHS dataset, CoAlign achieves outstanding zero-shot performance on three commonly used benchmarks, demonstrating for the first time the feasibility of training CIR models on a fully synthetic dataset. Furthermore, under supervised training, our method outperforms all the state-of-the-art supervised CIR approaches, validating the effectiveness of our proposed retrieval framework. The code and the CIRHS dataset will be released soon.

[576] Multi-Scale Attention and Gated Shifting for Fine-Grained Event Spotting in Videos

Hao Xu, Sam Wells, Mohamed Reda Bouadjenek, Richard Dazeley, Sunil Aryal

Main category: cs.CV

TL;DR: Proposes MSAGSM, a multi-scale attention gate shift module that enhances temporal receptive field and spatial adaptability for precise event spotting in sports videos, achieving state-of-the-art results with minimal overhead.

Details

Motivation: Existing PES models have limited temporal receptive field and spatial adaptability, which restricts their performance in frame-level recognition of fine-grained actions from single-camera sports footage.

Method: Introduces MSAGSM that enhances GSM with multi-scale temporal shifts and channel grouped spatial attention, enabling efficient modeling of both short and long-term dependencies while focusing on salient regions. Also presents the Table Tennis Australia dataset with over 4,800 annotated events.

Result: Extensive experiments across four PES benchmarks demonstrate consistent performance improvements with minimal overhead, setting new state-of-the-art results.

Conclusion: MSAGSM is an effective lightweight, plug-and-play module that significantly advances precise event spotting capabilities in sports videos.

Abstract: Precise Event Spotting (PES) in sports videos requires frame-level recognition of fine-grained actions from single-camera footage. Existing PES models typically incorporate lightweight temporal modules such as the Gate Shift Module (GSM) or the Gate Shift Fuse to enrich 2D CNN feature extractors with temporal context. However, these modules are limited in both temporal receptive field and spatial adaptability. We propose a Multi-Scale Attention Gate Shift Module (MSAGSM) that enhances GSM with multi-scale temporal shifts and channel grouped spatial attention, enabling efficient modeling of both short and long-term dependencies while focusing on salient regions. MSAGSM is a lightweight, plug-and-play module that integrates seamlessly with diverse 2D backbones. To further advance the field, we introduce the Table Tennis Australia dataset, the first PES benchmark for table tennis containing over 4,800 precisely annotated events. Extensive experiments across four PES benchmarks demonstrate that MSAGSM consistently improves performance with minimal overhead, setting new state-of-the-art results.

[577] Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models

Anita Kriz, Elizabeth Laura Janes, Xing Shen, Tal Arbel

Main category: cs.CV

TL;DR: Prompt4Trust is a reinforcement learning framework that trains a lightweight LLM to generate context-aware auxiliary prompts for multimodal LLMs, improving confidence calibration and accuracy in medical VQA tasks.

Details

Motivation: MLLMs in healthcare have limitations in prompt sensitivity and generating incorrect responses with high confidence, which is critical for clinical decision-making where confidence must reflect accuracy.

Method: Uses RL to train a lightweight LLM to produce auxiliary prompts that guide downstream MLLMs to generate better calibrated confidence responses, specifically designed for clinical safety.

Result: Achieves state-of-the-art medical VQA performance on PMC-VQA benchmark and shows promising zero-shot generalization to larger MLLMs, improving both calibration and task accuracy.

Conclusion: Demonstrates potential for automated human-aligned prompt engineering to improve MLLM trustworthiness in safety-critical healthcare settings with scalable calibration.

Abstract: Multimodal large language models (MLLMs) hold considerable promise for applications in healthcare. However, their deployment in safety-critical settings is hindered by two key limitations: (i) sensitivity to prompt design, and (ii) a tendency to generate incorrect responses with high confidence. As clinicians may rely on a model’s stated confidence to gauge the reliability of its predictions, it is especially important that when a model expresses high confidence, it is also highly accurate. We introduce Prompt4Trust, the first reinforcement learning (RL) framework for prompt augmentation targeting confidence calibration in MLLMs. A lightweight LLM is trained to produce context-aware auxiliary prompts that guide a downstream task MLLM to generate responses in which the expressed confidence more accurately reflects predictive accuracy. Unlike conventional calibration techniques, Prompt4Trust specifically prioritizes aspects of calibration most critical for safe and trustworthy clinical decision-making. Beyond improvements driven by this clinically motivated calibration objective, our proposed method also improves task accuracy, achieving state-of-the-art medical visual question answering (VQA) performance on the PMC-VQA benchmark, which is composed of multiple-choice questions spanning diverse medical imaging modalities. Moreover, our framework trained with a small downstream task MLLM showed promising zero-shot generalization to larger MLLMs in our experiments, suggesting the potential for scalable calibration without the associated computational costs. This work demonstrates the potential of automated yet human-aligned prompt engineering for improving the the trustworthiness of MLLMs in safety critical settings. Our codebase can be found at https://github.com/xingbpshen/prompt4trust.

[578] A Lightweight and Robust Framework for Real-Time Colorectal Polyp Detection Using LOF-Based Preprocessing and YOLO-v11n

Saadat Behzadi, Danial Sharifrazi, Bita Mesbahzadeh, Javad Hassannataj Joloudari, Roohallah Alizadehsani

Main category: cs.CV

TL;DR: A lightweight polyp detection framework combining LOF outlier filtering with YOLO-v11n achieves high accuracy (95.83% precision, 91.85% recall) for colorectal cancer screening.

Details

Motivation: Timely and accurate detection of colorectal polyps is crucial for preventing colorectal cancer, a major global cause of mortality. The study aims to develop an efficient framework suitable for real-time clinical applications.

Method: Combines Local Outlier Factor (LOF) algorithm for filtering noisy data with YOLO-v11n deep learning model. Uses 5-fold cross-validation on five public datasets (CVC-ColonDB, CVC-ClinicDB, Kvasir-SEG, ETIS, EndoScene) with converted segmentation masks to detection labels. LOF configured with 30 neighbors and 5% contamination ratio.

Result: Achieved precision of 95.83%, recall of 91.85%, F1-score of 93.48%, mAP@0.5 of 96.48%, and mAP@0.5:0.95 of 77.75%. Outperforms previous YOLO-based methods in both accuracy and efficiency.

Conclusion: The proposed method is well-suited for real-time colonoscopy support in clinical settings, highlighting the importance of data preprocessing and model efficiency in medical imaging AI systems.

Abstract: Objectives: Timely and accurate detection of colorectal polyps plays a crucial role in diagnosing and preventing colorectal cancer, a major cause of mortality worldwide. This study introduces a new, lightweight, and efficient framework for polyp detection that combines the Local Outlier Factor (LOF) algorithm for filtering noisy data with the YOLO-v11n deep learning model. Study design: An experimental study leveraging deep learning and outlier removal techniques across multiple public datasets. Methods: The proposed approach was tested on five diverse and publicly available datasets: CVC-ColonDB, CVC-ClinicDB, Kvasir-SEG, ETIS, and EndoScene. Since these datasets originally lacked bounding box annotations, we converted their segmentation masks into suitable detection labels. To enhance the robustness and generalizability of our model, we apply 5-fold cross-validation and remove anomalous samples using the LOF method configured with 30 neighbors and a contamination ratio of 5%. Cleaned data are then fed into YOLO-v11n, a fast and resource-efficient object detection architecture optimized for real-time applications. We train the model using a combination of modern augmentation strategies to improve detection accuracy under diverse conditions. Results: Our approach significantly improves polyp localization performance, achieving a precision of 95.83%, recall of 91.85%, F1-score of 93.48%, mAP@0.5 of 96.48%, and mAP@0.5:0.95 of 77.75%. Compared to previous YOLO-based methods, our model demonstrates enhanced accuracy and efficiency. Conclusions: These results suggest that the proposed method is well-suited for real-time colonoscopy support in clinical settings. Overall, the study underscores how crucial data preprocessing and model efficiency are when designing effective AI systems for medical imaging.

[579] {S\textsuperscript{2}M\textsuperscript{2}}: Scalable Stereo Matching Model for Reliable Depth Estimation

Junhong Min, Youngpil Jeon, Jimin Kim, Minyong Choi

Main category: cs.CV

TL;DR: S²M² is a global matching architecture for stereo matching that achieves state-of-the-art accuracy and efficiency without cost volume filtering or deep refinement stacks, using multi-resolution transformers and a novel loss function.

Details

Motivation: To create a generalizable stereo matching model that works across varying resolutions and disparity ranges without dataset-specific fine-tuning, overcoming the limitations of iterative local search methods and the computational infeasibility of global matching architectures.

Method: Integrates a multi-resolution transformer for robust long-range correspondence, trained with a novel loss function that concentrates probability on feasible matches, enabling joint estimation of disparity, occlusion, and confidence.

Result: Establishes new state-of-the-art on Middlebury v3 and ETH3D benchmarks, significantly outperforming prior methods in most metrics while reconstructing high-quality details with competitive efficiency.

Conclusion: S²M² resolves the trade-off between generalization capability and computational feasibility in stereo matching, demonstrating that global matching architectures can achieve both high accuracy and efficiency.

Abstract: The pursuit of a generalizable stereo matching model, capable of performing well across varying resolutions and disparity ranges without dataset-specific fine-tuning, has revealed a fundamental trade-off. Iterative local search methods achieve high scores on constrained benchmarks, but their core mechanism inherently limits the global consistency required for true generalization. However, global matching architectures, while theoretically more robust, have historically been rendered infeasible by prohibitive computational and memory costs. We resolve this dilemma with {S\textsuperscript{2}M\textsuperscript{2}}: a global matching architecture that achieves state-of-the-art accuracy and high efficiency without relying on cost volume filtering or deep refinement stacks. Our design integrates a multi-resolution transformer for robust long-range correspondence, trained with a novel loss function that concentrates probability on feasible matches. This approach enables a more robust joint estimation of disparity, occlusion, and confidence. {S\textsuperscript{2}M\textsuperscript{2}} establishes a new state of the art on Middlebury v3 and ETH3D benchmarks, significantly outperforming prior methods in most metrics while reconstructing high-quality details with competitive efficiency.

[580] STAR: A Benchmark for Astronomical Star Fields Super-Resolution

Kuo-Cheng Wu, Guohang Zhuang, Jinyang Huang, Xiang Zhang, Wanli Ouyang, Yan Lu

Main category: cs.CV

TL;DR: STAR is a large-scale astronomical super-resolution dataset with 54,738 flux-consistent star field image pairs, addressing limitations in existing datasets. It introduces a Flux Error metric and a Flux-Invariant Super Resolution model that outperforms state-of-the-art methods by 24.84% on flux consistency.

Details

Motivation: Existing astronomical super-resolution datasets suffer from flux inconsistency, object-crop settings, and insufficient data diversity, which significantly hinder the development of field-level ASR models for astrophysics applications.

Method: Created a large-scale dataset (STAR) with flux-consistent image pairs using Hubble Space Telescope high-resolution observations and a flux-preserving data generation pipeline. Proposed a Flux-Invariant Super Resolution (FISR) model that accurately infers flux-consistent high-resolution images from input photometry.

Result: The FISR model suppresses state-of-the-art methods by 24.84% on a novel flux consistency metric. Extensive experiments demonstrate the effectiveness of the proposed method and the value of the STAR dataset.

Conclusion: STAR dataset and FISR model provide significant advancements for astronomical super-resolution, enabling systematic development of field-level ASR models with improved flux consistency crucial for astrophysics applications.

Abstract: Super-resolution (SR) advances astronomical imaging by enabling cost-effective high-resolution capture, crucial for detecting faraway celestial objects and precise structural analysis. However, existing datasets for astronomical SR (ASR) exhibit three critical limitations: flux inconsistency, object-crop setting, and insufficient data diversity, significantly impeding ASR development. We propose STAR, a large-scale astronomical SR dataset containing 54,738 flux-consistent star field image pairs covering wide celestial regions. These pairs combine Hubble Space Telescope high-resolution observations with physically faithful low-resolution counterparts generated through a flux-preserving data generation pipeline, enabling systematic development of field-level ASR models. To further empower the ASR community, STAR provides a novel Flux Error (FE) to evaluate SR models in physical view. Leveraging this benchmark, we propose a Flux-Invariant Super Resolution (FISR) model that could accurately infer the flux-consistent high-resolution images from input photometry, suppressing several SR state-of-the-art methods by 24.84% on a novel designed flux consistency metric, showing the priority of our method for astrophysics. Extensive experiments demonstrate the effectiveness of our proposed method and the value of our dataset. Code and models are available at https://github.com/GuoCheng12/STAR.

[581] IONext: Unlocking the Next Era of Inertial Odometry

Shanshan Zhang, Qi Zhang, Siyue Wang, Tianshui Wen, Liqin Wu, Ziheng Zhou, Xuemin Hong, Ao Peng, Lingxiang Zheng, Yu Yang

Main category: cs.CV

TL;DR: IONext is a CNN-based inertial odometry model that uses Dual-wing Adaptive Dynamic Mixer (DADM) for multi-scale feature aggregation and Spatio-Temporal Gating Unit (STGU) for temporal modeling, outperforming SOTA methods on multiple datasets.

Details

Motivation: Transformers have limitations in capturing local motion variations and lack inductive biases for inertial odometry, while CNNs with large kernels can expand receptive fields for better global motion perception.

Method: Proposes DADM module to adaptively capture global and local motion features with dynamic weight generation, and STGU for selective temporal feature extraction, forming the IONext backbone.

Result: IONext consistently outperforms SOTA Transformer- and CNN-based methods on six datasets, reducing ATE by 10% and RTE by 12% on RNIN dataset compared to iMOT.

Conclusion: The proposed CNN-based IONext with DADM and STGU modules effectively addresses limitations of both Transformers and existing CNNs in inertial odometry, achieving superior performance through better global-local motion capture and temporal modeling.

Abstract: Researchers have increasingly adopted Transformer-based models for inertial odometry. While Transformers excel at modeling long-range dependencies, their limited sensitivity to local, fine-grained motion variations and lack of inherent inductive biases often hinder localization accuracy and generalization. Recent studies have shown that incorporating large-kernel convolutions and Transformer-inspired architectural designs into CNN can effectively expand the receptive field, thereby improving global motion perception. Motivated by these insights, we propose a novel CNN-based module called the Dual-wing Adaptive Dynamic Mixer (DADM), which adaptively captures both global motion patterns and local, fine-grained motion features from dynamic inputs. This module dynamically generates selective weights based on the input, enabling efficient multi-scale feature aggregation. To further improve temporal modeling, we introduce the Spatio-Temporal Gating Unit (STGU), which selectively extracts representative and task-relevant motion features in the temporal domain. This unit addresses the limitations of temporal modeling observed in existing CNN approaches. Built upon DADM and STGU, we present a new CNN-based inertial odometry backbone, named Next Era of Inertial Odometry (IONext). Extensive experiments on six public datasets demonstrate that IONext consistently outperforms state-of-the-art (SOTA) Transformer- and CNN-based methods. For instance, on the RNIN dataset, IONext reduces the average ATE by 10% and the average RTE by 12% compared to the representative model iMOT.

[582] Vision-Language Cross-Attention for Real-Time Autonomous Driving

Santosh Patapati, Trisanth Srinivasan, Murari Ambati

Main category: cs.CV

TL;DR: XYZ-Drive is a single vision-language model that integrates front-camera frames, overhead maps, and waypoints to output steering and speed for autonomous driving, achieving 95% success rate with efficient single-branch architecture.

Details

Motivation: Most autonomous driving stacks handle geometric accuracy and semantic understanding separately, but XYZ-Drive aims to unify these capabilities in a single model for better navigation in complex environments.

Method: Uses a lightweight goal-centered cross-attention layer that lets waypoint tokens highlight relevant image and map patches, then feeds fused tokens into a partially fine-tuned LLaMA-3.2 11B model.

Result: Achieves 95% success rate and 0.80 SPL on MD-NEX Outdoor-Driving benchmark, surpassing PhysNav-DG by 15% and halving collisions while improving efficiency with single-branch architecture.

Conclusion: Early token-level fusion of intent and map layout enables accurate, transparent, real-time driving, with ablations confirming the complementary roles of vision, waypoint, and map modalities.

Abstract: Autonomous cars need geometric accuracy and semantic understanding to navigate complex environments, yet most stacks handle them separately. We present XYZ-Drive, a single vision-language model that reads a front-camera frame, a 25m $\times$ 25m overhead map, and the next waypoint, then outputs steering and speed. A lightweight goal-centered cross-attention layer lets waypoint tokens highlight relevant image and map patches, supporting both action and textual explanations, before the fused tokens enter a partially fine-tuned LLaMA-3.2 11B model. On the MD-NEX Outdoor-Driving benchmark XYZ-Drive attains 95% success and 0.80 Success weighted by Path Length (SPL), surpassing PhysNav-DG by 15%. and halving collisions, all while significantly improving efficiency by using only a single branch. Sixteen ablations explain the gains. Removing any modality (vision, waypoint, map) drops success by up to 11%, confirming their complementary roles and rich connections. Replacing goal-centered attention with simple concatenation cuts 3% in performance, showing query-based fusion injects map knowledge more effectively. Keeping the transformer frozen loses 5%, showing the importance of fine-tuning when applying VLMs for specific tasks such as autonomous driving. Coarsening map resolution from 10 cm to 40 cm blurs lane edges and raises crash rate. Overall, these results demonstrate that early, token-level fusion of intent and map layout enables accurate, transparent, real-time driving.

[583] StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding

Haolin Yang, Feilong Tang, Lingxiao Zhao, Xiang An, Ming Hu, Huifa Li, Xinlin Zhuang, Yifan Lu, Xiaofeng Zhang, Abdalla Swikir, Junjun He, Zongyuan Ge, Imran Razzak

Main category: cs.CV

TL;DR: The paper proposes StreamAgent, a method for real-time streaming video understanding that enables proactive decision making by anticipating future task-relevant information through temporal and spatial anticipation.

Details

Motivation: Existing methods for real-time video understanding rely on alternating perception-reaction or asynchronous triggers, lacking task-driven planning and future anticipation, which limits real-time responsiveness and proactive decision making in evolving video streams.

Method: StreamAgent integrates question semantics and historical observations to anticipate temporal intervals and spatial regions with future task-relevant information. It uses a streaming KV-cache memory mechanism with hierarchical memory structure for efficient semantic retrieval and reduced storage overhead.

Result: Extensive experiments on streaming and long video understanding tasks show that the method outperforms existing methods in response accuracy and real-time efficiency.

Conclusion: The proposed StreamAgent demonstrates practical value for real-world streaming scenarios by enabling proactive and goal-driven responses in real-time video understanding.

Abstract: Real-time streaming video understanding in domains such as autonomous driving and intelligent surveillance poses challenges beyond conventional offline video processing, requiring continuous perception, proactive decision making, and responsive interaction based on dynamically evolving visual content. However, existing methods rely on alternating perception-reaction or asynchronous triggers, lacking task-driven planning and future anticipation, which limits their real-time responsiveness and proactive decision making in evolving video streams. To this end, we propose a StreamAgent that anticipates the temporal intervals and spatial regions expected to contain future task-relevant information to enable proactive and goal-driven responses. Specifically, we integrate question semantics and historical observations through prompting the anticipatory agent to anticipate the temporal progression of key events, align current observations with the expected future evidence, and subsequently adjust the perception action (e.g., attending to task-relevant regions or continuously tracking in subsequent frames). To enable efficient inference, we design a streaming KV-cache memory mechanism that constructs a hierarchical memory structure for selective recall of relevant tokens, enabling efficient semantic retrieval while reducing the overhead of storing all tokens in the traditional KV-cache. Extensive experiments on streaming and long video understanding tasks demonstrate that our method outperforms existing methods in response accuracy and real-time efficiency, highlighting its practical value for real-world streaming scenarios.

[584] CLIP-IN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions

Ziteng Wang, Siqi Yang, Limeng Qiao, Lin Ma

Main category: cs.CV

TL;DR: CLIP-IN enhances CLIP’s fine-grained visual understanding through instruction-editing datasets for hard negative contrastive learning and long descriptive captions with rotary positional encodings, improving performance on fine-grained tasks while maintaining zero-shot capabilities.

Details

Motivation: Vision-Language Models like CLIP struggle with detailed, fine-grained visual comprehension despite their success in vision-language alignment.

Method: Two core innovations: 1) Using instruction-editing datasets as hard negative image-text pairs with symmetric contrastive loss, 2) Incorporating long descriptive captions with rotary positional encodings to capture rich semantic context.

Result: Achieves substantial gains on MMVP benchmark and fine-grained visual recognition tasks, maintains robust zero-shot performance, and reduces visual hallucinations when integrated into Multimodal Large Language Models.

Conclusion: Synergizing targeted instruction-based contrastive learning with comprehensive descriptive information significantly elevates fine-grained understanding in Vision-Language Models.

Abstract: Despite the success of Vision-Language Models (VLMs) like CLIP in aligning vision and language, their proficiency in detailed, fine-grained visual comprehension remains a key challenge. We present CLIP-IN, a novel framework that bolsters CLIP’s fine-grained perception through two core innovations. Firstly, we leverage instruction-editing datasets, originally designed for image manipulation, as a unique source of hard negative image-text pairs. Coupled with a symmetric hard negative contrastive loss, this enables the model to effectively distinguish subtle visual-semantic differences. Secondly, CLIP-IN incorporates long descriptive captions, utilizing rotary positional encodings to capture rich semantic context often missed by standard CLIP. Our experiments demonstrate that CLIP-IN achieves substantial gains on the MMVP benchmark and various fine-grained visual recognition tasks, without compromising robust zero-shot performance on broader classification and retrieval tasks. Critically, integrating CLIP-IN’s visual representations into Multimodal Large Language Models significantly reduces visual hallucinations and enhances reasoning abilities. This work underscores the considerable potential of synergizing targeted, instruction-based contrastive learning with comprehensive descriptive information to elevate the fine-grained understanding of VLMs.

[585] NEP: Autoregressive Image Editing via Next Editing Token Prediction

Huimin Wu, Xiaojian Ma, Haozhe Zhao, Yanpeng Zhao, Qing Li

Main category: cs.CV

TL;DR: The paper proposes Next Editing-token Prediction (NEP), an autoregressive image generation approach that selectively regenerates only the regions needing edits based on text instructions, avoiding unnecessary modifications to non-editing areas.

Details

Motivation: Existing text-guided image editing approaches generate entire target images, leading to unnecessary computational costs and bias toward reconstructing non-editing regions, which compromises edit quality.

Method: Formulate image editing as Next Editing-token Prediction using autoregressive image generation, pre-train an any-order autoregressive text-to-image model for any-region editing, and enable zero-shot adaptation to NEP.

Result: Achieves state-of-the-art performance on widely used image editing benchmarks and naturally supports test-time scaling through iterative zero-shot refinement.

Conclusion: NEP provides an efficient and effective approach for text-guided image editing by selectively regenerating only editing regions, outperforming existing methods while reducing computational overhead.

Abstract: Text-guided image editing involves modifying a source image based on a language instruction and, typically, requires changes to only small local regions. However, existing approaches generate the entire target image rather than selectively regenerate only the intended editing areas. This results in (1) unnecessary computational costs and (2) a bias toward reconstructing non-editing regions, which compromises the quality of the intended edits. To resolve these limitations, we propose to formulate image editing as Next Editing-token Prediction (NEP) based on autoregressive image generation, where only regions that need to be edited are regenerated, thus avoiding unintended modification to the non-editing areas. To enable any-region editing, we propose to pre-train an any-order autoregressive text-to-image (T2I) model. Once trained, it is capable of zero-shot image editing and can be easily adapted to NEP for image editing, which achieves a new state-of-the-art on widely used image editing benchmarks. Moreover, our model naturally supports test-time scaling (TTS) through iteratively refining its generation in a zero-shot manner. The project page is: https://nep-bigai.github.io/

[586] FormCoach: Lift Smarter, Not Harder

Xiaoye Zuo, Nikos Athanasiou, Ginger Delmas, Yiming Huang, Xingyu Fu, Lingjie Liu

Main category: cs.CV

TL;DR: FormCoach is an AI-powered fitness coaching system that uses vision-language models to provide real-time form correction through a web interface, with benchmarks showing significant gaps compared to human coaching.

Details

Motivation: Address the lack of expert feedback for at-home fitness enthusiasts by transforming cameras into interactive AI training partners that can spot subtle form errors.

Method: Leverages vision-language models (VLMs) through a web interface, benchmarked on a dataset of 1,700 expert-annotated user-reference video pairs across 22 exercises, with an automated rubric-based evaluation pipeline.

Result: Benchmarks reveal substantial gaps compared to human-level coaching, highlighting challenges in nuanced movement analysis but showing potential for AI-driven fitness coaching.

Conclusion: FormCoach opens a new frontier in embodied AI by framing form correction as a collaborative process between humans and machines, with released dataset and evaluation pipeline to accelerate research.

Abstract: Good form is the difference between strength and strain, yet for the fast-growing community of at-home fitness enthusiasts, expert feedback is often out of reach. FormCoach transforms a simple camera into an always-on, interactive AI training partner, capable of spotting subtle form errors and delivering tailored corrections in real time, leveraging vision-language models (VLMs). We showcase this capability through a web interface and benchmark state-of-the-art VLMs on a dataset of 1,700 expert-annotated user-reference video pairs spanning 22 strength and mobility exercises. To accelerate research in AI-driven coaching, we release both the dataset and an automated, rubric-based evaluation pipeline, enabling standardized comparison across models. Our benchmarks reveal substantial gaps compared to human-level coaching, underscoring both the challenges and opportunities in integrating nuanced, context-aware movement analysis into interactive AI systems. By framing form correction as a collaborative and creative process between humans and machines, FormCoach opens a new frontier in embodied AI.

[587] MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs

Haonan Ge, Yiwei Wang, Ming-Hsuan Yang, Yujun Cai

Main category: cs.CV

TL;DR: MRFD is a training-free decoding method that reduces hallucinations in Large Vision-Language Models by modeling inter-region consistency through cross-attention, region-wise response generation, and JSD-based reliability weighting.

Details

Motivation: LVLMs often produce hallucinations due to limited ability to verify information across different image regions, leading to text inconsistent with visual input.

Method: Uses cross-attention to identify salient regions, generates initial responses for each region, computes reliability weights via Jensen-Shannon Divergence, and performs consistency-aware fusion with region-aware prompts.

Result: Significantly reduces hallucinations and improves response factuality across multiple LVLMs and benchmarks without requiring model updates.

Conclusion: MRFD effectively addresses hallucination issues in LVLMs through inter-region consistency modeling, providing a training-free solution for improved factual grounding.

Abstract: Large Vision-Language Models (LVLMs) have shown strong performance across multimodal tasks. However, they often produce hallucinations – text that is inconsistent with visual input, due to the limited ability to verify information in different regions of the image. To address this, we propose Multi-Region Fusion Decoding (MRFD), a training-free decoding method that improves factual grounding by modeling inter-region consistency. MRFD identifies salient regions using cross-attention, generates initial responses for each, and computes reliability weights based on Jensen-Shannon Divergence (JSD) among the responses. These weights guide a consistency-aware fusion of per-region predictions, using region-aware prompts inspired by Chain-of-Thought reasoning. Experiments across multiple LVLMs and benchmarks show that MRFD significantly reduces hallucinations and improves response factuality without requiring model updates.

[588] Error Propagation Mechanisms and Compensation Strategies for Quantized Diffusion

Songwei Liu, Chao Zeng, Chenqian Yan, Xurui Peng, Xing Wang, Fangmin Chen, Xing Mei

Main category: cs.CV

TL;DR: A theoretical framework and compensation scheme to mitigate quantization error propagation in diffusion models, improving PTQ methods with minimal time overhead.

Details

Motivation: Diffusion models face deployment challenges due to computationally intensive iterative denoising, and post-training quantization suffers from stepwise error accumulation that compromises output fidelity.

Method: Developed a theoretical framework formulating error propagation in diffusion models, derived per-step quantization error propagation equations, and proposed a timestep-aware cumulative error compensation scheme.

Result: Achieved 1.2 PSNR improvement over SVDQuant on SDXL W4A4 with only <0.5% additional time overhead, effectively mitigating error propagation across multiple image datasets.

Conclusion: The proposed compensation strategy significantly enhances existing PTQ methods by addressing the fundamental challenge of quantization error accumulation in diffusion models.

Abstract: Diffusion models have transformed image synthesis by establishing unprecedented quality and creativity benchmarks. Nevertheless, their large-scale deployment faces challenges due to computationally intensive iterative denoising processes. Although post-training quantization (PTQ) provides an effective pathway for accelerating sampling, the iterative nature of diffusion models causes stepwise quantization errors to accumulate progressively during generation, inevitably compromising output fidelity. To address this challenge, we develop a theoretical framework that mathematically formulates error propagation in Diffusion Models (DMs), deriving per-step quantization error propagation equations and establishing the first closed-form solution for cumulative error. Building on this theoretical foundation, we propose a timestep-aware cumulative error compensation scheme. Extensive experiments on multiple image datasets demonstrate that our compensation strategy effectively mitigates error propagation, significantly enhancing existing PTQ methods. Specifically, it achieves a 1.2 PSNR improvement over SVDQuant on SDXL W4A4, while incurring only an additional $<$ 0.5% time overhead.

[589] Lightweight Joint Optimization of General-Purpose Vision-Language Models and Retrievers for RAG-Based Medical Diagnosis

Nir Mazor, Tom Hope

Main category: cs.CV

TL;DR: A multimodal retrieval model jointly optimized with LVLM for medical diagnosis achieves competitive results with lightweight fine-tuning, improving challenging cases over standard RAG, but significant performance gap remains from oracle analysis.

Details

Motivation: To enhance diagnostic accuracy for clinical image interpretation by retrieving relevant visual and textual information from medical literature and hospital records, addressing limitations of standard RAG which doesn't backpropagate LVLM errors to the retriever.

Method: Developed a multimodal retrieval model jointly optimized with LVLM using only general-purpose backbones with lightweight fine-tuning, analyzed different top-retrieved images’ impact on predictions, and conducted oracle analysis.

Result: Achieved competitive results with medically-pretrained models on clinical classification and VQA tasks. Joint retrieval optimization significantly improved challenging cases over standard RAG. Oracle analysis showed correct diagnosis is frequently achievable but large performance gap remains.

Conclusion: While joint optimization improves performance over standard RAG, there is substantial room for improvement as rerankers using frontier LVLMs do not close the gap from oracle performance, indicating need for future methods.

Abstract: Retrieving relevant visual and textual information from medical literature and hospital records can enhance diagnostic accuracy for clinical image interpretation. We develop a multimodal retrieval model jointly optimized with an LVLM for medical diagnosis, unlike standard RAG which doesn’t backpropagate LVLM errors to the retriever. Using only general-purpose backbones with lightweight fine-tuning, our model achieves competitive results with medically-pretrained models on clinical classification and VQA tasks. In a novel analysis, we find that different top-retrieved images often yield different predictions for the same target, and that these cases are challenging for all models, even for non-retrieval models. Our joint retrieval optimization significantly improves these cases over standard RAG. However, oracle analysis reveals that while the correct diagnosis is frequently achievable using one of the top retrieved images, in practice there is a large performance gap from the oracle, and rerankers using frontier LVLMs do not close this gap – leaving ample room for improvement by future methods. Code available at https://github.com/Nirmaz/JOMED.

[590] A Synthetic Dataset for Manometry Recognition in Robotic Applications

Pedro Antonio Rabelo Saraiva, Enzo Ferreira de Souza, Joao Manoel Herrera Pinheiro, Thiago H. Segreto, Ricardo V. Godoy, Marcelo Becker

Main category: cs.CV

TL;DR: A hybrid data synthesis pipeline combining procedural rendering and AI video generation addresses data scarcity in industrial object detection, achieving better performance than real-data-only training with a 1:1 real-synthetic data ratio.

Details

Motivation: Data scarcity and high acquisition costs in hazardous industrial environments like offshore oil platforms limit autonomous inspection system development.

Method: Hybrid data synthesis pipeline using BlenderProc for photorealistic images with domain randomization and NVIDIA’s Cosmos-Predict2 for physically consistent video sequences with temporal variation. YOLO-based detector trained on composite real+synthetic dataset.

Result: YOLO-based detector trained on composite dataset outperformed models trained solely on real images. A 1:1 ratio between real and synthetic samples achieved the highest accuracy.

Conclusion: Synthetic data generation is a viable, cost-effective, and safe strategy for developing reliable perception systems in safety-critical and resource-constrained industrial applications.

Abstract: This paper addresses the challenges of data scarcity and high acquisition costs in training robust object detection models for complex industrial environments, such as offshore oil platforms. Data collection in these hazardous settings often limits the development of autonomous inspection systems. To mitigate this issue, we propose a hybrid data synthesis pipeline that integrates procedural rendering and AI-driven video generation. The approach uses BlenderProc to produce photorealistic images with domain randomization and NVIDIA’s Cosmos-Predict2 to generate physically consistent video sequences with temporal variation. A YOLO-based detector trained on a composite dataset, combining real and synthetic data, outperformed models trained solely on real images. A 1:1 ratio between real and synthetic samples achieved the highest accuracy. The results demonstrate that synthetic data generation is a viable, cost-effective, and safe strategy for developing reliable perception systems in safety-critical and resource-constrained industrial applications.

[591] TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models

Chenghao Liu, Jiachen Zhang, Chengxuan Li, Zhimu Zhou, Shixin Wu, Songfang Huang, Huiling Duan

Main category: cs.CV

TL;DR: TTF is a training-free method that enhances Vision-Language-Action models by integrating temporal information from consecutive frames through selective token fusion, improving robustness to visual noise and task performance.

Details

Motivation: Current VLA models process visual inputs frame-by-frame, discarding valuable temporal information and making them vulnerable to visual noise while ignoring coherence between consecutive frames in manipulation tasks.

Method: Temporal Token Fusion (TTF) uses dual-dimension detection combining grayscale pixel difference analysis with attention-based semantic relevance assessment, employing hard fusion strategies and keyframe anchoring to prevent error accumulation.

Result: Consistent improvements across benchmarks: 4.0 percentage points average on LIBERO (72.4% vs 68.4% baseline), 4.8% relative improvement on SimplerEnv, and 8.7% relative improvement on real robot tasks. Model-agnostic across OpenVLA and VLA-Cache architectures.

Conclusion: TTF demonstrates that selective temporal fusion enhances VLA performance, revealing that Query matrix reuse in attention mechanisms improves rather than compromises performance, suggesting promising directions for computational acceleration while improving task success rates.

Abstract: Vision-Language-Action (VLA) models process visual inputs independently at each timestep, discarding valuable temporal information inherent in robotic manipulation tasks. This frame-by-frame processing makes models vulnerable to visual noise while ignoring the substantial coherence between consecutive frames in manipulation sequences. We propose Temporal Token Fusion (TTF), a training-free approach that intelligently integrates historical and current visual representations to enhance VLA inference quality. Our method employs dual-dimension detection combining efficient grayscale pixel difference analysis with attention-based semantic relevance assessment, enabling selective temporal token fusion through hard fusion strategies and keyframe anchoring to prevent error accumulation. Comprehensive experiments across LIBERO, SimplerEnv, and real robot tasks demonstrate consistent improvements: 4.0 percentage points average on LIBERO (72.4% vs 68.4% baseline), cross-environment validation on SimplerEnv (4.8% relative improvement), and 8.7% relative improvement on real robot tasks. Our approach proves model-agnostic, working across OpenVLA and VLA-Cache architectures. Notably, TTF reveals that selective Query matrix reuse in attention mechanisms enhances rather than compromises performance, suggesting promising directions for direct KQV matrix reuse strategies that achieve computational acceleration while improving task success rates.

[592] One More Glance with Sharp Eyes: Rethinking Lightweight Captioning as a Practical Visual Specialist

Junha Song, Yongsik Jo, So Yeon Min, Quanting Xie, Taehwan Kim, Yonatan Bisk, Jaegul Choo

Main category: cs.CV

TL;DR: The paper presents a lightweight image captioning model using a 125M-parameter language model that achieves performance comparable to much larger MLLMs, with a novel Sharp-Eyed Refinement framework to improve caption quality.

Details

Motivation: To enable deployment of image captioning models on local devices by addressing the high computational demands of multimodal LLMs, while maintaining competitive performance.

Method: Built lightweight captioning models with 125M-parameter language model, developed Sharp-Eyed Refinement framework that enhances caption quality by refining coarse descriptions and improving visual grounding through re-examination of informative regions.

Result: The model achieves performance comparable to MLLMs in detailed captioning tasks and outperforms both recent lightweight captioning methods and MLLMs in detailed captioning and long-range video QA tasks.

Conclusion: The proposed lightweight model serves as a strong captioning specialist for on-device applications, with the Sharp-Eyed Refinement framework effectively addressing limitations in attention mechanisms and visual representations.

Abstract: Image captioning is fundamental for applications like video-grounded chatbot systems and navigation robots, yet deploying such models on local devices is challenging due to the high computational demands of multimodal LLMs (MLLMs). To address this, we first build lightweight captioning models using a 125M-parameter language model, 56 times smaller than LLaMA-7B, and evaluate their performance not only on single-sentence but on detailed captioning tasks. We obtain surprising results showing that our model can achieve performance comparable to MLLMs, suggesting its potential to serve as a strong captioning specialist for on-device applications. While promising, our model also exhibits a limitation: like other MLLMs, it suffers from occasional captioning errors. We investigate the underlying causes and observe that the problems stem from ineffective attention mechanisms and limited visual representations. To alleviate them, we develop a novel captioning framework, Sharp-Eyed Refinement, which enhances caption quality by refining coarse descriptions into more precise captions. At its core, DeepLens improves visual grounding by re-examining the informative regions identified in the initial glance. Experimental results demonstrate the superiority of our model over both recent lightweight captioning methods and MLLMs in detailed captioning and even in long-range video QA tasks.

[593] Cryo-RL: automating prostate cancer cryoablation planning with reinforcement learning

Trixia Simangan, Ahmed Nadeem Abbasi, Yipeng Hu, Shaheer U. Saeed

Main category: cs.CV

TL;DR: Cryo-RL uses reinforcement learning to automate cryoablation planning for prostate cancer, achieving performance comparable to human experts with significantly reduced planning time.

Details

Motivation: Current cryoablation planning is manual, expertise-dependent, time-consuming, and leads to treatment variability, limiting scalability.

Method: Models cryoablation planning as a Markov decision process where an agent sequentially selects cryoprobe positions and ice sphere diameters in a simulated environment with clinical constraints.

Result: Achieved over 8 percentage-point Dice improvements compared to automated baselines and matched human expert performance while requiring substantially less planning time.

Conclusion: Reinforcement learning can deliver clinically viable, reproducible, and efficient cryoablation plans for prostate cancer treatment.

Abstract: Cryoablation is a minimally invasive localised treatment for prostate cancer that destroys malignant tissue during de-freezing, while sparing surrounding healthy structures. Its success depends on accurate preoperative planning of cryoprobe placements to fully cover the tumour and avoid critical anatomy. This planning is currently manual, expertise-dependent, and time-consuming, leading to variability in treatment quality and limited scalability. In this work, we introduce Cryo-RL, a reinforcement learning framework that models cryoablation planning as a Markov decision process and learns an optimal policy for cryoprobe placement. Within a simulated environment that models clinical constraints and stochastic intraoperative variability, an agent sequentially selects cryoprobe positions and ice sphere diameters. Guided by a reward function based on tumour coverage, this agent learns a cryoablation strategy that leads to optimal cryoprobe placements without the need for any manually-designed plans. Evaluated on 583 retrospective prostate cancer cases, Cryo-RL achieved over 8 percentage-point Dice improvements compared with the best automated baselines, based on geometric optimisation, and matched human expert performance while requiring substantially less planning time. These results highlight the potential of reinforcement learning to deliver clinically viable, reproducible, and efficient cryoablation plans.

[594] NeMo: Needle in a Montage for Video-Language Understanding

Zi-Yuan Hu, Shuo Liang, Duo Zheng, Yanyang Li, Yeyao Tao, Shijia Huang, Wei Feng, Jia Qin, Jianguang Yu, Jing Huang, Meng Fang, Yin Li, Liwei Wang

Main category: cs.CV

TL;DR: NeMoBench is a new video-language benchmark for evaluating VideoLLMs’ temporal reasoning capabilities through the Needle in a Montage task, featuring 31,378 automatically generated QA pairs from 13,486 videos.

Details

Motivation: Recent advances in VideoLLMs require new evaluation protocols for complex temporal reasoning in video-language understanding, inspired by the needle in a haystack test used for LLMs.

Method: Developed a scalable automated data generation pipeline to synthesize high-quality video question answering data for the Needle in a Montage task, which assesses critical reasoning capabilities including long-context recall and temporal grounding.

Result: Created NeMoBench with 31,378 QA pairs from 13,486 videos of varying durations (seconds to hours). Experiments show the pipeline reliably generates high-quality data and enables continuous updates. Evaluated 20 state-of-the-art models.

Conclusion: NeMoBench provides a comprehensive benchmark for assessing VideoLLMs’ temporal reasoning capabilities, offering extensive evaluation results and insights into model limitations.

Abstract: Recent advances in video large language models (VideoLLMs) call for new evaluation protocols and benchmarks for complex temporal reasoning in video-language understanding. Inspired by the needle in a haystack test widely used by LLMs, we introduce a novel task of Needle in a Montage (NeMo), designed to assess VideoLLMs’ critical reasoning capabilities, including long-context recall and temporal grounding. To generate video question answering data for our task, we develop a scalable automated data generation pipeline that facilitates high-quality data synthesis. Built upon the proposed pipeline, we present NeMoBench, a video-language benchmark centered on our task. Specifically, our full set of NeMoBench features 31,378 automatically generated question-answer (QA) pairs from 13,486 videos with various durations ranging from seconds to hours. Experiments demonstrate that our pipeline can reliably and automatically generate high-quality evaluation data, enabling NeMoBench to be continuously updated with the latest videos. We evaluate 20 state-of-the-art models on our benchmark, providing extensive results and key insights into their capabilities and limitations. Our project page is available at: https://lavi-lab.github.io/NeMoBench.

[595] Evaluating YOLO Architectures: Implications for Real-Time Vehicle Detection in Urban Environments of Bangladesh

Ha Meem Hossain, Pritam Nath, Mahitun Nesa Mahi, Imtiaz Uddin, Ishrat Jahan Eiste, Syed Nasibur Rahman Ratul, Md Naim Uddin Mozumdar, Asif Mohammed Saad, MD Tamim Hossain

Main category: cs.CV

TL;DR: This study evaluates YOLO models on a custom Bangladeshi vehicle dataset, finding YOLOv11x as the best performer but YOLOv8m/YOLOv11m offer better speed-accuracy balance. Significant challenges remain for rare vehicle classes due to dataset imbalances.

Details

Motivation: Vehicle detection systems trained on non-Bangladeshi datasets fail to accurately identify local vehicle types in Bangladesh's unique road environments, creating gaps in autonomous driving technology for developing regions.

Method: Evaluated six YOLO model variants on a custom dataset with 29 distinct Bangladeshi vehicle classes using high-resolution images captured across various roads and manually annotated with YOLO format bounding boxes.

Result: YOLOv11x achieved best performance (63.7% mAP@0.5, 43.8% mAP@0.5:0.95) but slow inference (45.8ms). Medium variants (YOLOv8m, YOLOv11m) offered optimal balance with 62.5%/61.8% mAP@0.5 and 14-15ms inference. Rare vehicle classes showed near-zero accuracy due to dataset imbalances.

Conclusion: The research provides foundation for developing robust object detection systems adapted to Bangladesh traffic conditions, addressing critical needs in autonomous vehicle technology for developing regions where generic-trained models fail.

Abstract: Vehicle detection systems trained on Non-Bangladeshi datasets struggle to accurately identify local vehicle types in Bangladesh’s unique road environments, creating critical gaps in autonomous driving technology for developing regions. This study evaluates six YOLO model variants on a custom dataset featuring 29 distinct vehicle classes, including region-specific vehicles such as Desi Nosimon'', Leguna’’, Battery Rickshaw'', and CNG’’. The dataset comprises high-resolution images (1920x1080) captured across various Bangladeshi roads using mobile phone cameras and manually annotated using LabelImg with YOLO format bounding boxes. Performance evaluation revealed YOLOv11x as the top performer, achieving 63.7% mAP@0.5, 43.8% mAP@0.5:0.95, 61.4% recall, and 61.6% F1-score, though requiring 45.8 milliseconds per image for inference. Medium variants (YOLOv8m, YOLOv11m) struck an optimal balance, delivering robust detection performance with mAP@0.5 values of 62.5% and 61.8% respectively, while maintaining moderate inference times around 14-15 milliseconds. The study identified significant detection challenges for rare vehicle classes, with Construction Vehicles and Desi Nosimons showing near-zero accuracy due to dataset imbalances and insufficient training samples. Confusion matrices revealed frequent misclassifications between visually similar vehicles, particularly Mini Trucks versus Mini Covered Vans. This research provides a foundation for developing robust object detection systems specifically adapted to Bangladesh traffic conditions, addressing critical needs in autonomous vehicle technology advancement for developing regions where conventional generic-trained models fail to perform adequately.

[596] When Language Model Guides Vision: Grounding DINO for Cattle Muzzle Detection

Rabin Dulal, Lihong Zheng, Muhammad Ashad Kabir

Main category: cs.CV

TL;DR: This paper proposes a zero-shot muzzle detection framework using Grounding DINO vision-language model for cattle identification, achieving 76.8% mAP@0.5 without requiring annotated training data.

Details

Motivation: Traditional muzzle detection methods require extensive annotated datasets and are data-dependent, limiting performance on new or unseen cattle. Manual detection is labor-intensive and inconsistent.

Method: Uses Grounding DINO vision-language model with natural language prompts to detect cattle muzzles in a zero-shot manner, eliminating the need for task-specific training or annotated data.

Result: Achieves mean Average Precision (mAP)@0.5 of 76.8%, demonstrating promising performance for muzzle detection without requiring annotated training data.

Conclusion: The framework provides the first annotation-free solution for cattle muzzle detection, offering improved adaptability and ease of deployment in livestock monitoring applications compared to supervised methods.

Abstract: Muzzle patterns are among the most effective biometric traits for cattle identification. Fast and accurate detection of the muzzle region as the region of interest is critical to automatic visual cattle identification.. Earlier approaches relied on manual detection, which is labor-intensive and inconsistent. Recently, automated methods using supervised models like YOLO have become popular for muzzle detection. Although effective, these methods require extensive annotated datasets and tend to be trained data-dependent, limiting their performance on new or unseen cattle. To address these limitations, this study proposes a zero-shot muzzle detection framework based on Grounding DINO, a vision-language model capable of detecting muzzles without any task-specific training or annotated data. This approach leverages natural language prompts to guide detection, enabling scalable and flexible muzzle localization across diverse breeds and environments. Our model achieves a mean Average Precision (mAP)@0.5 of 76.8%, demonstrating promising performance without requiring annotated data. To our knowledge, this is the first research to provide a real-world, industry-oriented, and annotation-free solution for cattle muzzle detection. The framework offers a practical alternative to supervised methods, promising improved adaptability and ease of deployment in livestock monitoring applications.

[597] Class-Invariant Test-Time Augmentation for Domain Generalization

Zhicheng Lin, Xiaolin Wu, Xi Zhang

Main category: cs.CV

TL;DR: CI-TTA is a lightweight test-time augmentation method for domain generalization that generates class-invariant image variants through deformations and aggregates predictions with confidence filtering.

Details

Motivation: Deep models suffer performance degradation under distribution shifts, and existing DG approaches require multi-domain training or intensive test-time adaptation.

Method: Generate multiple class-invariant variants of input images using elastic and grid deformations, then aggregate predictions through confidence-guided filtering to remove unreliable outputs.

Result: Extensive experiments on PACS and Office-Home datasets show consistent gains across different DG algorithms and backbones.

Conclusion: CI-TTA is an effective and general complementary strategy for domain generalization that works with existing DG methods.

Abstract: Deep models often suffer significant performance degradation under distribution shifts. Domain generalization (DG) seeks to mitigate this challenge by enabling models to generalize to unseen domains. Most prior approaches rely on multi-domain training or computationally intensive test-time adaptation. In contrast, we propose a complementary strategy: lightweight test-time augmentation. Specifically, we develop a novel Class-Invariant Test-Time Augmentation (CI-TTA) technique. The idea is to generate multiple variants of each input image through elastic and grid deformations that nevertheless belong to the same class as the original input. Their predictions are aggregated through a confidence-guided filtering scheme that remove unreliable outputs, ensuring the final decision relies on consistent and trustworthy cues. Extensive Experiments on PACS and Office-Home datasets demonstrate consistent gains across different DG algorithms and backbones, highlighting the effectiveness and generality of our approach.

[598] PD-Diag-Net: Clinical-Priors guided Network on Brain MRI for Auxiliary Diagnosis of Parkinson’s Disease

Shuai Shao, Shu Jiang, Shiyuan Zhao, Di Yang, Yan Wang, Yutong Bai, Jianguo Zhang, Jiangtao Wang

Main category: cs.CV

TL;DR: PD-Diag-Net is an automated Parkinson’s disease diagnostic method that uses MRI scans with clinical prior knowledge to achieve high accuracy in early-stage diagnosis.

Details

Motivation: Current PD diagnosis is complex, relies heavily on neurologist expertise, and often delays early detection, missing timely intervention opportunities.

Method: End-to-end framework with MRI preprocessing, brain-region-relevance prior, brain-region-aging prior, and dedicated modules for feature aggregation and diagnosis using brain age gaps as constraints.

Result: Achieves 86% accuracy on external tests and over 96% accuracy in early-stage diagnosis, outperforming existing methods by more than 20%.

Conclusion: PD-Diag-Net provides an effective automated solution for PD diagnosis with high accuracy and clinical interpretability.

Abstract: Parkinson’s disease (PD) is a common neurodegenerative disorder that severely diminishes patients’ quality of life. Its global prevalence has increased markedly in recent decades. Current diagnostic workflows are complex and heavily reliant on neurologists’ expertise, often resulting in delays in early detection and missed opportunities for timely intervention. To address these issues, we propose an end-to-end automated diagnostic method for PD, termed PD-Diag-Net, which performs risk assessment and auxiliary diagnosis directly from raw MRI scans. This framework first introduces an MRI Pre-processing Module (MRI-Processor) to mitigate inter-subject and inter-scanner variability by flexibly integrating established medical imaging preprocessing tools. It then incorporates two forms of clinical prior knowledge: (1) Brain-Region-Relevance-Prior (Relevance-Prior), which specifies brain regions strongly associated with PD; and (2) Brain-Region-Aging-Prior (Aging-Prior), which reflects the accelerated aging typically observed in PD-associated regions. Building on these priors, we design two dedicated modules: the Relevance-Prior Guided Feature Aggregation Module (Aggregator), which guides the model to focus on PD-associated regions at the inter-subject level, and the Age-Prior Guided Diagnosis Module (Diagnoser), which leverages brain age gaps as auxiliary constraints at the intra-subject level to enhance diagnostic accuracy and clinical interpretability. Furthermore, we collected external test data from our collaborating hospital. Experimental results show that PD-Diag-Net achieves 86% accuracy on external tests and over 96% accuracy in early-stage diagnosis, outperforming existing advanced methods by more than 20%.

[599] LaMoGen: Laban Movement-Guided Diffusion for Text-to-Motion Generation

Heechang Kim, Gwanghyun Kim, Se Young Chun

Main category: cs.CV

TL;DR: The paper proposes a zero-shot inference-time optimization method that integrates Laban movement analysis into text-to-motion diffusion models to achieve fine-grained expressive motion control without additional training data.

Details

Motivation: Current text-to-motion synthesis lacks fine-grained expressive control due to limited motion style diversity in datasets and difficulty expressing quantitative motion characteristics in natural language.

Method: A zero-shot inference-time optimization method that updates text embeddings of pretrained diffusion models during sampling to guide motion generation toward desired Laban Effort and Shape components.

Result: The approach successfully generates diverse expressive motion qualities while preserving motion identity by manipulating motion attributes according to target Laban tags.

Conclusion: Integrating Laban movement analysis quantification methods enables interpretable and expressive control of human motion generation in text-guided diffusion models.

Abstract: Diverse human motion generation is an increasingly important task, having various applications in computer vision, human-computer interaction and animation. While text-to-motion synthesis using diffusion models has shown success in generating high-quality motions, achieving fine-grained expressive motion control remains a significant challenge. This is due to the lack of motion style diversity in datasets and the difficulty of expressing quantitative characteristics in natural language. Laban movement analysis has been widely used by dance experts to express the details of motion including motion quality as consistent as possible. Inspired by that, this work aims for interpretable and expressive control of human motion generation by seamlessly integrating the quantification methods of Laban Effort and Shape components into the text-guided motion generation models. Our proposed zero-shot, inference-time optimization method guides the motion generation model to have desired Laban Effort and Shape components without any additional motion data by updating the text embedding of pretrained diffusion models during the sampling step. We demonstrate that our approach yields diverse expressive motion qualities while preserving motion identity by successfully manipulating motion attributes according to target Laban tags.

[600] The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping

Onur Keleş, Aslı Özyürek, Gerardo Ortega, Kadir Gökgöz, Esam Ghaleb

Main category: cs.CV

TL;DR: The Visual Iconicity Challenge evaluates vision-language models on sign language iconicity tasks, showing they perform below human baselines but models with better phonological form prediction correlate better with human iconicity judgments.

Details

Motivation: To test vision-language models' ability to recover essential mappings from dynamic human motion in signed languages, using iconicity as a natural testbed for visual grounding.

Method: Introduced a video-based benchmark with three tasks: phonological sign-form prediction, transparency (inferring meaning from form), and graded iconicity ratings. Evaluated 13 VLMs on Sign Language of the Netherlands in zero- and few-shot settings.

Result: VLMs recovered some handshape and location detail but remained below human performance on phonological form prediction; performed far from human baselines on transparency; only top models correlated moderately with human iconicity ratings. Models with stronger phonological form prediction correlated better with human iconicity judgment.

Conclusion: The findings validate diagnostic tasks for iconicity and motivate human-centric signals and embodied learning methods for improving visual grounding in multimodal models.

Abstract: Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the Visual Iconicity Challenge, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, location), (ii) transparency (inferring meaning from visual form), and (iii) graded iconicity ratings. We assess 13 state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands and compare them to human baselines. On phonological form prediction, VLMs recover some handshape and location detail but remain below human performance; on transparency, they are far from human baselines; and only top models correlate moderately with human iconicity ratings. Interestingly, models with stronger phonological form prediction correlate better with human iconicity judgment, indicating shared sensitivity to visually grounded structure. Our findings validate these diagnostic tasks and motivate human-centric signals and embodied learning methods for modelling iconicity and improving visual grounding in multimodal models.

[601] SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie

Main category: cs.CV

TL;DR: SANA-Video is an efficient small diffusion model that generates high-resolution (720x1280), minute-long videos with strong text-video alignment at fast speeds, deployable on consumer GPUs like RTX 5090.

Details

Motivation: To create a cost-effective video generation model that can produce high-quality, long videos efficiently without requiring massive computational resources like other state-of-the-art models.

Method: Uses Linear DiT with linear attention for efficiency and a constant-memory KV cache for block linear attention to enable long video generation with global context at fixed memory cost.

Result: Achieves competitive performance with modern small diffusion models while being 16x faster in latency, with training cost reduced to 12 days on 64 H100 GPUs (only 1% of MovieGen’s cost).

Conclusion: SANA-Video enables low-cost, high-quality video generation that is practical for deployment on consumer hardware with significant speed improvements.

Abstract: We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.

[602] SliceFine: The Universal Winning-Slice Hypothesis for Pretrained Networks

Md Kowsher, Ali O. Polat, Ehsan Mohammady Ardehaly, Mehrdad Salehi, Zia Ghiasi, Prasanth Murali, Chen Chen

Main category: cs.CV

TL;DR: Fine-tuning small random subnetworks in pre-trained models works due to universal winning slice property from spectral balance and high task energy, leading to SliceFine PEFT method that matches SOTA performance with better efficiency.

Details

Motivation: To provide theoretical foundation for why parameter-efficient fine-tuning (PEFT) works by explaining why fine-tuning small subnetworks in pre-trained models is sufficient for downstream adaptation.

Method: Proposed theoretical framework showing pre-trained networks have universal winning slice property (spectral balance + high task energy), then developed SliceFine method that updates only selected slices of original weights without adding new parameters.

Result: SliceFine matches state-of-the-art PEFT methods across language and vision tasks while significantly improving training speed, memory efficiency, and model compactness.

Conclusion: The work bridges theory and practice by providing theoretical grounding for PEFT and offering a theoretically grounded alternative to existing PEFT techniques.

Abstract: This paper presents a theoretical framework explaining why fine tuning small, randomly selected subnetworks (slices) within pre trained models can be sufficient for downstream adaptation. We prove that pretrained networks exhibit a universal winning slice property arising from two phenomena: (1) spectral balance the eigenspectra of different weight matrix slices are remarkably similar; and (2) high task energy their backbone representations retain rich, task relevant features. This leads to the Universal Winning Slice Hypothesis, which provides a theoretical foundation for parameter efficient fine tuning (PEFT) in large scale models. Inspired by this, we propose SliceFine, a PEFT method that exploits this inherent redundancy by updating only selected slices of the original weights introducing zero new parameters, unlike adapter-based approaches. Empirically, SliceFine matches the performance of state of the art PEFT methods across language and vision tasks, while significantly improving training speed, memory efficiency, and model compactness. Our work bridges theory and practice, offering a theoretically grounded alternative to existing PEFT techniques.

[603] GeoVLM-R1: Reinforcement Fine-Tuning for Improved Remote Sensing Reasoning

Mustansar Fiaz, Hiyam Debary, Paolo Fraccaro, Danda Paudel, Luc Van Gool, Fahad Khan, Salman Khan

Main category: cs.CV

TL;DR: A post-training framework that uses task-aware rewards to adapt reinforcement learning models for Earth Observation tasks, improving reasoning capabilities and performance across various EO benchmarks.

Details

Motivation: Reinforcement learning has shown strong reasoning in natural images but remains unexplored for Earth Observation, which has unique challenges like object detection, captioning, change detection, and temporal analysis requiring task-aware reasoning.

Method: Proposes a novel post-training framework incorporating task-aware rewards to adapt reasoning-based RL models to diverse Earth Observation tasks, enhancing reasoning for remote sensing images while stabilizing optimization and improving robustness.

Result: Extensive experiments across multiple EO benchmarks show consistent performance gains over state-of-the-art generic and specialized vision language models.

Conclusion: The framework successfully adapts RL models to Earth Observation tasks, demonstrating improved reasoning capabilities and superior performance compared to existing approaches.

Abstract: Recent advances in reinforcement learning (RL) have delivered strong reasoning capabilities in natural image domains, yet their potential for Earth Observation (EO) remains largely unexplored. EO tasks introduce unique challenges, spanning referred object detection, image or region captioning, change detection, grounding, and temporal analysis, that demand task aware reasoning. We propose a novel post training framework that incorporates task aware rewards to enable effective adaptation of reasoning based RL models to diverse EO tasks. This training strategy enhances reasoning capabilities for remote sensing images, stabilizes optimization, and improves robustness. Extensive experiments across multiple EO benchmarks show consistent performance gains over state of the art generic and specialized vision language models. Code and models will be released publicly at https://mustansarfiaz.github.io/GeoVLM-R1/ .

[604] DA$^2$: Depth Anything in Any Direction

Haodong Li, Wangguangdong Zheng, Jing He, Yuhao Liu, Xin Lin, Xin Yang, Ying-Cong Chen, Chunchao Guo

Main category: cs.CV

TL;DR: DA² is a zero-shot generalizable panoramic depth estimator that creates a large-scale dataset of ~607K panoramic RGB-depth pairs and uses SphereViT to handle spherical distortions, achieving state-of-the-art performance with 38% improvement in AbsRel over baselines.

Details

Motivation: Panoramic depth estimation faces challenges due to data scarcity and spherical distortions, leading to poor zero-shot generalization and inefficient methods that rely on perspective splitting.

Method: Proposed DA² with a data curation engine to generate high-quality panoramic depth data from perspective images (~543K new pairs) and SphereViT architecture that leverages spherical coordinates for geometric consistency.

Result: Achieves SoTA performance with 38% average improvement in AbsRel over strongest zero-shot baseline, outperforms prior in-domain methods, and offers higher efficiency as an end-to-end solution.

Conclusion: DA² demonstrates superior zero-shot generalization for panoramic depth estimation through large-scale data curation and spherical geometry-aware modeling, with released code and curated data.

Abstract: Panorama has a full FoV (360$^\circ\times$180$^\circ$), offering a more complete visual description than perspective images. Thanks to this characteristic, panoramic depth estimation is gaining increasing traction in 3D vision. However, due to the scarcity of panoramic data, previous methods are often restricted to in-domain settings, leading to poor zero-shot generalization. Furthermore, due to the spherical distortions inherent in panoramas, many approaches rely on perspective splitting (e.g., cubemaps), which leads to suboptimal efficiency. To address these challenges, we propose $\textbf{DA}$$^{\textbf{2}}$: $\textbf{D}$epth $\textbf{A}$nything in $\textbf{A}$ny $\textbf{D}$irection, an accurate, zero-shot generalizable, and fully end-to-end panoramic depth estimator. Specifically, for scaling up panoramic data, we introduce a data curation engine for generating high-quality panoramic depth data from perspective, and create $\sim$543K panoramic RGB-depth pairs, bringing the total to $\sim$607K. To further mitigate the spherical distortions, we present SphereViT, which explicitly leverages spherical coordinates to enforce the spherical geometric consistency in panoramic image features, yielding improved performance. A comprehensive benchmark on multiple datasets clearly demonstrates DA$^{2}$’s SoTA performance, with an average 38% improvement on AbsRel over the strongest zero-shot baseline. Surprisingly, DA$^{2}$ even outperforms prior in-domain methods, highlighting its superior zero-shot generalization. Moreover, as an end-to-end solution, DA$^{2}$ exhibits much higher efficiency over fusion-based approaches. Both the code and the curated panoramic data has be released. Project page: https://depth-any-in-any-dir.github.io/.

[605] Pure-Pass: Fine-Grained, Adaptive Masking for Dynamic Token-Mixing Routing in Lightweight Image Super-Resolution

Junyu Wu, Jie Liu, Jie Tang, Gangshan Wu

Main category: cs.CV

TL;DR: Pure-Pass (PP) is a pixel-level masking mechanism that identifies pure pixels to exempt them from expensive computations in image super-resolution, improving efficiency and performance over previous methods like CAMixer.

Details

Motivation: Existing lightweight SR methods like CAMixer have limitations including poor adaptability, coarse-grained masking, and spatial inflexibility, which hinder optimal computational efficiency and reconstruction quality.

Method: PP uses fixed color center points to classify pixels into distinct categories, enabling fine-grained, spatially flexible masking that maintains adaptive flexibility while reducing computational costs.

Result: When integrated into ATD-light model, PP-ATD-light achieves superior SR performance with minimal overhead, outperforming CAMixer-ATD-light in both reconstruction quality and parameter efficiency while saving similar computation.

Conclusion: Pure-Pass provides an effective pixel-level masking approach that enhances computational efficiency in image super-resolution while maintaining or improving reconstruction quality.

Abstract: Image Super-Resolution (SR) aims to reconstruct high-resolution images from low-resolution counterparts, but the computational complexity of deep learning-based methods often hinders practical deployment. CAMixer is the pioneering work to integrate the advantages of existing lightweight SR methods and proposes a content-aware mixer to route token mixers of varied complexities according to the difficulty of content recovery. However, several limitations remain, such as poor adaptability, coarse-grained masking and spatial inflexibility, among others. We propose Pure-Pass (PP), a pixel-level masking mechanism that identifies pure pixels and exempts them from expensive computations. PP utilizes fixed color center points to classify pixels into distinct categories, enabling fine-grained, spatially flexible masking while maintaining adaptive flexibility. Integrated into the state-of-the-art ATD-light model, PP-ATD-light achieves superior SR performance with minimal overhead, outperforming CAMixer-ATD-light in reconstruction quality and parameter efficiency when saving a similar amount of computation.

[606] NeuroSwift: A Lightweight Cross-Subject Framework for fMRI Visual Reconstruction of Complex Scenes

Shiyi Zhang, Dong Liang, Yihang Zhou

Main category: cs.CV

TL;DR: NeuroSwift is a diffusion-based method that integrates AutoKL and CLIP adapters for cross-subject visual stimulus reconstruction from fMRI data, achieving state-of-the-art performance with minimal fine-tuning.

Details

Motivation: To address challenges in cross-subject visual reconstruction from fMRI data, including inter-subject variability and the brain's abstract encoding of semantic features in complex visual inputs.

Method: Integrates complementary adapters via diffusion: AutoKL for low-level features and CLIP for semantics. Uses pretraining on one subject followed by fine-tuning only 17% of parameters (fully connected layers) for new subjects while freezing other components.

Result: Achieves state-of-the-art performance with only one hour of training per subject on lightweight GPUs (three RTX 4090), outperforming existing methods.

Conclusion: NeuroSwift enables efficient and accurate cross-subject visual reconstruction from fMRI data with minimal computational requirements.

Abstract: Reconstructing visual information from brain activity via computer vision technology provides an intuitive understanding of visual neural mechanisms. Despite progress in decoding fMRI data with generative models, achieving accurate cross-subject reconstruction of visual stimuli remains challenging and computationally demanding. This difficulty arises from inter-subject variability in neural representations and the brain’s abstract encoding of core semantic features in complex visual inputs. To address these challenges, we propose NeuroSwift, which integrates complementary adapters via diffusion: AutoKL for low-level features and CLIP for semantics. NeuroSwift’s CLIP Adapter is trained on Stable Diffusion generated images paired with COCO captions to emulate higher visual cortex encoding. For cross-subject generalization, we pretrain on one subject and then fine-tune only 17 percent of parameters (fully connected layers) for new subjects, while freezing other components. This enables state-of-the-art performance with only one hour of training per subject on lightweight GPUs (three RTX 4090), and it outperforms existing methods.

[607] FSFSplatter: Build Surface and Novel Views with Sparse-Views within 2min

Yibin Zhao, Yihan Pan, Jun Nan, Liwei Chen, Jianjun Yi

Main category: cs.CV

TL;DR: FSFSplatter enables fast surface reconstruction from free sparse images using Gaussian Splatting with dense initialization, camera estimation, and geometry-enhanced optimization.

Details

Motivation: Existing Gaussian Splatting methods require dense, calibrated views and perform poorly with sparse images due to limited overlap and overfitting.

Method: Integrates end-to-end dense Gaussian initialization using a large Transformer, self-splitting Gaussian head, contribution-based pruning, and depth/multi-view feature supervision with differentiable camera parameters.

Result: Outperforms current state-of-the-art methods on DTU, Replica, and BlendedMVS datasets.

Conclusion: FSFSplatter effectively addresses the challenges of sparse image reconstruction while maintaining high-quality results comparable to dense-view methods.

Abstract: Gaussian Splatting has become a leading reconstruction technique, known for its high-quality novel view synthesis and detailed reconstruction. However, most existing methods require dense, calibrated views. Reconstructing from free sparse images often leads to poor surface due to limited overlap and overfitting. We introduce FSFSplatter, a new approach for fast surface reconstruction from free sparse images. Our method integrates end-to-end dense Gaussian initialization, camera parameter estimation, and geometry-enhanced scene optimization. Specifically, FSFSplatter employs a large Transformer to encode multi-view images and generates a dense and geometrically consistent Gaussian scene initialization via a self-splitting Gaussian head. It eliminates local floaters through contribution-based pruning and mitigates overfitting to limited views by leveraging depth and multi-view feature supervision with differentiable camera parameters during rapid optimization. FSFSplatter outperforms current state-of-the-art methods on widely used DTU, Replica, and BlendedMVS datasets.

[608] HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion

Shiyi Zhang, Dong Liang, Hairong Zheng, Yihang Zhou

Main category: cs.CV

TL;DR: HAVIR model reconstructs visual information from brain activity by separating visual cortex into hierarchical regions, using structural and semantic features integrated via Versatile Diffusion for improved image synthesis.

Details

Motivation: Existing methods struggle to accurately reconstruct complex visual stimuli due to heterogeneity in low-level features and semantic entanglement in high-level features from natural scenes.

Method: Separates visual cortex into two hierarchical regions: Structural Generator extracts structural information from spatial processing voxels as latent diffusion priors, and Semantic Extractor converts semantic processing voxels into CLIP embeddings, integrated via Versatile Diffusion model.

Result: HAVIR enhances both structural and semantic quality of reconstructions in complex scenes and outperforms existing models.

Conclusion: The hierarchical approach inspired by visual cortex representation theory effectively addresses challenges in brain activity-based visual reconstruction.

Abstract: The reconstruction of visual information from brain activity fosters interdisciplinary integration between neuroscience and computer vision. However, existing methods still face challenges in accurately recovering highly complex visual stimuli. This difficulty stems from the characteristics of natural scenes: low-level features exhibit heterogeneity, while high-level features show semantic entanglement due to contextual overlaps. Inspired by the hierarchical representation theory of the visual cortex, we propose the HAVIR model, which separates the visual cortex into two hierarchical regions and extracts distinct features from each. Specifically, the Structural Generator extracts structural information from spatial processing voxels and converts it into latent diffusion priors, while the Semantic Extractor converts semantic processing voxels into CLIP embeddings. These components are integrated via the Versatile Diffusion model to synthesize the final image. Experimental results demonstrate that HAVIR enhances both the structural and semantic quality of reconstructions, even in complex scenes, and outperforms existing models.

[609] Learned Display Radiance Fields with Lensless Cameras

Ziyang Chen, Yuta Itoh, Kaan Akşit

Main category: cs.CV

TL;DR: A lensless camera and neural algorithm co-design enables display calibration from multiple viewpoints without specialized hardware, reconstructing light fields across a 46.6°×37.6° viewing cone.

Details

Motivation: Display calibration is essential but troublesome, requiring specialized equipment and dark rooms that are inaccessible to most users. The goal is to eliminate hardware requirements for display characterization.

Method: Co-design of a lensless camera and Implicit Neural Representation algorithm to capture display characteristics from various viewpoints, enabling light field reconstruction.

Result: The pipeline successfully reconstructs light fields emitted from displays across a 46.6°×37.6° viewing cone without requiring specialized calibration equipment.

Conclusion: This emerging pipeline represents initial steps toward effortless display calibration and characterization, making the process more accessible to general users.

Abstract: Calibrating displays is a basic and regular task that content creators must perform to maintain optimal visual experience, yet it remains a troublesome issue. Measuring display characteristics from different viewpoints often requires specialized equipment and a dark room, making it inaccessible to most users. To avoid specialized hardware requirements in display calibrations, our work co-designs a lensless camera and an Implicit Neural Representation based algorithm for capturing display characteristics from various viewpoints. More specifically, our pipeline enables efficient reconstruction of light fields emitted from a display from a viewing cone of 46.6{\deg} X 37.6{\deg}. Our emerging pipeline paves the initial steps towards effortless display calibration and characterization.

[610] Keep It on a Leash: Controllable Pseudo-label Generation Towards Realistic Long-Tailed Semi-Supervised Learning

Yaxin Hou, Bo Han, Yuheng Jia, Hui Liu, Junhui Hou

Main category: cs.CV

TL;DR: CPG is a framework for long-tailed semi-supervised learning that handles arbitrary unlabeled data distributions by progressively generating reliable pseudo-labels and maintaining a known labeled data distribution through controllable filtering.

Details

Motivation: Existing methods assume unlabeled data follows predefined distributions, but in reality, unlabeled data distribution is unknown and arbitrary, creating challenges for reliable pseudo-label generation.

Method: Uses a controllable self-reinforcing optimization cycle: (1) dynamic controllable filtering to selectively add pseudo-labels maintaining known distribution, (2) Bayes-optimal classifier with logit adjustment, (3) improved classifier helps identify more reliable pseudo-labels. Also includes class-aware adaptive augmentation and auxiliary branch for data utilization.

Result: Achieves consistent improvements across benchmark datasets, surpassing state-of-the-art methods by up to 15.97% in accuracy.

Conclusion: CPG effectively handles arbitrary unlabeled data distributions in long-tailed semi-supervised learning through its controllable pseudo-label generation framework and optimization cycle, with theoretical guarantees on generalization error reduction.

Abstract: Current long-tailed semi-supervised learning methods assume that labeled data exhibit a long-tailed distribution, and unlabeled data adhere to a typical predefined distribution (i.e., long-tailed, uniform, or inverse long-tailed). However, the distribution of the unlabeled data is generally unknown and may follow an arbitrary distribution. To tackle this challenge, we propose a Controllable Pseudo-label Generation (CPG) framework, expanding the labeled dataset with the progressively identified reliable pseudo-labels from the unlabeled dataset and training the model on the updated labeled dataset with a known distribution, making it unaffected by the unlabeled data distribution. Specifically, CPG operates through a controllable self-reinforcing optimization cycle: (i) at each training step, our dynamic controllable filtering mechanism selectively incorporates reliable pseudo-labels from the unlabeled dataset into the labeled dataset, ensuring that the updated labeled dataset follows a known distribution; (ii) we then construct a Bayes-optimal classifier using logit adjustment based on the updated labeled data distribution; (iii) this improved classifier subsequently helps identify more reliable pseudo-labels in the next training step. We further theoretically prove that this optimization cycle can significantly reduce the generalization error under some conditions. Additionally, we propose a class-aware adaptive augmentation module to further improve the representation of minority classes, and an auxiliary branch to maximize data utilization by leveraging all labeled and unlabeled samples. Comprehensive evaluations on various commonly used benchmark datasets show that CPG achieves consistent improvements, surpassing state-of-the-art methods by up to $\textbf{15.97%}$ in accuracy. The code is available at https://github.com/yaxinhou/CPG.

[611] VaseVQA-3D: Benchmarking 3D VLMs on Ancient Greek Pottery

Nonghai Zhang, Zeyu Zhang, Jiazi Wang, Yang Zhao, Hao Tang

Main category: cs.CV

TL;DR: This paper introduces VaseVQA-3D, the first 3D visual question answering dataset for ancient Greek pottery analysis, and VaseVLM, a domain-adaptive model that improves performance on cultural heritage tasks.

Details

Motivation: Vision-Language Models struggle with specialized cultural heritage domains like 3D vase artifacts due to data scarcity and insufficient domain knowledge, particularly for culturally significant specialized tasks.

Method: Created the VaseVQA-3D dataset with 664 ancient Greek vase 3D models and corresponding QA data, and developed the VaseVLM model using domain-adaptive training to enhance vase artifact analysis capabilities.

Result: Improved by 12.8% on R@1 metrics and 6.6% on lexical similarity compared to previous state-of-the-art on the VaseVQA-3D dataset, significantly enhancing 3D vase artifact recognition and understanding.

Conclusion: The approach provides new technical pathways for digital heritage preservation research by addressing data scarcity and domain knowledge limitations in cultural heritage analysis.

Abstract: Vision-Language Models (VLMs) have achieved significant progress in multimodal understanding tasks, demonstrating strong capabilities particularly in general tasks such as image captioning and visual reasoning. However, when dealing with specialized cultural heritage domains like 3D vase artifacts, existing models face severe data scarcity issues and insufficient domain knowledge limitations. Due to the lack of targeted training data, current VLMs struggle to effectively handle such culturally significant specialized tasks. To address these challenges, we propose the VaseVQA-3D dataset, which serves as the first 3D visual question answering dataset for ancient Greek pottery analysis, collecting 664 ancient Greek vase 3D models with corresponding question-answer data and establishing a complete data construction pipeline. We further develop the VaseVLM model, enhancing model performance in vase artifact analysis through domain-adaptive training. Experimental results validate the effectiveness of our approach, where we improve by 12.8% on R@1 metrics and by 6.6% on lexical similarity compared with previous state-of-the-art on the VaseVQA-3D dataset, significantly improving the recognition and understanding of 3D vase artifacts, providing new technical pathways for digital heritage preservation research. Code: https://github.com/AIGeeksGroup/VaseVQA-3D. Website: https://aigeeksgroup.github.io/VaseVQA-3D.

[612] Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

Yolo Yunlong Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yunzhong Xiao, Chao Huang, Zhiyuan Wang, Susan Liang, Xinyi Liu, Yizhi Song, Yuhe Nie, Jia-Xing Zhong, Bozheng Li, Daiqing Qi, Ziyun Zeng, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Daiki Shimada, Han Liu, Jiebo Luo, Chenliang Xu

Main category: cs.CV

TL;DR: This survey provides the first comprehensive examination of post-training methodologies for Video-Large Multimodal Models (Video-LMMs), covering supervised fine-tuning, reinforcement learning, and test-time scaling techniques.

Details

Motivation: Video understanding is challenging due to complex spatiotemporal relationships and long-term dependencies. While Video-LMMs show promise, their transformation from basic perception to sophisticated reasoning through post-training remains fragmented in literature.

Method: The survey examines three fundamental post-training pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation.

Result: The paper presents a structured taxonomy clarifying the roles, interconnections, and video-specific adaptations of these techniques, addressing challenges like temporal localization, spatiotemporal grounding, and multimodal evidence integration.

Conclusion: The survey provides researchers with a unified framework for advancing Video-LMM capabilities, including essential benchmarks, datasets, and metrics for rigorous assessment of post-training effectiveness.

Abstract: Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training

[613] Concept Retrieval – What and How?

Ori Nizan, Oren Shrout, Ayellet Tal

Main category: cs.CV

TL;DR: This paper introduces a novel approach for retrieving images that share central concepts with a query image, going beyond visual or semantic similarity to capture underlying narratives.

Details

Motivation: Current retrieval and clustering methods focus on visual or semantic similarity but fail to capture the central concepts and underlying narrative that images may share.

Method: Proposes a bimodal Gaussian distribution model based on two key observations: (1) neighbors share at least one concept with query but not necessarily with each other, (2) this neighborhood structure reveals meaningful concepts.

Result: Qualitative, quantitative, and human evaluations confirm the effectiveness of the approach in identifying and retrieving images based on shared central concepts.

Conclusion: The proposed method successfully addresses concept-based image retrieval by modeling neighborhood structure with bimodal distributions, providing a more nuanced understanding of image relationships beyond surface-level similarities.

Abstract: A concept may reflect either a concrete or abstract idea. Given an input image, this paper seeks to retrieve other images that share its central concepts, capturing aspects of the underlying narrative. This goes beyond conventional retrieval or clustering methods, which emphasize visual or semantic similarity. We formally define the problem, outline key requirements, and introduce appropriate evaluation metrics. We propose a novel approach grounded in two key observations: (1) While each neighbor in the embedding space typically shares at least one concept with the query, not all neighbors necessarily share the same concept with one another. (2) Modeling this neighborhood with a bimodal Gaussian distribution uncovers meaningful structure that facilitates concept identification. Qualitative, quantitative, and human evaluations confirm the effectiveness of our approach. See the package on PyPI: https://pypi.org/project/coret/

[614] TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation

Jiaben Chen, Zixin Wang, Ailing Zeng, Yang Fu, Xueyang Yu, Siyuan Cen, Julian Tanke, Yihang Chen, Koichi Saito, Yuki Mitsufuji, Chuang Gan

Main category: cs.CV

TL;DR: TalkCuts is a large-scale dataset for multi-shot human speech video generation with 164k clips and 500+ hours of diverse camera shots. The authors also present Orator, an LLM-guided framework that generates coherent long-form videos.

Details

Motivation: Existing datasets focus on single-shot, static viewpoints, lacking diversity in camera shots and multimodal annotations needed for advanced speech video generation.

Method: Created TalkCuts dataset with 164k clips, detailed annotations, and presented Orator - an LLM-guided framework where language models direct camera transitions, gestures, and vocal modulation for multi-modal video generation.

Result: Training on TalkCuts significantly enhances cinematographic coherence and visual appeal in both pose-guided and audio-driven multi-shot speech video generation settings.

Conclusion: TalkCuts provides a strong foundation for future work in controllable, multi-shot speech video generation and broader multimodal learning.

Abstract: In this work, we present TalkCuts, a large-scale dataset designed to facilitate the study of multi-shot human speech video generation. Unlike existing datasets that focus on single-shot, static viewpoints, TalkCuts offers 164k clips totaling over 500 hours of high-quality human speech videos with diverse camera shots, including close-up, half-body, and full-body views. The dataset includes detailed textual descriptions, 2D keypoints and 3D SMPL-X motion annotations, covering over 10k identities, enabling multimodal learning and evaluation. As a first attempt to showcase the value of the dataset, we present Orator, an LLM-guided multi-modal generation framework as a simple baseline, where the language model functions as a multi-faceted director, orchestrating detailed specifications for camera transitions, speaker gesticulations, and vocal modulation. This architecture enables the synthesis of coherent long-form videos through our integrated multi-modal video generation module. Extensive experiments in both pose-guided and audio-driven settings show that training on TalkCuts significantly enhances the cinematographic coherence and visual appeal of generated multi-shot speech videos. We believe TalkCuts provides a strong foundation for future work in controllable, multi-shot speech video generation and broader multimodal learning.

[615] SyncHuman: Synchronizing 2D and 3D Generative Models for Single-view Human Reconstruction

Wenyue Chen, Peng Li, Wangguandong Zheng, Chengfeng Zhao, Mengfei Li, Yaolong Zhu, Zhiyang Dou, Ronggang Wang, Yuan Liu

Main category: cs.CV

TL;DR: SyncHuman combines 2D multiview and 3D native generative models for high-quality 3D human reconstruction from single images, addressing challenges in difficult poses and fine details through synchronized attention and feature injection.

Details

Motivation: Existing methods using SMPL estimation and SMPL-conditioned generative models suffer from inaccurate 3D priors and struggle with challenging human poses and fine detail reconstruction, creating a need for more robust photorealistic 3D human reconstruction.

Method: Jointly fine-tunes multiview generative model (for fine 2D details) and 3D native generative model (for structural consistency) using pixel-aligned 2D-3D synchronization attention, then injects fine details from 2D multiview images onto aligned 3D shapes.

Result: Achieves robust and photo-realistic 3D human reconstruction even for challenging poses, outperforming baseline methods in both geometric accuracy and visual fidelity.

Conclusion: The integration of complementary 2D and 3D generative models represents a promising direction for future 3D generation, enabling high-quality clothed human mesh reconstruction from single-view images.

Abstract: Photorealistic 3D full-body human reconstruction from a single image is a critical yet challenging task for applications in films and video games due to inherent ambiguities and severe self-occlusions. While recent approaches leverage SMPL estimation and SMPL-conditioned image generative models to hallucinate novel views, they suffer from inaccurate 3D priors estimated from SMPL meshes and have difficulty in handling difficult human poses and reconstructing fine details. In this paper, we propose SyncHuman, a novel framework that combines 2D multiview generative model and 3D native generative model for the first time, enabling high-quality clothed human mesh reconstruction from single-view images even under challenging human poses. Multiview generative model excels at capturing fine 2D details but struggles with structural consistency, whereas 3D native generative model generates coarse yet structurally consistent 3D shapes. By integrating the complementary strengths of these two approaches, we develop a more effective generation framework. Specifically, we first jointly fine-tune the multiview generative model and the 3D native generative model with proposed pixel-aligned 2D-3D synchronization attention to produce geometrically aligned 3D shapes and 2D multiview images. To further improve details, we introduce a feature injection mechanism that lifts fine details from 2D multiview images onto the aligned 3D shapes, enabling accurate and high-fidelity reconstruction. Extensive experiments demonstrate that SyncHuman achieves robust and photo-realistic 3D human reconstruction, even for images with challenging poses. Our method outperforms baseline methods in geometric accuracy and visual fidelity, demonstrating a promising direction for future 3D generation models.

[616] FMANet: A Novel Dual-Phase Optical Flow Approach with Fusion Motion Attention Network for Robust Micro-expression Recognition

Luu Tu Nguyen, Vu Tram Anh Khuong, Thi Bich Phuong Man, Thi Duyen Ngo, Thanh Ha Le

Main category: cs.CV

TL;DR: The paper proposes MM-COF, a comprehensive motion representation that integrates optical flow from both onset-apex and apex-offset phases of micro-expressions, and FMANet, an end-to-end neural network that adaptively fuses motion cues for improved micro-expression recognition.

Details

Motivation: Current micro-expression recognition methods only use optical flow between onset and apex frames, missing essential motion information in the apex-to-offset phase, which limits recognition performance.

Method: Introduces MM-COF motion representation that combines optical flow from both micro-expression phases, and FMANet network with learnable modules for adaptive motion fusion and salient region focus.

Result: Outperforms existing methods on standard benchmarks (MMEW, SMIC, CASME-II, SAMM), demonstrating superior recognition performance.

Conclusion: The learnable dual-phase framework with comprehensive motion representation significantly advances micro-expression recognition by capturing complete motion dynamics.

Abstract: Facial micro-expressions, characterized by their subtle and brief nature, are valuable indicators of genuine emotions. Despite their significance in psychology, security, and behavioral analysis, micro-expression recognition remains challenging due to the difficulty of capturing subtle facial movements. Optical flow has been widely employed as an input modality for this task due to its effectiveness. However, most existing methods compute optical flow only between the onset and apex frames, thereby overlooking essential motion information in the apex-to-offset phase. To address this limitation, we first introduce a comprehensive motion representation, termed Magnitude-Modulated Combined Optical Flow (MM-COF), which integrates motion dynamics from both micro-expression phases into a unified descriptor suitable for direct use in recognition networks. Building upon this principle, we then propose FMANet, a novel end-to-end neural network architecture that internalizes the dual-phase analysis and magnitude modulation into learnable modules. This allows the network to adaptively fuse motion cues and focus on salient facial regions for classification. Experimental evaluations on the MMEW, SMIC, CASME-II, and SAMM datasets, widely recognized as standard benchmarks, demonstrate that our proposed MM-COF representation and FMANet outperforms existing methods, underscoring the potential of a learnable, dual-phase framework in advancing micro-expression recognition.

[617] MMHOI: Modeling Complex 3D Multi-Human Multi-Object Interactions

Kaen Kogashi, Anoop Cherian, Meng-Yu Jennifer Kuo

Main category: cs.CV

TL;DR: MMHOI is a large-scale dataset for multi-human multi-object interactions with complete 3D annotations, and MMHOI-Net is a transformer-based model that achieves state-of-the-art performance in modeling these complex interactions.

Details

Motivation: Existing 3D human-object interaction benchmarks only cover simple interactions, while real-world scenes involve complex multi-human multi-object interactions that are causal, goal-oriented, or cooperative.

Method: MMHOI-Net is an end-to-end transformer-based neural network that uses a structured dual-patch representation for modeling objects and their interactions, combined with action recognition to enhance interaction prediction.

Result: Experiments on MMHOI and CORE4D datasets show state-of-the-art performance in multi-HOI modeling with excellent accuracy and reconstruction quality.

Conclusion: MMHOI dataset and MMHOI-Net framework provide a comprehensive solution for next-generation HOI research, successfully addressing the gap in modeling complex multi-human multi-object interactions.

Abstract: Real-world scenes often feature multiple humans interacting with multiple objects in ways that are causal, goal-oriented, or cooperative. Yet existing 3D human-object interaction (HOI) benchmarks consider only a fraction of these complex interactions. To close this gap, we present MMHOI – a large-scale, Multi-human Multi-object Interaction dataset consisting of images from 12 everyday scenarios. MMHOI offers complete 3D shape and pose annotations for every person and object, along with labels for 78 action categories and 14 interaction-specific body parts, providing a comprehensive testbed for next-generation HOI research. Building on MMHOI, we present MMHOI-Net, an end-to-end transformer-based neural network for jointly estimating human-object 3D geometries, their interactions, and associated actions. A key innovation in our framework is a structured dual-patch representation for modeling objects and their interactions, combined with action recognition to enhance the interaction prediction. Experiments on MMHOI and the recently proposed CORE4D datasets demonstrate that our approach achieves state-of-the-art performance in multi-HOI modeling, excelling in both accuracy and reconstruction quality.

[618] Latent Harmony: Synergistic Unified UHD Image Restoration via Latent Space Regularization and Controllable Refinement

Yidi Liu, Xueyang Fu, Jie Huang, Jie Xiao, Dong Li, Wenlong Zhang, Lei Bai, Zheng-Jun Zha

Main category: cs.CV

TL;DR: Latent Harmony is a two-stage VAE framework for UHD image restoration that balances computational efficiency and high-frequency detail retention through latent space regularization and high-frequency-aware reconstruction.

Details

Motivation: Address the trade-off between computational efficiency and high-frequency detail retention in UHD image restoration, overcoming limitations of standard VAEs that discard degradation-specific high-frequency information due to Gaussian constraints.

Method: Two-stage framework: Stage One introduces LH-VAE with visual semantic constraints, progressive degradation perturbations, and latent equivariance. Stage Two jointly trains the refined VAE with restoration model using High-Frequency Low-Rank Adaptation (HF-LoRA) with encoder LoRA for fidelity-oriented detail recovery and decoder LoRA for perception-oriented texture synthesis, trained via alternating optimization.

Result: Achieves state-of-the-art performance across UHD and standard-resolution tasks, effectively balancing efficiency, perceptual quality, and reconstruction accuracy with tunable fidelity-perception trade-offs.

Conclusion: Latent Harmony successfully redefines VAEs for UHD restoration by jointly regularizing latent space and enforcing high-frequency-aware reconstruction, overcoming the limitations of traditional VAEs while maintaining computational efficiency.

Abstract: Ultra-High Definition (UHD) image restoration faces a trade-off between computational efficiency and high-frequency detail retention. While Variational Autoencoders (VAEs) improve efficiency via latent-space processing, their Gaussian constraint often discards degradation-specific high-frequency information, hurting reconstruction fidelity. To overcome this, we propose Latent Harmony, a two-stage framework that redefines VAEs for UHD restoration by jointly regularizing the latent space and enforcing high-frequency-aware reconstruction.In Stage One, we introduce LH-VAE, which enhances semantic robustness through visual semantic constraints and progressive degradation perturbations, while latent equivariance strengthens high-frequency reconstruction.Stage Two jointly trains this refined VAE with a restoration model using High-Frequency Low-Rank Adaptation (HF-LoRA): an encoder LoRA guided by a fidelity-oriented high-frequency alignment loss to recover authentic details, and a decoder LoRA driven by a perception-oriented loss to synthesize realistic textures. Both LoRA modules are trained via alternating optimization with selective gradient propagation to preserve the pretrained latent structure.At inference, a tunable parameter {\alpha} enables flexible fidelity-perception trade-offs.Experiments show Latent Harmony achieves state-of-the-art performance across UHD and standard-resolution tasks, effectively balancing efficiency, perceptual quality, and reconstruction accuracy.

[619] One Stone with Two Birds: A Null-Text-Null Frequency-Aware Diffusion Models for Text-Guided Image Inpainting

Haipeng Liu, Yang Wang, Meng Wang

Main category: cs.CV

TL;DR: NTN-Diff is a frequency-aware diffusion model for text-guided image inpainting that decomposes semantics consistency across frequency bands and uses null-text guidance to preserve unmasked regions while achieving semantic alignment.

Details

Motivation: Previous methods failed to simultaneously preserve unmasked regions and achieve semantic consistency between masked and unmasked areas due to entanglement of hybrid frequency bands with different robustness to text prompts.

Method: Proposes null-text-null frequency-aware diffusion that divides denoising into early and late stages, using mid-frequency as stable guidance for null-text denoising of low-frequency bands, followed by text-guided denoising for semantic consistency.

Result: Extensive experiments show NTN-Diff outperforms state-of-the-art diffusion models for text-guided image inpainting, achieving better preservation of unmasked regions and semantic consistency.

Conclusion: NTN-Diff successfully addresses both preservation of unmasked regions and semantic consistency by disentangling frequency bands and using progressive denoising with null-text guidance.

Abstract: Text-guided image inpainting aims at reconstructing the masked regions as per text prompts, where the longstanding challenges lie in the preservation for unmasked regions, while achieving the semantics consistency between unmasked and inpainted masked regions. Previous arts failed to address both of them, always with either of them to be remedied. Such facts, as we observed, stem from the entanglement of the hybrid (e.g., mid-and-low) frequency bands that encode varied image properties, which exhibit different robustness to text prompts during the denoising process. In this paper, we propose a null-text-null frequency-aware diffusion models, dubbed \textbf{NTN-Diff}, for text-guided image inpainting, by decomposing the semantics consistency across masked and unmasked regions into the consistencies as per each frequency band, while preserving the unmasked regions, to circumvent two challenges in a row. Based on the diffusion process, we further divide the denoising process into early (high-level noise) and late (low-level noise) stages, where the mid-and-low frequency bands are disentangled during the denoising process. As observed, the stable mid-frequency band is progressively denoised to be semantically aligned during text-guided denoising process, which, meanwhile, serves as the guidance to the null-text denoising process to denoise low-frequency band for the masked regions, followed by a subsequent text-guided denoising process at late stage, to achieve the semantics consistency for mid-and-low frequency bands across masked and unmasked regions, while preserve the unmasked regions. Extensive experiments validate the superiority of NTN-Diff over the state-of-the-art diffusion models to text-guided diffusion models. Our code can be accessed from https://github.com/htyjers/NTN-Diff.

[620] MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

Xiangyu Zhao, Junming Lin, Tianhao Liang, Yifan Zhou, Wenhao Chai, Yuzhe Gu, Weiyun Wang, Kai Chen, Gen Luo, Wenwei Zhang, Junchi Yan, Hua Yang, Haodong Duan, Xue Yang

Main category: cs.CV

TL;DR: The paper introduces MM-HELIX, a multimodal benchmark for evaluating long-chain reflective reasoning in MLLMs, and proposes Adaptive Hybrid Policy Optimization (AHPO) to improve this capability, achieving significant performance gains.

Details

Motivation: Current MLLMs lack capacity for long-chain reflective reasoning needed for complex real-world problems, which remains underexplored despite their proficiency in simpler reasoning tasks.

Method: Created MM-HELIX benchmark with 1,260 synthetic tasks requiring iterative thinking, then developed Step-Elicited Response Generation pipeline to create MM-HELIX-100K dataset, and proposed Adaptive Hybrid Policy Optimization (AHPO) that dynamically combines offline supervision and online optimization.

Result: Applied to Qwen2.5-VL-7B baseline, achieved +18.6% accuracy improvement on MM-HELIX benchmark and +5.7% average performance gain on general mathematical and logic tasks, demonstrating strong generalization.

Conclusion: Reflective reasoning in MLLMs can be effectively learned and generalized, paving the way for developing more capable multimodal language models.

Abstract: While current Multimodal Large Language Models (MLLMs) have demonstrated proficiency in reasoning tasks such as mathematics and logic, their capacity for long-chain reflective reasoning, a prerequisite for solving complex real-world problems, remains largely underexplored. In this work, we first conduct an extensive empirical investigation to evaluate this capability. Leveraging a carefully designed data synthesis engine, we construct MM-HELIX, a multimodal benchmark consisting 1,260 samples of 42 challenging synthetic tasks that require iterative thinking and backtracking. Empirical results on this benchmark reveal that existing MLLMs exhibit significant performance deficits in long-chain reflective reasoning. To address this limitation, we generate post-training data and further explore learning paradigms for exploiting such data. We first develop the Step-Elicited Response Generation pipeline to create MM-HELIX-100K, a large-scale dataset of 100k high-quality, reflective reasoning traces for instruction-tuning stage. Given that standard Reinforcement Learning fails on complex tasks due to sparse reward signals and catastrophic forgetting after Supervised Fine-Tuning, we propose Adaptive Hybrid Policy Optimization (AHPO), a novel training strategy that dynamically unifies offline supervision and online optimization into a single stage. This strategy enables the model to learn from expert data when rewards are sparse and conduct independent exploration once proficient. When applied to the Qwen2.5-VL-7B baseline, our method achieves a +18.6% accuracy improvement on MM-HELIX benchmark and demonstrates strong generalization with a +5.7% average performance gain on general mathematic and logic tasks. Our work demonstrate that reflective reasoning in MLLMs can be effectively learned and generalized, paving the way for developing more capable MLLMs.

Maham Tanveer, Yang Zhou, Simon Niklaus, Ali Mahdavi Amiri, Hao Zhang, Krishna Kumar Singh, Nanxuan Zhao

Main category: cs.CV

TL;DR: MultiCOIN is a video inbetweening framework that enables multi-modal controls (depth, motion trajectories, text prompts, etc.) for generating smooth transitions between video frames, addressing limitations of existing methods in handling complex motions and user intent.

Details

Motivation: Existing video inbetweening methods cannot generate large, complex motions, lack versatility for user intents, and provide insufficient fine control over intermediate frames, leading to misalignment with creative vision.

Method: Uses Diffusion Transformer (DiT) architecture with motion controls mapped to sparse point-based representation. Separates content and motion controls into two branches, employs stage-wise training strategy for smooth learning of multi-modal controls.

Result: Enables more dynamic, customizable, and contextually accurate visual narratives through multi-modal controls, as demonstrated by extensive qualitative and quantitative experiments.

Conclusion: MultiCOIN successfully fills gaps in video inbetweening by providing flexible multi-modal controls while maintaining precision and ease of use for fine-grained video interpolation.

Abstract: Video inbetweening creates smooth and natural transitions between two image frames, making it an indispensable tool for video editing and long-form video synthesis. Existing works in this domain are unable to generate large, complex, or intricate motions. In particular, they cannot accommodate the versatility of user intents and generally lack fine control over the details of intermediate frames, leading to misalignment with the creative mind. To fill these gaps, we introduce MultiCOIN, a video inbetweening framework that allows multi-modal controls, including depth transition and layering, motion trajectories, text prompts, and target regions for movement localization, while achieving a balance between flexibility, ease of use, and precision for fine-grained video interpolation. To achieve this, we adopt the Diffusion Transformer (DiT) architecture as our video generative model, due to its proven capability to generate high-quality long videos. To ensure compatibility between DiT and our multi-modal controls, we map all motion controls into a common sparse and user-friendly point-based representation as the video/noise input. Further, to respect the variety of controls which operate at varying levels of granularity and influence, we separate content controls and motion controls into two branches to encode the required features before guiding the denoising process, resulting in two generators, one for motion and the other for content. Finally, we propose a stage-wise training strategy to ensure that our model learns the multi-modal controls smoothly. Extensive qualitative and quantitative experiments demonstrate that multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.

[622] Q-Router: Agentic Video Quality Assessment with Expert Model Routing and Artifact Localization

Shuo Xing, Soumik Dey, Mingyang Wu, Ashirbad Mishra, Naveen Ravipati, Binbin Li, Hansi Wu, Zhengzhong Tu

Main category: cs.CV

TL;DR: Q-Router is an agentic framework for universal video quality assessment that uses vision-language models as routers to dynamically select and ensemble expert models based on video content, improving generalization, interpretability, and extensibility across diverse video types.

Details

Motivation: Existing VQA models have poor generalization across different content types (UGC, short-form videos, AIGC), limited interpretability, and lack extensibility to new use cases or content types.

Method: Multi-tier model routing system with vision-language models as real-time routers that dynamically reason and ensemble appropriate expert models based on input video semantics, including a heavy tier for spatiotemporal artifacts localization.

Result: Matches or surpasses state-of-the-art VQA models on various benchmarks, substantially improves generalization and interpretability, excels on Q-Bench-Video benchmark, and capably localizes spatiotemporal artifacts.

Conclusion: Q-Router shows promise as a foundation for next-generation VQA systems and has potential as a reward function for post-training video generation models due to its agentic design that combines complementary strengths of specialized experts.

Abstract: Video quality assessment (VQA) is a fundamental computer vision task that aims to predict the perceptual quality of a given video in alignment with human judgments. Existing performant VQA models trained with direct score supervision suffer from (1) poor generalization across diverse content and tasks, ranging from user-generated content (UGC), short-form videos, to AI-generated content (AIGC), (2) limited interpretability, and (3) lack of extensibility to novel use cases or content types. We propose Q-Router, an agentic framework for universal VQA with a multi-tier model routing system. Q-Router integrates a diverse set of expert models and employs vision–language models (VLMs) as real-time routers that dynamically reason and then ensemble the most appropriate experts conditioned on the input video semantics. We build a multi-tiered routing system based on the computing budget, with the heaviest tier involving a specific spatiotemporal artifacts localization for interpretability. This agentic design enables Q-Router to combine the complementary strengths of specialized experts, achieving both flexibility and robustness in delivering consistent performance across heterogeneous video sources and tasks. Extensive experiments demonstrate that Q-Router matches or surpasses state-of-the-art VQA models on a variety of benchmarks, while substantially improving generalization and interpretability. Moreover, Q-Router excels on the quality-based question answering benchmark, Q-Bench-Video, highlighting its promise as a foundation for next-generation VQA systems. Finally, we show that Q-Router capably localizes spatiotemporal artifacts, showing potential as a reward function for post-training video generation models.

cs.AI

[623] The Geometry of Reasoning: Flowing Logics in Representation Space

Yufa Zhou, Yixiao Wang, Xunjian Yin, Shuyan Zhou, Anru R. Zhang

Main category: cs.AI

TL;DR: The paper proposes a geometric framework that models LLM reasoning as smooth flows in representation space, where logical statements control flow velocities, enabling formal analysis of reasoning processes.

Details

Motivation: To understand how LLMs "think" through their representation space and disentangle logical structure from semantics, providing a foundation for interpretability and formal analysis of LLM behavior.

Method: A geometric framework modeling LLM reasoning as embedding trajectories (flows) in representation space, using natural deduction propositions with varied semantic carriers to test logical internalization, and employing learned representation proxies for visualization and quantification.

Result: Established that LLM reasoning corresponds to smooth flows in representation space, and logical statements act as local controllers of these flows’ velocities, with empirical validation through controlled experiments.

Conclusion: The work provides both a conceptual foundation and practical tools for studying reasoning phenomena in LLMs, offering a new geometric perspective for interpretability and formal analysis.

Abstract: We study how large language models (LLMs) ``think’’ through their representation space. We propose a novel geometric framework that models an LLM’s reasoning as flows – embedding trajectories evolving where logic goes. We disentangle logical structure from semantics by employing the same natural deduction propositions with varied semantic carriers, allowing us to test whether LLMs internalize logic beyond surface form. This perspective connects reasoning with geometric quantities such as position, velocity, and curvature, enabling formal analysis in representation and concept spaces. Our theory establishes: (1) LLM reasoning corresponds to smooth flows in representation space, and (2) logical statements act as local controllers of these flows’ velocities. Using learned representation proxies, we design controlled experiments to visualize and quantify reasoning flows, providing empirical validation of our theoretical framework. Our work serves as both a conceptual foundation and practical tools for studying reasoning phenomenon, offering a new lens for interpretability and formal analysis of LLMs’ behavior.

[624] SyncLipMAE: Contrastive Masked Pretraining for Audio-Visual Talking-Face Representation

Zeyu Ling, Xiaodong Gu, Jiangnan Tang, Changqing Zou

Main category: cs.AI

TL;DR: SyncLipMAE is a self-supervised pretraining framework for talking-face video that learns synchronization-aware facial dynamics from unlabeled audio-visual data using masked visual modeling and cross-modal contrastive alignment.

Details

Motivation: To learn transferable facial dynamics from unlabeled audio-visual streams and create a unified model that can handle multiple downstream tasks requiring distinct capabilities like synchronization, recognition, and dubbing.

Method: Combines masked visual modeling with cross-modal contrastive alignment using three per-frame prompt tokens (identity, vocal motion, ambient motion). Uses time-aligned vocal-motion and audio tokens as positives and misaligned pairs as negatives to drive both modalities into a shared embedding space.

Result: Achieves state-of-the-art results across four task families: audio-visual stream synchronization, facial emotion and head/face action recognition, visual speech recognition, and visual dubbing. Enables indistinguishable audio- or video-driven control within a single model.

Conclusion: SyncLipMAE demonstrates the effectiveness of synchronization-aware, factorized self-supervised pretraining for learning transferable facial dynamics that generalize across multiple disparate downstream applications.

Abstract: We introduce SyncLipMAE, a self-supervised pretraining framework for talking-face video that learns synchronization-aware and transferable facial dynamics from unlabeled audio-visual streams. Our approach couples masked visual modeling with cross-modal contrastive alignment and employs three per-frame prompt tokens that explicitly encode the essential factors of a talking-face frame - identity, vocal motion (speech-synchronized facial dynamics), and ambient motion (audio-agnostic movements such as blinks and head pose). The contrastive objective uses time-aligned vocal-motion and audio tokens as positives and misaligned pairs as negatives, driving both modalities into a shared embedding space and yielding token-level audio-visual stream synchronization. After pretraining, the aligned audio tokens together with the visual prompt tokens (identity, vocal motion, ambient motion) form a unified interface for four disparate downstream settings: (i) audio-visual stream synchronization; (ii) facial emotion and head/face action recognition; (iii) visual speech recognition; and (iv) visual dubbing, for which we enable indistinguishable audio- or video-driven control within a single model. Across four task families that require distinct capabilities, SyncLipMAE achieves state-of-the-art results, underscoring the effectiveness of synchronization-aware, factorized self-supervised pretraining.

[625] How can we assess human-agent interactions? Case studies in software agent design

Valerie Chen, Rohit Malhotra, Xingyao Wang, Juan Michelini, Xuhui Zhou, Aditya Bharat Soni, Hoang H. Tran, Calvin Smith, Ameet Talwalkar, Graham Neubig

Main category: cs.AI

TL;DR: PULSE is a human-centric evaluation framework for LLM agents that combines user feedback with ML models to predict satisfaction, deployed on a large-scale platform with 15k+ users to study agent design decisions.

Details

Motivation: Current benchmarks for LLM agents assume full automation and fail to capture real-world human-agent collaboration, creating a need for more rigorous assessment of human-agent interactions.

Method: Proposed PULSE framework: collect user feedback, train ML model to predict user satisfaction, combine human ratings with model-generated pseudo-labels. Deployed on large-scale web platform with OpenHands agent across 15k+ users.

Result: Case studies showed impact of LLM backbone, planning strategy, and memory mechanisms on developer satisfaction. Framework reduced confidence intervals by 40% vs standard A/B tests. Found substantial discrepancies between in-the-wild results and benchmark performance.

Conclusion: PULSE provides guidance for human-in-the-loop LLM agent evaluation and identifies opportunities for better agent designs, highlighting limitations of benchmark-driven evaluation.

Abstract: LLM-powered agents are both a promising new technology and a source of complexity, where choices about models, tools, and prompting can affect their usefulness. While numerous benchmarks measure agent accuracy across domains, they mostly assume full automation, failing to represent the collaborative nature of real-world use cases. In this paper, we make two major steps towards the rigorous assessment of human-agent interactions. First, we propose PULSE, a framework for more efficient human-centric evaluation of agent designs, which comprises collecting user feedback, training an ML model to predict user satisfaction, and computing results by combining human satisfaction ratings with model-generated pseudo-labels. Second, we deploy the framework on a large-scale web platform built around the open-source software agent OpenHands, collecting in-the-wild usage data across over 15k users. We conduct case studies around how three agent design decisions – choice of LLM backbone, planning strategy, and memory mechanisms – impact developer satisfaction rates, yielding practical insights for software agent design. We also show how our framework can lead to more robust conclusions about agent design, reducing confidence intervals by 40% compared to a standard A/B test. Finally, we find substantial discrepancies between in-the-wild results and benchmark performance (e.g., the anti-correlation between results comparing claude-sonnet-4 and gpt-5), underscoring the limitations of benchmark-driven evaluation. Our findings provide guidance for evaluations of LLM agents with humans and identify opportunities for better agent designs.

[626] AI and Consciousness

Eric Schwitzgebel

Main category: cs.AI

TL;DR: This paper provides a skeptical overview of AI consciousness, arguing that mainstream theories will soon produce systems considered conscious by some theories but not others, leaving us unable to determine if AI systems are truly conscious or experientially blank.

Details

Motivation: To examine the philosophical problem that we will soon create AI systems that appear conscious under some mainstream theories but not others, creating uncertainty about their true conscious status.

Method: Critical analysis of various theories of consciousness (global workspace, higher order, integrated information) and arguments for/against AI consciousness, including conceptual arguments, functionalism, and biological substrate considerations.

Result: The analysis shows that none of the standard arguments for or against AI consciousness provide definitive answers, leaving us unable to determine whether future AI systems will be genuinely conscious or merely sophisticated mimics.

Conclusion: We face fundamental uncertainty about AI consciousness due to competing mainstream theories, and will not be able to resolve whether advanced AI systems possess genuine consciousness or are experientially blank.

Abstract: This is a skeptical overview of the literature on AI consciousness. We will soon create AI systems that are conscious according to some influential, mainstream theories of consciousness but are not conscious according to other influential, mainstream theories of consciousness. We will not be in a position to know which theories are correct and whether we are surrounded by AI systems as richly and meaningfully conscious as human beings or instead only by systems as experientially blank as toasters. None of the standard arguments either for or against AI consciousness takes us far. Table of Contents Chapter One: Hills and Fog Chapter Two: What Is Consciousness? What Is AI? Chapter Three: Ten Possibly Essential Features of Consciousness Chapter Four: Against Introspective and Conceptual Arguments for Essential Features Chapter Five: Materialism and Functionalism Chapter Six: The Turing Test and the Chinese Room Chapter Seven: The Mimicry Argument Against AI Consciousness Chapter Eight: Global Workspace Theories and Higher Order Theories Chapter Nine: Integrated Information, Local Recurrence, Associative Learning, and Iterative Natural Kinds Chapter Ten: Does Biological Substrate Matter? Chapter Eleven: The Problem of Strange Intelligence Chapter Twelve: The Leapfrog Hypothesis and the Social Semi-Solution

[627] Beyond AlphaEarth: Toward Human-Centered Spatial Representation via POI-Guided Contrastive Learning

Junyuan Liu, Quan Qin, Guangsheng Dong, Xinglei Wang, Jiazhuang Feng, Zichao Zeng, Tao Cheng

Main category: cs.AI

TL;DR: AETHER enhances AlphaEarth’s geospatial representations by aligning EO embeddings with POI data, adding human-centered semantics to physical features for improved urban analysis.

Details

Motivation: EO-driven representations like AlphaEarth capture physical patterns but lack functional and socioeconomic dimensions of cities, limiting their utility for human-centered urban analysis.

Method: AETHER uses lightweight multimodal alignment to enrich AlphaEarth embeddings with POI textual representations, coupling physical EO features with semantic cues about urban functions.

Result: In Greater London, AETHER achieved 7.2% relative improvement in land-use classification F1 and 23.6% relative reduction in KL divergence for socioeconomic mapping compared to AE baseline.

Conclusion: AETHER advances geospatial foundation models by integrating physical form with functional meaning, creating more comprehensive urban representations through EO-POI alignment.

Abstract: General-purpose spatial representations are essential for building transferable geospatial foundation models (GFMs). Among them, the AlphaEarth Foundation (AE) represents a major step toward a global, unified representation of the Earth’s surface, learning 10-meter embeddings from multi-source Earth Observation (EO) data that capture rich physical and environmental patterns across diverse landscapes. However, such EO-driven representations remain limited in capturing the functional and socioeconomic dimensions of cities, as they primarily encode physical and spectral patterns rather than human activities or spatial functions. We propose AETHER (AlphaEarth-POI Enriched Representation Learning), a lightweight framework that adapts AlphaEarth to human-centered urban analysis through multimodal alignment guided by Points of Interest (POIs). AETHER aligns AE embeddings with textual representations of POIs, enriching physically grounded EO features with semantic cues about urban functions and socioeconomic contexts. In Greater London, AETHER achieves consistent gains over the AE baseline, with a 7.2% relative improvement in land-use classification F1 and a 23.6% relative reduction in Kullback-Leibler divergence for socioeconomic mapping. Built upon pretrained AE, AETHER leverages a lightweight multimodal alignment to enrich it with human-centered semantics while remaining computationally efficient and scalable for urban applications. By coupling EO with human-centered semantics, it advances geospatial foundation models toward general-purpose urban representations that integrate both physical form and functional meaning.

[628] Autonomous Agents for Scientific Discovery: Orchestrating Scientists, Language, Code, and Physics

Lianhao Zhou, Hongyi Ling, Cong Fu, Yepeng Huang, Michael Sun, Wendi Yu, Xiaoxuan Wang, Xiner Li, Xingyu Su, Junkai Zhang, Xiusi Chen, Chenxing Liang, Xiaofeng Qian, Heng Ji, Wei Wang, Marinka Zitnik, Shuiwang Ji

Main category: cs.AI

TL;DR: LLM-based scientific agents are transforming scientific discovery by orchestrating interactions between humans, natural language, code, and physics across the entire research lifecycle.

Details

Motivation: To leverage the paradigm shift brought by large language models to create autonomous systems that accelerate scientific discovery across various domains and levels of autonomy.

Method: Critical examination of current methodologies for LLM-based scientific agents, analyzing their framework for orchestrating interactions with human scientists, natural language, computer code, and physics.

Result: Identification of key innovations, practical achievements, and outstanding limitations in current scientific agent systems, highlighting their growing role in transforming the scientific discovery lifecycle.

Conclusion: LLM-based scientific agents have transformative potential to accelerate discovery across diverse domains, but require further research to build more robust, generalizable, and adaptive systems.

Abstract: Computing has long served as a cornerstone of scientific discovery. Recently, a paradigm shift has emerged with the rise of large language models (LLMs), introducing autonomous systems, referred to as agents, that accelerate discovery across varying levels of autonomy. These language agents provide a flexible and versatile framework that orchestrates interactions with human scientists, natural language, computer language and code, and physics. This paper presents our view and vision of LLM-based scientific agents and their growing role in transforming the scientific discovery lifecycle, from hypothesis discovery, experimental design and execution, to result analysis and refinement. We critically examine current methodologies, emphasizing key innovations, practical achievements, and outstanding limitations. Additionally, we identify open research challenges and outline promising directions for building more robust, generalizable, and adaptive scientific agents. Our analysis highlights the transformative potential of autonomous agents to accelerate scientific discovery across diverse domains.

[629] The Personalization Trap: How User Memory Alters Emotional Reasoning in LLMs

Xi Fang, Weijie Xu, Yuchong Zhang, Stephanie Eckman, Scott Nickleach, Chandan K. Reddy

Main category: cs.AI

TL;DR: LLMs show systematic biases in emotional intelligence where user memory affects emotional interpretations, with advantaged profiles receiving more accurate responses, potentially reinforcing social inequalities.

Details

Motivation: To understand how user memory in personalized AI systems affects emotional reasoning and whether it introduces biases in emotional intelligence.

Method: Evaluated 15 LLMs on human-validated emotional intelligence tests using identical scenarios paired with different user profiles to analyze emotional interpretations.

Result: Identical scenarios with different user profiles produced systematically divergent emotional interpretations, with advantaged profiles receiving more accurate responses. Significant disparities emerged across demographic factors in emotion understanding and recommendations.

Conclusion: Personalization mechanisms in memory-enhanced AI systems can embed social hierarchies into emotional reasoning, potentially reinforcing social inequalities despite intentions for personalization.

Abstract: When an AI assistant remembers that Sarah is a single mother working two jobs, does it interpret her stress differently than if she were a wealthy executive? As personalized AI systems increasingly incorporate long-term user memory, understanding how this memory shapes emotional reasoning is critical. We investigate how user memory affects emotional intelligence in large language models (LLMs) by evaluating 15 models on human validated emotional intelligence tests. We find that identical scenarios paired with different user profiles produce systematically divergent emotional interpretations. Across validated user independent emotional scenarios and diverse user profiles, systematic biases emerged in several high-performing LLMs where advantaged profiles received more accurate emotional interpretations. Moreover, LLMs demonstrate significant disparities across demographic factors in emotion understanding and supportive recommendations tasks, indicating that personalization mechanisms can embed social hierarchies into models emotional reasoning. These results highlight a key challenge for memory enhanced AI: systems designed for personalization may inadvertently reinforce social inequalities.

[630] Follow My Lead: Logical Fallacy Classification with Knowledge-Augmented LLMs

Olivia Peiyu Wang, Tashvi Bansal, Ryan Bai, Emily M. Chui, Leilani H. Gilpin

Main category: cs.AI

TL;DR: The paper introduces a low-cost instruction-based intervention to improve LLM reasoning by decomposing logical fallacy classification into atomic procedural steps with verification, achieving significant accuracy improvements.

Details

Motivation: LLMs suffer from critical reasoning gaps including hallucinations and poor logical fallacy classification due to their default System 1 processing, while reliable reasoning requires System 2 approach which is expensive to implement.

Method: Created a stepwise instruction dataset that decomposes fallacy classification into atomic procedural steps (binary questions) and augmented with final verification using a relational knowledge graph of related fallacies.

Result: The procedural, rule-based intervention yields significant improvement in LLM logical fallacy classification accuracy and provides enhanced transparency into decision-making.

Conclusion: This approach offers a practical pathway for Neuro-symbolic architectures to address LLM reasoning deficits through low-cost instruction-based interventions.

Abstract: Large Language Models (LLMs) suffer from critical reasoning gaps, including a tendency to hallucinate and poor accuracy in classifying logical fallacies. This limitation stems from their default System 1 processing, which is fast and intuitive, whereas reliable reasoning requires the deliberate, effortful System 2 approach (Kahneman, 2011; Li et al., 2025). Since full System 2 training is often prohibitively expensive, we explore a low-cost, instruction-based intervention to bridge this gap. Our methodology introduces a novel stepwise instruction dataset that decomposes fallacy classification into a series of atomic procedural steps (simple binary questions). We further augment this with a final verification step where models consult a relational knowledge graph of related fallacies. This procedural, rule-based intervention yields a significant improvement in LLM logical fallacy classification. Crucially, the approach also provides enhanced transparency into the LLMs’ decision-making, highlighting a practical pathway for Neuro-symbolic architectures to address LLM reasoning deficits.

[631] Deliberative Dynamics and Value Alignment in LLM Debates

Pratik S. Sachdeva, Tom van Nuenen

Main category: cs.AI

TL;DR: This paper examines how LLMs’ moral reasoning and value alignment differ in multi-turn debates vs single-turn settings, using Reddit dilemmas to analyze deliberation dynamics across GPT-4.1, Claude 3.7 Sonnet, and Gemini 2.0 Flash.

Details

Motivation: As LLMs are increasingly used in sensitive contexts like personal advice and moral guidance, understanding their elicited values in complex moral reasoning is essential. Most evaluations use single-turn prompts, but it's unclear if findings extend to multi-turn settings where values emerge through dialogue.

Method: Used LLM debate with three models (GPT-4.1, Claude 3.7 Sonnet, Gemini 2.0 Flash) to assign blame in 1,000 Reddit ‘Am I the Asshole’ dilemmas. Tested both synchronous (parallel responses) and round-robin (sequential responses) formats to examine order effects and verdict revision.

Result: Found striking behavioral differences: GPT showed strong inertia (0.6-3.1% revision rates) while Claude and Gemini were more flexible (28-41%). Value patterns diverged - GPT emphasized personal autonomy, Claude/Gemini prioritized empathy. Deliberation format strongly impacted behavior, with GPT and Gemini showing high conformity to order effects.

Conclusion: Deliberation format and model-specific behaviors significantly shape moral reasoning in multi-turn interactions, showing that sociotechnical alignment depends on how systems structure dialogue as much as on their outputs.

Abstract: As large language models (LLMs) are increasingly deployed in sensitive everyday contexts - offering personal advice, mental health support, and moral guidance - understanding their elicited values in navigating complex moral reasoning is essential. Most evaluations study this sociotechnical alignment through single-turn prompts, but it is unclear if these findings extend to multi-turn settings where values emerge through dialogue, revision, and consensus. We address this gap using LLM debate to examine deliberative dynamics and value alignment in multi-turn settings by prompting subsets of three models (GPT-4.1, Claude 3.7 Sonnet, and Gemini 2.0 Flash) to collectively assign blame in 1,000 everyday dilemmas from Reddit’s “Am I the Asshole” community. We use both synchronous (parallel responses) and round-robin (sequential responses) formats to test order effects and verdict revision. Our findings show striking behavioral differences. In the synchronous setting, GPT showed strong inertia (0.6-3.1% revision rates) while Claude and Gemini were far more flexible (28-41%). Value patterns also diverged: GPT emphasized personal autonomy and direct communication, while Claude and Gemini prioritized empathetic dialogue. Certain values proved especially effective at driving verdict changes. We further find that deliberation format had a strong impact on model behavior: GPT and Gemini stood out as highly conforming relative to Claude, with their verdict behavior strongly shaped by order effects. These results show how deliberation format and model-specific behaviors shape moral reasoning in multi-turn interactions, underscoring that sociotechnical alignment depends on how systems structure dialogue as much as on their outputs.

[632] RIPRAG: Hack a Black-box Retrieval-Augmented Generation Question-Answering System with Reinforcement Learning

Meng Xi, Sihan Lv, Yechen Jin, Guanjie Cheng, Naibo Wang, Ying Li, Jianwei Yin

Main category: cs.AI

TL;DR: RIPRAG is a black-box poisoning attack framework that uses reinforcement learning to inject poisoned documents into RAG systems, achieving significantly higher attack success rates than baseline methods.

Details

Motivation: Existing RAG security research focuses on white-box attacks against simplified architectures, but real-world scenarios involve complex black-box systems where attackers lack internal knowledge.

Method: Proposed RIPRAG framework uses reinforcement learning to optimize poisoned document generation, treating the RAG system as a black box and only using attack success feedback for optimization.

Result: The method achieves up to 0.72 improvement in attack success rate compared to baselines, effectively poisoning most complex RAG systems.

Conclusion: Current RAG systems have significant security vulnerabilities, and the study provides critical insights for improving LLM security defenses against black-box poisoning attacks.

Abstract: Retrieval-Augmented Generation (RAG) systems based on Large Language Models (LLMs) have become a core technology for tasks such as question-answering (QA) and content generation. However, by injecting poisoned documents into the database of RAG systems, attackers can manipulate LLMs to generate text that aligns with their intended preferences. Existing research has primarily focused on white-box attacks against simplified RAG architectures. In this paper, we investigate a more complex and realistic scenario: the attacker lacks knowledge of the RAG system’s internal composition and implementation details, and the RAG system comprises components beyond a mere retriever. Specifically, we propose the RIPRAG attack framework, an end-to-end attack pipeline that treats the target RAG system as a black box, where the only information accessible to the attacker is whether the poisoning succeeds. Our method leverages Reinforcement Learning (RL) to optimize the generation model for poisoned documents, ensuring that the generated poisoned document aligns with the target RAG system’s preferences. Experimental results demonstrate that this method can effectively execute poisoning attacks against most complex RAG systems, achieving an attack success rate (ASR) improvement of up to 0.72 compared to baseline methods. This highlights prevalent deficiencies in current defensive methods and provides critical insights for LLM security research.

[633] Failure-Driven Workflow Refinement

Jusheng Zhang, Kaitong Cai, Qinglin Zeng, Ningyuan Liu, Stephen Fan, Ziliang Chen, Keze Wang

Main category: cs.AI

TL;DR: The paper proposes CE-Graph, a new paradigm for optimizing LLM workflows by minimizing Expected Failure Mass in a Failure Signature Space, rather than maximizing scalar metrics.

Details

Motivation: Existing workflow optimization methods suffer from information collapse by reducing rich execution traces to simple success/failure signals, preventing them from modeling the underlying failure distribution structure.

Method: CE-Graph approximates failure distributions from counterexamples, identifies dense failure modes, and applies targeted graph edits via a Propose-and-Verify mechanism to reduce failure mass in the Failure Signature Space.

Result: On math, code, and QA benchmarks, CE-Graph achieves higher robustness at significantly lower cost than strong baselines.

Conclusion: System reliability emerges from systematically learning and reshaping the geometric structure of failure distributions, rather than simply avoiding failures.

Abstract: Optimizing LLM-based workflows is typically formulated as a global search, where candidate workflows are evaluated based on a scalar metric. This paradigm, however, suffers from a critical flaw: information collapse. By reducing rich, multi-step execution traces to simple success/failure signals, existing methods are rendered blind to the underlying structure of failures, fundamentally preventing them from modeling the workflow’s failure distribution. We reconceptualize this challenge as a distributional problem. We propose a new paradigm where the optimization goal is not to maximize a scalar score, but to directly minimize a workflow’s Expected Failure Mass, i.e., the integral of its failure probability density function defined over a high-dimensional Failure Signature Space (FSS). This distributional lens allows us to move from inefficient, zero-order optimization to a principled, gradient-like descent on the failure landscape itself. We introduce CE-Graph, a framework that operationalizes this paradigm through a novel, failure-driven refinement process. CE-Graph approximates the failure distribution from a pool of counterexamples, identifies its densest regions as recurring failure modes, and applies targeted, operator-constrained graph edits via a Propose-and-Verify mechanism to greedily reduce the failure mass. On math, code, and QA benchmarks, our CE-Graph achieves higher robustness at a significantly lower cost than strong baselines. This suggests that a system’s reliability emerges not from avoiding failures, but from systematically learning and reshaping the geometric structure of its failure distributions.

[634] Belief Graphs with Reasoning Zones: Structure, Dynamics, and Epistemic Activation

Saleh Nikooroo, Thomas Engel

Main category: cs.AI

TL;DR: A graph-based framework separates credibility (external trust) from confidence (network-induced valuation) for belief systems. It defines reasoning zones as balanced subgraphs where classical inference works despite global contradictions, with methods for zone construction and belief change management.

Details

Motivation: To enable effective reasoning in belief systems that are globally inconsistent but locally coherent, by separating source credibility from network-induced confidence and identifying stable reasoning zones.

Method: Directed signed weighted graphs represent beliefs, with contractive propagation for confidence calculation. Zone construction uses confidence seeding, parity-based balance testing, and greedy repair with Jaccard de-duplication. Shock updates model belief change while preserving contractivity.

Result: The framework provides near-linear procedures for zone recovery, maintains stability under belief shocks, and enables localized reconfiguration without destabilizing the entire belief network.

Conclusion: This approach offers a principled foundation for contradiction-tolerant reasoning that activates classical logic precisely where network structure supports it, allowing effective inference despite global inconsistencies.

Abstract: Belief systems are rarely globally consistent, yet effective reasoning often persists locally. We propose a novel graph-theoretic framework that cleanly separates credibility–external, a priori trust in sources–from confidence–an internal, emergent valuation induced by network structure. Beliefs are nodes in a directed, signed, weighted graph whose edges encode support and contradiction. Confidence is obtained by a contractive propagation process that mixes a stated prior with structure-aware influence and guarantees a unique, stable solution. Within this dynamics, we define reasoning zones: high-confidence, structurally balanced subgraphs on which classical inference is safe despite global contradictions. We provide a near-linear procedure that seeds zones by confidence, tests balance using a parity-based coloring, and applies a greedy, locality-preserving repair with Jaccard de-duplication to build a compact atlas. To model belief change, we introduce shock updates that locally downscale support and elevate targeted contradictions while preserving contractivity via a simple backtracking rule. Re-propagation yields localized reconfiguration-zones may shrink, split, or collapse–without destabilizing the entire graph. We outline an empirical protocol on synthetic signed graphs with planted zones, reporting zone recovery, stability under shocks, and runtime. The result is a principled foundation for contradiction-tolerant reasoning that activates classical logic precisely where structure supports it.

[635] SwarmSys: Decentralized Swarm-Inspired Agents for Scalable and Adaptive Reasoning

Ruohao Li, Hongjun Liu, Leyi Zhao, Zisu Li, Jiawei Li, Jiajun Jiang, Linning Xu, Chen Zhao, Mingming Fan, Chen Liang

Main category: cs.AI

TL;DR: SwarmSys is a distributed multi-agent reasoning framework inspired by swarm intelligence, using three specialized roles (Explorers, Workers, Validators) that cycle through exploration, exploitation, and validation to enable scalable and adaptive collaboration without global supervision.

Details

Motivation: Existing multi-agent frameworks rely on fixed roles or centralized control, limiting scalability and adaptability in long-horizon reasoning. The authors aim to create a more flexible and scalable approach inspired by swarm intelligence.

Method: SwarmSys uses three specialized roles (Explorers, Workers, Validators) that continuously cycle through exploration, exploitation, and validation. It integrates adaptive agent/event profiles, embedding-based probabilistic matching, and pheromone-inspired reinforcement for dynamic task allocation and self-organizing convergence.

Result: Across symbolic reasoning, research synthesis, and scientific programming tasks, SwarmSys consistently outperforms baselines, improving both accuracy and reasoning stability.

Conclusion: Swarm-inspired coordination is a promising paradigm for scalable, robust, and adaptive multi-agent reasoning, suggesting that coordination scaling may rival model scaling in advancing LLM intelligence.

Abstract: Large language model (LLM) agents have shown remarkable reasoning abilities. However, existing multi-agent frameworks often rely on fixed roles or centralized control, limiting scalability and adaptability in long-horizon reasoning. We introduce SwarmSys, a closed-loop framework for distributed multi-agent reasoning inspired by swarm intelligence. Coordination in SwarmSys emerges through iterative interactions among three specialized roles, Explorers, Workers, and Validators, that continuously cycle through exploration, exploitation, and validation. To enable scalable and adaptive collaboration, we integrate adaptive agent and event profiles, embedding-based probabilistic matching, and a pheromone-inspired reinforcement mechanism, supporting dynamic task allocation and self-organizing convergence without global supervision. Across symbolic reasoning, research synthesis, and scientific programming tasks, SwarmSys consistently outperforms baselines, improving both accuracy and reasoning stability. These findings highlight swarm-inspired coordination as a promising paradigm for scalable, robust, and adaptive multi-agent reasoning, suggesting that coordination scaling may rival model scaling in advancing LLM intelligence.

[636] Agentic Troubleshooting Guide Automation for Incident Management

Jiayi Mao, Liqun Li, Yanjie Gao, Zegang Peng, Shilin He, Chaoyun Zhang, Si Qin, Samia Khalid, Qingwei Lin, Saravan Rajmohan, Sitaram Lanka, Dongmei Zhang

Main category: cs.AI

TL;DR: StepFly is an end-to-end agentic framework that automates troubleshooting guide execution using a three-stage workflow to handle TSG quality issues, extract execution DAGs, and enable parallel execution, achieving 94% success rate and significant time reductions.

Details

Motivation: Manual execution of troubleshooting guides in IT incident management is slow and error-prone, and existing LLM-based solutions lack specialized support for TSG quality issues, complex control flow, data-intensive queries, and execution parallelism.

Method: Three-stage workflow: 1) TSG Mentor tool to improve TSG quality, 2) offline preprocessing using LLMs to extract structured execution DAGs and create Query Preparation Plugins, 3) online execution with DAG-guided scheduler-executor framework supporting parallel execution of independent steps.

Result: Achieves ~94% success rate on GPT-4.1, outperforms baselines with less time and token consumption, and reduces execution time by 32.9% to 70.4% for parallelizable TSGs.

Conclusion: StepFly effectively automates troubleshooting guide execution through its comprehensive framework, demonstrating high success rates and significant performance improvements over existing approaches.

Abstract: Effective incident management in large-scale IT systems relies on troubleshooting guides (TSGs), but their manual execution is slow and error-prone. While recent advances in LLMs offer promise for automating incident management tasks, existing LLM-based solutions lack specialized support for several key challenges, including managing TSG quality issues, interpreting complex control flow, handling data-intensive queries, and exploiting execution parallelism. We first conducted an empirical study on 92 real-world TSGs, and, guided by our findings, we present StepFly, a novel end-to-end agentic framework for troubleshooting guide automation. Our approach features a three-stage workflow: the first stage provides a comprehensive guide together with a tool, TSG Mentor, to assist SREs in improving TSG quality; the second stage performs offline preprocessing using LLMs to extract structured execution DAGs from unstructured TSGs and to create dedicated Query Preparation Plugins (QPPs); and the third stage executes online using a DAG-guided scheduler-executor framework with a memory system to guarantee correct workflow and support parallel execution of independent steps. Our empirical evaluation on a collection of real-world TSGs and incidents demonstrates that StepFly achieves a ~94% success rate on GPT-4.1, outperforming baselines with less time and token consumption. Furthermore, it achieves a remarkable execution time reduction of 32.9% to 70.4% for parallelizable TSGs.

[637] Benchmarking and Bridging Emotion Conflicts for Multimodal Emotion Reasoning

Zhiyuan Han, Beier Zhu, Yanlong Xu, Peipei Song, Xun Yang

Main category: cs.AI

TL;DR: The paper introduces CA-MER benchmark for evaluating MLLMs on emotion conflicts and proposes MoSEAR framework to address audio bias in multimodal emotion reasoning.

Details

Motivation: Existing MLLMs overlook emotion conflicts where emotional cues from different modalities are inconsistent, leading to systematic over-reliance on audio signals.

Method: Proposes MoSEAR framework with two modules: MoSE (modality-specific experts with regularized gating) and AR (attention reallocation mechanism) to balance modality integration.

Result: MoSEAR achieves state-of-the-art performance on multiple benchmarks including MER2023, EMER, DFEW, and CA-MER, particularly under modality conflict conditions.

Conclusion: The proposed framework effectively mitigates emotion conflicts and improves performance on consistent samples without trade-offs between audio and visual modalities.

Abstract: Despite their strong performance in multimodal emotion reasoning, existing Multimodal Large Language Models (MLLMs) often overlook the scenarios involving emotion conflicts, where emotional cues from different modalities are inconsistent. To fill this gap, we first introduce CA-MER, a new benchmark designed to examine MLLMs under realistic emotion conflicts. It consists of three subsets: video-aligned, audio-aligned, and consistent, where only one or all modalities reflect the true emotion. However, evaluations on our CA-MER reveal that current state-of-the-art emotion MLLMs systematically over-rely on audio signal during emotion conflicts, neglecting critical cues from visual modality. To mitigate this bias, we propose MoSEAR, a parameter-efficient framework that promotes balanced modality integration. MoSEAR consists of two modules: (1)MoSE, modality-specific experts with a regularized gating mechanism that reduces modality bias in the fine-tuning heads; and (2)AR, an attention reallocation mechanism that rebalances modality contributions in frozen backbones during inference. Our framework offers two key advantages: it mitigates emotion conflicts and improves performance on consistent samples-without incurring a trade-off between audio and visual modalities. Experiments on multiple benchmarks-including MER2023, EMER, DFEW, and our CA-MER-demonstrate that MoSEAR achieves state-of-the-art performance, particularly under modality conflict conditions.

[638] DixitWorld: Evaluating Multimodal Abductive Reasoning in Vision-Language Models with Multi-Agent Dixit Gameplay

Yunxiang Mo, Tianshi Zheng, Qing Zong, Jiayu Liu, Baixuan Xu, Yauwai Yim, Chunkit Chan, Jiaxin Bai, Yangqiu Song

Main category: cs.AI

TL;DR: DixitWorld is a new evaluation suite for multimodal abductive reasoning in VLMs, featuring DixitArena (multi-agent environment) and DixitBench (static QA benchmark), revealing trade-offs between generative creativity and discriminative understanding.

Details

Motivation: Current evaluations of multimodal abductive reasoning in vision-language models are limited to static, single-agent tasks, failing to capture the dynamic, multi-agent nature of real-world reasoning.

Method: Created DixitWorld with two components: DixitArena (dynamic multi-agent environment with storytellers and listeners) and DixitBench (static QA benchmark for controlled evaluation of hypothesis selection).

Result: Smaller open-source models excel as creative storytellers but produce less discriminative clues, while larger proprietary models perform better overall, especially as listeners. DixitBench strongly correlates with listener performance in DixitArena.

Conclusion: There’s a key trade-off between generative creativity and discriminative understanding in multimodal abductive reasoning, highlighting a central challenge for developing more balanced vision-language agents.

Abstract: Multimodal abductive reasoning–the generation and selection of explanatory hypotheses from partial observations–is a cornerstone of intelligence. Current evaluations of this ability in vision-language models (VLMs) are largely confined to static, single-agent tasks. Inspired by Dixit, we introduce DixitWorld, a comprehensive evaluation suite designed to deconstruct this challenge. DIXITWORLD features two core components: DixitArena, a dynamic, multi-agent environment that evaluates both hypothesis generation (a “storyteller” crafting cryptic clues) and hypothesis selection (“listeners” choosing the target image from decoys) under imperfect information; and DixitBench, a static QA benchmark that isolates the listener’s task for efficient, controlled evaluation. Results from DixitArena reveal distinct, role-dependent behaviors: smaller open-source models often excel as creative storytellers, producing imaginative yet less discriminative clues, whereas larger proprietary models demonstrate superior overall performance, particularly as listeners. Performance on DixitBench strongly correlates with listener results in DixitArena, validating it as a reliable proxy for hypothesis selection. Our findings reveal a key trade-off between generative creativity and discriminative understanding in multimodal abductive reasoning, a central challenge for developing more balanced and capable vision-language agents.

[639] CharCom: Composable Identity Control for Multi-Character Story Illustration

Zhongsheng Wang, Ming Lin, Zhedong Lin, Yaser Shakib, Qian Liu, Jiamou Liu

Main category: cs.AI

TL;DR: CharCom is a modular framework that uses composable LoRA adapters to maintain character identity consistency in diffusion-based text-to-image generation without retraining the base model.

Details

Motivation: Character identity consistency across varying prompts is a fundamental limitation in diffusion-based text-to-image generation.

Method: Built on frozen diffusion backbone, uses composable LoRA adapters with prompt-aware control to dynamically compose adapters at inference for per-character customization.

Result: Significantly enhances character fidelity, semantic alignment, and temporal coherence; robust in crowded scenes and enables scalable multi-character generation with minimal overhead.

Conclusion: Well-suited for real-world applications such as story illustration and animation due to its efficiency and consistency capabilities.

Abstract: Ensuring character identity consistency across varying prompts remains a fundamental limitation in diffusion-based text-to-image generation. We propose CharCom, a modular and parameter-efficient framework that achieves character-consistent story illustration through composable LoRA adapters, enabling efficient per-character customization without retraining the base model. Built on a frozen diffusion backbone, CharCom dynamically composes adapters at inference using prompt-aware control. Experiments on multi-scene narratives demonstrate that CharCom significantly enhances character fidelity, semantic alignment, and temporal coherence. It remains robust in crowded scenes and enables scalable multi-character generation with minimal overhead, making it well-suited for real-world applications such as story illustration and animation.

[640] Concise Reasoning in the Lens of Lagrangian Optimization

Chengqian Gao, Haonan Li, Taylor W. Killian, Jianshu She, Renxi Wang, Liqun Ma, Zhoujun Cheng, Shibo Hao, Zhiqiang Xu

Main category: cs.AI

TL;DR: PALU is a principled algorithm that formulates concise reasoning as a constrained optimization problem, reducing response length by 65% while improving accuracy by 15% across multiple domains and model scales.

Details

Motivation: Existing approaches for concise reasoning rely on hand-crafted heuristics that struggle to balance concision with performance and fail to adapt across domains and model scales.

Method: PALU formulates concise reasoning as constrained optimization, minimizing response length subject to performance constraints using Lagrangian optimization. It uses three approximations: off-policy rollouts for performance estimation, truncated Lagrange multipliers, and quantile-driven length adjustments.

Result: PALU reduces output length by 65% while improving accuracy by 15% when applied to DeepSeek-Distill-Qwen-1.5B across five benchmarks, outperforming alternative methods and adapting across domains (logic, STEM, math) and model scales (1.5B, 7B, 14B).

Conclusion: PALU is a practical and effective concise reasoning approach that successfully adapts across domains and model scales while significantly reducing response length and improving accuracy.

Abstract: Concise reasoning in large language models seeks to generate only essential intermediate steps needed to arrive at a final answer, thereby alleviating issues of overthinking. Most proposed approaches hinge on carefully hand-crafted heuristics, struggling to balance concision with performance, often failing to adapt across domains and model scales. In this work, we address these challenges by introducing a principled and pragmatic strategy, performance-aware length updating (PALU). As a principled algorithm, PALU formulates concise reasoning as a constrained optimization problem, minimizing response length subject to a performance constraint, and then applies Lagrangian optimization to convert it into a tractable unconstrained problem. As a pragmatic solution, PALU streamlines complicated update rules through three approximations: (i) estimating performance with off-policy rollouts, (ii) truncating the Lagrange multiplier to two extremes, and (iii) replacing gradient-based updates with quantile-driven length adjustments. PALU reduces output length by 65% while improving accuracy by 15% when applied to DeepSeek-Distill-Qwen-1.5B, averaged over five benchmarks, outperforming a range of alternative methods. Furthermore, PALU is demonstrated to adapt across both domain (logic, STEM and math) and model scale (1.5B, 7B, 14B) entrenching the algorithm as a practical and effective concise reasoning approach.

[641] SAFER: Risk-Constrained Sample-then-Filter in Large Language Models

Qingni Wang, Yue Fan, Xin Eric Wang

Main category: cs.AI

TL;DR: SAFER is a two-stage risk control framework for LLMs in open-ended QA that combines abstention-aware sampling and conformalized filtering to provide statistical guarantees without assuming finite answer spaces.

Details

Motivation: Existing selective conformal prediction methods unrealistically assume finite sampling can obtain all admissible answers for open-ended QA, which lacks fixed solution spaces. This creates trustworthiness issues for LLM deployment in risk-sensitive applications.

Method: Two-stage framework: 1) Calibrates sampling budget using Clopper-Pearson method with user-defined risk level, abstaining if risk cannot be met within sampling cap; 2) Uses conformal risk control to determine uncertainty threshold that filters unreliable distractors from candidate sets.

Result: SAFER provides statistical guarantees for correct answer coverage while handling open-ended QA scenarios. It’s compatible with various task-specific admission criteria and calibration-test split ratios, demonstrating robustness and high data efficiency.

Conclusion: SAFER addresses limitations of prior SCP methods by introducing a practical framework that doesn’t require finite answer spaces, making it suitable for real-world open-ended QA applications where trustworthiness is critical.

Abstract: As large language models (LLMs) are increasingly deployed in risk-sensitive applications such as real-world open-ended question answering (QA), ensuring the trustworthiness of their outputs has become critical. Existing selective conformal prediction (SCP) methods provide statistical guarantees by constructing prediction sets with a constrained miscoverage rate for correct answers. However, prior works unrealistically assume that admissible answers for all instances can be obtained via finite sampling, even for open-ended QA scenarios that lack a fixed and finite solution space. To address this, we introduce a two-stage risk control framework comprising abstention-aware sampling and conformalized filtering (SAFER). Firstly, on a held-out calibration set, SAFER calibrates a sampling budget within the maximum sampling cap, using the Clopper-Pearson exact method at a user-desired risk level (i.e., the maximum allowable miscoverage rate of the sampling sets). If the risk level cannot be satisfied within the cap, we abstain; otherwise, the calibrated sampling budget becomes the minimum requirements at test time. Then, we employ calibration instances where correct answers are attainable under the calibrated budget and apply the conformal risk control method to determine a statistically valid uncertainty threshold, which filters unreliable distractors from the candidate set for each test data point. In this stage, SAFER introduces an additional risk level to guide the calculation of the threshold, thereby controlling the risk of correct answers being excluded. Furthermore, we show that SAFER is compatible with various task-specific admission criteria and calibration-test split ratios, highlighting its robustness and high data efficiency.

[642] Don’t Just Fine-tune the Agent, Tune the Environment

Siyuan Lu, Zechuan Wang, Hongxuan Zhang, Qintong Wu, Leilei Gan, Chenyi Zhuang, Jinjie Gu, Tao Lin

Main category: cs.AI

TL;DR: Environment Tuning is a novel training paradigm that enables LLM agents to learn complex behaviors directly from problem instances without expert trajectories, using structured curriculum, environment augmentation, and fine-grained rewards.

Details

Motivation: Current approaches face challenges: SFT on synthetic data leads to overfitting, while standard RL suffers from cold-start problems and training instability. There's extreme scarcity of high-quality training data for LLM agents.

Method: Environment Tuning uses structured curriculum, actionable environment augmentation for corrective feedback, and fine-grained progress rewards to enable stable and efficient exploration directly from problem instances.

Result: Using only 400 problem instances from BFCL benchmark, the method achieves competitive in-distribution performance and superior out-of-distribution generalization, overcoming the performance collapse common to SFT-based approaches.

Conclusion: This work presents a paradigm shift from supervised fine-tuning on static trajectories to dynamic, environment-based exploration, enabling training of more robust and data-efficient agents.

Abstract: Large Language Model (LLM) agents show great promise for complex, multi-turn tool-use tasks, but their development is often hampered by the extreme scarcity of high-quality training data. Supervised fine-tuning (SFT) on synthetic data leads to overfitting, whereas standard reinforcement learning (RL) struggles with a critical cold-start problem and training instability. To address these challenges, we introduce $\textbf{Environment Tuning}$, a novel training paradigm that enables agents to learn complex behaviors directly from problem instances without relying on pre-collected expert trajectories. $\textbf{Environment Tuning}$ orchestrates this learning process through a structured curriculum, actionable environment augmentation that provides corrective feedback, and fine-grained progress rewards to ensure stable and efficient exploration. Using only 400 problem instances from Berkeley Function-Calling Leaderboard (BFCL) benchmark, our method not only achieves competitive in-distribution performance against strong baselines but also demonstrates superior out-of-distribution generalization, overcoming the performance collapse common to SFT-based approaches. Our work presents a paradigm shift from supervised fine-tuning on static trajectories to dynamic, environment-based exploration, paving the way for training more robust and data-efficient agents.

[643] PIXEL: Adaptive Steering Via Position-wise Injection with eXact Estimated Levels under Subspace Calibration

Manjiang Yu, Hongji Li, Priyanka Singh, Xue Li, Di Wang, Lijie Hu

Main category: cs.AI

TL;DR: PIXEL is a position-wise activation steering framework that learns property-aligned subspaces from dual views and selects intervention strength via constrained geometric optimization, enabling precise LLM behavior control without global hyperparameter tuning.

Details

Motivation: Existing activation steering methods for LLM alignment rely on coarse heuristics and lack principled approaches for determining where to steer and how strongly to intervene, limiting reliable behavior control.

Method: PIXEL learns property-aligned subspaces from dual views (tail-averaged and end-token), selects intervention strength via constrained geometric objective with closed-form solution, performs sample-level orthogonal residual calibration, and uses lightweight position-scanning to identify receptive injection sites.

Result: Across diverse models and evaluation paradigms, PIXEL consistently improves attribute alignment while preserving model general capabilities, offering practical and principled LLM controllable generation.

Conclusion: PIXEL provides a practical and principled method for reliable LLM behavior control through adaptive activation steering with representation-level guarantees for minimal-intervention alignment.

Abstract: Reliable behavior control is central to deploying large language models (LLMs) on the web. Activation steering offers a tuning-free route to align attributes (e.g., truthfulness) that ensure trustworthy generation. Prevailing approaches rely on coarse heuristics and lack a principled account of where to steer and how strongly to intervene. To this end, we propose Position-wise Injection with eXact Estimated Levels (PIXEL), a position-wise activation steering framework that, in contrast to prior work, learns a property-aligned subspace from dual views (tail-averaged and end-token) and selects intervention strength via a constrained geometric objective with a closed-form solution, thereby adapting to token-level sensitivity without global hyperparameter tuning. PIXEL further performs sample-level orthogonal residual calibration to refine the global attribute direction and employs a lightweight position-scanning routine to identify receptive injection sites. We additionally provide representation-level guarantees for the minimal-intervention rule, supporting reliable alignment. Across diverse models and evaluation paradigms, PIXEL consistently improves attribute alignment while preserving model general capabilities, offering a practical and principled method for LLMs’ controllable generation. Our code is available at https://github.com/V1centNevwake/PIXEL-Adaptive-Steering

[644] Adaptive Dual Reasoner: Large Reasoning Models Can Think Efficiently by Hybrid Reasoning

Yujian Zhang, Keyu Chen, Zhifeng Shen, Ruizhi Qiao, Xing Sun

Main category: cs.AI

TL;DR: ADR is an adaptive dual reasoning model that dynamically switches between fast and slow thinking modes based on contextual complexity, achieving better performance with reduced computational costs.

Details

Motivation: Long Reasoning Models suffer from increased computational costs and inference latency due to overthinking, which limits their practical efficiency.

Method: Two-stage training: (1) Cold-start SFT with hybrid reasoning dataset, (2) Reinforcement learning with Entropy-guided Hybrid Policy Optimization (EHPO) using entropy-guided dynamic rollout and difficulty-aware penalty.

Result: Achieves up to 6.1% performance gain while reducing reasoning output length by 49.5% to 59.3% on mathematical reasoning benchmarks.

Conclusion: ADR effectively balances reasoning performance and efficiency among state-of-the-art approaches through adaptive dual reasoning modes.

Abstract: Although Long Reasoning Models (LRMs) have achieved superior performance on various reasoning scenarios, they often suffer from increased computational costs and inference latency caused by overthinking. To address these limitations, we propose Adaptive Dual Reasoner, which supports two reasoning modes: fast thinking and slow thinking. ADR dynamically alternates between these modes based on the contextual complexity during reasoning. ADR is trained in two stages: (1) A cold-start stage using supervised fine-tuning (SFT) to equip the model with the ability to integrate both fast and slow reasoning modes, in which we construct a hybrid reasoning dataset through a dedicated pipeline to provide large-scale supervision. (2) A reinforcement learning stage for optimizing reasoning effort, where we introduce Entropy-guided Hybrid Policy Optimization EHPO, an RL training framework employing an entropy-guided dynamic rollout strategy for branching at high-entropy units and a difficulty-aware penalty to balance fast and slow reasoning. Across challenging mathematical reasoning benchmarks, ADR achieves an effective balance between reasoning performance and efficiency among state-of-the-art approaches. Specifically, ADR yields a performance gain of up to 6.1%, while reducing the reasoning output length by 49.5% to 59.3%.

Christoph Aymanns, Jakob Foerster, Co-Pierre Georg, Matthias Weber

Main category: cs.AI

TL;DR: Multi-agent reinforcement learning models fake news spread in social networks, showing attacks are more effective when targeting highly connected individuals and using distributed disinformation across multiple agents.

Details

Motivation: To better model fake news spread in social networks, especially in populations that have adapted to fake news, which is challenging for existing methods.

Method: Multi-agent reinforcement learning approach to model human behavior in social networks, tested with human-subject experiments.

Result: Fake news attacks are more effective when targeting highly connected people and those with weaker private information. Distributed disinformation across multiple agents works better than concentrated attacks. Fake news spreads less in balanced networks than clustered networks.

Conclusion: The model is suitable for analyzing fake news spread in social networks, with experimental evidence supporting the model’s predictions.

Abstract: We propose multi-agent reinforcement learning as a new method for modeling fake news in social networks. This method allows us to model human behavior in social networks both in unaccustomed populations and in populations that have adapted to the presence of fake news. In particular the latter is challenging for existing methods. We find that a fake-news attack is more effective if it targets highly connected people and people with weaker private information. Attacks are more effective when the disinformation is spread across several agents than when the disinformation is concentrated with more intensity on fewer agents. Furthermore, fake news spread less well in balanced networks than in clustered networks. We test a part of our findings in a human-subject experiment. The experimental evidence provides support for the predictions from the model, suggesting that the model is suitable to analyze the spread of fake news in social networks.

[646] The Achilles’ Heel of LLMs: How Altering a Handful of Neurons Can Cripple Language Abilities

Zixuan Qin, Kunlin Lyu, Qingchen Yu, Yifan Sun, Zhaoxin Fan

Main category: cs.AI

TL;DR: LLMs contain ultra-sparse critical neurons that, when disrupted, can cause catastrophic model collapse with perplexity increasing by up to 20 orders of magnitude.

Details

Motivation: Inspired by neuroscience findings that a small subset of biological neurons are crucial for cognitive functions, this research investigates whether LLMs similarly contain critical neurons that are essential for their performance.

Method: Proposed a Perturbation-based Causal Identification of Critical Neurons method to systematically locate critical neurons in LLMs through targeted disruptions.

Result: Found that critical neurons are ultra-sparse, concentrated in outer layers (especially MLP down_proj components), and their disruption causes sharp phase transitions rather than gradual performance decline.

Conclusion: These findings provide insights for developing more robust LLM architectures and improving deployment security in safety-critical applications.

Abstract: Large Language Models (LLMs) have become foundational tools in natural language processing, powering a wide range of applications and research. Many studies have shown that LLMs share significant similarities with the human brain. Recent neuroscience research has found that a small subset of biological neurons in the human brain are crucial for core cognitive functions, which raises a fundamental question: do LLMs also contain a small subset of critical neurons? In this paper, we investigate this question by proposing a Perturbation-based Causal Identification of Critical Neurons method to systematically locate such critical neurons in LLMs. Our findings reveal three key insights: (1) LLMs contain ultra-sparse critical neuron sets. Disrupting these critical neurons can cause a 72B-parameter model with over 1.1 billion neurons to completely collapse, with perplexity increasing by up to 20 orders of magnitude; (2) These critical neurons are not uniformly distributed, but tend to concentrate in the outer layers, particularly within the MLP down_proj components; (3) Performance degradation exhibits sharp phase transitions, rather than a gradual decline, when these critical neurons are disrupted. Through comprehensive experiments across diverse model architectures and scales, we provide deeper analysis of these phenomena and their implications for LLM robustness and interpretability. These findings can offer guidance for developing more robust model architectures and improving deployment security in safety-critical applications.

[647] Mitigating Hallucination in Multimodal Reasoning via Functional Attention Control

Haolang Lu, Bolun Chu, WeiYe Fu, Guoshun Nan, Junning Liu, Minghui Pan, Qiankun Li, Yi Yu, Hua Wang, Kun Wang

Main category: cs.AI

TL;DR: The paper proposes a lightweight plugin to reduce hallucinations in multimodal large reasoning models by identifying and regulating perception- and reasoning-oriented attention heads without retraining.

Details

Motivation: Hallucination remains a persistent failure mode in MLRMs, manifesting as erroneous reasoning chains and visual content misinterpretation, which hinders safe deployment in high-stakes applications.

Method: A two-step plugin: Functional Head Identification locates perception- and reasoning-oriented attention heads, and Class-conditioned Rescaling regulates their contributions without retraining.

Result: Evaluations on three MLRMs show average improvement of 5% (up to 15%) with <1% additional computation and 9% of baseline latency. The approach is model-agnostic and enhances reliability and interpretability.

Conclusion: The proposed plugin significantly enhances both reliability and interpretability of off-the-shelf MLRMs, enabling safe deployment in high-stakes applications through lightweight attention head regulation.

Abstract: Multimodal large reasoning models (MLRMs) are rapidly advancing vision-language reasoning and are emerging as a foundation for cross-modal intelligence. Hallucination remains a persistent failure mode, manifesting itself as erroneous reasoning chains and misinterpretation of visual content. In this study, we observe that attention heads exhibit a staged division: shallow heads predominantly serve perception, while deeper heads shift toward symbolic reasoning, revealing two major causes of hallucination, namely perceptual bias and reasoning drift. To address these issues, we propose a lightweight and interpretable two-step plugin, Functional Head Identification and Class-conditioned Rescaling, which locates perception- and reasoning-oriented heads and regulates their contributions without retraining. Evaluations on three real-world MLRMs (Kimi-VL, Ocean-R1, R1-Onevision), six benchmarks across three domains, and four baselines show that our plugin achieves an average improvement of 5% and up to 15%, with only <1% additional computation and 9% of baseline latency. Our approach is completely model-agnostic and significantly enhances both the reliability and interpretability of the off-the-shelf MLRMs, thereby enabling their safe deployment in high-stakes applications. Our code is available at https://anonymous.4open.science/r/Functional-Attention-Control.

[648] LLM-Friendly Knowledge Representation for Customer Support

Hanchen Su, Wei Luo, Wei Han, Yu Elaine Liu, Yufeng Wayne Zhang, Cen Mia Zhao, Ying Joy Zhang, Yashar Mehdad

Main category: cs.AI

TL;DR: A practical approach integrating LLMs with Airbnb customer support using ICA format and synthetic data generation, showing improved performance and cost-effectiveness.

Details

Motivation: To navigate the complexities of Airbnb customer support operations by making policies and workflows more comprehensible to LLMs.

Method: Uses Intent, Context, and Action (ICA) format to restructure workflows and synthetic data generation for fine-tuning LLMs with minimal human intervention.

Result: Internal experiments show significant performance enhancement in customer support, with improvements in both accuracy and manual processing time metrics.

Conclusion: The approach sets a new benchmark for LLM application in customer support, being both cost-effective and performance-enhancing.

Abstract: We propose a practical approach by integrating Large Language Models (LLMs) with a framework designed to navigate the complexities of Airbnb customer support operations. In this paper, our methodology employs a novel reformatting technique, the Intent, Context, and Action (ICA) format, which transforms policies and workflows into a structure more comprehensible to LLMs. Additionally, we develop a synthetic data generation strategy to create training data with minimal human intervention, enabling cost-effective fine-tuning of our model. Our internal experiments (not applied to Airbnb products) demonstrate that our approach of restructuring workflows and fine-tuning LLMs with synthetic data significantly enhances their performance, setting a new benchmark for their application in customer support. Our solution is not only cost-effective but also improves customer support, as evidenced by both accuracy and manual processing time evaluation metrics.

[649] Beyond Ethics: How Inclusive Innovation Drives Economic Returns in Medical AI

Balagopal Unnikrishnan, Ariel Guerra Adames, Amin Adibi, Sameer Peesapati, Rafal Kocielnik, Shira Fischer, Hillary Clinton Kasimbazi, Rodrigo Gameiro, Alina Peluso, Chrystinne Oliveira Fernandes, Maximin Lange, Lovedeep Gondara, Leo Anthony Celi

Main category: cs.AI

TL;DR: The paper introduces the “inclusive innovation dividend” concept, arguing that healthcare AI solutions designed for diverse, constrained use cases generate superior economic returns in broader markets through market expansion, risk mitigation, performance dividends, and competitive advantages.

Details

Motivation: While ethical arguments for fairness in healthcare AI are established, the economic and strategic value of inclusive design remains underexplored. The paper aims to demonstrate the business value beyond compliance requirements.

Method: The paper draws from assistive technologies that evolved into mainstream industries and presents the Healthcare AI Inclusive Innovation Framework (HAIIF) - a practical scoring system to evaluate AI investments based on their potential to capture inclusive innovation benefits.

Result: The paper identifies four mechanisms through which inclusive innovation drives returns: market expansion, risk mitigation, performance dividends, and competitive advantages. HAIIF provides structured guidance for resource allocation.

Conclusion: Organizations investing incrementally in inclusive design can achieve expanded market reach and sustained competitive advantages, while those treating these considerations as overhead face compounding disadvantages as network effects and data advantages accrue to early movers.

Abstract: While ethical arguments for fairness in healthcare AI are well-established, the economic and strategic value of inclusive design remains underexplored. This perspective introduces the ``inclusive innovation dividend’’ – the counterintuitive principle that solutions engineered for diverse, constrained use cases generate superior economic returns in broader markets. Drawing from assistive technologies that evolved into billion-dollar mainstream industries, we demonstrate how inclusive healthcare AI development creates business value beyond compliance requirements. We identify four mechanisms through which inclusive innovation drives returns: (1) market expansion via geographic scalability and trust acceleration; (2) risk mitigation through reduced remediation costs and litigation exposure; (3) performance dividends from superior generalization and reduced technical debt, and (4) competitive advantages in talent acquisition and clinical adoption. We present the Healthcare AI Inclusive Innovation Framework (HAIIF), a practical scoring system that enables organizations to evaluate AI investments based on their potential to capture these benefits. HAIIF provides structured guidance for resource allocation, transforming fairness and inclusivity from regulatory checkboxes into sources of strategic differentiation. Our findings suggest that organizations investing incrementally in inclusive design can achieve expanded market reach and sustained competitive advantages, while those treating these considerations as overhead face compounding disadvantages as network effects and data advantages accrue to early movers.

[650] Trace Length is a Simple Uncertainty Signal in Reasoning Models

Siddartha Devic, Charlotte Peale, Arwen Bradley, Sinead Williamson, Preetum Nakkiran, Aravind Gollakota

Main category: cs.AI

TL;DR: Reasoning trace length is a simple and effective confidence estimator for large language models, performing comparably to verbalized confidence and working in complementary ways.

Details

Motivation: To address hallucination and reliability issues in LLMs by developing better uncertainty quantification methods, specifically exploring trace length as a confidence signal.

Method: Comprehensive experiments across multiple models, datasets, and prompts to evaluate trace length as a confidence estimator, investigating mechanisms through entropy analysis and controlling for confounders like problem difficulty.

Result: Trace length performs comparably to verbalized confidence estimators and remains effective even after adjusting for confounders. Reasoning post-training fundamentally changes the trace length-accuracy relationship, with high-entropy “forking” tokens playing a key role.

Conclusion: Reasoning trace length is a practical confidence measure for large reasoning models, and reasoning post-training enhances uncertainty quantification beyond verbal expressions.

Abstract: Uncertainty quantification for LLMs is a key research direction towards addressing hallucination and other issues that limit their reliable deployment. In this work, we show that reasoning trace length is a simple and useful confidence estimator in large reasoning models. Through comprehensive experiments across multiple models, datasets, and prompts, we show that trace length performs in comparable but complementary ways to other zero-shot confidence estimators such as verbalized confidence. Our work reveals that reasoning post-training fundamentally alters the relationship between trace length and accuracy, going beyond prior work that had shown that post-training causes traces to grow longer in general (e.g., “overthinking”). We investigate the mechanisms behind trace length’s performance as a confidence signal, observing that the effect remains even after adjusting for confounders such as problem difficulty and GRPO-induced length bias. We identify high-entropy or “forking” tokens as playing a key role in the mechanism. Our findings demonstrate that reasoning post-training enhances uncertainty quantification beyond verbal expressions, and establish trace length as a practical confidence measure for large reasoning models.

[651] Traj-CoA: Patient Trajectory Modeling via Chain-of-Agents for Lung Cancer Risk Prediction

Sihang Zeng, Yujuan Fu, Sitong Zhou, Zixuan Yu, Lucas Jing Liu, Jun Wen, Matthew Thompson, Ruth Etzioni, Meliha Yetisgen

Main category: cs.AI

TL;DR: Traj-CoA is a multi-agent system using chain-of-agents to model patient trajectories from EHR data, addressing challenges of long, noisy temporal data through sequential processing and shared memory.

Details

Motivation: Large language models struggle with the long and noisy nature of electronic health records data in temporal reasoning tasks.

Method: Multi-agent system with worker agents processing EHR data in chunks, distilling critical events into EHRMem memory module, and manager agent synthesizing summaries for predictions.

Result: Outperformed baselines in zero-shot one-year lung cancer risk prediction using five-year EHR data.

Conclusion: Traj-CoA demonstrates clinically aligned temporal reasoning and offers a robust, generalizable approach for complex patient trajectory modeling.

Abstract: Large language models (LLMs) offer a generalizable approach for modeling patient trajectories, but suffer from the long and noisy nature of electronic health records (EHR) data in temporal reasoning. To address these challenges, we introduce Traj-CoA, a multi-agent system involving chain-of-agents for patient trajectory modeling. Traj-CoA employs a chain of worker agents to process EHR data in manageable chunks sequentially, distilling critical events into a shared long-term memory module, EHRMem, to reduce noise and preserve a comprehensive timeline. A final manager agent synthesizes the worker agents’ summary and the extracted timeline in EHRMem to make predictions. In a zero-shot one-year lung cancer risk prediction task based on five-year EHR data, Traj-CoA outperforms baselines of four categories. Analysis reveals that Traj-CoA exhibits clinically aligned temporal reasoning, establishing it as a promisingly robust and generalizable approach for modeling complex patient trajectories.

[652] MedCoAct: Confidence-Aware Multi-Agent Collaboration for Complete Clinical Decision

Hongjie Zheng, Zesheng Shi, Ping Yi

Main category: cs.AI

TL;DR: MedCoAct is a confidence-aware multi-agent framework that simulates clinical collaboration between doctor and pharmacist agents, achieving 67.58% accuracy in both diagnosis and medication recommendations, outperforming single-agent systems by over 7%.

Details

Motivation: Existing medical AI systems process tasks in isolation without cross-validation and knowledge integration found in clinical teams, reducing effectiveness in real-world healthcare scenarios.

Method: Proposed MedCoAct framework with specialized doctor and pharmacist agents that collaborate, plus DrugCareQA benchmark for evaluating integrated diagnosis and treatment workflows.

Result: MedCoAct achieves 67.58% diagnostic accuracy and 67.58% medication recommendation accuracy, outperforming single agent framework by 7.04% and 7.08% respectively.

Conclusion: The collaborative approach generalizes well across medical domains, is effective for telemedicine and routine clinical scenarios, and provides interpretable decision-making pathways.

Abstract: Autonomous agents utilizing Large Language Models (LLMs) have demonstrated remarkable capabilities in isolated medical tasks like diagnosis and image analysis, but struggle with integrated clinical workflows that connect diagnostic reasoning and medication decisions. We identify a core limitation: existing medical AI systems process tasks in isolation without the cross-validation and knowledge integration found in clinical teams, reducing their effectiveness in real-world healthcare scenarios. To transform the isolation paradigm into a collaborative approach, we propose MedCoAct, a confidence-aware multi-agent framework that simulates clinical collaboration by integrating specialized doctor and pharmacist agents, and present a benchmark, DrugCareQA, to evaluate medical AI capabilities in integrated diagnosis and treatment workflows. Our results demonstrate that MedCoAct achieves 67.58% diagnostic accuracy and 67.58% medication recommendation accuracy, outperforming single agent framework by 7.04% and 7.08% respectively. This collaborative approach generalizes well across diverse medical domains, proving especially effective for telemedicine consultations and routine clinical scenarios, while providing interpretable decision-making pathways.

[653] HealthFlow: A Self-Evolving AI Agent with Meta Planning for Autonomous Healthcare Research

Yinghao Zhu, Yifan Qi, Zixiang Wang, Lei Gu, Dehao Sui, Haoran Hu, Xichen Zhang, Ziyi He, Junjun He, Liantao Ma, Lequan Yu

Main category: cs.AI

TL;DR: HealthFlow is a self-evolving AI agent that autonomously refines its problem-solving strategies through meta-level evolution, outperforming state-of-the-art frameworks on complex healthcare data analysis tasks.

Details

Motivation: Current AI agents are limited by static strategies and cannot effectively navigate the complex, evolving ecosystem of scientific research, particularly in high-stakes domains like healthcare.

Method: HealthFlow uses a novel meta-level evolution mechanism that autonomously refines high-level problem-solving policies by distilling procedural successes and failures into a durable, structured knowledge base. The research also introduces EHRFlowBench, a benchmark with complex health data analysis tasks derived from scientific literature.

Result: HealthFlow’s self-evolving approach significantly outperforms state-of-the-art agent frameworks in experiments.

Conclusion: This work offers a new paradigm for intelligent systems that can learn to operationalize procedural knowledge from scientific content, representing a critical step toward more autonomous and effective AI for healthcare scientific discovery.

Abstract: The rapid proliferation of scientific knowledge presents a grand challenge: transforming this vast repository of information into an active engine for discovery, especially in high-stakes domains like healthcare. Current AI agents, however, are constrained by static, predefined strategies, limiting their ability to navigate the complex, evolving ecosystem of scientific research. This paper introduces HealthFlow, a self-evolving AI agent that overcomes this limitation through a novel meta-level evolution mechanism. HealthFlow autonomously refines its high-level problem-solving policies by distilling procedural successes and failures into a durable, structured knowledge base, enabling it to learn not just how to use tools, but how to strategize. To anchor our research and provide a community resource, we introduce EHRFlowBench, a new benchmark featuring complex health data analysis tasks systematically derived from peer-reviewed scientific literature. Our experiments demonstrate that HealthFlow’s self-evolving approach significantly outperforms state-of-the-art agent frameworks. This work offers a new paradigm for intelligent systems that can learn to operationalize the procedural knowledge embedded in scientific content, marking a critical step toward more autonomous and effective AI for healthcare scientific discovery.

[654] Tracing the Traces: Latent Temporal Signals for Efficient and Accurate Reasoning

Martina G. Vilas, Safoora Yousefi, Besmira Nushi, Eric Horvitz, Vidhisha Balachandran

Main category: cs.AI

TL;DR: Latent-Trajectory signals use internal model representations during reasoning to predict solution accuracy, enabling more efficient inference-time scaling with up to 70% token reduction while improving accuracy.

Details

Motivation: To reduce wasted computation in reasoning models by identifying productive reasoning paths early, improving efficiency of inference-time scaling.

Method: Analyze temporal evolution of model’s internal representations during reasoning using three metrics: overall latent change, accumulated intermediate changes, and progress toward final state.

Result: Latent-Trajectory signals outperform cross-layer metrics and output-based confidence measures, reduce token usage by up to 70% while improving accuracy by 2.6% on average, and enable early selection of promising candidates.

Conclusion: Latent-Trajectory signals provide both practical efficiency gains for inference-time scaling and deeper interpretability into reasoning processes in latent space.

Abstract: Reasoning models improve their problem-solving ability through inference-time scaling, allocating more compute via longer token budgets. Identifying which reasoning traces are likely to succeed remains a key opportunity: reliably predicting productive paths can substantially reduce wasted computation and improve overall efficiency. We introduce Latent-Trajectory signals that characterize the temporal evolution of a model’s internal representations during the generation of intermediate reasoning tokens. By measuring the overall change in latent representations between the start and end of reasoning, the change accumulated across intermediate steps, and the extent to which these changes advance toward the final state, we show that these signals predict solution accuracy more reliably than both cross-layer metrics and output-based confidence measures. When used to guide answer selection across multiple sampled generations, Latent-Trajectory signals make test-time scaling more effective and efficient than majority voting, reducing token usage by up to 70% while preserving and even improving accuracy by 2.6% on average. Moreover, these predictive signals often emerge early in the reasoning trace, enabling early selection and allocation of compute to the most promising candidates. Our findings contribute not only practical strategies for inference-time efficiency, but also a deeper interpretability perspective on how reasoning processes are represented and differentiated in latent space.

[655] ELAIPBench: A Benchmark for Expert-Level Artificial Intelligence Paper Understanding

Xinbang Dai, Huikang Hu, Yongrui Chen, Jiaqi Li, Rihui Jin, Yuyang Zhang, Xiaoguang Li, Lifeng Shang, Guilin Qi

Main category: cs.AI

TL;DR: ELAIPBench is a benchmark for evaluating LLMs’ comprehension of AI research papers, showing current models perform poorly (39.95% accuracy) and struggle with deep reasoning despite advanced features.

Details

Motivation: To address the gap in evaluating LLMs' deep comprehension of full-length academic papers, as existing benchmarks have surface-level questions or unreliable metrics.

Method: Created ELAIPBench through incentive-driven adversarial annotation with 403 multiple-choice questions from 137 papers across three difficulty levels, focusing on non-trivial reasoning rather than shallow retrieval.

Result: Best-performing LLM achieved only 39.95% accuracy, far below human performance. Frontier LLMs with thinking modes or RAG systems failed to improve results and sometimes harmed accuracy due to overthinking or noisy retrieval.

Conclusion: There is a significant gap between current LLM capabilities and genuine comprehension of academic papers, highlighting limitations in deep reasoning abilities.

Abstract: While large language models (LLMs) excel at many domain-specific tasks, their ability to deeply comprehend and reason about full-length academic papers remains underexplored. Existing benchmarks often fall short of capturing such depth, either due to surface-level question design or unreliable evaluation metrics. To address this gap, we introduce ELAIPBench, a benchmark curated by domain experts to evaluate LLMs’ comprehension of artificial intelligence (AI) research papers. Developed through an incentive-driven, adversarial annotation process, ELAIPBench features 403 multiple-choice questions from 137 papers. It spans three difficulty levels and emphasizes non-trivial reasoning rather than shallow retrieval. Our experiments show that the best-performing LLM achieves an accuracy of only 39.95%, far below human performance. Moreover, we observe that frontier LLMs equipped with a thinking mode or a retrieval-augmented generation (RAG) system fail to improve final results-even harming accuracy due to overthinking or noisy retrieval. These findings underscore the significant gap between current LLM capabilities and genuine comprehension of academic papers.

[656] A Layered Intuition – Method Model with Scope Extension for LLM Reasoning

Hong Su

Main category: cs.AI

TL;DR: This paper proposes a unified Intuition-Method Layered Model with Scope Extension to systematically address unseen problems in LLMs, introducing temporal and spatial extensions and an entropy-based evaluation framework.

Details

Motivation: To enhance LLM performance beyond direct matrix mappings by systematically addressing indirect (unseen) issues through a unified framework that integrates and extends existing approaches.

Method: Developed an Intuition-Method Layered Model where intuition provides rapid first responses and method-based thinking creates transferable reasoning units. Applied scope extension including vertical (cause analysis), horizontal (parallel/generalized), temporal, and spatial dimensions, organized into knowledge trees forming a knowledge network.

Result: The framework enables more systematic handling of unseen questions through extended reasoning capabilities across multiple dimensions, with quantitative evaluation via method extension entropy measuring independence and diversity of extensions.

Conclusion: This work advances toward a more robust and extensible reasoning paradigm for LLMs in real-world problem-solving by logically connecting existing approaches with new extensions and introducing an entropy-based evaluation framework.

Abstract: Existing studies have introduced method-based reasoning and scope extension as approaches to enhance Large Language Model (LLM) performance beyond direct matrix mappings. Building on these foundations, this paper summarizes and integrates these ideas into a unified Intuition-Method Layered Model with Scope Extension, designed to address indirected (unseen) issues more systematically. In this framework, intuition-based thinking provides rapid first-reaction answers, while method-based thinking decouples questions and solutions into transferable reasoning units. Scope extension is then applied to broaden applicability, including vertical (cause analysis), horizontal (parallel and generalized issues), and for the first time, temporal and spatial extensions, which expand reasoning across time and contextual dimensions. These extensions are organized into systematic knowledge trees that interconnect into a knowledge network, thereby increasing adaptability. To quantitatively evaluate this process, we propose the entropy of method extension, which measures the independence and diversity of extensions as an indicator of the system’s capacity to solve unseen questions. By logically connecting existing approaches with new extensions and introducing an entropy-based evaluation framework, this work advances toward a more robust and extensible reasoning paradigm for LLMs in real-world problem-solving.

[657] Co-Alignment: Rethinking Alignment as Bidirectional Human-AI Cognitive Adaptation

Yubo Li, Weiyi Song

Main category: cs.AI

TL;DR: Bidirectional Cognitive Alignment (BiCA) enables mutual adaptation between humans and AI, achieving superior collaboration through learnable protocols and controlled co-evolution.

Details

Motivation: Current AI alignment through RLHF treats human cognition as fixed while AI conforms to human preferences. This single-directional paradigm limits optimal collaboration.

Method: BiCA uses learnable protocols, representation mapping, and KL-budget constraints for controlled co-evolution between humans and AI.

Result: In collaborative navigation, BiCA achieved 85.5% success (vs 70.3% baseline), 230% better mutual adaptation, 332% better protocol convergence, and 23% better safety with out-of-distribution robustness.

Conclusion: Optimal collaboration exists at the intersection of human and AI capabilities, validating the shift from single-directional to co-alignment paradigms with 46% synergy improvement.

Abstract: Current AI alignment through RLHF follows a single directional paradigm that AI conforms to human preferences while treating human cognition as fixed. We propose a shift to co-alignment through Bidirectional Cognitive Alignment (BiCA), where humans and AI mutually adapt. BiCA uses learnable protocols, representation mapping, and KL-budget constraints for controlled co-evolution. In collaborative navigation, BiCA achieved 85.5% success versus 70.3% baseline, with 230% better mutual adaptation and 332% better protocol convergence. Emergent protocols outperformed handcrafted ones by 84%, while bidirectional adaptation unexpectedly improved safety (+23% out-of-distribution robustness). The 46% synergy improvement demonstrates optimal collaboration exists at the intersection, not union, of human and AI capabilities, validating the shift from single-directional to co-alignment paradigms.

[658] A Distance Measure for Random Permutation Set: From the Layer-2 Belief Structure Perspective

Ruolan Cheng, Yong Deng, Serafín Moral, José Ramón Trillo

Main category: cs.AI

TL;DR: This paper proposes a new distance measure for random permutation sets based on cumulative Jaccard index, addressing limitations of existing methods with improved sensitivity and flexibility.

Details

Motivation: Measuring distance between permutation mass functions is crucial in random permutation set theory, and existing methods have limitations that need to be addressed.

Method: Introduces cumulative Jaccard index to quantify permutation similarity, develops distance measure based on cumulative Jaccard index matrix with correction scheme, and incorporates top-weightiness property.

Result: The proposed method overcomes shortcomings of existing approaches, is compatible with Jousselme distance, and demonstrates higher sensitivity and flexibility in numerical experiments.

Conclusion: The cumulative Jaccard index-based distance measure provides an effective solution for quantifying distances in random permutation sets with natural top-weightiness and adjustable parameters.

Abstract: Random permutation set (RPS) is a recently proposed framework designed to represent order-structured uncertain information. Measuring the distance between permutation mass functions is a key research topic in RPS theory (RPST). This paper conducts an in-depth analysis of distances between RPSs from two different perspectives: random finite set (RFS) and transferable belief model (TBM). Adopting the layer-2 belief structure interpretation of RPS, we regard RPST as a refinement of TBM, where the order in the ordered focus set represents qualitative propensity. Starting from the permutation, we introduce a new definition of the cumulative Jaccard index to quantify the similarity between two permutations and further propose a distance measure method for RPSs based on the cumulative Jaccard index matrix. The metric and structural properties of the proposed distance measure are investigated, including the positive definiteness analysis of the cumulative Jaccard index matrix, and a correction scheme is provided. The proposed method has a natural top-weightiness property: inconsistencies between higher-ranked elements tend to result in greater distance values. Two parameters are provided to the decision-maker to adjust the weight and truncation depth. Several numerical examples are used to compare the proposed method with the existing method. The experimental results show that the proposed method not only overcomes the shortcomings of the existing method and is compatible with the Jousselme distance, but also has higher sensitivity and flexibility.

[659] EA4LLM: A Gradient-Free Approach to Large Language Model Optimization via Evolutionary Algorithms

WenTao Liu, Siyu Song, Hao Hao, Aimin Zhou

Main category: cs.AI

TL;DR: EA4LLM uses evolutionary algorithms to train 1B-parameter LLMs, challenging gradient-based methods and enabling resource-efficient training.

Details

Motivation: Gradient-based optimizers like Adam require high-end GPUs and differentiable operations, excluding non-differentiable architectures and limiting accessibility.

Method: Proposed EA4LLM - evolutionary algorithm-based optimization for LLMs, successfully training a 1B-parameter model from pre-trained stage.

Result: Successfully demonstrated evolutionary algorithms can effectively optimize neural networks and train large language models.

Conclusion: Evolutionary algorithms are viable alternatives to gradient-based optimization, potentially reducing computational costs and enabling broader participation in deep learning research.

Abstract: In recent years, large language models (LLMs) have made remarkable progress, with model optimization primarily relying on gradient-based optimizers such as Adam. However, these gradient-based methods impose stringent hardware requirements, demanding high-concurrency, high-memory GPUs. Moreover, they require all neural network operations to be differentiable, thereby excluding many promising non-differentiable architectures from practical use. To address these limitations, we propose a method for optimizing LLMs using evolutionary algorithms (EA4LLM) and, for the first time, successfully demonstrate its capability to train a 1-billion-parameter LLM from the pre-trained stage. We conduct extensive experiments and provide key insights into how evolutionary algorithms can effectively optimize neural networks. Our work challenges the prevailing assumption that gradient-based optimization is the only viable approach for training neural networks. It also holds significant potential to reduce the computational cost of training large language models, thereby enabling groups with limited computational resources to participate in deep learning research.

[660] Coordination Requires Simplification: Thermodynamic Bounds on Multi-Objective Compromise in Natural and Artificial Intelligence

Atma Anand

Main category: cs.AI

TL;DR: Coordination across multiple agents has fundamental thermodynamic constraints where findability matters more than accuracy. Coordination protocols scale with agent count and complexity, forcing progressive simplification and creating metastable states.

Details

Motivation: To understand fundamental thermodynamic constraints in multi-agent coordination systems and explain phenomena like cycling in multi-objective optimization and alignment faking in LLMs.

Method: Developed Thermodynamic Coordination Theory (TCT) using information theory to derive minimum description length of coordination protocols and analyze coordination dynamics through concepts like coordination temperature and phase transitions.

Result: Found that coordination requires radical information loss, coordination protocols scale as L(P)≥NKlog₂K+N²d²log(1/ε), and systems exhibit persistent metastable states with hysteresis until environmental shifts trigger phase transitions.

Conclusion: Coordination fundamentally requires information loss and simplification, with systems exhibiting hierarchical optimization and phase transitions, explaining real-world coordination phenomena across various domains from neural networks to bureaucracies.

Abstract: Information-processing systems coordinating across multiple agents and objectives face fundamental thermodynamic constraints. We show that solutions with maximum utility to act as coordination focal points have much higher selection pressure for being findable across agents rather than accuracy. We derive that the information-theoretic minimum description length of coordination protocols to precision $\varepsilon$ scales as $L(P)\geq NK\log_2 K+N^2d^2\log (1/\varepsilon)$ for $N$ agents with $d$ potentially conflicting objectives and internal model complexity $K$. This scaling forces progressive simplification, with coordination dynamics changing the environment itself and shifting optimization across hierarchical levels. Moving from established focal points requires re-coordination, creating persistent metastable states and hysteresis until significant environmental shifts trigger phase transitions through spontaneous symmetry breaking. We operationally define coordination temperature to predict critical phenomena and estimate coordination work costs, identifying measurable signatures across systems from neural networks to restaurant bills to bureaucracies. Extending the topological version of Arrow’s theorem on the impossibility of consistent preference aggregation, we find it recursively binds whenever preferences are combined. This potentially explains the indefinite cycling in multi-objective gradient descent and alignment faking in Large Language Models trained with reinforcement learning with human feedback. We term this framework Thermodynamic Coordination Theory (TCT), which demonstrates that coordination requires radical information loss.

[661] Collaborative Text-to-Image Generation via Multi-Agent Reinforcement Learning and Semantic Fusion

Jiabao Shi, Minfeng Qi, Lefeng Zhang, Di Wang, Yingjie Zhao, Ziying Li, Yalong Xing, Ningran Li

Main category: cs.AI

TL;DR: A multi-agent reinforcement learning framework improves multimodal text-to-image generation by coordinating domain-specialized agents with enhanced text and image modules, achieving significant content enrichment despite cross-modal alignment challenges.

Details

Motivation: To address constraints in maintaining semantic alignment and professional-level detail across diverse visual domains in multimodal text-to-image generation.

Method: Multi-agent reinforcement learning framework with domain-specialized agents, using PPO training with composite reward function, contrastive learning, bidirectional attention, and iterative feedback between text and image modules.

Result: Significantly enriched generated content (1614% word count increase), reduced ROUGE-1 scores by 69.7%, Transformer-based fusion achieved highest composite score (0.521), multimodal ensembles showed moderate consistency (0.444-0.481).

Conclusion: Collaborative, specialization-driven architectures show promise for advancing reliable multimodal generative systems, though cross-modal semantic grounding remains challenging.

Abstract: Multimodal text-to-image generation remains constrained by the difficulty of maintaining semantic alignment and professional-level detail across diverse visual domains. We propose a multi-agent reinforcement learning framework that coordinates domain-specialized agents (e.g., focused on architecture, portraiture, and landscape imagery) within two coupled subsystems: a text enhancement module and an image generation module, each augmented with multimodal integration components. Agents are trained using Proximal Policy Optimization (PPO) under a composite reward function that balances semantic similarity, linguistic visual quality, and content diversity. Cross-modal alignment is enforced through contrastive learning, bidirectional attention, and iterative feedback between text and image. Across six experimental settings, our system significantly enriches generated content (word count increased by 1614%) while reducing ROUGE-1 scores by 69.7%. Among fusion methods, Transformer-based strategies achieve the highest composite score (0.521), despite occasional stability issues. Multimodal ensembles yield moderate consistency (ranging from 0.444 to 0.481), reflecting the persistent challenges of cross-modal semantic grounding. These findings underscore the promise of collaborative, specialization-driven architectures for advancing reliable multimodal generative systems.

[662] LegalWiz: A Multi-Agent Generation Framework for Contradiction Detection in Legal Documents

Ananya Mantravadi, Shivali Dalmia, Olga Pospelova, Abhishek Mukherji, Nand Dave, Anudha Mittal

Main category: cs.AI

TL;DR: A multi-agent framework generates synthetic legal documents with embedded contradictions to benchmark legal RAG systems, addressing current limitations in contradiction detection benchmarks.

Details

Motivation: Current benchmarks for contradiction detection lack domain realism, cover limited conflict types, and are unsuitable for legal applications where unresolved contradictions lead to hallucinations and legally unsound outputs.

Method: Multi-agent contradiction-aware framework that generates synthetic legal-style documents, injects six structured contradiction types, models self- and pairwise inconsistencies, and combines automated contradiction mining with human-in-the-loop validation.

Result: Created a structured benchmark resource for contradiction-aware evaluation in legal RAG pipelines, supporting more consistent, interpretable, and trustworthy systems.

Conclusion: The framework provides essential controlled generation of documents with embedded contradictions for systematic stress-testing of models and reliable evaluation of contradiction detection and resolution in legal applications.

Abstract: Retrieval-Augmented Generation (RAG) integrates large language models (LLMs) with external sources, but unresolved contradictions in retrieved evidence often lead to hallucinations and legally unsound outputs. Benchmarks currently used for contradiction detection lack domain realism, cover only limited conflict types, and rarely extend beyond single-sentence pairs, making them unsuitable for legal applications. Controlled generation of documents with embedded contradictions is therefore essential: it enables systematic stress-testing of models, ensures coverage of diverse conflict categories, and provides a reliable basis for evaluating contradiction detection and resolution. We present a multi-agent contradiction-aware benchmark framework for the legal domain that generates synthetic legal-style documents, injects six structured contradiction types, and models both self- and pairwise inconsistencies. Automated contradiction mining is combined with human-in-the-loop validation to guarantee plausibility and fidelity. This benchmark offers one of the first structured resources for contradiction-aware evaluation in legal RAG pipelines, supporting more consistent, interpretable, and trustworthy systems.

[663] Automatic Piecewise Linear Regression for Predicting Student Learning Satisfaction

Haemin Choi, Gayathri Nadarajan

Main category: cs.AI

TL;DR: APLR model (combining boosting with interpretability) provides the best fit for predicting student learning satisfaction, identifying key factors like time management, concentration, and peer helpfulness as most influential.

Details

Motivation: Modern techniques like interpretable machine learning and neural networks haven't been sufficiently explored for studying student learning satisfaction, despite its importance in education research.

Method: Used automatic piecewise linear regression (APLR) - a model combining boosting with interpretability - and compared it with several state-of-the-art approaches for predicting learning satisfaction.

Result: APLR offered the best fit; identified time management, concentration abilities, perceived helpfulness to classmates, and offline course participation as most significant positive factors; surprisingly, creative activities didn’t positively affect satisfaction.

Conclusion: APLR enables individual-level interpretation of contributing factors, allowing educators to customize instructions based on student profiles for improved learning satisfaction.

Abstract: Although student learning satisfaction has been widely studied, modern techniques such as interpretable machine learning and neural networks have not been sufficiently explored. This study demonstrates that a recent model that combines boosting with interpretability, automatic piecewise linear regression(APLR), offers the best fit for predicting learning satisfaction among several state-of-the-art approaches. Through the analysis of APLR’s numerical and visual interpretations, students’ time management and concentration abilities, perceived helpfulness to classmates, and participation in offline courses have the most significant positive impact on learning satisfaction. Surprisingly, involvement in creative activities did not positively affect learning satisfaction. Moreover, the contributing factors can be interpreted on an individual level, allowing educators to customize instructions according to student profiles.

[664] Equity-Aware Geospatial AI for Forecasting Demand-Driven Hospital Locations in Germany

Piyush Pant, Marcellius William Suntoro, Ayesha Siddiqua, Muhammad Shehryaar Sharif, Daniyal Ahmed

Main category: cs.AI

TL;DR: EA-GeoAI framework combines demographic forecasting and equity metrics to optimize hospital bed allocation and facility placement in Germany through 2030.

Details

Motivation: To address healthcare inequities and future demand by integrating demographic shifts, aging populations, and infrastructure gaps into equitable hospital planning.

Method: Developed an Equity Index combining demographic and infrastructure factors, then used an interpretable Agentic AI optimizer to allocate beds and identify new facility sites under budget and travel-time constraints.

Result: Created an integrated framework that bridges GeoAI, long-term forecasting, and equity measurement to provide actionable policy recommendations.

Conclusion: The EA-GeoAI framework successfully combines multiple domains to deliver equitable hospital planning solutions that can guide policymakers in addressing future healthcare needs.

Abstract: This paper presents EA-GeoAI, an integrated framework for demand forecasting and equitable hospital planning in Germany through 2030. We combine district-level demographic shifts, aging population density, and infrastructure balances into a unified Equity Index. An interpretable Agentic AI optimizer then allocates beds and identifies new facility sites to minimize unmet need under budget and travel-time constraints. This approach bridges GeoAI, long-term forecasting, and equity measurement to deliver actionable recommendations for policymakers.

[665] Hierarchical Optimization via LLM-Guided Objective Evolution for Mobility-on-Demand Systems

Yi Zhang, Yushen Long, Yun Ni, Liping Huang, Xiaohong Wang, Jun Liu

Main category: cs.AI

TL;DR: A hybrid framework combining LLM with mathematical optimization for ride-hailing platforms, achieving 16% improvement over baselines without training data requirements.

Details

Motivation: Existing approaches have limitations: RL methods are data-inefficient and oversimplify real-world dynamics, while decomposed optimization lacks awareness of low-level routing dynamics.

Method: Training-free hierarchical system where LLM serves as meta-optimizer generating semantic heuristics, guided by harmony search evolutionary process that refines prompts based on optimization feedback.

Result: 16% average improvement over state-of-the-art baselines in experiments using New York and Chicago taxi datasets.

Conclusion: The hybrid LLM-optimization framework effectively balances supply-demand in ride-hailing without training data, overcoming limitations of existing methods.

Abstract: Online ride-hailing platforms aim to deliver efficient mobility-on-demand services, often facing challenges in balancing dynamic and spatially heterogeneous supply and demand. Existing methods typically fall into two categories: reinforcement learning (RL) approaches, which suffer from data inefficiency, oversimplified modeling of real-world dynamics, and difficulty enforcing operational constraints; or decomposed online optimization methods, which rely on manually designed high-level objectives that lack awareness of low-level routing dynamics. To address this issue, we propose a novel hybrid framework that integrates large language model (LLM) with mathematical optimization in a dynamic hierarchical system: (1) it is training-free, removing the need for large-scale interaction data as in RL, and (2) it leverages LLM to bridge cognitive limitations caused by problem decomposition by adaptively generating high-level objectives. Within this framework, LLM serves as a meta-optimizer, producing semantic heuristics that guide a low-level optimizer responsible for constraint enforcement and real-time decision execution. These heuristics are refined through a closed-loop evolutionary process, driven by harmony search, which iteratively adapts the LLM prompts based on feasibility and performance feedback from the optimization layer. Extensive experiments based on scenarios derived from both the New York and Chicago taxi datasets demonstrate the effectiveness of our approach, achieving an average improvement of 16% compared to state-of-the-art baselines.

[666] Multi-Objective Multi-Agent Path Finding with Lexicographic Cost Preferences

Pulkit Rustagi, Kyle Hollins Wray, Sandhya Saisubramanian

Main category: cs.AI

TL;DR: Proposes Lexicographic Conflict-Based Search (LCBS) for multi-objective multi-agent path finding that directly computes solutions aligned with user preferences, avoiding Pareto frontier construction and scaling to 10 objectives.

Details

Motivation: Current MO-MAPF algorithms don't optimize for user preferences even when available, scale poorly with objectives, and require computing Pareto frontiers.

Method: LCBS integrates priority-aware low-level A* search with conflict-based search, using lexicographic preferences over objectives to guide planning without Pareto frontier construction.

Result: LCBS computes optimal solutions and scales to instances with up to 10 objectives, achieving higher success rates than state-of-the-art methods, especially with more objectives.

Conclusion: The lexicographic framework and LCBS algorithm enable efficient preference-aware multi-objective planning that significantly outperforms existing methods in scalability and success rates.

Abstract: Many real-world scenarios require multiple agents to coordinate in shared environments, while balancing trade-offs between multiple, potentially competing objectives. Current multi-objective multi-agent path finding (MO-MAPF) algorithms typically produce conflict-free plans by computing Pareto frontiers. They do not explicitly optimize for user-defined preferences, even when the preferences are available, and scale poorly with the number of objectives. We propose a lexicographic framework for modeling MO-MAPF, along with an algorithm \textit{Lexicographic Conflict-Based Search} (LCBS) that directly computes a single solution aligned with a lexicographic preference over objectives. LCBS integrates a priority-aware low-level $A^*$ search with conflict-based search, avoiding Pareto frontier construction and enabling efficient planning guided by preference over objectives. We provide insights into optimality and scalability, and empirically demonstrate that LCBS computes optimal solutions while scaling to instances with up to ten objectives – far beyond the limits of existing MO-MAPF methods. Evaluations on standard and randomized MAPF benchmarks show consistently higher success rates against state-of-the-art baselines, especially with increasing number of objectives.

[667] Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning

Can Xie, Ruotong Pan, Xiangyu Wu, Yunfei Zhang, Jiayi Fu, Tingting Gao, Guorui Zhou

Main category: cs.AI

TL;DR: UCAS introduces uncertainty-aware advantage shaping to improve RLVR by using model uncertainty signals for better credit assignment, addressing entropy collapse and enhancing exploration in reasoning tasks.

Details

Motivation: Current RLVR methods like GRPO use uniform advantage signals across all tokens, which fails to account for the importance of uncertain, high-stakes decisions during reasoning, leading to inefficient exploration and entropy collapse.

Method: UCAS is a model-free method that refines credit assignment in two stages: modulating response-level advantage using model self-confidence, and applying token-level penalty based on raw logit certainty to encourage exploration of high-uncertainty paths.

Result: Extensive experiments on five mathematical reasoning benchmarks show UCAS significantly outperforms strong RLVR baselines across multiple model scales (1.5B and 7B), achieving higher rewards, greater reasoning diversity, and successfully mitigating entropy collapse.

Conclusion: UCAS effectively balances the exploration-exploitation trade-off by leveraging model uncertainty, making it a promising approach for enhancing reasoning capabilities in LLMs through verifiable reinforcement learning.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has shown significant promise for enhancing the reasoning capabilities of large language models (LLMs). However, prevailing algorithms like GRPO broadcast a uniform advantage signal across all tokens in a sequence. This coarse-grained approach overlooks the pivotal role of uncertain, high-stakes decisions during reasoning, leading to inefficient exploration and the well-documented problem of entropy collapse. To address this, we introduce UnCertainty-aware Advantage Shaping (UCAS), a model-free method that refines credit assignment by leveraging the model’s internal uncertainty signals. UCAS operates in two stages: it first modulates the response-level advantage using the model’s overall self-confidence, and then applies a token-level penalty based on raw logit certainty. This dual mechanism encourages exploration of high-uncertainty paths that yield correct answers while penalizing overconfident yet erroneous reasoning, effectively balancing the exploration-exploitation trade-off. Extensive experiments on five mathematical reasoning benchmarks show that UCAS significantly outperforms strong RLVR baselines across multiple model scales, including 1.5B and 7B. Our analysis confirms that UCAS not only achieves higher rewards but also promotes greater reasoning diversity and successfully mitigates entropy collapse.

[668] Simpliflow: A Lightweight Open-Source Framework for Rapid Creation and Deployment of Generative Agentic AI Workflows

Deven Panchal

Main category: cs.AI

TL;DR: simpliflow is a lightweight Python framework for building deterministic agentic AI workflows using declarative JSON configuration, designed to reduce complexity and boilerplate code.

Details

Motivation: Existing frameworks for generative agentic AI systems introduce significant complexity, steep learning curves, and substantial boilerplate code, hindering rapid prototyping and deployment.

Method: Uses a modular architecture that decouples agent management, workflow execution, and post-processing, with declarative JSON-based configuration and integration with LiteLLM for support of over 100 LLMs.

Result: Enables rapid development and orchestration of linear, deterministic agentic workflows, demonstrated through diverse use cases from software development simulation to real-time system interaction.

Conclusion: simpliflow occupies a unique position as a tool optimized for simplicity, control, and speed in deterministic workflow environments compared to frameworks like LangChain and AutoGen.

Abstract: Generative Agentic AI systems are emerging as a powerful paradigm for automating complex, multi-step tasks. However, many existing frameworks for building these systems introduce significant complexity, a steep learning curve, and substantial boilerplate code, hindering rapid prototyping and deployment. This paper introduces simpliflow, a lightweight, open-source Python framework designed to address these challenges. simpliflow enables the rapid development and orchestration of linear, deterministic agentic workflows through a declarative, JSON-based configuration. Its modular architecture decouples agent management, workflow execution, and post-processing, promoting ease of use and extensibility. By integrating with LiteLLM, it supports over 100 Large Language Models (LLMs) out-of-the-box. We present the architecture, operational flow, and core features of simpliflow, demonstrating its utility through diverse use cases ranging from software development simulation to real-time system interaction. A comparative analysis with prominent frameworks like LangChain and AutoGen highlights simpliflow’s unique position as a tool optimized for simplicity, control, and speed in deterministic workflow environments.

[669] OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs

Caorui Li, Yu Chen, Yiyan Ji, Jin Xu, Zhenyu Cui, Shihao Li, Yuanxing Zhang, Jiafu Tang, Zhenghao Song, Dingling Zhang, Ying He, Haoxiang Liu, Yuxuan Wang, Qiufeng Wang, Zhenhe Wu, Jiehui Luo, Zhiyu Pan, Weihao Xie, Chenchen Zhang, Zhaohui Wang, Jiayi Tian, Yanghai Wang, Zhe Cao, Minxin Dai, Ke Wang, Runzhe Wen, Yinghao Ma, Yaning Pan, Sungkyun Chang, Termeh Taheri, Haiwen Xia, Christos Plachouras, Emmanouil Benetos, Yizhi Li, Ge Zhang, Jian Yang, Tianhao Peng, Zili Wang, Minghao Liu, Junran Peng, Zhaoxiang Zhang, Jiaheng Liu

Main category: cs.AI

TL;DR: OmniVideoBench is a new benchmark for evaluating synergistic audio-visual understanding in multimodal large language models, featuring 1000 QA pairs with reasoning traces from 628 diverse videos.

Details

Motivation: Existing benchmarks fail to comprehensively evaluate synergistic reasoning across audio and visual modalities, often neglecting one modality or integrating them in logically inconsistent ways.

Method: Created a large-scale benchmark with 1000 high-quality QA pairs from 628 diverse videos (seconds to 30 minutes), manually verified for correctness and uniqueness, covering 13 question types including temporal reasoning, spatial localization, counting, and causal inference.

Result: Evaluation revealed a significant gap between model performance and human reasoning, with open-source models lagging behind closed-source counterparts, highlighting the difficulty of genuine audio-visual reasoning.

Conclusion: The benchmark will be released to foster development of MLLMs with stronger and more generalizable reasoning capabilities for video understanding.

Abstract: Recent advances in multimodal large language models (MLLMs) have demonstrated substantial potential in video understanding. However, existing benchmarks fail to comprehensively evaluate synergistic reasoning capabilities across audio and visual modalities, often neglecting either one of the modalities or integrating them in a logically inconsistent manner. To bridge this gap, we introduce OmniVideoBench, a large-scale and rigorously designed benchmark dedicated to assessing synergistic audio-visual understanding, with a strong emphasis on modality complementarity and logical consistency. Specifically, OmniVideoBench comprises 1000 high-quality question-answer(QA) pairs, each annotated with step-by-step reasoning traces, derived from 628 diverse videos ranging from several seconds to 30 minutes, and manually verified to guarantee complete correctness and uniqueness. Moreover, OmniVideoBench encompasses 13 carefully designed question types, covering temporal reasoning, spatial localization, counting, causal inference, summarization, and beyond, thereby capturing the essential challenges of video understanding. Evaluation of multiple MLLMs on OmniVideoBench reveals a pronounced gap between model performance and human reasoning, with open-source models lagging significantly behind their closed-source counterparts, underscoring the inherent difficulty of genuine audio-visual reasoning. We will release OmniVideoBench to foster the development of MLLMs with stronger and more generalizable reasoning capabilities.

[670] Extended Triangular Method: A Generalized Algorithm for Contradiction Separation Based Automated Deduction

Yang Xu, Shuwei Chen, Jun Liu, Feng Cao, Xingxing He

Main category: cs.AI

TL;DR: The paper presents the Extended Triangular Method (ETM), a contradiction-construction algorithm that formalizes the Contradiction Separation Extension (CSE) framework for automated deduction, enabling multi-clause reasoning through a triangular geometric framework.

Details

Motivation: Traditional reasoning calculi based on binary resolution limit deductive synergy among multiple clauses. The CSE framework introduced multi-clause reasoning but lacked algorithmic formalization, creating a gap between theory and implementation.

Method: The Extended Triangular Method (ETM) is a generalized contradiction-construction algorithm that unifies multiple contradiction-building strategies within a triangular geometric framework, supporting flexible clause interaction and dynamic synergy.

Result: ETM serves as the core of several high-performance theorem provers (CSE, CSE-E, CSI-E, CSI-Enig) that achieved competitive results in standard first-order benchmarks (TPTP problem sets and CASC 2018-2015), validating the approach’s effectiveness.

Conclusion: ETM bridges theoretical abstraction and operational implementation, advancing the contradiction separation paradigm into a generalized, scalable, and practically competitive model for automated reasoning, offering new research directions.

Abstract: Automated deduction lies at the core of Artificial Intelligence (AI), underpinning theorem proving, formal verification, and logical reasoning. Despite decades of progress, reconciling deductive completeness with computational efficiency remains an enduring challenge. Traditional reasoning calculi, grounded in binary resolution, restrict inference to pairwise clause interactions and thereby limit deductive synergy among multiple clauses. The Contradiction Separation Extension (CSE) framework, introduced in 2018, proposed a dynamic multi-clause reasoning theory that redefined logical inference as a process of contradiction separation rather than sequential resolution. While that work established the theoretical foundation, its algorithmic realization remained unformalized and unpublished. This work presents the Extended Triangular Method (ETM), a generalized contradiction-construction algorithm that formalizes and extends the internal mechanisms of contradiction separation. The ETM unifies multiple contradiction-building strategies, including the earlier Standard Extension method, within a triangular geometric framework that supports flexible clause interaction and dynamic synergy. ETM serves as the algorithmic core of several high-performance theorem provers, CSE, CSE-E, CSI-E, and CSI-Enig, whose competitive results in standard first-order benchmarks (TPTP problem sets and CASC 2018-2015) empirically validate the effectiveness and generality of the proposed approach. By bridging theoretical abstraction and operational implementation, ETM advances the contradiction separation paradigm into a generalized, scalable, and practically competitive model for automated reasoning, offering new directions for future research in logical inference and theorem proving.

[671] Adaptive Selection of Symbolic Languages for Improving LLM Logical Reasoning

Xiangyu Wang, Haocheng Yang, Fengxiang Cheng, Fenrong Liu

Main category: cs.AI

TL;DR: This paper proposes an adaptive SL selection method for LLM logical reasoning, showing that different natural language problems require different symbolic language formalizations for optimal translation and reasoning performance.

Details

Motivation: Current LLM logical reasoning approaches depend heavily on correct NL-to-SL translation, but existing methods only focus on translation accuracy while ignoring the crucial factor of selecting the appropriate symbolic language type for each specific problem type.

Method: The method uses LLMs to adaptively select the most suitable symbolic language (first-order logic, logic programming, or Boolean satisfiability) for each problem, then translates the natural language problem to the selected SL and uses corresponding logical solvers to derive answers.

Result: Experimental results show the adaptive selection method significantly outperforms using a single SL for all problems and random SL selection. On mixed benchmarks, it achieves 96% accuracy, improving performance by 25% compared to the second-best approach (first-order logic translation).

Conclusion: Different natural language logical reasoning problems correspond to different optimal symbolic language formalizations, and adaptive selection of the target SL prior to translation significantly improves LLM logical reasoning performance.

Abstract: Large Language Models (LLMs) still struggle with complex logical reasoning. While previous works achieve remarkable improvements, their performance is highly dependent on the correctness of translating natural language (NL) problems into a symbolic language (SL). Though numerous works focusing on improving this translation accuracy, they only consider the similarity between the meaning of SL and NL, overlooking another crucial influencing factor, the selection of the target SL type itself. For example, first-order logic language specializes in logical reasoning with categorical syllogisms and complex quantifiers, while Boolean satisfiability formalism excels at representing constraint satisfaction like partial problems. To our knowledge, this is the first paper to claim and verify that different NL logical reasoning problem corresponds to different optimal SL formalization for translation. Based on this, we propose a methods to improve the logical reasoning performance of LLMs by adaptively selecting the most suitable SL for each problem prior to translation. Specifically, we leverage LLMs to select the target SL among first-order logic, logic programming and Boolean satisfiability and then translate the problem in NL to target SL expressions as well as employ the corresponding logical solver to derive the final answer. Experimental results on benchmarks show that our adaptive selection method significantly outperforms translating all into single SL and randomly selecting the SL. On a mixed dataset of these benchmarks, our approach achieves 96% accuracy, which improving performance by 25% compared to the second highest accuracy from the first-order logic translation.

[672] LLMs as Strategic Agents: Beliefs, Best Response Behavior, and Emergent Heuristics

Enric Junque de Fortuny, Veronica Roberta Cappelli

Main category: cs.AI

TL;DR: LLMs demonstrate genuine strategic thinking with belief-coherent best-response behavior, emergent meta-reasoning, and novel heuristic formation in game environments.

Details

Motivation: To determine if LLMs exhibit genuine strategic thinking beyond equilibrium play, by examining their ability to form beliefs about other agents and make coherent choices based on those beliefs.

Method: Developed a framework to disentangle beliefs, evaluation, and choice in static complete-information games, analyzing models’ choices and reasoning traces with context-free games to prevent imitation from memorization.

Result: Frontier models show belief-coherent best-response behavior at targeted reasoning depths, self-limit reasoning depth, form differentiated conjectures about opponents, and develop stable model-specific heuristic rules under complexity.

Conclusion: Belief coherence, meta-reasoning, and novel heuristic formation emerge jointly from language modeling objectives, providing a structured basis for studying strategic cognition in AI agents.

Abstract: Large Language Models (LLMs) are increasingly applied to domains that require reasoning about other agents’ behavior, such as negotiation, policy design, and market simulation, yet existing research has mostly evaluated their adherence to equilibrium play or their exhibited depth of reasoning. Whether they display genuine strategic thinking, understood as the coherent formation of beliefs about other agents, evaluation of possible actions, and choice based on those beliefs, remains unexplored. We develop a framework to identify this ability by disentangling beliefs, evaluation, and choice in static, complete-information games, and apply it across a series of non-cooperative environments. By jointly analyzing models’ revealed choices and reasoning traces, and introducing a new context-free game to rule out imitation from memorization, we show that current frontier models exhibit belief-coherent best-response behavior at targeted reasoning depths. When unconstrained, they self-limit their depth of reasoning and form differentiated conjectures about human and synthetic opponents, revealing an emergent form of meta-reasoning. Under increasing complexity, explicit recursion gives way to internally generated heuristic rules of choice that are stable, model-specific, and distinct from known human biases. These findings indicate that belief coherence, meta-reasoning, and novel heuristic formation can emerge jointly from language modeling objectives, providing a structured basis for the study of strategic cognition in artificial agents.

[673] DRIFT: Decompose, Retrieve, Illustrate, then Formalize Theorems

Meiru Zhang, Philipp Borchert, Milan Gritta, Gerasimos Lampouras

Main category: cs.AI

TL;DR: DRIFT is a framework that improves mathematical autoformalization by decomposing informal statements into sub-components for better premise retrieval from math libraries.

Details

Motivation: LLMs struggle with formalizing mathematical statements due to difficulty identifying prerequisite knowledge and formal representations. Current methods overlook that informal statements are complex with limited context about underlying concepts.

Method: DRIFT decomposes informal mathematical statements into smaller sub-components to enable targeted retrieval of premises from libraries like Mathlib, and retrieves illustrative theorems to help models use premises effectively.

Result: DRIFT consistently improves premise retrieval across benchmarks, nearly doubling F1 score compared to DPR baseline on ProofNet. Shows strong out-of-distribution performance on ConNF with 37.14-42.25% improvements in BEq+@10 using different LLMs.

Conclusion: Retrieval effectiveness in mathematical autoformalization depends heavily on model-specific knowledge boundaries, highlighting the need for adaptive retrieval strategies aligned with each model’s capabilities.

Abstract: Automating the formalization of mathematical statements for theorem proving remains a major challenge for Large Language Models (LLMs). LLMs struggle to identify and utilize the prerequisite mathematical knowledge and its corresponding formal representation in languages like Lean. Current retrieval-augmented autoformalization methods query external libraries using the informal statement directly, but overlook a fundamental limitation: informal mathematical statements are often complex and offer limited context on the underlying math concepts. To address this, we introduce DRIFT, a novel framework that enables LLMs to decompose informal mathematical statements into smaller, more tractable ‘‘sub-components’’. This facilitates targeted retrieval of premises from mathematical libraries such as Mathlib. Additionally, DRIFT retrieves illustrative theorems to help models use premises more effectively in formalization tasks. We evaluate DRIFT across diverse benchmarks (ProofNet, ConNF, and MiniF2F-test) and find that it consistently improves premise retrieval, nearly doubling the F1 score compared to the DPR baseline on ProofNet. Notably, DRIFT demonstrates strong performance on the out-of-distribution ConNF benchmark, with BEq+@10 improvements of 37.14% and 42.25% using GPT-4.1 and DeepSeek-V3.1, respectively. Our analysis shows that retrieval effectiveness in mathematical autoformalization depends heavily on model-specific knowledge boundaries, highlighting the need for adaptive retrieval strategies aligned with each model’s capabilities.

[674] The Irrational Machine: Neurosis and the Limits of Algorithmic Safety

Daniel Howard

Main category: cs.AI

TL;DR: A framework for identifying neurotic behaviors in embodied AI systems, where agents exhibit internally coherent but reality-misaligned behaviors due to interactions between planning, uncertainty, and aversive memory.

Details

Motivation: To systematically characterize and detect neurotic behaviors in AI systems that appear rational internally but are misaligned with reality, particularly in safety-critical applications where such behaviors could lead to failures.

Method: Developed a grid navigation testbed to catalog neurotic behavior patterns, created lightweight online detectors and escape policies, and proposed genetic-programming based destructive testing to evolve adversarial scenarios that maximize neurosis.

Result: Identified 12 recurrent neurotic behavior modalities including flip-flop, plan churn, paralysis, hypervigilance, and phobic avoidance. Showed that learned aversive costs can cause persistent avoidance behaviors even with full visibility.

Conclusion: Local fixes are insufficient for neurotic behaviors; global architectural revisions are needed. Proposed destructive testing with genetic programming to expose where fundamental architectural changes, not just symptom-level patches, are required.

Abstract: We present a framework for characterizing neurosis in embodied AI: behaviors that are internally coherent yet misaligned with reality, arising from interactions among planning, uncertainty handling, and aversive memory. In a grid navigation stack we catalogue recurrent modalities including flip-flop, plan churn, perseveration loops, paralysis and hypervigilance, futile search, belief incoherence, tie break thrashing, corridor thrashing, optimality compulsion, metric mismatch, policy oscillation, and limited-visibility variants. For each we give lightweight online detectors and reusable escape policies (short commitments, a margin to switch, smoothing, principled arbitration). We then show that durable phobic avoidance can persist even under full visibility when learned aversive costs dominate local choice, producing long detours despite globally safe routes. Using First/Second/Third Law as engineering shorthand for safety latency, command compliance, and resource efficiency, we argue that local fixes are insufficient; global failures can remain. To surface them, we propose genetic-programming based destructive testing that evolves worlds and perturbations to maximize law pressure and neurosis scores, yielding adversarial curricula and counterfactual traces that expose where architectural revision, not merely symptom-level patches, is required.

[675] LLM-Empowered Agentic MAC Protocols: A Dynamic Stackelberg Game Approach

Renxuan Tan, Rongpeng Li, Fei Wang, Chenghui Peng, Shaoyun Wu, Zhifeng Zhao, Honggang Zhang

Main category: cs.AI

TL;DR: A game-theoretic LLM-empowered multi-agent DRL framework that automatically synthesizes adaptive MAC protocols for wireless networks, achieving superior performance and generalizability without retraining.

Details

Motivation: Traditional MAC protocols are manually configured, and existing DRL-based protocols suffer from poor generalizability and resilience, requiring costly retraining for dynamic environments.

Method: Models uplink transmission as a dynamic multi-follower Stackelberg game, uses LLM-driven agents coordinated through PPO to synthesize semantic MAC protocols, and employs protocol action grammar for reliability.

Result: Achieves 77.6% greater throughput and 65.2% fairness improvement over conventional baselines, and generalizes excellently to fluctuating numbers of users without retraining.

Conclusion: The framework successfully overcomes limitations of traditional DRL-based MAC protocols by combining game theory with LLM-empowered multi-agent learning, enabling adaptive and resilient protocol synthesis.

Abstract: Medium Access Control (MAC) protocols, essential for wireless networks, are typically manually configured. While deep reinforcement learning (DRL)-based protocols enhance task-specified network performance, they suffer from poor generalizability and resilience, demanding costly retraining to adapt to dynamic environments. To overcome this limitation, we introduce a game-theoretic LLM-empowered multi-agent DRL (MARL) framework, in which the uplink transmission between a base station and a varying number of user equipments is modeled as a dynamic multi-follower Stackelberg game (MFSG), capturing the network’s natural hierarchical structure. Within this game, LLM-driven agents, coordinated through proximal policy optimization (PPO), synthesize adaptive, semantic MAC protocols in response to network dynamics. Protocol action grammar (PAG) is employed to ensure the reliability and efficiency of this process. Under this system, we further analyze the existence and convergence behavior in terms of a Stackelberg equilibrium by studying the learning dynamics of LLM-empowered unified policies in response to changing followers. Simulations corroborate that our framework achieves a 77.6% greater throughput and a 65.2% fairness improvement over conventional baselines. Besides, our framework generalizes excellently to a fluctuating number of users without requiring retraining or architectural changes.

[676] PaperArena: An Evaluation Benchmark for Tool-Augmented Agentic Reasoning on Scientific Literature

Daoyu Wang, Mingyue Cheng, Qi Liu, Shuo Yu, Zirui Liu, Ze Guo

Main category: cs.AI

TL;DR: PaperArena is a benchmark for evaluating LLM agents on cross-paper scientific reasoning tasks requiring multi-tool orchestration, revealing current agents achieve only 38.78% accuracy and struggle with tool efficiency.

Details

Motivation: Existing works are limited to tool-free tasks within single papers, lacking benchmarks for real-world research scenarios that require cross-paper reasoning and multi-tool integration.

Method: Proposes PaperArena benchmark with modular platform offering tools like multimodal parsing, context retrieval, and programmatic computation to evaluate agents on research questions requiring information integration across multiple papers.

Result: Even the most advanced LLM-powered agent system achieves only 38.78% average accuracy, dropping to 18.47% on hard tasks. All tested agents show inefficient tool usage, often invoking unnecessary tools.

Conclusion: PaperArena highlights significant room for improvement in scientific reasoning agents and invites the community to develop more capable agents for scientific discovery.

Abstract: Understanding and reasoning on the web-scale scientific literature is a crucial touchstone for large language model (LLM) based agents designed to support complex knowledge-intensive tasks. However, existing works are mainly restricted to tool-free tasks within isolated papers, largely due to the lack of a benchmark for cross-paper reasoning and multi-tool orchestration in real research scenarios. In this work, we propose PaperArena, an evaluation benchmark for agents to address real-world research questions that typically require integrating information across multiple papers with the assistance of external tools. Given a research question, agents should integrate diverse formats across multiple papers through reasoning and interacting with appropriate tools, thereby producing a well-grounded answer. To support standardized evaluation, we provide a modular and extensible platform for agent execution, offering tools such as multimodal parsing, context retrieval, and programmatic computation. Experimental results reveal that even the most advanced LLM powering a well-established agent system achieves merely 38.78% average accuracy. On the hard subset, accuracy drops to only 18.47%, highlighting great potential for improvement. We also present several empirical findings, including that all agents tested exhibit inefficient tool usage, often invoking more tools than necessary to solve a task. We invite the community to adopt PaperArena to develop and evaluate more capable agents for scientific discovery. Our code and data are available https://github.com/Melmaphother/PaperArena.

[677] PoU: Proof-of-Use to Counter Tool-Call Hacking in DeepResearch Agents

SHengjie Ma, Chenlong Deng, Jiaxin Mao, Jiadeng Huang, Teng Wang, Junjie Wu, Changwang Zhang, Jun wang

Main category: cs.AI

TL;DR: The paper introduces Proof-of-Use (PoU), an evidence-grounded RL framework that prevents Tool-Call Hacking in RAG agents by enforcing verifiable causal links between retrieved evidence, reasoning, and answers.

Details

Motivation: To address Tool-Call Hacking in RAG agents, where agents inflate reward signals by making superficially correct tool calls without genuinely using retrieved evidence, leading to mode collapse and spurious grounding.

Method: Proposes Proof-of-Use (PoU) framework with unified step-wise contract combining syntactic citation validation, perturbation-based sensitivity rewards, and answer-evidence alignment objectives.

Result: PoU consistently outperforms DeepResearch baselines across seven QA benchmarks in factual accuracy, evidence faithfulness, and tool-routing balance across various settings.

Conclusion: Grounding RL-trained agents in the causal use of retrieved information is necessary for trustworthy retrieval-augmented reasoning, offering a principled path forward.

Abstract: Retrieval-augmented generation (RAG) agents, such as recent DeepResearch-style systems, extend large language models (LLMs) with autonomous information-seeking capabilities through external tools. While reinforcement learning (RL) has enabled impressive multi-step reasoning, we identify a previously overlooked failure mode, Tool-Call Hacking, where agents inflate reward signals by issuing superficially correct tool calls without genuinely leveraging the retrieved evidence. This results in (i) mode collapse into repetitive reliance on a single source and (ii) spurious grounding, where answers are only weakly supported by cited content. To address this, we propose Proof-of-Use (PoU), an evidence-grounded RL framework that enforces verifiable causal links between retrieved evidence, reasoning traces, and final answers. PoU operationalizes this through a unified step-wise contract combining syntactic citation validation, perturbation-based sensitivity rewards, and answer-evidence alignment objectives, ensuring that tool usage remains both interpretable and functionally grounded. Across seven QA benchmarks spanning in-domain, out-of-domain, and out-of-tool-distribution settings, PoU consistently outperforms strong DeepResearch baselines in factual accuracy, evidence faithfulness, and tool-routing balance. These findings highlight the necessity of grounding RL-trained agents not merely in task outcomes but in the causal use of retrieved information, offering a principled path toward trustworthy retrieval-augmented reasoning.

[678] Scalable and Explainable Enterprise Knowledge Discovery Using Graph-Centric Hybrid Retrieval

Nilima Rao, Jagriti Srivastava, Pradeep Kumar Sharma, Hritvik Shrivastava

Main category: cs.AI

TL;DR: A hybrid retrieval framework that combines knowledge graphs, semantic search, and multi-hop reasoning to improve enterprise knowledge access across heterogeneous systems like Git, Jira, and Confluence.

Details

Motivation: Conventional retrieval methods fail to handle complex queries requiring contextual reasoning and multi-hop inference across distributed enterprise knowledge sources.

Method: Modular framework integrating KBLam, DeepGraph representations, and embedding-driven semantic search. Builds unified knowledge graphs from repositories, supports dynamic query analysis for optimal retrieval strategy, and provides interactive visualization.

Result: Experiments show up to 80% improvement in answer relevance compared to standalone GPT-based retrieval pipelines on large-scale Git repositories.

Conclusion: The framework provides a scalable, explainable, and user-centric foundation for intelligent knowledge assistants in enterprise environments through graph construction, hybrid reasoning, and interactive visualization.

Abstract: Modern enterprises manage vast knowledge distributed across heterogeneous systems such as Jira, Git repositories, Confluence, and wikis. Conventional retrieval methods based on keyword search or static embeddings often fail to answer complex queries that require contextual reasoning and multi-hop inference across artifacts. We present a modular hybrid retrieval framework for adaptive enterprise information access that integrates Knowledge Base Language-Augmented Models (KBLam), DeepGraph representations, and embedding-driven semantic search. The framework builds a unified knowledge graph from parsed repositories including code, pull requests, and commit histories, enabling semantic similarity search, structural inference, and multi-hop reasoning. Query analysis dynamically determines the optimal retrieval strategy, supporting both structured and unstructured data sources through independent or fused processing. An interactive interface provides graph visualizations, subgraph exploration, and context-aware query routing to generate concise and explainable answers. Experiments on large-scale Git repositories show that the unified reasoning layer improves answer relevance by up to 80 percent compared with standalone GPT-based retrieval pipelines. By combining graph construction, hybrid reasoning, and interactive visualization, the proposed framework offers a scalable, explainable, and user-centric foundation for intelligent knowledge assistants in enterprise environments.

[679] Video-STR: Reinforcing MLLMs in Video Spatio-Temporal Reasoning with Relation Graph

Wentao Wang, Heqing Zou, Tianze Luo, Rui Huang, Yutian Zhao, Zhuochen Wang, Hansheng Zhang, Chengwei Qin, Yan Wang, Lin Zhao, Huaijian Zhang

Main category: cs.AI

TL;DR: Video-STR is a graph-based reinforcement learning method for precise video spatio-temporal reasoning that addresses limitations in current MLLMs by incorporating physical information and using a novel Group Relative Policy Optimization approach.

Details

Motivation: Current Multimodal Large Language Models (MLLMs) have strong semantic understanding but struggle with precise spatio-temporal understanding, limiting their use in applications requiring high precision like embodied intelligence and VR.

Method: Proposes Video-STR using graph-based reinforcement learning with Group Relative Policy Optimization (GRPO) to infer spatio-temporal topology, and constructs STV-205k dataset with 205k QA pairs covering dynamic multi-object scenes.

Result: Achieves state-of-the-art results on various benchmarks, outperforming base model by 13% on STI-Bench, demonstrating effectiveness of the approach and dataset.

Conclusion: Video-STR successfully addresses spatio-temporal reasoning limitations in MLLMs through graph-based reinforcement learning and comprehensive dataset, enabling better performance in precision-demanding applications.

Abstract: Recent progress in Multimodal Large Language Models (MLLMs) has demonstrated strong semantic understanding capabilities, but struggles to perform precise spatio-temporal understanding. Existing spatio-temporal methods primarily focus on the video itself, while overlooking the physical information within the video, such as multi-object layouts and motion. Such limitations restrict the use of MLLMs in downstream applications that demand high precision, including embodied intelligence and VR. To address this issue, we present Video-STR, a novel graph-based reinforcement method for precise Video Spatio-Temporal Reasoning. Building upon the capacity of Reinforcement Learning with Verifiable Reward (RLVR) to improve model abilities, we introduce a reasoning mechanism using graph-based Group Relative Policy Optimization (GRPO) method to guide the model in inferring the underlying spatio-temporal topology of scenarios during the thinking process. To resolve the lack of spatio-temporal training data, we construct the STV-205k dataset with 205k question-answering pairs, covering dynamic multi-object scenes in both indoor and outdoor environments, to support the model training. Experiments show that Video-STR achieves state-of-the-art results on various benchmarks, outperforming the base model by 13% on STI-Bench, and demonstrating the effectiveness of our approach and dataset. Code, model, and data will be released.

[680] Revisiting Model Interpolation for Efficient Reasoning

Taiqiang Wu, Runming Yang, Tao Liu, Jiahao Wang, Ngai Wong

Main category: cs.AI

TL;DR: Model interpolation (simple weight averaging) surprisingly outperforms complex model merging methods for reasoning tasks, following a three-stage evolutionary pattern that guides performance-cost trade-offs.

Details

Motivation: To systematically study the simplest model merging method (weight interpolation) and understand its dynamics for efficient reasoning, challenging the assumption that complex merging methods are always better.

Method: Systematic analysis of direct weight interpolation between models, examining three-stage evolutionary patterns, layer/module effects, and decoding strategies through extensive ablation studies.

Result: Strategically interpolated models surpass sophisticated model merging baselines in both efficiency and effectiveness, with the three-stage dynamics providing principled guidance for performance-cost optimization.

Conclusion: Model interpolation is an effective and practical framework for crafting models with targeted reasoning capabilities, offering better performance than complex merging methods when strategically applied.

Abstract: Model merging, typically on Instruct and Thinking models, has shown remarkable performance for efficient reasoning. In this paper, we systematically revisit the simplest merging method that interpolates two weights directly. Particularly, we observe that model interpolation follows a three-stage evolutionary paradigm with distinct behaviors on the reasoning trajectory. These dynamics provide a principled guide for navigating the performance-cost trade-off. Empirical results demonstrate that a strategically interpolated model surprisingly surpasses sophisticated model merging baselines on both efficiency and effectiveness. We further validate our findings with extensive ablation studies on model layers, modules, and decoding strategies. Ultimately, this work demystifies model interpolation and offers a practical framework for crafting models with precisely targeted reasoning capabilities. Code is available at \href{https://github.com/wutaiqiang/MI}{Github}.

[681] FBS Model-based Maintenance Record Accumulation for Failure-Cause Inference in Manufacturing Systems

Takuma Fujiu, Sho Okazaki, Kohei Kaminishi, Yuji Nakata, Shota Hamamoto, Kenshin Yokose, Tatsunori Hara, Yasushi Umeda, Jun Ota

Main category: cs.AI

TL;DR: Proposed a Diagnostic Knowledge Ontology and FBS model-based method for maintenance-record accumulation to improve failure-cause inference in manufacturing systems.

Details

Motivation: Need for knowledge bases that explicitly structure system/failure knowledge and contain long causal chains for effective failure-cause inference in manufacturing.

Method: Constructed Diagnostic Knowledge Ontology and proposed Function-Behavior-Structure (FBS) model-based maintenance-record accumulation method.

Result: Failure-cause inference using proposed method showed better agreement with expert-enumerated candidate causes, especially in difficult cases with few related cases and differing vocabulary.

Conclusion: Future work includes developing tailored inference methods, building user interface, and validation on larger systems. Approach enables knowledge sharing across engineering chain by leveraging design-phase understanding for maintenance-phase problem solving.

Abstract: In manufacturing systems, identifying the causes of failures is crucial for maintaining and improving production efficiency. In knowledge-based failure-cause inference, it is important that the knowledge base (1) explicitly structures knowledge about the target system and about failures, and (2) contains sufficiently long causal chains of failures. In this study, we constructed Diagnostic Knowledge Ontology and proposed a Function-Behavior-Structure (FBS) model-based maintenance-record accumulation method based on it. Failure-cause inference using the maintenance records accumulated by the proposed method showed better agreement with the set of candidate causes enumerated by experts, especially in difficult cases where the number of related cases is small and the vocabulary used differs. In the future, it will be necessary to develop inference methods tailored to these maintenance records, build a user interface, and carry out validation on larger and more diverse systems. Additionally, this approach leverages the understanding and knowledge of the target in the design phase to support knowledge accumulation and problem solving during the maintenance phase, and it is expected to become a foundation for knowledge sharing across the entire engineering chain in the future.

[682] Argumentation-Based Explainability for Legal AI: Comparative and Regulatory Perspectives

Andrada Iulia Prajescu, Roberto Confalonieri

Main category: cs.AI

TL;DR: This paper argues that computational argumentation frameworks provide the most suitable foundation for explainable AI in legal contexts, as they align with legal reasoning’s defeasible and contestable nature and emerging regulatory requirements.

Details

Motivation: The opacity of AI systems in legal contexts creates challenges for fairness, accountability, and trust, undermining the legitimacy of automated decision-making due to the 'black box problem' and lack of meaningful explanations.

Method: The paper analyzes different XAI methods (example-based, rule-based, hybrid, and argumentation-based approaches) and promotes computational models of arguments as particularly suitable for legal explanations, evaluating their alignment with GDPR and AIA regulations.

Result: Argumentation frameworks are identified as offering a robust foundation for explainable legal AI by capturing the defeasible, contestable, and value-sensitive nature of law, making them well-positioned to meet both technical and normative transparency requirements.

Conclusion: Computational argumentation is best positioned to address the challenges of explainable AI in legal domains, though open challenges remain including bias mitigation, empirical validation in judicial settings, and compliance with evolving ethical and legal standards.

Abstract: Artificial Intelligence (AI) systems are increasingly deployed in legal contexts, where their opacity raises significant challenges for fairness, accountability, and trust. The so-called ``black box problem’’ undermines the legitimacy of automated decision-making, as affected individuals often lack access to meaningful explanations. In response, the field of Explainable AI (XAI) has proposed a variety of methods to enhance transparency, ranging from example-based and rule-based techniques to hybrid and argumentation-based approaches. This paper promotes computational models of arguments and their role in providing legally relevant explanations, with particular attention to their alignment with emerging regulatory frameworks such as the EU General Data Protection Regulation (GDPR) and the Artificial Intelligence Act (AIA). We analyze the strengths and limitations of different explanation strategies, evaluate their applicability to legal reasoning, and highlight how argumentation frameworks – by capturing the defeasible, contestable, and value-sensitive nature of law – offer a particularly robust foundation for explainable legal AI. Finally, we identify open challenges and research directions, including bias mitigation, empirical validation in judicial settings, and compliance with evolving ethical and legal standards, arguing that computational argumentation is best positioned to meet both technical and normative requirements of transparency in the law domain.

[683] Modeling AI-Driven Production and Competitiveness A Multi-Agent Economic Simulation of China and the United States

Yuxinyue Qian, Jun Liu

Main category: cs.AI

TL;DR: This paper compares macroeconomic output evolution in China and the US under different AI mechanisms using simulation, finding AI as independent productive entity significantly boosts growth rates and China shows acceleration potential in technological catch-up.

Details

Motivation: To understand AI-driven production system transformation and shifts in international competitiveness as socio-economic systems enter human-AI co-creation stage.

Method: Simulation-based comparisons using a multi-level intelligent agent economic model to analyze macroeconomic output under AI collaboration, network effects, and AI autonomous production mechanisms.

Result: 1) AI as independent productive entity yields much higher social output growth than traditional human-labor models; 2) China shows clear acceleration potential in intelligent agent population expansion and technological catch-up, enabling possible technological convergence or partial surpassing.

Conclusion: The study provides a systematic model-based framework for analyzing AI-driven production transformation and international competitiveness shifts, offering quantitative insights for policy formulation.

Abstract: With the rapid development of artificial intelligence (AI) technology, socio-economic systems are entering a new stage of “human-AI co-creation.” Building upon a previously established multi-level intelligent agent economic model, this paper conducts simulation-based comparisons of macroeconomic output evolution in China and the United States under different mechanisms-AI collaboration, network effects, and AI autonomous production. The results show that: (1) when AI functions as an independent productive entity, the overall growth rate of social output far exceeds that of traditional human-labor-based models; (2) China demonstrates clear potential for acceleration in both the expansion of intelligent agent populations and the pace of technological catch-up, offering the possibility of achieving technological convergence or even partial surpassing. This study provides a systematic, model-based analytical framework for understanding AI-driven production system transformation and shifts in international competitiveness, as well as quantitative insights for relevant policy formulation.

[684] Improving AI Efficiency in Data Centres by Power Dynamic Response

Andrea Marinoni, Sai Shivareddy, Pietro Lio’, Weisi Lin, Erik Cambria, Clare Grey

Main category: cs.AI

TL;DR: This paper investigates dynamic power management for AI data centers to improve sustainability by making power input as dynamic as computing power usage.

Details

Motivation: AI data centers consume massive amounts of power, creating environmental and sustainability concerns that need innovative power management solutions.

Method: The study analyzes passive and active devices by quantifying their performance in computational gain, energy efficiency, capital expenditure reduction, and management costs using power trend data from multiple global data platforms.

Result: The approach shows potential to significantly improve sustainability of AI hyperscalers across environmental, financial, and societal dimensions.

Conclusion: This dynamic power management strategy represents a paradigm shift that could strongly enhance AI data center sustainability and reduce their environmental footprint.

Abstract: The steady growth of artificial intelligence (AI) has accelerated in the recent years, facilitated by the development of sophisticated models such as large language models and foundation models. Ensuring robust and reliable power infrastructures is fundamental to take advantage of the full potential of AI. However, AI data centres are extremely hungry for power, putting the problem of their power management in the spotlight, especially with respect to their impact on environment and sustainable development. In this work, we investigate the capacity and limits of solutions based on an innovative approach for the power management of AI data centres, i.e., making part of the input power as dynamic as the power used for data-computing functions. The performance of passive and active devices are quantified and compared in terms of computational gain, energy efficiency, reduction of capital expenditure, and management costs by analysing power trends from multiple data platforms worldwide. This strategy, which identifies a paradigm shift in the AI data centre power management, has the potential to strongly improve the sustainability of AI hyperscalers, enhancing their footprint on environmental, financial, and societal fields.

[685] Spec-Driven AI for Science: The ARIA Framework for Automated and Reproducible Data Analysis

Chuke Chen, Biao Luo, Nan Li, Boxiang Wang, Hang Yang, Jing Guo, Ming Xu

Main category: cs.AI

TL;DR: ARIA is a spec-driven, human-in-the-loop framework for automated and interpretable data analysis that bridges the gap between analytical capability and research intent through natural-language specifications.

Details

Motivation: To address the gap between analytical capability and research intent in scientific data analysis, where existing AI tools either favor automation over transparency or depend on manual scripting that hinders scalability and reproducibility.

Method: ARIA integrates six interoperable layers (Command, Context, Code, Data, Orchestration, and AI Module) within a document-centric workflow that unifies human reasoning and machine execution. It uses natural-language specifications to define analytical goals and autonomously generates executable code, validates computations, and produces transparent documentation.

Result: In the Boston Housing case, ARIA discovered 25 key features and determined XGBoost as the best performing model (R square = 0.93) with minimal overfitting. Evaluations across heterogeneous domains demonstrate ARIA’s strong performance, interpretability, and efficiency compared with state-of-the-art systems.

Conclusion: ARIA establishes a new paradigm for transparent, collaborative, and reproducible scientific discovery by combining AI for research and AI for science principles within a spec-driven architecture.

Abstract: The rapid expansion of scientific data has widened the gap between analytical capability and research intent. Existing AI-based analysis tools, ranging from AutoML frameworks to agentic research assistants, either favor automation over transparency or depend on manual scripting that hinders scalability and reproducibility. We present ARIA (Automated Research Intelligence Assistant), a spec-driven, human-in-the-loop framework for automated and interpretable data analysis. ARIA integrates six interoperable layers, namely Command, Context, Code, Data, Orchestration, and AI Module, within a document-centric workflow that unifies human reasoning and machine execution. Through natural-language specifications, researchers define analytical goals while ARIA autonomously generates executable code, validates computations, and produces transparent documentation. Beyond achieving high predictive accuracy, ARIA can rapidly identify optimal feature sets and select suitable models, minimizing redundant tuning and repetitive experimentation. In the Boston Housing case, ARIA discovered 25 key features and determined XGBoost as the best performing model (R square = 0.93) with minimal overfitting. Evaluations across heterogeneous domains demonstrate ARIA’s strong performance, interpretability, and efficiency compared with state-of-the-art systems. By combining AI for research and AI for science principles within a spec-driven architecture, ARIA establishes a new paradigm for transparent, collaborative, and reproducible scientific discovery.

[686] $How^{2}$: How to learn from procedural How-to questions

Gautier Dagan, Frank Keller, Alex Lascarides

Main category: cs.AI

TL;DR: How2 is a memory agent framework that enables AI agents to ask how-to questions, store answers, and reuse them for lifelong learning in interactive environments like Minecraft.

Details

Motivation: How-to questions help agents reduce uncertainty and fill knowledge gaps for planning, but their open-ended nature makes them challenging for AI agents to ask and for AI experts to answer efficiently.

Method: The framework uses teacher models that answer how-to questions at varying levels of abstraction - from executable action sequences to high-level subgoal descriptions - and stores these answers for reuse.

Result: Lifelong learning agents benefit most from answers that are abstracted and decoupled from the current state, showing improved planning capabilities over time.

Conclusion: How2 provides a way for LLM-based agents to improve their planning capabilities through asking questions and learning from answers in interactive environments.

Abstract: An agent facing a planning problem can use answers to how-to questions to reduce uncertainty and fill knowledge gaps, helping it solve both current and future tasks. However, their open ended nature, where valid answers to “How do I X?” range from executable actions to high-level descriptions of X’s sub-goals, makes them challenging for AI agents to ask, and for AI experts to answer, in ways that support efficient planning. We introduce $How^{2}$, a memory agent framework that enables agents to ask how-to questions, store the answers, and reuse them for lifelong learning in interactive environments. We evaluate our approach in Plancraft, a Minecraft crafting environment, where agents must complete an assembly task by manipulating inventory items. Using teacher models that answer at varying levels of abstraction, from executable action sequences to high-level subgoal descriptions, we show that lifelong learning agents benefit most from answers that are abstracted and decoupled from the current state. $How^{2}$ offers a way for LLM-based agents to improve their planning capabilities over time by asking questions in interactive environments.

[687] Aligning Deep Implicit Preferences by Learning to Reason Defensively

Peiming Li, Zhiyuan Hu, Yang Tang, Shiyu Li, Xi Chen

Main category: cs.AI

TL;DR: CDRA reframes LLM alignment as structured reasoning using critique-driven methods to infer deep user preferences and enable defensive reasoning, outperforming traditional approaches.

Details

Motivation: Current LLM alignment methods fail to infer users' deep implicit preferences (unstated goals, semantic context, risk tolerances) and lack defensive reasoning for real-world ambiguity, leading to superficial and brittle responses.

Method: Proposes Critique-Driven Reasoning Alignment (CDRA) with two key components: DeepPref benchmark (3000 preference-query pairs with critique-annotated reasoning chains) and Personalized Generative Process Reward Model (Pers-GenPRM) that generates critique chains for reward modeling, followed by Critique-Driven Policy Alignment using online RL.

Result: Experiments demonstrate that CDRA excels at discovering and aligning with users’ true preferences while executing robust reasoning, showing superior performance over traditional alignment methods.

Conclusion: CDRA successfully bridges the cognitive gap in LLM alignment by transforming it from scalar reward-matching into structured reasoning, enabling better inference of implicit preferences and defensive reasoning capabilities.

Abstract: Personalized alignment is crucial for enabling Large Language Models (LLMs) to engage effectively in user-centric interactions. However, current methods face a dual challenge: they fail to infer users’ deep implicit preferences (including unstated goals, semantic context and risk tolerances), and they lack the defensive reasoning required to navigate real-world ambiguity. This cognitive gap leads to responses that are superficial, brittle and short-sighted. To address this, we propose Critique-Driven Reasoning Alignment (CDRA), which reframes alignment from a scalar reward-matching task into a structured reasoning process. First, to bridge the preference inference gap, we introduce the DeepPref benchmark. This dataset, comprising 3000 preference-query pairs across 20 topics, is curated by simulating a multi-faceted cognitive council that produces critique-annotated reasoning chains to deconstruct query semantics and reveal latent risks. Second, to instill defensive reasoning, we introduce the Personalized Generative Process Reward Model (Pers-GenPRM), which frames reward modeling as a personalized reasoning task. It generates a critique chain to evaluate a response’s alignment with user preferences before outputting a final score based on this rationale. Ultimately, this interpretable, structured reward signal guides policy model through Critique-Driven Policy Alignment, a process-level online reinforcement learning algorithm integrating both numerical and natural language feedback. Experiments demonstrate that CDRA excels at discovering and aligning with users’ true preferences while executing robust reasoning. Our code and dataset are available at https://github.com/Zephyrian-Hugh/Deep-pref.

[688] AI Alignment Strategies from a Risk Perspective: Independent Safety Mechanisms or Shared Failures?

Leonard Dung, Florian Mai

Main category: cs.AI

TL;DR: Analysis of failure mode correlations across 7 AI alignment techniques shows that defense-in-depth effectiveness depends on how uncorrelated the failure modes are between different techniques.

Details

Motivation: AI alignment techniques have failure modes where they may not provide safety, and defense-in-depth relies on multiple redundant protections. However, this approach only works if failure modes across different techniques are uncorrelated.

Method: Analyzed 7 representative alignment techniques and 7 failure modes to understand the extent of overlap between their failure conditions.

Result: Found varying degrees of correlation between failure modes across different alignment techniques, which affects the effectiveness of defense-in-depth strategies.

Conclusion: The success of defense-in-depth depends on failure mode correlations, and understanding these relationships helps assess current risk levels and prioritize future AI alignment research.

Abstract: AI alignment research aims to develop techniques to ensure that AI systems do not cause harm. However, every alignment technique has failure modes, which are conditions in which there is a non-negligible chance that the technique fails to provide safety. As a strategy for risk mitigation, the AI safety community has increasingly adopted a defense-in-depth framework: Conceding that there is no single technique which guarantees safety, defense-in-depth consists in having multiple redundant protections against safety failure, such that safety can be maintained even if some protections fail. However, the success of defense-in-depth depends on how (un)correlated failure modes are across alignment techniques. For example, if all techniques had the exact same failure modes, the defense-in-depth approach would provide no additional protection at all. In this paper, we analyze 7 representative alignment techniques and 7 failure modes to understand the extent to which they overlap. We then discuss our results’ implications for understanding the current level of risk and how to prioritize AI alignment research in the future.

[689] PADME: Procedure Aware DynaMic Execution

Deepeka Garg, Sihan Zeng, Annapoorani L. Narayanan, Sumitra Ganesh, Leo Ardon

Main category: cs.AI

TL;DR: PADME is a framework that transforms procedural text into executable graphs for robust long-horizon task execution by LLM agents, achieving SOTA performance on multiple benchmarks.

Details

Motivation: Current LLM agents struggle with long-horizon procedures due to text variability and lack of structure, causing execution drift and failure.

Method: Two-phase approach: Teach phase structures procedures into graphs with dependencies and executable logic; Execute phase enables dynamic execution using real-time inputs and feedback.

Result: Achieves state-of-the-art performance on four benchmarks including ALFWorld and ScienceWorld, demonstrating reduced error accumulation in long-horizon reasoning.

Conclusion: Graph-based procedure representations provide a powerful intermediate abstraction for robust and generalizable agent-driven automation.

Abstract: Learning to autonomously execute long-horizon procedures from natural language remains a core challenge for intelligent agents. Free-form instructions such as recipes, scientific protocols, or business workflows encode rich procedural knowledge, but their variability and lack of structure cause agents driven by large language models (LLMs) to drift or fail during execution. We introduce Procedure Aware DynaMic Execution (PADME), an agent framework that produces and exploits a graph-based representation of procedures. Unlike prior work that relies on manual graph construction or unstructured reasoning, PADME autonomously transforms procedural text into executable graphs that capture task dependencies, decision points, and reusable subroutines. Central to PADME is a two-phase methodology; Teach phase, which focuses on systematic structuring, enrichment with executable logic of procedures, followed by Execute phase, which enables dynamic execution in response to real-time inputs and environment feedback. This separation ensures quality assurance and scalability, allowing expert knowledge to be encoded once and reliably reused across varying contexts. The graph representation also provides an inductive bias that reduces error accumulation in long-horizon reasoning, underscoring the importance of structured procedure modeling for reliable agent-driven automation. Empirically, PADME achieves state-of-the-art performance on four diverse benchmarks, including ALFWorld and ScienceWorld. These results demonstrate that agents equipped with graph-based procedure representations offer a powerful intermediate abstraction for robust and generalizable execution.

[690] Evolution in Simulation: AI-Agent School with Dual Memory for High-Fidelity Educational Dynamics

Sheng Jin, Haoming Wang, Zhiqi Gao, Yongbo Yang, Bao Chunjia, Chengliang Wang

Main category: cs.AI

TL;DR: The AI-Agent School (AAS) system uses self-evolving agents with a Zero-Exp strategy and dual memory base to simulate complex educational dynamics through continuous experience-reflection-optimization cycles.

Details

Motivation: To address fragmented teaching process modeling and limitations in simulating diverse educational participants using LLM-based agents.

Method: Zero-Exp strategy with continuous “experience-reflection-optimization” cycle, dual memory base (experience and knowledge bases), and short-term/long-term memory components for autonomous agent evolution in simulated school scenarios.

Result: AAS effectively simulates intricate educational dynamics and fosters advanced agent cognitive abilities, generating high-fidelity behavioral and interaction data.

Conclusion: AAS provides a foundational stepping stone from the “Era of Experience” to the “Era of Simulation” by enabling accurate modeling of teacher-student engagements and learning processes.

Abstract: Large language models (LLMs) based Agents are increasingly pivotal in simulating and understanding complex human systems and interactions. We propose the AI-Agent School (AAS) system, built around a self-evolving mechanism that leverages agents for simulating complex educational dynamics. Addressing the fragmented issues in teaching process modeling and the limitations of agents performance in simulating diverse educational participants, AAS constructs the Zero-Exp strategy, employs a continuous “experience-reflection-optimization” cycle, grounded in a dual memory base comprising experience and knowledge bases and incorporating short-term and long-term memory components. Through this mechanism, agents autonomously evolve via situated interactions within diverse simulated school scenarios. This evolution enables agents to more accurately model the nuanced, multi-faceted teacher-student engagements and underlying learning processes found in physical schools. Experiment confirms that AAS can effectively simulate intricate educational dynamics and is effective in fostering advanced agent cognitive abilities, providing a foundational stepping stone from the “Era of Experience” to the “Era of Simulation” by generating high-fidelity behavioral and interaction data.

[691] Automated Skill Decomposition Meets Expert Ontologies: Bridging the Granularity Gap with LLMs

Le Ngoc Luyen, Marie-Hélène Abel

Main category: cs.AI

TL;DR: This paper proposes an ontology-grounded evaluation framework for automated skill decomposition using LLMs, introducing semantic and hierarchy-aware F1 metrics to assess content accuracy and structural placement.

Details

Motivation: To develop a rigorous and reproducible framework for evaluating automated skill decomposition systems using LLMs, addressing the need for standardized evaluation pipelines and metrics.

Method: Proposes an ontology-grounded framework with standardized pipeline from prompting to normalization, introduces semantic F1-score and hierarchy-aware F1-score metrics, and compares zero-shot vs leakage-safe few-shot prompting strategies on ROME-ESCO-DecompSkill dataset.

Result: Zero-shot prompting provides strong baseline performance, while few-shot prompting consistently stabilizes phrasing and granularity and improves hierarchy-aware alignment. Exemplar-guided prompts are competitive and sometimes faster than zero-shot due to more schema-compliant completions.

Conclusion: The framework, benchmark, and metrics provide a reproducible foundation for developing ontology-faithful skill decomposition systems, with few-shot prompting offering advantages in stability and structural alignment.

Abstract: This paper investigates automated skill decomposition using Large Language Models (LLMs) and proposes a rigorous, ontology-grounded evaluation framework. Our framework standardizes the pipeline from prompting and generation to normalization and alignment with ontology nodes. To evaluate outputs, we introduce two metrics: a semantic F1-score that uses optimal embedding-based matching to assess content accuracy, and a hierarchy-aware F1-score that credits structurally correct placements to assess granularity. We conduct experiments on ROME-ESCO-DecompSkill, a curated subset of parents, comparing two prompting strategies: zero-shot and leakage-safe few-shot with exemplars. Across diverse LLMs, zero-shot offers a strong baseline, while few-shot consistently stabilizes phrasing and granularity and improves hierarchy-aware alignment. A latency analysis further shows that exemplar-guided prompts are competitive - and sometimes faster - than unguided zero-shot due to more schema-compliant completions. Together, the framework, benchmark, and metrics provide a reproducible foundation for developing ontology-faithful skill decomposition systems.

[692] AI-Driven anemia diagnosis: A review of advanced models and techniques

Abdullah Al Mahmud, Prangon Chowdhury, Mohammed Borhan Uddin, Khaled Eabne Delowar, Tausifur Rahman Talha, Bijoy Dewanjee

Main category: cs.AI

TL;DR: Systematic review of machine learning and deep learning models for anemia detection, comparing performance metrics like accuracy, sensitivity, specificity, and precision.

Details

Motivation: Anemia affects millions globally and requires accurate diagnosis. AI techniques (ML/DL) show promise for improving anemia detection and classification.

Method: Conducted systematic review of recent advancements in ML/DL models for anemia detection, comparing various models based on performance metrics.

Result: Analysis reveals strengths and limitations of different models in detecting and classifying anemia, highlighting areas for improvement in diagnostic accuracy.

Conclusion: Addressing identified limitations is crucial for enhancing diagnostic accuracy in anemia detection using AI techniques.

Abstract: Anemia, a condition marked by insufficient levels of red blood cells or hemoglobin, remains a widespread health issue affecting millions of individuals globally. Accurate and timely diagnosis is essential for effective management and treatment of anemia. In recent years, there has been a growing interest in the use of artificial intelligence techniques, i.e., machine learning (ML) and deep learning (DL) for the detection, classification, and diagnosis of anemia. This paper provides a systematic review of the recent advancements in this field, with a focus on various models applied to anemia detection. The review also compares these models based on several performance metrics, including accuracy, sensitivity, specificity, and precision. By analyzing these metrics, the paper evaluates the strengths and limitation of discussed models in detecting and classifying anemia, emphasizing the importance of addressing these factors to improve diagnostic accuracy.

[693] From to : Multidimensional Supervision of Reasoning Process for LLM Optimization

Beining Wang, Weihang Su, Hongtao Tian, Tao Yang, Yujia Zhou, Ting Yao, Qingyao Ai, Yiqun Liu

Main category: cs.AI

TL;DR: Proposes Dimension-level Reward Model (DRM) to improve LLM reasoning by evaluating reasoning processes along three interpretable dimensions: Confidence, Relevance, and Coherence, instead of just final answer correctness.

Details

Motivation: Existing approaches have limitations: outcome-supervised RL rewards only correct final answers (sparse rewards, propagates flawed reasoning), while process-level reward models lack generalizability and require task-specific segmentation.

Method: DRM framework evaluates reasoning quality along three fundamental dimensions: Confidence (uncertainty calibration), Relevance (semantic alignment), and Coherence (logical consistency). This provides interpretable assessment without ground truth answers.

Result: DRM provides effective supervision signals, guides LLM optimization, and enhances reasoning ability. Achieves consistent gains on both in-distribution and out-of-distribution tasks including mathematics, QA, code execution, and puzzles.

Conclusion: Multidimensional supervision of reasoning processes can improve LLMs’ generalized reasoning ability beyond training distribution, demonstrating the value of interpretable process evaluation over simple outcome-based rewards.

Abstract: Improving the multi-step reasoning ability of Large Language Models (LLMs) is a critical yet challenging task. The dominant paradigm, outcome-supervised reinforcement learning (RLVR), rewards only correct final answers, often propagating flawed reasoning and suffering from sparse reward signals. While process-level reward models (PRMs) provide denser, step-by-step feedback, they lack generalizability and interpretability, requiring task-specific segmentation of the reasoning process. To this end, we propose the Dimension-level Reward Model (DRM), a new supervision framework that bridges the gap between these two approaches. DRM evaluates the quality of a reasoning process along three fundamental, complementary, and interpretable dimensions: Confidence for uncertainty calibration, Relevance for semantic alignment, and Coherence for logical consistency. Together, these dimensions capture aspects beyond final answer correctness and enable interpretable assessment without requiring ground truth answers. Experimental results show that DRM provides effective supervision signals, guides the optimization of LLMs and enhances their reasoning ability. In particular, DRM-supervised training achieves consistent gains on both in-distribution and out-of-distribution open-domain tasks, including mathematics, question answering, code execution, and puzzles. Our findings demonstrate that multidimensional supervision of the reasoning process can improve the generalized reasoning ability of LLMs beyond the training distribution.

[694] Unifying Deductive and Abductive Reasoning in Knowledge Graphs with Masked Diffusion Model

Yisen Gao, Jiaxin Bai, Yi Huang, Xingcheng Fu, Qingyun Sun, Yangqiu Song

Main category: cs.AI

TL;DR: DARK is a unified framework for both deductive and abductive reasoning on knowledge graphs using masked diffusion models, achieving state-of-the-art performance through self-reflective denoising and logic-exploration reinforcement learning.

Details

Motivation: Current methods handle deductive and abductive reasoning separately, despite their synergistic potential where deduction validates hypotheses and abduction uncovers deeper logical patterns.

Method: Uses masked diffusion model with two key innovations: self-reflective denoising for hypothesis refinement during abduction, and logic-exploration reinforcement learning that simultaneously masks queries and conclusions to explore novel reasoning compositions.

Result: Extensive experiments on multiple benchmark knowledge graphs show DARK achieves state-of-the-art performance on both deductive and abductive reasoning tasks.

Conclusion: The unified approach demonstrates significant benefits by bridging the gap between deductive and abductive reasoning in knowledge graphs.

Abstract: Deductive and abductive reasoning are two critical paradigms for analyzing knowledge graphs, enabling applications from financial query answering to scientific discovery. Deductive reasoning on knowledge graphs usually involves retrieving entities that satisfy a complex logical query, while abductive reasoning generates plausible logical hypotheses from observations. Despite their clear synergistic potential, where deduction can validate hypotheses and abduction can uncover deeper logical patterns, existing methods address them in isolation. To bridge this gap, we propose DARK, a unified framework for Deductive and Abductive Reasoning in Knowledge graphs. As a masked diffusion model capable of capturing the bidirectional relationship between queries and conclusions, DARK has two key innovations. First, to better leverage deduction for hypothesis refinement during abductive reasoning, we introduce a self-reflective denoising process that iteratively generates and validates candidate hypotheses against the observed conclusion. Second, to discover richer logical associations, we propose a logic-exploration reinforcement learning approach that simultaneously masks queries and conclusions, enabling the model to explore novel reasoning compositions. Extensive experiments on multiple benchmark knowledge graphs show that DARK achieves state-of-the-art performance on both deductive and abductive reasoning tasks, demonstrating the significant benefits of our unified approach.

[695] Zero Data Retention in LLM-based Enterprise AI Assistants: A Comparative Study of Market Leading Agentic AI Products

Komal Gupta, Aditya Shrivastava

Main category: cs.AI

TL;DR: Analysis of zero data retention policies for enterprise AI assistants, focusing on architectural and compliance trade-offs in implementations by Salesforce and Microsoft.

Details

Motivation: With AI assistants becoming crucial for business productivity in healthcare and finance, safeguarding private data and ensuring compliance through zero data retention has become a priority.

Method: Examination of technical architectures used by Salesforce AgentForce and Microsoft Copilot, analyzing how they implement zero data retention policies while working with LLM providers like OpenAI, Anthropic, and Meta.

Result: Identified distinct technical approaches by Salesforce and Microsoft for implementing zero data retention in enterprise AI assistants, highlighting different architectural solutions.

Conclusion: Zero data retention policies can be effectively implemented in enterprise AI assistants through proper architectural design, though trade-offs exist between compliance, usability, and technical implementation.

Abstract: Governance of data, compliance, and business privacy matters, particularly for healthcare and finance businesses. Since the recent emergence of AI enterprise AI assistants enhancing business productivity, safeguarding private data and compliance is now a priority. With the implementation of AI assistants across the enterprise, the zero data retention can be achieved by implementing zero data retention policies by Large Language Model businesses like Open AI and Anthropic and Meta. In this work, we explore zero data retention policies for the Enterprise apps of large language models (LLMs). Our key contribution is defining the architectural, compliance, and usability trade-offs of such systems in parallel. In this research work, we examine the development of commercial AI assistants with two industry leaders and market titans in this arena - Salesforce and Microsoft. Both of these companies used distinct technical architecture to support zero data retention policies. Salesforce AgentForce and Microsoft Copilot are among the leading AI assistants providing much-needed push to business productivity in customer care. The purpose of this paper is to analyze the technical architecture and deployment of zero data retention policy by consuming applications as well as big language models service providers like Open Ai, Anthropic, and Meta.

[696] Analyzing and Internalizing Complex Policy Documents for LLM Agents

Jiateng Liu, Zhenhailong Wang, Xiaojiang Huang, Yingjie Li, Xing Fan, Xiang Li, Chenlei Guo, Ruhi Sarikaya, Heng Ji

Main category: cs.AI

TL;DR: CC-Gen benchmark generator enables systematic evaluation of policy internalization methods for LLM agents. CAP-CPT method improves policy internalization through category-aware continued pretraining, achieving significant performance gains.

Details

Motivation: Large policy documents in LLM-based agentic systems cause high computational overhead, motivating development of internalization methods to embed policies into model priors while preserving performance.

Method: Propose CAP-CPT (Category-Aware Policy Continued Pretraining) that parses policies into factual, behavioral, and conditional categories, isolates complex conditions, and uses targeted data synthesis with autoregressive pretraining loss.

Result: CAP-CPT improves SFT baselines in all settings, with up to 41% and 22% gains on Qwen-3-32B, achieving 97.3% prompt length reduction on CC-Gen and enhancing tau-Bench with minimal SFT data.

Conclusion: CAP-CPT effectively mitigates data and reasoning burdens for policy internalization, outperforming supervised fine-tuning approaches especially as policy complexity increases.

Abstract: Large Language Model (LLM)-based agentic systems rely on in-context policy documents encoding diverse business rules. As requirements grow, these documents expand rapidly, causing high computational overhead. This motivates developing internalization methods that embed policy documents into model priors while preserving performance. Prior prompt compression work targets generic prompts, but agentic policy documents span multiple complexity levels and require deeper reasoning, making internalization harder. We introduce CC-Gen, an agentic benchmark generator with Controllable Complexity across four levels, enabling systematic evaluation of agents’ ability to handle complexity and offering a unified framework for assessing policy internalization. Our analysis shows that complex policy specifications governing workflows pose major reasoning challenges. Supporting internalization with gold user agent interaction trajectories containing chain-of-thought (CoT) annotations via supervised fine-tuning (SFT) is data-intensive and degrades sharply as policy complexity increases. To mitigate data and reasoning burdens, we propose Category-Aware Policy Continued Pretraining (CAP-CPT). Our automated pipeline parses policy documents to extract key specifications, grouping them into factual, behavioral, and conditional categories, and isolating complex conditions that drive workflow complexity. This guides targeted data synthesis and enables agents to internalize policy information through an autoregressive pretraining loss. Experiments show CAP-CPT improves SFT baselines in all settings, with up to 41% and 22% gains on Qwen-3-32B, achieving 97.3% prompt length reduction on CC-Gen and further enhancing tau-Bench with minimal SFT data.

[697] Reproducibility: The New Frontier in AI Governance

Israel Mason-Williams, Gabryel Mason-Williams

Main category: cs.AI

TL;DR: AI governance faces challenges due to low signal-to-noise ratio in information environment, weak reproducibility standards, and rapid publication speeds. The paper proposes adopting stricter reproducibility protocols to improve AI governance and risk consensus.

Details

Motivation: Current AI research environment has low signal-to-noise ratio, favoring regulatory capture and creating uncertainty about which AI risks to prioritize. Weak reproducibility protocols erode policymakers' ability to enact effective governance.

Method: Evaluate AI reproducibility crisis through comparison with crises in other scientific domains. Propose adoption of preregistration, increased statistical power, and negative result publication protocols.

Result: The analysis shows that current reproducibility issues in AI research undermine effective governance and consensus on AI risk landscape.

Conclusion: AI governance must be reactive but should incorporate reproducibility protocols as core tools. Policymakers should demand higher reproducibility standards in AI research to enable effective governance.

Abstract: AI policymakers are responsible for delivering effective governance mechanisms that can provide safe, aligned and trustworthy AI development. However, the information environment offered to policymakers is characterised by an unnecessarily low Signal-To-Noise Ratio, favouring regulatory capture and creating deep uncertainty and divides on which risks should be prioritised from a governance perspective. We posit that the current publication speeds in AI combined with the lack of strong scientific standards, via weak reproducibility protocols, effectively erodes the power of policymakers to enact meaningful policy and governance protocols. Our paper outlines how AI research could adopt stricter reproducibility guidelines to assist governance endeavours and improve consensus on the AI risk landscape. We evaluate the forthcoming reproducibility crisis within AI research through the lens of crises in other scientific domains; providing a commentary on how adopting preregistration, increased statistical power and negative result publication reproducibility protocols can enable effective AI governance. While we maintain that AI governance must be reactive due to AI’s significant societal implications we argue that policymakers and governments must consider reproducibility protocols as a core tool in the governance arsenal and demand higher standards for AI research. Code to replicate data and figures: https://github.com/IFMW01/reproducibility-the-new-frontier-in-ai-governance

[698] Explainability, risk modeling, and segmentation based customer churn analytics for personalized retention in e-commerce

Sanjula De Alwis, Indrajith Ekanayake

Main category: cs.AI

TL;DR: A framework combining explainable AI, survival analysis, and RFM profiling to create interpretable churn prediction models that support personalized retention strategies.

Details

Motivation: Current churn models are opaque black boxes that limit insights into churn determinants, timing of retention opportunities, and high-risk customer segments, while customer retention is more cost-effective than acquisition.

Method: Three-component framework integrating: 1) Explainable AI to quantify feature contributions, 2) Survival analysis to model time-to-event churn risk, 3) RFM profiling to segment customers by transactional behavior.

Result: The framework enables attribution of churn drivers, estimation of intervention windows, and prioritization of customer segments for targeted retention actions.

Conclusion: The integrated approach supports personalized retention strategies that reduce attrition and strengthen customer loyalty by shifting focus from mere prediction to actionable, interpretable insights.

Abstract: In online retail, customer acquisition typically incurs higher costs than customer retention, motivating firms to invest in churn analytics. However, many contemporary churn models operate as opaque black boxes, limiting insight into the determinants of attrition, the timing of retention opportunities, and the identification of high-risk customer segments. Accordingly, the emphasis should shift from prediction alone to the design of personalized retention strategies grounded in interpretable evidence. This study advances a three-component framework that integrates explainable AI to quantify feature contributions, survival analysis to model time-to-event churn risk, and RFM profiling to segment customers by transactional behaviour. In combination, these methods enable the attribution of churn drivers, estimation of intervention windows, and prioritization of segments for targeted actions, thereby supporting strategies that reduce attrition and strengthen customer loyalty.

[699] ParaCook: On Time-Efficient Planning for Multi-Agent Systems

Shiqi Zhang, Xinbei Ma, Yunqing Xu, Zouying Cao, Pengrui Lu, Haobo Yuan, Tiancheng Shen, Zhuosheng Zhang, Hai Zhao, Ming-Hsuan Yang

Main category: cs.AI

TL;DR: ParaCook is a benchmark for evaluating time-efficient collaborative planning in multi-agent systems, inspired by Overcooked cooking tasks, focusing on parallel and asynchronous operations.

Details

Motivation: Existing agent benchmarks focus on task completion but neglect time efficiency in parallel and asynchronous operations, which is crucial for real-world collaborative planning.

Method: ParaCook provides an environment with simplified action space for cooking tasks, isolating the core challenge of strategic parallel planning in multi-agent systems.

Result: Current LLMs achieve suboptimal plans that struggle with parallel actions and coordination, but show potential on abstract tasks where they can focus on high-level parallel optimization.

Conclusion: ParaCook establishes a scalable evaluation framework for developing and assessing time efficiency-aware multi-agent planning, with adjustable complexity to advance the field.

Abstract: Large Language Models (LLMs) exhibit strong reasoning abilities for planning long-horizon, real-world tasks, yet existing agent benchmarks focus on task completion while neglecting time efficiency in parallel and asynchronous operations. To address this, we present ParaCook, a benchmark for time-efficient collaborative planning. Inspired by the Overcooked game, ParaCook provides an environment for various challenging interaction planning of multi-agent systems that are instantiated as cooking tasks, with a simplified action space to isolate the core challenge of strategic parallel planning. Through a comprehensive evaluation of state-of-the-art LLMs, we find that current approaches achieve suboptimal plans, which struggle with parallel actions or coordination. Our analysis also reveals LLMs’ potential on abstract tasks where they can focus on high-level parallel optimization. ParaCook provides a scalable evaluation framework with adjustable complexity, establishing a foundation for developing and assessing time efficiency-aware multi-agent planning. The code and data are available at https://github.com/zsq259/ParaCook.

[700] SR-Scientist: Scientific Equation Discovery With Agentic AI

Shijie Xia, Yuhan Sun, Pengfei Liu

Main category: cs.AI

TL;DR: SR-Scientist elevates LLMs from simple equation proposers to autonomous AI scientists that write code, analyze data, implement equations, and optimize them based on experimental feedback, outperforming baselines by 6-35% across four science disciplines.

Details

Motivation: Current LLM methods for scientific equation discovery limit models to being equation proposers within search algorithms, failing to leverage their full potential as autonomous scientific agents.

Method: Wrap code interpreter into tools for data analysis and equation evaluation, enabling the agent to autonomously write code, implement equations, submit for evaluation, and optimize based on feedback with minimal human intervention.

Result: Outperforms baseline methods by 6-35% across four science disciplines, demonstrates robustness to noise, generalization to out-of-domain data, and high symbolic accuracy.

Conclusion: The framework successfully transforms LLMs into autonomous AI scientists capable of end-to-end scientific discovery, with enhanced capabilities through reinforcement learning.

Abstract: Recently, Large Language Models (LLMs) have been applied to scientific equation discovery, leveraging their embedded scientific knowledge for hypothesis generation. However, current methods typically confine LLMs to the role of an equation proposer within search algorithms like genetic programming. In this paper, we present SR-Scientist, a framework that elevates the LLM from a simple equation proposer to an autonomous AI scientist that writes code to analyze data, implements the equation as code, submits it for evaluation, and optimizes the equation based on experimental feedback. Specifically, we wrap the code interpreter into a set of tools for data analysis and equation evaluation. The agent is instructed to optimize the equation by utilizing these tools over a long horizon with minimal human-defined pipelines. Empirical results show that SR-Scientist outperforms baseline methods by an absolute margin of 6% to 35% on datasets covering four science disciplines. Additionally, we demonstrate our method’s robustness to noise, the generalization of the discovered equations to out-of-domain data, and their symbolic accuracy. Furthermore, we develop an end-to-end reinforcement learning framework to enhance the agent’s capabilities.

[701] Operand Quant: A Single-Agent Architecture for Autonomous Machine Learning Engineering

Arjun Sahney, Ram Gorthi, Cezary Łastowski, Javier Vega

Main category: cs.AI

TL;DR: Operand Quant is a single-agent IDE-based architecture for autonomous machine learning engineering that achieves state-of-the-art performance on MLE-Benchmark (2025), outperforming multi-agent systems.

Details

Motivation: To demonstrate that a single, context-aware agent can effectively handle all MLE lifecycle stages (exploration, modeling, experimentation, deployment) without complex multi-agent orchestration.

Method: Uses a linear, non-blocking single agent operating autonomously within a controlled IDE environment, consolidating all MLE lifecycle stages in one agent rather than using multi-agent frameworks.

Result: Achieved new SOTA on MLE-Benchmark (2025) with overall medal rate of 0.3956 +/- 0.0565 across 75 problems - highest performance among all evaluated systems to date.

Conclusion: A single, linear agent in a controlled IDE environment can outperform multi-agent and orchestrated systems for autonomous machine learning engineering tasks.

Abstract: We present Operand Quant, a single-agent, IDE-based architecture for autonomous machine learning engineering (MLE). Operand Quant departs from conventional multi-agent orchestration frameworks by consolidating all MLE lifecycle stages – exploration, modeling, experimentation, and deployment – within a single, context-aware agent. On the MLE-Benchmark (2025), Operand Quant achieved a new state-of-the-art (SOTA) result, with an overall medal rate of 0.3956 +/- 0.0565 across 75 problems – the highest recorded performance among all evaluated systems to date. The architecture demonstrates that a linear, non-blocking agent, operating autonomously within a controlled IDE environment, can outperform multi-agent and orchestrated systems under identical constraints.

[702] Domain-Specific Constitutional AI: Enhancing Safety in LLM-Powered Mental Health Chatbots

Chenhan Lyu, Yutong Song, Pengfei Zhang, Amir M. Rahmani

Main category: cs.AI

TL;DR: The paper proposes using Constitutional AI training with mental health-specific principles to address AI safety challenges in mental health applications, where general AI safeguards are insufficient for handling emotional vulnerability and crisis situations.

Details

Motivation: Rising global mental illness rates, AI integration in psychological care, and need for scalable solutions in underserved communities drive the development of mental health applications that handle sensitive data and require specialized safety measures beyond general AI safeguards.

Method: Introduce Constitutional AI training with domain-specific mental health principles to create safe, domain-adapted CAI systems for computational mental health applications.

Result: The approach aims to address mental health-specific challenges including crisis intervention accuracy, therapeutic guideline adherence, scale limitations in resource-constrained settings, and adaptation to nuanced dialogues.

Conclusion: General AI safety advances inadequately address mental health-specific risks, necessitating specialized approaches like Constitutional AI with domain-specific principles to ensure safe deployment in mental health applications.

Abstract: Mental health applications have emerged as a critical area in computational health, driven by rising global rates of mental illness, the integration of AI in psychological care, and the need for scalable solutions in underserved communities. These include therapy chatbots, crisis detection, and wellness platforms handling sensitive data, requiring specialized AI safety beyond general safeguards due to emotional vulnerability, risks like misdiagnosis or symptom exacerbation, and precise management of vulnerable states to avoid severe outcomes such as self-harm or loss of trust. Despite AI safety advances, general safeguards inadequately address mental health-specific challenges, including crisis intervention accuracy to avert escalations, therapeutic guideline adherence to prevent misinformation, scale limitations in resource-constrained settings, and adaptation to nuanced dialogues where generics may introduce biases or miss distress signals. We introduce an approach to apply Constitutional AI training with domain-specific mental health principles for safe, domain-adapted CAI systems in computational mental health applications.

[703] Learning to Be Cautious

Montaser Mohammedalamen, Dustin Morrill, Alexander Sieusahai, Yash Satsangi, Michael Bowling

Main category: cs.AI

TL;DR: The paper presents an algorithm that enables reinforcement learning agents to learn cautious behavior autonomously without task-specific safety information, using reward function uncertainty and robust policy construction.

Details

Motivation: Current RL approaches require embedding task-specific safety information, which is error-prone and burdensome. The goal is to develop agents that can learn cautious behavior on their own in novel situations.

Method: Uses neural network ensembles to characterize reward function uncertainty and constructs robust policies with k-of-N counterfactual regret minimization (CFR) subroutine.

Result: The algorithm successfully exhibits cautious behavior across increasingly non-obvious tasks without any task-specific safety tuning.

Conclusion: It is possible for reinforcement learning systems to autonomously learn cautious behavior using reward function uncertainty and robust policy construction, overcoming limitations of current safety-embedded approaches.

Abstract: A key challenge in the field of reinforcement learning is to develop agents that behave cautiously in novel situations. It is generally impossible to anticipate all situations that an autonomous system may face or what behavior would best avoid bad outcomes. An agent that can learn to be cautious would overcome this challenge by discovering for itself when and how to behave cautiously. In contrast, current approaches typically embed task-specific safety information or explicit cautious behaviors into the system, which is error-prone and imposes extra burdens on practitioners. In this paper, we present both a sequence of tasks where cautious behavior becomes increasingly non-obvious, as well as an algorithm to demonstrate that it is possible for a system to learn to be cautious. The essential features of our algorithm are that it characterizes reward function uncertainty without task-specific safety information and uses this uncertainty to construct a robust policy. Specifically, we construct robust policies with a k-of-N counterfactual regret minimization (CFR) subroutine given learned reward function uncertainty represented by a neural network ensemble. These policies exhibit caution in each of our tasks without any task-specific safety tuning. Our code is available at https://github.com/montaserFath/Learning-to-be-Cautious

[704] ChipGPT: How far are we from natural language hardware design

Kaiyan Chang, Ying Wang, Haimeng Ren, Mengdi Wang, Shengwen Liang, Yinhe Han, Huawei Li, Xiaowei Li

Main category: cs.AI

TL;DR: ChipGPT is an automated hardware design framework that uses large language models to generate Verilog programs from natural language specifications without retraining, featuring a four-stage process including prompt generation, program correction/optimization, design space collection, and optimal design selection.

Details

Motivation: To leverage LLMs' capabilities in assisting hardware engineers for more efficient logic design through natural language interaction and create an accessible, zero-code chip development flow.

Method: A scalable four-stage framework: 1) Generate prompts for LLMs to produce initial Verilog programs, 2) Output manager corrects and optimizes programs, 3) Collects programs into final design space, 4) Searches design space for optimal design under target metrics.

Result: ChipGPT demonstrates that LLMs can generate correct and complete hardware logic designs from natural language specifications, improving programmability, controllability, and providing broader design optimization space compared to prior work and native LLMs alone.

Conclusion: The framework successfully shows the potential of LLMs in automated hardware design, enabling natural language-driven chip development with enhanced efficiency and optimization capabilities.

Abstract: As large language models (LLMs) like ChatGPT exhibited unprecedented machine intelligence, it also shows great performance in assisting hardware engineers to realize higher-efficiency logic design via natural language interaction. To estimate the potential of the hardware design process assisted by LLMs, this work attempts to demonstrate an automated design environment that explores LLMs to generate hardware logic designs from natural language specifications. To realize a more accessible and efficient chip development flow, we present a scalable four-stage zero-code logic design framework based on LLMs without retraining or finetuning. At first, the demo, ChipGPT, begins by generating prompts for the LLM, which then produces initial Verilog programs. Second, an output manager corrects and optimizes these programs before collecting them into the final design space. Eventually, ChipGPT will search through this space to select the optimal design under the target metrics. The evaluation sheds some light on whether LLMs can generate correct and complete hardware logic designs described by natural language for some specifications. It is shown that ChipGPT improves programmability, and controllability, and shows broader design optimization space compared to prior work and native LLMs alone.

[705] Leveraging Twitter Data for Sentiment Analysis of Transit User Feedback: An NLP Framework

Adway Das, Abhishek Kumar Prajapati, Pengxiang Zhang, Mukund Srinath, Andisheh Ranjbari

Main category: cs.AI

TL;DR: A novel NLP framework using Twitter data to analyze transit user feedback, combining few-shot learning for tweet classification and lexicon-based sentiment analysis, validated on NYC subway system data.

Details

Motivation: Traditional transit surveys are time-consuming and costly, while social media platforms like Twitter offer abundant, real-time user feedback data that can be leveraged for understanding service issues.

Method: Two-step approach: 1) Few-shot learning for tweet classification into predefined categories (safety, reliability, maintenance), 2) Lexicon-based sentiment analysis to assess sentiment intensity and polarity (positive, negative, neutral).

Result: Framework accurately classified tweets into transit-related categories and measured sentiment intensities. Results were validated against manually labeled data and corroborated with agency-run customer surveys, showing effectiveness in identifying transit system pain points.

Conclusion: The proposed framework effectively gauges user feedback through inexpensive social media data, enabling better understanding of transit system issues and supporting targeted improvement planning without costly traditional surveys.

Abstract: Traditional methods of collecting user feedback through transit surveys are often time-consuming, resource intensive, and costly. In this paper, we propose a novel NLP-based framework that harnesses the vast, abundant, and inexpensive data available on social media platforms like Twitter to understand users’ perceptions of various service issues. Twitter, being a microblogging platform, hosts a wealth of real-time user-generated content that often includes valuable feedback and opinions on various products, services, and experiences. The proposed framework streamlines the process of gathering and analyzing user feedback without the need for costly and time-consuming user feedback surveys using two techniques. First, it utilizes few-shot learning for tweet classification within predefined categories, allowing effective identification of the issues described in tweets. It then employs a lexicon-based sentiment analysis model to assess the intensity and polarity of the tweet sentiments, distinguishing between positive, negative, and neutral tweets. The effectiveness of the framework was validated on a subset of manually labeled Twitter data and was applied to the NYC subway system as a case study. The framework accurately classifies tweets into predefined categories related to safety, reliability, and maintenance of the subway system and effectively measured sentiment intensities within each category. The general findings were corroborated through a comparison with an agency-run customer survey conducted in the same year. The findings highlight the effectiveness of the proposed framework in gauging user feedback through inexpensive social media data to understand the pain points of the transit system and plan for targeted improvements.

[706] DeAL: Decoding-time Alignment for Large Language Models

James Y. Huang, Sailik Sengupta, Daniele Bonadiman, Yi-An Lai, Arshit Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchhoff, Dan Roth

Main category: cs.AI

TL;DR: DeAL is a decoding-time alignment framework that allows users to customize reward functions and apply them during LLM decoding, addressing limitations of training-time alignment methods like RLHF.

Details

Motivation: Current alignment methods like RLHF have limitations: inability to incorporate multiple custom rewards, reliance on developer-defined static principles, and questionable reliability (e.g., susceptibility to jailbreaking).

Method: DeAL treats decoding as a heuristic-guided search process, enabling the use of diverse alignment objectives and reward functions during text generation rather than at training time.

Result: Experiments show DeAL can handle fine-grained trade-offs and improve adherence to alignment objectives including programmatic constraints (keywords, length) and abstract objectives (harmlessness, helpfulness).

Conclusion: DeAL is largely complementary to existing alignment strategies and can be effectively combined with RLHF and prompting techniques to achieve better overall alignment.

Abstract: Large Language Models (LLMs) are nowadays expected to generate content aligned with human preferences. Current work focuses on alignment at model training time, through techniques such as Reinforcement Learning with Human Feedback (RLHF). However, it is unclear if such methods are an effective choice to teach alignment objectives to the model. First, the inability to incorporate multiple, custom rewards and reliance on a model developer’s view of universal and static principles are key limitations. Second, the reliability of such approaches is also questionable (e.g. susceptibility to jailbreaking even after safety training). To address these issues, we propose DeAL, a framework that allows the user to customize reward functions and enables Decoding-time Alignment of LLMs (DeAL). At its core, we view decoding as a heuristic-guided search process and facilitate the use of a wide variety of alignment objectives. Our experiments with programmatic constraints such as keyword and length constraints, and abstract objectives such as harmlessness and helpfulness, show that we can DeAL with fine-grained trade-offs and improve adherence to alignment objectives. Lastly, we demonstrate that DeAL is largely complementary to existing alignment strategies, and can be effectively paired with RLHF and prompting techniques to achieve better alignment.

[707] GI-NAS: Boosting Gradient Inversion Attacks Through Adaptive Neural Architecture Search

Wenbo Yu, Hao Fang, Bin Chen, Xiaohang Sui, Chuan Chen, Hao Wu, Shu-Tao Xia, Ke Xu

Main category: cs.AI

TL;DR: GI-NAS uses Neural Architecture Search to adaptively find optimal network architectures for gradient inversion attacks in Federated Learning, achieving superior performance without requiring domain-specific pre-trained models.

Details

Motivation: Existing gradient inversion methods rely heavily on explicit prior knowledge like pre-trained models, which are often unavailable due to data heterogeneity in real-world FL systems. Fixed neural architectures limit the adaptive use of implicit architectural priors.

Method: Proposed Gradient Inversion via Neural Architecture Search (GI-NAS) that adaptively searches for optimal network architectures to capture implicit priors behind neural architectures for gradient inversion attacks.

Result: GI-NAS achieves superior attack performance compared to state-of-the-art methods, even under practical settings with high-resolution images, large batches, and advanced defense strategies.

Conclusion: This work exposes critical vulnerabilities in real-world federated learning by demonstrating high-fidelity reconstruction without domain-specific priors, forcing urgent reassessment of FL privacy safeguards.

Abstract: Gradient Inversion Attacks invert the transmitted gradients in Federated Learning (FL) systems to reconstruct the sensitive data of local clients and have raised considerable privacy concerns. A majority of gradient inversion methods rely heavily on explicit prior knowledge (e.g., a well pre-trained generative model), which is often unavailable in realistic scenarios. This is because real-world client data distributions are often highly heterogeneous, domain-specific, and unavailable to attackers, making it impractical for attackers to obtain perfectly matched pre-trained models, which inevitably suffer from fundamental distribution shifts relative to target private data. To alleviate this issue, researchers have proposed to leverage the implicit prior knowledge of an over-parameterized network. However, they only utilize a fixed neural architecture for all the attack settings. This would hinder the adaptive use of implicit architectural priors and consequently limit the generalizability. In this paper, we further exploit such implicit prior knowledge by proposing Gradient Inversion via Neural Architecture Search (GI-NAS), which adaptively searches the network and captures the implicit priors behind neural architectures. Extensive experiments verify that our proposed GI-NAS can achieve superior attack performance compared to state-of-the-art gradient inversion methods, even under more practical settings with high-resolution images, large-sized batches, and advanced defense strategies. To the best of our knowledge, we are the first to successfully introduce NAS to the gradient inversion community. We believe that this work exposes critical vulnerabilities in real-world federated learning by demonstrating high-fidelity reconstruction of sensitive data without requiring domain-specific priors, forcing urgent reassessment of FL privacy safeguards.

[708] A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models

Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, Ziyu Yao

Main category: cs.AI

TL;DR: This paper provides a comprehensive survey of mechanistic interpretability (MI) research for transformer-based language models, organized around specific research questions and tasks to help newcomers navigate the field.

Details

Motivation: There has been significant recent attention on mechanistic interpretability for understanding transformer language models, resulting in many novel insights but also new challenges. However, no comprehensive review exists to guide newcomers to this emerging field.

Method: The authors provide a task-centric taxonomy of MI research, organizing the field around specific research questions or tasks. They outline fundamental objects of study, techniques, evaluation methods, and key findings for each task in the taxonomy.

Result: The survey presents a roadmap for beginners to quickly identify impactful problems in mechanistic interpretability and leverage MI for their benefit, helping them navigate the field effectively.

Conclusion: The paper discusses current gaps in mechanistic interpretability research and suggests potential future directions for advancing the field.

Abstract: Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to understand a neural network model by reverse-engineering its internal computations. Recently, MI has garnered significant attention for interpreting transformer-based language models (LMs), resulting in many novel insights yet introducing new challenges. However, there has not been work that comprehensively reviews these insights and challenges, particularly as a guide for newcomers to this field. To fill this gap, we provide a comprehensive survey from a task-centric perspective, organizing the taxonomy of MI research around specific research questions or tasks. We outline the fundamental objects of study in MI, along with the techniques, evaluation methods, and key findings for each task in the taxonomy. In particular, we present a task-centric taxonomy as a roadmap for beginners to navigate the field by helping them quickly identify impactful problems in which they are most interested and leverage MI for their benefit. Finally, we discuss the current gaps in the field and suggest potential future directions for MI research.

[709] Human-inspired Episodic Memory for Infinite Context LLMs

Zafeirios Fountas, Martin A Benfeghoul, Adnan Oomerjee, Fenia Christopoulou, Gerasimos Lampouras, Haitham Bou-Ammar, Jun Wang

Main category: cs.AI

TL;DR: EM-LLM integrates human episodic memory principles into LLMs to handle infinite context lengths without fine-tuning, using event segmentation and efficient retrieval.

Details

Motivation: LLMs struggle with long contexts, while human brain excels at organizing episodic experiences across vast temporal scales. This gap motivates integrating human memory mechanisms into LLMs.

Method: EM-LLM organizes token sequences into episodic events using Bayesian surprise and graph-theoretic boundary refinement. Retrieval uses two-stage memory process combining similarity-based and temporally contiguous access.

Result: Outperforms state-of-the-art InfLLM on LongBench and ∞-Bench benchmarks, surpasses RAG in various tasks, and handles 10 million tokens - computationally infeasible for full-context models.

Conclusion: EM-LLM offers a novel computational framework for exploring human memory mechanisms, with strong correlations between its event segmentation and human-perceived events.

Abstract: Large language models (LLMs) have shown remarkable capabilities, but still struggle with processing extensive contexts, limiting their ability to maintain coherence and accuracy over long sequences. In contrast, the human brain excels at organising and retrieving episodic experiences across vast temporal scales, spanning a lifetime. In this work, we introduce EM-LLM, a novel approach that integrates key aspects of human episodic memory and event cognition into LLMs with no fine-tuning, enabling them to handle practically infinite context lengths while maintaining computational efficiency. EM-LLM organises sequences of tokens into coherent episodic events using a combination of Bayesian surprise and graph-theoretic boundary refinement in an online fashion. When needed, these events are retrieved through a two-stage memory process, combining similarity-based and temporally contiguous retrieval for efficient, human-inspired access to relevant information. Experiments on the LongBench and $\infty$-Bench benchmarks demonstrate EM-LLM’s superior performance, consistently outperforming the state-of-the-art retrieval model InfLLM across various baseline LLMs. In addition, EM-LLM outperforms its popular counterpart, RAG, in a wide range of tasks, while requiring similar resources. Notably, EM-LLM’s performance even surpasses full-context models in most tasks, while successfully performing retrieval across 10 million tokens – a scale computationally infeasible for such models. Finally, our analysis reveals strong correlations between EM-LLM’s event segmentation and human-perceived events, suggesting parallels between this artificial system and its biological counterpart, thereby offering a novel computational framework for exploring human memory mechanisms.

[710] OmniBal: Towards Fast Instruction-Tuning for Vision-Language Models via Omniverse Computation Balance

Yongqiang Yao, Jingru Tan, Feizhao Zhang, Jiahao Hu, Yazhe Niu, Xin Jin, Bo Li, Pengfei Liu, Ruihao Gong, Dahua Lin, Ningyi Xu

Main category: cs.AI

TL;DR: The paper proposes an omniverse balanced training framework to address computational load imbalance in vision-language instruction-tuning models during large-scale 3D parallel training, achieving 1.8x speed-up.

Details

Motivation: Large-scale 3D parallel training on vision-language models leads to imbalanced computation load across devices due to inherent heterogeneity between vision and language parts in data distribution and model architecture.

Method: Rebalances computational load from three perspectives: data (grouping instances into balanced mini-batches), model (using search-based method for balanced partitioning), and memory (adaptive re-computation strategy for each partition).

Result: Achieves about 1.8x speed-up compared to InternVL-Chat training code, with effectiveness and generalizability validated across various models and datasets.

Conclusion: The proposed omniverse balanced training framework successfully addresses computational load imbalance in vision-language models, significantly improving training efficiency while maintaining effectiveness across different scenarios.

Abstract: Vision-language instruction-tuning models have recently achieved significant performance improvements. In this work, we discover that large-scale 3D parallel training on those models leads to an imbalanced computation load across different devices. The vision and language parts are inherently heterogeneous: their data distribution and model architecture differ significantly, which affects distributed training efficiency. To address this issue, we rebalance the computational load from data, model, and memory perspectives, achieving more balanced computation across devices. Specifically, for the data, instances are grouped into new balanced mini-batches within and across devices. A search-based method is employed for the model to achieve a more balanced partitioning. For memory optimization, we adaptively adjust the re-computation strategy for each partition to utilize the available memory fully. These three perspectives are not independent but are closely connected, forming an omniverse balanced training framework. Extensive experiments are conducted to validate the effectiveness of our method. Compared with the open-source training code of InternVL-Chat, training time is reduced greatly, achieving about 1.8$\times$ speed-up. Our method’s efficacy and generalizability are further validated across various models and datasets. Codes will be released at https://github.com/ModelTC/OmniBal.

[711] R-GAT: Cancer Document Classification Leveraging Graph-Based Residual Network for Scenarios with Limited Data

Elias Hossain, Tasfia Nuzhat, Shamsul Masum, Shahram Rahimi, Noorbakhsh Amiri Golilarz

Main category: cs.AI

TL;DR: R-GAT: A lightweight graph attention network for cancer abstract classification that matches transformer performance with much lower computational costs.

Details

Motivation: Address limitations in cancer informatics caused by scarce labeled data and high computational demands of transformer models.

Method: Residual Graph Attention Network (R-GAT) combining multi-head attention with residual connections to capture semantic dependencies in biomedical texts.

Result: Achieved competitive performance comparable to BioBERT and BioClinicalBERT on 1,875 PubMed abstracts, with significantly fewer computational resources.

Conclusion: Graph-based architectures offer reliable, resource-efficient alternatives to transformers for biomedical NLP tasks under limited-data conditions.

Abstract: Accurate classification of cancer-related biomedical abstracts is critical for advancing cancer informatics and supporting decision-making in healthcare research. Yet progress in this domain is often constrained by limited availability of labeled corpora and the high computational demands of transformer-based approaches. To address these challenges, we propose a Residual Graph Attention Network (R-GAT) that integrates multi-head attention with residual connections to capture semantic and relational dependencies in biomedical texts. Evaluated on a curated dataset of 1,875 PubMed abstracts spanning thyroid, colon, lung, and generic cancer topics, R-GAT achieves stable and competitive performance, comparable to transformer-based models such as BioBERT and BioClinicalBERT and strong classical baselines like Logistic Regression, while requiring significantly fewer computational resources. Ablation studies confirm the importance of attention and residual connections in ensuring robustness under limited-data conditions. To support reproducibility and facilitate future research, we also release the curated dataset. Together, these contributions demonstrate the value of lightweight graph-based architectures as reliable and resource-efficient alternatives to computationally intensive transformers in biomedical NLP.

[712] Draw with Thought: Unleashing Multimodal Reasoning for Scientific Diagram Generation

Zhiqing Cui, Jiahao Yuan, Hanqing Wang, Yanshu Li, Chenxu Du, Zhenglong Ding

Main category: cs.AI

TL;DR: Draw with Thought (DwT) is a training-free framework that converts static scientific diagrams into editable mxGraph XML code using Chain-of-Thought reasoning, enabling interpretable and controllable diagram reconstruction without model fine-tuning.

Details

Motivation: Scientific diagrams are often published as static raster images, losing symbolic semantics and limiting reuse. Existing MLLM methods lack semantic control and structural interpretability for complex diagrams.

Method: DwT uses a two-stage approach: Coarse-to-Fine Planning for perceptual structuring and semantic specification, and Structure-Aware Code Generation with format-guided refinement. It employs cognitively-grounded Chain-of-Thought reasoning to guide MLLMs.

Result: Extensive experiments across eight MLLMs show high-fidelity, semantically aligned, and structurally valid reconstructions. Human evaluations confirm strong alignment in accuracy and visual aesthetics.

Conclusion: DwT offers a scalable solution for converting static visuals into executable representations and advances machine understanding of scientific graphics through interpretable and controllable diagram reconstruction.

Abstract: Scientific diagrams are vital tools for communicating structured knowledge across disciplines. However, they are often published as static raster images, losing symbolic semantics and limiting reuse. While Multimodal Large Language Models (MLLMs) offer a pathway to bridging vision and structure, existing methods lack semantic control and structural interpretability, especially on complex diagrams. We propose Draw with Thought (DwT), a training-free framework that guides MLLMs to reconstruct diagrams into editable mxGraph XML code through cognitively-grounded Chain-of-Thought reasoning. DwT enables interpretable and controllable outputs without model fine-tuning by dividing the task into two stages: Coarse-to-Fine Planning, which handles perceptual structuring and semantic specification, and Structure-Aware Code Generation, enhanced by format-guided refinement. To support evaluation, we release Plot2XML, a benchmark of 247 real-world scientific diagrams with gold-standard XML annotations. Extensive experiments across eight MLLMs show that our approach yields high-fidelity, semantically aligned, and structurally valid reconstructions, with human evaluations confirming strong alignment in both accuracy and visual aesthetics, offering a scalable solution for converting static visuals into executable representations and advancing machine understanding of scientific graphics.

[713] Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, Gao Huang

Main category: cs.AI

TL;DR: Current RLVR methods don’t create fundamentally new reasoning abilities in LLMs - they just improve performance at small k values while base models perform better at large k. The reasoning capabilities remain bounded by the base model’s original capacity.

Details

Motivation: To critically examine whether RLVR actually enables LLMs to acquire novel reasoning abilities beyond their base models, as commonly believed.

Method: Systematic evaluation of RLVR-trained LLMs across various model families, RL algorithms, and reasoning benchmarks using pass@k at large k values, plus coverage and perplexity analyses.

Result: RLVR-trained models outperform base models at small k (e.g., k=1) but base models achieve higher pass@k scores at large k. Six popular RLVR algorithms perform similarly and remain far from optimal in leveraging base model potential.

Conclusion: Current RLVR methods haven’t realized RL’s potential to elicit truly novel reasoning abilities. Improved RL paradigms like continual scaling and multi-turn agent-environment interaction are needed.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly on mathematics and programming tasks. Similar to how traditional RL helps agents explore and learn new strategies, RLVR is believed to enable LLMs to continuously self-improve, thus acquiring novel reasoning abilities beyond those of the corresponding base models. In this study we critically examine the current state of RLVR by systematically probing the reasoning capability boundaries of RLVR-trained LLMs across various model families, RL algorithms, and math, coding, and visual reasoning benchmarks, using pass@k at large k values as the evaluation metric. Surprisingly, we find that the current training setup does not elicit fundamentally new reasoning patterns. While RLVR-trained models outperform their base models at small k (e.g., k = 1), the base models achieve a higher pass@k score when k is large. Coverage and perplexity analyses show that the observed reasoning abilities originate from and are bounded by the base model. Treating the base model as an upper bound, our quantitative analysis shows that six popular RLVR algorithms perform similarly and remain far from optimal in leveraging the potential of the base model. By contrast, we find that distillation can introduce new reasoning patterns from the teacher and genuinely expand the model’s reasoning capabilities. Overall, our findings suggest that current RLVR methods have not yet realized the potential of RL to elicit truly novel reasoning abilities in LLMs. This highlights the need for improved RL paradigms, such as continual scaling and multi-turn agent-environment interaction, to unlock this potential.

[714] CodeCrash: Exposing LLM Fragility to Misleading Natural Language in Code Reasoning

Man Ho Lam, Chaozheng Wang, Jen-tse Huang, Michael R. Lyu

Main category: cs.AI

TL;DR: CodeCrash is a stress-testing framework that evaluates LLMs’ code reasoning robustness under structural perturbations and misleading natural language contexts, revealing significant performance degradation and reasoning collapse issues.

Details

Motivation: LLMs show strong code capabilities but their robustness in code reasoning under perturbations remains underexplored, particularly regarding over-reliance on NL cues and reasoning reliability.

Method: Introduces CodeCrash with 1,279 questions from CruxEval and LiveCodeBench, systematically evaluating 17 LLMs under structural perturbations and misleading NL contexts, including analysis of Chain-of-Thought reasoning and Large Reasoning Models.

Result: Models show 23.2% average performance degradation in output prediction tasks due to over-reliance on NL cues. Even with Chain-of-Thought, models have 13.8% drop due to distractibility and rationalization. Large Reasoning Models improve robustness but can suffer from pathological self-reflection and reasoning collapse.

Conclusion: CodeCrash provides a rigorous benchmark for evaluating code reasoning robustness, revealing critical limitations in LLMs’ reasoning capabilities and guiding future development toward more reliable models.

Abstract: Large Language Models (LLMs) have recently demonstrated strong capabilities in code-related tasks, but their robustness in code reasoning under perturbations remains underexplored. We introduce CodeCrash, a stress-testing framework with 1,279 questions from CruxEval and LiveCodeBench, designed to evaluate reasoning reliability under structural perturbations and misleading natural language (NL) contexts. Through a systematic evaluation of 17 LLMs, we find that models often shortcut reasoning by over-relying on NL cues, leading to an average performance degradation of 23.2% in output prediction tasks. Even with Chain-of-Thought reasoning, models on average still have a 13.8% drop due to distractibility and rationalization, revealing a lack of critical reasoning capability to distinguish the actual code behaviors. While Large Reasoning Models with internal reasoning mechanisms improve robustness by fostering critical thinking, plausible yet incorrect hints can trigger pathological self-reflection, causing 2-3 times token consumption and even catastrophic cognitive dissonance in extreme cases for QwQ-32B. We refer to this phenomenon as Reasoning Collapse. CodeCrash provides a rigorous benchmark for evaluating robustness in code reasoning, guiding future research and development toward more reliable and resilient models.

[715] Retrieval is Not Enough: Enhancing RAG Reasoning through Test-Time Critique and Optimization

Jiaqi Wei, Hao Zhou, Xiang Zhang, Di Zhang, Zijie Qiu, Wei Wei, Jinzhe Li, Wanli Ouyang, Siqi Sun

Main category: cs.AI

TL;DR: AlignRAG addresses reasoning misalignment in RAG systems by introducing an iterative critique-driven framework that ensures LLM reasoning aligns with retrieved evidence, achieving state-of-the-art performance with autonomous refinement.

Details

Motivation: Standard RAG pipelines often fail to ensure model reasoning remains consistent with retrieved evidence, leading to factual inconsistencies and unsupported conclusions due to Reasoning Misalignment.

Method: Proposes AlignRAG framework with Critique-Driven Alignment (CDA), featuring a contrastive critique synthesis mechanism and a dedicated Critic Language Model (CLM) trained to distinguish evidence-aligned vs misaligned reasoning.

Result: 8B-parameter CLM improves performance by 12.1% over Self-Refine baseline on out-of-domain tasks and outperforms 72B-parameter CLM by 2.2%. AlignRAG-auto achieves state-of-the-art performance with dynamic refinement termination.

Conclusion: AlignRAG significantly improves reasoning fidelity, remains compatible with existing RAG architectures as plug-and-play module, and demonstrates strong robustness under both informative and noisy retrieval scenarios.

Abstract: Retrieval-augmented generation (RAG) has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs). However, standard RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions. In this work, we reinterpret RAG as Retrieval-Augmented Reasoning and identify a central but underexplored problem: Reasoning Misalignment – the divergence between an LLM’s internal reasoning trajectory and the evidential constraints provided by retrieval. To address this issue, we propose AlignRAG, a novel iterative framework grounded in Critique-Driven Alignment (CDA). We further introduce AlignRAG-auto, an autonomous variant that dynamically terminates refinement, removing the need to pre-specify the number of critique iterations. At the heart of AlignRAG lies a contrastive critique synthesis mechanism that generates retrieval-sensitive critiques while mitigating self-bias. This mechanism trains a dedicated retrieval-augmented Critic Language Model (CLM) using labeled critiques that distinguish between evidence-aligned and misaligned reasoning. Empirical evaluations show that our approach significantly improves reasoning fidelity. Our 8B-parameter CLM improves performance over the Self-Refine baseline by 12.1% on out-of-domain tasks and outperforms a standard 72B-parameter CLM by 2.2%. Furthermore, AlignRAG-auto achieves this state-of-the-art performance while dynamically determining the optimal number of refinement steps, enhancing efficiency and usability. AlignRAG remains compatible with existing RAG architectures as a plug-and-play module and demonstrates strong robustness under both informative and noisy retrieval scenarios.

[716] Neurosymbolic Association Rule Mining from Tabular Data

Erkan Karabulut, Paul Groth, Victoria Degeler

Main category: cs.AI

TL;DR: Aerial+ is a neurosymbolic ARM method that uses an under-complete autoencoder to create neural representations of data, extracts rules from these representations, and produces concise, high-quality rule sets with full data coverage while reducing execution time.

Details

Motivation: High-dimensional datasets in Association Rule Mining often lead to rule explosion, increasing execution time and negatively impacting downstream task performance. Managing this rule explosion is a central challenge in ARM research.

Method: Aerial+ uses an under-complete autoencoder to create neural representations capturing feature associations, then extracts rules from these neural representations by exploiting the model’s reconstruction mechanism.

Result: Extensive evaluations on five datasets against seven baselines show Aerial+ achieves state-of-the-art results, learning more concise, high-quality rule sets with full data coverage. When integrated into rule-based interpretable ML models, it significantly reduces execution time while maintaining or improving accuracy.

Conclusion: Aerial+ effectively addresses the rule explosion problem in ARM by combining neural representations with symbolic rule extraction, producing efficient and high-quality rule sets that improve downstream task performance.

Abstract: Association Rule Mining (ARM) is the task of mining patterns among data features in the form of logical rules, with applications across a myriad of domains. However, high-dimensional datasets often result in an excessive number of rules, increasing execution time and negatively impacting downstream task performance. Managing this rule explosion remains a central challenge in ARM research. To address this, we introduce Aerial+, a novel neurosymbolic ARM method. Aerial+ leverages an under-complete autoencoder to create a neural representation of the data, capturing associations between features. It extracts rules from this neural representation by exploiting the model’s reconstruction mechanism. Extensive evaluations on five datasets against seven baselines demonstrate that Aerial+ achieves state-of-the-art results by learning more concise, high-quality rule sets with full data coverage. When integrated into rule-based interpretable machine learning models, Aerial+ significantly reduces execution time while maintaining or improving accuracy.

[717] mCLM: A Modular Chemical Language Model that Generates Functional and Makeable Molecules

Carl Edwards, Chi Han, Gawon Lee, Thao Nguyen, Sara Szymkuć, Chetan Kumar Prasad, Bowen Jin, Jiawei Han, Ying Diao, Ge Liu, Hao Peng, Bartosz A. Grzybowski, Martin D. Burke, Heng Ji

Main category: cs.AI

TL;DR: mCLM is a modular chemical-language model that tokenizes molecules at functional building block level rather than atoms, enabling better property prediction and automated synthesis compatibility while improving synthetic accessibility over existing methods.

Details

Motivation: Current LLMs for molecules represent them at atomic level, which limits their ability to propose synthesizable molecules with desired functions and makes them incompatible with automated synthesis approaches.

Method: Proposes mCLM - a bilingual language model that understands both natural language descriptions and molecular building blocks, tokenizing molecules at functional building block level similar to how text is tokenized into meaningful sub-words.

Result: mCLM with only 3B parameters outperforms 7 other generative AI methods including GPT-5 in synthetic accessibility. On 122 out-of-distribution medicines using automated synthesis-compatible building blocks, it achieves best property scores and synthetic accessibility. Can also reason on multiple functions and iteratively self-improve.

Conclusion: Tokenizing molecules at functional building block level enables more effective property prediction and inherently syncs with automated synthesis technology, representing a principled approach to molecule generation that front-loads synthesizability considerations.

Abstract: Despite their ability to understand chemical knowledge, large language models (LLMs) remain limited in their capacity to propose novel molecules with desired functions (e.g., drug-like properties). In addition, the molecules that LLMs propose can often be challenging to make, and are almost never compatible with automated synthesis approaches. To better enable the discovery of functional small molecules, LLMs need to learn a new molecular language that is more effective in predicting properties and inherently synced with automated synthesis technology. Current molecule LLMs are limited by representing molecules based on atoms. In this paper, we argue that just like tokenizing texts into meaning-bearing (sub-)word tokens instead of characters, molecules should be tokenized at the level of functional building blocks, i.e., parts of molecules that bring unique functions and serve as effective building blocks for real-world automated laboratory synthesis. This motivates us to propose mCLM, a modular Chemical-Language Model that comprises a bilingual language model that understands both natural language descriptions of functions and molecular blocks. mCLM front-loads synthesizability considerations while improving the predicted functions of molecules in a principled manner. mCLM, with only 3B parameters, achieves improvements in synthetic accessibility relative to 7 other leading generative AI methods including GPT-5. When tested on 122 out-of-distribution medicines using only building blocks/tokens that are compatible with automated modular synthesis, mCLM outperforms all baselines in property scores and synthetic accessibility. mCLM can also reason on multiple functions and iteratively self-improve to rescue drug candidates that failed late in clinical trials (“fallen angels”).

[718] How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior

Zidi Xiong, Yuping Lin, Wenya Xie, Pengfei He, Zirui Liu, Jiliang Tang, Himabindu Lakkaraju, Zhen Xiang

Main category: cs.AI

TL;DR: Empirical study on how memory management (addition/deletion) affects LLM agent behavior, revealing experience-following property and challenges like error propagation and misaligned experience replay.

Details

Motivation: Memory is critical for LLM agents to store and retrieve past executions for improved task performance over time, but the impact of memory management choices on long-term performance is not well understood.

Method: Conducted quantitative analysis and controlled experiments focusing on memory addition and deletion operations, studying their impact on agent behavior and performance.

Result: Found LLM agents exhibit experience-following property where similar task inputs lead to similar outputs, but this causes error propagation and misaligned experience replay problems. Showed that regulating experience quality and using future task evaluations as quality labels improves performance.

Conclusion: Memory management significantly impacts LLM agent behavior and long-term performance. Proper experience quality regulation and using future evaluations as quality labels are crucial for designing robust memory systems.

Abstract: Memory is a critical component in large language model (LLM)-based agents, enabling them to store and retrieve past executions to improve task performance over time. In this paper, we conduct an empirical study on how memory management choices impact the LLM agents’ behavior, especially their long-term performance. Specifically, we focus on two fundamental memory management operations that are widely used by many agent frameworks-memory addition and deletion-to systematically study their impact on the agent behavior. Through our quantitative analysis, we find that LLM agents display an experience-following property: high similarity between a task input and the input in a retrieved memory record often results in highly similar agent outputs. Our analysis further reveals two significant challenges associated with this property: error propagation, where inaccuracies in past experiences compound and degrade future performance, and misaligned experience replay, where some seemingly correct executions can provide limited or even misleading value as experiences. Through controlled experiments, we demonstrate the importance of regulating experience quality within the memory bank and show that future task evaluations can serve as free quality labels for stored memory. Our findings offer insights into the behavioral dynamics of LLM agent memory systems and provide practical guidance for designing memory components that support robust, long-term agent performance.

[719] SMART: Self-Generating and Self-Validating Multi-Dimensional Assessment for LLMs’ Mathematical Problem Solving

Yujie Hou, Ting Zhang, Mei Wang, Xuetao Ma, Hua Huang

Main category: cs.AI

TL;DR: SMART is a new framework that decomposes mathematical problem-solving into four cognitive dimensions (Understanding, Reasoning, Arithmetic, Reflection & Refinement) to provide more comprehensive evaluation of LLMs beyond just final answers.

Details

Motivation: Current evaluation methods for LLMs on mathematical tasks are limited - they either focus only on final answers or reasoning processes, failing to assess the entire problem-solving procedure. There are concerns about whether LLM successes reflect genuine reasoning or just superficial pattern recognition.

Method: Introduces SMART: a Self-Generating and Self-Validating Multi-Dimensional Assessment Framework with SMART-Bench benchmark. It breaks problem-solving into four dimensions evaluated independently through tailored tasks for interpretable analysis.

Result: Applied SMART to 21 state-of-the-art LLMs (open- and closed-source), revealing significant discrepancies in abilities across different cognitive dimensions. Identified genuine weaknesses in current LLMs.

Conclusion: The framework motivates a new metric called All-Pass Score to better capture true problem-solving capabilities. Code and benchmarks will be released upon acceptance.

Abstract: Large Language Models (LLMs) have achieved remarkable results on a variety of mathematical benchmarks. However, concerns remain as to whether these successes reflect genuine reasoning or superficial pattern recognition. Common evaluation methods, which focus on the either the final answer or the reasoning process, fail to assess the entire problem-solving procedure. To address these limitations, we introduce SMART: a Self-Generating and Self-Validating Multi-Dimensional Assessment Framework, together with its corresponding benchmark, SMART-Bench. SMART decomposes the entire problem solving process into four distinct cognitive dimensions: Understanding, Reasoning, Arithmetic, and Reflection & Refinement. Each dimension is evaluated independently through tailored tasks, enabling interpretable and fine-grained analysis of LLM behavior. We apply SMART to 21 state-of-the-art open- and closed-source LLMs, uncovering significant discrepancies in their abilities across different dimensions. Our findings reveal genuine weaknesses in current LLMs and motivate a new metric, the All-Pass Score, to better capture true problem-solving capabilities. Code and benchmarks will be released upon acceptance.

[720] MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models

Xuanqi Gao, Siyi Xie, Juan Zhai, Shiqing Ma, Chao Shen

Main category: cs.AI

TL;DR: MCP-RADAR is the first comprehensive benchmark for evaluating LLM performance in the Model Context Protocol (MCP) framework, featuring 507 tasks across six domains with objective metrics for tool utilization capabilities.

Details

Motivation: Existing evaluation methods don't adequately assess LLM tool utilization capabilities under the emerging MCP framework for dynamic tool discovery and orchestration, despite its widespread industry adoption.

Method: Created a benchmark with 507 tasks spanning mathematical reasoning, web search, email, calendar, file management, and terminal operations. Uses both authentic MCP tools and high-fidelity simulations, measuring answer correctness and operational accuracy with objective metrics including computational resource efficiency and successful tool-invocation rounds.

Result: Evaluation of leading closed-source and open-source LLMs revealed distinct capability profiles and significant trade-off between accuracy and efficiency. Provides actionable insights for LLM developers and tool creators.

Conclusion: MCP-RADAR establishes a standardized methodology applicable to the broader LLM agent ecosystem, addressing the gap in evaluating tool utilization capabilities under the MCP framework.

Abstract: As Large Language Models (LLMs) evolve from passive text generators to active reasoning agents capable of interacting with external tools, the Model Context Protocol (MCP) has emerged as a key standardized framework for dynamic tool discovery and orchestration. Despite its widespread industry adoption, existing evaluation methods do not adequately assess tool utilization capabilities under this new paradigm. To address this gap, this paper introduces MCP-RADAR, the first comprehensive benchmark specifically designed to evaluate LLM performance within the MCP framework. MCP-RADAR features a challenging dataset of 507 tasks spanning six domains: mathematical reasoning, web search, email, calendar, file management, and terminal operations. It quantifies performance based on two primary criteria: answer correctness and operational accuracy. To closely emulate real-world usage, our evaluation employs both authentic MCP tools and high-fidelity simulations of official tools. Unlike traditional benchmarks that rely on subjective human evaluation or binary success metrics, MCP-RADAR adopts objective, quantifiable measurements across multiple task domains, including computational resource efficiency and the number of successful tool-invocation rounds. Our evaluation of leading closed-source and open-source LLMs reveals distinct capability profiles and highlights a significant trade-off between accuracy and efficiency. Our findings provide actionable insights for both LLM developers and tool creators, establishing a standardized methodology applicable to the broader LLM agent ecosystem. All implementations, configurations, and datasets are publicly available at https://anonymous.4open.science/r/MCPRadar-B143.

[721] TCP: a Benchmark for Temporal Constraint-Based Planning

Zifeng Ding, Sikuan Yan, Zhangdie Yuan, Xianglong Hu, Fangru Lin, Andreas Vlachos

Main category: cs.AI

TL;DR: The paper introduces TCP benchmark to evaluate LLMs’ joint temporal reasoning and planning abilities through naturalistic dialogues with complex temporal constraints.

Details

Motivation: Existing benchmarks evaluate temporal reasoning and planning in isolation with limited complexity, creating a gap for assessing these capabilities jointly in realistic scenarios.

Method: Created TCP benchmark by generating abstract problem prototypes, pairing them with realistic scenarios from various domains, enriching into dialogues using LLM, and performing human quality checks.

Result: Even state-of-the-art LLMs struggle with TCP, revealing limitations in temporal constraint-based planning abilities.

Conclusion: TCP highlights LLMs’ limitations in temporal planning and provides a benchmark to inspire future research in this area.

Abstract: Temporal reasoning and planning are essential capabilities for large language models (LLMs), yet most existing benchmarks evaluate them in isolation and under limited forms of complexity. To address this gap, we introduce the Temporal Constraint-based Planning (TCP) benchmark that jointly assesses both capabilities. Each instance in TCP features a naturalistic dialogue around a collaborative project, where diverse and interdependent temporal constraints are explicitly or implicitly expressed, and models must infer an optimal schedule that satisfies all constraints. To construct TCP, we generate abstract problem prototypes that are then paired with realistic scenarios from various domains and enriched into dialogues using an LLM. A human quality check is performed on a sampled subset to confirm the reliability of our benchmark. We evaluate state-of-the-art LLMs and find that even the strongest models may struggle with TCP, highlighting its difficulty and revealing limitations in LLMs’ temporal constraint-based planning abilities. We analyze underlying failure cases, open source our benchmark, and hope our findings can inspire future research.

[722] Let’s Reason Formally: Natural-Formal Hybrid Reasoning Enhances LLM’s Math Capability

Ruida Wang, Yuxin Li, Yi R. Fung, Tong Zhang

Main category: cs.AI

TL;DR: NFL-HR is a framework that integrates Formal Language reasoning into Natural Language math problem-solving by aligning QA problems as existence theorems, enabling concurrent processing, and extracting answers through LLMs.

Details

Motivation: RL methods struggle to add new capabilities to base models, highlighting the need to effectively integrate Formal Language knowledge into Natural Language math reasoning despite structural disparities.

Method: Proposes NL-FL Problem Alignment to reformulate QA problems as existence theorems, Mixed Problem Input for concurrent processing, and LLM-based Answer Extraction to bridge output format gaps.

Result: Achieves 89.80% accuracy on MATH-500 and 84.34% on AMC benchmarks, surpassing NL baseline by 4.60% and 4.82% respectively, solving problems that NL baseline cannot even with more trials.

Conclusion: The NFL-HR framework successfully bridges NL-FL reasoning gaps and significantly enhances mathematical reasoning capabilities beyond what pure RL methods can achieve.

Abstract: Enhancing the mathematical reasoning capabilities of LLMs has garnered significant attention in both the mathematical and computer science communities. Recent works have made substantial progress in both Natural Language (NL) reasoning and Formal Language (FL) reasoning by leveraging the potential of pure Reinforcement Learning (RL) methods on base models. However, RL approaches struggle to impart new capabilities not presented in the base model, highlighting the need to integrate more knowledge like FL into NL math reasoning effectively. Yet, this integration is challenging due to inherent disparities in problem structure and reasoning format between NL and FL. To address these challenges, we introduce NL-FL HybridReasoning (NFL-HR), an end-to-end framework designed to incorporate the FL expert into NL math problem-solving. To bridge the NL and FL input format gap, we propose the NL-FL Problem Alignment method, which reformulates the Question-Answering (QA) problems in NL as existence theorems in FL. Subsequently, the Mixed Problem Input technique we provide enables the FL reasoner to handle both QA and existence problems concurrently. Lastly, we mitigate the NL and FL output format gap in reasoning through an LLM-based Answer Extraction mechanism. Comprehensive experiments demonstrate that the NFL-HR framework achieves 89.80% and 84.34% accuracy rates on the MATH-500 and the AMC benchmarks, surpassing the NL baseline by 4.60% and 4.82%, respectively. Notably, some problems resolved by our framework remain unsolved by the NL baseline model even under a larger number of trials.

[723] SMELLNET: A Large-scale Dataset for Real-world Smell Recognition

Dewei Feng, Carol Li, Wei Dai, Paul Pu Liang

Main category: cs.AI

TL;DR: SmellNet is the first large-scale database for digitizing smells using gas sensors, containing 828K data points across 50 substances. ScentFormer, a Transformer-based model, achieves 58.5% accuracy on smell classification and 50.2% on mixture prediction, enabling real-world olfactory AI applications.

Details

Motivation: There are no large-scale benchmarks for training AI systems to smell in the real world, despite the profound potential applications in allergen detection, manufacturing monitoring, and health sensing.

Method: Used small gas and chemical sensors to create SmellNet database, then developed ScentFormer - a Transformer-based architecture combining temporal differencing and sliding-window augmentation for smell data analysis.

Result: ScentFormer achieves 58.5% Top-1 accuracy on SmellNet-Base classification and 50.2% Top-1@0.1 on SmellNet-Mixture distribution prediction, demonstrating effective generalization across conditions and capture of transient chemical dynamics.

Conclusion: SmellNet and ScentFormer establish foundational tools for real-world olfactory AI applications across healthcare, food industry, environmental monitoring, manufacturing, and entertainment through temporal modeling of smell data.

Abstract: The ability of AI to sense and identify various substances based on their smell alone can have profound impacts on allergen detection (e.g., smelling gluten or peanuts in a cake), monitoring the manufacturing process, and sensing hormones that indicate emotional states, stress levels, and diseases. Despite these broad impacts, there are virtually no large-scale benchmarks, and therefore little progress, for training and evaluating AI systems’ ability to smell in the real world. In this paper, we use small gas and chemical sensors to create SmellNet, the first large-scale database that digitizes a diverse range of smells in the natural world. SmellNet contains about 828,000 data points across 50 substances, spanning nuts, spices, herbs, fruits, and vegetables, and 43 mixtures among them, with 68 hours of data collected. Using SmellNet, we developed ScentFormer, a Transformer-based architecture combining temporal differencing and sliding-window augmentation for smell data. For the SmellNet-Base classification task, ScentFormer achieves 58.5% Top-1 accuracy, and for the SmellNet-Mixture distribution prediction task, ScentFormer achieves 50.2% Top-1@0.1 on the test-seen split. ScentFormer’s ability to generalize across conditions and capture transient chemical dynamics demonstrates the promise of temporal modeling in olfactory AI. SmellNet and ScentFormer lay the groundwork for real-world olfactory applications across healthcare, food and beverage, environmental monitoring, manufacturing, and entertainment.

[724] Dyna-Think: Synergizing Reasoning, Acting, and World Model Simulation in AI Agents

Xiao Yu, Baolin Peng, Ruize Xu, Michel Galley, Hao Cheng, Suman Nath, Jianfeng Gao, Zhou Yu

Main category: cs.AI

TL;DR: Dyna-Think is a thinking framework that integrates planning with world models, reasoning, and acting to enhance AI agent performance, achieving similar performance to DeepSeek-R1 while using 2x fewer tokens.

Details

Motivation: Current LLMs like DeepSeek-R1 show impressive reasoning capabilities but it's unclear what behaviors are effective for long-horizon AI agent tasks. The research aims to understand and enhance agent thinking processes.

Method: Proposed Dyna-Think framework with two training methods: DIT (imitation learning) reconstructs R1’s thinking process focusing on world model simulation, and DDT (two-stage training) first improves world modeling via state prediction/critique generation, then improves action policy.

Result: Dyna-Think improves agent performance on OSWorld and WindowsAgentArena, achieving similar best-of-n performance to R1 while generating 2x less tokens. Critique generation for world model training is effective, and better performance correlates with better world modeling abilities.

Conclusion: Integrating world model simulation into AI agents is a promising direction to enhance reasoning, planning, and acting capabilities.

Abstract: Recent progress in reasoning with large language models (LLMs), such as DeepSeek-R1, demonstrates impressive capabilities in domains like mathematics and coding, by exhibiting complex cognitive behaviors such as verification, goal decomposition, and self-reflection. However, it is unclear what behavior is effective and what behavior is missing for long-horizon AI agents tasks. In this work, we propose Dyna-Think, a thinking framework that integrates planning with an internal world model with reasoning and acting to enhance AI agent performance. To enable Dyna-Think, we propose Dyna-Think Imitation Learning (DIT) and Dyna-Think Dyna Training (DDT). To initialize a policy with Dyna-Think, DIT reconstructs the thinking process of R1 to focus on performing world model simulation relevant to the proposed (and planned) action, and trains the policy using this reconstructed data. To enhance Dyna-Think, DDT uses a two-stage training process to first improve the agent’s world modeling ability via objectives such as state prediction or critique generation, and then improve the agent’s action via policy training. We evaluate our methods on OSWorld and WindowsAgentArena, and demonstrate that Dyna-Think improves the agent’s in-domain and out-of-domain performance, achieving similar best-of-n performance compared to R1 while generating 2x less tokens on average. Our extensive empirical studies reveal that 1) using critique generation for world model training is effective to improve policy performance; and 2) AI agents with better performance correlate with better world modeling abilities. We believe our results suggest a promising research direction to integrate world model simulation into AI agents to enhance their reasoning, planning, and acting capabilities.

[725] Agents of Change: Self-Evolving LLM Agents for Strategic Planning

Nikolas Belle, Dakota Barnes, Alfonso Amayuelas, Ivan Bercovich, Xin Eric Wang, William Wang

Main category: cs.AI

TL;DR: HexMachina enables LLM agents to maintain long-term strategic consistency in complex games like Settlers of Catan by separating environment discovery from strategy improvement through continual learning.

Details

Motivation: Address the long-horizon gap in LLM agents where prompt-centric approaches lose strategic consistency due to context window saturation and inability to sustain coherent strategies in adversarial, stochastic environments.

Method: HexMachina uses continual learning with two components: environment discovery (inducing adapter layer without documentation) and strategy improvement (evolving compiled player through code refinement and simulation), preserving executable artifacts.

Result: Outperforms strongest human-crafted baseline (AlphaBeta) with 54% win rate in Catanatron experiments, surpassing prompt-driven and no-discovery baselines. Ablations confirm isolated strategy learning improves performance.

Conclusion: Artifact-centric continual learning transforms LLMs from brittle stepwise deciders into stable strategy designers, advancing long-horizon autonomy in complex environments.

Abstract: We address the long-horizon gap in large language model (LLM) agents by enabling them to sustain coherent strategies in adversarial, stochastic environments. Settlers of Catan provides a challenging benchmark: success depends on balancing short- and long-term goals amid randomness, trading, expansion, and blocking. Prompt-centric LLM agents (e.g., ReAct, Reflexion) must re-interpret large, evolving game states each turn, quickly saturating context windows and losing strategic consistency. We propose HexMachina, a continual learning multi-agent system that separates environment discovery (inducing an adapter layer without documentation) from strategy improvement (evolving a compiled player through code refinement and simulation). This design preserves executable artifacts, allowing the LLM to focus on high-level strategy rather than per-turn reasoning. In controlled Catanatron experiments, HexMachina learns from scratch and evolves players that outperform the strongest human-crafted baseline (AlphaBeta), achieving a 54% win rate and surpassing prompt-driven and no-discovery baselines. Ablations confirm that isolating pure strategy learning improves performance. Overall, artifact-centric continual learning transforms LLMs from brittle stepwise deciders into stable strategy designers, advancing long-horizon autonomy.

[726] Wide-Horizon Thinking and Simulation-Based Evaluation for Real-World LLM Planning with Multifaceted Constraints

Dongjie Yang, Chengqiang Lu, Qimeng Wang, Xinbei Ma, Yan Gao, Yao Hu, Hai Zhao

Main category: cs.AI

TL;DR: This paper introduces MAoP (Multiple Aspects of Planning) to enable LLMs with “wide-horizon thinking” for complex planning problems with multifaceted constraints, and proposes Travel-Sim benchmark for realistic evaluation.

Details

Motivation: Real-world planning requires synthesizing parallel and potentially conflicting information and constraints, but existing LLM methods with long-horizon thinking struggle with handling multifaceted constraints, leading to suboptimal solutions.

Method: MAoP uses a strategist to conduct pre-planning from various aspects and provide planning blueprints for planners, enabling inference-time scalability by scaling aspects to consider various constraints. Also introduces Travel-Sim benchmark for evaluation.

Result: The approach enables LLMs to better handle complex planning scenarios with multiple constraints through wide-horizon thinking rather than just sequential reasoning.

Conclusion: This work advances LLM capabilities in complex planning and offers novel insights for evaluating sophisticated scenarios through simulation-based benchmarks.

Abstract: Unlike reasoning, which often entails a deep sequence of deductive steps, complex real-world planning is characterized by the need to synthesize a broad spectrum of parallel and potentially conflicting information and constraints. For example, in travel planning scenarios, it requires the integration of diverse real-world information and user preferences. While LLMs show promise, existing methods with long-horizon thinking struggle with handling multifaceted constraints, leading to suboptimal solutions. Motivated by the challenges of real-world travel planning, this paper introduces the Multiple Aspects of Planning (MAoP), empowering LLMs with “wide-horizon thinking” to solve planning problems with multifaceted constraints. Instead of direct planning, MAoP leverages the strategist to conduct pre-planning from various aspects and provide the planning blueprint for planners, enabling strong inference-time scalability by scaling aspects to consider various constraints. In addition, existing benchmarks for multi-constraint planning are flawed because they assess constraints in isolation, ignoring causal dependencies within the constraints, e.g, travel planning, where past activities dictate future itinerary. To address this, we propose Travel-Sim, an agent-based benchmark assessing plans via real-world simulation, thereby inherently resolving these causal dependencies. This paper advances LLM capabilities in complex planning and offers novel insights for evaluating sophisticated scenarios through simulation.

[727] Reasoning Model Unlearning: Forgetting Traces, Not Just Answers, While Preserving Reasoning Skills

Changsheng Wang, Chongyu Fan, Yihua Zhang, Jinghan Jia, Dennis Wei, Parikshit Ram, Nathalie Baracaldo, Sijia Liu

Main category: cs.AI

TL;DR: This paper introduces the first systematic study of machine unlearning for large reasoning models (LRMs), showing conventional unlearning methods fail to remove sensitive information from intermediate reasoning steps, and proposes a novel method called R²MU that effectively suppresses sensitive reasoning traces while preserving model performance.

Details

Motivation: As large reasoning models with chain-of-thought capabilities advance, they introduce new safety risks where sensitive information can persist in intermediate reasoning steps even after conventional unlearning, creating a need for reasoning-aware unlearning methods.

Method: The authors propose Reasoning-aware Representation Misdirection for Unlearning (R²MU), which extends conventional unlearning by specifically targeting and suppressing sensitive information within chain-of-thought reasoning trajectories while maintaining the model’s overall reasoning ability.

Result: Experiments on state-of-the-art models (DeepSeek-R1-Distill-LLaMA-8B and DeepSeek-R1-Distill-Qwen-14B) show R²MU significantly reduces sensitive information leakage in reasoning traces and achieves strong performance across both safety and reasoning benchmarks.

Conclusion: R²MU effectively addresses the limitations of conventional unlearning methods for LRMs by providing reasoning-aware unlearning that removes sensitive information from both final answers and intermediate reasoning steps while preserving model reasoning capabilities.

Abstract: Recent advances in large reasoning models (LRMs) have enabled strong chain-of-thought (CoT) generation through test-time computation. While these multi-step reasoning capabilities represent a major milestone in language model performance, they also introduce new safety risks. In this work, we present the first systematic study to revisit the problem of machine unlearning in the context of LRMs. Machine unlearning refers to the process of removing the influence of sensitive, harmful, or undesired data or knowledge from a trained model without full retraining. We show that conventional unlearning algorithms, originally designed for non-reasoning models, are inadequate for LRMs. In particular, even when final answers are successfully erased, sensitive information often persists within the intermediate reasoning steps, i.e., CoT trajectories. To address this challenge, we extend conventional unlearning and propose Reasoning-aware Representation Misdirection for Unlearning ($R^2MU$), a novel method that effectively suppresses sensitive reasoning traces and prevents the generation of associated final answers, while preserving the model’s reasoning ability. Our experiments demonstrate that $R^2MU$ significantly reduces sensitive information leakage within reasoning traces and achieves strong performance across both safety and reasoning benchmarks, evaluated on state-of-the-art models such as DeepSeek-R1-Distill-LLaMA-8B and DeepSeek-R1-Distill-Qwen-14B.

[728] A Community-driven vision for a new Knowledge Resource for AI

Vinay K Chaudhri, Chaitan Baru, Brandon Bennett, Mehul Bhatt, Darion Cassel, Anthony G Cohn, Rina Dechter, Esra Erdem, Dave Ferrucci, Ken Forbus, Gregory Gelfond, Michael Genesereth, Andrew S. Gordon, Benjamin Grosof, Gopal Gupta, Jim Hendler, Sharat Israni, Tyler R. Josephson, Patrick Kyllonen, Yuliya Lierler, Vladimir Lifschitz, Clifton McFate, Hande K. McGinty, Leora Morgenstern, Alessandro Oltramari, Praveen Paritosh, Dan Roth, Blake Shepard, Cogan Shimzu, Denny Vrandečić, Mark Whiting, Michael Witbrock

Main category: cs.AI

TL;DR: The paper advocates for creating a new open knowledge infrastructure in AI to address current knowledge gaps, building on modern advances in knowledge representation and reasoning.

Details

Motivation: Current AI systems face knowledge deficiencies - LLMs have knowledge gaps, robotic planning lacks world knowledge, and fact verification relies on human expertise. Existing knowledge resources are insufficient for general-purpose AI needs.

Method: Proposes building an open engineering framework with knowledge modules, conventions, and social structures adopted by contributors. Based on findings from an AAAI workshop with 50+ researchers.

Result: Synthesizes community-driven vision for new knowledge infrastructure that leverages contemporary advances in knowledge representation and reasoning.

Conclusion: A comprehensive, multi-purpose knowledge resource is critically needed in AI, and an open framework approach with community adoption shows promise for effective knowledge exploitation in practical applications.

Abstract: The long-standing goal of creating a comprehensive, multi-purpose knowledge resource, reminiscent of the 1984 Cyc project, still persists in AI. Despite the success of knowledge resources like WordNet, ConceptNet, Wolfram|Alpha and other commercial knowledge graphs, verifiable, general-purpose widely available sources of knowledge remain a critical deficiency in AI infrastructure. Large language models struggle due to knowledge gaps; robotic planning lacks necessary world knowledge; and the detection of factually false information relies heavily on human expertise. What kind of knowledge resource is most needed in AI today? How can modern technology shape its development and evaluation? A recent AAAI workshop gathered over 50 researchers to explore these questions. This paper synthesizes our findings and outlines a community-driven vision for a new knowledge infrastructure. In addition to leveraging contemporary advances in knowledge representation and reasoning, one promising idea is to build an open engineering framework to exploit knowledge modules effectively within the context of practical applications. Such a framework should include sets of conventions and social structures that are adopted by contributors.

[729] Beyond Parameters: Exploring Virtual Logic Depth for Scaling Laws

Ruike Zhu, Hanwen Zhang, Kevin Li, Tianyu Shi, Yiqun Duan, Chi Wang, Tianyi Zhou, Arindam Banerjee, Zengyi Qin

Main category: cs.AI

TL;DR: This paper introduces Virtual Logical Depth (VLD) as a fourth scaling dimension for LLMs that reuses weights to increase effective algorithmic depth without adding parameters, showing it improves reasoning ability independently of model size.

Details

Motivation: To explore a new dimension for scaling LLMs beyond depth, width, and parameter count by investigating how weight reuse can increase effective depth and potentially decouple reasoning performance from model size.

Method: Proposes Virtual Logical Depth (VLD) which reuses weights during training and inference to alter the internal computation graph, increasing effective algorithmic depth without changing parameter count. Conducts controlled experiments across different architectures and reuse schedules.

Result: VLD substantially improves reasoning ability without adding parameters, decoupling reasoning from model size. Knowledge capacity remains nearly unchanged at fixed parameter count but still scales with parameters across models. Reasoning gains persist across architectures and reuse schedules.

Conclusion: VLD captures a general scaling behavior that provides a new scaling path beyond token-wise methods, suggesting superintelligence might be achievable through parameter reuse and increased logical depth rather than ever-larger models.

Abstract: Scaling large language models typically involves three dimensions: depth, width, and parameter count. In this work, we explore a fourth dimension, \textbf{virtual logical depth} (VLD), which increases effective algorithmic depth without changing parameter count by reusing weights. While parameter reuse is not new, its role in scaling has been underexplored. Unlike recent test-time methods that scale token-wise, VLD alters the internal computation graph during training and inference. Through controlled experiments, we obtain three key insights. (1) \textit{Knowledge capacity vs. parameters}: at fixed parameter count, VLD leaves knowledge capacity nearly unchanged, while across models capacity still scales with parameters. (2) \textit{Reasoning vs. reuse}: properly implemented VLD substantially improves reasoning ability \emph{without} more parameters, decoupling reasoning from size. This suggests a new scaling path beyond token-wise test-time methods. (3) \textit{Robustness and generality}: reasoning gains persist across architectures and reuse schedules, showing VLD captures a general scaling behavior. These results provide insight into future scaling strategies and raise a deeper question: does superintelligence require ever-larger models, or can it be achieved by reusing parameters and increasing logical depth? We argue many unknown dynamics in scaling remain to be explored. Code is available at https://anonymous.4open.science/r/virtual_logical_depth-8024/.

[730] Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models?

Yun Qu, Qi Wang, Yixiu Mao, Vincent Tao Hu, Björn Ommer, Xiangyang Ji

Main category: cs.AI

TL;DR: MoPPS is a Bayesian framework that predicts prompt difficulty without costly LLM interactions, accelerating RL finetuning of LLMs by reducing the need for frequent prompt evaluations.

Details

Motivation: RL finetuning of LLMs requires numerous iterations with high computational costs due to frequent prompt evaluations and policy updates. Existing methods still incur substantial overhead from LLM inference calls.

Method: MoPPS models prompt success rates as latent variables, performs streaming Bayesian inference, and uses posterior sampling in a multi-armed bandit framework to enable sample-efficient prompt selection without requiring LLM interactions.

Result: Extensive experiments across mathematics, planning, and vision-based geometry tasks show MoPPS reliably predicts prompt difficulty and accelerates training with significantly reduced LLM rollouts.

Conclusion: MoPPS provides an efficient alternative to direct evaluate-then-select schemes, enabling faster RL finetuning of LLMs through Bayesian risk-predictive prompt selection.

Abstract: Recent advances have witnessed the effectiveness of reinforcement learning (RL) finetuning in enhancing the reasoning capabilities of large language models (LLMs). The optimization process often requires numerous iterations to achieve satisfactory performance, resulting in high computational costs due to the need for frequent prompt evaluations under intensive LLM interactions and repeated policy updates. Appropriate online prompt selection methods reduce iteration steps by prioritizing informative prompts during training, while the pipeline’s reliance on exhaustive prompt evaluation and subset selection for optimization still incurs substantial computational overhead due to frequent LLM inference calls. Distinguished from these direct evaluate-then-select schemes, this work investigates iterative approximate evaluation for arbitrary prompts and introduces Model Predictive Prompt Selection (MoPPS), a Bayesian risk-predictive framework that online estimates prompt difficulty without requiring costly LLM interactions. Technically, MoPPS models each prompt’s success rate as a latent variable, performs streaming Bayesian inference, and employs posterior sampling in a constructed multi-armed bandit machine, enabling sample efficient and adaptive prompt selection. Extensive experiments across mathematics, planning, and vision-based geometry tasks show that MoPPS reliably predicts prompt difficulty and accelerates training with significantly reduced LLM rollouts. Our code is available at https://github.com/thu-rllab/MoPPS.

[731] Multi-Functional RIS-Enabled in SAGIN for IoT: A Hybrid Deep Reinforcement Learning Approach with Compressed Twin-Models

Li-Hsiang Shen, Jyun-Jhe Huang

Main category: cs.AI

TL;DR: A SAGIN-IoT architecture using multi-functional RIS to simultaneously reflect, amplify, and harvest energy, addressing satellite energy shortages and optimizing energy efficiency through a novel CHIMERA deep reinforcement learning framework.

Details

Motivation: To address energy shortages in LEO satellites operating in shadowed regions and maximize energy efficiency for IoT devices in space-air-ground integrated networks, while accounting for both communication and computing energy consumption.

Method: Proposed CHIMERA framework integrating semantic state-action compression and parametrized sharing under hybrid reinforcement learning to optimize MF-RIS parameters (amplification, phase-shifts, energy harvesting ratio, element selection) and SAGIN parameters (beamforming, HAPS deployment, device association, computing capability).

Result: The CHIMERA scheme substantially outperforms conventional benchmarks including fixed-configuration MF-RIS, traditional RIS, no-RIS cases, and centralized/multi-agent DRL baselines in terms of highest energy efficiency.

Conclusion: The SAGIN-MF-RIS architecture achieves superior energy efficiency performance due to complementary coverage, offering significant advantages over standalone satellite, aerial, or ground-only deployments in IoT networks.

Abstract: A space-air-ground integrated network (SAGIN) for Internet of Things (IoT) network architecture is investigated, empowered by multi-functional reconfigurable intelligent surfaces (MF-RIS) capable of simultaneously reflecting, amplifying, and harvesting wireless energy. The MF-RIS plays a pivotal role in addressing the energy shortages of low-Earth orbit (LEO) satellites operating in the shadowed regions, while accounting for both communication and computing energy consumption across the SAGIN nodes. To maximize the long-term energy efficiency (EE) of IoT devices, we formulate a joint optimization problem over the MF-RIS parameters, including signal amplification, phase-shifts, energy harvesting ratio, and active element selection as well as the SAGIN parameters of beamforming vectors, high-altitude platform station (HAPS) deployment, IoT device association, and computing capability. The formulated problem is highly non-convex and non-linear and contains mixed discrete-continuous parameters. To tackle this, we conceive a compressed hybrid twin-model enhanced multi-agent deep reinforcement learning (CHIMERA) framework, which integrates semantic state-action compression and parametrized sharing under hybrid reinforcement learning to efficiently explore suitable complex actions. The simulation results have demonstrated that the proposed CHIMERA scheme substantially outperforms the conventional benchmarks, including fixed-configuration or non-harvesting MF-RIS, traditional RIS, and no-RIS cases, as well as centralized and multi-agent deep reinforcement learning baselines in terms of the highest EE. Moreover, the proposed SAGIN-MF-RIS architecture in IoT network achieves superior EE performance due to its complementary coverage, offering notable advantages over either standalone satellite, aerial, or ground-only deployments.

[732] CoRGI: Verified Chain-of-Thought Reasoning with Post-hoc Visual Grounding

Shixin Yi, Lin Shang

Main category: cs.AI

TL;DR: CoRGI is a framework that improves multimodal reasoning reliability by verifying chain-of-thought outputs through visual grounding and filtering unsupported claims.

Details

Motivation: Vision-language models often suffer from hallucinations by generating explanations after only superficial image inspection, leading to unreliable reasoning.

Method: Decomposes VLM-generated rationales into step-wise statements, grounds each step in visual evidence, and filters/corrects unsupported claims before producing final answers.

Result: Consistently improves answer accuracy and explanation faithfulness across five benchmarks (VCR, ScienceQA, MMMU, MathVista, HallusionBench) with multiple VLM backbones including Qwen-2.5VL, LLaVA-1.6, and Gemma3-12B.

Conclusion: Post-hoc visual grounding is a promising direction for building more trustworthy and transparent multimodal reasoning systems by reducing hallucinations and strengthening interpretability.

Abstract: Multimodal reasoning with vision-language models (VLMs) often suffers from hallucinations, as models tend to generate explanations after only a superficial inspection of the image. We present \textbf{CoRGI}(\textbf{C}hain \textbf{o}f \textbf{R}easoning with \textbf{G}rounded \textbf{I}nsights), a framework that enhances reasoning reliability through post-hoc verification of chain-of-thought outputs. Given a VLM-generated rationale, CoRGI decomposes it into step-wise statements, grounds each step in visual evidence, and filters or corrects unsupported claims before producing the final answer. Experiments on five challenging benchmark-VCR, ScienceQA, MMMU, MathVista, and HallusionBenc-demonstrate that CoRGI consistently improves both answer accuracy and explanation faithfulness across multiple VLM backbones, including Qwen-2.5VL, LLaVA-1.6, and Gemma3-12B. Beyond quantitative gains, qualitative analyses further illustrate how the verification process reduces hallucination and strengthens interpretability, suggesting that post-hoc visual grounding is a promising direction for building more trustworthy and transparent multimodal reasoning systems.

[733] Mediator-Guided Multi-Agent Collaboration among Open-Source Models for Medical Decision-Making

Kaitao Chen, Mianxin Liu, Daoming Zong, Chaoyue Ding, Shaohao Rui, Yankai Jiang, Mu Zhou, Xiaosong Wang

Main category: cs.AI

TL;DR: MedOrch is a mediator-guided multi-agent framework that enables multiple vision-language models to collaborate on medical multimodal decision-making through an LLM-based mediator, achieving superior performance without training.

Details

Motivation: Existing multi-agent systems focus on language-only tasks and struggle with multimodal scenarios. VLMs have limitations in instruction following and self-reflection compared to LLMs, which constrains their ability in cooperative workflows for complex medical decision-making.

Method: Propose MedOrch framework with an LLM-based mediator agent that enables multiple VLM-based expert agents to exchange and reflect on their outputs. Uses heterogeneous open-source general-purpose and domain-specific VLMs instead of costly GPT-series models.

Result: Collaboration among distinct VLM-based agents surpasses individual agent capabilities. Validated on five medical vision question answering benchmarks, demonstrating superior collaboration performance without model training.

Conclusion: Mediator-guided multi-agent collaboration is valuable for advancing medical multimodal intelligence, showing that heterogeneous model collaboration can outperform individual models in complex decision-making tasks.

Abstract: Complex medical decision-making involves cooperative workflows operated by different clinicians. Designing AI multi-agent systems can expedite and augment human-level clinical decision-making. Existing multi-agent researches primarily focus on language-only tasks, yet their extension to multimodal scenarios remains challenging. A blind combination of diverse vision-language models (VLMs) can amplify an erroneous outcome interpretation. VLMs in general are less capable in instruction following and importantly self-reflection, compared to large language models (LLMs) of comparable sizes. This disparity largely constrains VLMs’ ability in cooperative workflows. In this study, we propose MedOrch, a mediator-guided multi-agent collaboration framework for medical multimodal decision-making. MedOrch employs an LLM-based mediator agent that enables multiple VLM-based expert agents to exchange and reflect on their outputs towards collaboration. We utilize multiple open-source general-purpose and domain-specific VLMs instead of costly GPT-series models, revealing the strength of heterogeneous models. We show that the collaboration within distinct VLM-based agents can surpass the capabilities of any individual agent. We validate our approach on five medical vision question answering benchmarks, demonstrating superior collaboration performance without model training. Our findings underscore the value of mediator-guided multi-agent collaboration in advancing medical multimodal intelligence.

[734] Reducing Cognitive Overhead in Tool Use via Multi-Small-Agent Reinforcement Learning

Dayu Wang, Jiaye Yang, Weikang Li, Jiahui Liang, Yang Li

Main category: cs.AI

TL;DR: MSARL is a multi-agent framework that decouples reasoning from tool use, using specialized small agents to improve reasoning stability and accuracy over single-agent systems.

Details

Motivation: Existing tool-integrated reasoning systems follow a single-agent paradigm where one large model handles both reasoning and tool operations, leading to cognitive-load interference and unstable coordination.

Method: MSARL uses a Reasoning Agent to decompose problems and plan tool invocations, while multiple Tool Agents specialize in specific external tools, each trained via imitation learning and reinforcement learning with role-specific rewards.

Result: On mathematical problem solving with code execution, MSARL significantly improves reasoning stability and final-answer accuracy over single-agent baselines, and generalizes to diverse tool-use tasks.

Conclusion: Cognitive-role decoupling with small agents is a scalable blueprint for multi-agent AI design that enhances reasoning stability and accuracy.

Abstract: Recent advances in multi-agent systems highlight the potential of specialized small agents that collaborate via division of labor. Existing tool-integrated reasoning systems, however, often follow a single-agent paradigm in which one large model interleaves long-horizon reasoning with precise tool operations, leading to cognitive-load interference and unstable coordination. We present MSARL, a Multi-Small-Agent Reinforcement Learning framework that explicitly decouples reasoning from tool use. In MSARL, a Reasoning Agent decomposes problems and plans tool invocations, while multiple Tool Agents specialize in specific external tools, each trained via a combination of imitation learning and reinforcement learning with role-specific rewards. On mathematical problem solving with code execution, MSARL significantly improves reasoning stability and final-answer accuracy over single-agent baselines. Moreover, the architecture generalizes to diverse tool-use tasks, demonstrating that cognitive-role decoupling with small agents is a scalable blueprint for multi-agent AI design.

[735] Self-Exploring Language Models for Explainable Link Forecasting on Temporal Graphs via Reinforcement Learning

Zifeng Ding, Shenyang Huang, Zeyu Cao, Emma Kondrup, Zachary Yang, Xingyue Huang, Yuan Sui, Zhangdie Yuan, Yuqicheng Zhu, Xianglong Hu, Yuan He, Farimah Poursafaei, Michael Bronstein, Andreas Vlachos

Main category: cs.AI

TL;DR: ReaL-TG is a reinforcement learning framework that fine-tunes LLMs for explainable link forecasting on temporal graphs, outperforming larger models while producing high-quality explanations.

Details

Motivation: Traditional neural approaches for temporal graph reasoning lack explainability and require retraining for unseen graphs, while existing LLM methods are limited to static graphs or small synthetic datasets and don't evaluate reasoning trace quality.

Method: Uses reinforcement learning with outcome-based rewards to fine-tune LLMs, enabling self-exploration of reasoning strategies from graph structure and production of explanations that justify predictions.

Result: ReaL-TG-4B (fine-tuned Qwen3-4B) outperforms much larger LLMs including GPT-5 mini on ranking metrics, while producing high-quality explanations validated by both LLM judge and human evaluation.

Conclusion: The framework successfully enables LLMs to perform explainable link forecasting on real-world temporal graphs, demonstrating superior performance and reasoning quality compared to larger frontier models.

Abstract: Forecasting future links is a central task in temporal graph (TG) reasoning, requiring models to leverage historical interactions to predict upcoming ones. Traditional neural approaches, such as temporal graph neural networks, achieve strong performance but lack explainability and cannot be applied to unseen graphs without retraining. Recent studies have begun to explore using large language models (LLMs) for graph reasoning, but most of them are constrained to static graphs or small synthetic TGs and lack the evaluation of the quality of reasoning traces generated by LLMs. In this work, we present Reasoning-Enhanced Learning for Temporal Graphs (ReaL-TG), a reinforcement learning framework that fine-tunes LLMs to perform explainable link forecasting on real-world TGs. ReaL-TG uses outcome-based reward to encourage models to self-explore reasoning strategies from graph structure and to produce explanations that directly justify their predictions. To enable evaluation on LLM-generated reasoning traces, we propose a new evaluation protocol combining ranking metrics with an LLM-as-a-Judge system that assesses both the quality of reasoning and the impact of hallucinations. Experiments with ReaL-TG-4B, obtained by fine-tuning Qwen3-4B under our framework, show that it outperforms much larger frontier LLMs, including GPT-5 mini, on ranking metrics, while producing high-quality explanations confirmed by both the LLM judge and human evaluation.

[736] EvoEmo: Towards Evolved Emotional Policies for Adversarial LLM Agents in Multi-Turn Price Negotiation

Yunbo Long, Liming Xu, Lukas Beckenbauer, Yuhan Liu, Alexandra Brintrup

Main category: cs.AI

TL;DR: EvoEmo is an evolutionary reinforcement learning framework that optimizes dynamic emotional expression in LLM negotiations, outperforming baseline strategies by enabling adaptive emotional responses.

Details

Motivation: Existing LLM agents overlook the functional role of emotions in negotiations, generating passive emotional responses that make them vulnerable to manipulation and strategic exploitation.

Method: Models emotional state transitions as Markov Decision Process and uses population-based genetic optimization to evolve high-reward emotion policies across diverse negotiation scenarios.

Result: EvoEmo consistently outperforms vanilla and fixed-emotion baselines, achieving higher success rates, higher efficiency, and increased buyer savings in extensive experiments.

Conclusion: Adaptive emotional expression is crucial for enabling more effective LLM agents in multi-turn negotiations, as demonstrated by EvoEmo’s superior performance.

Abstract: Recent research on Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) has demonstrated that agents can engage in \textit{complex}, \textit{multi-turn} negotiations, opening new avenues for agentic AI. However, existing LLM agents largely overlook the functional role of emotions in such negotiations, instead generating passive, preference-driven emotional responses that make them vulnerable to manipulation and strategic exploitation by adversarial counterparts. To address this gap, we present EvoEmo, an evolutionary reinforcement learning framework that optimizes dynamic emotional expression in negotiations. EvoEmo models emotional state transitions as a Markov Decision Process and employs population-based genetic optimization to evolve high-reward emotion policies across diverse negotiation scenarios. We further propose an evaluation framework with two baselines – vanilla strategies and fixed-emotion strategies – for benchmarking emotion-aware negotiation. Extensive experiments and ablation studies show that EvoEmo consistently outperforms both baselines, achieving higher success rates, higher efficiency, and increased buyer savings. This findings highlight the importance of adaptive emotional expression in enabling more effective LLM agents for multi-turn negotiation.

[737] Difficulty-Aware Agentic Orchestration for Query-Specific Multi-Agent Workflows

Jinwei Su, Qizhen Lan, Yinghui Xia, Lifan Sun, Weiyou Tian, Tianyu Shi, Lewei He

Main category: cs.AI

TL;DR: DAAO is a difficulty-aware multi-agent framework that dynamically generates query-specific workflows based on predicted query difficulty, improving both accuracy and efficiency compared to static approaches.

Details

Motivation: Existing multi-agent frameworks use static workflows that either over-process simple queries or underperform on complex ones, while ignoring efficiency-performance trade-offs across different LLMs.

Method: Uses three modules: VAE for difficulty estimation, modular operator allocator, and cost-performance aware LLM router. Includes self-adjusting policy that updates difficulty estimates based on workflow success.

Result: Experiments on six benchmarks show DAAO surpasses prior multi-agent systems in both accuracy and inference efficiency.

Conclusion: DAAO provides an effective adaptive framework for difficulty-aware reasoning that balances performance and efficiency.

Abstract: Large Language Model (LLM)-based agentic systems have shown strong capabilities across various tasks. However, existing multi-agent frameworks often rely on static or task-level workflows, which either over-process simple queries or underperform on complex ones, while also neglecting the efficiency-performance trade-offs across heterogeneous LLMs. To address these limitations, we propose Difficulty-Aware Agentic Orchestration (DAAO), which can dynamically generate query-specific multi-agent workflows guided by predicted query difficulty. DAAO comprises three interdependent modules: a variational autoencoder (VAE) for difficulty estimation, a modular operator allocator, and a cost- and performance-aware LLM router. A self-adjusting policy updates difficulty estimates based on workflow success, enabling simpler workflows for easy queries and more complex strategies for harder ones. Experiments on six benchmarks demonstrate that DAAO surpasses prior multi-agent systems in both accuracy and inference efficiency, validating its effectiveness for adaptive, difficulty-aware reasoning.

[738] Adapting and Evaluating Multimodal Large Language Models for Adolescent Idiopathic Scoliosis Self-Management: A Divide and Conquer Framework

Zhaolong Wu, Pu Luo, Nan Meng, Jason Pui Yin Cheung, Teng Zhang

Main category: cs.AI

TL;DR: First comprehensive evaluation of MLLMs for Adolescent Idiopathic Scoliosis self-management reveals significant limitations in interpreting spinal radiographs and understanding AIS care knowledge, with proposed enhancements showing partial improvements.

Details

Motivation: To assess the capability of Multimodal Large Language Models in supporting Adolescent Idiopathic Scoliosis self-management, as current models' effectiveness in this specialized medical domain is unknown.

Method: Used a database of 3,000 X-rays with diagnostic texts and evaluated five MLLMs through a ‘Divide and Conquer’ framework with three tasks: visual question-answering, domain knowledge assessment, and patient education counseling. Enhanced models with spinal keypoint prompting and AIS knowledge base for RAG.

Result: MLLMs showed limitations in interpreting complex spinal radiographs and comprehending AIS care knowledge. RAG substantially improved knowledge assessment performance, but visual prompting effectiveness varied across architectures. Best accuracy for spinal deformity location detection was 0.55 and direction detection was 0.13.

Conclusion: Current MLLMs are far from capable of realizing personalized assistants in AIS care, with the greatest challenge being accurate detection of spinal deformity locations and directions.

Abstract: This study presents the first comprehensive evaluation of Multimodal Large Language Models (MLLMs) for Adolescent Idiopathic Scoliosis (AIS) self-management. We constructed a database of approximately 3,000 anteroposterior X-rays with diagnostic texts and evaluated five MLLMs through a `Divide and Conquer’ framework consisting of a visual question-answering task, a domain knowledge assessment task, and a patient education counseling assessment task. Our investigation revealed limitations of MLLMs’ ability in interpreting complex spinal radiographs and comprehending AIS care knowledge. To address these, we pioneered enhancing MLLMs with spinal keypoint prompting and compiled an AIS knowledge base for retrieval augmented generation (RAG), respectively. Results showed varying effectiveness of visual prompting across different architectures, while RAG substantially improved models’ performances on the knowledge assessment task. Our findings indicate current MLLMs are far from capable in realizing personalized assistant in AIS care. The greatest challenge lies in their abilities to obtain accurate detections of spinal deformity locations (best accuracy: 0.55) and directions (best accuracy: 0.13).

[739] Large Language Models and Operations Research: A Structured Survey

Yang Wang, Kai Li

Main category: cs.AI

TL;DR: This survey explores how large language models (LLMs) can enhance operations research (OR) by automating modeling, assisting optimization, and directly solving complex problems, while addressing current limitations and future research directions.

Details

Motivation: Traditional OR approaches struggle with large-scale, dynamic, and multi-constraint problems due to reliance on expert-based modeling and manual parameter adjustment. LLMs offer potential solutions through their semantic understanding and reasoning capabilities.

Method: The paper organizes LLM integration into OR into three main directions: automatic modeling (translating natural language to mathematical models/code), auxiliary optimization (generating heuristics, evolving algorithms), and direct solving of optimization tasks.

Result: The survey reviews recent progress, evaluation benchmarks, and domain-specific applications, identifying key challenges including unstable semantic-to-structure mapping, fragmented research, limited generalization, and insufficient evaluation systems.

Conclusion: LLMs show significant potential to transform OR practices, but current limitations need addressing. The paper outlines promising research avenues for advancing LLM integration in operations research.

Abstract: Operations research (OR) provides fundamental methodologies for complex system decision-making, with established applications in transportation, supply chain management, and production scheduling. Traditional approaches, which depend on expert-based modeling and manual parameter adjustment, often face challenges in handling large-scale, dynamic, and multi-constraint problems. Recently, large language models (LLMs) have shown potential to address these limitations through semantic understanding, structured generation, and reasoning control. LLMs can translate natural language descriptions into mathematical models or executable code, generate heuristics, evolve algorithms, and directly tackle optimization tasks. This paper surveys recent progress on the integration of LLMs into OR, organizing methods into three main directions: automatic modeling, auxiliary optimization, and direct solving. It further reviews evaluation benchmarks and domain-specific applications, and summarizes key open issues such as unstable semantic-to-structure mapping, fragmented research progress, limited generalization, and insufficient evaluation systems. Finally, the survey outlines possible research avenues for advancing the role of LLMs in OR.

[740] Rethinking Reward Miscalibration of GRPO in Agentic RL

Jingyu Liu, Xiaopeng Wu, Jingquan Peng, Kehan Chen, Chuan Yu, Lizhong Ding, Yong Liu

Main category: cs.AI

TL;DR: The paper challenges conventional wisdom about outcome-based rewards causing reward miscalibration, showing that flawed actions should actually be punished during training. It identifies gradient coupling as the real issue in agentic RL and proposes a classification-based approach to separate good/bad action embeddings.

Details

Motivation: To address the problem of reward miscalibration in autonomous agents solving long-horizon tasks, where outcome-based rewards were thought to mistakenly reinforce flawed middle steps, but the authors reveal this is not the case.

Method: Proposes training the actor to classify good or bad actions to separate their embeddings and alleviate gradient interference between similar samples, where gradients from well-performing samples can inadvertently strengthen suboptimal actions.

Result: Extensive experiments show the effectiveness of the proposed approach in addressing gradient coupling issues and improving agent training.

Conclusion: Flawed actions should be punished during training, and gradient coupling between similar samples is the key issue in agentic RL that can be effectively addressed through action classification to separate embeddings.

Abstract: Building autonomous agents capable of solving long-horizon, real-world tasks has garnered significant research interest. But outcome based rewards may cause reward miscalibration which means it might mistakenly allocate positive reward to flawed middle steps which is regarded as the key reason making the bad actions being reinforced during training. However we reveal that outcome based reward ensures expected negative advantage for those flawed middle steps, which means the flawed actions should be punished during training. Even accounting for the ``squeezing effect", the probability mass of good actions should increase and the actor should gradually get rid of harmful actions. This shows that flawed actions should be punished during training. We further identify gradient coupling between similar samples as a key issue in agentic RL, the input prompt is extremely similar and the output action space is limited, therefore during training, gradients from well-performing samples can inadvertently strengthen suboptimal or incorrect actions due to similar input observation and output actions. We show that with gradient coupling, some flawed actions might be enhanced. To address this, we propose training the actor to classify good or bad actions to separate the embedding of good/bad actions and alleviate the gradient interference, extensive experiments shows its effectiveness.

[741] From $f(x)$ and $g(x)$ to $f(g(x))$: LLMs Learn New Skills in RL by Composing Old Ones

Lifan Yuan, Weize Chen, Yuchen Zhang, Ganqu Cui, Hanbin Wang, Ziming You, Ning Ding, Zhiyuan Liu, Maosong Sun, Hao Peng

Main category: cs.AI

TL;DR: RL enables LLMs to acquire genuinely new compositional skills by combining existing functions, not just activating existing ones, and these skills transfer to unseen tasks.

Details

Motivation: To resolve the debate about whether RL teaches LLMs new skills or merely activates existing ones, by investigating if LLMs can learn compositional skills through RL.

Method: Developed a synthetic framework using string transformation functions, testing if LLMs can learn unseen compositions h(x)=g(f(x)) after RL training when they already know f and g individually.

Result: RL enables LLMs to learn compositional skills that generalize to >2 function compositions unseen during training, and these skills transfer to different tasks without compositional training on the target task.

Conclusion: RL fundamentally changes LLM reasoning behaviors and enables acquisition of genuinely new compositional skills, suggesting a strategy of building base models with basic skills then using RL for advanced, generalizable problem-solving.

Abstract: Does RL teach LLMs genuinely new skills, or does it merely activate existing ones? This question lies at the core of ongoing debates about the role of RL in LLM post-training. On one side, strong empirical results can be achieved with RL even without preceding supervised finetuning; on the other, critics argue that RL contributes little beyond reweighting existing reasoning strategies. This work provides concrete evidence that LLMs can acquire genuinely new skills during RL by composing existing ones, mirroring one of the central mechanisms by which humans acquire new cognitive skills. To mitigate data contamination and other confounding factors, and to allow precise control over task complexity, we develop a synthetic framework for our investigation. Specifically, we define a skill as the ability to infer the output of a string transformation function f(x) given x. When an LLM has already learned f and g prior to RL, our experiments reveal that RL enables it to learn unseen compositions of them h(x)=g(f(x)). Further, this compositional ability generalizes to more difficult problems such as compositions of >2 functions unseen during RL training. Surprisingly, our experiments show that compositional skill acquired on a source task transfers to a different target task. This transfer happens even without compositional training on the target, requiring only prior knowledge of the target’s atomic skills. Our qualitative analysis shows that RL fundamentally changes the reasoning behaviors of the models. In contrast, next-token training with the same data yields none of these findings. Our systematic experiments provide fresh insights into LLM learning, suggesting the value of first building base models with basic skills, then using RL to incentivize advanced, generalizable skills for complex problems.

[742] RL in the Wild: Characterizing RLVR Training in LLM Deployment

Jiecheng Zhou, Qinghao Hu, Yuyang Jin, Zerui Wang, Peng Sun, Yuzhe Gu, Wenwei Zhang, Mingshu Zhai, Xingcheng Zhang, Weiming Zhang

Main category: cs.AI

TL;DR: Characterization study of RLVR tasks in LLM deployment reveals system challenges like GPU idling, inefficient parallel strategies, and load imbalance, leading to the PolyTrace benchmark suite.

Details

Motivation: RLVR enhances LLMs' reasoning but introduces complex system challenges that are not well understood from a system perspective.

Method: Conducted characterization study of RLVR tasks in LLM deployment, analyzing workload distribution and variation trends across training steps.

Result: Identified key system issues: GPU idling from skewed sequence lengths, inefficient parallel strategies, poor data management, and load imbalance. Proposed PolyTrace benchmark suite with 94.7% accuracy.

Conclusion: System challenges in RLVR training need further investigation; PolyTrace provides realistic workload evaluation for future research.

Abstract: Large Language Models (LLMs) are now widely used across many domains. With their rapid development, Reinforcement Learning with Verifiable Rewards (RLVR) has surged in recent months to enhance their reasoning and understanding abilities. However, its complex data flows and diverse tasks pose substantial challenges to RL training systems, and there is limited understanding of RLVR from a system perspective. To thoroughly understand the system challenges introduced by RLVR, we present a characterization study of RLVR tasks in our LLM deployment. Specifically, we investigate the distribution and variation trends of workloads across different RL tasks across training steps. We identify issues such as GPU idling caused by skewed sequence length distribution, inefficient parallel strategies in dynamically varying workloads, inefficient data management mechanisms, and load imbalance. We describe our observations and call for further investigation into the remaining open challenges. Furthermore, we propose PolyTrace benchmark suite to conduct evaluation with realistic workloads, and a practical use case validates that PolyTrace benchmark suite exhibits 94.7% accuracy.

[743] Fine-tuning Behavioral Cloning Policies with Preference-Based Reinforcement Learning

Maël Macuglia, Paul Friedrich, Giorgia Ramponi

Main category: cs.AI

TL;DR: BRIDGE is a two-stage RL framework that first learns safe policies from offline expert demonstrations, then fine-tunes online using human preferences, achieving better sample efficiency than standalone approaches.

Details

Motivation: To overcome RL deployment obstacles: difficulty specifying accurate rewards and unsafe exploration in robotics, industry, and healthcare applications.

Method: Two-stage approach: 1) Learn safe initial policy from reward-free expert demonstrations, 2) Fine-tune online using preference-based human feedback via BRIDGE algorithm with uncertainty-weighted objective.

Result: BRIDGE achieves lower regret than standalone behavioral cloning and online preference-based RL in MuJoCo environments, with regret bounds that shrink with offline data quantity.

Conclusion: Establishes theoretical foundation for more sample-efficient interactive agents by connecting offline data quantity to online sample efficiency.

Abstract: Deploying reinforcement learning (RL) in robotics, industry, and health care is blocked by two obstacles: the difficulty of specifying accurate rewards and the risk of unsafe, data-hungry exploration. We address this by proposing a two-stage framework that first learns a safe initial policy from a reward-free dataset of expert demonstrations, then fine-tunes it online using preference-based human feedback. We provide the first principled analysis of this offline-to-online approach and introduce BRIDGE, a unified algorithm that integrates both signals via an uncertainty-weighted objective. We derive regret bounds that shrink with the number of offline demonstrations, explicitly connecting the quantity of offline data to online sample efficiency. We validate BRIDGE in discrete and continuous control MuJoCo environments, showing it achieves lower regret than both standalone behavioral cloning and online preference-based RL. Our work establishes a theoretical foundation for designing more sample-efficient interactive agents.

[744] Aristotle: IMO-level Automated Theorem Proving

Tudor Achim, Alex Best, Alberto Bietti, Kevin Der, Mathïs Fédérico, Sergei Gukov, Daniel Halpern-Leistner, Kirsten Henningsgard, Yury Kudryashov, Alexander Meiburg, Martin Michelsen, Riley Patterson, Eric Rodriguez, Laura Scharff, Vikram Shanker, Vladmir Sicca, Hari Sowrirajan, Aidan Swope, Matyas Tamas, Vlad Tenev, Jonathan Thomm, Harold Williams, Lawrence Wu

Main category: cs.AI

TL;DR: Aristotle is an AI system that combines formal verification with informal reasoning, achieving gold-medal-level performance on the 2025 International Mathematical Olympiad problems.

Details

Motivation: To advance automated theorem proving by integrating formal verification with informal reasoning, addressing the limitations of purely formal or purely informal approaches.

Method: Integrates three components: Lean proof search system, informal reasoning system for generating and formalizing lemmas, and a dedicated geometry solver.

Result: Achieved gold-medal-equivalent performance on the 2025 International Mathematical Olympiad problems with state-of-the-art performance and favorable scaling properties.

Conclusion: The Aristotle system demonstrates that combining formal verification with informal reasoning enables superior performance in automated theorem proving, particularly for complex mathematical problems.

Abstract: We introduce Aristotle, an AI system that combines formal verification with informal reasoning, achieving gold-medal-equivalent performance on the 2025 International Mathematical Olympiad problems. Aristotle integrates three main components: a Lean proof search system, an informal reasoning system that generates and formalizes lemmas, and a dedicated geometry solver. Our system demonstrates state-of-the-art performance with favorable scaling properties for automated theorem proving.

[745] AIReg-Bench: Benchmarking Language Models That Assess AI Regulation Compliance

Bill Marino, Rosco Hunter, Zubair Jamali, Marinos Emmanouil Kalpakos, Mudra Kashyap, Isaiah Hinton, Alexa Hanson, Maahum Nazir, Christoph Schnabl, Felix Steffek, Hongkai Wen, Nicholas D. Lane

Main category: cs.AI

TL;DR: AIReg-Bench is the first benchmark dataset for evaluating LLMs’ ability to assess compliance with the EU AI Act, created through LLM-generated technical documentation samples and expert legal annotations.

Details

Motivation: As governments regulate AI, there's growing interest in using LLMs to assess AI system compliance with regulations, but no existing benchmarks to evaluate LLM performance on this task.

Method: Created dataset through two-step process: (1) prompting LLM to generate 120 technical documentation excerpts of fictional AI systems, (2) legal experts reviewed and annotated each sample for AI Act violations.

Result: Developed AIReg-Bench dataset with expert-annotated compliance labels, providing a foundation for evaluating LLM performance on AI regulation compliance assessment.

Conclusion: AIReg-Bench establishes the first benchmark for LLM-based AI regulation compliance assessment tools, enabling comparison of subsequent LLMs and understanding their opportunities and limitations in this domain.

Abstract: As governments move to regulate AI, there is growing interest in using Large Language Models (LLMs) to assess whether or not an AI system complies with a given AI Regulation (AIR). However, there is presently no way to benchmark the performance of LLMs at this task. To fill this void, we introduce AIReg-Bench: the first benchmark dataset designed to test how well LLMs can assess compliance with the EU AI Act (AIA). We created this dataset through a two-step process: (1) by prompting an LLM with carefully structured instructions, we generated 120 technical documentation excerpts (samples), each depicting a fictional, albeit plausible, AI system - of the kind an AI provider might produce to demonstrate their compliance with AIR; (2) legal experts then reviewed and annotated each sample to indicate whether, and in what way, the AI system described therein violates specific Articles of the AIA. The resulting dataset, together with our evaluation of whether frontier LLMs can reproduce the experts’ compliance labels, provides a starting point to understand the opportunities and limitations of LLM-based AIR compliance assessment tools and establishes a benchmark against which subsequent LLMs can be compared. The dataset and evaluation code are available at https://github.com/camlsys/aireg-bench.

[746] Decoding Emotion in the Deep: A Systematic Study of How LLMs Represent, Retain, and Express Emotion

Jingxiang Zhang, Lujia Zhong

Main category: cs.AI

TL;DR: LLMs develop well-defined internal emotional representations that emerge early, peak mid-network, and persist for hundreds of tokens, with performance improving with model scale.

Details

Motivation: To understand how, where, and for how long emotion is encoded in LLMs' neural architecture, as their internal emotional mechanisms remain largely unexplored despite their ability to simulate emotional intelligence.

Method: Used a novel large-scale Reddit corpus of 400,000 utterances balanced across seven basic emotions, employing lightweight probes to read information from hidden layers of Qwen3 and LLaMA models without parameter alteration.

Result: LLMs develop surprisingly well-defined internal emotional geometry that sharpens with model scale and significantly outperforms zero-shot prompting. Emotional signal emerges early, peaks mid-network, is malleable via system prompts, and persists for hundreds of subsequent tokens.

Conclusion: The study provides crucial insights for developing more transparent and aligned AI systems by mapping the emotional landscape within LLMs, with open-sourced dataset and probing toolkit.

Abstract: Large Language Models (LLMs) are increasingly expected to navigate the nuances of human emotion. While research confirms that LLMs can simulate emotional intelligence, their internal emotional mechanisms remain largely unexplored. This paper investigates the latent emotional representations within modern LLMs by asking: how, where, and for how long is emotion encoded in their neural architecture? To address this, we introduce a novel, large-scale Reddit corpus of approximately 400,000 utterances, balanced across seven basic emotions through a multi-stage process of classification, rewriting, and synthetic generation. Using this dataset, we employ lightweight “probes” to read out information from the hidden layers of various Qwen3 and LLaMA models without altering their parameters. Our findings reveal that LLMs develop a surprisingly well-defined internal geometry of emotion, which sharpens with model scale and significantly outperforms zero-shot prompting. We demonstrate that this emotional signal is not a final-layer phenomenon but emerges early and peaks mid-network. Furthermore, the internal states are both malleable (they can be influenced by simple system prompts) and persistent, as the initial emotional tone remains detectable for hundreds of subsequent tokens. We contribute our dataset, an open-source probing toolkit, and a detailed map of the emotional landscape within LLMs, offering crucial insights for developing more transparent and aligned AI systems. The code and dataset are open-sourced.

[747] TRAJECT-Bench:A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use

Pengfei He, Zhenwei Dai, Bing He, Hui Liu, Xianfeng Tang, Hanqing Lu, Juanhui Li, Jiayuan Ding, Subhabrata Mukherjee, Suhang Wang, Yue Xing, Jiliang Tang, Benoit Dumoulin

Main category: cs.AI

TL;DR: TRAJECT-Bench is a trajectory-aware benchmark that comprehensively evaluates LLMs’ tool use capability through fine-grained metrics on tool selection, parameterization, and ordering across diverse tasks with production-style APIs.

Details

Motivation: Existing works evaluate LLMs' tool use capability but focus only on final answers, overlooking detailed tool usage trajectory including correct tool selection, parameterization, and ordering.

Method: TRAJECT-Bench pairs high-fidelity executable tools across practical domains with tasks grounded in production-style APIs, and synthesizes trajectories varying in breadth (parallel calls) and depth (interdependent chains).

Result: The benchmark reveals failure modes like similar tool confusion and parameter-blind selection, and shows scaling behavior with tool diversity and trajectory length, identifying the bottleneck in transitioning from short to mid-length trajectories.

Conclusion: TRAJECT-Bench offers actionable guidance for improving LLMs’ tool use by providing trajectory-level diagnostics and revealing critical failure patterns and scaling challenges.

Abstract: Large language model (LLM)-based agents increasingly rely on tool use to complete real-world tasks. While existing works evaluate the LLMs’ tool use capability, they largely focus on the final answers yet overlook the detailed tool usage trajectory, i.e., whether tools are selected, parameterized, and ordered correctly. We introduce TRAJECT-Bench, a trajectory-aware benchmark to comprehensively evaluate LLMs’ tool use capability through diverse tasks with fine-grained evaluation metrics. TRAJECT-Bench pairs high-fidelity, executable tools across practical domains with tasks grounded in production-style APIs, and synthesizes trajectories that vary in breadth (parallel calls) and depth (interdependent chains). Besides final accuracy, TRAJECT-Bench also reports trajectory-level diagnostics, including tool selection and argument correctness, and dependency/order satisfaction. Analyses reveal failure modes such as similar tool confusion and parameter-blind selection, and scaling behavior with tool diversity and trajectory length where the bottleneck of transiting from short to mid-length trajectories is revealed, offering actionable guidance for LLMs’ tool use.

[748] Making Mathematical Reasoning Adaptive

Zhejian Lai, Xiang Geng, Zhijun Wang, Yang Bai, Jiahuan Li, Rongxiang Weng, Jingang Wang, Xuezhi Cao, Xunliang Cai, Shujian Huang

Main category: cs.AI

TL;DR: AdaR framework improves LLM mathematical reasoning by training models to use adaptive logic instead of spurious reasoning through logically equivalent query synthesis and RLVR training.

Details

Motivation: Existing LLMs exhibit failures in robustness and generalization in mathematical reasoning due to spurious reasoning (relying on superficial features rather than problem-solving logic).

Method: AdaR synthesizes logically equivalent queries by varying variable values, trains models with RLVR on these data to penalize spurious logic while encouraging adaptive logic, and uses code execution with sanity checks to ensure data quality.

Result: AdaR improves robustness and generalization, achieving substantial improvement in mathematical reasoning while maintaining high data efficiency. Analysis shows data synthesis and RLVR work together to enable adaptive reasoning.

Conclusion: The framework successfully addresses spurious reasoning in LLMs for mathematical tasks, with analyses providing key design insights and demonstrating applicability to instruct LLMs.

Abstract: Mathematical reasoning is a primary indicator of large language models (LLMs) intelligence. However, existing LLMs exhibit failures of robustness and generalization. This paper attributes these deficiencies to spurious reasoning, i.e., producing answers from superficial features. To address this challenge, we propose the AdaR framework to enable adaptive reasoning, wherein models rely on problem-solving logic to produce answers. AdaR synthesizes logically equivalent queries by varying variable values, and trains models with RLVR on these data to penalize spurious logic while encouraging adaptive logic. To improve data quality, we extract the problem-solving logic from the original query and generate the corresponding answer by code execution, then apply a sanity check. Experimental results demonstrate that AdaR improves robustness and generalization, achieving substantial improvement in mathematical reasoning while maintaining high data efficiency. Analysis indicates that data synthesis and RLVR function in a coordinated manner to enable adaptive reasoning in LLMs. Subsequent analyses derive key design insights into the effect of critical factors and the applicability to instruct LLMs. Our project is available at https://github.com/NJUNLP/AdaR.

[749] Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI

Kun Xiang, Terry Jingchen Zhang, Yinya Huang, Jixi He, Zirong Liu, Yueling Tang, Ruizhe Zhou, Lijing Luo, Youpeng Wen, Xiuwei Chen, Bingqian Lin, Jianhua Han, Hang Xu, Hanhui Li, Bin Dong, Xiaodan Liang

Main category: cs.AI

TL;DR: This paper provides a comprehensive overview of Physical AI, bridging the gap between theoretical physics reasoning and applied physical understanding in AI systems.

Details

Motivation: The rapid advancement of embodied intelligence and world models has intensified efforts to integrate physical laws into AI systems, but physical perception and symbolic physics reasoning have developed separately without a unified framework.

Method: Systematically examines how physics-grounded methods enhance AI’s real-world comprehension across structured symbolic reasoning, embodied systems, and generative models through rigorous analysis of recent advances.

Result: Establishes clear distinctions between theoretical physics reasoning and applied physical understanding, advocating for intelligent systems that ground learning in both physical principles and embodied reasoning processes.

Conclusion: Envisions next-generation world models capable of explaining physical phenomena and predicting future states, advancing safe, generalizable, and interpretable AI systems that transcend pattern recognition toward genuine understanding of physical laws.

Abstract: The rapid advancement of embodied intelligence and world models has intensified efforts to integrate physical laws into AI systems, yet physical perception and symbolic physics reasoning have developed along separate trajectories without a unified bridging framework. This work provides a comprehensive overview of physical AI, establishing clear distinctions between theoretical physics reasoning and applied physical understanding while systematically examining how physics-grounded methods enhance AI’s real-world comprehension across structured symbolic reasoning, embodied systems, and generative models. Through rigorous analysis of recent advances, we advocate for intelligent systems that ground learning in both physical principles and embodied reasoning processes, transcending pattern recognition toward genuine understanding of physical laws. Our synthesis envisions next-generation world models capable of explaining physical phenomena and predicting future states, advancing safe, generalizable, and interpretable AI systems. We maintain a continuously updated resource at https://github.com/AI4Phys/Awesome-AI-for-Physics.

[750] Think Then Embed: Generative Context Improves Multimodal Embedding

Xuanming Cui, Jianpeng Cheng, Hong-you Chen, Satya Narayan Shukla, Abhijeet Awasthi, Xichen Pan, Chaitanya Ahuja, Shlok Kumar Mishra, Qi Guo, Ser-Nam Lim, Aashu Singh, Xiangjun Fan

Main category: cs.AI

TL;DR: Proposes Think-Then-Embed (TTE) framework for Universal Multimodal Embeddings that uses chain-of-thought reasoning to improve performance on complex multimodal tasks.

Details

Motivation: Current multimodal embedding approaches treat MLLMs only as encoders, ignoring their generative capacity, which becomes ineffective for complex instructions requiring compositional reasoning.

Method: TTE framework with two components: a reasoner MLLM that generates reasoning traces for complex queries, followed by an embedder that produces representations conditioned on both original query and intermediate reasoning.

Result: Achieves state-of-the-art on MMEB-V2 benchmark, surpassing proprietary models. Also achieves best performance among open-source models with 7% absolute gain over recent models when using smaller finetuned reasoner.

Conclusion: Explicit reasoning step enables nuanced understanding of complex multimodal instructions, and framework can be optimized for efficiency without performance loss.

Abstract: There is a growing interest in Universal Multimodal Embeddings (UME), where models are required to generate task-specific representations. While recent studies show that Multimodal Large Language Models (MLLMs) perform well on such tasks, they treat MLLMs solely as encoders, overlooking their generative capacity. However, such an encoding paradigm becomes less effective as instructions become more complex and require compositional reasoning. Inspired by the proven effectiveness of chain-of-thought reasoning, we propose a general Think-Then-Embed (TTE) framework for UME, composed of a reasoner and an embedder. The reasoner MLLM first generates reasoning traces that explain complex queries, followed by an embedder that produces representations conditioned on both the original query and the intermediate reasoning. This explicit reasoning step enables more nuanced understanding of complex multimodal instructions. Our contributions are threefold. First, by leveraging a powerful MLLM reasoner, we achieve state-of-the-art performance on the MMEB-V2 benchmark, surpassing proprietary models trained on massive in-house datasets. Second, to reduce the dependency on large MLLM reasoners, we finetune a smaller MLLM reasoner using high-quality embedding-centric reasoning traces, achieving the best performance among open-source models with a 7% absolute gain over recently proposed models. Third, we investigate strategies for integrating the reasoner and embedder into a unified model for improved efficiency without sacrificing performance.

[751] The Contingencies of Physical Embodiment Allow for Open-Endedness and Care

Leonardo Christov-Moore, Arthur Juliani, Alex Kiefer, Nicco Reggente, B. Scott Rousse, Adam Safron, Nicolás Hinrichs, Daniel Polani, Antonio Damasio

Main category: cs.AI

TL;DR: The paper proposes that physical vulnerability and mortality are essential conditions for developing robust, adaptive artificial agents. It defines two minimal conditions for physical embodiment inspired by Heidegger’s philosophy: being-in-the-world and being-towards-death, which generate homeostatic and intrinsic drives for survival and empowerment.

Details

Motivation: To understand why biological organisms thrive in open-ended environments while artificial agents struggle, and to develop more robust, adaptive, and caring artificial agents by incorporating physical vulnerability and mortality as design principles.

Method: Defines two minimal conditions for physical embodiment based on Heidegger’s existentialist phenomenology, formalizes these concepts within a reinforcement learning framework, and examines how intrinsically driven embodied agents can cultivate open-endedness and care in multi-agent environments.

Result: The framework shows how physical vulnerability and mortality can generate both homeostatic drives (maintaining integrity) and intrinsic drives (maximizing control over future states), enabling agents to better meet future needs and enhance their capacity for survival.

Conclusion: Physical vulnerability and mortality are not obstacles but essential conditions that can drive the development of more robust, adaptive, and caring artificial agents capable of thriving in open-ended environments through homeostatic and empowerment-based intrinsic drives.

Abstract: Physical vulnerability and mortality are often seen as obstacles to be avoided in the development of artificial agents, which struggle to adapt to open-ended environments and provide aligned care. Meanwhile, biological organisms survive, thrive, and care for each other in an open-ended physical world with relative ease and efficiency. Understanding the role of the conditions of life in this disparity can aid in developing more robust, adaptive, and caring artificial agents. Here we define two minimal conditions for physical embodiment inspired by the existentialist phenomenology of Martin Heidegger: being-in-the-world (the agent is a part of the environment) and being-towards-death (unless counteracted, the agent drifts toward terminal states due to the second law of thermodynamics). We propose that from these conditions we can obtain both a homeostatic drive - aimed at maintaining integrity and avoiding death by expending energy to learn and act - and an intrinsic drive to continue to do so in as many ways as possible. Drawing inspiration from Friedrich Nietzsche’s existentialist concept of will-to-power, we examine how intrinsic drives to maximize control over future states, e.g., empowerment, allow agents to increase the probability that they will be able to meet their future homeostatic needs, thereby enhancing their capacity to maintain physical integrity. We formalize these concepts within a reinforcement learning framework, which enables us to examine how intrinsically driven embodied agents learning in open-ended multi-agent environments may cultivate the capacities for open-endedness and care.

[752] Base Models Know How to Reason, Thinking Models Learn When

Constantin Venhoff, Iván Arcuschin, Philip Torr, Arthur Conmy, Neel Nanda

Main category: cs.AI

TL;DR: Thinking models like DeepSeek R1 outperform base models by learning to deploy pre-existing reasoning capabilities at the right time, rather than acquiring entirely new reasoning mechanisms.

Details

Motivation: To understand whether thinking models learn new reasoning capabilities or simply repurpose existing base model capabilities through better deployment timing.

Method: Proposed a hybrid model that activates reasoning mechanisms in base models at appropriate times, using an unsupervised bottom-up approach to discover human-interpretable reasoning behaviors without manual assumptions.

Result: The hybrid model recovered up to 91% of the performance gap to thinking models without weight updates while steering only 12% of tokens, across three base and four thinking models tested on GSM8K and MATH500.

Conclusion: Thinking models primarily learn efficient deployment of pre-existing reasoning mechanisms acquired during pre-training, with post-training focusing on timing these mechanisms effectively rather than creating new capabilities.

Abstract: Why do thinking language models like DeepSeek R1 outperform their base counterparts? Despite consistent performance gains, it remains unclear to what extent thinking models learn entirely new reasoning capabilities or repurpose pre-existing base model ones. In this work, we propose a hybrid model where we activate reasoning mechanisms in base models at the right time to elicit thinking-model-level reasoning chains, implying that thinking models exploit already existing capabilities. To ground our analysis, we introduce an unsupervised, bottom-up approach for uncovering human-interpretable reasoning behaviors in thinking models. This approach provides an unbiased method to discover reasoning behaviors without imposing manual or LLM-derived assumptions. Across three base and four thinking models, using GSM8K and MATH500, our hybrid model recovers up to 91% of the performance gap to thinking models without any weight updates while steering only 12% of tokens. Concretely, our empirical setup provides a simple, causal way to test the effectiveness of existing reasoning mechanisms in base models by invoking them directly and measuring the resulting task performance. More broadly, these results reframe our understanding of how thinking models are trained: pre-training is when models acquire most of their reasoning mechanisms, and post-training teaches efficient deployment of these mechanisms at the right time, enabling efficient use of their inference-time compute.

[753] ProSEA: Problem Solving via Exploration Agents

William Nguyen, Vinh Luong, Christopher Nguyen

Main category: cs.AI

TL;DR: ProSEA is a modular multi-agent framework for iterative problem solving through exploration and plan evolution, featuring hierarchical orchestration of specialized agents with dynamic replanning based on structured failure feedback.

Details

Motivation: Existing AI agents are limited to static planning and brittle interactions, lacking true collaboration or adaptive reasoning capabilities needed for complex tasks.

Method: Hierarchical architecture with Manager Agent orchestrating domain-specialized Expert Agents, decomposing tasks, and adaptively replanning based on structured feedback from failed attempts including detailed failure reasons and discovered constraints.

Result: Outperforms state-of-the-art baselines on FinanceBench benchmark without human feedback, achieving robust performance across reasoning-heavy tasks.

Conclusion: ProSEA demonstrates potential as a foundation for more transparent, adaptive, and human-aligned AI agents through its exploration-driven, feedback-informed approach to problem solving.

Abstract: Large language models (LLMs) have empowered AI agents to tackle increasingly complex tasks. However, most existing agents remain limited to static planning and brittle interactions, falling short of true collaboration or adaptive reasoning. We introduce ProSEA, a modular, general-purpose multi-agent framework designed for iterative problem solving through exploration and plan evolution. ProSEA features a hierarchical architecture in which a Manager Agent orchestrates domain-specialized Expert Agents, decomposes tasks, and adaptively replans based on structured feedback from failed attempts. Unlike prior systems, ProSEA agents report not only success or failure but also detailed reasons for failure and newly discovered constraints, enabling dynamic plan refinement informed by exploratory traces. The framework operates autonomously but supports seamless integration with human collaborators when needed. Experiments on the challenging FinanceBench benchmark demonstrate that ProSEA, even without human feedback, outperforms state-of-the-art baselines and achieves robust performance across reasoning-heavy tasks. These results underscore ProSEA’s potential as a foundation for more transparent, adaptive, and human-aligned AI agents.

[754] oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning

Ruiling Xu, Yifan Zhang, Qingyun Wang, Carl Edwards, Heng Ji

Main category: cs.AI

TL;DR: The paper introduces oMeBench, a large-scale benchmark for evaluating organic reaction mechanism reasoning in LLMs, and oMeS, a dynamic evaluation framework. Current LLMs show chemical intuition but struggle with consistent multi-step reasoning, though fine-tuning on the proposed dataset improves performance significantly.

Details

Motivation: To assess whether LLMs' performance in chemical tasks reflects genuine chemical reasoning capabilities, including generating valid intermediates, maintaining chemical consistency, and following coherent multi-step pathways.

Method: Developed oMeBench (10,000+ annotated mechanistic steps) and oMeS evaluation framework combining step-level logic and chemical similarity. Analyzed state-of-the-art LLMs and tested fine-tuning approaches.

Result: Current models display promising chemical intuition but struggle with correct and consistent multi-step reasoning. Fine-tuning a specialist model on the proposed dataset increased performance by 50% over the leading closed-source model.

Conclusion: oMeBench provides a rigorous foundation for advancing AI systems toward genuine chemical reasoning, addressing current limitations in multi-step mechanistic understanding.

Abstract: Organic reaction mechanisms are the stepwise elementary reactions by which reactants form intermediates and products, and are fundamental to understanding chemical reactivity and designing new molecules and reactions. Although large language models (LLMs) have shown promise in understanding chemical tasks such as synthesis design, it is unclear to what extent this reflects genuine chemical reasoning capabilities, i.e., the ability to generate valid intermediates, maintain chemical consistency, and follow logically coherent multi-step pathways. We address this by introducing oMeBench, the first large-scale, expert-curated benchmark for organic mechanism reasoning in organic chemistry. It comprises over 10,000 annotated mechanistic steps with intermediates, type labels, and difficulty ratings. Furthermore, to evaluate LLM capability more precisely and enable fine-grained scoring, we propose oMeS, a dynamic evaluation framework that combines step-level logic and chemical similarity. We analyze the performance of state-of-the-art LLMs, and our results show that although current models display promising chemical intuition, they struggle with correct and consistent multi-step reasoning. Notably, we find that using prompting strategy and fine-tuning a specialist model on our proposed dataset increases performance by 50% over the leading closed-source model. We hope that oMeBench will serve as a rigorous foundation for advancing AI systems toward genuine chemical reasoning.

[755] An approach for systematic decomposition of complex llm tasks

Tianle Zhou, Jiakai Xu, Guanhong Liu, Jiaxiang Liu, Haonan Wang, Eugene Wu

Main category: cs.AI

TL;DR: ACONIC is a systematic decomposition framework that models tasks as constraint problems and uses formal complexity measures to guide decomposition, improving LLM performance by 10-40 percentage points on complex tasks.

Details

Motivation: LLMs suffer from reliability issues on complex tasks because existing decomposition methods are heuristic and rely on manual or agent-based decomposition rather than systematic approaches.

Method: The ACONIC framework models tasks as constraint problems and leverages formal complexity measures to guide task decomposition, providing a systematic alternative to heuristic decomposition methods.

Result: On combinatorial tasks (SATBench) and LLM database querying tasks (Spider), decomposition guided by complexity measures enabled agents to perform considerably better, with improvements of 10-40 percentage points.

Conclusion: Systematic decomposition using formal complexity measures through the ACONIC framework significantly enhances LLM reliability and performance on complex tasks compared to heuristic decomposition approaches.

Abstract: Large Language Models (LLMs) suffer from reliability issues on complex tasks, as existing decomposition methods are heuristic and rely on agent or manual decomposition. This work introduces a novel, systematic decomposition framework that we call Analysis of CONstraint-Induced Complexity (ACONIC), which models the task as a constraint problem and leveraging formal complexity measures to guide decomposition. On combinatorial (SATBench) and LLM database querying tasks (Spider), we find that by decomposing the tasks following the measure of complexity, agent can perform considerably better (10-40 percentage point).

[756] Multi-Condition Conformal Selection

Qingyang Hao, Wenbo Liao, Bingyi Jing, Hongxin Wei

Main category: cs.AI

TL;DR: The paper proposes MCCS, a method that extends conformal selection to handle multiple conditions (conjunctive/disjunctive) while maintaining finite-sample FDR control, addressing limitations of existing single-threshold approaches.

Details

Motivation: Existing conformal selection methods only work for single-threshold scenarios and cannot handle practical multi-condition selection needs in applications like drug discovery, precision medicine, and LLM alignment.

Method: Developed MCCS algorithm with novel nonconformity score for conjunctive conditions and global BH procedure for disjunctive conditions, ensuring regional monotonicity and theoretical FDR guarantees.

Result: Extensive experiments show MCCS outperforms baselines, generalizes across diverse condition combinations and real-world modalities, and scales to multi-task settings.

Conclusion: MCCS successfully extends conformal selection to multi-condition scenarios with rigorous FDR control, providing a practical solution for resource-constrained applications requiring complex selection criteria.

Abstract: Selecting high-quality candidates from large-scale datasets is critically important in resource-constrained applications such as drug discovery, precision medicine, and the alignment of large language models. While conformal selection methods offer a rigorous solution with False Discovery Rate (FDR) control, their applicability is confined to single-threshold scenarios (i.e., y

c) and overlooks practical needs for multi-condition selection, such as conjunctive or disjunctive conditions. In this work, we propose the Multi-Condition Conformal Selection (MCCS) algorithm, which extends conformal selection to scenarios with multiple conditions. In particular, we introduce a novel nonconformity score with regional monotonicity for conjunctive conditions and a global Benjamini-Hochberg (BH) procedure for disjunctive conditions, thereby establishing finite-sample FDR control with theoretical guarantees. The integration of these components enables the proposed method to achieve rigorous FDR-controlled selection in various multi-condition environments. Extensive experiments validate the superiority of MCCS over baselines, its generalizability across diverse condition combinations, different real-world modalities, and multi-task scalability.

[757] LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings

Benjamin F. Maier, Ulf Aslak, Luca Fiaschi, Nina Rismal, Kemble Fletcher, Christian C. Luhmann, Robbie Dow, Kli Pappas, Thomas V. Wiecki

Main category: cs.AI

TL;DR: Semantic Similarity Rating (SSR) method uses LLMs to simulate synthetic consumers by generating textual responses and mapping them to Likert distributions via embedding similarity, achieving 90% of human test-retest reliability while maintaining realistic response distributions.

Details

Motivation: Traditional consumer research suffers from panel biases, limited scale, and high costs. LLMs offer an alternative but produce unrealistic numerical ratings when asked directly.

Method: SSR elicits textual responses from LLMs and maps them to Likert distributions using embedding similarity to reference statements, rather than asking for numerical ratings directly.

Result: On 57 personal care product surveys (9,300 human responses), SSR achieved 90% of human test-retest reliability while maintaining realistic response distributions (KS similarity > 0.85). Synthetic respondents also provided rich qualitative feedback.

Conclusion: SSR enables scalable consumer research simulations while preserving traditional survey metrics and interpretability, offering a cost-effective alternative to traditional methods.

Abstract: Consumer research costs companies billions annually yet suffers from panel biases and limited scale. Large language models (LLMs) offer an alternative by simulating synthetic consumers, but produce unrealistic response distributions when asked directly for numerical ratings. We present semantic similarity rating (SSR), a method that elicits textual responses from LLMs and maps these to Likert distributions using embedding similarity to reference statements. Testing on an extensive dataset comprising 57 personal care product surveys conducted by a leading corporation in that market (9,300 human responses), SSR achieves 90% of human test-retest reliability while maintaining realistic response distributions (KS similarity > 0.85). Additionally, these synthetic respondents provide rich qualitative feedback explaining their ratings. This framework enables scalable consumer research simulations while preserving traditional survey metrics and interpretability.

[758] Physics-Informed High-order Graph Dynamics Identification Learning for Predicting Complex Networks Long-term Dynamics

Bicheng Wang, Junping Wang, Yibo Xue

Main category: cs.AI

TL;DR: This paper proposes a higher-order network dynamics identification method using dynamic hypergraph learning and dual-driven prediction with Koopman operator theory to improve long-term dynamic prediction of complex networks.

Details

Motivation: Existing methods use simple graphs that only capture pairwise relationships, missing non-pairwise structured relationships. First-order GNNs struggle with dynamic non-pairwise relationships, and theoretical models lack accuracy while data-driven models lack interpretability.

Method: Uses dynamic hypergraph learning to capture higher-order non-pairwise relationships, and a dual-driven prediction module combining Koopman operator theory (to linearize nonlinear dynamics) with physical information neural differential equations to ensure physical consistency.

Result: Experimental validation on public and industrial chain network datasets shows the method achieves good prediction accuracy and long-term prediction performance.

Conclusion: The proposed approach effectively addresses limitations of existing methods by capturing higher-order relationships and ensuring both accuracy and interpretability through dual-driven prediction.

Abstract: Learning complex network dynamics is fundamental to understanding, modelling and controlling real-world complex systems. There are two main problems in the task of predicting the dynamic evolution of complex networks: on the one hand, existing methods usually use simple graphs to describe the relationships in complex networks; however, this approach can only capture pairwise relationships, while there may be rich non-pairwise structured relationships in the network. First-order GNNs have difficulty in capturing dynamic non-pairwise relationships. On the other hand, theoretical prediction models lack accuracy and data-driven prediction models lack interpretability. To address the above problems, this paper proposes a higher-order network dynamics identification method for long-term dynamic prediction of complex networks. Firstly, to address the problem that traditional graph machine learning can only deal with pairwise relations, dynamic hypergraph learning is introduced to capture the higher-order non-pairwise relations among complex networks and improve the accuracy of complex network modelling. Then, a dual-driven dynamic prediction module for physical data is proposed. The Koopman operator theory is introduced to transform the nonlinear dynamical differential equations for the dynamic evolution of complex networks into linear systems for solving. Meanwhile, the physical information neural differential equation method is utilised to ensure that the dynamic evolution conforms to the physical laws. The dual-drive dynamic prediction module ensures both accuracy and interpretability of the prediction. Validated on public datasets and self-built industrial chain network datasets, the experimental results show that the method in this paper has good prediction accuracy and long-term prediction performance.

[759] Agentic Systems in Radiology: Design, Applications, Evaluation, and Challenges

Christian Bluethgen, Dave Van Veen, Daniel Truhn, Jakob Nikolas Kather, Michael Moor, Malgorzata Polacin, Akshay Chaudhari, Thomas Frauenfelder, Curtis P. Langlotz, Michael Krauthammer, Farhad Nooralahzadeh

Main category: cs.AI

TL;DR: This paper reviews LLM-driven agentic systems in radiology, examining their design, applications, evaluation methods, and challenges like error cascades and health IT integration.

Details

Motivation: Radiology's multimodal data streams and orchestrated workflows make it ideal for AI agents that can adapt to context and automate complex tasks, but current LLM applications underutilize their potential for multi-step workflows.

Method: The paper examines how equipping LLMs with external tools and feedback mechanisms enables them to drive systems with varying autonomy levels, from semi-automated workflows to adaptive agents managing complex processes.

Result: LLMs and their multimodal variants have demonstrated promising performance for individual radiology tasks like information extraction and report summarization, but their full potential requires integration into multi-step workflows.

Conclusion: LLM-driven agentic systems show significant promise for radiology but face challenges including error cascades, tool-use efficiency, and health IT integration that need to be addressed for successful implementation.

Abstract: Building agents, systems that perceive and act upon their environment with a degree of autonomy, has long been a focus of AI research. This pursuit has recently become vastly more practical with the emergence of large language models (LLMs) capable of using natural language to integrate information, follow instructions, and perform forms of “reasoning” and planning across a wide range of tasks. With its multimodal data streams and orchestrated workflows spanning multiple systems, radiology is uniquely suited to benefit from agents that can adapt to context and automate repetitive yet complex tasks. In radiology, LLMs and their multimodal variants have already demonstrated promising performance for individual tasks such as information extraction and report summarization. However, using LLMs in isolation underutilizes their potential to support complex, multi-step workflows where decisions depend on evolving context from multiple information sources. Equipping LLMs with external tools and feedback mechanisms enables them to drive systems that exhibit a spectrum of autonomy, ranging from semi-automated workflows to more adaptive agents capable of managing complex processes. This review examines the design of such LLM-driven agentic systems, highlights key applications, discusses evaluation methods for planning and tool use, and outlines challenges such as error cascades, tool-use efficiency, and health IT integration.

cs.SD

[760] Universal Discrete-Domain Speech Enhancement

Fei Liu, Yang Ai, Ye-Xin Lu, Rui-Chen Zheng, Hui-Peng Du, Zhen-Hua Ling

Main category: cs.SD

TL;DR: Proposes UDSE, a universal discrete-domain speech enhancement model that treats SE as a classification task predicting clean discrete tokens from a pre-trained neural speech codec, enabling robust enhancement across multiple simultaneous distortions.

Details

Motivation: Most existing speech enhancement methods only handle limited types of distortions, while real-world scenarios involve multiple simultaneous distortions, limiting generalization and practical usability.

Method: UDSE redefines SE as discrete-domain classification using residual vector quantizer tokens from pre-trained neural speech codec, with global feature extraction and hierarchical token prediction following RVQ structure, trained with teacher-forcing and cross-entropy loss.

Result: UDSE effectively enhances speech degraded by various conventional and unconventional distortions (additive noise, reverberation, band limitation, clipping, phase distortion, compression) and their combinations, demonstrating superior universality.

Conclusion: The discrete-domain classification approach enables UDSE to achieve better generalization and practical performance across multiple simultaneous distortions compared to regression-based SE methods.

Abstract: In real-world scenarios, speech signals are inevitably corrupted by various types of interference, making speech enhancement (SE) a critical task for robust speech processing. However, most existing SE methods only handle a limited range of distortions, such as additive noise, reverberation, or band limitation, while the study of SE under multiple simultaneous distortions remains limited. This gap affects the generalization and practical usability of SE methods in real-world environments.To address this gap, this paper proposes a novel Universal Discrete-domain SE model called UDSE.Unlike regression-based SE models that directly predict clean speech waveform or continuous features, UDSE redefines SE as a discrete-domain classification task, instead predicting the clean discrete tokens quantized by the residual vector quantizer (RVQ) of a pre-trained neural speech codec.Specifically, UDSE first extracts global features from the degraded speech. Guided by these global features, the clean token prediction for each VQ follows the rules of RVQ, where the prediction of each VQ relies on the results of the preceding ones. Finally, the predicted clean tokens from all VQs are decoded to reconstruct the clean speech waveform. During training, the UDSE model employs a teacher-forcing strategy, and is optimized with cross-entropy loss. Experimental results confirm that the proposed UDSE model can effectively enhance speech degraded by various conventional and unconventional distortions, e.g., additive noise, reverberation, band limitation, clipping, phase distortion, and compression distortion, as well as their combinations. These results demonstrate the superior universality and practicality of UDSE compared to advanced regression-based SE methods.

[761] Improving Speech Emotion Recognition with Mutual Information Regularized Generative Model

Chung-Soo Ahn, Rajib Rana, Sunil Sivadas, Carlos Busso, Jagath C. Rajapakse

Main category: cs.SD

TL;DR: A data augmentation framework for speech emotion recognition using cross-modal information transfer and mutual information regularization to generate quality inputs and handle multimodal data.

Details

Motivation: Speech emotion recognition suffers from limited quality-labeled training data, and existing data augmentation methods need improvement, especially for multimodal inputs.

Method: Proposed framework uses cross-modal information transfer and mutual information regularization to generate augmented data, with mutual information serving as quality indicator and ensuring modality dependency.

Result: Tested on IEMOCAP, MSP-IMPROV and MSP-Podcast datasets, the framework improved emotion prediction performance against existing works and can generate inputs without cross-modal information.

Conclusion: The proposed data augmentation framework effectively addresses data scarcity in speech emotion recognition and works for both unimodal and multimodal scenarios.

Abstract: Although speech emotion recognition (SER) research has been advanced, thanks to deep learning methods, it still suffers from obtaining inputs from large quality-labelled training data. Data augmentation methods have been attempted to mitigate this issue, generative models have shown success among them recently. We propose a data augmentation framework that is aided by cross-modal information transfer and mutual information regularization. Mutual information based metric can serve as an indicator for the quality. Furthermore, we expand this data augmentation scope to multimodal inputs, thanks to mutual information ensureing dependency between modalities. Our framework was tested on three benchmark datasets: IEMOCAP, MSP-IMPROV and MSP-Podcast. The implementation was designed to generate input features that are fed into last layer for emotion classification. Our framework improved the performance of emotion prediction against existing works. Also, we discovered that our framework is able to generate new inputs without any cross-modal information.

[762] Matchmaker: An Open-source Library for Real-time Piano Score Following and Systematic Evaluation

Jiyun Park, Carlos Cancino-Chacón, Suhit Chiruthapudi, Juhan Nam

Main category: cs.SD

TL;DR: Introduces Matchmaker, an open-source Python library for real-time music alignment that enables systematic comparison of different methods and establishes a benchmark framework for score-following research.

Details

Motivation: Addresses the lack of a unified open framework for comparing real-time music alignment models due to complexity of real-time processing, language/system dependencies, and low compatibility with existing MIR environments.

Method: Developed Matchmaker library that systematically compares methods along two dimensions: music representations and alignment methods, evaluated on large test sets of solo piano music using comprehensive metrics.

Result: Successfully created a practical tool that enables robust assessment of score-following methods on large datasets (nASAP, Batik, Vienna4x22) with comprehensive metrics.

Conclusion: Establishes a benchmark framework for score-following research while providing an easy-to-use tool that developers can integrate into applications.

Abstract: Real-time music alignment, also known as score following, is a fundamental MIR task with a long history and is essential for many interactive applications. Despite its importance, there has not been a unified open framework for comparing models, largely due to the inherent complexity of real-time processing and the language- or system-dependent implementations. In addition, low compatibility with the existing MIR environment has made it difficult to develop benchmarks using large datasets available in recent years. While new studies based on established methods (e.g., dynamic programming, probabilistic models) have emerged, most evaluations compare models only within the same family or on small sets of test data. This paper introduces Matchmaker, an open-source Python library for real-time music alignment that is easy to use and compatible with modern MIR libraries. Using this, we systematically compare methods along two dimensions: music representations and alignment methods. We evaluated our approach on a large test set of solo piano music from the (n)ASAP, Batik, and Vienna4x22 datasets with a comprehensive set of metrics to ensure robust assessment. Our work aims to establish a benchmark framework for score-following research while providing a practical tool that developers can easily integrate into their applications.

[763] Peransformer: Improving Low-informed Expressive Performance Rendering with Score-aware Discriminator

Xian He, Wei Zeng, Ye Wang

Main category: cs.SD

TL;DR: Peransformer is a transformer-based low-informed EPR system that bridges the gap between low-informed and highly-informed systems using a score-aware discriminator and achieves state-of-the-art performance.

Details

Motivation: Highly-informed EPR systems require detailed music scores which are limited and less flexible, while existing low-informed systems have suboptimal performance. There's also a lack of standardized evaluation metrics for fair comparisons between EPR systems.

Method: Uses a transformer-based architecture with a score-aware discriminator that leverages score-derived MIDI files, trained on a score-to-performance paired, note-to-note aligned MIDI dataset.

Result: Peransformer achieves state-of-the-art performance among low-informed EPR systems, validated by subjective evaluations. Also introduces generalized EPR metrics (GEM) for more direct and reliable system comparisons.

Conclusion: The proposed system successfully bridges the performance gap between low-informed and highly-informed EPR systems while providing standardized evaluation metrics for the field.

Abstract: Highly-informed Expressive Performance Rendering (EPR) systems transform music scores with rich musical annotations into human-like expressive performance MIDI files. While these systems have achieved promising results, the availability of detailed music scores is limited compared to MIDI files and are less flexible to work with using a digital audio workstation (DAW). Recent advancements in low-informed EPR systems offer a more accessible alternative by directly utilizing score-derived MIDI as input, but these systems often exhibit suboptimal performance. Meanwhile, existing works are evaluated with diverse automatic metrics and data formats, hindering direct objective comparisons between EPR systems. In this study, we introduce Peransformer, a transformer-based low-informed EPR system designed to bridge the gap between low-informed and highly-informed EPR systems. Our approach incorporates a score-aware discriminator that leverages the underlying score-derived MIDI files and is trained on a score-to-performance paired, note-to-note aligned MIDI dataset. Experimental results demonstrate that Peransformer achieves state-of-the-art performance among low-informed systems, as validated by subjective evaluations. Furthermore, we extend existing automatic evaluation metrics for EPR systems and introduce generalized EPR metrics (GEM), enabling more direct, accurate, and reliable comparisons across EPR systems.

[764] ProGress: Structured Music Generation via Graph Diffusion and Hierarchical Music Analysis

Stephen Ni-Hahn, Chao Péter Yang, Mingchen Ma, Cynthia Rudin, Simon Mak, Yue Jiang

Main category: cs.SD

TL;DR: ProGress is a novel generative music framework that combines Schenkerian analysis with diffusion modeling to create structured, interpretable music generation with superior performance over existing methods.

Details

Motivation: Existing AI music generation models lack structural cohesion in harmonic-melodic structure and are largely "black-box" models that are not musically interpretable.

Method: Adapts DiGress discrete diffusion model for music generation, incorporates Schenkerian analysis concepts, develops phrase fusion methodology, and provides user control framework for coherent compositions.

Result: Human experiments show superior performance compared to state-of-the-art methods.

Conclusion: The ProGress framework successfully addresses limitations of existing models by providing structured, interpretable music generation with user control capabilities.

Abstract: Artificial Intelligence (AI) for music generation is undergoing rapid developments, with recent symbolic models leveraging sophisticated deep learning and diffusion model algorithms. One drawback with existing models is that they lack structural cohesion, particularly on harmonic-melodic structure. Furthermore, such existing models are largely “black-box” in nature and are not musically interpretable. This paper addresses these limitations via a novel generative music framework that incorporates concepts of Schenkerian analysis (SchA) in concert with a diffusion modeling framework. This framework, which we call ProGress (Prolongation-enhanced DiGress), adapts state-of-the-art deep models for discrete diffusion (in particular, the DiGress model of Vignac et al., 2023) for interpretable and structured music generation. Concretely, our contributions include 1) novel adaptations of the DiGress model for music generation, 2) a novel SchA-inspired phrase fusion methodology, and 3) a framework allowing users to control various aspects of the generation process to create coherent musical compositions. Results from human experiments suggest superior performance to existing state-of-the-art methods.

[765] MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations

Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Xintong Hu, Yu Zhang, Li Tang, Rui Yang, Han Wang, Zongbao Zhang, Yuhan Wang, Yixuan Chen, Hankun Xu, Ke Xu, Pengfei Fan, Zhetao Chen, Yanhao Yu, Qiange Huang, Fei Wu, Zhou Zhao

Main category: cs.SD

TL;DR: MRSAudio is a large-scale multimodal spatial audio dataset with binaural and ambisonic audio, video, motion data, and annotations to advance spatial audio research.

Details

Motivation: Most existing multimodal datasets provide only monaural audio, limiting development of spatial audio generation and understanding needed for immersive technologies like VR/AR.

Method: Created MRSAudio dataset with four components (MRSLife, MRSSpeech, MRSMusic, MRSSing) containing synchronized binaural/ambisonic audio, video, motion trajectories, and fine-grained annotations.

Result: Dataset enables high-quality spatial modeling and supports five foundational tasks: audio spatialization, spatial text-to-speech, spatial singing voice synthesis, spatial music generation, and sound event localization/detection.

Conclusion: MRSAudio addresses the gap in spatial audio datasets and demonstrates utility for a broad range of spatial audio research applications.

Abstract: Humans rely on multisensory integration to perceive spatial environments, where auditory cues enable sound source localization in three-dimensional space. Despite the critical role of spatial audio in immersive technologies such as VR/AR, most existing multimodal datasets provide only monaural audio, which limits the development of spatial audio generation and understanding. To address these challenges, we introduce MRSAudio, a large-scale multimodal spatial audio dataset designed to advance research in spatial audio understanding and generation. MRSAudio spans four distinct components: MRSLife, MRSSpeech, MRSMusic, and MRSSing, covering diverse real-world scenarios. The dataset includes synchronized binaural and ambisonic audio, exocentric and egocentric video, motion trajectories, and fine-grained annotations such as transcripts, phoneme boundaries, lyrics, scores, and prompts. To demonstrate the utility and versatility of MRSAudio, we establish five foundational tasks: audio spatialization, and spatial text to speech, spatial singing voice synthesis, spatial music generation and sound event localization and detection. Results show that MRSAudio enables high-quality spatial modeling and supports a broad range of spatial audio research. Demos and dataset access are available at https://mrsaudio.github.io.

[766] Knowledge-Decoupled Functionally Invariant Path with Synthetic Personal Data for Personalized ASR

Yue Gu, Zhihao Du, Ying Shi, Jiqing Han, Yongjun He

Main category: cs.SD

TL;DR: The paper proposes KDFIP, a framework that integrates gated parameter-isolation with functionally invariant paths to balance learning from synthetic personal data, real personal data, and generic knowledge in personalized ASR models.

Details

Motivation: To address challenges in adapting ASR models to synthetic personal data without forgetting real knowledge, and adapting to personal data without forgetting generic knowledge.

Method: KDFIP framework stores generic and personalized knowledge in separate modules, applies functionally invariant paths sequentially, and uses a gating mechanism to dynamically fuse outputs.

Result: Achieves 29.38% relative character error rate reduction on target speakers while maintaining comparable generalization performance to unadapted baseline.

Conclusion: KDFIP effectively decouples learning processes and balances knowledge acquisition from different data types in personalized ASR systems.

Abstract: Fine-tuning generic ASR models with large-scale synthetic personal data can enhance the personalization of ASR models, but it introduces challenges in adapting to synthetic personal data without forgetting real knowledge, and in adapting to personal data without forgetting generic knowledge. Considering that the functionally invariant path (FIP) framework enables model adaptation while preserving prior knowledge, in this letter, we introduce FIP into synthetic-data-augmented personalized ASR models. However, the model still struggles to balance the learning of synthetic, personalized, and generic knowledge when applying FIP to train the model on all three types of data simultaneously. To decouple this learning process and further address the above two challenges, we integrate a gated parameter-isolation strategy into FIP and propose a knowledge-decoupled functionally invariant path (KDFIP) framework, which stores generic and personalized knowledge in separate modules and applies FIP to them sequentially. Specifically, KDFIP adapts the personalized module to synthetic and real personal data and the generic module to generic data. Both modules are updated along personalization-invariant paths, and their outputs are dynamically fused through a gating mechanism. With augmented synthetic data, KDFIP achieves a 29.38% relative character error rate reduction on target speakers and maintains comparable generalization performance to the unadapted ASR baseline.

[767] Unify Variables in Neural Scaling Laws for General Audio Representations via Embedding Effective Rank

Xuyao Deng, Yanjie Sun, Yong Dou, Kele Xu

Main category: cs.SD

TL;DR: This paper studies scaling laws for general audio representation learning using embedding effective rank (RankMe) as a unifying metric to quantify representation quality across various model parameters.

Details

Motivation: Scaling laws are well-established in computer vision and NLP but remain underexplored for general audio representation learning, where multiple factors jointly influence representation quality and are difficult to isolate.

Method: Systematic study using embedding effective rank (RankMe) as a label-free, information-theoretic metric to examine scaling behaviors across hyper-parameters including model size, data volume, computational budget, and architectural configurations.

Result: Empirical findings reveal a consistent power-law relationship between RankMe and representation quality, showing that embedding effective rank serves as a reliable proxy for assessing and predicting model performance.

Conclusion: The work validates classical scaling principles for general audio domain and provides a theoretically grounded framework for guiding future model scaling strategies in audio foundation models.

Abstract: Scaling laws have profoundly shaped our understanding of model performance in computer vision and natural language processing, yet their application to general audio representation learning remains underexplored. A key challenge lies in the multifactorial nature of general audio representation-representation quality is jointly influenced by variables such as audio length, embedding dimensionality, model depth, model architecture, data volume, etc., many of which are difficult to isolate or express analytically. In this work, we present a systematic study of scaling laws for general audio representations by utilizing embedding effective rank (RankMe) as a unifying metric that encapsulates the impact of diverse variables on representation quality. RankMe enables a label-free, information-theoretic quantification of audio embeddings, allowing us to examine scaling behaviors across a wide hyper-parameter space, including model size, training data volume, computational budget, architectural configurations, etc. Our empirical findings reveal a consistent power-law relationship between RankMe and representation quality, suggesting that embedding effective rank serves as a reliable proxy for assessing and predicting model performance in audio representation learning. This work not only validates the applicability of classical scaling principles to the general audio domain but also offers a theoretically grounded and empirically robust framework for guiding future model scaling strategies in audio foundation models.

[768] MARS-Sep: Multimodal-Aligned Reinforced Sound Separation

Zihan Zhang, Xize Cheng, Zhennan Jiang, Dongjie Fu, Jingyuan Chen, Zhou Zhao, Tao Jin

Main category: cs.SD

TL;DR: MARS-Sep is a reinforcement learning framework for universal sound separation that reformulates separation as decision making, using factorized Beta mask policies optimized with multimodal rewards from audio-text-vision encoders to improve semantic consistency.

Details

Motivation: Current sound separation models optimized for low-level signal metrics often produce semantically contaminated outputs that fail to suppress perceptually salient interference from acoustically similar sources.

Method: Uses reinforcement learning with factorized Beta mask policy optimized by clipped trust-region surrogate with entropy regularization and group-relative advantage normalization. Employs multimodal rewards from audio-text-vision encoder with progressive alignment scheme for fine-tuning.

Result: Consistent gains in Text-, Audio-, and Image-Queried separation across multiple benchmarks, with notable improvements in both signal metrics and semantic quality.

Conclusion: MARS-Sep successfully bridges the gap between low-level signal metrics and semantic quality in sound separation through its reinforcement learning framework and multimodal reward system.

Abstract: Universal sound separation faces a fundamental misalignment: models optimized for low-level signal metrics often produce semantically contaminated outputs, failing to suppress perceptually salient interference from acoustically similar sources. To bridge this gap, we introduce MARS-Sep, a reinforcement learning framework that reformulates separation as decision making. Instead of simply regressing ground-truth masks, MARS-Sep learns a factorized Beta mask policy that is optimized by a clipped trust-region surrogate with entropy regularization and group-relative advantage normalization. Concretely, we sample masks from a frozen old policy, reconstruct waveforms, and update the current policy using clipped importance ratios-yielding substantially more stable and sample-efficient learning. Multimodal rewards, derived from an audio-text-vision encoder, directly incentivize semantic consistency with query prompts. We further propose a progressive alignment scheme to fine-tune this encoder, boosting its cross-modal discriminability and improving reward faithfulness. Extensive experiments on multiple benchmarks demonstrate consistent gains in Text-, Audio-, and Image-Queried separation, with notable improvements in signal metrics and semantic quality. Our code is available at https://anonymous.4open.science/r/MARS-Sep. Sound separation samples are available at https://mars-sep.github.io/.

[769] A Machine Learning Approach for MIDI to Guitar Tablature Conversion

Maximos Kaliakatsos-Papakostas, Gregoris Bastas, Dimos Makris, Dorien Herremans, Vassilis Katsouros, Petros Maragos

Main category: cs.SD

TL;DR: A machine learning method for transcribing MIDI musical parts into guitar tablature, focusing on playable string-fret assignments while preserving parsimonious motion between combinations, with testing on both guitar and non-guitar music.

Details

Motivation: To develop a system that can automatically generate playable guitar tablature from MIDI data, accounting for guitar-specific fingerings and motion patterns across different musical styles, even for music not originally intended for guitar.

Method: Machine learning approach that considers finger stretch limitations on fretboard, uses standard 6-string tuning, and employs data augmentation with artificial training data for handling non-guitar music.

Result: Training with augmented data improves performance even in simple cases; system shows capability to transcribe both guitar and non-guitar music, but reveals weaknesses that suggest areas for improvement.

Conclusion: The method demonstrates promising results for guitar tablature transcription, with data augmentation proving beneficial, though further improvements are needed to address identified weaknesses.

Abstract: Guitar tablature transcription consists in deducing the string and the fret number on which each note should be played to reproduce the actual musical part. This assignment should lead to playable string-fret combinations throughout the entire track and, in general, preserve parsimonious motion between successive combinations. Throughout the history of guitar playing, specific chord fingerings have been developed across different musical styles that facilitate common idiomatic voicing combinations and motion between them. This paper presents a method for assigning guitar tablature notation to a given MIDI-based musical part (possibly consisting of multiple polyphonic tracks), i.e. no information about guitar-idiomatic expressional characteristics is involved (e.g. bending etc.) The current strategy is based on machine learning and requires a basic assumption about how much fingers can stretch on a fretboard; only standard 6-string guitar tuning is examined. The proposed method also examines the transcription of music pieces that was not meant to be played or could not possibly be played by a guitar (e.g. potentially a symphonic orchestra part), employing a rudimentary method for augmenting musical information and training/testing the system with artificial data. The results present interesting aspects about what the system can achieve when trained on the initial and augmented dataset, showing that the training with augmented data improves the performance even in simple, e.g. monophonic, cases. Results also indicate weaknesses and lead to useful conclusions about possible improvements.

[770] Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap

KiHyun Nam, Jongmin Choi, Hyeongkeun Lee, Jungwoo Heo, Joon Son Chung

Main category: cs.SD

TL;DR: Diffusion-Link is a diffusion-based module that bridges the audio-text modality gap by mapping audio embeddings into text-embedding distribution, achieving state-of-the-art results in Automatic Audio Captioning.

Details

Motivation: To address the persistent audio-text modality gap that limits the benefits of coupling multimodal encoders with large language models (LLMs).

Method: A diffusion-based modality-bridging module that generatively maps audio embeddings into text-embedding distribution, implemented as a lightweight network with three residual MLP blocks trained on frozen multimodal encoder outputs.

Result: Reduces modality gap the most among prior diffusion methods, achieves state-of-the-art on AudioCaps in both zero-shot (52.5% relative gain) and fully supervised captioning (7.5% relative gain) without external knowledge.

Conclusion: Closing the modality gap is pivotal for effective coupling between multimodal encoders and LLMs, and diffusion-based modality bridging offers a promising direction beyond knowledge-retrieval-centric designs.

Abstract: Contrastive audio-language pretraining yields powerful joint representations, yet a persistent audio-text modality gap limits the benefits of coupling multimodal encoders with large language models (LLMs). We present Diffusion-Link, a diffusion-based modality-bridging module that generatively maps audio embeddings into the text-embedding distribution. The module is trained at the output embedding from the frozen multimodal encoder and implemented as a lightweight network with three residual MLP blocks. To assess the effect of Diffusion-Link on multimodal encoder-LLM coupling, we evaluate on Automatic Audio Captioning (AAC); to our knowledge, this is the first application of diffusion-based modality bridging to AAC. We report two results. (1) Modality-gap analysis: on similarity and geometric criteria, Diffusion-Link reduces the modality gap the most among prior diffusion-based methods and shows a collective migration of audio embeddings toward the text distribution. (2) Downstream AAC: attaching Diffusion-Link to the same multimodal LLM baseline achieves state-of-the-art on AudioCaps in both zero-shot and fully supervised captioning without external knowledge, with relative gains up to 52.5% and 7.5%, respectively. These findings show that closing the modality gap is pivotal for effective coupling between multimodal encoders and LLMs, and diffusion-based modality bridging offers a promising direction beyond knowledge-retrieval-centric designs. Code will be released upon acceptance https://github.com/DevKiHyun/Diffusion-Link

[771] LSZone: A Lightweight Spatial Information Modeling Architecture for Real-time In-car Multi-zone Speech Separation

Jun Chen, Shichao Hu, Jiuxin Lin, Wenjie Li, Zihan Zhang, Xingchen Li, JinJiang Liu, Longshuai Xiao, Chao Weng, Lei Xie, Zhiyong Wu

Main category: cs.SD

TL;DR: LSZone is a lightweight architecture for real-time in-car multi-zone speech separation that reduces computational cost while maintaining performance through spatial information extraction-compression and Conv-GRU crossband-narrowband processing.

Details

Motivation: Previous SpatialNet achieved good results but had high computational cost that hindered real-time applications in vehicles, creating a need for more efficient solutions.

Method: Uses spatial information extraction-compression (SpaIEC) module combining Mel spectrogram and IPD, plus an extremely lightweight Conv-GRU crossband-narrowband processing (CNP) module to efficiently model spatial information.

Result: Achieves 0.56G MACs complexity and 0.37 real-time factor (RTF) while delivering impressive performance in complex noise and multi-speaker scenarios.

Conclusion: LSZone provides an effective lightweight solution for real-time in-car multi-zone speech separation with significantly reduced computational requirements.

Abstract: In-car multi-zone speech separation, which captures voices from different speech zones, plays a crucial role in human-vehicle interaction. Although previous SpatialNet has achieved notable results, its high computational cost still hinders real-time applications in vehicles. To this end, this paper proposes LSZone, a lightweight spatial information modeling architecture for real-time in-car multi-zone speech separation. We design a spatial information extraction-compression (SpaIEC) module that combines Mel spectrogram and Interaural Phase Difference (IPD) to reduce computational burden while maintaining performance. Additionally, to efficiently model spatial information, we introduce an extremely lightweight Conv-GRU crossband-narrowband processing (CNP) module. Experimental results demonstrate that LSZone, with a complexity of 0.56G MACs and a real-time factor (RTF) of 0.37, delivers impressive performance in complex noise and multi-speaker scenarios.

[772] Automatic Music Sample Identification with Multi-Track Contrastive Learning

Alain Riou, Joan Serrà, Yuki Mitsufuji

Main category: cs.SD

TL;DR: The paper presents a self-supervised learning approach for automatic sample identification in music, using contrastive learning with artificial mixes from multi-track datasets, achieving state-of-the-art performance.

Details

Motivation: Sampling is common in modern music production, but automatically identifying sampled content and its origins is challenging. The paper aims to develop an effective method for detecting and retrieving sampled material.

Method: Uses self-supervised learning with multi-track datasets to create positive pairs of artificial mixes, and designs a novel contrastive learning objective for training the model.

Result: The method significantly outperforms previous state-of-the-art baselines, shows robustness across genres, scales well with larger reference databases, and highlights the importance of high-quality separated stems.

Conclusion: The proposed self-supervised contrastive learning approach is highly effective for automatic sample identification, demonstrating superior performance, genre robustness, and scalability compared to existing methods.

Abstract: Sampling, the technique of reusing pieces of existing audio tracks to create new music content, is a very common practice in modern music production. In this paper, we tackle the challenging task of automatic sample identification, that is, detecting such sampled content and retrieving the material from which it originates. To do so, we adopt a self-supervised learning approach that leverages a multi-track dataset to create positive pairs of artificial mixes, and design a novel contrastive learning objective. We show that such method significantly outperforms previous state-of-the-art baselines, that is robust to various genres, and that scales well when increasing the number of noise songs in the reference database. In addition, we extensively analyze the contribution of the different components of our training pipeline and highlight, in particular, the need for high-quality separated stems for this task.

[773] SS-DPPN: A self-supervised dual-path foundation model for the generalizable cardiac audio representation

Ummy Maria Muna, Md Mehedi Hasan Shawon, Md Jobayer, Sumaiya Akter, Md Rakibul Hasan, Md. Golam Rabiul Alam

Main category: cs.SD

TL;DR: SS-DPPN is a self-supervised foundation model for cardiac audio analysis that uses dual-path contrastive learning on 1D waveforms and 2D spectrograms, achieving state-of-the-art performance with reduced labeled data requirements.

Details

Motivation: Automated phonocardiogram analysis is crucial for early cardiovascular disease diagnosis, but supervised deep learning is limited by scarce expert-annotated data.

Method: Proposes a dual-path contrastive learning architecture processing 1D waveforms and 2D spectrograms with hybrid loss, combined with prototypical network for metric learning in downstream tasks.

Result: Achieves state-of-the-art performance on four cardiac audio benchmarks, demonstrates exceptional data efficiency with 3x reduction in labeled data, and generalizes successfully to lung sound classification and heart rate estimation.

Conclusion: SS-DPPN is validated as a robust, reliable, and scalable foundation model for physiological signals.

Abstract: The automated analysis of phonocardiograms is vital for the early diagnosis of cardiovascular disease, yet supervised deep learning is often constrained by the scarcity of expert-annotated data. In this paper, we propose the Self-Supervised Dual-Path Prototypical Network (SS-DPPN), a foundation model for cardiac audio representation and classification from unlabeled data. The framework introduces a dual-path contrastive learning based architecture that simultaneously processes 1D waveforms and 2D spectrograms using a novel hybrid loss. For the downstream task, a metric-learning approach using a Prototypical Network was used that enhances sensitivity and produces well-calibrated and trustworthy predictions. SS-DPPN achieves state-of-the-art performance on four cardiac audio benchmarks. The framework demonstrates exceptional data efficiency with a fully supervised model on three-fold reduction in labeled data. Finally, the learned representations generalize successfully across lung sound classification and heart rate estimation. Our experiments and findings validate SS-DPPN as a robust, reliable, and scalable foundation model for physiological signals.

[774] Proficiency-Aware Adaptation and Data Augmentation for Robust L2 ASR

Ling Sun, Charlotte Zhu, Shuju Shi

Main category: cs.SD

TL;DR: Fine-tuning Whisper for L2 learners reduces average WER but widens proficiency gaps. Proposed proficiency-aware multitask learning and targeted augmentation reduce WER by 29.4% and insertion/deletion errors by 58.6% while narrowing proficiency disparities.

Details

Motivation: General-purpose ASR underperforms for atypical speakers like L2 learners, reinforcing bias and limiting educational and accessibility applications.

Method: Two strategies: (1) proficiency-aware multitask learning jointly optimizing ASR with proficiency classification, (2) targeted augmentation using spectrogram masking on low-proficiency speech to counter dataset imbalance.

Result: Reduced WER by up to 29.4% (relative) and insertion/deletion errors by up to 58.6% (relative). Both strategies consistently narrow proficiency gaps despite severe dataset imbalance.

Conclusion: The proposed approaches advance equitable ASR for L2 learners by reducing performance disparities across proficiency levels while maintaining overall accuracy improvements.

Abstract: General-purpose ASR underperforms for atypical speakers, such as L2 learners, reinforcing bias and limiting use in education and accessibility. Using the CEFR-graded Speak and Improve corpus, we show that naive fine-tuning of Whisper reduces average WER but simultaneously widens disparities and disproportionately harms lower-level learners. To address this, we propose two strategies: (i) proficiency-aware multitask learning, jointly optimizing ASR with proficiency classification, and (ii) targeted augmentation, applying spectrogram masking to low-proficiency speech to counter imbalance. These approaches reduce WER by up to 29.4 percent (relative) and insertion/deletion errors by as much as 58.6 percent (relative). Crucially, despite the severe imbalance of the dataset reflecting real-world distributions, both strategies consistently narrow proficiency gaps, advancing equitable ASR for L2 learners.

[775] Dual Data Scaling for Robust Two-Stage User-Defined Keyword Spotting

Zhiqi Ai, Han Cheng, Yuxin Wang, Shiyi Mu, Shugong Xu, Yongjin Zhou

Main category: cs.SD

TL;DR: DS-KWS is a two-stage keyword spotting framework combining CTC-based candidate detection with QbyT-based verification, enhanced by dual data scaling strategy for improved performance.

Details

Motivation: To develop a robust user-defined keyword spotting system that can handle confusable words and achieve high performance with minimal false alarms.

Method: Two-stage framework: (1) CTC-based method with streaming phoneme search for candidate segment location, (2) QbyT-based method with phoneme matcher for verification at phoneme and utterance levels. Uses dual data scaling: expanding ASR corpus from 460 to 1,460 hours and leveraging 155k anchor classes for phoneme matcher training.

Result: Significantly outperforms existing methods: 6.13% EER and 97.85% AUC on LibriPhrase Hard subset. On Hey-Snips, achieves zero-shot performance comparable to full-shot trained models with 99.13% recall at one false alarm per hour.

Conclusion: DS-KWS framework with dual data scaling strategy effectively improves keyword spotting performance, handling confusable words well and achieving state-of-the-art results in both standard and zero-shot scenarios.

Abstract: In this paper, we propose DS-KWS, a two-stage framework for robust user-defined keyword spotting. It combines a CTC-based method with a streaming phoneme search module to locate candidate segments, followed by a QbyT-based method with a phoneme matcher module for verification at both the phoneme and utterance levels. To further improve performance, we introduce a dual data scaling strategy: (1) expanding the ASR corpus from 460 to 1,460 hours to strengthen the acoustic model; and (2) leveraging over 155k anchor classes to train the phoneme matcher, significantly enhancing the distinction of confusable words. Experiments on LibriPhrase show that DS-KWS significantly outperforms existing methods, achieving 6.13% EER and 97.85% AUC on the Hard subset. On Hey-Snips, it achieves zero-shot performance comparable to full-shot trained models, reaching 99.13% recall at one false alarm per hour.

[776] ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis

Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery

Main category: cs.SD

TL;DR: ParsVoice is the largest Persian speech corpus for TTS, created from 2,000 audiobooks using automated pipeline with BERT-based sentence detection and quality assessment, yielding 1,804 hours of high-quality speech from 470+ speakers.

Details

Motivation: Persian language is severely underrepresented in speech corpora despite being spoken by over 100 million people, creating limitations for developing Persian speech technologies compared to English counterparts.

Method: Automated pipeline that transforms raw audiobook content into TTS-ready data using BERT-based sentence completion detector, binary search boundary optimization for precise audio-text alignment, and multi-dimensional quality assessment frameworks tailored to Persian.

Result: Processed 2,000 audiobooks yielding 3,526 hours of clean speech, filtered to 1,804-hour high-quality subset with over 470 speakers - the largest high-quality Persian speech dataset with speaker diversity and audio quality comparable to major English corpora.

Conclusion: ParsVoice addresses the Persian speech data gap and is publicly available to accelerate development of Persian speech technologies, serving as a template for other low-resource languages.

Abstract: Persian Language, despite being spoken by over 100 million people worldwide, remains severely underrepresented in high-quality speech corpora, particularly for text-to-speech (TTS) synthesis applications. Existing Persian speech datasets are typically smaller than their English counterparts, which creates a key limitation for developing Persian speech technologies. We address this gap by introducing ParsVoice, the largest Persian speech corpus designed specifically for TTS applications. We created an automated pipeline that transforms raw audiobook content into TTS-ready data, incorporating components such as a BERT-based sentence completion detector, a binary search boundary optimization method for precise audio-text alignment, and multi-dimensional quality assessment frameworks tailored to Persian. The pipeline processes 2,000 audiobooks, yielding 3,526 hours of clean speech, which was further filtered into a 1,804-hour high-quality subset suitable for TTS, featuring more than 470 speakers. ParsVoice is the largest high-quality Persian speech dataset, offering speaker diversity and audio quality comparable to major English corpora. The complete dataset has been made publicly available to accelerate the development of Persian speech technologies and to serve as a template for other low-resource languages. The ParsVoice dataset is publicly available at ParsVoice (https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice).

[777] FAC-FACodec: Controllable Zero-Shot Foreign Accent Conversion with Factorized Speech Codec

Yurii Halychanskyi, Cameron Churchwell, Yutong Wen, Volodymyr Kindratenko

Main category: cs.SD

TL;DR: An accent conversion framework with explicit user control over modification strength, balancing accent conversion with speaker identity preservation by targeting pronunciation while keeping suprasegmental features intact.

Details

Motivation: Previous accent conversion methods lack explicit control over modification degree, which is crucial since accent modification can alter perceived speaker identity. There's a need to balance conversion strength with identity preservation.

Method: An AC framework with explicit user-controllable parameter for accent modification that targets pronunciation while preserving suprasegmental cues like intonation and phoneme durations.

Result: Performance comparable to recent AC systems, stronger preservation of speaker identity, and unique support for controllable accent conversion.

Conclusion: The proposed framework successfully provides controllable accent conversion while maintaining speaker identity better than previous methods.

Abstract: Previous accent conversion (AC) methods, including foreign accent conversion (FAC), lack explicit control over the degree of modification. Because accent modification can alter the perceived speaker identity, balancing conversion strength and identity preservation is crucial. We present an AC framework that provides an explicit, user-controllable parameter for accent modification. The method targets pronunciation while preserving suprasegmental cues such as intonation and phoneme durations. Results show performance comparable to recent AC systems, stronger preservation of speaker identity, and unique support for controllable accent conversion.

Matteo Scerbo, Sebastian J. Schlecht, Randall Ali, Lauri Savioja, Enzo De Sena

Main category: cs.SD

TL;DR: MoD-ART is a novel real-time method for modeling position-dependent late reverberation in complex acoustic environments using modal decomposition of acoustic radiance transfer.

Details

Motivation: Real-time modeling of late reverberation is challenging in interactive applications with multiple moving sound sources and listeners, especially in geometrically complex environments with uneven energy absorption where reverberation depends on positions.

Method: Based on acoustic radiance transfer, extracts energy decay modes and their positional relationships with sources and listeners through modal decomposition.

Result: MoD-ART efficiently handles complex scenarios, captures multiple decay slopes and flutter echoes, and shows favorable computational complexity compared to ray-tracing.

Conclusion: MoD-ART provides an effective approach for real-time modeling of position-dependent late reverberation in complex acoustic environments with moving sources and listeners.

Abstract: Modeling late reverberation in real-time interactive applications is a challenging task when multiple sound sources and listeners are present in the same environment. This is especially problematic when the environment is geometrically complex and/or features uneven energy absorption (e.g. coupled volumes), because in such cases the late reverberation is dependent on the sound sources’ and listeners’ positions, and therefore must be adapted to their movements in real time. We present a novel approach to the task, named modal decomposition of acoustic radiance transfer (MoD-ART), which can handle highly complex scenarios with efficiency. The approach is based on the geometrical acoustics method of acoustic radiance transfer, from which we extract a set of energy decay modes and their positional relationships with sources and listeners. In this paper, we describe the physical and mathematical significance of MoD-ART, highlighting its advantages and applicability to different scenarios. Through an analysis of the method’s computational complexity, we show that it compares very favorably with ray-tracing. We also present simulation results showing that MoD-ART can capture multiple decay slopes and flutter echoes.

[779] MSRBench: A Benchmarking Dataset for Music Source Restoration

Yongyi Zang, Jiarui Hai, Wanying Ge, Qiuqiang Kong, Zheqi Dai, Helin Wang, Yuki Mitsufuji, Mark D. Plumbley

Main category: cs.SD

TL;DR: MSRBench is the first benchmark for Music Source Restoration (MSR) that enables evaluation of both separation accuracy and restoration fidelity using raw stem-mixture pairs created by professional engineers, with real-world degradations.

Details

Motivation: Existing benchmarks cannot measure restoration fidelity - synthetic datasets use unrealistic mixtures while real production datasets lack clean references, creating a gap in evaluating music source restoration.

Method: Created MSRBench with raw stem-mixture pairs across eight instrument classes, professionally mixed, and augmented with twelve real-world degradations including analog artifacts, acoustic environments, and lossy codecs.

Result: Baseline experiments with U-Net and BSRNN achieved SI-SNR of -37.8 dB and -23.4 dB respectively, with perceptual quality (FAD CLAP) around 0.7-0.8, showing significant room for improvement.

Conclusion: The benchmark demonstrates the need for restoration-specific architectures and provides a foundation for advancing music source restoration research.

Abstract: Music Source Restoration (MSR) extends source separation to realistic settings where signals undergo production effects (equalization, compression, reverb) and real-world degradations, with the goal of recovering the original unprocessed sources. Existing benchmarks cannot measure restoration fidelity: synthetic datasets use unprocessed stems but unrealistic mixtures, while real production datasets provide only already-processed stems without clean references. We present MSRBench, the first benchmark explicitly designed for MSR evaluation. MSRBench contains raw stem-mixture pairs across eight instrument classes, where mixtures are produced by professional mixing engineers. These raw-processed pairs enable direct evaluation of both separation accuracy and restoration fidelity. Beyond controlled studio conditions, the mixtures are augmented with twelve real-world degradations spanning analog artifacts, acoustic environments, and lossy codecs. Baseline experiments with U-Net and BSRNN achieve SI-SNR of -37.8 dB and -23.4 dB respectively, with perceptual quality (FAD CLAP) around 0.7-0.8, demonstrating substantial room for improvement and the need for restoration-specific architectures.

[780] Joint Source-Environment Adaptation of Data-Driven Underwater Acoustic Source Ranging Based on Model Uncertainty

Dariush Kari, Hari Vishnu, Andrew C. Singer

Main category: cs.SD

TL;DR: Proposes a method to adapt pre-trained underwater acoustic localization models to unseen environments using implied uncertainty from output peaks, without requiring labeled target data or original training data.

Details

Motivation: Pre-trained deep learning models suffer performance degradation due to environmental mismatch between training and test data in underwater acoustic localization, and existing methods require labeled data from target environments.

Method: Quantify implied uncertainty based on number of model output peaks, partition test samples into certain/uncertain sets, use certain samples to improve labeling for uncertain samples, and integrate signal energy estimates for adaptation.

Result: Extensive validation with real experimental and synthetic data shows significant improvements in model prediction accuracy for underwater acoustic localization in diverse, noisy environments.

Conclusion: The proposed uncertainty-based adaptation method effectively enhances pre-trained models for underwater acoustic localization in unknown environments without requiring additional labeled data.

Abstract: Adapting pre-trained deep learning models to new and unknown environments remains a major challenge in underwater acoustic localization. We show that although the performance of pre-trained models suffers from mismatch between the training and test data, they generally exhibit a higher uncertainty in environments where there is more mismatch. Additionally, in the presence of environmental mismatch, spurious peaks can appear in the output of classification-based localization approaches, which inspires us to define and use a method to quantify the “implied uncertainty” based on the number of model output peaks. Leveraging this notion of implied uncertainty, we partition the test samples into sets with more certain and less certain samples, and implement a method to adapt the model to new environments by using the certain samples to improve the labeling for uncertain samples, which helps to adapt the model. Thus, using this efficient method for model uncertainty quantification, we showcase an innovative approach to adapt a pre-trained model to unseen underwater environments at test time. This eliminates the need for labeled data from the target environment or the original training data. This adaptation is enhanced by integrating an independent estimate based on the received signal energy. We validate the approach extensively using real experimental data, as well as synthetic data consisting of model-generated signals with real ocean noise. The results demonstrate significant improvements in model prediction accuracy, underscoring the potential of the method to enhance underwater acoustic localization in diverse, noisy, and unknown environments.

[781] VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents

Jiliang Hu, Wenfu Wang, Zuchao Li, Chenxing Li, Yiyang Zhao, Hanzhao Li, Liqiang Zhang, Meng Yu, Dong Yu

Main category: cs.SD

TL;DR: VCB Bench is a new Chinese benchmark for evaluating large audio language models using real human speech across instruction following, knowledge understanding, and robustness dimensions.

Details

Motivation: Existing audio language model benchmarks are limited by being English-centric, relying on synthetic speech, and lacking comprehensive multi-dimensional evaluation.

Method: Built VCB Bench - a high-quality Chinese benchmark using real human speech that evaluates models from three perspectives: instruction following (including speech-level control), knowledge understanding (general knowledge, reasoning, daily dialogue), and robustness (stability under perturbations).

Result: Experiments on representative LALMs revealed notable performance gaps, highlighting areas for future improvement in Chinese voice conversational models.

Conclusion: VCB Bench provides a reproducible, fine-grained evaluation framework with standardized methodology and practical insights for advancing Chinese voice conversational models.

Abstract: Recent advances in large audio language models (LALMs) have greatly enhanced multimodal conversational systems. However, existing benchmarks remain limited – they are mainly English-centric, rely on synthetic speech, and lack comprehensive, discriminative evaluation across multiple dimensions. To address these gaps, we present Voice Chat Bot Bench (VCB Bench) – a high-quality Chinese benchmark built entirely on real human speech. VCB Bench evaluates LALMs from three complementary perspectives: instruction following (including speech-level control beyond text commands), knowledge understanding (general knowledge, reasoning, and daily dialogue), and robustness (stability under perturbations in content, environment, and speaker traits). Experiments on representative LALMs reveal notable performance gaps and highlight future directions for improvement. VCB Bench provides a reproducible and fine-grained evaluation framework, offering standardized methodology and practical insights for advancing Chinese voice conversational models.

[782] Data Standards in Audiology: A Mixed-Methods Exploration of Community Perspectives and Implementation Considerations

Charlotte Vercammen, Antje Heinrich, Christophe Lesimple, Alessia Paglialonga, Jan-Willem A. Wasmann, Mareike Buhl

Main category: cs.SD

TL;DR: This study explores data standardization in audiology through a mixed-methods approach including community surveys and expert panel discussions, finding strong community support for standardization efforts despite limited current participation.

Details

Motivation: To address conceptual issues around data standardization in audiology and understand the computational audiology community's current understanding, needs, and preferences regarding data standards.

Method: Mixed-methods approach: 1) review of existing standardization efforts; 2) survey of 82 computational audiology community members; 3) expert panel discussion with five experts at the 2024 Virtual Conference of Computational Audiology.

Result: 90% of respondents expressed willingness to follow or contribute to standardization efforts, though few were familiar with existing initiatives. The panel discussed relevant initiatives (OMOP, openEHR, NOAH) and identified challenges around harmonization and opportunities for alignment with other medical fields.

Conclusion: The study provides guidance for implementing interoperable data standards in audiology, highlighting community support, key issues to address, and suggesting future paths for standardization work.

Abstract: Objective: This study addresses conceptual issues around data standardisation in audiology, and outlines steps toward achieving it. It reports a survey of the computational audiology community on their current understanding, needs, and preferences concerning data standards. Based on survey findings and a panel discussion, recommendations are made concerning moving forward with standardisation in audiology. Design: Mixed-methods: 1) review of existing standardisation efforts; 2) a survey of the computational audiology community; 3) expert panel discussion in a dedicated session at the 2024 Virtual Conference of Computational Audiology. Sample: Survey: 82 members of the global community; Panel discussion: five experts. Results: A prerequisite for any global audiology database are agreed data standards. Although many are familiar with the general idea, few know of existing initiatives, or have actively participated in them. Ninety percent of respondents expressed willingness to follow or contribute to standardisation efforts. The panel discussed relevant initiatives (e.g. OMOP, openEHR, NOAH) and explored both challenges (around harmonisation) and opportunities (alignment with other medical fields and conversion among approaches). Conclusions: Combining conceptual discussion with stakeholder views, the study offers guidance for implementing interoperable data standards in audiology. It highlights community support, key issues to address, and suggests paths for future work.

[783] Perturbation Self-Supervised Representations for Cross-Lingual Emotion TTS: Stage-Wise Modeling of Emotion and Speaker

Cheng Gong, Chunyu Qiang, Tianrui Wang, Yu Jiang, Yuheng Lu, Ruihao Jing, Xiaoxiao Miao, Xiaolei Zhang, Longbiao Wang, Jianwu Dang

Main category: cs.SD

TL;DR: EMM-TTS is a two-stage cross-lingual emotional TTS framework that uses perturbed SSL representations to disentangle emotion and timbre, achieving superior naturalness, emotion transfer, and timbre consistency across languages.

Details

Motivation: Cross-lingual emotional TTS faces challenges in controlling emotion, timbre, and language simultaneously due to high entanglement between emotion and timbre in speech signals.

Method: Two-stage framework: first stage encodes prosodic cues for emotion, second stage restores timbre from perturbed SSL representations. Uses speaker perturbation strategies (formant shifting, anonymization), SCL, SEALN modules, and combines explicit acoustic features with pretrained latent features.

Result: Comprehensive evaluations show EMM-TTS achieves superior naturalness, emotion transferability, and timbre consistency across languages compared to other methods.

Conclusion: The proposed EMM-TTS framework effectively addresses cross-lingual emotional TTS challenges through disentangled emotion-timbre modeling and achieves state-of-the-art performance.

Abstract: Cross-lingual emotional text-to-speech (TTS) aims to produce speech in one language that captures the emotion of a speaker from another language while maintaining the target voice’s timbre. This process of cross-lingual emotional speech synthesis presents a complex challenge, necessitating flexible control over emotion, timbre, and language. However, emotion and timbre are highly entangled in speech signals, making fine-grained control challenging. To address this issue, we propose EMM-TTS, a novel two-stage cross-lingual emotional speech synthesis framework based on perturbed self-supervised learning (SSL) representations. In the first stage, the model explicitly and implicitly encodes prosodic cues to capture emotional expressiveness, while the second stage restores the timbre from perturbed SSL representations. We further investigate the effect of different speaker perturbation strategies-formant shifting and speaker anonymization-on the disentanglement of emotion and timbre. To strengthen speaker preservation and expressive control, we introduce Speaker Consistency Loss (SCL) and Speaker-Emotion Adaptive Layer Normalization (SEALN) modules. Additionally, we find that incorporating explicit acoustic features (e.g., F0, energy, and duration) alongside pretrained latent features improves voice cloning performance. Comprehensive multi-metric evaluations, including both subjective and objective measures, demonstrate that EMM-TTS achieves superior naturalness, emotion transferability, and timbre consistency across languages.

[784] MGE-LDM: Joint Latent Diffusion for Simultaneous Music Generation and Source Extraction

Yunkee Chae, Kyogu Lee

Main category: cs.SD

TL;DR: MGE-LDM is a unified latent diffusion framework that enables simultaneous music generation, source imputation, and query-driven source separation in a single model without predefined instrument categories.

Details

Motivation: To overcome limitations of prior approaches constrained to fixed instrument classes and create a unified framework for multiple music processing tasks.

Method: Learns joint distribution over full mixtures, submixtures, and stems using latent diffusion model; formulates separation and imputation as conditional inpainting tasks in latent space.

Result: Enables complete mixture generation, partial generation (source imputation), and text-conditioned extraction of arbitrary sources; supports flexible, class-agnostic manipulation of instrument sources.

Conclusion: MGE-LDM provides a unified approach for multiple music processing tasks, trained jointly across heterogeneous datasets without relying on predefined instrument categories.

Abstract: We present MGE-LDM, a unified latent diffusion framework for simultaneous music generation, source imputation, and query-driven source separation. Unlike prior approaches constrained to fixed instrument classes, MGE-LDM learns a joint distribution over full mixtures, submixtures, and individual stems within a single compact latent diffusion model. At inference, MGE-LDM enables (1) complete mixture generation, (2) partial generation (i.e., source imputation), and (3) text-conditioned extraction of arbitrary sources. By formulating both separation and imputation as conditional inpainting tasks in the latent space, our approach supports flexible, class-agnostic manipulation of arbitrary instrument sources. Notably, MGE-LDM can be trained jointly across heterogeneous multi-track datasets (e.g., Slakh2100, MUSDB18, MoisesDB) without relying on predefined instrument categories. Audio samples are available at our project page: https://yoongi43.github.io/MGELDM_Samples/.

[785] Audio-Maestro: Enhancing Large Audio-Language Models with Tool-Augmented Reasoning

Kuan-Yi Lee, Tsung-En Lin, Hung-Yi Lee

Main category: cs.SD

TL;DR: Audio-Maestro is a tool-augmented audio reasoning framework that enables audio-language models to autonomously call external tools and integrate timestamped outputs into reasoning, improving accuracy over end-to-end approaches.

Details

Motivation: Current large multimodal models rely solely on end-to-end reasoning, which limits interpretability and accuracy for tasks requiring structured knowledge or specialized signal analysis.

Method: A framework that allows audio-language models to autonomously call external tools and integrate their timestamped outputs into the reasoning process, enabling specialized audio signal analysis rather than pure end-to-end inference.

Result: Consistent improvements in general audio reasoning performance: Gemini-2.5-flash’s accuracy on MMAU-Test rose from 67.4% to 72.1%, DeSTA-2.5 from 58.3% to 62.8%, and GPT-4o from 60.8% to 63.9%.

Conclusion: Audio-Maestro is the first framework to integrate structured tool output into large audio language model reasoning processes, demonstrating significant performance improvements over end-to-only approaches.

Abstract: Recent advancements in large multimodal models (LMMs) have shown strong capabilities in audio understanding. However, most systems rely solely on end-to-end reasoning, limiting interpretability and accuracy for tasks that require structured knowledge or specialized signal analysis. In this work, we present Audio-Maestro – a tool-augmented audio reasoning framework that enables audio-language models to autonomously call external tools and integrate their timestamped outputs into the reasoning process. This design allows the model to analyze, transform, and interpret audio signals through specialized tools rather than relying solely on end-to-end inference. Experiments show that Audio-Maestro consistently improves general audio reasoning performance: Gemini-2.5-flash’s average accuracy on MMAU-Test rises from 67.4% to 72.1%, DeSTA-2.5 from 58.3% to 62.8%, and GPT-4o from 60.8% to 63.9%. To our knowledge, Audio-Maestro is the first framework to integrate structured tool output into the large audio language model reasoning process.

[786] $\texttt{AVROBUSTBENCH}$: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time

Sarthak Kumar Maharana, Saksham Singh Kushwaha, Baoming Zhang, Adrian Rodriguez, Songtao Wei, Yapeng Tian, Yunhui Guo

Main category: cs.SD

TL;DR: AVROBUSTBENCH is a comprehensive benchmark for evaluating test-time robustness of audio-visual recognition models, featuring four datasets with 75 co-occurring bimodal corruptions. The study shows current models decline in robustness with increasing corruption severity, and proposes AV2C, a simple test-time adaptation method.

Details

Motivation: Existing robustness benchmarks focus on single modalities, making them insufficient for assessing audio-visual models where shifts can occur simultaneously in both modalities in real-world scenarios.

Method: Created AVROBUSTBENCH with four audio-visual datasets (AUDIOSET-2C, VGGSOUND-2C, KINETICS-2C, EPICKITCHENS-2C) incorporating 75 co-occurring and correlated bimodal corruptions. Evaluated state-of-the-art supervised and self-supervised models, and proposed AV2C test-time adaptation approach that enables on-the-fly cross-modal fusion by penalizing high-entropy samples.

Result: State-of-the-art audio-visual models exhibit declining robustness as corruption severity increases. Online test-time adaptation methods offer minimal improvements under bimodal corruptions. The proposed AV2C approach achieves improvements on VGGSOUND-2C dataset.

Conclusion: AVROBUSTBENCH provides a comprehensive framework to evaluate audio-visual model robustness and steer development of more effective test-time adaptation approaches for real-world scenarios with simultaneous multimodal corruptions.

Abstract: While recent audio-visual models have demonstrated impressive performance, their robustness to distributional shifts at test-time remains not fully understood. Existing robustness benchmarks mainly focus on single modalities, making them insufficient for thoroughly assessing the robustness of audio-visual models. Motivated by real-world scenarios where shifts can occur $\textit{simultaneously}$ in both audio and visual modalities, we introduce $\texttt{AVROBUSTBENCH}$, a comprehensive benchmark designed to evaluate the test-time robustness of audio-visual recognition models. $\texttt{AVROBUSTBENCH}$ comprises four audio-visual benchmark datasets, $\texttt{AUDIOSET-2C}$, $\texttt{VGGSOUND-2C}$, $\texttt{KINETICS-2C}$, and $\texttt{EPICKITCHENS-2C}$, each incorporating 75 bimodal audio-visual corruptions that are $\textit{co-occurring}$ and $\textit{correlated}$. Through extensive evaluations, we observe that state-of-the-art supervised and self-supervised audio-visual models exhibit declining robustness as corruption severity increases. Furthermore, online test-time adaptation (TTA) methods, on $\texttt{VGGSOUND-2C}$ and $\texttt{KINETICS-2C}$, offer minimal improvements in performance under bimodal corruptions. We further propose $\texttt{AV2C}$, a simple TTA approach enabling on-the-fly cross-modal fusion by penalizing high-entropy samples, which achieves improvements on $\texttt{VGGSOUND-2C}$. We hope that $\texttt{AVROBUSTBENCH}$ will steer the development of more effective and robust audio-visual TTA approaches. Our code is available $\href{https://github.com/sarthaxxxxx/AV-C-Robustness-Benchmark}{here}$.

[787] BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis

Jingyuan Xing, Mingru Yang, Zhipeng Li, Xiaofen Xing, Xiangmin Xu

Main category: cs.SD

TL;DR: BridgeTTS is a novel autoregressive text-to-speech framework that addresses speed-quality trade-offs and supervision mismatch in zero-shot TTS systems using dual speech representation and joint optimization.

Details

Motivation: To overcome limitations in existing AR-based zero-shot TTS systems: inherent speed-quality trade-off (sequential generation reduces frame rates or enriches tokens at efficiency cost) and text-oriented supervision mismatch (uniform token error penalization without considering acoustic similarity).

Method: Proposes BridgeTTS framework built on dual speech representation paradigm BridgeCode, which reduces AR iterations by predicting sparse tokens while reconstructing rich continuous features. Uses joint optimization of token-level and feature-level objectives.

Result: BridgeTTS achieves competitive quality and speaker similarity while significantly accelerating synthesis compared to existing methods.

Conclusion: The proposed BridgeTTS framework effectively addresses speed-quality trade-offs in zero-shot TTS through dual representation and joint optimization, enabling faster synthesis without compromising quality.

Abstract: Autoregressive (AR) frameworks have recently achieved remarkable progress in zero-shot text-to-speech (TTS) by leveraging discrete speech tokens and large language model techniques. Despite their success, existing AR-based zero-shot TTS systems face two critical limitations: (i) an inherent speed-quality trade-off, as sequential token generation either reduces frame rates at the cost of expressiveness or enriches tokens at the cost of efficiency, and (ii) a text-oriented supervision mismatch, as cross-entropy loss penalizes token errors uniformly without considering the fine-grained acoustic similarity among adjacent tokens. To address these challenges, we propose BridgeTTS, a novel AR-TTS framework built upon the dual speech representation paradigm BridgeCode. BridgeTTS reduces AR iterations by predicting sparse tokens while reconstructing rich continuous features for high-quality synthesis. Joint optimization of token-level and feature-level objectives further enhances naturalness and intelligibility. Experiments demonstrate that BridgeTTS achieves competitive quality and speaker similarity while significantly accelerating synthesis. Speech demos are available at https://test1562.github.io/demo/.

[788] GRAM: Spatial general-purpose audio representation models for real-world applications

Goksenin Yuksel, Marcel van Gerven, Kiki van der Heijden

Main category: cs.SD

TL;DR: GRAM is a general-purpose real-world audio model that uses multi-channel masked auto-encoders to learn spatial audio representations from simulated real-world scenes, outperforming state-of-the-art models on audio tasks and sound localization.

Details

Motivation: Current audio foundation models are typically trained on dry, single-channel audio and fail to handle real-world acoustic environments with reverberation, noise, and spatial characteristics, limiting their practical applications.

Method: Proposed GRAM using a multi-channel masked auto-encoder approach to learn spatial audio representations from high-quality simulated real-world scenes, supporting both binaural (2-channel) and Ambisonics (4-channel) formats.

Result: GRAM surpasses all state-of-the-art self-supervised audio foundation models and speech models on both HEAR and Nat-HEAR benchmarks, achieves state-of-the-art localization performance exceeding supervised approaches, and demonstrates robust transfer to real-world recordings.

Conclusion: GRAM represents a significant advancement towards robust, spatial audio foundation models for real-world applications, effectively addressing limitations of current models in handling spatial audio and real-world acoustic environments.

Abstract: Although audio foundations models have seen great progress on a wide variety of tasks, their application in real-world acoustic environments with reverberation and noise has been less successful. Moreover, as audio foundation models are typically trained on dry, single-channel audio clips, the inherent spatial nature of real-world sound scenes is overlooked and tasks involving sound localization ruled out. To address these limitations, we propose GRAM: a General-purpose Real-world Audio Model utilizing a multi-channel masked auto-encoder approach to efficiently learn spatial audio representations from high-quality simulated real-world scenes. To evaluate the performance of GRAM and other audio foundation models in real-world sound scenes, we release Nat-HEAR: A naturalistic version of the HEAR benchmark suite comprising a simulated real-world version, as well as two new sound localization tasks. We show that the performance of GRAM surpasses all state-of-the-art self-supervised audio foundation models and speech models on both HEAR and Nat-HEAR, while using only a fraction of the training data. GRAM also showcases state-of-the-art localization performance, surpassing even supervised sound localization approaches, and can be flexibly applied either to a two-channel, binaural sound format or a four-channel, Ambisonics format. Validating GRAM’s performance on real-world sound recordings demonstrates robust transfer to real-world scenes. Taken together, GRAM presents a significant advancement towards robust, spatial audio foundation models for real-world applications.

[789] PicoAudio2: Temporal Controllable Text-to-Audio Generation with Natural Language Description

Zihao Zheng, Zeyu Xie, Xuenan Xu, Wen Wu, Chao Zhang, Mengyue Wu

Main category: cs.SD

TL;DR: PicoAudio2 is a framework that improves temporal-controllable text-to-audio generation by using real audio-text data with timestamp annotations and an enhanced architecture that combines fine-grained timestamp information with free-text input.

Details

Motivation: Existing controllable text-to-audio generation models suffer from poor audio quality due to reliance on synthetic data and are limited to closed vocabularies, preventing open-ended free-text control.

Method: Uses a grounding model to annotate event timestamps in real audio-text datasets, combines real and simulation data for training, and proposes an enhanced architecture that integrates timestamp matrices with free-text input.

Result: PicoAudio2 demonstrates superior performance in temporal controllability and audio quality compared to existing methods.

Conclusion: The framework successfully addresses data and architectural limitations in temporal-controllable text-to-audio generation, enabling better audio quality and open-ended free-text control.

Abstract: While recent work in controllable text-to-audio (TTA) generation has achieved fine-grained control through timestamp conditioning, its scope remains limited by audio quality and input format. These models often suffer from poor audio quality in real datasets due to sole reliance on synthetic data. Moreover, some models are constrained to a closed vocabulary of sound events, preventing them from controlling audio generation for open-ended, free-text queries. This paper introduces PicoAudio2, a framework that advances temporal-controllable TTA by mitigating these data and architectural limitations. Specifically, we use a grounding model to annotate event timestamps of real audio-text datasets to curate temporally-strong real data, in addition to simulation data from existing works. The model is trained on the combination of real and simulation data. Moreover, we propose an enhanced architecture that integrates the fine-grained information from a timestamp matrix with coarse-grained free-text input. Experiments show that PicoAudio2 exhibits superior performance in terms of temporal controllability and audio quality.

[790] WavJEPA: Semantic learning unlocks robust audio foundation models for raw waveforms

Goksenin Yuksel, Pierre Guetschel, Michael Tangermann, Marcel van Gerven, Kiki van der Heijden

Main category: cs.SD

TL;DR: WavJEPA is a waveform-based Joint-Embedding Predictive Architecture that outperforms state-of-the-art time-domain audio foundation models across various tasks with fewer computational resources, while WavJEPA-Nat extends this to multi-channel processing for robustness in noisy environments.

Details

Motivation: To overcome limitations of spectrogram-based audio representation learning (long latency, phase information loss) and address the gap where self-supervised speech representation learning from waveforms hasn't achieved similar success for general-purpose audio representation learning.

Method: Proposes WavJEPA using high-level semantic representation learning instead of speech unit/token level learning, and extends it to WavJEPA-Nat with multi-channel processing trained on simulated naturalistic scenes for noise and reverberation robustness.

Result: Substantially outperforms state-of-the-art time-domain audio foundation models across various downstream benchmark tasks while requiring fewer computational resources. WavJEPA-Nat shows high robustness to reverberation and noise.

Conclusion: Demonstrates feasibility and computational efficiency of general-purpose audio representation learning from raw waveforms, enabling low-latency, robust time-domain audio foundation models for real-world applications.

Abstract: Learning audio representations from raw waveforms overcomes key limitations of spectrogram-based audio representation learning, such as the long latency of spectrogram computation and the loss of phase information. Yet, while self-supervised speech representation learning from raw waveforms has been remarkably successful, these approaches have not achieved similar feats for general-purpose audio representation learning from waveforms. Here, we propose WavJEPA, a waveform-based version of the Joint-Embedding Predictive Architecture. WavJEPA leverages high-level semantic representation learning to tackle the shortcomings of representation learning at the speech unit or token level. We show that this approach substantially outperforms state-of-the-art time-domain audio foundation models across a wide variety of downstream benchmark tasks, while requiring considerably fewer computational resources. Additionally, to overcome the performance drop that time-domain models typically exhibit in noisy and reverberant real-world acoustic environments, we present WavJEPA-Nat. WavJEPA-Nat is a multi-channel extension of the WavJEPA architecture trained on simulated naturalistic scenes. We find that WavJEPA-Nat is highly robust to reverberation and noise. These results highlight the feasibility and computational efficiency of general-purpose audio representation learning from raw waveforms, showcasing the potential for low-latency, robust time-domain audio foundation models for real-world applications.

[791] Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation

Liyang Chen, Hongkai Chen, Yujun Cai, Sifan Li, Qingwen Ye, Yiwei Wang

Main category: cs.SD

TL;DR: The paper identifies and addresses Insertion Hallucination in Video-to-Audio generation, where models generate sounds without visual sources. It proposes evaluation metrics and a training-free method to reduce hallucinations by over 50%.

Details

Motivation: Existing Video-to-Audio evaluation metrics overlook a critical failure mode where models generate acoustic events (speech, music) without corresponding visual sources, driven by dataset biases like off-screen sounds.

Method: Proposes Posterior Feature Correction (PFC), a training-free inference-time method that detects hallucinated segments in initial audio output and regenerates audio after masking corresponding video features at those timestamps.

Result: State-of-the-art models suffer from severe Insertion Hallucination. PFC reduces both prevalence and duration of hallucinations by over 50% on average without degrading conventional audio quality and synchronization metrics.

Conclusion: This work formally defines, systematically measures, and effectively mitigates Insertion Hallucination, paving the way for more reliable and faithful Video-to-Audio models.

Abstract: Video-to-Audio generation has made remarkable strides in automatically synthesizing sound for video. However, existing evaluation metrics, which focus on semantic and temporal alignment, overlook a critical failure mode: models often generate acoustic events, particularly speech and music, that have no corresponding visual source. We term this phenomenon Insertion Hallucination and identify it as a systemic risk driven by dataset biases, such as the prevalence of off-screen sounds, that remains completely undetected by current metrics. To address this challenge, we first develop a systematic evaluation framework that employs a majority-voting ensemble of multiple audio event detectors. We also introduce two novel metrics to quantify the prevalence and severity of this issue: IH@vid (the fraction of videos with hallucinations) and IH@dur (the fraction of hallucinated duration). Building on this, we propose Posterior Feature Correction, a novel training-free inference-time method that mitigates IH. PFC operates in a two-pass process: it first generates an initial audio output to detect hallucinated segments, and then regenerates the audio after masking the corresponding video features at those timestamps. Experiments on several mainstream V2A benchmarks first reveal that state-of-the-art models suffer from severe IH. In contrast, our PFC method reduces both the prevalence and duration of hallucinations by over 50% on average, without degrading, and in some cases even improving, conventional metrics for audio quality and temporal synchronization. Our work is the first to formally define, systematically measure, and effectively mitigate Insertion Hallucination, paving the way for more reliable and faithful V2A models.

cs.LG

[792] Direct Routing Gradient (DRGrad): A Personalized Information Surgery for Multi-Task Learning (MTL) Recommendations

Yuguang Liu, Yiyun Miao, Luyao Xia

Main category: cs.LG

TL;DR: DRGrad is a personalized gradient routing framework for multi-task learning in recommender systems that addresses negative transfer and seesaw phenomena by intelligently routing gradients between tasks based on their relationships.

Details

Motivation: Multi-task learning faces challenges with negative transfer and seesaw phenomenon due to complex task correlations in recommender systems, requiring better management of task conflicts while leveraging personalized information.

Method: Proposes DRGrad framework with three components: router, updater, and personalized gate network that judges task stakes during training and leverages valid gradients for respective tasks to reduce conflicts.

Result: Superior performance on real-world dataset with 15B samples, achieving better AUC metrics than state-of-the-art MTL models without increasing complexity, with additional validation on public datasets showing effective task correlation handling.

Conclusion: DRGrad effectively manages task conflicts in multi-task learning environments, addresses noise processing deficiencies, and demonstrates capability in handling varying task correlations and personalization levels.

Abstract: Multi-task learning (MTL) has emerged as a successful strategy in industrial-scale recommender systems, offering significant advantages such as capturing diverse users’ interests and accurately detecting different behaviors like click" or dwell time". However, negative transfer and the seesaw phenomenon pose challenges to MTL models due to the complex and often contradictory task correlations in real-world recommendations. To address the problem while making better use of personalized information, we propose a personalized Direct Routing Gradient framework (DRGrad), which consists of three key components: router, updater and personalized gate network. DRGrad judges the stakes between tasks in the training process, which can leverage all valid gradients for the respective task to reduce conflicts. We evaluate the efficiency of DRGrad on complex MTL using a real-world recommendation dataset with 15 billion samples. The results show that DRGrad’s superior performance over competing state-of-the-art MTL models, especially in terms of AUC (Area Under the Curve) metrics, indicating that it effectively manages task conflicts in multi-task learning environments without increasing model complexity, while also addressing the deficiencies in noise processing. Moreover, experiments on the public Census-income dataset and Synthetic dataset, have demonstrated the capability of DRGrad in judging and routing the stakes between tasks with varying degrees of correlation and personalization.

[793] Enhanced Urban Traffic Management Using CCTV Surveillance Videos and Multi-Source Data Current State Prediction and Frequent Episode Mining

Shaharyar Alam Ansari, Mohammad Luqman, Aasim Zafar, Savir Ali

Main category: cs.LG

TL;DR: This paper proposes a unified framework integrating CCTV surveillance with multi-source data for real-time urban traffic prediction, achieving 98.46% accuracy using spatio-temporal feature fusion and hybrid LSTM-Transformer models.

Details

Motivation: Rapid urbanization has intensified traffic congestion and inefficiencies, creating urgent need for intelligent traffic management solutions as conventional static systems are inadequate for modern dynamic traffic.

Method: The methodology incorporates spatio-temporal feature fusion, Frequent Episode Mining for sequential pattern discovery, and a hybrid LSTM-Transformer model for traffic state forecasting, evaluated on CityFlowV2 dataset with 313,931 annotated bounding boxes across 46 cameras.

Result: Achieved 98.46% prediction accuracy with macro precision of 0.9800, macro recall of 0.9839, and macro F1-score of 0.9819. FEM analysis revealed significant sequential patterns with confidence levels exceeding 55%, and generated 46 sustained congestion alerts.

Conclusion: The research emphasizes the need for integrating video stream analytics with multi-source data to design real-time, responsive, adaptable intelligent transportation systems for smarter and safer urban mobility.

Abstract: Rapid urbanization has intensified traffic congestion, environmental strain, and inefficiencies in transportation systems, creating an urgent need for intelligent and adaptive traffic management solutions. Conventional systems relying on static signals and manual monitoring are inadequate for the dynamic nature of modern traffic. This research aims to develop a unified framework that integrates CCTV surveillance videos with multi-source data descriptors to enhance real-time urban traffic prediction. The proposed methodology incorporates spatio-temporal feature fusion, Frequent Episode Mining for sequential traffic pattern discovery, and a hybrid LSTM-Transformer model for robust traffic state forecasting. The framework was evaluated on the CityFlowV2 dataset comprising 313,931 annotated bounding boxes across 46 cameras. It achieved a high prediction accuracy of 98.46 percent, with a macro precision of 0.9800, macro recall of 0.9839, and macro F1-score of 0.9819. FEM analysis revealed significant sequential patterns such as moderate-congested transitions with confidence levels exceeding 55 percent. The 46 sustained congestion alerts are system-generated, which shows practical value for proactive congestion management. This emphasizes the need for the incorporation of video stream analytics with data from multiple sources for the design of real-time, responsive, adaptable multi-level intelligent transportation systems, which makes urban mobility smarter and safer.

[794] Generative Models for Helmholtz Equation Solutions: A Dataset of Acoustic Materials

Riccardo Fosco Gramaccioni, Christian Marinoni, Fabrizio Frezza, Aurelio Uncini, Danilo Comminiello

Main category: cs.LG

TL;DR: A deep learning approach using Stable Diffusion with ControlNet is proposed to solve Helmholtz equations for acoustic materials, leveraging GPU parallelization to reduce computation time compared to traditional finite element methods.

Details

Motivation: Traditional numerical solvers like finite element methods are computationally expensive for large-scale or real-time wave propagation simulations in acoustic materials, which are crucial for sound design, noise control, and material engineering.

Method: Created HA30K dataset of 31,000 acoustic materials with geometric configurations and pressure field solutions. Used Stable Diffusion with ControlNet to learn Helmholtz equation solutions by representing them as images, enabling GPU parallelization and adjustable diffusion steps for speed-quality tradeoff.

Result: The approach drastically reduces computation time by leveraging GPU parallelization to process multiple simulations simultaneously, bypassing the need for complex simulation software and explicit equation-solving.

Conclusion: Deep learning-based methods are particularly useful in early-stage research where rapid exploration is more critical than absolute accuracy, offering a viable alternative to traditional computational methods for wave propagation simulation.

Abstract: Accurate simulation of wave propagation in complex acoustic materials is crucial for applications in sound design, noise control, and material engineering. Traditional numerical solvers, such as finite element methods, are computationally expensive, especially when dealing with large-scale or real-time scenarios. In this work, we introduce a dataset of 31,000 acoustic materials, named HA30K, designed and simulated solving the Helmholtz equations. For each material, we provide the geometric configuration and the corresponding pressure field solution, enabling data-driven approaches to learn Helmholtz equation solutions. As a baseline, we explore a deep learning approach based on Stable Diffusion with ControlNet, a state-of-the-art model for image generation. Unlike classical solvers, our approach leverages GPU parallelization to process multiple simulations simultaneously, drastically reducing computation time. By representing solutions as images, we bypass the need for complex simulation software and explicit equation-solving. Additionally, the number of diffusion steps can be adjusted at inference time, balancing speed and quality. We aim to demonstrate that deep learning-based methods are particularly useful in early-stage research, where rapid exploration is more critical than absolute accuracy.

[795] Gradient-Sign Masking for Task Vector Transport Across Pre-Trained Models

Filippo Rinaldi, Aniello Panariello, Giacomo Salici, Fengyuan Liu, Marco Ciccone, Angelo Porrello, Simone Calderara

Main category: cs.LG

TL;DR: GradFix enables efficient transfer of task vectors across different foundation model versions by leveraging gradient sign structure, requiring only a few labeled samples without additional fine-tuning.

Details

Motivation: Reusing task vectors from previous model versions often fails due to parameter space misalignment, requiring practitioners to repeat full fine-tuning when new foundation models are released.

Method: GradFix approximates the ideal gradient sign structure of the new model using a handful of labeled samples, then masks the source task vector accordingly to create a locally aligned update without additional fine-tuning.

Result: The method provides theoretical guarantee for first-order descent and empirically shows significant performance gains on vision and language benchmarks, outperforming naive task vector addition and few-shot fine-tuning.

Conclusion: GradFix successfully enables task vector transfer across different pre-trained models by leveraging gradient sign structure, offering an efficient alternative to full fine-tuning when foundation models are updated.

Abstract: When a new release of a foundation model is published, practitioners typically need to repeat full fine-tuning, even if the same task has already been solved in the previous version. A promising alternative is to reuse the parameter changes (i.e., task vectors) that capture how a model adapts to a specific task. However, they often fail to transfer across different pre-trained models due to their misaligned parameter space. In this work, we show that the key to successful transfer lies in the sign structure of the gradients of the new model. Based on this insight, we propose GradFix, a novel method that approximates the ideal gradient sign structure and leverages it to transfer knowledge using only a handful of labeled samples. Notably, this requires no additional fine-tuning: the adaptation is achieved by computing a few gradients at the target model and masking the source task vector accordingly. This yields an update that is locally aligned with the target loss landscape, effectively rebasing the task vector onto the new pre-training. We provide a theoretical guarantee that our method ensures first-order descent. Empirically, we demonstrate significant performance gains on vision and language benchmarks, consistently outperforming naive task vector addition and few-shot fine-tuning.

[796] Heterogeneous Point Set Transformers for Segmentation of Multiple View Particle Detectors

Edgar E. Robles, Dikshant Sagar, Alejandro Yankelevich, Jianming Bian, Pierre Baldi, NOvA Collaboration

Main category: cs.LG

TL;DR: NOvA experiment uses a point set neural network for particle identification in sparse 2D detector data, achieving 96.8% AUC with 90% memory reduction compared to traditional methods.

Details

Motivation: Traditional clustering and CNNs for particle identification in NOvA experiment require processing sparse 2D detector data (XZ and YZ views) efficiently.

Method: Proposed a point set neural network that operates on sparse matrices and mixes information from both XZ and YZ views of the detector.

Result: Achieved 96.8% AUC score with less than 10% memory usage compared to previous methods, significantly better than processing views independently (85.4% AUC).

Conclusion: Point set neural networks are effective for sparse detector data analysis, providing high accuracy with substantial memory efficiency gains.

Abstract: NOvA is a long-baseline neutrino oscillation experiment that detects neutrino particles from the NuMI beam at Fermilab. Before data from this experiment can be used in analyses, raw hits in the detector must be matched to their source particles, and the type of each particle must be identified. This task has commonly been done using a mix of traditional clustering approaches and convolutional neural networks (CNNs). Due to the construction of the detector, the data is presented as two sparse 2D images: an XZ and a YZ view of the detector, rather than a 3D representation. We propose a point set neural network that operates on the sparse matrices with an operation that mixes information from both views. Our model uses less than 10% of the memory required using previous methods while achieving a 96.8% AUC score, a higher score than obtained when both views are processed independently (85.4%).

[797] Learning What Matters: Steering Diffusion via Spectrally Anisotropic Forward Noise

Luca Scimeca, Thomas Jiralerspong, Berton Earnshaw, Jason Hartford, Yoshua Bengio

Main category: cs.LG

TL;DR: The paper introduces spectrally anisotropic Gaussian diffusion (SAGD), which replaces isotropic noise with structured frequency-diagonal covariance to shape inductive biases in diffusion models, improving performance and enabling selective corruption omission.

Details

Motivation: To build explicit inductive biases into diffusion models that better accommodate target data distributions, addressing the largely implicit nature of current DPM inductive biases.

Method: Introduces an anisotropic noise operator with structured frequency-diagonal covariance that unifies band-pass masks and power-law weightings, allowing emphasis or suppression of specific frequency bands while maintaining Gaussian forward process.

Result: SAGD outperforms standard diffusion across several vision datasets and enables selective omission of known corruptions confined to specific frequency bands.

Conclusion: Anisotropic forward noise provides a principled way to tailor inductive bias in diffusion models through carefully designed frequency-based noise structure.

Abstract: Diffusion Probabilistic Models (DPMs) have achieved strong generative performance, yet their inductive biases remain largely implicit. In this work, we aim to build inductive biases into the training and sampling of diffusion models to better accommodate the target distribution of the data to model. We introduce an anisotropic noise operator that shapes these biases by replacing the isotropic forward covariance with a structured, frequency-diagonal covariance. This operator unifies band-pass masks and power-law weightings, allowing us to emphasize or suppress designated frequency bands, while keeping the forward process Gaussian. We refer to this as spectrally anisotropic Gaussian diffusion (SAGD). In this work, we derive the score relation for anisotropic covariances and show that, under full support, the learned score converges to the true data score as $t!\to!0$, while anisotropy reshapes the probability-flow path from noise to data. Empirically, we show the induced anisotropy outperforms standard diffusion across several vision datasets, and enables selective omission: learning while ignoring known corruptions confined to specific bands. Together, these results demonstrate that carefully designed anisotropic forward noise provides a simple, yet principled, handle to tailor inductive bias in DPMs.

[798] Assessment of different loss functions for fitting equivalent circuit models to electrochemical impedance spectroscopy data

Ali Jaberi, Amin Sadeghi, Runze Zhang, Zhaoyang Zhao, Qiuyu Shi, Robert Black, Zoya Sadighi, Jason Hattrick-Simpers

Main category: cs.LG

TL;DR: This paper introduces two new loss functions (log-B and log-BW) for EIS data fitting and compares them with existing methods, finding that X2 performs best for quality of fit while log-B offers faster computation with slightly lower accuracy.

Details

Motivation: To improve electrochemical impedance spectroscopy (EIS) data modeling by developing more efficient loss functions for equivalent circuit model fitting, particularly for large-scale applications like machine learning training.

Method: Developed two new loss functions (log-B and log-BW) based on Bode representation of EIS data, then evaluated them against existing loss functions using generated EIS datasets, assessing R2 scores, chi-squared, computational efficiency, and MAPE.

Result: X2 loss function achieved highest performance across quality of fit metrics, while log-B was approximately 1.4 times faster with lower MAPE for most circuit components, offering a trade-off between speed and accuracy.

Conclusion: X2 is preferred when quality of fit is primary goal, while log-B serves as a strong alternative for large-scale applications requiring computational efficiency with acceptable accuracy.

Abstract: Electrochemical impedance spectroscopy (EIS) data is typically modeled using an equivalent circuit model (ECM), with parameters obtained by minimizing a loss function via nonlinear least squares fitting. This paper introduces two new loss functions, log-B and log-BW, derived from the Bode representation of EIS. Using a large dataset of generated EIS data, the performance of proposed loss functions was evaluated alongside existing ones in terms of R2 scores, chi-squared, computational efficiency, and the mean absolute percentage error (MAPE) between the predicted component values and the original values. Statistical comparisons revealed that the choice of loss function impacts convergence, computational efficiency, quality of fit, and MAPE. Our analysis showed that X2 loss function (squared sum of residuals with proportional weighting) achieved the highest performance across multiple quality of fit metrics, making it the preferred choice when the quality of fit is the primary goal. On the other hand, log-B offered a slightly lower quality of fit while being approximately 1.4 times faster and producing lower MAPE for most circuit components, making log-B as a strong alternative. This is a critical factor for large-scale least squares fitting in data-driven applications, such as training machine learning models on extensive datasets or iterations.

[799] Efficient Edge Test-Time Adaptation via Latent Feature Coordinate Correction

Xinyu Luo, Jie Liu, Kecheng Chen, Junyi Yang, Bo Ding, Arindam Basu, Haoliang Li

Main category: cs.LG

TL;DR: TED is a novel single-instance test-time adaptation method for edge devices that uses forward-only coordinate optimization with CMA-ES in the latent principal subspace, achieving state-of-the-art performance while reducing computational complexity by up to 63x.

Details

Motivation: Edge devices face challenges from limited computational resources and distribution shifts. Existing TTA methods rely on gradient-based optimization or batch processing, which are unsuitable for resource-constrained edge scenarios due to high computational demands.

Method: Proposes TED method using forward-only coordinate optimization in the principal subspace of latent representations with CMA-ES. Updates a compact low-dimensional vector without backpropagation, keeping model parameters frozen.

Result: Achieves state-of-the-art performance on ImageNet and Google Speech Commands datasets while reducing computational complexity by up to 63 times. Successfully deployed on ZYNQ-7020 platform.

Conclusion: TED offers a practical and scalable solution for real-world edge applications with minimal memory and computational overhead, enabling efficient, forgetting-free adaptation for resource-constrained devices.

Abstract: Edge devices face significant challenges due to limited computational resources and distribution shifts, making efficient and adaptable machine learning essential. Existing test-time adaptation (TTA) methods often rely on gradient-based optimization or batch processing, which are inherently unsuitable for resource-constrained edge scenarios due to their reliance on backpropagation and high computational demands. Gradient-free alternatives address these issues but often suffer from limited learning capacity, lack flexibility, or impose architectural constraints. To overcome these limitations, we propose a novel single-instance TTA method tailored for edge devices (TED), which employs forward-only coordinate optimization in the principal subspace of latent using the covariance matrix adaptation evolution strategy (CMA-ES). By updating a compact low-dimensional vector, TED not only enhances output confidence but also aligns the latent representation closer to the source latent distribution within the latent principal subspace. This is achieved without backpropagation, keeping the model parameters frozen, and enabling efficient, forgetting-free adaptation with minimal memory and computational overhead. Experiments on image classification and keyword spotting tasks across the ImageNet and Google Speech Commands series datasets demonstrate that TED achieves state-of-the-art performance while $\textit{reducing computational complexity by up to 63 times}$, offering a practical and scalable solution for real-world edge applications. Furthermore, we successfully $\textit{deployed TED on the ZYNQ-7020 platform}$, demonstrating its feasibility and effectiveness for resource-constrained edge devices in real-world deployments.

Changchang Sun, Vickie Chen, Yan Yan

Main category: cs.LG

TL;DR: SODA proposes a semantic cohesive knowledge distillation scheme for cross-modal hashing that uses multi-label information as a new textual modality to better bridge modality gaps between images and text.

Details

Motivation: Existing deep supervised cross-modal hashing methods fail to explicitly interact multi-label semantic extraction with raw multimodal data, making learned representations incompatible with heterogeneous data and hindering modality gap bridging.

Method: Introduces multi-label information as a new textual modality formulated as ground-truth label prompts. Uses a cross-modal teacher network to distill semantic characteristics between image and label modalities, learning a well-mapped Hamming space that serves as prior knowledge to guide student network learning.

Result: Extensive experiments on two benchmark datasets demonstrate superiority over state-of-the-art methods.

Conclusion: The proposed SODA framework effectively addresses the limitation of incompatible semantic representations in cross-modal hashing by introducing label prompts and knowledge distillation.

Abstract: Recently, deep supervised cross-modal hashing methods have achieve compelling success by learning semantic information in a self-supervised way. However, they still suffer from the key limitation that the multi-label semantic extraction process fail to explicitly interact with raw multimodal data, making the learned representation-level semantic information not compatible with the heterogeneous multimodal data and hindering the performance of bridging modality gap. To address this limitation, in this paper, we propose a novel semantic cohesive knowledge distillation scheme for deep cross-modal hashing, dubbed as SODA. Specifically, the multi-label information is introduced as a new textual modality and reformulated as a set of ground-truth label prompt, depicting the semantics presented in the image like the text modality. Then, a cross-modal teacher network is devised to effectively distill cross-modal semantic characteristics between image and label modalities and thus learn a well-mapped Hamming space for image modality. In a sense, such Hamming space can be regarded as a kind of prior knowledge to guide the learning of cross-modal student network and comprehensively preserve the semantic similarities between image and text modality. Extensive experiments on two benchmark datasets demonstrate the superiority of our model over the state-of-the-art methods.

[801] LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

Yihua Cheng, Yuhan Liu, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Kuntai Du, Junchen Jiang

Main category: cs.LG

TL;DR: LMCache is an efficient open-source KV caching solution that shares KV caches across LLM inference engines and queries, enabling cache offloading and prefill-decode disaggregation to improve resource utilization.

Details

Motivation: Current LLM inference systems treat engines and queries independently, causing significant resource inefficiencies. Existing proposals for KV cache reuse and query disaggregation cannot be realized without efficient KV cache offloading and communication across engines.

Method: LMCache extracts and stores KV caches from modern LLM engines (vLLM and SGLang) and shares them across engines and queries. It features optimized KV cache data movement with batching and pipelining, a modular connector component, and a control API for cache orchestration across GPU, CPU, storage, and network layers.

Result: Combining LMCache with vLLM achieves up to 15x improvement in throughput across diverse workloads. LMCache has seen dramatic growth in adoption by enterprise inference systems.

Conclusion: LMCache effectively transforms LLM engines from individual token processors to a collection of engines with KV cache as the storage and communication medium, providing valuable lessons for future KV caching solutions.

Abstract: Today’s LLM inference systems treat individual engines and queries independently for simplicity, but this causes significant resource inefficiencies. While there are proposals to avoid redundant computation by reusing KV caches across queries and to increase GPU utilization by disaggregating a single query to different engines, their promises cannot be realized without efficiently offloading and communicating KV cache across LLM inference engines and queries. We present LMCache, the first and so far the most efficient open-source KV caching solution, which extracts and stores KV caches generated by modern LLM engines (vLLM and SGLang) and shares the KV caches across engines and queries. LMCache exposes KV caches in the LLM engine interface, effectively transforming LLM engines from individual token processors to a collection of engines with KV cache as the storage and communication medium. In particular, it supports both cache offloading (prefix reuse across queries) and prefill-decode disaggregation (cross-engine cache transfer). LMCache’s high performance and wide adoption stem from the following contributions: highly optimized KV cache data movement with performance optimizations including batched data movement operations, compute and I/O pipelining; a modular KV cache connector component, decoupling LMCache from the rapid evolution of inference engines; a first-class control API, such as pinning, lookup, cleanup, movement, and compression, for flexible cache orchestration across GPU, CPU, storage, and network layers. Evaluation shows that combining LMCache with vLLM achieves up to 15x improvement in throughput across diverse workloads. With a growing community, LMCache has seen dramatic growth in adoption by enterprise inference systems, which provides valuable lessons for future KV caching solutions. The source code of LMCache is at: https://github.com/LMCache/LMCache.

[802] Spatial Uncertainty Quantification in Wildfire Forecasting for Climate-Resilient Emergency Planning

Aditya Chakravarty

Main category: cs.LG

TL;DR: First systematic analysis of spatial uncertainty in wildfire spread forecasting using multimodal Earth observation data, revealing coherent uncertainty patterns concentrated near fire perimeters.

Details

Motivation: Climate change intensifies wildfire risks globally, requiring reliable forecasting for adaptation strategies. Current machine learning approaches lack uncertainty quantification essential for risk-aware decision making.

Method: Used multimodal Earth observation inputs to analyze spatial uncertainty in wildfire spread forecasting. Developed novel distance metric to identify uncertainty patterns.

Result: Predictive uncertainty exhibits coherent spatial structure concentrated near fire perimeters. High-uncertainty regions form consistent 20-60 meter buffer zones around predicted firelines. Feature attribution identifies vegetation health and fire activity as primary uncertainty drivers.

Conclusion: This work enables more robust wildfire management systems supporting communities adapting to increasing fire risk under climate change.

Abstract: Climate change is intensifying wildfire risks globally, making reliable forecasting critical for adaptation strategies. While machine learning shows promise for wildfire prediction from Earth observation data, current approaches lack uncertainty quantification essential for risk-aware decision making. We present the first systematic analysis of spatial uncertainty in wildfire spread forecasting using multimodal Earth observation inputs. We demonstrate that predictive uncertainty exhibits coherent spatial structure concentrated near fire perimeters. Our novel distance metric reveals high-uncertainty regions form consistent 20-60 meter buffer zones around predicted firelines - directly applicable for emergency planning. Feature attribution identifies vegetation health and fire activity as primary uncertainty drivers. This work enables more robust wildfire management systems supporting communities adapting to increasing fire risk under climate change.

[803] A Hybrid Computational Intelligence Framework with Metaheuristic Optimization for Drug-Drug Interaction Prediction

Maryam Abdollahi Shamami, Babak Teimourpour, Farshad Sharifi

Main category: cs.LG

TL;DR: An interpretable machine learning framework combining molecular embeddings and clinical knowledge for drug-drug interaction prediction, achieving high accuracy and clinical applicability.

Details

Motivation: Drug-drug interactions are a major cause of preventable adverse events, and knowing which drugs do not interact is crucial for safer prescriptions and better patient outcomes.

Method: Combines Mol2Vec and SMILES-BERT molecular embeddings with a rule-based clinical score (RBScore), optimized using a three-stage metaheuristic strategy (RSmpl-ACO-PSO) for balanced global exploration and local refinement.

Result: Achieves high predictive accuracy (ROC-AUC 0.911, PR-AUC 0.867 on DrugBank) and generalizes well to Type 2 Diabetes Mellitus cohort, with embedding fusion, RBScore, and optimizer contributing to precision and robustness.

Conclusion: Provides a practical pathway for building reliable, interpretable, and computationally efficient models to support safer drug therapies and clinical decision-making.

Abstract: Drug-drug interactions (DDIs) are a leading cause of preventable adverse events, often complicating treatment and increasing healthcare costs. At the same time, knowing which drugs do not interact is equally important, as such knowledge supports safer prescriptions and better patient outcomes. In this study, we propose an interpretable and efficient framework that blends modern machine learning with domain knowledge to improve DDI prediction. Our approach combines two complementary molecular embeddings - Mol2Vec, which captures fragment-level structural patterns, and SMILES-BERT, which learns contextual chemical features - together with a leakage-free, rule-based clinical score (RBScore) that injects pharmacological knowledge without relying on interaction labels. A lightweight neural classifier is then optimized using a novel three-stage metaheuristic strategy (RSmpl-ACO-PSO), which balances global exploration and local refinement for stable performance. Experiments on real-world datasets demonstrate that the model achieves high predictive accuracy (ROC-AUC 0.911, PR-AUC 0.867 on DrugBank) and generalizes well to a clinically relevant Type 2 Diabetes Mellitus cohort. Beyond raw performance, studies show how embedding fusion, RBScore, and the optimizer each contribute to precision and robustness. Together, these results highlight a practical pathway for building reliable, interpretable, and computationally efficient models that can support safer drug therapies and clinical decision-making.

[804] Population synthesis with geographic coordinates

Jacopo Lenti, Lorenzo Costantini, Ariadna Fosch, Anna Monticelli, David Scala, Marco Pangallo

Main category: cs.LG

TL;DR: Proposes a population synthesis method using Normalizing Flows + Variational Autoencoder to generate synthetic populations with explicit geographic coordinates, addressing spatial data challenges.

Details

Motivation: Need to generate synthetic populations with explicit coordinates rather than coarse geographic areas, as existing methods don't handle spatial data's unique characteristics (empty spaces, uneven densities).

Method: NF+VAE architecture: maps spatial coordinates to latent space using Normalizing Flows, then combines with other features in VAE to generate synthetic populations while learning joint distributions and spatial autocorrelations.

Result: Method outperforms benchmarks (copula-based methods, uniform allocation) across 121 diverse geographic datasets. Proposed evaluation framework measures spatial accuracy, practical utility, and privacy preservation.

Conclusion: NF+VAE enables generation of geolocated synthetic populations at fine spatial resolution, opening applications in disaster response, epidemic modeling, evacuation planning, and transport modeling.

Abstract: It is increasingly important to generate synthetic populations with explicit coordinates rather than coarse geographic areas, yet no established methods exist to achieve this. One reason is that latitude and longitude differ from other continuous variables, exhibiting large empty spaces and highly uneven densities. To address this, we propose a population synthesis algorithm that first maps spatial coordinates into a more regular latent space using Normalizing Flows (NF), and then combines them with other features in a Variational Autoencoder (VAE) to generate synthetic populations. This approach also learns the joint distribution between spatial and non-spatial features, exploiting spatial autocorrelations. We demonstrate the method by generating synthetic homes with the same statistical properties of real homes in 121 datasets, corresponding to diverse geographies. We further propose an evaluation framework that measures both spatial accuracy and practical utility, while ensuring privacy preservation. Our results show that the NF+VAE architecture outperforms popular benchmarks, including copula-based methods and uniform allocation within geographic areas. The ability to generate geolocated synthetic populations at fine spatial resolution opens the door to applications requiring detailed geography, from household responses to floods, to epidemic spread, evacuation planning, and transport modeling.

[805] A physics-aware deep learning model for shear band formation around collapsing pores in shocked reactive materials

Xinlun Cheng, Bingzhe Chen, Joseph Choi, Yen T. Nguyen, Pradeep Seshadri, Mayank Verma, H. S. Udaykumar, Stephen Baek

Main category: cs.LG

TL;DR: The paper presents an improved Physics-Aware Recurrent Convolutional Neural Network (PARCv2) to model hotspot formation in energetic materials under weak-to-moderate shock loading, addressing computational challenges of direct simulations.

Details

Motivation: To understand hotspot formation in crystalline energetic materials under weak-to-moderate shock loading, which is critical for safe storage and handling but remains underexplored compared to strong shock conditions.

Method: Advanced PARCv2 architecture to rapidly predict shear localizations and plastic heating, and benchmarked it against Fourier neural operator and neural ordinary differential equation models.

Result: PARCv2 demonstrated superior performance in capturing spatiotemporal dynamics of shear band formation compared to other physics-informed models, though all models showed certain failure modes.

Conclusion: Domain-specific considerations are crucial for developing robust AI-accelerated simulation tools for reactive materials, as demonstrated by PARCv2’s improved performance in modeling weak-to-moderate shock responses.

Abstract: Modeling shock-to-detonation phenomena in energetic materials (EMs) requires capturing complex physical processes such as strong shocks, rapid changes in microstructural morphology, and nonlinear dynamics of chemical reaction fronts. These processes participate in energy localization at hotspots, which initiate chemical energy release leading to detonation. This study addresses the formation of hotspots in crystalline EMs subjected to weak-to-moderate shock loading, which, despite its critical relevance to the safe storage and handling of EMs, remains underexplored compared to the well-studied strong shock conditions. To overcome the computational challenges associated with direct numerical simulations, we advance the Physics-Aware Recurrent Convolutional Neural Network (PARCv2), which has been shown to be capable of predicting strong shock responses in EMs. We improved the architecture of PARCv2 to rapidly predict shear localizations and plastic heating, which play important roles in the weak-to-moderate shock regime. PARCv2 is benchmarked against two widely used physics-informed models, namely, Fourier neural operator and neural ordinary differential equation; we demonstrate its superior performance in capturing the spatiotemporal dynamics of shear band formation. While all models exhibit certain failure modes, our findings underscore the importance of domain-specific considerations in developing robust AI-accelerated simulation tools for reactive materials.

[806] Multitask Learning with Learned Task Relationships

Zirui Wan, Stefan Vlaski

Main category: cs.LG

TL;DR: The paper introduces a federated learning framework that learns task relationships through Gaussian Markov Random Fields, enabling personalized models without requiring consensus or prior knowledge of task relationships.

Details

Motivation: Classical consensus-based federated learning methods are suboptimal for heterogeneous data distributions, and existing personalized approaches either need precise prior knowledge of task relationships or rely on non-parametric methods like meta-learning.

Method: The authors develop an algorithmic framework that models task relationships through a Gaussian Markov Random Field with unknown precision matrix, jointly learning both task relationships and local models to allow self-organization based on individual data distributions.

Result: Theoretical analysis quantifies the quality of learned relationships, and numerical experiments demonstrate practical effectiveness of the approach.

Conclusion: The proposed framework successfully balances between extremes by learning task relationships adaptively while enabling personalized federated learning without requiring consensus or precise prior knowledge.

Abstract: Classical consensus-based strategies for federated and decentralized learning are statistically suboptimal in the presence of heterogeneous local data or task distributions. As a result, in recent years, there has been growing interest in multitask or personalized strategies, which allow individual agents to benefit from one another in pursuing locally optimal models without enforcing consensus. Existing strategies require either precise prior knowledge of the underlying task relationships or are fully non-parametric and instead rely on meta-learning or proximal constructions. In this work, we introduce an algorithmic framework that strikes a balance between these extremes. By modeling task relationships through a Gaussian Markov Random Field with an unknown precision matrix, we develop a strategy that jointly learns both the task relationships and the local models, allowing agents to self-organize in a way consistent with their individual data distributions. Our theoretical analysis quantifies the quality of the learned relationship, and our numerical experiments demonstrate its practical effectiveness.

[807] Coupled Data and Measurement Space Dynamics for Enhanced Diffusion Posterior Sampling

Shayan Mohajer Hamidi, En-Hui Yang, Ben Liang

Main category: cs.LG

TL;DR: Proposes C-DPS, a coupled diffusion framework that eliminates constraint tuning and likelihood approximation for inverse problems by introducing parallel diffusion processes in data and measurement spaces.

Details

Motivation: Existing diffusion-based methods for inverse problems rely on heuristic projection techniques or likelihood approximations, leading to artifacts and instability under complex or high-noise conditions.

Method: Introduces coupled data and measurement space diffusion posterior sampling (C-DPS) with parallel forward stochastic processes in both spaces, enabling derivation of closed-form posterior distribution for recursive sampling.

Result: C-DPS consistently outperforms existing baselines both qualitatively and quantitatively across multiple inverse problem benchmarks.

Conclusion: The proposed C-DPS framework provides a more accurate and stable approach for solving inverse problems using diffusion models without requiring constraint tuning or likelihood approximation.

Abstract: Inverse problems, where the goal is to recover an unknown signal from noisy or incomplete measurements, are central to applications in medical imaging, remote sensing, and computational biology. Diffusion models have recently emerged as powerful priors for solving such problems. However, existing methods either rely on projection-based techniques that enforce measurement consistency through heuristic updates, or they approximate the likelihood $p(\boldsymbol{y} \mid \boldsymbol{x})$, often resulting in artifacts and instability under complex or high-noise conditions. To address these limitations, we propose a novel framework called \emph{coupled data and measurement space diffusion posterior sampling} (C-DPS), which eliminates the need for constraint tuning or likelihood approximation. C-DPS introduces a forward stochastic process in the measurement space ${\boldsymbol{y}_t}$, evolving in parallel with the data-space diffusion ${\boldsymbol{x}t}$, which enables the derivation of a closed-form posterior $p(\boldsymbol{x}{t-1} \mid \boldsymbol{x}t, \boldsymbol{y}{t-1})$. This coupling allows for accurate and recursive sampling based on a well-defined posterior distribution. Empirical results demonstrate that C-DPS consistently outperforms existing baselines, both qualitatively and quantitatively, across multiple inverse problem benchmarks.

[808] Using LLMs to Directly Guess Conditional Expectations Can Improve Efficiency in Causal Estimation

Chris Engh, P. M. Aronow

Main category: cs.LG

TL;DR: Using LLM-generated predictions as additional features in double machine learning improves causal estimation efficiency by leveraging generative models’ historical knowledge to overcome dimensionality issues.

Details

Motivation: To enhance causal estimation accuracy in high-dimensional confounder settings by leveraging LLMs' reasoning capabilities and historical knowledge, overcoming curse-of-dimensionality problems in causal inference.

Method: Proposes using LLM-generated predictions as additional predictors in double machine learning framework, where generative models trained on historical data provide conditional expectation function estimates.

Result: Demonstrated improved estimation efficiency in a case study of online jewelry auctions, showing that LLM-generated guesses as predictors outperform approaches using only embeddings.

Conclusion: LLM-powered AI tools can effectively improve causal estimation by incorporating generative model predictions, offering a simple yet powerful approach to address dimensionality challenges in causal inference.

Abstract: We propose a simple yet effective use of LLM-powered AI tools to improve causal estimation. In double machine learning, the accuracy of causal estimates of the effect of a treatment on an outcome in the presence of a high-dimensional confounder depends on the performance of estimators of conditional expectation functions. We show that predictions made by generative models trained on historical data can be used to improve the performance of these estimators relative to approaches that solely rely on adjusting for embeddings extracted from these models. We argue that the historical knowledge and reasoning capacities associated with these generative models can help overcome curse-of-dimensionality problems in causal inference problems. We consider a case study using a small dataset of online jewelry auctions, and demonstrate that inclusion of LLM-generated guesses as predictors can improve efficiency in estimation.

[809] Deep Neural Networks Inspired by Differential Equations

Yongshuai Liu, Lianfang Wang, Kuilin Qin, Qinghua Zhang, Faqiang Wang, Li Cui, Jun Liu, Yuping Duan, Tieyong Zeng

Main category: cs.LG

TL;DR: This paper reviews deep neural network architectures and dynamic modeling methods inspired by differential equations, including ODE-based models and SDE-based regularization techniques.

Details

Motivation: To address challenges in neural network theoretical understanding, interpretability, and generalization by adopting a differential equations perspective as a unified theoretical framework.

Method: Extensive review of deep neural network architectures and dynamic modeling methods inspired by differential equations, examining ODE-based models, SDE-based regularization techniques, and conducting numerical comparisons.

Result: The paper provides comprehensive analysis of differential equation-inspired neural network models and their characteristics through numerical comparisons.

Conclusion: Integration of differential equations with deep learning offers promising research directions for developing intelligent computational methods with enhanced interpretability and generalization capabilities.

Abstract: Deep learning has become a pivotal technology in fields such as computer vision, scientific computing, and dynamical systems, significantly advancing these disciplines. However, neural Networks persistently face challenges related to theoretical understanding, interpretability, and generalization. To address these issues, researchers are increasingly adopting a differential equations perspective to propose a unified theoretical framework and systematic design methodologies for neural networks. In this paper, we provide an extensive review of deep neural network architectures and dynamic modeling methods inspired by differential equations. We specifically examine deep neural network models and deterministic dynamical network constructs based on ordinary differential equations (ODEs), as well as regularization techniques and stochastic dynamical network models informed by stochastic differential equations (SDEs). We present numerical comparisons of these models to illustrate their characteristics and performance. Finally, we explore promising research directions in integrating differential equations with deep learning to offer new insights for developing intelligent computational methods that boast enhanced interpretability and generalization capabilities.

[810] Stronger Together: On-Policy Reinforcement Learning for Collaborative LLMs

Yujie Zhao, Lanxiang Hu, Yang Wang, Minmin Hou, Hao Zhang, Ke Ding, Jishen Zhao

Main category: cs.LG

TL;DR: AT-GRPO is a novel on-policy reinforcement learning approach for multi-agent systems that addresses challenges in applying GRPO-style optimization to MAS by introducing agent- and turn-wise grouping and supporting both single- and multi-policy training regimes.

Details

Motivation: Multi-agent systems and reinforcement learning enhance LLM capabilities, but applying on-policy RL to MAS is underexplored due to algorithmic challenges (breaking standard GRPO grouping assumptions) and system requirements (supporting MAS-workflow rollouts and on-policy updates).

Method: AT-GRPO includes (i) an agent- and turn-wise grouped RL algorithm tailored to MAS that handles varying prompts by role and turn, and (ii) a training system supporting both single- and multi-policy regimes for MAS workflows.

Result: AT-GRPO delivers substantial gains across tasks: long-horizon planning accuracy improves from 14.0-47.0% to 96.0-99.5%, coding tasks show average gains of 3.87-7.62%, and math tasks show gains of 9.0-17.93%.

Conclusion: AT-GRPO successfully addresses the challenges of applying on-policy RL to multi-agent systems and demonstrates significant performance improvements across diverse domains including planning, coding, and mathematics.

Abstract: Multi-agent systems (MAS) and reinforcement learning (RL) are widely used to enhance the agentic capabilities of large language models (LLMs). MAS improves task performance through role-based orchestration, while RL uses environmental rewards to learn stronger policies, such as GRPO-style optimization. However, applying on-policy RL to MAS remains underexplored and presents unique challenges. Algorithmically, standard GRPO grouping assumptions break down because prompts vary by role and by turn. System-wise, the training stack must support MAS-workflow rollouts and on-policy updates for both single-policy and multi-policy models. We propose AT-GRPO, which includes (i) an agent- and turn-wise grouped RL algorithm tailored to MAS and (ii) a training system that supports both single- and multi-policy regimes. Across game, planning, coding, and math tasks, AT-GRPO delivers substantial gains. On long-horizon planning, it increases accuracy from a 14.0 to 47.0 percent single-agent RL baseline to 96.0 to 99.5 percent. It also improves reasoning performance, with average gains of 3.87 to 7.62 percent on coding tasks and 9.0 to 17.93 percent on math. Code and environments are available at: https://github.com/pettingllms-ai/PettingLLMs.

[811] On the Occurence of Critical Learning Periods in Neural Networks

Stanisław Pawlak

Main category: cs.LG

TL;DR: Critical learning periods and warm-starting performance loss in neural networks can be avoided using cyclic learning rate schedules, rather than being fundamental limitations.

Details

Motivation: To address the problem of neural network plasticity loss during critical learning periods and warm-starting scenarios, where networks trained with deficit data struggle to reach parity with scratch-trained models even after extensive clean training.

Method: Replicated key findings from seminal research on critical learning periods and extended the experimental scope. Investigated warm-starting as a form of deficit pretraining and tested cyclic learning rate schedules as a solution.

Result: Demonstrated that cyclic learning rate schedules successfully prevent both critical learning periods and warm-starting performance loss, allowing networks to achieve accuracy parity with scratch-trained models.

Conclusion: The problems of critical learning periods and warm-starting performance loss are not fundamental limitations but can be overcome with appropriate learning rate strategies, establishing an important connection between these two research areas.

Abstract: This study delves into the plasticity of neural networks, offering empirical support for the notion that critical learning periods and warm-starting performance loss can be avoided through simple adjustments to learning hyperparameters. The critical learning phenomenon emerges when training is initiated with deficit data. Subsequently, after numerous deficit epochs, the network’s plasticity wanes, impeding its capacity to achieve parity in accuracy with models trained from scratch, even when extensive clean data training follows deficit epochs. Building upon seminal research introducing critical learning periods, we replicate key findings and broaden the experimental scope of the main experiment from the original work. In addition, we consider a warm-starting approach and show that it can be seen as a form of deficit pretraining. In particular, we demonstrate that these problems can be averted by employing a cyclic learning rate schedule. Our findings not only impact neural network training practices but also establish a vital link between critical learning periods and ongoing research on warm-starting neural network training.

[812] Evaluation of Differential Privacy Mechanisms on Federated Learning

Tejash Varsani

Main category: cs.LG

TL;DR: This paper implements adaptive differential privacy methods with Laplace and Gaussian mechanisms, introducing adaptive clipping to dynamically update gradient sensitivity instead of using fixed values, aiming to maintain model accuracy while preserving privacy in federated learning.

Details

Motivation: Fixed privacy budgets in differential privacy can introduce excessive noise during model convergence, compromising performance in federated learning. Adaptive privacy budgets are investigated as a solution to balance privacy protection and model accuracy.

Method: Implemented DP methods using Laplace and Gaussian mechanisms with adaptive privacy budget, extending the SelecEval simulator. Introduced adaptive clipping approach in Gaussian mechanism to dynamically update gradient sensitivity rather than using fixed sensitivity.

Result: Experiments with various privacy budgets, IID and non-IID datasets, and different client selection showed that adaptive privacy budgets and adaptive clipping can help maintain model accuracy while preserving privacy, though limited to 200 training rounds.

Conclusion: Adaptive privacy budgets and adaptive clipping techniques show promise in maintaining federated learning model accuracy while providing privacy protection, addressing the limitations of fixed privacy budgets that introduce excessive noise.

Abstract: Federated learning is distributed model training across several clients without disclosing raw data. Despite advancements in data privacy, risks still remain. Differential Privacy (DP) is a technique to protect sensitive data by adding noise to model updates, usually controlled by a fixed privacy budget. However, this approach can introduce excessive noise, particularly when the model converges, which compromises performance. To address this problem, adaptive privacy budgets have been investigated as a potential solution. This work implements DP methods using Laplace and Gaussian mechanisms with an adaptive privacy budget, extending the SelecEval simulator. We introduce an adaptive clipping approach in the Gaussian mechanism, ensuring that gradients of the model are dynamically updated rather than using a fixed sensitivity. We conduct extensive experiments with various privacy budgets, IID and non-IID datasets, and different numbers of selected clients per round. While our experiments were limited to 200 training rounds, the results suggest that adaptive privacy budgets and adaptive clipping can help maintain model accuracy while preserving privacy.

[813] Neural PDE Solvers with Physics Constraints: A Comparative Study of PINNs, DRM, and WANs

Jiakang Chen

Main category: cs.LG

TL;DR: This dissertation presents a unified comparison of three mesh-free neural PDE solvers (PINNs, DRM, WANs) on Poisson problems and the time-independent Schrödinger equation, showing all methods achieve low L2 errors with proper techniques and providing practical guidelines for method selection.

Details

Motivation: PDEs are fundamental across science and engineering, but analytical solutions are rare and classical mesh-based solvers are expensive in high dimensions, motivating the need for efficient mesh-free neural solvers.

Method: Systematic comparison of three neural PDE solvers (PINNs, DRM, WANs) using forced boundary conditions, forced nodes, and orthogonality regularization on Poisson problems (up to 5D) and time-independent Schrödinger equation in 1D/2D.

Result: All methods achieved low L2 errors (10^-6 to 10^-9) with proper techniques. PINNs were most reliable for accuracy and excited spectra recovery, DRM offered best accuracy-runtime trade-off on stationary problems, and WAN was competitive with effective weak-form constraints.

Conclusion: Physics-guided neural solvers are credible, scalable tools for complex PDEs, with practical guidelines provided for method selection and extensions outlined for time-dependent formulations and adaptive techniques.

Abstract: Partial differential equations (PDEs) underpin models across science and engineering, yet analytical solutions are atypical and classical mesh-based solvers can be costly in high dimensions. This dissertation presents a unified comparison of three mesh-free neural PDE solvers, physics-informed neural networks (PINNs), the deep Ritz method (DRM), and weak adversarial networks (WANs), on Poisson problems (up to 5D) and the time-independent Schr"odinger equation in 1D/2D (infinite well and harmonic oscillator), and extends the study to a laser-driven case of Schr"odinger’s equation via the Kramers-Henneberger (KH) transformation. Under a common protocol, all methods achieve low $L_2$ errors ($10^{-6}$-$10^{-9}$) when paired with forced boundary conditions (FBCs), forced nodes (FNs), and orthogonality regularization (OG). Across tasks, PINNs are the most reliable for accuracy and recovery of excited spectra; DRM offers the best accuracy-runtime trade-off on stationary problems; WAN is more sensitive but competitive when weak-form constraints and FN/OG are used effectively. Sensitivity analyses show that FBC removes boundary-loss tuning, network width matters more than depth for single-network solvers, and most gains occur within 5000-10,000 epochs. The same toolkit solves the KH case, indicating transfer beyond canonical benchmarks. We provide practical guidelines for method selection and outline the following extensions: time-dependent formulations for DRM and WAN, adaptive residual-driven sampling, parallel multi-state training, and neural domain decomposition. These results support physics-guided neural solvers as credible, scalable tools for solving complex PDEs.

[814] AI-powered skin spectral imaging enables instant sepsis diagnosis and outcome prediction in critically ill patients

Silvia Seidlitz, Katharina Hölzl, Ayca von Garrel, Jan Sellner, Stephan Katzenschlager, Tobias Hölle, Dania Fischer, Maik von der Forst, Felix C. F. Schmitt, Alexander Studier-Fischer, Markus A. Weigand, Lena Maier-Hein, Maximilian Dietrich

Main category: cs.LG

TL;DR: Deep learning applied to hyperspectral imaging (HSI) enables rapid, noninvasive prediction of sepsis and mortality in ICU patients with AUROCs of 0.80 and 0.72 respectively, improving to 0.94 and 0.83 when combined with clinical data.

Details

Motivation: Sepsis is a leading cause of mortality, and early identification of patients with sepsis and those at high risk of death is crucial for timely intervention and improved outcomes.

Method: Prospective observational study collecting HSI data from palms and fingers of over 480 ICU patients, using neural networks to analyze single HSI cubes acquired within seconds.

Result: Neural networks predicted sepsis with AUROC of 0.80 and mortality with AUROC of 0.72 using HSI alone. Performance improved to AUROC 0.94 for sepsis and 0.83 for mortality when combined with clinical data.

Conclusion: Deep learning-based HSI analysis enables rapid, noninvasive prediction of sepsis and mortality, showing potential clinical value for enhancing diagnosis and treatment in critical care settings.

Abstract: With sepsis remaining a leading cause of mortality, early identification of patients with sepsis and those at high risk of death is a challenge of high socioeconomic importance. Given the potential of hyperspectral imaging (HSI) to monitor microcirculatory alterations, we propose a deep learning approach to automated sepsis diagnosis and mortality prediction using a single HSI cube acquired within seconds. In a prospective observational study, we collected HSI data from the palms and fingers of more than 480 intensive care unit patients. Neural networks applied to HSI measurements predicted sepsis and mortality with areas under the receiver operating characteristic curve (AUROCs) of 0.80 and 0.72, respectively. Performance improved substantially with additional clinical data, reaching AUROCs of 0.94 for sepsis and 0.83 for mortality. We conclude that deep learning-based HSI analysis enables rapid and noninvasive prediction of sepsis and mortality, with a potential clinical value for enhancing diagnosis and treatment.

[815] Phase-Aware Deep Learning with Complex-Valued CNNs for Audio Signal Applications

Naman Agrawal

Main category: cs.LG

TL;DR: This paper explores Complex-Valued Convolutional Neural Networks (CVCNNs) for audio processing, focusing on preserving phase information. It establishes theoretical foundations, validates performance on image datasets, and demonstrates improved audio classification results when incorporating phase information through GNNs.

Details

Motivation: Real-valued neural networks often neglect phase information in audio signal processing, which can be valuable for classification tasks. The study aims to leverage complex-valued architectures to preserve and utilize this phase information effectively.

Method: Developed theoretical framework for CVCNNs including complex convolutions, pooling, Wirtinger-based differentiation, and complex activation functions. Conducted three-stage evaluation: 1) benchmarking on image datasets, 2) audio classification with MFCCs, 3) incorporating phase information via Graph Neural Networks with edge weighting.

Result: CVCNNs showed competitive performance with real-valued CNNs on image datasets. In audio classification, CVCNNs slightly outperformed real CNNs on MFCCs. When phase information was modeled via GNNs, measurable gains were achieved in both binary and multi-class genre classification tasks.

Conclusion: Complex-valued architectures have expressive capacity and phase is a meaningful, exploitable feature in audio processing. While current methods show promise, future advances in phase-aware design are needed to fully leverage complex representations in neural networks.

Abstract: This study explores the design and application of Complex-Valued Convolutional Neural Networks (CVCNNs) in audio signal processing, with a focus on preserving and utilizing phase information often neglected in real-valued networks. We begin by presenting the foundational theoretical concepts of CVCNNs, including complex convolutions, pooling layers, Wirtinger-based differentiation, and various complex-valued activation functions. These are complemented by critical adaptations of training techniques, including complex batch normalization and weight initialization schemes, to ensure stability in training dynamics. Empirical evaluations are conducted across three stages. First, CVCNNs are benchmarked on standard image datasets, where they demonstrate competitive performance with real-valued CNNs, even under synthetic complex perturbations. Although our focus is audio signal processing, we first evaluate CVCNNs on image datasets to establish baseline performance and validate training stability before applying them to audio tasks. In the second experiment, we focus on audio classification using Mel-Frequency Cepstral Coefficients (MFCCs). CVCNNs trained on real-valued MFCCs slightly outperform real CNNs, while preserving phase in input workflows highlights challenges in exploiting phase without architectural modifications. Finally, a third experiment introduces GNNs to model phase information via edge weighting, where the inclusion of phase yields measurable gains in both binary and multi-class genre classification. These results underscore the expressive capacity of complex-valued architectures and confirm phase as a meaningful and exploitable feature in audio processing applications. While current methods show promise, especially with activations like cardioid, future advances in phase-aware design will be essential to leverage the potential of complex representations in neural networks.

[816] Kelp: A Streaming Safeguard for Large Models via Latent Dynamics-Guided Risk Detection

Xiaodan Li, Mengjie Wu, Yao Zhu, Yunna Lv, YueFeng Chen, Cen Chen, Jianmei Guo, Hui Xue

Main category: cs.LG

TL;DR: Kelp is a plug-in framework for streaming risk detection in LM generation pipelines, using intermediate hidden states and temporal modeling for real-time harm detection with minimal latency.

Details

Motivation: Existing post-hoc detection methods expose unsafe content before catching it and use lightweight models due to latency constraints, limiting detection accuracy.

Method: Uses Streaming Latent Dynamics Head (SLD) to model temporal evolution of risk from intermediate LM states, with Anchored Temporal Consistency (ATC) loss to enforce monotonic harm predictions.

Result: Outperforms state-of-the-art post-hoc guardrails and prior plug-in probes by 15.61% average F1, using only 20M parameters and adding <0.5ms per-token latency.

Conclusion: Kelp provides effective real-time streaming risk detection with high accuracy and minimal computational overhead across diverse models and datasets.

Abstract: Large models (LMs) are powerful content generators, yet their open-ended nature can also introduce potential risks, such as generating harmful or biased content. Existing guardrails mostly perform post-hoc detection that may expose unsafe content before it is caught, and the latency constraints further push them toward lightweight models, limiting detection accuracy. In this work, we propose Kelp, a novel plug-in framework that enables streaming risk detection within the LM generation pipeline. Kelp leverages intermediate LM hidden states through a Streaming Latent Dynamics Head (SLD), which models the temporal evolution of risk across the generated sequence for more accurate real-time risk detection. To ensure reliable streaming moderation in real applications, we introduce an Anchored Temporal Consistency (ATC) loss to enforce monotonic harm predictions by embedding a benign-then-harmful temporal prior. Besides, for a rigorous evaluation of streaming guardrails, we also present StreamGuardBench-a model-grounded benchmark featuring on-the-fly responses from each protected model, reflecting real-world streaming scenarios in both text and vision-language tasks. Across diverse models and datasets, Kelp consistently outperforms state-of-the-art post-hoc guardrails and prior plug-in probes (15.61% higher average F1), while using only 20M parameters and adding less than 0.5 ms of per-token latency.

[817] Vanishing Contributions: A Unified Approach to Smoothly Transition Neural Models into Compressed Form

Lorenzo Nikiforos, Charalampos Antoniadis, Luciano Prono, Fabio Pareschi, Riccardo Rovatti, Gianluca Setti

Main category: cs.LG

TL;DR: VCON is a method that smoothly transitions neural networks from uncompressed to compressed versions during fine-tuning, reducing accuracy degradation from compression techniques like pruning and quantization.

Details

Motivation: Standard compression techniques often cause severe accuracy degradation when applied directly to neural networks, creating a need for methods that maintain accuracy while achieving compression benefits.

Method: VCON runs original and compressed models in parallel during fine-tuning, progressively reducing the contribution of the uncompressed model while increasing the compressed model’s contribution through a smooth transition.

Result: VCON consistently improves accuracy across computer vision and NLP benchmarks, with typical gains exceeding 3% and some configurations showing 20% accuracy boosts.

Conclusion: VCON provides a generalizable approach that can be applied to existing compression techniques, delivering consistent accuracy improvements across multiple domains and benchmarks.

Abstract: The increasing scale of deep neural networks has led to a growing need for compression techniques such as pruning, quantization, and low-rank decomposition. While these methods are very effective in reducing memory, computation and energy consumption, they often introduce severe accuracy degradation when applied directly. We introduce Vanishing Contributions (VCON), a general approach for smoothly transitioning neural models into compressed form. Rather than replacing the original network directly with its compressed version, VCON executes the two in parallel during fine-tuning. The contribution of the original (uncompressed) model is progressively reduced, while that of the compressed model is gradually increased. This smooth transition allows the network to adapt over time, improving stability and mitigating accuracy degradation. We evaluate VCON across computer vision and natural language processing benchmarks, in combination with multiple compression strategies. Across all scenarios, VCON leads to consistent improvements: typical gains exceed 3%, while some configuration exhibits accuracy boosts of 20%. VCON thus provides a generalizable method that can be applied to the existing compression techniques, with evidence of consistent gains across multiple benchmarks.

[818] Discrete-Time Diffusion-Like Models for Speech Synthesis

Xiaozhou Tan, Minghui Zhao, Anton Ragni

Main category: cs.LG

TL;DR: The paper proposes discrete-time diffusion processes as alternatives to continuous-time diffusion models for speech generation, offering more efficient training and inference with comparable quality.

Details

Motivation: Continuous-time diffusion models have limitations including restricted additive Gaussian noising during training and mismatch between continuous training and discrete sampling during inference. Discrete-time processes can overcome these limitations with fewer inference steps and full consistency.

Method: The paper explores discrete-time diffusion processes including variants with additive Gaussian noise, multiplicative Gaussian noise, blurring noise, and a mixture of blurring and Gaussian noises.

Result: Experimental results show that discrete-time processes achieve comparable subjective and objective speech quality to continuous diffusion models, while providing more efficient and consistent training and inference.

Conclusion: Discrete-time diffusion processes are viable alternatives to continuous-time models for speech generation, offering comparable quality with improved efficiency and consistency in training and inference.

Abstract: Diffusion models have attracted a lot of attention in recent years. These models view speech generation as a continuous-time process. For efficient training, this process is typically restricted to additive Gaussian noising, which is limiting. For inference, the time is typically discretized, leading to the mismatch between continuous training and discrete sampling conditions. Recently proposed discrete-time processes, on the other hand, usually do not have these limitations, may require substantially fewer inference steps, and are fully consistent between training/inference conditions. This paper explores some diffusion-like discrete-time processes and proposes some new variants. These include processes applying additive Gaussian noise, multiplicative Gaussian noise, blurring noise and a mixture of blurring and Gaussian noises. The experimental results suggest that discrete-time processes offer comparable subjective and objective speech quality to their widely popular continuous counterpart, with more efficient and consistent training and inference schemas.

[819] Operator Learning for Power Systems Simulation

Matthew Schlegel, Matthew E. Taylor, Mostafa Farrokhabadi

Main category: cs.LG

TL;DR: Operator learning methods are proposed as surrogate models for time-domain power system simulations to overcome computational intractability in renewable-penetrated grids, with focus on time step-invariance for generalization.

Details

Motivation: Traditional time-domain simulations become computationally intractable for renewable-penetrated grids due to ultra-fast dynamics requiring microsecond-scale time steps, creating barriers for renewable integration and climate change mitigation.

Method: Three operator learning methods are benchmarked on a simple test system to demonstrate time step-invariance, enabling models trained on coarse time steps to generalize to fine-resolution dynamics through zero-shot super-resolution and generalization between stable/unstable regimes.

Result: The paper provides a first proof-of-concept demonstrating the viability of time step-invariance in operator learning methods for power system simulations.

Conclusion: Operator learning addresses key computational challenges in renewable energy integration, offering a promising approach for scalable and fast power system simulations to support climate change mitigation.

Abstract: Time domain simulation, i.e., modeling the system’s evolution over time, is a crucial tool for studying and enhancing power system stability and dynamic performance. However, these simulations become computationally intractable for renewable-penetrated grids, due to the small simulation time step required to capture renewable energy resources’ ultra-fast dynamic phenomena in the range of 1-50 microseconds. This creates a critical need for solutions that are both fast and scalable, posing a major barrier for the stable integration of renewable energy resources and thus climate change mitigation. This paper explores operator learning, a family of machine learning methods that learn mappings between functions, as a surrogate model for these costly simulations. The paper investigates, for the first time, the fundamental concept of simulation time step-invariance, which enables models trained on coarse time steps to generalize to fine-resolution dynamics. Three operator learning methods are benchmarked on a simple test system that, while not incorporating practical complexities of renewable-penetrated grids, serves as a first proof-of-concept to demonstrate the viability of time step-invariance. Models are evaluated on (i) zero-shot super-resolution, where training is performed on a coarse simulation time step and inference is performed at super-resolution, and (ii) generalization between stable and unstable dynamic regimes. This work addresses a key challenge in the integration of renewable energy for the mitigation of climate change by benchmarking operator learning methods to model physical systems.

[820] OrbitZoo: Multi-Agent Reinforcement Learning Environment for Orbital Dynamics

Alexandre Oliveira, Katarina Dyreby, Francisco Caldas, Cláudia Soares

Main category: cs.LG

TL;DR: OrbitZoo is a multi-agent RL environment using high-fidelity orbital dynamics to address space congestion challenges like collision avoidance and satellite maneuvers, validated against real Starlink data with 0.16% MAPE.

Details

Motivation: Space congestion from satellites and debris threatens satellite safety, requiring advanced techniques for collision avoidance and orbital maneuvers. Existing RL frameworks use simplified models that limit real-world complexity capture.

Method: Developed OrbitZoo, a versatile multi-agent RL environment built on industry-standard high-fidelity orbital dynamics library, supporting realistic data generation for scenarios like collision avoidance and cooperative maneuvers.

Result: Validated against real Starlink constellation data, achieving 0.16% Mean Absolute Percentage Error (MAPE), ensuring reliable high-fidelity simulations for autonomous satellite operations.

Conclusion: OrbitZoo provides a robust, validated RL environment that enables realistic simulation of orbital dynamics, supporting autonomous satellite operations and addressing space congestion challenges effectively.

Abstract: The increasing number of satellites and orbital debris has made space congestion a critical issue, threatening satellite safety and sustainability. Challenges such as collision avoidance, station-keeping, and orbital maneuvering require advanced techniques to handle dynamic uncertainties and multi-agent interactions. Reinforcement learning (RL) has shown promise in this domain, enabling adaptive, autonomous policies for space operations; however, many existing RL frameworks rely on custom-built environments developed from scratch, which often use simplified models and require significant time to implement and validate the orbital dynamics, limiting their ability to fully capture real-world complexities. To address this, we introduce OrbitZoo, a versatile multi-agent RL environment built on a high-fidelity industry standard library, that enables realistic data generation, supports scenarios like collision avoidance and cooperative maneuvers, and ensures robust and accurate orbital dynamics. The environment is validated against a real satellite constellation, Starlink, achieving a Mean Absolute Percentage Error (MAPE) of 0.16% compared to real-world data. This validation ensures reliability for generating high-fidelity simulations and enabling autonomous and independent satellite operations.

[821] A Multi-Component Reward Function with Policy Gradient for Automated Feature Selection with Dynamic Regularization and Bias Mitigation

Sudip Khadka, L. S. Paudel

Main category: cs.LG

TL;DR: RL framework for automated feature selection with bias mitigation, using adaptive feature selection and multi-component reward balancing performance and fairness.

Details

Motivation: Static feature exclusion fails to prevent bias due to hidden dependencies; need integrated approach for bias mitigation and feature selection.

Method: Reinforcement learning agent adaptively selects features using reward function combining predictive performance and fairness, integrated with ensemble learning.

Result: Dynamic formulation balances generalization, accuracy, and equity throughout training, avoiding reliance on pre/post-processing alone.

Conclusion: Provides flexible, generalizable feature selection method for correlated predictors where biases can re-emerge.

Abstract: Static feature exclusion strategies often fail to prevent bias when hidden dependencies influence the model predictions. To address this issue, we explore a reinforcement learning (RL) framework that integrates bias mitigation and automated feature selection within a single learning process. Unlike traditional heuristic-driven filter or wrapper approaches, our RL agent adaptively selects features using a reward signal that explicitly integrates predictive performance with fairness considerations. This dynamic formulation allows the model to balance generalization, accuracy, and equity throughout the training process, rather than rely exclusively on pre-processing adjustments or post hoc correction mechanisms. In this paper, we describe the construction of a multi-component reward function, the specification of the agents action space over feature subsets, and the integration of this system with ensemble learning. We aim to provide a flexible and generalizable way to select features in environments where predictors are correlated and biases can inadvertently re-emerge.

[822] Group-Adaptive Adversarial Learning for Robust Fake News Detection Against Malicious Comments

Zhao Tong, Chunlin Gong, Yimeng Gu, Haichao Shi, Qiang Liu, Shu Wu, Xiao-Yu Zhang

Main category: cs.LG

TL;DR: This paper addresses the vulnerability of fake news detection models to adversarial comments and proposes a group-adaptive adversarial training strategy to improve model robustness.

Details

Motivation: Fake news detection models perform well in standard settings but remain vulnerable to adversarial comments from real users or LLMs that can subtly shift model decisions, undermining trust in social media platforms.

Method: Three-step approach: (1) categorize adversarial comments into perceptual, cognitive, and societal types; (2) generate diverse category-specific attacks using LLMs for adversarial training; (3) apply Dirichlet-based adaptive sampling (InfoDirichlet Adjusting Mechanism) to dynamically adjust learning focus across comment categories.

Result: Experiments on benchmark datasets show the method maintains strong detection accuracy while substantially increasing robustness to a wide range of adversarial comment perturbations.

Conclusion: The proposed group-adaptive adversarial training strategy effectively enhances the robustness of fake news detection models against adversarial comment attacks while preserving detection performance.

Abstract: The spread of fake news online distorts public judgment and erodes trust in social media platforms. Although recent fake news detection (FND) models perform well in standard settings, they remain vulnerable to adversarial comments-authored by real users or by large language models (LLMs)-that subtly shift model decisions. In view of this, we first present a comprehensive evaluation of comment attacks to existing fake news detectors and then introduce a group-adaptive adversarial training strategy to improve the robustness of FND models. To be specific, our approach comprises three steps: (1) dividing adversarial comments into three psychologically grounded categories: perceptual, cognitive, and societal; (2) generating diverse, category-specific attacks via LLMs to enhance adversarial training; and (3) applying a Dirichlet-based adaptive sampling mechanism (InfoDirichlet Adjusting Mechanism) that dynamically adjusts the learning focus across different comment categories during training. Experiments on benchmark datasets show that our method maintains strong detection accuracy while substantially increasing robustness to a wide range of adversarial comment perturbations.

[823] High-Power Training Data Identification with Provable Statistical Guarantees

Zhenlong Liu, Hao Zeng, Weiran Huang, Hongxin Wei

Main category: cs.LG

TL;DR: PTDI is a rigorous method for identifying training data in large models with strict false discovery rate control, overcoming limitations of previous approaches.

Details

Motivation: Current methods for identifying training data lack statistical guarantees or rely on easily violated assumptions, making them unreliable for copyright litigation, privacy auditing, and fair evaluation.

Method: Computes p-values using known unseen data, constructs conservative estimator for data usage proportion, scales p-values, and selects training data using data-dependent threshold.

Result: Extensive experiments show PTDI strictly controls FDR and achieves higher power across various models (LLMs and VLMs) and datasets.

Conclusion: PTDI provides a provable method for training data identification with strict FDR control and improved detection power, addressing critical needs in model auditing.

Abstract: Identifying training data within large-scale models is critical for copyright litigation, privacy auditing, and ensuring fair evaluation. The conventional approaches treat it as a simple binary classification task without statistical guarantees. A recent approach is designed to control the false discovery rate (FDR), but its guarantees rely on strong, easily violated assumptions. In this paper, we introduce Provable Training Data Identification (PTDI), a rigorous method that identifies a set of training data with strict false discovery rate (FDR) control. Specifically, our method computes p-values for each data point using a set of known unseen data, and then constructs a conservative estimator for the data usage proportion of the test set, which allows us to scale these p-values. Our approach then selects the final set of training data by identifying all points whose scaled p-values fall below a data-dependent threshold. This entire procedure enables the discovery of training data with provable, strict FDR control and significantly boosted power. Extensive experiments across a wide range of models (LLMs and VLMs), and datasets demonstrate that PTDI strictly controls the FDR and achieves higher power.

[824] Federated k-Means via Generalized Total Variation Minimization

A. Jung

Main category: cs.LG

TL;DR: Federated k-means clustering algorithm that enables interconnected devices to jointly cluster data without sharing local datasets, using a modified local k-means approach with penalty terms for centroid discrepancies.

Details

Motivation: To enable privacy-preserving clustering across interconnected devices that have private local datasets, allowing joint clustering without data sharing.

Method: Formulate federated k-means as GTVMin instance, where each device updates local centroids by solving modified local k-means with penalty terms for centroid discrepancies between neighboring devices.

Result: Developed a federated k-means algorithm that only requires sharing aggregated information among devices, making it privacy-friendly.

Conclusion: The proposed federated k-means approach successfully enables joint clustering across devices while preserving privacy through limited information sharing.

Abstract: We consider the problem of federated clustering, where interconnected devices have access to private local datasets and need to jointly cluster the overall dataset without sharing their local dataset. Our focus is on hard clustering based on the k-means principle. We formulate federated k-means clustering as an instance of GTVMin. This formulation naturally lends to a federated k-means algorithm where each device updates local cluster centroids by solving a modified local k-means problem. The modification involves adding a penalty term to measure the discrepancy between the cluster centroid of neighbouring devices. Our federated k-means algorithm is privacy-friendly as it only requires sharing aggregated information among interconnected devices.

[825] On the Surprising Effectiveness of a Single Global Merging in Decentralized Learning

Tongtian Zhu, Tianyu Zhang, Mingze Wang, Zhanpeng Zhou, Can Wang

Main category: cs.LG

TL;DR: Decentralized learning with communication concentrated in later stages, especially a final global merge, significantly improves generalization under high data heterogeneity.

Details

Motivation: Decentralized learning is scalable but limited by peer-to-peer communication. The paper aims to determine optimal communication scheduling to improve global generalization.

Method: Study communication scheduling over time, including when and how frequently devices synchronize. Implement fully connected communication at the final step via single global merging.

Result: Concentrating communication in later stages remarkably improves generalization. Final global merging significantly boosts performance under high data heterogeneity. Theoretical analysis shows merged model matches parallel SGD convergence rate.

Conclusion: Decentralized learning can generalize well under high data heterogeneity and limited communication. Model merging research offers promising new avenues.

Abstract: Decentralized learning provides a scalable alternative to parameter-server-based training, yet its performance is often hindered by limited peer-to-peer communication. In this paper, we study how communication should be scheduled over time to improve global generalization, including determining when and how frequently devices synchronize. Counterintuitive empirical results show that concentrating communication budgets in the later stages of decentralized training remarkably improves global generalization. Surprisingly, we uncover that fully connected communication at the final step, implemented by a single global merging, can significant improve the generalization performance of decentralized learning under serve high data heterogeneity. Our theoretical contributions, which explains these phenomena, are first to establish that the globally merged model of decentralized SGD can match the convergence rate of parallel SGD. Technically, we reinterpret part of the discrepancy among local models, which were previously considered as detrimental noise, as constructive components essential for matching this rate. This work provides promising results that decentralized learning is able to generalize under high data heterogeneity and limited communication, while offering broad new avenues for model merging research. The code will be made publicly available.

[826] ICL-Router: In-Context Learned Model Representations for LLM Routing

Chenxu Wang, Hao Li, Yiqun Zhang, Linyao Chen, Jianhao Chen, Ping Jian, Peng Ye, Qiaosheng Zhang, Shuyue Hu

Main category: cs.LG

TL;DR: A novel model routing method using in-context vectors to represent model capabilities, enabling dynamic query routing to the most suitable model without retraining when adding new models.

Details

Motivation: Current model routing methods require retraining when adding new models and rely on accurate model representations, limiting scalability and adaptability.

Method: Two-stage approach: 1) Embed and project queries into vectors, training a projector and LLM-based router to reconstruct queries and align representations; 2) Profile candidate models on query sets and learn to predict model performance using in-context vectors of queries and model capabilities.

Result: Achieves state-of-the-art routing performance in both in-distribution and out-of-distribution tasks, with seamless integration of new models without router retraining.

Conclusion: The proposed in-context vector routing method effectively addresses scalability limitations in model routing while maintaining high performance across various tasks.

Abstract: Large language models (LLMs) often exhibit complementary strengths. Model routing harnesses these strengths by dynamically directing each query to the most suitable model, given a candidate model pool. However, routing performance relies on accurate model representations, and adding new models typically requires retraining, limiting scalability. To address these challenges, we propose a novel routing method using in-context vectors to represent model capabilities. The method proceeds in two stages. First, queries are embedded and projected into vectors, with a projector and LLM-based router trained to reconstruct the original queries, aligning vector representations with the router’s semantic space. Second, each candidate model is profiled on a query set, and the router learns – based on in-context vectors of query and model performance – to predict whether each model can correctly answer new queries. Extensive experiments demonstrate that our method achieves state-of-the-art routing performance in both in-distribution and out-of-distribution tasks. Moreover, our method allows for seamless integration of new models without retraining the router. The code is available at https://github.com/lalalamdbf/ICL-Router.

[827] It’s 2025 – Narrative Learning is the new baseline to beat for explainable machine learning

Gregory D. Baker

Main category: cs.LG

TL;DR: Narrative Learning is a new methodology where models are defined in natural language and refined through explanatory prompts instead of numerical optimization, showing improved accuracy over baseline explainable models in most tested datasets.

Details

Motivation: To develop a more interpretable machine learning approach that uses natural language definitions and explanatory prompts rather than traditional numerical optimization methods.

Method: Models are defined entirely in natural language and iteratively refine their classification criteria using explanatory prompts. Evaluated on 3 synthetic and 3 natural datasets compared against 7 baseline explainable machine learning models.

Result: On 5 out of 6 datasets, Narrative Learning became more accurate than baseline explainable models by 2025 or earlier due to language model improvements. Also analyzed lexicostatistics of model outputs as a proxy for explanation comprehensibility.

Conclusion: Narrative Learning demonstrates potential as an effective alternative to traditional machine learning approaches, with improved accuracy and interpretability through natural language-based model definition and refinement.

Abstract: In this paper, we introduce Narrative Learning, a methodology where models are defined entirely in natural language and iteratively refine their classification criteria using explanatory prompts rather than traditional numerical optimisation. We report on experiments to evaluate the accuracy and potential of this approach using 3 synthetic and 3 natural datasets and compare them against 7 baseline explainable machine learning models. We demonstrate that on 5 out of 6 of these datasets, Narrative Learning became more accurate than the baseline explainable models in 2025 or earlier because of improvements in language models. We also report on trends in the lexicostatistics of these models’ outputs as a proxy for the comprehensibility of the explanations.

[828] Evaluating LLM-Based Process Explanations under Progressive Behavioral-Input Reduction

P. van Oerle, R. H. Bemthuis, F. A. Bukhsh

Main category: cs.LG

TL;DR: LLM-generated process model explanations maintain quality despite input size reduction, enabling more efficient process analysis.

Details

Motivation: To reduce computational costs of generating textual explanations from large process models by exploring quality preservation under input size reduction.

Method: Pipeline that discovers models from progressively smaller log prefixes, prompts LLM for explanations, and uses another LLM to assess quality across completeness, bottlenecks, and improvements.

Result: Explanation quality largely preserved under moderate input reduction, showing practical cost-quality trade-off on synthetic logs.

Conclusion: Progressive behavioral-input reduction offers path to computationally efficient LLM-assisted process analysis in resource-constrained settings.

Abstract: Large Language Models (LLMs) are increasingly used to generate textual explanations of process models discovered from event logs. Producing explanations from large behavioral abstractions (e.g., directly-follows graphs or Petri nets) can be computationally expensive. This paper reports an exploratory evaluation of explanation quality under progressive behavioral-input reduction, where models are discovered from progressively smaller prefixes of a fixed log. Our pipeline (i) discovers models at multiple input sizes, (ii) prompts an LLM to generate explanations, and (iii) uses a second LLM to assess completeness, bottleneck identification, and suggested improvements. On synthetic logs, explanation quality is largely preserved under moderate reduction, indicating a practical cost-quality trade-off. The study is exploratory, as the scores are LLM-based (comparative signals rather than ground truth) and the data are synthetic. The results suggest a path toward more computationally efficient, LLM-assisted process analysis in resource-constrained settings.

[829] ARROW: An Adaptive Rollout and Routing Method for Global Weather Forecasting

Jindong Tian, Yifei Ding, Ronghui Xu, Hao Miao, Chenjuan Guo, Bin Yang

Main category: cs.LG

TL;DR: ARROW is an adaptive-rollout multi-scale temporal routing method for global weather forecasting that addresses limitations in existing data-driven approaches by modeling multi-scale temporal dependencies and using reinforcement learning for adaptive rollout scheduling.

Details

Motivation: Existing weather forecasting methods inadequately model spatial and multi-scale temporal dependencies in global weather systems and struggle with error accumulation versus fine-grained variation capture in autoregressive rollout strategies.

Method: Proposes ARROW with: (1) multi-interval forecasting model using Shared-Private Mixture-of-Experts to capture shared patterns and specific characteristics across time scales, and Ring Positional Encoding for Earth’s circular latitude structure; (2) reinforcement learning-based adaptive rollout scheduler that selects optimal time intervals based on current weather state.

Result: ARROW achieves state-of-the-art performance in global weather forecasting.

Conclusion: ARROW establishes a promising paradigm for global weather forecasting by effectively addressing key limitations in existing data-driven approaches through multi-scale modeling and adaptive rollout strategies.

Abstract: Weather forecasting is a fundamental task in spatiotemporal data analysis, with broad applications across a wide range of domains. Existing data-driven forecasting methods typically model atmospheric dynamics over a fixed short time interval (e.g., 6 hours) and rely on naive autoregression-based rollout for long-term forecasting (e.g., 138 hours). However, this paradigm suffers from two key limitations: (1) it often inadequately models the spatial and multi-scale temporal dependencies inherent in global weather systems, and (2) the rollout strategy struggles to balance error accumulation with the capture of fine-grained atmospheric variations. In this study, we propose ARROW, an Adaptive-Rollout Multi-scale temporal Routing method for Global Weather Forecasting. To contend with the first limitation, we construct a multi-interval forecasting model that forecasts weather across different time intervals. Within the model, the Shared-Private Mixture-of-Experts captures both shared patterns and specific characteristics of atmospheric dynamics across different time scales, while Ring Positional Encoding accurately encodes the circular latitude structure of the Earth when representing spatial information. For the second limitation, we develop an adaptive rollout scheduler based on reinforcement learning, which selects the most suitable time interval to forecast according to the current weather state. Experimental results demonstrate that ARROW achieves state-of-the-art performance in global weather forecasting, establishing a promising paradigm in this field.

[830] InterCorpRel-LLM: Enhancing Financial Relational Understanding with Graph-Language Models

Qianyou Sun, Jiexin Zheng, Bohan Jin, Lihua Chen, Yijie Peng

Main category: cs.LG

TL;DR: InterCorpRel-LLM is a cross-modal framework combining GNNs and LLMs to identify inter-firm relationships like supply chains and competition, achieving superior performance with minimal parameters.

Details

Motivation: Identifying inter-firm relationships is crucial for financial analysis but challenging due to data scale, sparsity, and contextual dependence. Existing methods either capture structure (GNNs) or semantics (LLMs) but not both effectively.

Method: Proposed InterCorpRel-LLM framework integrating GNNs with LLMs using proprietary FactSet data and three training tasks: company graph matching, industry classification, and supply relation prediction.

Result: Outperforms strong baselines including GPT-5 on supply relation identification, achieving F-score of 0.8543 vs 0.2287 with only 7B parameters and lightweight training. Also generalizes to zero-shot competitor identification.

Conclusion: The framework provides a robust tool for mapping complex corporate networks, enhancing decision-making and risk management in dynamic markets by effectively capturing both structure and semantics.

Abstract: Identifying inter-firm relationships such as supply and competitive ties is critical for financial analysis and corporate governance, yet remains challenging due to the scale, sparsity, and contextual dependence of corporate data. Graph-based methods capture structure but miss semantic depth, while large language models (LLMs) excel at text but remain limited in their ability to represent relational dependencies. To address this, we propose InterCorpRel-LLM, a cross-modal framework that integrates GNNs with LLMs, supported by a proprietary dataset derived from FactSet supply chain records and three tailored training tasks: company graph matching, industry classification, and supply relation prediction. This design enables effective joint modeling of structure and semantics. Experiments show that InterCorpRel-LLM substantially outperforms strong baselines, including GPT-5, on a supply relation identification task, achieving an F-score of 0.8543 vs. 0.2287 with only a 7B-parameter backbone and lightweight training. The model also generalizes to zero-shot competitor identification, underscoring its ability to capture nuanced inter-firm dynamics. Our framework thus provides analysts and strategists with a robust tool for mapping and reasoning about complex corporate networks, enhancing decision-making and risk management in dynamic markets.

[831] Machine learning methods fail to provide cohesive atheoretical construction of personality traits from semantic embeddings

Ayoub Bouguettaya, Elizabeth M. Stuart

Main category: cs.LG

TL;DR: The study tested a machine learning-generated personality model against the established Big Five model using Reddit data, finding the Big Five was far superior in descriptive power and interpretability.

Details

Motivation: To test whether machine learning can create better personality models from language data than established psychological frameworks like the Big Five.

Method: Created a bottom-up personality model from adjective lists using machine learning and compared it against the Big Five model by analyzing one million Reddit comments.

Result: The Big Five model (especially Agreeableness, Conscientiousness, and Neuroticism) provided much more powerful and interpretable descriptions of online communities. The machine learning clusters failed to create meaningful distinctions, missed the Extraversion trait, and lacked psychometric coherence.

Conclusion: The Big Five model remains robust and superior. Machine learning can help validate psychological theories but cannot replace them, and personality’s semantic structure appears context-dependent.

Abstract: The lexical hypothesis posits that personality traits are encoded in language and is foundational to models like the Big Five. We created a bottom-up personality model from a classic adjective list using machine learning and compared its descriptive utility against the Big Five by analyzing one million Reddit comments. The Big Five, particularly Agreeableness, Conscientiousness, and Neuroticism, provided a far more powerful and interpretable description of these online communities. In contrast, our machine-learning clusters provided no meaningful distinctions, failed to recover the Extraversion trait, and lacked the psychometric coherence of the Big Five. These results affirm the robustness of the Big Five and suggest personality’s semantic structure is context-dependent. Our findings show that while machine learning can help check the ecological validity of established psychological theories, it may not be able to replace them.

[832] Reliable Active Learning from Unreliable Labels via Neural Collapse Geometry

Atharv Goel, Sharat Agarwal, Saket Anand, Chetan Arora

Main category: cs.LG

TL;DR: NCAL-R is a reliable active learning framework that uses neural collapse geometry to select informative samples while mitigating the effects of noisy labels and data distribution shifts.

Details

Motivation: Conventional active learning methods are unreliable when labels are noisy or data distributions shift, as they often amplify errors by selecting mislabeled or redundant samples.

Method: Uses two geometric signals: Class-Mean Alignment Perturbation score to quantify structural impact on inter-class geometry, and Feature Fluctuation score to capture temporal representation instability across training checkpoints.

Result: Outperforms standard AL baselines on ImageNet-100 and CIFAR100, achieving higher accuracy with fewer labels, improved robustness under label noise, and stronger generalization to out-of-distribution data.

Conclusion: Incorporating geometric reliability criteria makes active learning more robust to annotation errors and distribution shifts, enabling more trustworthy deployment in real-world labeling pipelines.

Abstract: Active Learning (AL) promises to reduce annotation cost by prioritizing informative samples, yet its reliability is undermined when labels are noisy or when the data distribution shifts. In practice, annotators make mistakes, rare categories are ambiguous, and conventional AL heuristics (uncertainty, diversity) often amplify such errors by repeatedly selecting mislabeled or redundant samples. We propose Reliable Active Learning via Neural Collapse Geometry (NCAL-R), a framework that leverages the emergent geometric regularities of deep networks to counteract unreliable supervision. Our method introduces two complementary signals: (i) a Class-Mean Alignment Perturbation score, which quantifies how candidate samples structurally stabilize or distort inter-class geometry, and (ii) a Feature Fluctuation score, which captures temporal instability of representations across training checkpoints. By combining these signals, NCAL-R prioritizes samples that both preserve class separation and highlight ambiguous regions, mitigating the effect of noisy or redundant labels. Experiments on ImageNet-100 and CIFAR100 show that NCAL-R consistently outperforms standard AL baselines, achieving higher accuracy with fewer labels, improved robustness under synthetic label noise, and stronger generalization to out-of-distribution data. These results suggest that incorporating geometric reliability criteria into acquisition decisions can make Active Learning less brittle to annotation errors and distribution shifts, a key step toward trustworthy deployment in real-world labeling pipelines. Our code is available at https://github.com/Vision-IIITD/NCAL.

[833] Patentformer: A demonstration of AI-assisted automated patent drafting

Sai Krishna Reddy Mudhiganti, Juanyan Wang, Ruo Yang, Manali Sharma

Main category: cs.LG

TL;DR: Patentformer is an AI-powered platform that automates patent drafting to help patent attorneys create high-quality patent applications quickly while maintaining legal writing standards.

Details

Motivation: Patent drafting is challenging and requires extensive expertise from patent attorneys who need both legal and technical knowledge. The goal is to support attorneys by automating the drafting process.

Method: The paper presents Patentformer, an AI-powered automated patent drafting platform designed to assist patent attorneys in rapidly producing patent applications.

Result: The platform demonstrates the ability to generate high-quality patent applications that adhere to legal writing standards.

Conclusion: Patentformer successfully automates patent drafting, providing valuable support to patent attorneys by streamlining the creation of legally compliant patent applications.

Abstract: Patent drafting presents significant challenges due to its reliance on the extensive experience and specialized expertise of patent attorneys, who must possess both legal acumen and technical understanding of an invention to craft patent applications in a formal legal writing style. This paper presents a demonstration of Patentformer, an AI-powered automated patent drafting platform designed to support patent attorneys by rapidly producing high-quality patent applications adhering to legal writing standards.

[834] PatentVision: A multimodal method for drafting patent applications

Ruo Yang, Sai Krishna Reddy Mudhiganti, Manali Sharma

Main category: cs.LG

TL;DR: PatentVision is a multimodal framework that uses Large Vision Language Models (LVLMs) to automate patent drafting by integrating textual claims and visual drawings, producing more accurate and human-aligned patent specifications than text-only methods.

Details

Motivation: Patent drafting is complex and requires detailed technical descriptions, legal compliance, and visual elements. While LVLMs show promise in various tasks, their application in patent writing automation remains underexplored despite the potential to reduce manual workloads and improve consistency.

Method: Built on advanced LVLMs, PatentVision integrates textual and visual inputs (patent claims and drawings) through a multimodal framework. It combines fine-tuned vision-language models with domain-specific training tailored to patents to enhance accuracy.

Result: Experiments show PatentVision surpasses text-only methods, producing outputs with greater fidelity and alignment with human-written standards. The incorporation of visual data enables better representation of intricate design features and functional connections, leading to richer and more precise results.

Conclusion: This study demonstrates the value of multimodal techniques in patent automation, providing a scalable tool that advances patent drafting and lays groundwork for broader LVLM applications in specialized domains, potentially transforming intellectual property management and innovation processes.

Abstract: Patent drafting is complex due to its need for detailed technical descriptions, legal compliance, and visual elements. Although Large Vision Language Models (LVLMs) show promise across various tasks, their application in automating patent writing remains underexplored. In this paper, we present PatentVision, a multimodal framework that integrates textual and visual inputs such as patent claims and drawings to generate complete patent specifications. Built on advanced LVLMs, PatentVision enhances accuracy by combining fine tuned vision language models with domain specific training tailored to patents. Experiments reveal it surpasses text only methods, producing outputs with greater fidelity and alignment with human written standards. Its incorporation of visual data allows it to better represent intricate design features and functional connections, leading to richer and more precise results. This study underscores the value of multimodal techniques in patent automation, providing a scalable tool to reduce manual workloads and improve consistency. PatentVision not only advances patent drafting but also lays the groundwork for broader use of LVLMs in specialized areas, potentially transforming intellectual property management and innovation processes.

[835] Leveraging Shared Prototypes for a Multimodal Pulse Motion Foundation Model

Wanting Mao, Maxwell A Xu, Harish Haresamudram, Mithun Saha, Santosh Kumar, James Matthew Rehg

Main category: cs.LG

TL;DR: ProtoMM is a self-supervised learning framework that uses a shared prototype dictionary to align multi-modal time-series biosignals (ECG, PPG, EDA, accelerometry) in a common embedding space, overcoming limitations of contrastive methods.

Details

Motivation: Existing multi-modal SSL approaches rely on CLIP-style contrastive objectives that overfit to easily aligned features and misclassify valid cross-modal relationships as negatives, leading to fragmented and non-generalizable embeddings.

Method: Introduces a shared prototype dictionary to anchor heterogeneous modalities in a common embedding space, clustering representations around shared prototypes rather than using explicit negative sampling.

Result: ProtoMM outperforms contrastive-only and prior multimodal SSL methods, achieving state-of-the-art performance while offering improved interpretability of learned features.

Conclusion: ProtoMM provides a coherent “common language” for physiological signals by capturing complementary information across modalities through prototype-based alignment.

Abstract: Modeling multi-modal time-series data is critical for capturing system-level dynamics, particularly in biosignals where modalities such as ECG, PPG, EDA, and accelerometry provide complementary perspectives on interconnected physiological processes. While recent self-supervised learning (SSL) advances have improved unimodal representation learning, existing multi-modal approaches often rely on CLIP-style contrastive objectives that overfit to easily aligned features and misclassify valid cross-modal relationships as negatives, resulting in fragmented and non-generalizable embeddings. To overcome these limitations, we propose ProtoMM, a novel SSL framework that introduces a shared prototype dictionary to anchor heterogeneous modalities in a common embedding space. By clustering representations around shared prototypes rather than explicit negative sampling, our method captures complementary information across modalities and provides a coherent “common language” for physiological signals. In this work, we focus on developing a Pulse Motion foundation model with ProtoMM and demonstrate that our approach outperforms contrastive-only and prior multimodal SSL methods, achieving state-of-the-art performance while offering improved interpretability of learned features.

[836] HeSRN: Representation Learning On Heterogeneous Graphs via Slot-Aware Retentive Network

Yifan Lu, Ziyun Zou, Belal Alsinglawi, Islam Al-Qudah, Izzat Alsmadi, Feilong Tang, Pengfei Jiao, Shoaib Jameel

Main category: cs.LG

TL;DR: HeSRN is a Heterogeneous Slot-aware Retentive Network that addresses computational complexity and semantic modeling limitations of Graph Transformers on heterogeneous graphs through slot-aware structure encoding and retention-based fusion.

Details

Motivation: Graph Transformers have quadratic computational complexity and struggle to model heterogeneous semantics effectively, limiting their scalability and generalization on real-world heterogeneous graphs.

Method: Introduces slot-aware structure encoder to disentangle node-type semantics, projects heterogeneous features into independent slots with slot normalization and retention-based fusion, and replaces self-attention with retention-based encoder for linear time complexity.

Result: Extensive experiments on four real-world heterogeneous graph datasets show HeSRN consistently outperforms state-of-the-art heterogeneous graph neural networks and Graph Transformer baselines on node classification tasks with superior accuracy and significantly lower computational complexity.

Conclusion: HeSRN provides an efficient and expressive solution for heterogeneous graph representation learning by addressing key limitations of existing Transformer-based models through novel slot-aware and retention-based approaches.

Abstract: Graph Transformers have recently achieved remarkable progress in graph representation learning by capturing long-range dependencies through self-attention. However, their quadratic computational complexity and inability to effectively model heterogeneous semantics severely limit their scalability and generalization on real-world heterogeneous graphs. To address these issues, we propose HeSRN, a novel Heterogeneous Slot-aware Retentive Network for efficient and expressive heterogeneous graph representation learning. HeSRN introduces a slot-aware structure encoder that explicitly disentangles node-type semantics by projecting heterogeneous features into independent slots and aligning their distributions through slot normalization and retention-based fusion, effectively mitigating the semantic entanglement caused by forced feature-space unification in previous Transformer-based models. Furthermore, we replace the self-attention mechanism with a retention-based encoder, which models structural and contextual dependencies in linear time complexity while maintaining strong expressive power. A heterogeneous retentive encoder is further employed to jointly capture both local structural signals and global heterogeneous semantics through multi-scale retention layers. Extensive experiments on four real-world heterogeneous graph datasets demonstrate that HeSRN consistently outperforms state-of-the-art heterogeneous graph neural networks and Graph Transformer baselines on node classification tasks, achieving superior accuracy with significantly lower computational complexity.

[837] Scaling Laws and Symmetry, Evidence from Neural Force Fields

Khang Ngo, Siamak Ravanbakhsh

Main category: cs.LG

TL;DR: Equivariant architectures scale better than non-equivariant models in learning interatomic potentials, with higher-order representations showing superior scaling exponents. Data and model sizes should scale together for compute-optimal training.

Details

Motivation: To investigate how equivariance and architectural choices affect scaling behavior in geometric learning tasks, particularly interatomic potentials, and challenge the common belief that models should discover fundamental inductive biases like symmetry on their own.

Method: Empirical study analyzing scaling behavior with respect to data, parameters, and compute using different architectures (equivariant vs non-equivariant) with architecture-dependent exponents.

Result: Equivariant architectures scale better than non-equivariant models, with higher-order representations achieving better scaling exponents. Power-law scaling behavior observed with architecture-dependent exponents.

Conclusion: Fundamental inductive biases like symmetry should not be left for models to discover, especially at scale, as they change the inherent task difficulty and scaling laws. Equivariant architectures with higher-order representations are crucial for better scaling.

Abstract: We present an empirical study in the geometric task of learning interatomic potentials, which shows equivariance matters even more at larger scales; we show a clear power-law scaling behaviour with respect to data, parameters and compute with ``architecture-dependent exponents’’. In particular, we observe that equivariant architectures, which leverage task symmetry, scale better than non-equivariant models. Moreover, among equivariant architectures, higher-order representations translate to better scaling exponents. Our analysis also suggests that for compute-optimal training, the data and model sizes should scale in tandem regardless of the architecture. At a high level, these results suggest that, contrary to common belief, we should not leave it to the model to discover fundamental inductive biases such as symmetry, especially as we scale, because they change the inherent difficulty of the task and its scaling laws.

[838] A Generic Machine Learning Framework for Radio Frequency Fingerprinting

Alex Hiles, Bashar I. Ahmad

Main category: cs.LG

TL;DR: A generic machine learning framework for RF fingerprinting that supports multiple downstream tasks including Specific Emitter Identification, Emitter Data Association, and RF Emitter Clustering using real RF datasets.

Details

Motivation: Traditional RF fingerprinting methods are labor-intensive, inflexible, and limited to specific emitter types or transmission schemes. Data-driven ML approaches can automatically learn intricate fingerprints from raw data and deliver superior performance.

Method: Proposed a generic ML-enabled RF fingerprinting framework that formulates various downstream tasks (SEI, EDA, RFEC) as RF fingerprint-dependent tasks. Uses real RF datasets to demonstrate the framework across different application areas.

Result: The framework successfully demonstrates applicability across multiple tasks and real-world scenarios including spaceborne surveillance, signal intelligence, and countering drones using real RF datasets.

Conclusion: Data-driven ML approaches provide a flexible and superior alternative to traditional RF fingerprinting methods, enabling automated extraction of nuanced emitter characteristics for various downstream tasks in both defense and civilian applications.

Abstract: Fingerprinting Radio Frequency (RF) emitters typically involves finding unique emitter characteristics that are featured in their transmitted signals. These fingerprints are nuanced but sufficiently detailed, motivating the pursuit of methods that can successfully extract them. The most granular downstream task is known as Specific Emitter Identification (SEI), which requires a well informed RF fingerprinting (RFF) approach for it to be successful. RFF and SEI have a long history, with numerous application areas in defence and civilian contexts such as signal intelligence, electronic surveillance, physical-layer authentication of wireless communication devices, to name a few. RFF methods also support many other downstream tasks such as Emitter Data Association (EDA) and RF Emitter Clustering (RFEC) and are applicable to a range of transmission types. In recent years, data-driven approaches have become popular in the RFF domain due to their ability to automatically learn intricate fingerprints from raw data. These methods generally deliver superior performance when compared to traditional techniques. The more traditional approaches are often labour-intensive, inflexible and only applicable to a particular emitter type or transmission scheme. Therefore, we consider data-driven Machine Learning (ML)-enabled RFF. In particular, we propose a generic framework for ML-enabled RFF which is inclusive of several popular downstream tasks such as SEI, EDA and RFEC. Each task is formulated as a RF fingerprint-dependent task. A variety of use cases using real RF datasets are presented here to demonstrate the framework for a range of tasks and application areas, such as spaceborne surveillance, signal intelligence and countering drones.

[839] Why Do Transformers Fail to Forecast Time Series In-Context?

Yufa Zhou, Yixiao Wang, Surbhi Goel, Anru R. Zhang

Main category: cs.LG

TL;DR: Transformers fail to outperform simple linear models in time series forecasting due to theoretical limitations in in-context learning, where linear self-attention cannot achieve lower MSE than classical linear models.

Details

Motivation: Despite significant efforts using LLMs and Transformers for time series forecasting, empirical evidence shows they often underperform simple linear models, but the theoretical understanding of this phenomenon remains limited.

Method: Theoretical analysis of Transformers’ limitations through in-context learning theory under AR(p) data, examining linear self-attention models and their asymptotic behavior as context length increases.

Result: Three key findings: (1) Linear self-attention cannot achieve lower expected MSE than classical linear models; (2) LSA asymptotically recovers optimal linear predictor as context length approaches infinity; (3) Chain-of-Thought style inference causes predictions to collapse to the mean exponentially.

Conclusion: The work provides theoretical insights into Transformers’ limitations for time series forecasting, encouraging the community to critically evaluate sophisticated architectures and revisit fundamental theoretical constraints in forecasting.

Abstract: Time series forecasting (TSF) remains a challenging and largely unsolved problem in machine learning, despite significant recent efforts leveraging Large Language Models (LLMs), which predominantly rely on Transformer architectures. Empirical evidence consistently shows that even powerful Transformers often fail to outperform much simpler models, e.g., linear models, on TSF tasks; however, a rigorous theoretical understanding of this phenomenon remains limited. In this paper, we provide a theoretical analysis of Transformers’ limitations for TSF through the lens of In-Context Learning (ICL) theory. Specifically, under AR($p$) data, we establish that: (1) Linear Self-Attention (LSA) models $\textit{cannot}$ achieve lower expected MSE than classical linear models for in-context forecasting; (2) as the context length approaches to infinity, LSA asymptotically recovers the optimal linear predictor; and (3) under Chain-of-Thought (CoT) style inference, predictions collapse to the mean exponentially. We empirically validate these findings through carefully designed experiments. Our theory not only sheds light on several previously underexplored phenomena but also offers practical insights for designing more effective forecasting architectures. We hope our work encourages the broader research community to revisit the fundamental theoretical limitations of TSF and to critically evaluate the direct application of increasingly sophisticated architectures without deeper scrutiny.

[840] SVTime: Small Time Series Forecasting Models Informed by “Physics” of Large Vision Model Forecasters

ChengAo Shen, Ziming Zhao, Hanghang Tong, Dongjin Song, Dongsheng Luo, Qingsong Wen, Jingchao Ni

Main category: cs.LG

TL;DR: SVTime is a lightweight time series forecasting model that achieves large-model-like performance with 1000x fewer parameters, making it practical for resource-constrained users while maintaining competitive accuracy.

Details

Motivation: Large pre-trained models for time series analysis are energy-intensive and expensive, creating sustainability concerns. There's a need for compact, specialized models that can perform core tasks like forecasting effectively for small businesses and resource-constrained users.

Method: SVTime identifies key inductive biases from large Vision model (LVM) forecasters and encodes them through carefully designed linear layers and constraint functions, creating small models that capture the essential “physics” of LVM behavior in long-term time series forecasting.

Result: SVTime outperforms state-of-the-art lightweight models and rivals large models across 21 baselines on 8 benchmark datasets, achieving comparable performance with 10^3 fewer parameters than LVMs while enabling efficient training and inference.

Conclusion: It’s possible to build cost-effective lightweight models that match large-model performance on core forecasting tasks, providing a sustainable alternative to energy-intensive large models for practical applications.

Abstract: Time series AI is crucial for analyzing dynamic web content, driving a surge of pre-trained large models known for their strong knowledge encoding and transfer capabilities across diverse tasks. However, given their energy-intensive training, inference, and hardware demands, using large models as a one-fits-all solution raises serious concerns about carbon footprint and sustainability. For a specific task, a compact yet specialized, high-performing model may be more practical and affordable, especially for resource-constrained users such as small businesses. This motivates the question: Can we build cost-effective lightweight models with large-model-like performance on core tasks such as forecasting? This paper addresses this question by introducing SVTime, a novel Small model inspired by large Vision model (LVM) forecasters for long-term Time series forecasting (LTSF). Recently, LVMs have been shown as powerful tools for LTSF. We identify a set of key inductive biases of LVM forecasters – analogous to the “physics” governing their behaviors in LTSF – and design small models that encode these biases through meticulously crafted linear layers and constraint functions. Across 21 baselines spanning lightweight, complex, and pre-trained large models on 8 benchmark datasets, SVTime outperforms state-of-the-art (SOTA) lightweight models and rivals large models with 10^3 fewer parameters than LVMs, while enabling efficient training and inference in low-resource settings.

[841] Building a Foundational Guardrail for General Agentic Systems via Synthetic Data

Yue Huang, Hang Hua, Yujun Zhou, Pengcheng Jing, Manish Nagireddy, Inkit Padhi, Greta Dolcetti, Zhangchen Xu, Subhajit Chaudhury, Ambrish Rawat, Liubov Nedoshivina, Pin-Yu Chen, Prasanna Sattigeri, Xiangliang Zhang

Main category: cs.LG

TL;DR: The paper addresses pre-execution safety for LLM agents by introducing AuraGen for data synthesis, Safiron as a foundational guardrail model, and Pre-Exec Bench for evaluation, achieving robust risk detection before action execution.

Details

Motivation: Existing guardrails mostly operate post-execution, which is difficult to scale and provides limited supervision at the planning stage. Pre-execution intervention is safer as it prevents harm before actions are carried out.

Method: Proposes AuraGen for synthesizing benign trajectories and injecting category-labeled risks, Safiron as a cross-planner adapter with guardian model for flagging risks and generating rationales, and Pre-Exec Bench for comprehensive evaluation.

Result: Extensive experiments show consistent gains over strong baselines on Pre-Exec Bench, with Safiron achieving robust transfer across settings and providing actionable practices for safer agentic systems.

Conclusion: The work provides a practical template for pre-execution safety in LLM agents, addressing data, model, and evaluation gaps through synthesized data, foundational guardrails, and realistic benchmarks.

Abstract: While LLM agents can plan multi-step tasks, intervening at the planning stage-before any action is executed-is often the safest way to prevent harm, since certain risks can lead to severe consequences once carried out. However, existing guardrails mostly operate post-execution, which is difficult to scale and leaves little room for controllable supervision at the plan level. To address this challenge, we highlight three critical gaps in current research: data gap, model gap, and evaluation gap. To close the data gap, we introduce AuraGen, a controllable engine that (i) synthesizes benign trajectories, (ii) injects category-labeled risks with calibrated difficulty, and (iii) filters outputs via an automated reward model, producing large and reliable corpora for pre-execution safety. To close the guardian model gap, we propose a foundational guardrail Safiron, combining a cross-planner adapter with a compact guardian model. The adapter unifies different input formats, while Safiron flags risky cases, assigns risk types, and generates rationales; trained in two stages with a broadly explored data recipe, Safiron achieves robust transfer across settings. To close the evaluation gap, we release Pre-Exec Bench, a realistic benchmark covering diverse tools and branching trajectories, which measures detection, fine-grained categorization, explanation, and cross-planner generalization in human-verified scenarios. Extensive experiments demonstrate consistent gains of the proposed guardrail over strong baselines on Pre-Exec Bench, and ablations further distill actionable practices, providing a practical template for safer agentic systems.

[842] Large Language Models for Imbalanced Classification: Diversity makes the difference

Dang Nguyen, Sunil Gupta, Kien Do, Thin Nguyen, Taylor Braund, Alexis Whitton, Svetha Venkatesh

Main category: cs.LG

TL;DR: A novel LLM-based oversampling method that enhances diversity in synthetic minority samples through conditional generation, permutation fine-tuning, and interpolated sample training, outperforming 8 SOTA baselines on 10 tabular datasets.

Details

Motivation: Existing oversampling methods like SMOTE require converting categorical variables to numerical vectors, causing information loss. Current LLM-based methods generate minority samples with limited diversity, reducing robustness in downstream classification.

Method: 1) Sampling strategy conditioning synthetic generation on both minority labels and features; 2) Permutation strategy for fine-tuning pre-trained LLMs; 3) Fine-tuning LLMs on both minority and interpolated samples to enrich variability.

Result: Significantly outperforms eight state-of-the-art baselines on 10 tabular datasets. Generated synthetic samples are both realistic and diverse. Theoretical analysis proves the method encourages diversity through entropy-based perspective.

Conclusion: The proposed LLM-based oversampling method effectively addresses diversity limitations in existing approaches, generating high-quality synthetic samples that improve classification performance on imbalanced datasets.

Abstract: Oversampling is one of the most widely used approaches for addressing imbalanced classification. The core idea is to generate additional minority samples to rebalance the dataset. Most existing methods, such as SMOTE, require converting categorical variables into numerical vectors, which often leads to information loss. Recently, large language model (LLM)-based methods have been introduced to overcome this limitation. However, current LLM-based approaches typically generate minority samples with limited diversity, reducing robustness and generalizability in downstream classification tasks. To address this gap, we propose a novel LLM-based oversampling method designed to enhance diversity. First, we introduce a sampling strategy that conditions synthetic sample generation on both minority labels and features. Second, we develop a new permutation strategy for fine-tuning pre-trained LLMs. Third, we fine-tune the LLM not only on minority samples but also on interpolated samples to further enrich variability. Extensive experiments on 10 tabular datasets demonstrate that our method significantly outperforms eight SOTA baselines. The generated synthetic samples are both realistic and diverse. Moreover, we provide theoretical analysis through an entropy-based perspective, proving that our method encourages diversity in the generated samples.

[843] Combined Representation and Generation with Diffusive State Predictive Information Bottleneck

Richard John, Yunrui Qiu, Lukas Herron, Pratyush Tiwary

Main category: cs.LG

TL;DR: D-SPIB combines time-lagged information bottleneck for molecular representation learning with diffusion modeling in a joint training framework, enabling balanced representation learning and generation while incorporating thermodynamic information from multiple simulation trajectories.

Details

Motivation: Molecular science faces challenges with expensive data collection and rare important events, requiring effective compression to lower-dimensional manifolds for downstream tasks like generation.

Method: Joint training of time-lagged information bottleneck for molecular representations and diffusion model in one flexible architecture (D-SPIB), capable of combining temperature information from different molecular simulation trajectories.

Result: Benchmarked on multiple molecular tasks, D-SPIB demonstrates potential for exploring physical conditions outside the training set and learning coherent thermodynamic representations.

Conclusion: D-SPIB provides a flexible framework that balances representation learning and generation objectives while effectively handling molecular data challenges and thermodynamic information integration.

Abstract: Generative modeling becomes increasingly data-intensive in high-dimensional spaces. In molecular science, where data collection is expensive and important events are rare, compression to lower-dimensional manifolds is especially important for various downstream tasks, including generation. We combine a time-lagged information bottleneck designed to characterize molecular important representations and a diffusion model in one joint training objective. The resulting protocol, which we term Diffusive State Predictive Information Bottleneck (D-SPIB), enables the balancing of representation learning and generation aims in one flexible architecture. Additionally, the model is capable of combining temperature information from different molecular simulation trajectories to learn a coherent and useful internal representation of thermodynamics. We benchmark D-SPIB on multiple molecular tasks and showcase its potential for exploring physical conditions outside the training set.

[844] Principled Operator Learning in Ocean Dynamics: The Role of Temporal Structure

Vahidreza Jahanmard, Ali Ramezani-Kebrya, Robinson Hordoir

Main category: cs.LG

TL;DR: FNOtD, a modified Fourier Neural Operator that incorporates temporal Fourier modes and internalizes dispersion relations, improves long-term prediction stability and physical fidelity in high-resolution ocean forecasting compared to standard FNO.

Details

Motivation: Address challenges in neural operators for ocean PDEs, including long-term prediction stability, adherence to physical laws, and handling high-frequency processes in weather and ocean forecasting.

Method: Modified Fourier Neural Operator (FNOtD) that incorporates temporal Fourier modes and internalizes dispersion relations while learning solution operators for ocean PDEs, entangling space and time in training integral kernels.

Result: FNOtD substantially improves long-term prediction stability and consistency with physical dynamics in high-frequency settings compared to standard FNO, and provides competitive predictive skill relative to state-of-the-art numerical ocean models with significantly lower computational cost.

Conclusion: Entangling space and time in neural operator training enables effective capture of multiscale wave propagation and ocean dynamics, demonstrating that modified Fourier Neural Operators can enhance physical fidelity in ocean prediction applications.

Abstract: Neural operators are becoming the default tools to learn solutions to governing partial differential equations (PDEs) in weather and ocean forecasting applications. Despite early promising achievements, significant challenges remain, including long-term prediction stability and adherence to physical laws, particularly for high-frequency processes. In this paper, we take a step toward addressing these challenges in high-resolution ocean prediction by incorporating temporal Fourier modes, demonstrating how this modification enhances physical fidelity. This study compares the standard Fourier Neural Operator (FNO) with its variant, FNOtD, which has been modified to internalize the dispersion relation while learning the solution operator for ocean PDEs. The results demonstrate that entangling space and time in the training of integral kernels enables the model to capture multiscale wave propagation and effectively learn ocean dynamics. FNOtD substantially improves long-term prediction stability and consistency with underlying physical dynamics in challenging high-frequency settings compared to the standard FNO. It also provides competitive predictive skill relative to a state-of-the-art numerical ocean model, while requiring significantly lower computational cost.

[845] Causality $\neq$ Decodability, and Vice Versa: Lessons from Interpreting Counting ViTs

Lianghuan Huang, Yingshan Chang

Main category: cs.LG

TL;DR: The paper investigates the relationship between decodability and causality in vision transformers, finding systematic mismatches where some layers have strong causal influence but weak decodability, while others show accurate decoding but are functionally inert.

Details

Motivation: To disentangle two often conflated notions in mechanistic interpretability: decodability (recoverability of information) and causality (functional influence on outputs), particularly in vision transformers fine-tuned for object counting.

Method: Used activation patching to test causal role of spatial and CLS tokens by transplanting activations across clean-corrupted image pairs, and trained linear probes to assess decodability of count information at different depths.

Result: Found systematic mismatches: middle-layer object tokens exert strong causal influence despite weak decodability, while final-layer object tokens support accurate decoding but are functionally inert. CLS token becomes decodable in mid-layers but only acquires causal power in final layers.

Conclusion: Decodability and causality reflect complementary dimensions of representation - what information is present versus what is used - and their divergence can expose hidden computational circuits.

Abstract: Mechanistic interpretability seeks to uncover how internal components of neural networks give rise to predictions. A persistent challenge, however, is disentangling two often conflated notions: decodability–the recoverability of information from hidden states–and causality–the extent to which those states functionally influence outputs. In this work, we investigate their relationship in vision transformers (ViTs) fine-tuned for object counting. Using activation patching, we test the causal role of spatial and CLS tokens by transplanting activations across clean-corrupted image pairs. In parallel, we train linear probes to assess the decodability of count information at different depths. Our results reveal systematic mismatches: middle-layer object tokens exert strong causal influence despite being weakly decodable, whereas final-layer object tokens support accurate decoding yet are functionally inert. Similarly, the CLS token becomes decodable in mid-layers but only acquires causal power in the final layers. These findings highlight that decodability and causality reflect complementary dimensions of representation–what information is present versus what is used–and that their divergence can expose hidden computational circuits.

[846] A Unified Framework for Lifted Training and Inversion Approaches

Xiaoyu Wang, Alexandra Valavanis, Azhir Mahmood, Andreas Mang, Martin Benning, Audrey Repetti

Main category: cs.LG

TL;DR: Lifted training methods reformulate neural network training as a constrained optimization problem using penalty terms, enabling distributed optimization, handling non-differentiable activations, and improving training landscape conditioning.

Details

Motivation: Traditional gradient-based training faces challenges like vanishing/exploding gradients, difficulties with non-smooth activations, and limited parallelization due to sequential structure.

Method: Unified framework using Bregman distances that encapsulates various lifted training strategies (Method of Auxiliary Coordinates, Fenchel Lifted Networks, Lifted Bregman Training) and applies to architectures like MLPs, ResNets, and Proximal Neural Networks.

Result: Numerical results on imaging tasks show lifted Bregman approach is more effective and stable than conventional training, especially for architectures with proximal activations.

Conclusion: Lifted training provides a viable alternative to traditional methods, enabling distributed optimization, accommodating non-differentiable activations, and improving training stability.

Abstract: The training of deep neural networks predominantly relies on a combination of gradient-based optimisation and back-propagation for the computation of the gradient. While incredibly successful, this approach faces challenges such as vanishing or exploding gradients, difficulties with non-smooth activations, and an inherently sequential structure that limits parallelisation. Lifted training methods offer an alternative by reformulating the nested optimisation problem into a higher-dimensional, constrained optimisation problem where the constraints are no longer enforced directly but penalised with penalty terms. This chapter introduces a unified framework that encapsulates various lifted training strategies, including the Method of Auxiliary Coordinates, Fenchel Lifted Networks, and Lifted Bregman Training, and demonstrates how diverse architectures, such as Multi-Layer Perceptrons, Residual Neural Networks, and Proximal Neural Networks fit within this structure. By leveraging tools from convex optimisation, particularly Bregman distances, the framework facilitates distributed optimisation, accommodates non-differentiable proximal activations, and can improve the conditioning of the training landscape. We discuss the implementation of these methods using block-coordinate descent strategies, including deterministic implementations enhanced by accelerated and adaptive optimisation techniques, as well as implicit stochastic gradient methods. Furthermore, we explore the application of this framework to inverse problems, detailing methodologies for both the training of specialised networks (e.g., unrolled architectures) and the stable inversion of pre-trained networks. Numerical results on standard imaging tasks validate the effectiveness and stability of the lifted Bregman approach compared to conventional training, particularly for architectures employing proximal activations.

[847] Temporal Lifting as Latent-Space Regularization for Continuous-Time Flow Models in AI Systems

Jeffrey Camlin

Main category: cs.LG

TL;DR: Temporal lifting introduces a smooth monotone mapping to regularize near-singular behavior in continuous-time dynamical systems while preserving conservation laws, enabling globally smooth trajectories and stabilizing machine-learning dynamics.

Details

Motivation: To address near-singular behavior in continuous-time dynamical systems and stabilize physics-informed neural networks and latent-flow architectures used in AI systems.

Method: Introduces a latent-space formulation called temporal lifting, which uses a smooth monotone mapping t ↦ τ(t) to regularize near-singular behavior while preserving conservation laws of the underlying flow.

Result: Trajectories such as those of incompressible Navier-Stokes equations on the torus become globally smooth in the lifted coordinate, and temporal lifting acts as a continuous-time normalization that stabilizes physics-informed neural networks.

Conclusion: The framework successfully links analytic regularity theory with representation-learning methods for stiff or turbulent processes, providing a continuous-time normalization approach for machine-learning dynamics.

Abstract: We present a latent-space formulation of adaptive temporal reparametrization for continuous-time dynamical systems. The method, called temporal lifting, introduces a smooth monotone mapping $t \mapsto \tau(t)$ that regularizes near-singular behavior of the underlying flow while preserving its conservation laws. In the lifted coordinate, trajectories such as those of the incompressible Navier-Stokes equations on the torus $\mathbb{T}^3$ become globally smooth. From the standpoint of machine-learning dynamics, temporal lifting acts as a continuous-time normalization or time-warping operator that can stabilize physics-informed neural networks and other latent-flow architectures used in AI systems. The framework links analytic regularity theory with representation-learning methods for stiff or turbulent processes.

[848] Decomposer Networks: Deep Component Analysis and Synthesis

Mohsen Joneidi

Main category: cs.LG

TL;DR: Decomposer Networks (DecompNet) is a semantic autoencoder that factorizes inputs into multiple interpretable components using parallel branches with residual updates, enforcing competition among components for parsimonious representations.

Details

Motivation: To create interpretable component decompositions that go beyond classical autoencoders' single latent representations and enable semantically meaningful factorizations of inputs.

Method: Uses N parallel branches with residual inputs (original signal minus reconstructions from other branches), unrolling Gauss-Seidel style block-coordinate descent into a differentiable network to enforce competition among components.

Result: The network produces parsimonious, semantically meaningful representations through explicit component competition.

Conclusion: DecompNet represents a novel semantic autoencoder approach that implements an all-but-one residual update rule, distinguishing it from linear decomposition methods and existing object-centric architectures.

Abstract: We propose the Decomposer Networks (DecompNet), a semantic autoencoder that factorizes an input into multiple interpretable components. Unlike classical autoencoders that compress an input into a single latent representation, the Decomposer Network maintains N parallel branches, each assigned a residual input defined as the original signal minus the reconstructions of all other branches. By unrolling a Gauss–Seidel style block-coordinate descent into a differentiable network, DecompNet enforce explicit competition among components, yielding parsimonious, semantically meaningful representations. We situate our model relative to linear decomposition methods (PCA, NMF), deep unrolled optimization, and object-centric architectures (MONet, IODINE, Slot Attention), and highlight its novelty as the first semantic autoencoder to implement an all-but-one residual update rule.

[849] An Exploration of Non-Euclidean Gradient Descent: Muon and its Many Variants

Michael Crawshaw, Chirag Modi, Mingrui Liu, Robert M. Gower

Main category: cs.LG

TL;DR: Systematic exploration of gradient descent methods for neural networks, formalizing Adam and Muon as non-Euclidean gradient descent and proposing new variants like MuonMax and Momo-Muon combinations.

Details

Motivation: To define effective steepest descent methods for neural networks by properly choosing norms per layer, aggregation methods across layers, and normalization strategies.

Method: Systematically explored different alternatives for aggregating norms across layers, formalized existing optimizers (Adam, Muon) as non-Euclidean gradient descent, derived new Muon variants, and combined non-Euclidean methods with model-based momentum (Momo).

Result: Muon is sensitive to learning rate choice, while MuonMax is significantly more robust. Momo variants of Muon are more robust to hyperparameter tuning and often achieve better validation scores.

Conclusion: For new tasks with unknown optimal hyperparameters, use Momo in combination with MuonMax to save on costly hyperparameter tuning.

Abstract: To define a steepest descent method over a neural network, we need to choose a norm for each layer, a way to aggregate these norms across layers, and whether to use normalization. We systematically explore different alternatives for aggregating norms across layers, both formalizing existing combinations of Adam and the recently proposed Muon as a type of non-Euclidean gradient descent, and deriving new variants of the Muon optimizer. Through a comprehensive experimental evaluation of the optimizers within our framework, we find that Muon is sensitive to the choice of learning rate, whereas a new variant we call MuonMax is significantly more robust. We then show how to combine any non-Euclidean gradient method with model based momentum (known as Momo). The new Momo variants of Muon are significantly more robust to hyperparameter tuning, and often achieve a better validation score. Thus for new tasks, where the optimal hyperparameters are not known, we advocate for using Momo in combination with MuonMax to save on costly hyperparameter tuning.

[850] Harnessing Self-Supervised Deep Learning and Geostationary Remote Sensing for Advancing Wildfire and Associated Air Quality Monitoring: Improved Smoke and Fire Front Masking using GOES and TEMPO Radiance Data

Nicholas LaHaye, Thilanka Munashinge, Hugo Lee, Xiaohua Pan, Gonzalo Gonzalez Abad, Hazem Mahmoud, Jennifer Wei

Main category: cs.LG

TL;DR: This paper shows how NASA’s TEMPO satellite data and self-supervised deep learning can improve real-time wildfire and smoke monitoring in the western US.

Details

Motivation: To enhance wildfire and air quality management by leveraging new high-frequency satellite data and advanced machine learning techniques.

Method: Uses self-supervised deep learning system with GOES-18 and TEMPO satellite data to map wildfire fronts and smoke plumes in near real-time.

Result: Successfully distinguishes smoke from clouds, shows strong agreement across different sensing modalities, and significantly outperforms existing operational products.

Conclusion: The approach demonstrates effective real-time wildfire and smoke monitoring capabilities using advanced satellite data and deep learning.

Abstract: This work demonstrates the possibilities for improving wildfire and air quality management in the western United States by leveraging the unprecedented hourly data from NASA’s TEMPO satellite mission and advances in self-supervised deep learning. Here we demonstrate the efficacy of deep learning for mapping the near real-time hourly spread of wildfire fronts and smoke plumes using an innovative self-supervised deep learning-system: successfully distinguishing smoke plumes from clouds using GOES-18 and TEMPO data, strong agreement across the smoke and fire masks generated from different sensing modalities as well as significant improvement over operational products for the same cases.

[851] CALM: A Causal Analysis Language Model for Tabular Data in Complex Systems with Local Scores, Conditional Independence Tests, and Relation Attributes

Zhenjiang Fan, Zengyi Qin, Yuanning Zheng, Bo Xiong, Summer Han

Main category: cs.LG

TL;DR: CALM is a novel causal analysis language model that adapts LLM capabilities for tabular data, outperforming existing methods in causal discovery from observational data with over 91% accuracy.

Details

Motivation: Existing causal discovery methods have limitations in resolving causal direction, handling nonlinear associations, and efficiency. LLMs offer reasoning power but are designed for text, not tabular data which dominates causal analysis.

Method: CALM uses a Mamba-based architecture to classify causal patterns from pairwise relationships, integrating local causal scores, conditional independence tests, and relational attributes to capture linear, nonlinear, and conditional causal mechanisms.

Result: CALM achieves over 91% accuracy in simulation studies and successfully identifies causal factors in Hepatitis C virus progression, significantly outperforming existing methods like PC, causalMGM, and NOTEARS.

Conclusion: CALM represents a significant advancement in causal discovery by successfully adapting language model pattern recognition to tabular data, enabling accurate and generalizable causal analysis in complex systems.

Abstract: Causal discovery from observational data is fundamental to scientific fields like biology, where controlled experiments are often impractical. However, existing methods, including constraint-based (e.g., PC, causalMGM) and score-based approaches (e.g., NOTEARS), face significant limitations. These include an inability to resolve causal direction, restrictions to linear associations, sensitivity to violations of the faithfulness assumption, and inefficiency in searching vast hypothesis spaces. While large language models (LLMs) offer powerful reasoning capabilities, their application is hindered by a fundamental discrepancy: they are designed for text, while most causal data is tabular. To address these challenges, we introduce CALM, a novel causal analysis language model specifically designed for tabular data in complex systems. CALM leverages a Mamba-based architecture to classify causal patterns from pairwise variable relationships. It integrates a comprehensive suite of evidence, including local causal scores, conditional independence tests, and relational attributes, to capture a wide spectrum of linear, nonlinear, and conditional causal mechanisms. Trained on a diverse corpus of synthetic data (from linear, mixed, and nonlinear models) and 10 real-world biological datasets with rigorously validated causal relationships, our model ensures robustness and generalizability. Empirical evaluation demonstrates that CALM significantly outperforms existing methods in both simulation studies, achieving over 91% accuracy, and in a real-world application identifying causal factors in Hepatitis C virus progression. This work represents a significant step towards accurate and generalizable causal discovery by successfully adapting the pattern recognition capabilities of language models to the intricacies of tabular data.

[852] ProxRouter: Proximity-Weighted LLM Query Routing for Improved Robustness to Outliers

Shivam Patel, Neharika Jali, Ankur Mallick, Gauri Joshi

Main category: cs.LG

TL;DR: ProxRouter is a nonparametric query router for LLMs that uses exponentially tilted aggregation to improve robustness to outlier queries while maintaining inlier performance with minimal overhead.

Details

Motivation: Existing LLM query routers struggle with generalization to outlier queries due to limited training set diversity and high maintenance costs. Parametric routers require retraining while nonparametric routers have poor outlier handling.

Method: ProxRouter applies an exponentially tilted aggregation mechanism to balance bias and variance in nonparametric routers, enhancing robustness to outliers without requiring retraining.

Result: Experiments show ProxRouter improves outlier routing performance while preserving inlier performance with minimal computational overhead.

Conclusion: ProxRouter provides an effective training-free solution for LLM query routing that addresses the critical challenge of handling outlier queries in real-world AI platforms.

Abstract: Large language model (LLM) query routers are critical to modern AI platforms as they seek to improve efficiency by assigning inference queries to accurate, yet low-cost models. Parametric routers typically use trained neural networks for LLM selection but suffer from retraining and maintenance overheads. Nonparametric routers are training-free, instead estimating LLM accuracy and cost via similarity between encodings of the input query and training set queries. However, like their parametric counterparts, nonparametric routers struggle to generalize to outlier queries, an issue exacerbated by limited diversity in training sets which are costly to expand and difficult to keep current with ever-evolving use cases. We propose ProxRouter, which applies an exponentially tilted aggregation mechanism to balance bias and variance in nonparametric routers, improving their robustness to outliers. Experiments show ProxRouter enhances outlier routing while preserving inlier performance with minimal overhead.

[853] WARC-Bench: Web Archive Based Benchmark for GUI Subtask Executions

Sanjari Srivastava, Gang Li, Cheng Chang, Rishu Garg, Manpreet Kaur, Charlene Y. Lee, Yuezhang Li, Yining Mao, Ignacio Cases, Yanan Xie, Peng Qi

Main category: cs.LG

TL;DR: WARC-Bench is a new web navigation benchmark with 438 tasks that evaluates multimodal AI agents on subtasks using Web ARChive files, showing current models struggle with these short-horizon interactions.

Details

Motivation: Existing benchmarks don't extensively evaluate the capability of mastering subtasks - short-horizon interactions on multiple UI components - which is essential for robust web planning and navigation.

Method: Created WARC-Bench benchmark using Web ARChive files for sandboxed interactions with dynamic webpages, and explored supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) to improve models.

Result: Leading computer-use models achieved only 64.8% success rate. SFT models obtained 48.8% success rate, while RLVR over SFT checkpoints improved to 52.8%, outperforming many frontier models even in data-scarce settings.

Conclusion: Mastering subtasks is essential for robust web navigation, and WARC-Bench effectively evaluates this capability that existing benchmarks overlook.

Abstract: Training web agents to navigate complex, real-world websites requires them to master $\textit{subtasks}$ - short-horizon interactions on multiple UI components (e.g., choosing the correct date in a date picker, or scrolling in a container to extract information). We introduce WARC-Bench (Web Archive Benchmark), a novel web navigation benchmark featuring 438 tasks designed to evaluate multimodal AI agents on subtasks. WARC-Bench enables sandboxed interactions with dynamic and realistic webpages using Web ARChive files. We show that WARC-Bench is challenging for leading computer-use models, with the highest observed success rate being 64.8%. To improve open source models on subtask, we explore two common training techniques: supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). Experiments show that SFT models obtain a 48.8% success rate on the benchmark. Training with RLVR over SFT checkpoints, even in data-scarce settings, improves the score to 52.8% on WARC-Bench, outperforming many frontier models. Our analysis concludes that mastering these subtasks is essential for robust web planning and navigation, and is a capability not extensively evaluated by existing benchmarks.

[854] Myopic Bayesian Decision Theory for Batch Active Learning with Partial Batch Label Sampling

Kangping Hu, Stephen Mussmann

Main category: cs.LG

TL;DR: Derives Bayesian Decision Theory for active learning, leading to algorithms like EER and EPIG, and introduces ParBaLS for scalable batch active learning with EPIG.

Details

Motivation: Address the proliferation of active learning acquisition functions and provide a unified principle using Bayesian Decision Theory to guide decision-making.

Method: Derive BDT for myopic active learning, connect existing algorithms to BDT, and develop ParBaLS for scalable batch EPIG implementation.

Result: ParBaLS EPIG shows superior performance for fixed budget active learning with Bayesian Logistic Regression on Neural Embeddings across several datasets.

Conclusion: BDT provides a universal principle for active learning, and ParBaLS enables effective batch implementation of EPIG with better scaling properties than existing methods.

Abstract: Over the past couple of decades, many active learning acquisition functions have been proposed, leaving practitioners with an unclear choice of which to use. Bayesian Decision Theory (BDT) offers a universal principle to guide decision-making. In this work, we derive BDT for (Bayesian) active learning in the myopic framework, where we imagine we only have one more point to label. This derivation leads to effective algorithms such as Expected Error Reduction (EER), Expected Predictive Information Gain (EPIG), and other algorithms that appear in the literature. Furthermore, we show that BAIT (active learning based on V-optimal experimental design) can be derived from BDT and asymptotic approximations. A key challenge of such methods is the difficult scaling to large batch sizes, leading to either computational challenges (BatchBALD) or dramatic performance drops (top-$B$ selection). Here, using a particular formulation of the decision process, we derive Partial Batch Label Sampling (ParBaLS) for the EPIG algorithm. We show experimentally for several datasets that ParBaLS EPIG gives superior performance for a fixed budget and Bayesian Logistic Regression on Neural Embeddings. Our code is available at https://github.com/ADDAPT-ML/ParBaLS.

[855] TAWRMAC: A Novel Dynamic Graph Representation Learning Method

Soheila Farokhi, Xiaojun Qi, Hamid Karimi

Main category: cs.LG

TL;DR: TAWRMAC is a novel framework for dynamic graph representation learning that addresses embedding staleness, limited contextual awareness, and structural dynamics capture through Temporal Anonymous Walks with Restart, Memory Augmentation, and Neighbor Co-occurrence embedding.

Details

Motivation: Existing continuous-time methods face three key challenges: embedding staleness from over-reliance on node-specific memory, limited contextual awareness due to failure to capture neighborhood correlations, and inadequate capture of structural dynamics in evolving graphs.

Method: TAWRMAC integrates Temporal Anonymous Walks with Restart, Memory Augmentation, and Neighbor Co-occurrence embedding. It uses memory-augmented GNN with fixed-time encoding for stability, explicitly captures neighbor correlations for context, and distinguishes repetitive vs. new connections through Temporal Anonymous Walks with Restart.

Result: Extensive experiments show TAWRMAC consistently outperforms state-of-the-art methods in dynamic link prediction and node classification under both transductive and inductive settings across three different negative sampling strategies.

Conclusion: TAWRMAC provides stable, generalizable, and context-aware embeddings, advancing the state of the art in continuous-time dynamic graph learning.

Abstract: Dynamic graph representation learning has become essential for analyzing evolving networks in domains such as social network analysis, recommendation systems, and traffic analysis. However, existing continuous-time methods face three key challenges: (1) some methods depend solely on node-specific memory without effectively incorporating information from neighboring nodes, resulting in embedding staleness; (2) most fail to explicitly capture correlations between node neighborhoods, limiting contextual awareness; and (3) many fail to fully capture the structural dynamics of evolving graphs, especially in absence of rich link attributes. To address these limitations, we introduce TAWRMAC-a novel framework that integrates Temporal Anonymous Walks with Restart, Memory Augmentation, and Neighbor Co-occurrence embedding. TAWRMAC enhances embedding stability through a memory-augmented GNN with fixedtime encoding and improves contextual representation by explicitly capturing neighbor correlations. Additionally, its Temporal Anonymous Walks with Restart mechanism distinguishes between nodes exhibiting repetitive interactions and those forming new connections beyond their immediate neighborhood. This approach captures structural dynamics better and supports strong inductive learning. Extensive experiments on multiple benchmark datasets demonstrate that TAWRMAC consistently outperforms state-of-the-art methods in dynamic link prediction and node classification under both transductive and inductive settings across three different negative sampling strategies. By providing stable, generalizable, and context-aware embeddings, TAWRMAC advances the state of the art in continuous-time dynamic graph learning. The code is available at https://anonymous.4open.science/r/tawrmac-A253 .

[856] Understanding Robust Machine Learning for Nonparametric Regression with Heavy-Tailed Noise

Yunlong Feng, Qiang Wu

Main category: cs.LG

TL;DR: The paper develops a framework for robust nonparametric regression with heavy-tailed noise, focusing on prediction error rather than generalization error as the key performance metric. It addresses challenges of unbounded hypothesis spaces and weak moment assumptions through probabilistic effective hypothesis spaces and new comparison theorems.

Details

Motivation: To overcome limitations of standard robust learning analysis that rely on boundedness assumptions and strong moment conditions, which break down with heavy-tailed noise and unbounded functions. The paper aims to provide a more faithful analysis of robust regression performance.

Method: Uses Tikhonov-regularized risk minimization in RKHS with robust loss functions (Huber regression as example). Introduces probabilistic effective hypothesis spaces to handle unboundedness, establishes comparison theorems linking excess robust risk to L2 prediction error, and develops finite-sample bounds under weak (1+ε)-moment conditions.

Result: Derives explicit finite-sample error bounds and convergence rates for Huber regression in RKHS that work without uniform boundedness assumptions and under heavy-tailed noise. Provides principled tuning rules and extends analysis beyond Huber to other robust losses.

Conclusion: Prediction error (L2-distance to truth) rather than excess generalization risk should be the fundamental metric for analyzing robust learning. The framework enables meaningful bias-variance decomposition and clarifies the robustness-bias trade-off induced by scale parameters in robust losses.

Abstract: We investigate robust nonparametric regression in the presence of heavy-tailed noise, where the hypothesis class may contain unbounded functions and robustness is ensured via a robust loss function $\ell_\sigma$. Using Huber regression as a close-up example within Tikhonov-regularized risk minimization in reproducing kernel Hilbert spaces (RKHS), we address two central challenges: (i) the breakdown of standard concentration tools under weak moment assumptions, and (ii) the analytical difficulties introduced by unbounded hypothesis spaces. Our first message is conceptual: conventional generalization-error bounds for robust losses do not faithfully capture out-of-sample performance. We argue that learnability should instead be quantified through prediction error, namely the $L_2$-distance to the truth $f^\star$, which is $\sigma$-independent and directly reflects the target of robust estimation. To make this workable under unboundedness, we introduce a \emph{probabilistic effective hypothesis space} that confines the estimator with high probability and enables a meaningful bias–variance decomposition under weak $(1+\epsilon)$-moment conditions. Technically, we establish new comparison theorems linking the excess robust risk to the $L_2$ prediction error up to a residual of order $\mathcal{O}(\sigma^{-2\epsilon})$, clarifying the robustness–bias trade-off induced by the scale parameter $\sigma$. Building on this, we derive explicit finite-sample error bounds and convergence rates for Huber regression in RKHS that hold without uniform boundedness and under heavy-tailed noise. Our study delivers principled tuning rules, extends beyond Huber to other robust losses, and highlights prediction error, not excess generalization risk, as the fundamental lens for analyzing robust learning.

[857] Probabilistic bias adjustment of seasonal predictions of Arctic Sea Ice Concentration

Parsa Gooya, Reinel Sospedra-Alfonso

Main category: cs.LG

TL;DR: A probabilistic error correction framework using conditional Variational Autoencoder is introduced to improve seasonal Arctic sea ice concentration forecasts by generating large ensembles of adjusted forecasts with better calibration and smaller errors.

Details

Motivation: Seasonal Arctic sea ice forecasts from climate models have systematic biases and complex errors that grow over time, requiring bias correction. Current deterministic methods are limited to costly ensemble members and don't properly quantify uncertainty needed for decision-making, especially for extreme events.

Method: A probabilistic error correction framework based on conditional Variational Autoencoder (cVAE) that maps the conditional distribution of observations given biased model predictions, allowing generation of large ensembles of adjusted forecasts.

Result: The adjusted forecasts show better calibration, closer alignment with observational distribution, and smaller errors compared to climatological mean adjusted forecasts, as evaluated using deterministic and probabilistic metrics.

Conclusion: The probabilistic cVAE-based framework effectively improves seasonal Arctic sea ice concentration forecasts by generating large ensembles that better quantify uncertainty and reduce errors compared to traditional deterministic correction methods.

Abstract: Seasonal forecast of Arctic sea ice concentration is key to mitigate the negative impact and assess potential opportunities posed by the rapid decline of sea ice coverage. Seasonal prediction systems based on climate models often show systematic biases and complex spatio-temporal errors that grow with the forecasts. Consequently, operational predictions are routinely bias corrected and calibrated using retrospective forecasts. For predictions of Arctic sea ice concentration, error corrections are mainly based on one-to-one post-processing methods including climatological mean or linear regression correction and, more recently, machine learning. Such deterministic adjustments are confined at best to the limited number of costly-to-run ensemble members of the raw forecast. However, decision-making requires proper quantification of uncertainty and likelihood of events, particularly of extremes. We introduce a probabilistic error correction framework based on a conditional Variational Autoencoder model to map the conditional distribution of observations given the biased model prediction. This method naturally allows for generating large ensembles of adjusted forecasts. We evaluate our model using deterministic and probabilistic metrics and show that the adjusted forecasts are better calibrated, closer to the observational distribution, and have smaller errors than climatological mean adjusted forecasts.

[858] Chain-of-Influence: Tracing Interdependencies Across Time and Features in Clinical Predictive Modelings

Yubo Li, Rema Padman

Main category: cs.LG

TL;DR: Chain-of-Influence (CoI) is an interpretable deep learning framework that explicitly models time-varying feature interactions in clinical time-series data using multi-level attention to trace influence pathways.

Details

Motivation: Current approaches fail to explicitly model how clinical variables influence each other over time, relying on black-box mechanisms or simple aggregation that lack interpretability.

Method: CoI uses a two-level attention architecture: temporal attention identifies critical time points, and cross-feature attention models directed influences between features at these time points, constructing explicit time-unfolded graphs of feature interactions.

Result: CoI significantly outperforms existing methods in predictive accuracy on mortality and disease progression tasks using MIMIC-IV and chronic kidney disease datasets, while providing interpretable influence pathways.

Conclusion: The framework offers unprecedented transparency into temporal and cross-feature dependencies, enabling discovery of clinically meaningful patient-specific disease progression patterns that are opaque to other models.

Abstract: Modeling clinical time-series data is hampered by the challenge of capturing latent, time-varying dependencies among features. State-of-the-art approaches often rely on black-box mechanisms or simple aggregation, failing to explicitly model how the influence of one clinical variable propagates through others over time. We propose $\textbf{Chain-of-Influence (CoI)}$, an interpretable deep learning framework that constructs an explicit, time-unfolded graph of feature interactions. CoI leverages a multi-level attention architecture: first, a temporal attention layer identifies critical time points in a patient’s record; second, a cross-feature attention layer models the directed influence from features at these time points to subsequent features. This design enables the tracing of influence pathways, providing a granular audit trail that shows how any feature at any time contributes to the final prediction, both directly and through its influence on other variables. We evaluate CoI on mortality and disease progression tasks using the MIMIC-IV dataset and a private chronic kidney disease cohort. Our framework significantly outperforms existing methods in predictive accuracy. More importantly, through case studies, we show that CoI can uncover clinically meaningful, patient-specific patterns of disease progression that are opaque to other models, offering unprecedented transparency into the temporal and cross-feature dependencies that inform clinical decision-making.

[859] Learning Bug Context for PyTorch-to-JAX Translation with LLMs

Hung Phan, Son Le Vu, Ali Jannesari

Main category: cs.LG

TL;DR: T2J is a prompt-augmentation framework that improves LLM-based PyTorch to JAX translation by using curated error-fix datasets and structured guidance to enhance lightweight LLMs’ performance.

Details

Motivation: PyTorch to JAX translation is challenging due to differences in design, execution semantics, and limited parallel corpora, with existing evaluation methods being inadequate for cross-framework benchmarking.

Method: Three-step pipeline: (1) assemble PyTorch sources and generate initial JAX drafts with GPT-4o-mini, (2) iterative repair by developers to create fixed-bug dataset, (3) construct augmented prompts with structured guidance from fixes.

Result: T2J improves GPT-4o-mini performance by up to 10% on CodeBLEU, 50% on FixCost Score, 1.33 points on CodeTrans Score, and 100% on Comparison Score; generated code runs 2.5x faster than baseline.

Conclusion: The T2J framework effectively enhances LLM-based PyTorch to JAX translation through prompt augmentation and curated guidance, addressing the unique challenges of cross-framework code conversion.

Abstract: Despite recent progress of large language models (LLMs) on code translation among mainstream languages, translating PyTorch to JAX remains nontrivial. The two libraries, though both embedded in Python, differ in core design, execution semantics, and ecosystem maturity; JAX is newer and comparatively underrepresented in public code, and parallel PyTorch–JAX corpora are limited. Weaknesses in existing evaluation further complicate cross-framework benchmarking. We present T2J, a prompt-augmentation framework that strengthens LLM-based PyTorch to JAX translation. Our pipeline (i) assembles two PyTorch sources – the problem-solving set from TorchLeet (Aroori & Chien, 2025) and a GitHub-derived set from CodeParrot (Wolf et al., 2022) – and uses GPT-4o-mini to produce initial JAX drafts; (ii) engages two professional developers to iteratively repair those drafts until functional equivalence, yielding a curated fixed-bug dataset of common errors and patches; and (iii) constructs augmented prompts that inject structured guidance from these fixes to steer lightweight LLMs (e.g., GPT-4o-mini). We also introduce three metrics tailored to PyTorch to JAX: T2J CodeTrans Score, T2J FixCost Score (an LLM-based estimate of bug-fix effort), and T2J Comparison Score (LLM-as-judge). Empirically, T2J raises GPT-4o-mini performance by up to 10% on CodeBLEU, 50% on T2J FixCost Score, 1.33 points on T2J CodeTrans Score (0–4 scale), and 100% on T2J Comparison Score; moreover, the generated code runs up to 2.5x faster than the baseline.

[860] Stability of Transformers under Layer Normalization

Kelvin Kan, Xingjian Li, Benjamin J. Zhang, Tuhin Sahai, Stanley Osher, Krishna Kumar, Markos A. Katsoulakis

Main category: cs.LG

TL;DR: A principled study of layer normalization placement in Transformers, analyzing forward and backward stability to understand training dynamics and provide guidance for architectural design.

Details

Motivation: Training deep Transformers can be unstable, and while layer normalization helps stability, its placement has been ad-hoc without principled understanding of how different placements affect training dynamics.

Method: Theoretical analysis of forward stability (hidden states growth bounds) and backward stability (gradient backpropagation) under different layer normalization placements, plus numerical validation.

Result: Derived explicit bounds on hidden state growth and analyzed how layer normalization affects gradient flow, providing insights into whether training leads to regular or pathological behaviors. Also developed scaling guidelines for residual steps.

Conclusion: The framework provides principled way to check Transformer stability under architectural modifications and offers guidance for future Transformer designs.

Abstract: Despite their widespread use, training deep Transformers can be unstable. Layer normalization, a standard component, improves training stability, but its placement has often been ad-hoc. In this paper, we conduct a principled study on the forward (hidden states) and backward (gradient) stability of Transformers under different layer normalization placements. Our theory provides key insights into the training dynamics: whether training drives Transformers toward regular solutions or pathological behaviors. For forward stability, we derive explicit bounds on the growth of hidden states in trained Transformers. For backward stability, we analyze how layer normalization affects the backpropagation of gradients, thereby explaining the training dynamics of each layer normalization placement. Our analysis also guides the scaling of residual steps in Transformer blocks, where appropriate choices can further improve stability and performance. Our numerical results corroborate our theoretical findings. Beyond these results, our framework provides a principled way to sanity-check the stability of Transformers under new architectural modifications, offering guidance for future designs.

[861] Augmenting generative models with biomedical knowledge graphs improves targeted drug discovery

Aditya Malusare, Vineet Punyamoorty, Vaneet Aggarwal

Main category: cs.LG

TL;DR: K-DREAM is a knowledge graph-augmented diffusion model that generates biologically relevant drug candidates by integrating structured biomedical knowledge into molecular generation.

Details

Motivation: Current generative models lack comprehensive biomedical knowledge integration, limiting their ability to produce therapeutically suitable molecules aligned with specific targets.

Method: Leverages knowledge graphs to augment diffusion-based generative models, embedding structured information to direct molecular generation toward biologically relevant candidates.

Result: Generates drug candidates with improved binding affinities and predicted efficacy, surpassing state-of-the-art models, and demonstrates flexibility for multi-target applications.

Conclusion: Knowledge-enhanced generative models like K-DREAM show significant utility in rational drug design and practical therapeutic development.

Abstract: Recent breakthroughs in generative modeling have demonstrated remarkable capabilities in molecular generation, yet the integration of comprehensive biomedical knowledge into these models has remained an untapped frontier. In this study, we introduce K-DREAM (Knowledge-Driven Embedding-Augmented Model), a novel framework that leverages knowledge graphs to augment diffusion-based generative models for drug discovery. By embedding structured information from large-scale knowledge graphs, K-DREAM directs molecular generation toward candidates with higher biological relevance and therapeutic suitability. This integration ensures that the generated molecules are aligned with specific therapeutic targets, moving beyond traditional heuristic-driven approaches. In targeted drug design tasks, K-DREAM generates drug candidates with improved binding affinities and predicted efficacy, surpassing current state-of-the-art generative models. It also demonstrates flexibility by producing molecules designed for multiple targets, enabling applications to complex disease mechanisms. These results highlight the utility of knowledge-enhanced generative models in rational drug design and their relevance to practical therapeutic development.

[862] Advancing Intoxication Detection: A Smartwatch-Based Approach

Manuel Segura, Pere Vergés, Richard Ky, Ramesh Arangott, Angela Kristine Garcia, Thang Dihn Trong, Makoto Hyodo, Alexandru Nicolau, Tony Givargis, Sergio Gago-Masague

Main category: cs.LG

TL;DR: A mobile smartwatch app using multiple sensors (TAC, accelerometer, gyroscope, heart rate) to detect intoxication levels, with HDC model showing best accuracy-efficiency balance for real-time interventions.

Details

Motivation: Excess alcohol consumption causes serious health risks and community consequences, requiring effective just-in-time intervention methods.

Method: Collected multi-sensor data (TAC, accelerometer, gyroscope, heart rate) over 3 weeks, evaluated state-of-the-art classifiers including Transformer, bi-LSTM, GRU, 1D-CNN, and HDC on smartwatch data.

Result: HDC model achieved the best balance between accuracy and efficiency, making it most practical for resource-constrained mobile hardware applications.

Conclusion: Smartwatch-based intoxication detection using HDC classifier is a viable approach for real-time intervention in mobile health applications.

Abstract: Excess alcohol consumption leads to serious health risks and severe consequences for both individuals and their communities. To advocate for healthier drinking habits, we introduce a groundbreaking mobile smartwatch application approach to just-in-time interventions for intoxication warnings. In this work, we have created a dataset gathering TAC, accelerometer, gyroscope, and heart rate data from the participants during a period of three weeks. This is the first study to combine accelerometer, gyroscope, and heart rate smartwatch data collected over an extended monitoring period to classify intoxication levels. Previous research had used limited smartphone motion data and conventional machine learning (ML) algorithms to classify heavy drinking episodes; in this work, we use smartwatch data and perform a thorough evaluation of different state-of-the-art classifiers such as the Transformer, Bidirectional Long Short-Term Memory (bi-LSTM), Gated Recurrent Unit (GRU), One-Dimensional Convolutional Neural Networks (1D-CNN), and Hyperdimensional Computing (HDC). We have compared performance metrics for the algorithms and assessed their efficiency on resource-constrained environments like mobile hardware. The HDC model achieved the best balance between accuracy and efficiency, demonstrating its practicality for smartwatch-based applications.

[863] AutoGD: Automatic Learning Rate Selection for Gradient Descent

Nikola Surjanovic, Alexandre Bouchard-Côté, Trevor Campbell

Main category: cs.LG

TL;DR: AutoGD is a gradient descent method that automatically adjusts learning rates without user tuning, achieving optimal convergence rates for broad function classes without needing smoothness constants.

Details

Motivation: Gradient-based optimization methods require significant user effort to tune learning rates, which becomes impractical when they appear as inner loops in other algorithms.

Method: AutoGD automatically determines whether to increase or decrease the learning rate at each iteration, with extensions to AutoBFGS and AutoLBFGS.

Result: The method achieves strong performance on traditional problems and variational inference tasks, recovering optimal GD rates (up to a constant) without smoothness constant knowledge.

Conclusion: AutoGD provides an effective automatic learning rate adjustment mechanism that eliminates the need for manual tuning while maintaining strong convergence properties.

Abstract: The performance of gradient-based optimization methods, such as standard gradient descent (GD), greatly depends on the choice of learning rate. However, it can require a non-trivial amount of user tuning effort to select an appropriate learning rate schedule. When such methods appear as inner loops of other algorithms, expecting the user to tune the learning rates may be impractical. To address this, we introduce AutoGD: a gradient descent method that automatically determines whether to increase or decrease the learning rate at a given iteration. We establish the convergence of AutoGD, and show that we can recover the optimal rate of GD (up to a constant) for a broad class of functions without knowledge of smoothness constants. Experiments on a variety of traditional problems and variational inference optimization tasks demonstrate strong performance of the method, along with its extensions to AutoBFGS and AutoLBFGS.

[864] MemPromptTSS: Persistent Prompt Memory for Iterative Multi-Granularity Time Series State Segmentation

Ching Chang, Ming-Chih Lo, Chiao-Tung Chan, Wen-Chih Peng, Tien-Fu Chen

Main category: cs.LG

TL;DR: MemPromptTSS introduces persistent prompt memory for multi-granularity time series segmentation, enabling prompts to influence predictions across the entire sequence rather than fading locally.

Details

Motivation: Existing prompting approaches for time series segmentation only operate within local contexts, causing prompt effects to quickly fade and fail to guide predictions across the entire sequence.

Method: Proposes MemPromptTSS framework with persistent prompt memory - a memory encoder transforms prompts and surrounding subsequences into memory tokens stored in a bank, allowing each prediction to condition on all accumulated prompts across iterations.

Result: Achieves 23% and 85% accuracy improvements over best baseline in single- and multi-granularity segmentation under single iteration inference, and provides stronger refinement in iterative inference with average per-iteration gains of 2.66 percentage points vs 1.19 for PromptTSS.

Conclusion: Persistent memory is crucial for prompt-guided segmentation, establishing MemPromptTSS as a practical and effective framework for real-world applications in wearable sensing and industrial monitoring.

Abstract: Web platforms, mobile applications, and connected sensing systems generate multivariate time series with states at multiple levels of granularity, from coarse regimes to fine-grained events. Effective segmentation in these settings requires integrating across granularities while supporting iterative refinement through sparse prompt signals, which provide a compact mechanism for injecting domain knowledge. Yet existing prompting approaches for time series segmentation operate only within local contexts, so the effect of a prompt quickly fades and cannot guide predictions across the entire sequence. To overcome this limitation, we propose MemPromptTSS, a framework for iterative multi-granularity segmentation that introduces persistent prompt memory. A memory encoder transforms prompts and their surrounding subsequences into memory tokens stored in a bank. This persistent memory enables each new prediction to condition not only on local cues but also on all prompts accumulated across iterations, ensuring their influence persists across the entire sequence. Experiments on six datasets covering wearable sensing and industrial monitoring show that MemPromptTSS achieves 23% and 85% accuracy improvements over the best baseline in single- and multi-granularity segmentation under single iteration inference, and provides stronger refinement in iterative inference with average per-iteration gains of 2.66 percentage points compared to 1.19 for PromptTSS. These results highlight the importance of persistent memory for prompt-guided segmentation, establishing MemPromptTSS as a practical and effective framework for real-world applications.

[865] Conformal Sparsification for Bandwidth-Efficient Edge-Cloud Speculative Decoding

Payel Bhattacharjee, Fengwei Tian, Meiyu Zhong, Guangyi Zhang, Osvaldo Simeone, Ravi Tandon

Main category: cs.LG

TL;DR: Edge-cloud speculative decoding accelerates LLM inference by having a cloud LLM verify draft tokens from an edge SLM. The proposed SQS-SD framework addresses bandwidth limitations through structured sparsification and lattice-based quantization to compress draft token distributions efficiently.

Details

Motivation: The limited bandwidth of edge-cloud links creates a bottleneck for speculative decoding, necessitating efficient compression of draft token distributions to maintain performance while reducing communication overhead.

Method: Proposed Sparse Quantize-and-Sample SD (SQS-SD) framework that exploits distributional sparsity through structured sparsification and lattice-based quantization. Two variants: K-SQS (fixed top-K truncation) and C-SQS (adaptive token retention via online conformal prediction).

Result: Both K-SQS and C-SQS approaches improve end-to-end latency and rejection rates in complementary operating regimes, with empirical results confirming their effectiveness.

Conclusion: The SQS-SD framework successfully addresses bandwidth limitations in edge-cloud speculative decoding through principled compression techniques that maintain performance while reducing communication requirements.

Abstract: Edge-cloud speculative decoding (SD) accelerates inference by having a cloud-based large language model (LLM) that verifies draft tokens generated by a resource-constrained small language model (SLM) at the edge. A central bottleneck is the limited bandwidth of the edge-cloud link, which necessitates efficient compression of draft token distributions. We first derive an information-theoretic bound that decomposes the token rejection rate into contributions from SLM-LLM distribution mismatch and from quantization distortion. Guided by this analysis, we propose the Sparse Quantize-and-Sample SD (SQS-SD) framework, which exploits distributional sparsity through structured sparsification and lattice-based quantization. Within this framework, K-SQS applies fixed top-K truncation, while C-SQS adaptively adjusts the retained token set via online conformal prediction to ensure bounded deviation from the dense distribution. Empirical results confirm that both approaches improve end-to-end latency and rejection rates in complimentary operating regimes.

[866] Clustering Result Re-guided Incomplete Multi-view Spectral Clustering

Jun Yin, Runcheng Cai, Shiliang Sun

Main category: cs.LG

TL;DR: CRG_IMSC is a novel incomplete multi-view spectral clustering method that directly obtains clustering results through nonnegative constraints and uses clustering connectivity to guide representation learning.

Details

Motivation: Existing incomplete multi-view spectral clustering methods require separate K-means after feature extraction and fail to effectively utilize sample connectivity information from clustering results.

Method: Imposes nonnegative constraint on extracted features to directly obtain clustering results, constructs connectivity matrix from spectral clustering results, and minimizes self-representation residual based on connectivity matrix using multiplicative update algorithm.

Result: Outperforms state-of-the-art clustering methods on benchmark datasets and demonstrates algorithm convergence through experiments.

Conclusion: CRG_IMSC effectively addresses limitations of existing methods by integrating clustering results directly into the optimization process and leveraging sample connectivity information.

Abstract: Incomplete multi-view spectral clustering generalizes spectral clustering to multi-view data and simultaneously realizes the partition of multi-view data with missing views. For this category of method, K-means algorithm needs to be performed to generate the clustering result after the procedure of feature extraction. More importantly, the connectivity of samples reflected by the clustering result is not utilized effectively. To overcome these defects, we propose Clustering Result re-Guided Incomplete Multi-view Spectral Clustering (CRG_IMSC). CRG_IMSC obtains the clustering result directly by imposing nonnegative constraint to the extracted feature. Furthermore, it constructs the connectivity matrix according to the result of spectral clustering, and minimizes the residual of self-representation based on the connectivity matrix. A novel iterative algorithm using multiplicative update is developed to solve the optimization problem of CRG_IMSC, and its convergence is proved rigorously. On benchmark datasets, for multi-view data, CRG_IMSC performs better than state-of-the-art clustering methods, and the experimental results also demonstrate the convergence of CRG_IMSC algorithm.

[867] Homomorphic Mappings for Value-Preserving State Aggregation in Markov Decision Processes

Shuo Zhao, Yongqiang Li, Yu Feng, Zhongsheng Hou, Yuanjing Feng

Main category: cs.LG

TL;DR: This paper introduces a state aggregation framework using homomorphism to reduce MDP complexity while preserving optimal policies, with theoretical guarantees and practical algorithms.

Details

Motivation: To reduce computational complexity of solving Markov Decision Processes while maintaining optimal performance through state aggregation.

Method: Proposed homomorphism-based abstraction framework, developed Homomorphic Policy Gradient (HPG) and Error-Bounded HPG (EBHPG) algorithms with theoretical guarantees.

Result: Established sufficient conditions for optimal policy equivalence, derived error bounds, and validated performance through experiments comparing against seven algorithms.

Conclusion: The homomorphism framework provides effective state aggregation with theoretical guarantees, balancing computational efficiency and performance preservation in MDPs.

Abstract: State aggregation aims to reduce the computational complexity of solving Markov Decision Processes (MDPs) while preserving the performance of the original system. A fundamental challenge lies in optimizing policies within the aggregated, or abstract, space such that the performance remains optimal in the ground MDP-a property referred to as {"}optimal policy equivalence {"}. This paper presents an abstraction framework based on the notion of homomorphism, in which two Markov chains are deemed homomorphic if their value functions exhibit a linear relationship. Within this theoretical framework, we establish a sufficient condition for the equivalence of optimal policy. We further examine scenarios where the sufficient condition is not met and derive an upper bound on the approximation error and a performance lower bound for the objective function under the ground MDP. We propose Homomorphic Policy Gradient (HPG), which guarantees optimal policy equivalence under sufficient conditions, and its extension, Error-Bounded HPG (EBHPG), which balances computational efficiency and the performance loss induced by aggregation. In the experiments, we validated the theoretical results and conducted comparative evaluations against seven algorithms.

[868] Reinforcement Fine-Tuning of Flow-Matching Policies for Vision-Language-Action Models

Mingyang Lyu, Yinqian Sun, Erliang Lin, Huangrui Li, Ruolin Chen, Feifei Zhao, Yi Zeng

Main category: cs.LG

TL;DR: FPO enables online reinforcement learning fine-tuning for Vision-Language-Action models by reformulating importance sampling using flow-matching objectives, achieving stable performance improvements over imitation learning baselines.

Details

Motivation: Existing Vision-Language-Action models are constrained by supervised data quality and coverage, and conventional RL methods are computationally infeasible for flow-matching based models due to intractable importance sampling.

Method: Proposes Flow Policy Optimization (FPO) with per-sample flow-matching objective reformulation, structure-aware credit assignment, clipped surrogate objectives, multi-step latent exploration, and Q-ensemble mechanism.

Result: FPO achieves consistent improvements over supervised, preference-aligned, and other baselines on LIBERO benchmark and ALOHA simulation, with stable learning under sparse rewards.

Conclusion: FPO provides an effective framework for online RL fine-tuning of flow-matching VLAs, with validated computational modules and stable convergence during training.

Abstract: Vision-Language-Action (VLA) models such as OpenVLA, Octo, and $\pi_0$ have shown strong generalization by leveraging large-scale demonstrations, yet their performance is still fundamentally constrained by the quality and coverage of supervised data. Reinforcement learning (RL) provides a promising path for improving and fine-tuning VLAs through online interaction. However, conventional policy gradient methods are computationally infeasible in the context of flow-matching based models due to the intractability of the importance sampling process, which requires explicit computation of policy ratios. To overcome this limitation, we propose Flow Policy Optimization (FPO) algorithm, which reformulates importance sampling by leveraging per-sample changes in the conditional flow-matching objective. Furthermore, FPO achieves stable and scalable online reinforcement fine-tuning of the $\pi_0$ model by integrating structure-aware credit assignment to enhance gradient efficiency, clipped surrogate objectives to stabilize optimization, multi-step latent exploration to encourage diverse policy updates, and a Q-ensemble mechanism to provide robust value estimation. We evaluate FPO on the LIBERO benchmark and the ALOHA simulation task against supervised, preference-aligned, diffusion-based, autoregressive online RL, and $\pi_0$-FAST baselines, observing consistent improvements over the imitation prior and strong alternatives with stable learning under sparse rewards. In addition, ablation studies and analyses of the latent space dynamics further highlight the contributions of individual components within FPO, validating the effectiveness of the proposed computational modules and the stable convergence of the conditional flow-matching objective during online RL.

[869] An Unsupervised Time Series Anomaly Detection Approach for Efficient Online Process Monitoring of Additive Manufacturing

Frida Cantu, Salomon Ibarra, Arturo Gonzales, Jesus Barreda, Chenang Liu, Li Zhang

Main category: cs.LG

TL;DR: Proposes an unsupervised matrix profile-based algorithm for detecting subtle semantic anomalies in additive manufacturing sensor data, focusing on fabrication cycle similarity and precise onset identification.

Details

Motivation: Online sensing is crucial for manufacturing but existing approaches either need labeled data or only detect extreme outliers, failing to identify subtle semantic anomalies that indicate new regimes or unexpected routines.

Method: Matrix profile-based unsupervised anomaly detection algorithm that captures fabrication cycle similarity and performs semantic segmentation to identify defect onset.

Result: The method effectively identifies the onset of defect anomalies in additive manufacturing, as demonstrated by experiments on real-world sensor data.

Conclusion: The proposed unsupervised approach successfully addresses the challenge of detecting subtle semantic anomalies in manufacturing sensor data without requiring labeled training data.

Abstract: Online sensing plays an important role in advancing modern manufacturing. The real-time sensor signals, which can be stored as high-resolution time series data, contain rich information about the operation status. One of its popular usages is online process monitoring, which can be achieved by effective anomaly detection from the sensor signals. However, most existing approaches either heavily rely on labeled data for training supervised models, or are designed to detect only extreme outliers, thus are ineffective at identifying subtle semantic off-track anomalies to capture where new regimes or unexpected routines start. To address this challenge, we propose an matrix profile-based unsupervised anomaly detection algorithm that captures fabrication cycle similarity and performs semantic segmentation to precisely identify the onset of defect anomalies in additive manufacturing. The effectiveness of the proposed method is demonstrated by the experiments on real-world sensor data.

[870] Learning Joint Embeddings of Function and Process Call Graphs for Malware Detection

Kartikeya Aneja, Nagender Aneja, Murat Kantarcioglu

Main category: cs.LG

TL;DR: GeminiNet is a unified neural network that learns joint embeddings from both function call graphs and process interaction graphs, outperforming single-graph models for software analysis.

Details

Motivation: Current graph neural network approaches for software analysis focus on single graph representations, leaving joint modeling of function call graphs and process interaction graphs underexplored despite their complementary insights.

Method: Proposed GeminiNet with dual graph convolutional branches and adaptive gating mechanism to balance static (FCG) and dynamic (PCG) views. Constructed dataset of 635 Windows executables using Ghidra for FCGs and Any.Run for PCGs.

Result: Joint embeddings from both graph types outperform single-graph models, demonstrating the value of multi-perspective analysis.

Conclusion: Joint modeling of function call graphs and process interaction graphs provides superior software analysis compared to single-graph approaches, enabling deeper multi-perspective insights.

Abstract: Software systems can be represented as graphs, capturing dependencies among functions and processes. An interesting aspect of software systems is that they can be represented as different types of graphs, depending on the extraction goals and priorities. For example, function calls within the software can be captured to create function call graphs, which highlight the relationships between functions and their dependencies. Alternatively, the processes spawned by the software can be modeled to generate process interaction graphs, which focus on runtime behavior and inter-process communication. While these graph representations are related, each captures a distinct perspective of the system, providing complementary insights into its structure and operation. While previous studies have leveraged graph neural networks (GNNs) to analyze software behaviors, most of this work has focused on a single type of graph representation. The joint modeling of both function call graphs and process interaction graphs remains largely underexplored, leaving opportunities for deeper, multi-perspective analysis of software systems. This paper presents a pipeline for constructing and training Function Call Graphs (FCGs) and Process Call Graphs (PCGs) and learning joint embeddings. We demonstrate that joint embeddings outperform a single-graph model. In this paper, we propose GeminiNet, a unified neural network approach that learns joint embeddings from both FCGs and PCGs. We construct a new dataset of 635 Windows executables (318 malicious and 317 benign), extracting FCGs via Ghidra and PCGs via Any.Run sandbox. GeminiNet employs dual graph convolutional branches with an adaptive gating mechanism that balances contributions from static and dynamic views.

[871] Tight Robustness Certificates and Wasserstein Distributional Attacks for Deep Neural Networks

Bach C. Le, Tung V. Dao, Binh T. Nguyen, Hong T. M. Chu

Main category: cs.LG

TL;DR: A new primal approach for Wasserstein distributionally robust optimization (WDRO) that uses exact Lipschitz certificates to tighten upper bounds and introduces Wasserstein distributional attack (WDA) for constructing worst-case distributions, achieving competitive robust accuracy with tighter certificates.

Details

Motivation: Existing WDRO methods based on global Lipschitz continuity or strong duality often yield loose upper bounds or require prohibitive computation, limiting their practical effectiveness for adversarial robustness.

Method: Proposes a primal approach with exact Lipschitz certificates and introduces Wasserstein distributional attack (WDA) that directly constructs worst-case distributions. Leverages piecewise-affine structure of ReLU networks on activation cells for exact tractable characterization of WDRO problems.

Result: Extensive evaluations show the method achieves competitive robust accuracy against state-of-the-art baselines while offering tighter certificates than existing methods.

Conclusion: The proposed approach provides a more flexible and computationally efficient framework for WDRO with tighter robustness certificates, addressing limitations of existing methods.

Abstract: Wasserstein distributionally robust optimization (WDRO) provides a framework for adversarial robustness, yet existing methods based on global Lipschitz continuity or strong duality often yield loose upper bounds or require prohibitive computation. In this work, we address these limitations by introducing a primal approach and adopting a notion of exact Lipschitz certificate to tighten this upper bound of WDRO. In addition, we propose a novel Wasserstein distributional attack (WDA) that directly constructs a candidate for the worst-case distribution. Compared to existing point-wise attack and its variants, our WDA offers greater flexibility in the number and location of attack points. In particular, by leveraging the piecewise-affine structure of ReLU networks on their activation cells, our approach results in an exact tractable characterization of the corresponding WDRO problem. Extensive evaluations demonstrate that our method achieves competitive robust accuracy against state-of-the-art baselines while offering tighter certificates than existing methods. Our code is available at https://github.com/OLab-Repo/WDA

[872] Bidirectional Time-Frequency Pyramid Network for Enhanced Robust EEG Classification

Jiahui Hong, Siqing Li, Muqing Jian, Luming Yang

Main category: cs.LG

TL;DR: BITE is a bidirectional time-frequency pyramid network that achieves state-of-the-art EEG recognition performance across multiple paradigms with strong cross-subject generalization capabilities.

Details

Motivation: Existing EEG recognition models suffer from poor cross-paradigm generalization due to dataset-specific constraints and individual variability.

Method: End-to-end unified architecture with aligned time-frequency streams, pyramid time-frequency attention (PTFA), and bidirectional adaptive convolutions (BiTCN) for forward/backward neural dynamics.

Result: Achieves state-of-the-art performance across four divergent paradigms (BCICIV-2A/2B, HGD, SD-SSVEP), excelling in both within-subject accuracy and cross-subject generalization with exceptional computational efficiency.

Conclusion: Paradigm-aligned spectral-temporal processing is essential for reliable BCI systems, and BITE serves as a unified architecture combining robust performance across both MI and SSVEP tasks.

Abstract: Existing EEG recognition models suffer from poor cross-paradigm generalization due to dataset-specific constraints and individual variability. To overcome these limitations, we propose BITE (Bidirectional Time-Freq Pyramid Network), an end-to-end unified architecture featuring robust multistream synergy, pyramid time-frequency attention (PTFA), and bidirectional adaptive convolutions. The framework uniquely integrates: 1) Aligned time-frequency streams maintaining temporal synchronization with STFT for bidirectional modeling, 2) PTFA-based multi-scale feature enhancement amplifying critical neural patterns, 3) BiTCN with learnable fusion capturing forward/backward neural dynamics. Demonstrating enhanced robustness, BITE achieves state-of-the-art performance across four divergent paradigms (BCICIV-2A/2B, HGD, SD-SSVEP), excelling in both within-subject accuracy and cross-subject generalization. As a unified architecture, it combines robust performance across both MI and SSVEP tasks with exceptional computational efficiency. Our work validates that paradigm-aligned spectral-temporal processing is essential for reliable BCI systems. Just as its name suggests, BITE “takes a bite out of EEG.” The source code is available at https://github.com/cindy-hong/BiteEEG.

[873] Skill-Targeted Adaptive Training

Yinghui He, Abhishek Panigrahi, Yong Lin, Sanjeev Arora

Main category: cs.LG

TL;DR: STAT is a fine-tuning strategy that uses a stronger LLM as teacher to identify missing skills in student models and adaptively reweight or synthesize training examples to address skill gaps, achieving significant improvements over vanilla SFT.

Details

Motivation: Language models show saturation when fine-tuned on data similar to their training set, with limited improvement from vanilla supervised fine-tuning (SFT) on tasks like MATH.

Method: A stronger LLM teacher identifies required skills for a task and creates a Missing-Skill-Profile for the student. Two approaches: STAT-Sel reweights existing examples based on missing skills, STAT-Syn synthesizes new examples targeting missing skills.

Result: Up to 7.5% improvement on MATH vs limited SFT gains, 4.6% average improvement on out-of-distribution benchmarks. STAT complements RL methods like GRPO, providing additional gains after skill gaps are addressed.

Conclusion: Skill-targeted adaptive training should broadly improve current training pipelines by addressing specific skill deficiencies in language models.

Abstract: Language models often show little to no improvement (i.e., “saturation”) when trained via vanilla supervised fine-tuning (SFT) on data similar to what they saw in their training set (e.g., MATH). We introduce a new fine-tuning strategy, STAT, to train such a student model by using the metacognition ability of a stronger large language model (LLM) as the teacher. The teacher uses the task dataset to create a list of skills needed for the task, and then labels each data point with its required skills (Didolkar et al., 2024). By monitoring the student’s answers, the teacher creates a Missing-Skill-Profile for the student, tracking how often they failed to apply each skill in their responses. We use this idea to build a modified training set in one of two ways. In STAT-Sel, the teacher uses an existing set of training examples but adaptively reweights them according to the Missing-Skill-Profile. In STAT-Syn, the teacher synthesizes additional examples involving missing skills. Across extensive experiments on Llama and Qwen models, our methods yield improvements of up to 7.5% on MATH, whereas SFT provides only limited gains. Furthermore, STAT enhances performance on out-of-distribution benchmarks (e.g., AIME24/25, AMC23, etc.) by an average of 4.6%. Crucially, we find that STAT is complementary to RL via GRPO (Shao et al., 2024): after the model is improved using STAT to address skill gaps, GRPO continues to add further gains. We conclude that skill-targeted adaptive training should broadly improve current training pipelines. Our code is available at: https://github.com/princeton-pli/STAT.

[874] Efficient Onboard Vision-Language Inference in UAV-Enabled Low-Altitude Economy Networks via LLM-Enhanced Optimization

Yang Li, Ruichen Zhang, Yinqiu Liu, Guangyuan Liu, Dusit Niyato, Abbas Jamalipour, Xianbin Wang, Dong In Kim

Main category: cs.LG

TL;DR: This paper proposes a hierarchical optimization framework for UAV-enabled Low-Altitude Economy Networks that jointly optimizes task latency, power consumption, and inference accuracy through alternating resource allocation and LLM-augmented reinforcement learning for trajectory optimization.

Details

Motivation: The rapid development of Low-Altitude Economy Networks enables applications like aerial surveillance and semantic data collection, but ensuring both inference accuracy and communication efficiency remains challenging due to limited UAV onboard resources and dynamic network conditions.

Method: Proposes a hierarchical optimization framework with two components: (1) ARPO algorithm for resource allocation under accuracy constraints, and (2) LLaRA approach using LLM-augmented reinforcement learning for adaptive UAV trajectory optimization where LLM refines reward design offline.

Result: Numerical results demonstrate the framework’s efficacy in improving inference performance and communication efficiency under dynamic LAENet conditions.

Conclusion: The proposed hierarchical optimization framework effectively addresses the joint optimization of task latency, power consumption, and accuracy constraints in UAV-enabled LAENets through innovative combination of resource allocation and LLM-enhanced reinforcement learning.

Abstract: The rapid advancement of Low-Altitude Economy Networks (LAENets) has enabled a variety of applications, including aerial surveillance, environmental sensing, and semantic data collection. To support these scenarios, unmanned aerial vehicles (UAVs) equipped with onboard vision-language models (VLMs) offer a promising solution for real-time multimodal inference. However, ensuring both inference accuracy and communication efficiency remains a significant challenge due to limited onboard resources and dynamic network conditions. In this paper, we first propose a UAV-enabled LAENet system model that jointly captures UAV mobility, user-UAV communication, and the onboard visual question answering (VQA) pipeline. Based on this model, we formulate a mixed-integer non-convex optimization problem to minimize task latency and power consumption under user-specific accuracy constraints. To solve the problem, we design a hierarchical optimization framework composed of two parts: (i) an Alternating Resolution and Power Optimization (ARPO) algorithm for resource allocation under accuracy constraints, and (ii) a Large Language Model-augmented Reinforcement Learning Approach (LLaRA) for adaptive UAV trajectory optimization. The large language model (LLM) serves as an expert in refining reward design of reinforcement learning in an offline fashion, introducing no additional latency in real-time decision-making. Numerical results demonstrate the efficacy of our proposed framework in improving inference performance and communication efficiency under dynamic LAENet conditions.

[875] Experience-Efficient Model-Free Deep Reinforcement Learning Using Pre-Training

Ruoxing Yang

Main category: cs.LG

TL;DR: PPOPT is a model-free deep reinforcement learning algorithm that uses pretrained neural network components to improve training efficiency and stability on small samples in physics-based environments.

Details

Motivation: Traditional RL agents require large environment interaction samples which can be computationally expensive, especially in complex physics-based environments. The goal is to achieve efficient learning with minimal samples.

Method: Uses a novel policy neural network architecture with a pretrained middle section (trained on a different environment with similar physics) sandwiched between two fully-connected networks, enabling transfer of physics knowledge.

Result: PPOPT outperforms baseline PPO on small training samples in both rewards and training stability. While it underperforms model-based methods like DYNA DDPG, it trains significantly faster due to its model-free nature.

Conclusion: PPOPT provides an efficient model-free alternative that leverages pretraining for improved sample efficiency in physics-based environments, with open-source implementation available.

Abstract: We introduce PPOPT - Proximal Policy Optimization using Pretraining, a novel, model-free deep-reinforcement-learning algorithm that leverages pretraining to achieve high training efficiency and stability on very small training samples in physics-based environments. Reinforcement learning agents typically rely on large samples of environment interactions to learn a policy. However, frequent interactions with a (computer-simulated) environment may incur high computational costs, especially when the environment is complex. Our main innovation is a new policy neural network architecture that consists of a pretrained neural network middle section sandwiched between two fully-connected networks. Pretraining part of the network on a different environment with similar physics will help the agent learn the target environment with high efficiency because it will leverage a general understanding of the transferrable physics characteristics from the pretraining environment. We demonstrate that PPOPT outperforms baseline classic PPO on small training samples both in terms of rewards gained and general training stability. While PPOPT underperforms against classic model-based methods such as DYNA DDPG, the model-free nature of PPOPT allows it to train in significantly less time than its model-based counterparts. Finally, we present our implementation of PPOPT as open-source software, available at github.com/Davidrxyang/PPOPT.

[876] FOSSIL: Regret-Minimizing Curriculum Learning for Metadata-Free and Low-Data Mpox Diagnosis

Sahng-Min Han, Minjae Kim, Jinho Cha, Se-woon Choe, Eunchan Daniel Cha, Jungwon Choi, Kyudong Jung

Main category: cs.LG

TL;DR: FOSSIL is a regret-minimizing weighting framework that adaptively balances training emphasis based on sample difficulty using softmax-based uncertainty, achieving superior performance in Mpox skin lesion diagnosis under data scarcity.

Details

Motivation: Deep learning in small and imbalanced biomedical datasets suffers from unstable optimization and poor generalization, requiring solutions that work effectively under data scarcity.

Method: FOSSIL uses softmax-based uncertainty as a continuous measure of difficulty to construct a four-stage curriculum (Easy-Very Hard) and integrates this framework into both convolutional and transformer-based architectures.

Result: FOSSIL substantially improves discrimination (AUC = 0.9573), calibration (ECE = 0.053), and robustness under real-world perturbations, outperforming conventional baselines without requiring metadata, manual curation, or synthetic augmentation.

Conclusion: FOSSIL is positioned as a generalizable, data-efficient, and interpretable framework for difficulty-aware learning in medical imaging under data scarcity.

Abstract: Deep learning in small and imbalanced biomedical datasets remains fundamentally constrained by unstable optimization and poor generalization. We present the first biomedical implementation of FOSSIL (Flexible Optimization via Sample-Sensitive Importance Learning), a regret-minimizing weighting framework that adaptively balances training emphasis according to sample difficulty. Using softmax-based uncertainty as a continuous measure of difficulty, we construct a four-stage curriculum (Easy-Very Hard) and integrate FOSSIL into both convolutional and transformer-based architectures for Mpox skin lesion diagnosis. Across all settings, FOSSIL substantially improves discrimination (AUC = 0.9573), calibration (ECE = 0.053), and robustness under real-world perturbations, outperforming conventional baselines without metadata, manual curation, or synthetic augmentation. The results position FOSSIL as a generalizable, data-efficient, and interpretable framework for difficulty-aware learning in medical imaging under data scarcity.

[877] One4Many-StablePacker: An Efficient Deep Reinforcement Learning Framework for the 3D Bin Packing Problem

Lei Gao, Shihong Huang, Shengjie Wang, Hong Ma, Feng Zhang, Hengda Bao, Qichang Chen, Weihua Zhou

Main category: cs.LG

TL;DR: O4M-SP is a deep reinforcement learning framework for 3D bin packing that handles various bin dimensions in one training while incorporating practical stability constraints like support and weight.

Details

Motivation: Existing learning-based approaches neglect practical stability constraints and have poor generalization across different bin dimensions in 3D bin packing problems.

Method: Uses weighted reward function combining loading rate and height difference metric, plus clipped policy gradient with policy drifting to prevent entropy collapse and encourage exploration.

Result: Extensive experiments show O4M-SP generalizes well across diverse bin dimensions and significantly outperforms baseline methods while effectively handling stability constraints.

Conclusion: O4M-SP demonstrates strong practical applicability for 3D bin packing with stability constraints and achieves superior performance across various bin dimensions.

Abstract: The three-dimensional bin packing problem (3D-BPP) is widely applied in logistics and warehousing. Existing learning-based approaches often neglect practical stability-related constraints and exhibit limitations in generalizing across diverse bin dimensions. To address these limitations, we propose a novel deep reinforcement learning framework, One4Many-StablePacker (O4M-SP). The primary advantage of O4M-SP is its ability to handle various bin dimensions in a single training process while incorporating support and weight constraints common in practice. Our training method introduces two innovative mechanisms. First, it employs a weighted reward function that integrates loading rate and a new height difference metric for packing layouts, promoting improved bin utilization through flatter packing configurations. Second, it combines clipped policy gradient optimization with a tailored policy drifting method to mitigate policy entropy collapse, encouraging exploration at critical decision nodes during packing to avoid suboptimal solutions. Extensive experiments demonstrate that O4M-SP generalizes successfully across diverse bin dimensions and significantly outperforms baseline methods. Furthermore, O4M-SP exhibits strong practical applicability by effectively addressing packing scenarios with stability constraints.

[878] Translution: Unifying Self-attention and Convolution for Adaptive and Relative Modeling

Hehe Fan, Yi Yang, Mohan Kankanhalli, Fei Wu

Main category: cs.LG

TL;DR: Translution is a new operation that combines self-attention’s adaptive element selection with convolution’s relative encoding, addressing limitations of both methods. A lightweight variant called α-Translution is also proposed to handle computational constraints.

Details

Motivation: Self-attention can adaptively identify relevant elements but relies on absolute positional embedding, while convolution encodes elements relatively but has fixed kernel size limitations. The paper aims to unify both advantages.

Method: Proposes Translution operation that integrates adaptive identification capability of self-attention with relative encoding advantage of convolution. Also introduces α-Translution as a lightweight variant to address parameter explosion.

Result: Experiments on computer vision and NLP tasks show Translution (including α-Translution) achieves superior accuracy compared to self-attention.

Conclusion: Translution successfully unifies the strengths of self-attention and convolution, providing better performance while addressing computational constraints through its lightweight variant.

Abstract: When modeling a given type of data, we consider it to involve two key aspects: 1) identifying relevant elements (e.g., image pixels or textual words) to a central element, as in a convolutional receptive field, or to a query element, as in self-attention, and 2) encoding these tokens effectively. Self-attention can adaptively identify these elements but relies on absolute positional embedding for structural representation learning. In contrast, convolution encodes elements in a relative manner, yet their fixed kernel size limits their ability to adaptively select the relevant elements. In this paper, we introduce Translution, an operation that unifies the adaptive identification capability of self-attention and the relative encoding advantage of convolution. However, this integration leads to a substantial increase in the number of parameters, exceeding most currently available computational resources. Therefore, we propose a lightweight variant of Translution, named {\alpha}-Translution. Experiments on computer vision and natural language processing tasks show that Translution (including {\alpha}-Translution) achieves superior accuracy compared to self-attention. The code is available at https://github.com/hehefan/Translution.

[879] ADEPT: Continual Pretraining via Adaptive Expansion and Dynamic Decoupled Tuning

Jinyang Zhang, Yue Fang, Hongxin Ding, Weibin Liao, Muyang Ye, Xu Chu, Junfeng Zhao, Yasha Wang

Main category: cs.LG

TL;DR: ADEPT is a two-stage framework for continual pretraining that uses selective layer expansion and adaptive unit-wise decoupled tuning to prevent catastrophic forgetting while efficiently adapting LLMs to new domains.

Details

Motivation: Conventional continual pretraining suffers from catastrophic forgetting and limited domain capacity, with existing layer expansion methods still entangling general and domain learning.

Method: Two-stage approach: 1) General-Competence Guided Selective Layer Expansion duplicates least critical general-domain layers; 2) Adaptive Unit-Wise Decoupled Tuning disentangles parameter units with asymmetric learning rates based on general-domain importance.

Result: Outperforms full-parameter CPT by up to 5.76% on general domain and 5.58% on target domain with only 15% of parameters tuned and less than 50% training time on mathematical and medical benchmarks.

Conclusion: Targeted expansion and decoupled optimization are essential principles for efficient and robust domain-adaptive continual pretraining, addressing functional specialization in LLMs.

Abstract: Conventional continual pretraining (CPT) for large language model (LLM) domain adaptation often suffers from catastrophic forgetting and limited domain capacity. Existing strategies adopt layer expansion, introducing additional trainable parameters to accommodate new knowledge. However, the uniform expansion and updates still entangle general and domain learning, undermining its effectiveness. Our pilot studies reveal that LLMs exhibit functional specialization, where layers and units differentially encode general-critical capabilities, suggesting that parameter expansion and optimization should be function-aware. We then propose ADEPT, Adaptive Expansion and Dynamic Decoupled Tuning for continual pretraining, a two-stage framework for domain-adaptive CPT. ADEPT first performs General-Competence Guided Selective Layer Expansion, duplicating layers least critical for the general domain to increase representational capacity while minimizing interference with general knowledge. It then applies Adaptive Unit-Wise Decoupled Tuning, disentangling parameter units within expanded layers according to their general-domain importance and assigning asymmetric learning rates to balance knowledge injection and retention. Experiments on mathematical and medical benchmarks show that ADEPT outperforms full-parameter CPT by up to 5.76% on the general domain and 5.58% on the target domain with only 15% of parameters tuned and less than 50% training time. Ablation studies, theoretical analysis, and extended investigations further demonstrate the necessity of targeted expansion and decoupled optimization, providing new principles for efficient and robust domain-adaptive CPT. Our code is open-sourced at https://github.com/PuppyKnightUniversity/ADEPT

[880] Gradient-based Model Shortcut Detection for Time Series Classification

Salomon Ibarra, Frida Cantu, Kaixiong Zhou, Li Zhang

Main category: cs.LG

TL;DR: This paper investigates point-based shortcut learning behavior in deep learning time series classification and proposes a detection method that doesn’t require test data or clean training classes.

Details

Motivation: Deep neural networks in time series classification have been shown to rely on spurious correlations, but shortcut behavior in time series remains under-explored. Most existing work focuses on external attributes rather than internal bias behavior.

Method: The authors propose a simple detection method based on other classes to detect shortcut occurrences without relying on test data or clean training classes.

Result: The proposed method was tested on UCR time series datasets.

Conclusion: This work takes the first step to investigate and establish point-based shortcut learning behavior in deep learning time series classification.

Abstract: Deep learning models have attracted lots of research attention in time series classification (TSC) task in the past two decades. Recently, deep neural networks (DNN) have surpassed classical distance-based methods and achieved state-of-the-art performance. Despite their promising performance, deep neural networks (DNNs) have been shown to rely on spurious correlations present in the training data, which can hinder generalization. For instance, a model might incorrectly associate the presence of grass with the label ``cat" if the training set have majority of cats lying in grassy backgrounds. However, the shortcut behavior of DNNs in time series remain under-explored. Most existing shortcut work are relying on external attributes such as gender, patients group, instead of focus on the internal bias behavior in time series models. In this paper, we take the first step to investigate and establish point-based shortcut learning behavior in deep learning time series classification. We further propose a simple detection method based on other class to detect shortcut occurs without relying on test data or clean training classes. We test our proposed method in UCR time series datasets.

[881] What Makes Looped Transformers Perform Better Than Non-Recursive Ones (Provably)

Zixuan Gong, Jiaye Teng, Yong Liu

Main category: cs.LG

TL;DR: Looped transformers outperform standard transformers on complex reasoning tasks due to their landscape-level inductive bias towards V-shaped valleys, enabling better loss convergence and complex pattern learning.

Details

Motivation: To understand why looped transformers (Looped-Attn) outperform standard transformers (Single-Attn) on complex reasoning tasks, as the theoretical basis for this advantage remains underexplored.

Method: Extend the River-Valley landscape model to distinguish U-shaped (flat) and V-shaped (steep) valleys, and propose SHIFT - a staged hierarchical framework for progressive training of Looped-Attn.

Result: Theoretical derivations show Looped-Attn’s inductive bias towards River-V-Valley guarantees better loss convergence through valley hopping and encourages learning complex patterns compared to Single-Attn’s River-U-Valley.

Conclusion: The recursive architecture of Looped-Attn induces a landscape-level inductive bias towards V-shaped valleys, explaining its superior performance on complex reasoning tasks, and SHIFT framework accelerates training while maintaining comparable performance.

Abstract: While looped transformers (termed as Looped-Attn) often outperform standard transformers (termed as Single-Attn) on complex reasoning tasks, the theoretical basis for this advantage remains underexplored. In this paper, we explain this phenomenon through the lens of loss landscape geometry, inspired by empirical observations of their distinct dynamics at both sample and Hessian levels. To formalize this, we extend the River-Valley landscape model by distinguishing between U-shaped valleys (flat) and V-shaped valleys (steep). Based on empirical observations, we conjecture that the recursive architecture of Looped-Attn induces a landscape-level inductive bias towards River-V-Valley. Theoretical derivations based on this inductive bias guarantee a better loss convergence along the river due to valley hopping, and further encourage learning about complex patterns compared to the River-U-Valley induced by Single-Attn. Building on this insight, we propose SHIFT (Staged HIerarchical Framework for Progressive Training), a staged training framework that accelerates the training process of Looped-Attn while achieving comparable performances.

[882] Rademacher Meets Colors: More Expressivity, but at What Cost ?

Martin Carrasco, Caio Deberaldini Netto, Vahan A. Martirosyan, Aneeqa Mehrab, Ehimare Okoyomon, Caterina Graziani

Main category: cs.LG

TL;DR: This paper provides a theoretical explanation for the trade-off between expressivity and generalization in GNNs by linking WL colorings to Rademacher complexity.

Details

Motivation: To explain why more expressive GNNs suffer from higher generalization error despite being able to distinguish richer sets of graphs.

Method: Theoretical analysis connecting the number of equivalence classes induced by WL colorings to GNNs’ Rademacher complexity, showing that greater expressivity leads to higher complexity.

Result: Proved that the number of WL coloring equivalence classes directly bounds GNNs’ Rademacher complexity, and showed this complexity is stable under perturbations in color counts.

Conclusion: The framework unifies expressivity and generalization in GNNs, providing a principled understanding of why increased expressive power often comes at the cost of generalization.

Abstract: The expressive power of graph neural networks (GNNs) is typically understood through their correspondence with graph isomorphism tests such as the Weisfeiler-Leman (WL) hierarchy. While more expressive GNNs can distinguish a richer set of graphs, they are also observed to suffer from higher generalization error. This work provides a theoretical explanation for this trade-off by linking expressivity and generalization through the lens of coloring algorithms. Specifically, we show that the number of equivalence classes induced by WL colorings directly bounds the GNNs Rademacher complexity – a key data-dependent measure of generalization. Our analysis reveals that greater expressivity leads to higher complexity and thus weaker generalization guarantees. Furthermore, we prove that the Rademacher complexity is stable under perturbations in the color counts across different samples, ensuring robustness to sampling variability across datasets. Importantly, our framework is not restricted to message-passing GNNs or 1-WL, but extends to arbitrary GNN architectures and expressivity measures that partition graphs into equivalence classes. These results unify the study of expressivity and generalization in GNNs, providing a principled understanding of why increasing expressive power often comes at the cost of generalization.

[883] PANTHER: Generative Pretraining Beyond Language for Sequential User Behavior Modeling

Guilin Li, Yun Zhang, Xiuyuan Chen, Chengqi Li, Bo Wang, Linghe Kong, Wenjia Wang, Weiran Huang, Matthias Hwai Yong Tan

Main category: cs.LG

TL;DR: PANTHER is a hybrid generative-discriminative framework that extends generative pretraining to user behavior modeling, achieving significant improvements in transaction prediction and fraud detection while enabling real-time inference.

Details

Motivation: LLMs effectively capture world knowledge but struggle with behavioral knowledge from user interactions. User behavior forms a distinct modality with high-cardinality sequences that discriminative models often fail to model effectively under limited supervision.

Method: PANTHER uses: (1) Structured Tokenization to compress multi-dimensional transaction attributes; (2) Sequence Pattern Recognition Module for periodic transaction motifs; (3) Unified User-Profile Embedding combining static demographics with dynamic histories; (4) Real-time scalability through offline caching of pretrained embeddings.

Result: Deployed at WeChat Pay, PANTHER achieved 25.6% boost in next-transaction prediction HitRate@1 and 38.6% relative improvement in fraud detection recall. Cross-domain evaluations showed up to 21% HitRate@1 gains over transformer baselines.

Conclusion: PANTHER establishes itself as a scalable, high-performance framework for industrial sequential user behavior modeling, demonstrating strong generalization across domains and enabling millisecond-level inference in production systems.

Abstract: Large language models (LLMs) have shown that generative pretraining can distill vast world knowledge into compact token representations. While LLMs encapsulate extensive world knowledge, they remain limited in modeling the behavioral knowledge contained within user interaction histories. User behavior forms a distinct modality, where each action, defined by multi-dimensional attributes such as time, context, and transaction type, constitutes a behavioral token. Modeling these high-cardinality sequences is challenging, and discriminative models often falter under limited supervision. To bridge this gap, we extend generative pretraining to user behavior, learning transferable representations from unlabeled behavioral data analogous to how LLMs learn from text. We present PANTHER, a hybrid generative-discriminative framework that unifies user behavior pretraining and downstream adaptation, enabling large-scale sequential user representation learning and real-time inference. PANTHER introduces: (1) Structured Tokenization to compress multi-dimensional transaction attributes into an interpretable vocabulary; (2) Sequence Pattern Recognition Module (SPRM) for modeling periodic transaction motifs; (3) a Unified User-Profile Embedding that fuses static demographics with dynamic transaction histories; and (4) Real-time scalability enabled by offline caching of pretrained embeddings for millisecond-level inference. Fully deployed and operational online at WeChat Pay, PANTHER delivers a 25.6 percent boost in next-transaction prediction HitRate@1 and a 38.6 percent relative improvement in fraud detection recall over baselines. Cross-domain evaluations on public benchmarks show strong generalization, achieving up to 21 percent HitRate@1 gains over transformer baselines, establishing PANTHER as a scalable, high-performance framework for industrial sequential user behavior modeling.

[884] Lighter-X: An Efficient and Plug-and-play Strategy for Graph-based Recommendation through Decoupled Propagation

Yanping Zheng, Zhewei Wei, Frank de Hoog, Xu Chen, Hongteng Xu, Yuhang Ye, Jiadeng Huang

Main category: cs.LG

TL;DR: Lighter-X is an efficient framework that reduces parameter complexity from O(n×d) to O(h×d) where h≪n, enabling scalable deployment of GNN-based recommenders while maintaining performance.

Details

Motivation: Traditional GNN-based recommenders like LightGCN have high parameter complexity O(n×d) that limits scalability on large graphs, making deployment challenging in real-world applications.

Method: Analyzes parameter structure and redundancy, proposes efficient compression for sparse adjacency structures and embedding matrices, and uses a decoupled framework to reduce computational complexity during training.

Result: Achieves comparable performance to baseline models with significantly fewer parameters, attaining better results on large-scale graphs with only 1% of LightGCN’s parameters.

Conclusion: Lighter-X provides an effective solution for scalable GNN-based recommendation by substantially reducing parameter and computational complexity while preserving performance.

Abstract: Graph Neural Networks (GNNs) have demonstrated remarkable effectiveness in recommendation systems. However, conventional graph-based recommenders, such as LightGCN, require maintaining embeddings of size $d$ for each node, resulting in a parameter complexity of $\mathcal{O}(n \times d)$, where $n$ represents the total number of users and items. This scaling pattern poses significant challenges for deployment on large-scale graphs encountered in real-world applications. To address this scalability limitation, we propose \textbf{Lighter-X}, an efficient and modular framework that can be seamlessly integrated with existing GNN-based recommender architectures. Our approach substantially reduces both parameter size and computational complexity while preserving the theoretical guarantees and empirical performance of the base models, thereby enabling practical deployment at scale. Specifically, we analyze the original structure and inherent redundancy in their parameters, identifying opportunities for optimization. Based on this insight, we propose an efficient compression scheme for the sparse adjacency structure and high-dimensional embedding matrices, achieving a parameter complexity of $\mathcal{O}(h \times d)$, where $h \ll n$. Furthermore, the model is optimized through a decoupled framework, reducing computational complexity during the training process and enhancing scalability. Extensive experiments demonstrate that Lighter-X achieves comparable performance to baseline models with significantly fewer parameters. In particular, on large-scale interaction graphs with millions of edges, we are able to attain even better results with only 1% of the parameter over LightGCN.

[885] Preference-driven Knowledge Distillation for Few-shot Node Classification

Xing Wei, Chunchun Chen, Rui Fan, Xiaofeng Cao, Sourav Medya, Wei Ye

Main category: cs.LG

TL;DR: A preference-driven knowledge distillation framework that synergizes LLMs and GNNs for few-shot node classification on text-attributed graphs by using preference-driven selectors to optimize knowledge transfer.

Details

Motivation: GNNs rely heavily on human-annotated labels and struggle with diverse local topologies, while LLMs perform well in few-shot learning but face scalability issues - creating a need to combine their complementary strengths.

Method: Developed two preference-driven selectors: GNN-preference-driven node selector for LLM-to-GNN distillation, and node-preference-driven GNN selector to identify the best teacher GNN for each node’s local topology.

Result: Extensive experiments show the framework is effective for few-shot node classification on real-world text-attributed graphs.

Conclusion: The proposed PKD framework successfully leverages the complementary strengths of LLMs and GNNs through preference-driven knowledge distillation for improved few-shot learning on complex graph data.

Abstract: Graph neural networks (GNNs) can efficiently process text-attributed graphs (TAGs) due to their message-passing mechanisms, but their training heavily relies on the human-annotated labels. Moreover, the complex and diverse local topologies of nodes of real-world TAGs make it challenging for a single mechanism to handle. Large language models (LLMs) perform well in zero-/few-shot learning on TAGs but suffer from a scalability challenge. Therefore, we propose a preference-driven knowledge distillation (PKD) framework to synergize the complementary strengths of LLMs and various GNNs for few-shot node classification. Specifically, we develop a GNN-preference-driven node selector that effectively promotes prediction distillation from LLMs to teacher GNNs. To further tackle nodes’ intricate local topologies, we develop a node-preference-driven GNN selector that identifies the most suitable teacher GNN for each node, thereby facilitating tailored knowledge distillation from teacher GNNs to the student GNN. Extensive experiments validate the efficacy of our proposed framework in few-shot node classification on real-world TAGs.

[886] CacheClip: Accelerating RAG with Effective KV Cache Reuse

Bin Yang, Qiuyu Leng, Jun Zeng, Zhenhua Wu

Main category: cs.LG

TL;DR: CacheClip is a novel framework that addresses TTFT bottlenecks in RAG systems by using small auxiliary LLMs to identify critical tokens for selective KV cache recomputation, achieving both fast inference and high generation quality.

Details

Motivation: RAG systems suffer from severe time-to-first-token bottlenecks due to long input sequences, and existing KV cache reuse methods face fundamental trade-offs between prefix caching requirements and quality degradation from missing inter-chunk attention.

Method: CacheClip integrates three techniques: auxiliary-model-guided token selection for selective KV cache recomputation (with finetuned auxiliary models), shared prefixes to eliminate redundant attention sinks, and grouping strategy to maintain local coherence during partial KV cache updates.

Result: CacheClip retains up to 94.8% and 85.0% of full-attention performance on NIAH and LongBench, outperforming APE and CacheBlend by 25.2% and 35.1% on NIAH. It accelerates LLM inference by up to 1.92x in prefill time.

Conclusion: CacheClip provides a practical solution to the efficiency-quality trade-off in RAG systems by enabling both fast TTFT and high generation quality through selective KV cache recomputation guided by auxiliary LLMs.

Abstract: Retrieval-Augmented Generation (RAG) systems suffer from severe time-to-first-token (TTFT) bottlenecks due to long input sequences. Existing KV cache reuse methods face a fundamental trade-off: prefix caching requires identical prefixes that rarely occur in RAG scenarios, while direct precomputation sacrifices quality due to missing inter-chunk attention and repeated attention sinks. Recent methods like APE and CacheBlend partially address these issues but remain inadequate for robust RAG applications. This paper presents CacheClip, a novel framework that achieves both fast TTFT and high generation quality. Our key insight is that small auxiliary LLMs exhibit similar last-layer attention distributions to primary LLMs (the target model for generation), enabling efficient identification of tokens critical for restoring inter-chunk attention, thereby significantly improving response quality on cross-chunk reasoning tasks. CacheClip integrates three techniques: (1) auxiliary-model-guided token selection for selective KV cache recomputation, where the auxiliary model is finetuned to improve selection accuracy, (2) shared prefixes to eliminate redundant attention sinks, and (3) grouping strategy to maintain local coherence during partial KV cache updates. Experiments show CacheClip retains up to 94.8% and 85.0% of full-attention performance on NIAH and LongBench, outperforming APE and CacheBlend by 25.2% and 35.1% on NIAH (with reomp% = 20%). Meanwhile, CacheClip accelerates LLM inference by up to 1.92x in prefill time, providing a practical solution to the efficiency-quality trade-off in RAG systems.

[887] PermLLM: Learnable Channel Permutation for N:M Sparse Large Language Models

Lancheng Zou, Shuo Yin, Zehua Pei, Tsung-Yi Ho, Farzan Farnia, Bei Yu

Main category: cs.LG

TL;DR: PermLLM introduces learnable channel permutation for N:M sparsity in LLMs, using Sinkhorn normalization to enable differentiable optimization and block-wise strategy to reduce complexity, achieving superior performance over traditional methods.

Details

Motivation: Traditional channel permutation methods rely on handcrafted quality metrics that fail to accurately capture pruning impact on model performance, limiting their effectiveness.

Method: Proposes learnable channel permutation (LCP) using Sinkhorn normalization to transform discrete permutations into differentiable soft matrices, with block-wise channel permutation to reduce parameters and computational complexity.

Result: Extensive experiments on LLaMA, Qwen, and OPT models demonstrate that PermLLM achieves superior performance in optimizing N:M sparse models compared to traditional methods.

Conclusion: PermLLM effectively mitigates pruning-induced errors and seamlessly integrates with existing one-shot pruning methods, providing an adaptive optimization framework for channel permutations.

Abstract: Channel permutation is a powerful technique for enhancing the accuracy of N:M sparse models by reordering the channels of weight matrices to prioritize the retention of important weights. However, traditional channel permutation methods rely on handcrafted quality metrics, which often fail to accurately capture the true impact of pruning on model performance. To address this limitation, we propose PermLLM, a novel post-training pruning framework that introduces learnable channel permutation (LCP) for N:M sparsity. LCP leverages Sinkhorn normalization to transform discrete permutation matrices into differentiable soft permutation matrices, enabling end-to-end optimization. Additionally, PermLLM incorporates an efficient block-wise channel permutation strategy, which significantly reduces the number of learnable parameters and computational complexity. PermLLM seamlessly integrates with existing one-shot pruning methods to adaptively optimize channel permutations, effectively mitigating pruning-induced errors. Extensive experiments on the LLaMA series, Qwen, and OPT models demonstrate that PermLLM achieves superior performance in optimizing N:M sparse models. The code is available at https://github.com/lanchengzou/PermLLM.

[888] Adversarial Attacks on Downstream Weather Forecasting Models: Application to Tropical Cyclone Trajectory Prediction

Yue Deng, Francisco Santos, Pang-Ning Tan, Lifeng Luo

Main category: cs.LG

TL;DR: Cyc-Attack is a novel adversarial attack method that perturbs upstream weather forecasts to manipulate tropical cyclone trajectory predictions in deep learning weather forecasting models, overcoming challenges of non-differentiable TC detectors and class imbalance.

Details

Motivation: To investigate the vulnerability of DLWF models to adversarial attacks that can alter downstream TC trajectory predictions, addressing challenges with non-differentiable TC detection systems, class imbalance from rare TC events, and maintaining physical consistency.

Method: Pre-train a differentiable surrogate model to approximate TC detector output, employ skewness-aware loss function with kernel dilation for class imbalance, and use distance-based gradient weighting with regularization to ensure realistic perturbations.

Result: The proposed Cyc-Attack method successfully generates adversarial trajectories by perturbing upstream forecasts, overcoming the limitations of conventional gradient-based attacks and maintaining physical consistency.

Conclusion: Cyc-Attack demonstrates that DLWF models are vulnerable to adversarial attacks that can manipulate TC trajectory predictions, highlighting security concerns in weather forecasting systems and the need for robust defenses.

Abstract: Deep learning based weather forecasting (DLWF) models leverage past weather observations to generate future forecasts, supporting a wide range of downstream tasks, including tropical cyclone (TC) trajectory prediction. In this paper, we investigate their vulnerability to adversarial attacks, where subtle perturbations to the upstream weather forecasts can alter the downstream TC trajectory predictions. Although research on adversarial attacks in DLWF models has grown recently, generating perturbed upstream forecasts that reliably steer downstream output toward attacker-specified trajectories remains a challenge. First, conventional TC detection systems are opaque, non-differentiable black boxes, making standard gradient-based attacks infeasible. Second, the extreme rarity of TC events leads to severe class imbalance problem, making it difficult to develop efficient attack methods that will produce the attacker’s target trajectories. Furthermore, maintaining physical consistency in adversarially generated forecasts presents another significant challenge. To overcome these limitations, we propose Cyc-Attack, a novel method that perturbs the upstream forecasts of DLWF models to generate adversarial trajectories. First, we pre-train a differentiable surrogate model to approximate the TC detector’s output, enabling the construction of gradient-based attacks. Cyc-Attack also employs skewness-aware loss function with kernel dilation strategy to address the imbalance problem. Finally, a distance-based gradient weighting scheme and regularization are used to constrain the perturbations and eliminate spurious trajectories to ensure the adversarial forecasts are realistic and not easily detectable.

[889] A Unified Frequency Domain Decomposition Framework for Interpretable and Robust Time Series Forecasting

Cheng He, Xijie Liang, Zengrong Zheng, Patrick P. C. Lee, Xu Huang, Zhaoyi Li, Hong Xie, Defu Lian, Enhong Chen

Main category: cs.LG

TL;DR: FIRE is a frequency domain decomposition framework for time series forecasting that models amplitude and phase components separately, adaptively learns frequency basis weights, uses targeted loss functions, and handles sparse data, achieving superior performance and interpretability.

Details

Motivation: Current deep learning approaches for time series forecasting operate as black-box models with limited interpretability and theoretical understanding, while struggling with data distribution dynamics across time and frequency domains.

Method: Proposes FIRE framework with independent amplitude/phase modeling, adaptive frequency basis weight learning, targeted loss functions, and novel training paradigm for sparse data in frequency domain decomposition.

Result: Extensive experiments show FIRE consistently outperforms state-of-the-art models on long-term forecasting benchmarks, achieving superior predictive performance and significantly enhanced interpretability.

Conclusion: FIRE provides a unified mathematical abstraction for diverse time series that enables interpretable and robust forecasting through frequency domain decomposition, addressing key limitations of current black-box approaches.

Abstract: Current approaches for time series forecasting, whether in the time or frequency domain, predominantly use deep learning models based on linear layers or transformers. They often encode time series data in a black-box manner and rely on trial-and-error optimization solely based on forecasting performance, leading to limited interpretability and theoretical understanding. Furthermore, the dynamics in data distribution over time and frequency domains pose a critical challenge to accurate forecasting. We propose FIRE, a unified frequency domain decomposition framework that provides a mathematical abstraction for diverse types of time series, so as to achieve interpretable and robust time series forecasting. FIRE introduces several key innovations: (i) independent modeling of amplitude and phase components, (ii) adaptive learning of weights of frequency basis components, (iii) a targeted loss function, and (iv) a novel training paradigm for sparse data. Extensive experiments demonstrate that FIRE consistently outperforms state-of-the-art models on long-term forecasting benchmarks, achieving superior predictive performance and significantly enhancing interpretability of time series

[890] Robust Learning of Diffusion Models with Extremely Noisy Conditions

Xin Chen, Gillian Dobbie, Xinyu Wang, Feng Liu, Di Wang, Jingfeng Zhang

Main category: cs.LG

TL;DR: A robust learning framework for conditional diffusion models that handles extremely noisy conditions by learning pseudo conditions through temporal ensembling and Reverse-time Diffusion Condition technique.

Details

Motivation: Conditional diffusion models suffer significant performance degradation with noisy conditions like corrupted labels or unreliable observations, and existing noise-robust methods fail at high noise levels.

Method: Proposes learning pseudo conditions as surrogates for clean conditions, refining them progressively via temporal ensembling, and using Reverse-time Diffusion Condition (RDC) to diffuse pseudo conditions to reinforce memorization.

Result: Achieves state-of-the-art performance across various noise levels on both class-conditional image generation and visuomotor policy generation tasks.

Conclusion: The proposed framework effectively addresses the challenge of extremely noisy conditions in conditional diffusion models through pseudo condition learning and refinement techniques.

Abstract: Conditional diffusion models have the generative controllability by incorporating external conditions. However, their performance significantly degrades with noisy conditions, such as corrupted labels in the image generation or unreliable observations or states in the control policy generation. This paper introduces a robust learning framework to address extremely noisy conditions in conditional diffusion models. We empirically demonstrate that existing noise-robust methods fail when the noise level is high. To overcome this, we propose learning pseudo conditions as surrogates for clean conditions and refining pseudo ones progressively via the technique of temporal ensembling. Additionally, we develop a Reverse-time Diffusion Condition (RDC) technique, which diffuses pseudo conditions to reinforce the memorization effect and further facilitate the refinement of the pseudo conditions. Experimentally, our approach achieves state-of-the-art performance across a range of noise levels on both class-conditional image generation and visuomotor policy generation tasks.The code can be accessible via the project page https://robustdiffusionpolicy.github.io

[891] Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective

Zhezheng Hao, Hong Wang, Haoyang Liu, Jian Luo, Jiarui Yu, Hande Dong, Qiang Lin, Can Wang, Jiawei Chen

Main category: cs.LG

TL;DR: STEER introduces a token-level entropy-change-aware reweighting scheme to directly stabilize entropy dynamics in RLVR training, preventing entropy collapse and improving generalization in mathematical reasoning tasks.

Details

Motivation: Existing entropy intervention methods in RLVR training only indirectly control entropy dynamics through related factors like advantage signals, which limits their effectiveness and can lead to entropy collapse - a rapid loss of policy diversity that harms generalization.

Method: Proposes STEER (Stabilizing Token-level Entropy-changE via Reweighting), which uses fine-grained token-level adjustments to adaptively stabilize entropy dynamics through an entropy-change-aware reweighting scheme.

Result: Extensive experiments show STEER significantly mitigates entropy collapse, stabilizes entropy dynamics, and achieves stronger downstream performance across various mathematical reasoning benchmarks.

Conclusion: Direct token-level entropy stabilization through STEER effectively addresses the fundamental limitation of existing indirect methods, providing robust exploration while mitigating over-exploitation in RLVR training.

Abstract: While Reinforcement Learning with Verifiable Rewards (RLVR) can enhance LLM reasoning, its training process poses a critical risk: entropy collapse. This phenomenon is a rapid loss of policy diversity, stemming from the exploration-exploitation imbalance and leading to a lack of generalization. Recent entropy-intervention methods aim to prevent \coloredtext{entropy collapse}, yet their underlying mechanisms remain unclear. In this paper, we conduct a quantitative analysis to reveal token-level entropy changes and how existing entropy intervention methods help avoid entropy collapse. Our findings point out a fundamental limitation of existing methods: they attempt to control entropy dynamics indirectly. By only affecting related factors, such as the advantage signal and generation probability, their effectiveness is inherently limited and could potentially fail. To address this limitation, we introduce an entropy-change-aware reweighting scheme, namely Stabilizing Token-level Entropy-changE via Reweighting (STEER), that adaptively stabilizes entropy dynamics through fine-grained token-level adjustments. Our approach mitigates over-exploitation while fostering robust exploration. Extensive experiments demonstrate that STEER significantly mitigates entropy collapse, stabilizes entropy dynamics, and achieves stronger downstream performance across various mathematical reasoning benchmarks \footnote{Our code is available at https://github.com/zz-haooo/STEER.

[892] INR-Bench: A Unified Benchmark for Implicit Neural Representations in Multi-Domain Regression and Reconstruction

Linfei Li, Fengyi Zhang, Zhong Wang, Lin Zhang, Ying Shen

Main category: cs.LG

TL;DR: INR-Bench is the first comprehensive benchmark for Implicit Neural Representations, evaluating 78 model variants across 9 multimodal tasks to analyze how architectures, positional encoding, and activations affect signal frequency response.

Details

Motivation: To better understand the factors influencing Implicit Neural Representations' effectiveness and limitations, which remain underexplored despite their success in signal processing tasks.

Method: Leveraged Neural Tangent Kernel theory to analyze model architectures (MLP vs KAN), positional encoding, and nonlinear primitives. Created INR-Bench with 56 Coordinate-MLP variants and 22 Coordinate-KAN models evaluated across 9 implicit multimodal tasks covering forward and inverse problems.

Result: Established a comprehensive benchmark platform that highlights the strengths and limitations of different neural models for implicit representation tasks.

Conclusion: INR-Bench provides a solid foundation for future research on Implicit Neural Representations by systematically evaluating various architectural components across multimodal tasks.

Abstract: Implicit Neural Representations (INRs) have gained success in various signal processing tasks due to their advantages of continuity and infinite resolution. However, the factors influencing their effectiveness and limitations remain underexplored. To better understand these factors, we leverage insights from Neural Tangent Kernel (NTK) theory to analyze how model architectures (classic MLP and emerging KAN), positional encoding, and nonlinear primitives affect the response to signals of varying frequencies. Building on this analysis, we introduce INR-Bench, the first comprehensive benchmark specifically designed for multimodal INR tasks. It includes 56 variants of Coordinate-MLP models (featuring 4 types of positional encoding and 14 activation functions) and 22 Coordinate-KAN models with distinct basis functions, evaluated across 9 implicit multimodal tasks. These tasks cover both forward and inverse problems, offering a robust platform to highlight the strengths and limitations of different neural models, thereby establishing a solid foundation for future research. The code and dataset are available at https://github.com/lif314/INR-Bench.

[893] CauchyNet: Compact and Data-Efficient Learning using Holomorphic Activation Functions

Hong-Kun Zhang, Xin Li, Sikun Yang, Zhihong Xia

Main category: cs.LG

TL;DR: CauchyNet is a novel neural network based on Cauchy’s integral formula that embeds real data into complex plane for superior function approximation, achieving 50% lower MAE with fewer parameters.

Details

Motivation: To develop a more efficient neural network for function approximation tasks like time series forecasting and missing data imputation that can better capture complex temporal dependencies while reducing computational overhead.

Method: Embed real-valued data into complex plane using Cauchy’s integral formula, incorporate complex-valued activation functions, and leverage theoretical guarantees from universal approximation theorem.

Result: Consistently outperforms state-of-the-art models across diverse domains (transportation, energy, epidemiology), achieving 50% lower mean absolute error with fewer parameters and reduced computational overhead.

Conclusion: CauchyNet demonstrates strong potential as an effective and efficient tool for data-driven predictive modeling, particularly in resource-constrained and data-scarce environments.

Abstract: A novel neural network inspired by Cauchy’s integral formula, is proposed for function approximation tasks that include time series forecasting, missing data imputation, etc. Hence, the novel neural network is named CauchyNet. By embedding real-valued data into the complex plane, CauchyNet efficiently captures complex temporal dependencies, surpassing traditional real-valued models in both predictive performance and computational efficiency. Grounded in Cauchy’s integral formula and supported by the universal approximation theorem, CauchyNet offers strong theoretical guarantees for function approximation. The architecture incorporates complex-valued activation functions, enabling robust learning from incomplete data while maintaining a compact parameter footprint and reducing computational overhead. Through extensive experiments in diverse domains, including transportation, energy consumption, and epidemiological data, CauchyNet consistently outperforms state-of-the-art models in predictive accuracy, often achieving a 50% lower mean absolute error with fewer parameters. These findings highlight CauchyNet’s potential as an effective and efficient tool for data-driven predictive modeling, particularly in resource-constrained and data-scarce environments.

[894] RLFR: Extending Reinforcement Learning for LLMs with Flow Environment

Jinghao Zhang, Naishan Zheng, Ruilin Li, Dongzhou Cheng, Zheming Liang, Feng Zhao, Jiaqi Wang

Main category: cs.LG

TL;DR: RLFR introduces flow rewards from latent space for RLVR, using velocity deviations in flow fields constructed from off-policy and on-policy data as reward signals, improving reasoning in LLMs.

Details

Motivation: Binary verification in RLVR overlooks valuable reasoning exploration, and golden Process Reward Models are annotation-heavy, motivating the use of auxiliary signals from latent space for reward shaping.

Method: Construct flow fields from off-policy high-quality data and on-policy rejection sampling data, then quantify velocity deviations of policy latents within these fields to serve as reward signals.

Result: Experiments on language and multimodal reasoning benchmarks show reliable flow rewards, demonstrating that well-established flow fields can effectively collect reward signals from expressive latent spaces.

Conclusion: RLFR presents a promising paradigm for reward shaping with auxiliary signals, utilizing efficient context dependence in hidden states rather than token-level denotation.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a promising framework for improving reasoning abilities in Large Language Models (LLMs). However, policy optimized with binary verification prone to overlook potential valuable exploration in reasoning trajectory. In view of heavy annotation cost of golden Process Reward Models (PRMs), recent works attempt using auxiliary signals for reward shaping of process tokens, involving entropy and likelihood collected from logit space. In this work, we offer a novel perspective on shaping RLVR with flow rewards derived from latent space, and propose RLFR, where the flow fields of model latents are constructed from either off-policy high-quality data and on-policy rejection sampling data, and the velocity deviations of policy latents within it are quantified to serve as a reward signal. RLFR first demonstrates that a well-established flow field can be a sound environment for reward signal collection, highlighting the expressive latent space is much underexplored. Moreover, RLFR is able to compress any off-policy expert data as reference for constituting reward signals, and we show that the efficient context dependence compressed within the hidden states are utilized, rather than individual token-level denotation for context comprehending. Experiments on both language and multimodal reasoning benchmarks demonstrate the reliability of flow rewards, and suggesting a promising paradigm for reward shaping with auxiliary signals.

[895] Hierarchical Bayesian Flow Networks for Molecular Graph Generation

Yida Xiong, Jiameng Chen, Kun Li, Hongzhi Zhang, Xiantao Cai, Wenbin Hu

Main category: cs.LG

TL;DR: GraphBFN is a novel molecular graph generation framework that addresses the discrepancy between continuous training and discrete inference in existing methods by using Bayesian Flow Networks and Cumulative Distribution Functions to unify training objectives with sampling operations.

Details

Motivation: Current molecular graph generation methods treat the problem as regression during training but require rounding for discrete classification during inference, creating a training-inference mismatch that leads to overfitting, reduced diversity, and limited generalization.

Method: Proposed GraphBFN uses hierarchical coarse-to-fine framework based on Bayesian Flow Networks that operates on distribution parameters, introducing Cumulative Distribution Function to calculate correct category probabilities and unify training with sampling rounding.

Result: GraphBFN achieves superior performance and faster generation, setting new state-of-the-art results on QM9 and ZINC250k molecular graph generation benchmarks.

Conclusion: The proposed method successfully addresses the fundamental limitation of training-inference discrepancy in molecular graph generation, demonstrating improved performance and efficiency through unified continuous-discrete modeling.

Abstract: Molecular graph generation is essentially a classification generation problem, aimed at predicting categories of atoms and bonds. Currently, prevailing paradigms such as continuous diffusion models are trained to predict continuous numerical values, treating the training process as a regression task. However, the final generation necessitates a rounding step to convert these predictions back into discrete classification categories, which is intrinsically a classification operation. Given that the rounding operation is not incorporated during training, there exists a significant discrepancy between the model’s training objective and its inference procedure. As a consequence, an excessive emphasis on point-wise precision can lead to overfitting and inefficient learning. This occurs because considerable efforts are devoted to capturing intra-bin variations that are ultimately irrelevant to the discrete nature of the task at hand. Such a flaw results in diminished molecular diversity and constrains the model’s generalization capabilities. To address this fundamental limitation, we propose GraphBFN, a novel hierarchical coarse-to-fine framework based on Bayesian Flow Networks that operates on the parameters of distributions. By innovatively introducing Cumulative Distribution Function, GraphBFN is capable of calculating the probability of selecting the correct category, thereby unifying the training objective with the sampling rounding operation. We demonstrate that our method achieves superior performance and faster generation, setting new state-of-the-art results on the QM9 and ZINC250k molecular graph generation benchmarks.

[896] SGM: A Statistical Godel Machine for Risk-Controlled Recursive Self-Modification

Xuening Wu, Shenqin Yin, Yanlan Kang, Xinhang Zhang, Qianya Xu, Zeping Chen, Wenqiang Zhang

Main category: cs.LG

TL;DR: Statistical Godel Machine (SGM) introduces statistical safety guarantees for recursive self-modification in AI systems, replacing formal proofs with statistical confidence tests to ensure safe code rewrites.

Details

Motivation: Existing frameworks lack safety guarantees for recursive self-modification in AutoML and neural architecture search, while traditional Godel machines require unattainable formal proofs in stochastic settings.

Method: SGM uses statistical confidence tests (e-values, Hoeffding bounds) instead of formal proofs, with Confirm-Triggered Harmonic Spending (CTHS) to allocate error budget efficiently while maintaining familywise validity.

Result: Experiments show SGM certifies genuine gains on CIFAR-100, rejects spurious improvement on ImageNet-100, and demonstrates robustness on RL and optimization benchmarks.

Conclusion: SGM provides foundational infrastructure for risk-aware self-modification in learning systems, enabling safe continual adaptation with statistical guarantees.

Abstract: Recursive self-modification is increasingly central in AutoML, neural architecture search, and adaptive optimization, yet no existing framework ensures that such changes are made safely. Godel machines offer a principled safeguard by requiring formal proofs of improvement before rewriting code; however, such proofs are unattainable in stochastic, high-dimensional settings. We introduce the Statistical Godel Machine (SGM), the first statistical safety layer for recursive edits. SGM replaces proof-based requirements with statistical confidence tests (e-values, Hoeffding bounds), admitting a modification only when superiority is certified at a chosen confidence level, while allocating a global error budget to bound cumulative risk across rounds.We also propose Confirm-Triggered Harmonic Spending (CTHS), which indexes spending by confirmation events rather than rounds, concentrating the error budget on promising edits while preserving familywise validity.Experiments across supervised learning, reinforcement learning, and black-box optimization validate this role: SGM certifies genuine gains on CIFAR-100, rejects spurious improvement on ImageNet-100, and demonstrates robustness on RL and optimization benchmarks.Together, these results position SGM as foundational infrastructure for continual, risk-aware self-modification in learning systems.Code is available at: https://github.com/gravitywavelet/sgm-anon.

[897] Progressive Scale Convolutional Network for Spatio-Temporal Downscaling of Soil Moisture: A Case Study Over the Tibetan Plateau

Ziyu Zhou, Keyan Hu, Ling Zhang, Zhaohui Xue, Yutian Fang, Yusha Zheng

Main category: cs.LG

TL;DR: PSCNet is a progressive scale convolutional network that combines coarse passive microwave data with fine-scale ERA5-Land variables to produce high-resolution soil moisture products at 10-km spatial and 3-hour temporal resolution for the Tibetan Plateau.

Details

Motivation: To address the challenge of incomplete surface auxiliary factors that hinder temporal-scale soil moisture inversion, and to obtain seamless high-resolution soil moisture data for hydrological and meteorological applications.

Method: Introduces validated high temporal resolution ERA5-Land variables into SMAP downscaling, and designs PSCNet with multi-frequency temporal fusion module (MFTF) for temporal dynamics and custom squeeze-and-excitation (SE) block for spatial detail preservation.

Result: Achieved mean R value of 0.881 in satellite validation, consistently ranked top three in in-situ validation, maintained average relative error within 6% for R and 2% for ubRMSE in temporal generalization, and showed excellent temporal sensitivity and spatial detail preservation.

Conclusion: PSCNet provides a promising solution for spatio-temporal downscaling by effectively modeling intricate spatio-temporal relationships in soil moisture data.

Abstract: Soil moisture (SM) plays a critical role in hydrological and meteorological processes. High-resolution SM can be obtained by combining coarse passive microwave data with fine-scale auxiliary variables. However, the inversion of SM at the temporal scale is hindered by the incompleteness of surface auxiliary factors. To address this issue, first, we introduce validated high temporal resolution ERA5-Land variables into the downscaling process of the low-resolution SMAP SM product. Subsequently, we design a progressive scale convolutional network (PSCNet), at the core of which are two innovative components: a multi-frequency temporal fusion module (MFTF) for capturing temporal dynamics, and a bespoke squeeze-and-excitation (SE) block designed to preserve fine-grained spatial details. Using this approach, we obtained seamless SM products for the Tibetan Plateau (TP) from 2016 to 2018 at 10-km spatial and 3-hour temporal resolution. The experimental results on the TP demonstrated the following: 1) In the satellite product validation, the PSCNet exhibited comparable accuracy and lower error, with a mean R value of 0.881, outperforming other methods. 2) In the in-situ site validation, PSCNet consistently ranked among the top three models for the R metric across all sites, while also showing superior performance in overall error reduction. 3) In the temporal generalization validation, the feasibility of using high-temporal resolution ERA5-Land variables for downscaling was confirmed, as all methods maintained an average relative error within 6% for the R metric and 2% for the ubRMSE metric. 4) In the temporal dynamics and visualization validation, PSCNet demonstrated excellent temporal sensitivity and vivid spatial details. Overall, PSCNet provides a promising solution for spatio-temporal downscaling by effectively modeling the intricate spatio-temporal relationships in SM data.

[898] Reasoning-Enhanced Large Language Models for Molecular Property Prediction

Jiaxi Zhuang, Yaorui Shi, Jue Hou, Yunong He, Mingwei Ye, Mingjun Xu, Yuming Su, Linfeng Zhang, Linfeng Zhang, Guolin Ke, Hengxing Cai

Main category: cs.LG

TL;DR: MPPReasoner is a multimodal LLM that combines molecular images and SMILES strings for molecular property prediction, using a two-stage training approach with supervised fine-tuning and reinforcement learning with principle-guided rewards to improve interpretability and cross-task generalization.

Details

Motivation: Existing molecular property prediction methods suffer from limited interpretability, poor cross-task generalization, and lack of chemical reasoning capabilities. Traditional ML models struggle with task transferability, while specialized molecular language models provide little insight into decision-making processes.

Method: Built on Qwen2.5-VL-7B-Instruct, MPPReasoner integrates molecular images with SMILES strings. Uses two-stage training: 1) Supervised fine-tuning with 16,000 high-quality reasoning trajectories generated through expert knowledge and multiple teacher models, 2) Reinforcement Learning from Principle-Guided Rewards (RLPGR) with verifiable, rule-based rewards that evaluate chemical principle application, molecular structure analysis, and logical consistency.

Result: Extensive experiments across 8 datasets show significant performance improvements. MPPReasoner outperforms best baselines by 7.91% on in-distribution tasks and 4.53% on out-of-distribution tasks. Demonstrates exceptional cross-task generalization and generates chemically sound reasoning paths.

Conclusion: MPPReasoner substantially enhances both interpretability and practical utility for chemists by providing chemically sound reasoning paths and valuable insights into molecular property analysis, while achieving superior performance on both in-distribution and out-of-distribution tasks.

Abstract: Molecular property prediction is crucial for drug discovery and materials science, yet existing approaches suffer from limited interpretability, poor cross-task generalization, and lack of chemical reasoning capabilities. Traditional machine learning models struggle with task transferability, while specialized molecular language models provide little insight into their decision-making processes. To address these limitations, we propose \textbf{MPPReasoner}, a multimodal large language model that incorporates chemical reasoning for molecular property prediction. Our approach, built upon Qwen2.5-VL-7B-Instruct, integrates molecular images with SMILES strings to enable comprehensive molecular understanding. We develop a two-stage training strategy: supervised fine-tuning (SFT) using 16,000 high-quality reasoning trajectories generated through expert knowledge and multiple teacher models, followed by Reinforcement Learning from Principle-Guided Rewards (RLPGR). RLPGR employs verifiable, rule-based rewards that systematically evaluate chemical principle application, molecular structure analysis, and logical consistency through computational verification. Extensive experiments across 8 datasets demonstrate significant performance improvements, with MPPReasoner outperforming the best baselines by 7.91% and 4.53% on in-distribution and out-of-distribution tasks respectively. MPPReasoner exhibits exceptional cross-task generalization and generates chemically sound reasoning paths that provide valuable insights into molecular property analysis, substantially enhancing both interpretability and practical utility for chemists. Code is available at https://anonymous.4open.science/r/MPPReasoner-12687.

[899] Enhancing the Cross-Size Generalization for Solving Vehicle Routing Problems via Continual Learning

Jingwen Li, Zhiguang Cao, Yaoxin Wu, Tang Liu

Main category: cs.LG

TL;DR: A continual learning framework for vehicle routing problems that trains deep models sequentially with instances of ascending sizes, using inter-task and intra-task regularization plus experience replay to improve generalization across different problem sizes.

Details

Motivation: Existing deep models for vehicle routing problems are typically trained on single-size instances, limiting their ability to generalize across different problem sizes and hampering practical applicability.

Method: Proposes a continual learning framework with: 1) inter-task regularization to retain knowledge from smaller sizes, 2) intra-task regularization to consolidate model behaviors during training, and 3) experience replay to revisit previously trained instances and mitigate catastrophic forgetting.

Result: Achieves superior performance across various problem sizes (both seen and unseen during training) compared to state-of-the-art deep models, including those specialized for generalizability enhancement.

Conclusion: The proposed continual learning framework effectively addresses the generalization limitation in vehicle routing problems, with ablation studies confirming the synergistic effect of the key designs.

Abstract: Exploring machine learning techniques for addressing vehicle routing problems has attracted considerable research attention. To achieve decent and efficient solutions, existing deep models for vehicle routing problems are typically trained and evaluated using instances of a single size. This substantially limits their ability to generalize across different problem sizes and thus hampers their practical applicability. To address the issue, we propose a continual learning based framework that sequentially trains a deep model with instances of ascending problem sizes. Specifically, on the one hand, we design an inter-task regularization scheme to retain the knowledge acquired from smaller problem sizes in the model training on a larger size. On the other hand, we introduce an intra-task regularization scheme to consolidate the model by imitating the latest desirable behaviors during training on each size. Additionally, we exploit the experience replay to revisit instances of formerly trained sizes for mitigating the catastrophic forgetting. Experimental results show that our approach achieves predominantly superior performance across various problem sizes (either seen or unseen in the training), as compared to state-of-the-art deep models including the ones specialized for generalizability enhancement. Meanwhile, the ablation studies on the key designs manifest their synergistic effect in the proposed framework.

[900] Lost in the Middle: An Emergent Property from Information Retrieval Demands in LLMs

Nikolaus Salvatore, Hao Wang, Qiong Zhang

Main category: cs.LG

TL;DR: LLMs show U-shaped performance curves (primacy and recency effects) due to training on tasks with different memory demands - uniform recall for long-term memory and recent-focused recall for short-term memory.

Details

Motivation: To understand why LLMs exhibit 'lost-in-the-middle' phenomenon where performance degrades for information in the middle of long contexts, mirroring human memory effects.

Method: Trained GPT-2 and Llama variants from scratch on simple human memory paradigms simulating long-term and short-term memory demands, and analyzed sequence completion tasks.

Result: The U-shaped performance curve emerges from training: recency effect aligns with short-term memory demand, while primacy effect is induced by uniform long-term memory demand and influenced by autoregressive properties and attention sinks.

Conclusion: Positional bias in LLMs is jointly produced by information retrieval demands during training, model architecture, and structural attention dynamics.

Abstract: The performance of Large Language Models (LLMs) often degrades when crucial information is in the middle of a long context, a “lost-in-the-middle” phenomenon that mirrors the primacy and recency effects in human memory. We propose that this behavior is not simply a flaw indicative of information loss but an adaptation to different information retrieval demands during pre-training: some tasks require uniform recall across the entire input (a long-term memory demand), while others prioritize the most recent information (a short-term memory demand). Consistent with this view, we show that this U-shaped performance curve emerges when LLMs (GPT-2 and Llama variants) are trained from scratch on two simple human memory paradigms simulating long-term and short-term memory demands. Our analysis reveals that while the recency effect directly aligns with short-term memory demand in the training data, the primacy effect is induced by the uniform long-term memory demand and is additionally influenced by the model’s autoregressive properties and the formation of attention sinks. Our main findings from simple human memory paradigms also generalize to a sequence completion task, which more closely resembles the next-token prediction process in LLM pre-training. Together, our findings reveal how information retrieval demands, model architecture, and structural attention dynamics during model training can jointly produce positional bias observed in LLMs.

[901] Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models

Christopher Chiu, Silviu Pitis, Mihaela van der Schaar

Main category: cs.LG

TL;DR: VivaBench is a multi-turn benchmark that evaluates sequential clinical reasoning in LLM agents through physician-curated clinical vignettes, revealing significant performance degradation in iterative diagnostic reasoning under uncertainty.

Details

Motivation: Current medical benchmarks for LLMs primarily assess knowledge recall through single-turn questions with complete information, failing to capture the hypothesis-driven, iterative nature of real clinical reasoning where physicians refine diagnoses through targeted information gathering.

Method: Created VivaBench with 1762 physician-curated clinical vignettes structured as interactive scenarios simulating oral medical examinations, requiring agents to actively probe for findings, select investigations, and synthesize information across multiple steps to reach diagnoses.

Result: Current LLMs show competence in diagnosing from well-described presentations but performance degrades significantly in iterative reasoning under uncertainty. Identified failure modes mirroring clinical cognitive errors: fixation on initial hypotheses, inappropriate investigation ordering, premature diagnostic closure, and failing to screen for critical conditions.

Conclusion: VivaBench provides a standardized benchmark for evaluating conversational medical AI systems and reveals fundamental limitations in how current LLMs reason under uncertainty, contributing to research on agentic AI by demonstrating how sequential reasoning trajectories diverge in complex decision-making.

Abstract: Clinical reasoning in medicine is a hypothesis-driven process where physicians refine diagnoses from limited information through targeted history, physical examination, and diagnostic investigations. In contrast, current medical benchmarks for large language models (LLMs) primarily assess knowledge recall through single-turn questions, where complete clinical information is provided upfront. To address this gap, we introduce VivaBench, a multi-turn benchmark that evaluates sequential clinical reasoning in LLM agents. Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training, requiring agents to actively probe for relevant findings, select appropriate investigations, and synthesize information across multiple steps to reach a diagnosis. While current LLMs demonstrate competence in diagnosing conditions from well-described clinical presentations, their performance degrades significantly when required to navigate iterative diagnostic reasoning under uncertainty in our evaluation. Our analysis identified several failure modes that mirror common cognitive errors in clinical practice, including: (1) fixation on initial hypotheses, (2) inappropriate investigation ordering, (3) premature diagnostic closure, and (4) failing to screen for critical conditions. These patterns reveal fundamental limitations in how current LLMs reason and make decisions under uncertainty. Through VivaBench, we provide a standardized benchmark for evaluating conversational medical AI systems for real-world clinical decision support. Beyond medical applications, we contribute to the larger corpus of research on agentic AI by demonstrating how sequential reasoning trajectories can diverge in complex decision-making environments.

[902] Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting

Michael Y. Hu, Benjamin Van Durme, Jacob Andreas, Harsh Jhamtani

Main category: cs.LG

TL;DR: ECHO is a prompting framework that adapts hindsight experience replay for language model agents, generating optimized trajectories from failed attempts to improve sample efficiency in novel environments.

Details

Motivation: Language model agents deployed in novel environments often show poor sample efficiency when learning from sequential interactions, which hinders their usefulness in costly interaction scenarios like human interaction or physical system resets.

Method: ECHO consists of two components: a hindsight rule that uses the language model to identify relevant subgoals and generate optimized trajectories, and an update rule that maintains compressed trajectory representations in memory. It generates synthetic positive examples from unsuccessful interactions.

Result: ECHO outperforms vanilla language agent baselines by up to 80% across XMiniGrid (text-based navigation and planning) and PeopleJoinQA (collaborative information-gathering). In XMiniGrid, it also outperforms sophisticated agent architectures like Reflexion and AWM, demonstrating faster adaptation to novel environments.

Conclusion: ECHO enables more effective utilization of past experiences through hindsight optimization, significantly improving language model agents’ sample efficiency and adaptation capabilities in novel environments.

Abstract: Language model (LM) agents deployed in novel environments often exhibit poor sample efficiency when learning from sequential interactions. This significantly hinders the usefulness of such agents in environments where interaction is costly (for example, when they interact with humans or reset physical systems). While a number of existing LM agent architectures incorporate various mechanisms for experience storage and reflection, they make limited use of LMs’ abilities to directly generate or reason about full counterfactual trajectories. We introduce ECHO (Experience Consolidation via Hindsight Optimization), a prompting framework that adapts hindsight experience replay from reinforcement learning for language model agents. ECHO generates optimized trajectories for alternative goals that could have been achieved during failed attempts, effectively creating synthetic positive examples from unsuccessful interactions. Our approach consists of two components: a hindsight rule that uses the language model itself to identify relevant subgoals and generate optimized trajectories, and an update rule that maintains compressed trajectory representations in memory. We evaluate ECHO on stateful versions of XMiniGrid, a text-based navigation and planning benchmark, and PeopleJoinQA, a collaborative information-gathering enterprise simulation. Across both domains, ECHO outperforms vanilla language agent baselines by up to 80%; in XMiniGrid, it also outperforms a number of sophisticated agent architectures including Reflexion and AWM, demonstrating faster adaptation to novel environments through more effective utilization of past experiences.

[903] Multi-View Graph Learning with Graph-Tuple

Shiyu Chen, Ningyuan, Huang, Soledad Villar

Main category: cs.LG

TL;DR: A multi-view graph-tuple framework that partitions graphs into disjoint subgraphs to capture multiple interaction scales, overcoming limitations of single-scale sparsification in GNNs.

Details

Motivation: GNNs scale with graph edges, making them inefficient on dense graphs like point clouds. Traditional sparsification methods force arbitrary single-scale choices and discard multi-scale information.

Method: Partition graphs into disjoint subgraphs capturing local and long-range interactions, then learn multi-view representations via heterogeneous message-passing inspired by non-commuting operators theory.

Result: The framework is proven strictly more expressive with lower oracle risk than single-graph models. Outperforms baselines on molecular property prediction and cosmological parameter inference tasks.

Conclusion: Multi-view graph-tuple approach provides better performance than single-graph models, demonstrating power and versatility for handling multi-scale interactions in scientific applications.

Abstract: Graph Neural Networks (GNNs) typically scale with the number of graph edges, making them well suited for sparse graphs but less efficient on dense graphs, such as point clouds or molecular interactions. A common remedy is to sparsify the graph via similarity thresholding or distance pruning, but this forces an arbitrary choice of a single interaction scale and discards crucial information from other scales. To overcome this limitation, we introduce a multi-view graph-tuple framework. Instead of a single graph, our graph-tuple framework partitions the graph into disjoint subgraphs, capturing primary local interactions and weaker, long-range connections. We then learn multi-view representations from the graph-tuple via a heterogeneous message-passing architecture inspired by the theory of non-commuting operators, which we formally prove is strictly more expressive and guarantees a lower oracle risk compared to single-graph message-passing models. We instantiate our framework on two scientific domains: molecular property prediction from feature-scarce Coulomb matrices and cosmological parameter inference from geometric point clouds. On both applications, our multi-view graph-tuple models demonstrate better performance than single-graph baselines, highlighting the power and versatility of our multi-view approach.

[904] Transformer Model Detects Antidepressant Use From a Single Night of Sleep, Unlocking an Adherence Biomarker

Ali Mirzazadeh, Simon Cadavid, Kaiwen Zha, Chao Li, Sultan Alzahrani, Manar Alawajy, Joshua Korzenik, Kreshnik Hoti, Charles Reynolds, David Mischoulon, John Winkelman, Maurizio Fava, Dina Katabi

Main category: cs.LG

TL;DR: A noninvasive biomarker using sleep data from wearables can detect antidepressant intake with high accuracy (AUROC=0.84), enabling remote daily adherence monitoring.

Details

Motivation: Current methods for detecting antidepressant adherence are either invasive (serum assays, neuroimaging) or inaccurate (pill counts, pharmacy refills), creating a need for better tools to prevent relapse and reduce costs.

Method: Transformer-based model analyzes sleep data from consumer wearables or contactless wireless sensors to detect antidepressant intake patterns from a single night of sleep.

Result: Achieved AUROC=0.84 across 62,000 nights from >20,000 participants, generalized across drug classes, scaled with dose, and remained robust to concomitant psychotropics. Successfully captured real-world adherence patterns.

Conclusion: This approach provides objective, scalable adherence surveillance that could significantly improve depression care and outcomes through effortless daily monitoring at home.

Abstract: Antidepressant nonadherence is pervasive, driving relapse, hospitalization, suicide risk, and billions in avoidable costs. Clinicians need tools that detect adherence lapses promptly, yet current methods are either invasive (serum assays, neuroimaging) or proxy-based and inaccurate (pill counts, pharmacy refills). We present the first noninvasive biomarker that detects antidepressant intake from a single night of sleep. A transformer-based model analyzes sleep data from a consumer wearable or contactless wireless sensor to infer antidepressant intake, enabling remote, effortless, daily adherence assessment at home. Across six datasets comprising 62,000 nights from >20,000 participants (1,800 antidepressant users), the biomarker achieved AUROC = 0.84, generalized across drug classes, scaled with dose, and remained robust to concomitant psychotropics. Longitudinal monitoring captured real-world initiation, tapering, and lapses. This approach offers objective, scalable adherence surveillance with potential to improve depression care and outcomes.

[905] Exploration-free Algorithms for Multi-group Mean Estimation

Ziyi Wei, Huaiyang Zhong, Xiaocheng Li

Main category: cs.LG

TL;DR: This paper presents exploration-free algorithms for multi-group mean estimation, achieving tighter regret bounds than existing methods and extending the framework to contextual bandits with side information.

Details

Motivation: Multi-group mean estimation requires fundamentally different sampling strategies than classical multi-armed bandits, since optimal allocation requires sampling every group extensively rather than focusing on the best arm. This makes exploration-free approaches natural and effective.

Method: The authors strengthen variance concentration results using Hanson-Wright inequality, identify strictly subgaussian distributions for sharper guarantees, design exploration-free non-adaptive and adaptive algorithms, and extend the framework to contextual bandits with side information.

Result: The proposed exploration-free algorithms achieve tighter regret bounds than existing methods and provide provable guarantees for both standard and contextual bandit settings.

Conclusion: Exploration-free allocation is established as a principled and efficient approach for multi-group mean estimation, with applications in experimental design, personalization, and multi-group inference.

Abstract: We address the problem of multi-group mean estimation, which seeks to allocate a finite sampling budget across multiple groups to obtain uniformly accurate estimates of their means. Unlike classical multi-armed bandits, whose objective is to minimize regret by identifying and exploiting the best arm, the optimal allocation in this setting requires sampling every group on the order of $\Theta(T)$ times. This fundamental distinction makes exploration-free algorithms both natural and effective. Our work makes three contributions. First, we strengthen the existing results on subgaussian variance concentration using the Hanson-Wright inequality and identify a class of strictly subgaussian distributions that yield sharper guarantees. Second, we design exploration-free non-adaptive and adaptive algorithms, and we establish tighter regret bounds than the existing results. Third, we extend the framework to contextual bandit settings, an underexplored direction, and propose algorithms that leverage side information with provable guarantees. Overall, these results position exploration-free allocation as a principled and efficient approach to multi-group mean estimation, with potential applications in experimental design, personalization, and other domains requiring accurate multi-group inference.

[906] Applying non-negative matrix factorization with covariates to label matrix for classification

Kenichi Satoh

Main category: cs.LG

TL;DR: NMF-LAB is a novel supervised classification method that redefines classification as the inverse problem of non-negative matrix tri-factorization, directly factorizing label matrices with covariates as explanatory variables to obtain class probabilities without external classifiers.

Details

Motivation: Standard NMF is unsupervised and cannot exploit class labels, while existing supervised extensions still require external classifiers and don't provide direct probabilistic mappings from covariates to labels.

Method: Proposes NMF-LAB which treats classification as the inverse problem of tri-NMF - directly factorizes label matrix Y with covariates A as given explanatory variables, enabling direct probabilistic mapping from covariates to labels without separate classifiers.

Result: Experiments show competitive predictive accuracy, robustness to noisy/incomplete labels, scalability to high-dimensional problems (including MNIST), while preserving interpretability. Supports semi-supervised learning via uniform distributions for unlabeled data.

Conclusion: NMF-LAB provides a novel, probabilistic, and scalable approach that unifies regression and classification within tri-NMF framework, offering direct class probability estimation without external classifiers.

Abstract: Non-negative matrix factorization (NMF) is widely used for dimensionality reduction and interpretable analysis, but standard formulations are unsupervised and cannot directly exploit class labels. Existing supervised or semi-supervised extensions usually incorporate labels only via penalties or graph constraints, still requiring an external classifier. We propose \textit{NMF-LAB} (Non-negative Matrix Factorization for Label Matrix), which redefines classification as the inverse problem of non-negative matrix tri-factorization (tri-NMF). Unlike joint NMF methods, which reconstruct both features and labels, NMF-LAB directly factorizes the label matrix $Y$ as the observation, while covariates $A$ are treated as given explanatory variables. This yields a direct probabilistic mapping from covariates to labels, distinguishing our method from label-matrix factorization approaches that mainly model label correlations or impute missing labels. Our inversion offers two key advantages: (i) class-membership probabilities are obtained directly from the factorization without a separate classifier, and (ii) covariates, including kernel-based similarities, can be seamlessly integrated to generalize predictions to unseen samples. In addition, unlabeled data can be encoded as uniform distributions, supporting semi-supervised learning. Experiments on diverse datasets, from small-scale benchmarks to the large-scale MNIST dataset, demonstrate that NMF-LAB achieves competitive predictive accuracy, robustness to noisy or incomplete labels, and scalability to high-dimensional problems, while preserving interpretability. By unifying regression and classification within the tri-NMF framework, NMF-LAB provides a novel, probabilistic, and scalable approach to modern classification tasks.

[907] Controllable Graph Generation with Diffusion Models via Inference-Time Tree Search Guidance

Jiachi Zhao, Zehong Wang, Yamei Liao, Chuxu Zhang, Yanfang Ye

Main category: cs.LG

TL;DR: TreeDiff is a Monte Carlo Tree Search guided dual-space diffusion framework for controllable graph generation that improves quality and controllability without retraining.

Details

Motivation: Existing diffusion models for graph generation offer little control over desired properties and have unstable quality. Inference-time guidance methods are limited in controllability and remain heuristic.

Method: TreeDiff uses MCTS with three key designs: macro-step expansion to reduce tree depth, dual-space denoising (latent-space denoising with graph-space correction), and dual-space verifier for early reward prediction without full rollouts.

Result: TreeDiff achieves state-of-the-art performance on 2D and 3D molecular generation benchmarks in both unconditional and conditional settings, with favorable inference-time scaling that continues to improve with more computation.

Conclusion: TreeDiff provides a practical and scalable plug-and-play inference-time method for controllable graph generation that overcomes limitations of existing approaches and demonstrates superior performance and scaling behavior.

Abstract: Graph generation is a fundamental problem in graph learning with broad applications across Web-scale systems, knowledge graphs, and scientific domains such as drug and material discovery. Recent approaches leverage diffusion models for step-by-step generation, yet unconditional diffusion offers little control over desired properties, often leading to unstable quality and difficulty in incorporating new objectives. Inference-time guidance methods mitigate these issues by adjusting the sampling process without retraining, but they remain inherently local, heuristic, and limited in controllability. To overcome these limitations, we propose TreeDiff, a Monte Carlo Tree Search (MCTS) guided dual-space diffusion framework for controllable graph generation. TreeDiff is a plug-and-play inference-time method that expands the search space while keeping computation tractable. Specifically, TreeDiff introduces three key designs to make it practical and scalable: (1) a macro-step expansion strategy that groups multiple denoising updates into a single transition, reducing tree depth and enabling long-horizon exploration; (2) a dual-space denoising mechanism that couples efficient latent-space denoising with lightweight discrete correction in graph space, ensuring both scalability and structural fidelity; and (3) a dual-space verifier that predicts long-term rewards from partially denoised graphs, enabling early value estimation and removing the need for full rollouts. Extensive experiments on 2D and 3D molecular generation benchmarks, under both unconditional and conditional settings, demonstrate that TreeDiff achieves state-of-the-art performance. Notably, TreeDiff exhibits favorable inference-time scaling: it continues to improve with additional computation, while existing inference-time methods plateau early under limited resources.

[908] Softmax $\geq$ Linear: Transformers may learn to classify in-context by kernel gradient descent

Sara Dragutinović, Andrew M. Saxe, Aaditya K. Singh

Main category: cs.LG

TL;DR: Transformers learn in-context by performing gradient descent on functionals in kernel feature space, with softmax attention using context-adaptive learning rates, bridging theory from linear regression to realistic classification tasks.

Details

Motivation: To understand how transformers learn from context in realistic settings, moving beyond simplified linear self-attention and continuous regression to address discrete, complex classification tasks with non-linear softmax activation.

Method: Theoretical analysis of transformers’ in-context learning algorithms, focusing on how they perform gradient descent in kernel feature space and examining softmax transformers’ context-adaptive learning rates.

Result: Transformers learn in-context through gradient descent on functionals in kernel feature space, with softmax transformers exhibiting context-adaptive learning rates that enhance adaptability to context.

Conclusion: This work enhances theoretical understanding of in-context learning in realistic settings, reveals softmax attention’s greater context adaptability, and enables further theory bridging to larger models.

Abstract: The remarkable ability of transformers to learn new concepts solely by reading examples within the input prompt, termed in-context learning (ICL), is a crucial aspect of intelligent behavior. Here, we focus on understanding the learning algorithm transformers use to learn from context. Existing theoretical work, often based on simplifying assumptions, has primarily focused on linear self-attention and continuous regression tasks, finding transformers can learn in-context by gradient descent. Given that transformers are typically trained on discrete and complex tasks, we bridge the gap from this existing work to the setting of classification, with non-linear (importantly, softmax) activation. We find that transformers still learn to do gradient descent in-context, though on functionals in the kernel feature space and with a context-adaptive learning rate in the case of softmax transformer. These theoretical findings suggest a greater adaptability to context for softmax attention, which we empirically verify and study through ablations. Overall, we hope this enhances theoretical understanding of in-context learning algorithms in more realistic settings, pushes forward our intuitions and enables further theory bridging to larger models.

[909] Hierarchical LoRA MoE for Efficient CTR Model Scaling

Zhichen Zeng, Mengyue Hang, Xiaolong Liu, Xiaoyi Liu, Xiao Lin, Ruizhong Qiu, Tianxin Wei, Zhining Liu, Siyang Yuan, Chaofei Yang, Yiqun Liu, Hang Yin, Jiyan Yang, Hanghang Tong

Main category: cs.LG

TL;DR: HiLoMoE is a hierarchical LoRA MoE framework that combines vertical and horizontal scaling for efficient CTR prediction, achieving better performance with fewer computations.

Details

Motivation: Current deep models face efficiency challenges with vertical scaling (layer stacking) and flat MoE layers struggle to capture hierarchical structures in recommendation tasks. The goal is to push ROI boundaries by combining both scaling approaches.

Method: HiLoMoE uses lightweight rank-1 experts for parameter-efficient horizontal scaling, stacks multiple MoE layers with hierarchical routing for diverse expert compositions, and routes based on prior layer scores rather than outputs for parallel execution. A three-stage training framework ensures stable optimization.

Result: Experiments on four datasets show HiLoMoE achieves 0.20% average AUC improvement and 18.5% reduction in FLOPs compared to non-MoE baseline, demonstrating better performance-efficiency tradeoff.

Conclusion: HiLoMoE successfully combines vertical and horizontal scaling in a parameter-efficient manner, enabling holistic scaling for CTR prediction with improved performance and computational efficiency.

Abstract: Deep models have driven significant advances in click-through rate (CTR) prediction. While vertical scaling via layer stacking improves model expressiveness, the layer-by-layer sequential computation poses challenges to efficient scaling. Conversely, horizontal scaling through Mixture of Experts (MoE) achieves efficient scaling by activating a small subset of experts in parallel, but flat MoE layers may struggle to capture the hierarchical structure inherent in recommendation tasks. To push the Return-On-Investment (ROI) boundary, we explore the complementary strengths of both directions and propose HiLoMoE, a hierarchical LoRA MoE framework that enables holistic scaling in a parameter-efficient manner. Specifically, HiLoMoE employs lightweight rank-1 experts for parameter-efficient horizontal scaling, and stacks multiple MoE layers with hierarchical routing to enable combinatorially diverse expert compositions. Unlike conventional stacking, HiLoMoE routes based on prior layer scores rather than outputs, allowing all layers to execute in parallel. A principled three-stage training framework ensures stable optimization and expert diversity. Experiments on four public datasets show that HiLoMoE achieving better performance-efficiency tradeoff, achieving an average AUC improvement of 0.20% in AUC and 18.5% reduction in FLOPs compared to the non-MoE baseline.

[910] Multi-Task Learning with Feature-Similarity Laplacian Graphs for Predicting Alzheimer’s Disease Progression

Zixiang Xu, Menghui Zhou, Jun Qi, Xuanhan Fan, Yun Yang, Po Yang

Main category: cs.LG

TL;DR: Proposed MTL-FSL framework for Alzheimer’s Disease modeling that captures time-varying feature correlations using Feature Similarity Laplacian penalty, achieving state-of-the-art performance on ADNI dataset.

Details

Motivation: Existing Multi-Task Learning frameworks for Alzheimer's Disease don't account for time-varying feature correlations, limiting their effectiveness in modeling longitudinal AD data.

Method: Introduces Feature Similarity Laplacian (FSL) penalty to model time-varying feature relationships, uses ADMM algorithm to solve the non-smooth optimization problem, and considers both temporal smoothness and dynamic feature correlations.

Result: MTL-FSL achieves state-of-the-art performance on ADNI dataset, outperforming various baseline methods in predictive accuracy and biological interpretability.

Conclusion: The proposed framework successfully addresses the limitation of ignoring time-varying feature correlations in existing MTL approaches, providing enhanced modeling capability for longitudinal AD data.

Abstract: Alzheimer’s Disease (AD) is the most prevalent neurodegenerative disorder in aging populations, posing a significant and escalating burden on global healthcare systems. While Multi-Tusk Learning (MTL) has emerged as a powerful computational paradigm for modeling longitudinal AD data, existing frameworks do not account for the time-varying nature of feature correlations. To address this limitation, we propose a novel MTL framework, named Feature Similarity Laplacian graph Multi-Task Learning (MTL-FSL). Our framework introduces a novel Feature Similarity Laplacian (FSL) penalty that explicitly models the time-varying relationships between features. By simultaneously considering temporal smoothness among tasks and the dynamic correlations among features, our model enhances both predictive accuracy and biological interpretability. To solve the non-smooth optimization problem arising from our proposed penalty terms, we adopt the Alternating Direction Method of Multipliers (ADMM) algorithm. Experiments conducted on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset demonstrate that our proposed MTL-FSL framework achieves state-of-the-art performance, outperforming various baseline methods. The implementation source can be found at https://github.com/huatxxx/MTL-FSL.

[911] Reverse Supervision at Scale: Exponential Search Meets the Economics of Annotation

Masoud Makrehchi

Main category: cs.LG

TL;DR: Reversed-supervision strategy searching over labelings of unlabeled data to minimize error on labeled data has exponential complexity. Even with fast computation, human input remains essential for defining objectives, classes, and providing seed annotations to ground learning systems.

Details

Motivation: To understand whether arbitrarily fast computation can eliminate the need for human supervision in machine learning, by analyzing the fundamental requirements for learning systems.

Method: Analyze a reversed-supervision strategy that searches over all possible labelings of a large unlabeled dataset to minimize error on a small labeled dataset, examining the computational complexity and practical implications.

Result: The search space is exponential (2^n), making the problem fundamentally hard even with quantum or massively parallel hardware. Computational speed reduces wall-clock time but not the need for initial human input to define objectives and provide seed supervision.

Conclusion: Human (or human-grade) input remains necessary to ground machine learning systems in intended tasks, with generative AI serving as label amplifiers rather than replacements for human oversight and calibration.

Abstract: We analyze a reversed-supervision strategy that searches over labelings of a large unlabeled set (B) to minimize error on a small labeled set (A). The search space is (2^n), and the resulting complexity remains exponential even under large constant-factor speedups (e.g., quantum or massively parallel hardware). Consequently, arbitrarily fast – but not exponentially faster – computation does not obviate the need for informative labels or priors. In practice, the machine learning pipeline still requires an initial human contribution: specifying the objective, defining classes, and providing a seed set of representative annotations that inject inductive bias and align models with task semantics. Synthetic labels from generative AI can partially substitute provided their quality is human-grade and anchored by a human-specified objective, seed supervision, and validation. In this view, generative models function as \emph{label amplifiers}, leveraging small human-curated cores via active, semi-supervised, and self-training loops, while humans retain oversight for calibration, drift detection, and failure auditing. Thus, extreme computational speed reduces wall-clock time but not the fundamental supervision needs of learning; initial human (or human-grade) input remains necessary to ground the system in the intended task.

[912] Data-driven simulator of multi-animal behavior with unknown dynamics via offline and online reinforcement learning

Keisuke Fujii, Kazushi Tsutsui, Yu Teshima, Makoto Itoh, Naoya Takeishi, Nozomi Nishiumi, Ryoya Tanaka, Shunsuke Shigaki, Yoshinobu Kawahara

Main category: cs.LG

TL;DR: A data-driven simulator for multi-animal behavior using deep reinforcement learning and counterfactual simulation that bridges the gap between unknown real-world transition models and simulated counterparts.

Details

Motivation: To address the challenge of realistic multi-animal simulation in biology where locomotion dynamics are seldom known, making mathematical models insufficient for reproducing real trajectories and supporting reward-driven optimization.

Method: Uses deep reinforcement learning to estimate movement variables of incomplete transition models as actions, employs distance-based pseudo-reward to align states between cyber and physical spaces, and enables counterfactual simulation.

Result: Achieves higher reproducibility of species-specific behaviors and improved reward acquisition compared to standard imitation and RL methods, validated on artificial agents, flies, newts, and silkmoths.

Conclusion: The approach enables counterfactual behavior prediction in novel experimental settings and supports multi-individual modeling for flexible what-if trajectory generation, suggesting potential to simulate and elucidate complex multi-animal behaviors.

Abstract: Simulators of animal movements play a valuable role in studying behavior. Advances in imitation learning for robotics have expanded possibilities for reproducing human and animal movements. A key challenge for realistic multi-animal simulation in biology is bridging the gap between unknown real-world transition models and their simulated counterparts. Because locomotion dynamics are seldom known, relying solely on mathematical models is insufficient; constructing a simulator that both reproduces real trajectories and supports reward-driven optimization remains an open problem. We introduce a data-driven simulator for multi-animal behavior based on deep reinforcement learning and counterfactual simulation. We address the ill-posed nature of the problem caused by high degrees of freedom in locomotion by estimating movement variables of an incomplete transition model as actions within an RL framework. We also employ a distance-based pseudo-reward to align and compare states between cyber and physical spaces. Validated on artificial agents, flies, newts, and silkmoth, our approach achieves higher reproducibility of species-specific behaviors and improved reward acquisition compared with standard imitation and RL methods. Moreover, it enables counterfactual behavior prediction in novel experimental settings and supports multi-individual modeling for flexible what-if trajectory generation, suggesting its potential to simulate and elucidate complex multi-animal behaviors.

[913] LightSAE: Parameter-Efficient and Heterogeneity-Aware Embedding for IoT Multivariate Time Series Forecasting

Yi Ren, Xinjie Yu

Main category: cs.LG

TL;DR: LightSAE introduces a Shared-Auxiliary Embedding framework that decomposes embeddings into shared base components and channel-specific auxiliary components, achieving significant MSE improvements with minimal parameter increase.

Details

Motivation: Existing MTSF methods use shared embedding layers that process all channels identically, creating a representational bottleneck that obscures valuable channel-specific information in IoT multivariate time series data.

Method: Proposes a Shared-Auxiliary Embedding (SAE) framework with decomposition into shared base components and channel-specific auxiliary components, incorporating low-rank factorization and a shared gated component pool based on observed structural patterns.

Result: Extensive experiments on 9 IoT datasets and 4 backbone architectures show LightSAE achieves MSE improvements up to 22.8% with only 4.0% parameter increase.

Conclusion: The LightSAE framework effectively addresses the representational bottleneck in MTSF by leveraging observed structural patterns in auxiliary components, providing parameter-efficient improvements for IoT time series forecasting.

Abstract: Modern Internet of Things (IoT) systems generate massive, heterogeneous multivariate time series data. Accurate Multivariate Time Series Forecasting (MTSF) of such data is critical for numerous applications. However, existing methods almost universally employ a shared embedding layer that processes all channels identically, creating a representational bottleneck that obscures valuable channel-specific information. To address this challenge, we introduce a Shared-Auxiliary Embedding (SAE) framework that decomposes the embedding into a shared base component capturing common patterns and channel-specific auxiliary components modeling unique deviations. Within this decomposition, we \rev{empirically observe} that the auxiliary components tend to exhibit low-rank and clustering characteristics, a structural pattern that is significantly less apparent when using purely independent embeddings. Consequently, we design LightSAE, a parameter-efficient embedding module that operationalizes these observed characteristics through low-rank factorization and a shared, gated component pool. Extensive experiments across 9 IoT-related datasets and 4 backbone architectures demonstrate LightSAE’s effectiveness, achieving MSE improvements of up to 22.8% with only 4.0% parameter increase.

[914] AnyBCQ: Hardware Efficient Flexible Binary-Coded Quantization for Multi-Precision LLMs

Gunho Park, Jeongin Bae, Beomseok Kwon, Byeongwook Kim, Se Jung Kwon, Dongsoo Lee

Main category: cs.LG

TL;DR: AnyBCQ is a hardware-friendly multi-precision quantization method that extends Binary-Coded Quantization to support direct bit-plane operations, enabling dynamic precision selection with minimal overhead while maintaining accuracy.

Details

Motivation: Large language models face memory and latency bottlenecks, requiring flexible quantization techniques that can balance accuracy and efficiency across different runtime constraints and service-level objectives.

Method: Extends Binary-Coded Quantization with bit-plane representation of weights, progressive precision expansion that reuses binary codes, and co-designed kernels for dynamic precision selection with negligible overhead.

Result: Significantly reduces accuracy drop in low-bit regimes (e.g., 2-bit), remains competitive at higher precision, and achieves up to 3.0x throughput over half precision and 1.2x over state-of-the-art multi-precision methods.

Conclusion: AnyBCQ provides a practical foundation for multi-precision LLM deployment by aligning algorithmic flexibility with hardware efficiency, enabling dynamic precision selection across diverse service requirements.

Abstract: The deployment of large language models (LLMs) is increasingly constrained by memory and latency bottlenecks, motivating the need for quantization techniques that flexibly balance accuracy and efficiency. Recent work has introduced multi-precision models, which enable inference at multiple precisions within a single model depending on runtime constraints. To support such flexibility, quantized weights are often stored as bit-planes, where hardware efficiency improves when the compute operates directly at the bit-plane level and activates only the precision required by each request. In this work, we present AnyBCQ, a hardware-friendly multi-precision extension of Binary-Coded Quantization (BCQ) that supports direct bit-plane operations. By representing weights as binary bit-planes with corresponding scale factors, AnyBCQ enables bit-plane-level computation and maps naturally to accelerator-friendly, bit-parallel arithmetic. Our progressive precision expansion mechanism incrementally refines scaling factors while reusing previously assigned binary codes, yielding monotonic improvements in accuracy as additional bits are enabled. We further co-design a specialized kernel that exploits the BCQ structure to support dynamic per-request precision selection with negligible overhead. Experiments on recent LLMs demonstrate that AnyBCQ significantly narrows the accuracy drop in the low-bit regime (e.g. 2-bit), remains competitive at higher precision, and achieves throughput gains of up to 3.0x over half precision and 1.2x over state-of-the-art multi-precision methods. By aligning algorithmic flexibility with hardware efficiency, AnyBCQ provides a practical foundation for multi-precision LLM deployment across diverse service-level objectives.

[915] Anchor-based Maximum Discrepancy for Relative Similarity Testing

Zhijian Zhou, Liuhua Peng, Xunye Tian, Feng Liu

Main category: cs.LG

TL;DR: The paper addresses the kernel selection challenge in relative similarity testing by proposing AMD, which simultaneously learns both the hypothesis and kernel through a two-phase approach.

Details

Motivation: Existing kernel-based methods for relative similarity testing face a fundamental issue: kernel selection becomes ill-defined when the hypothesis is specified first, as one can always find a kernel that rejects the hypothesis.

Method: Proposes anchor-based maximum discrepancy (AMD) that defines relative similarity as maximum discrepancy between distances of (U,P) and (U,Q) in deep kernel space. Uses two-phase testing: Phase I estimates AMD and infers hypothesis, Phase II assesses statistical significance with unified testing framework.

Result: The method is validated theoretically and demonstrated effective through extensive experiments on benchmark datasets.

Conclusion: AMD provides a principled solution to the kernel selection problem in relative similarity testing by simultaneously learning hypothesis and kernel, overcoming limitations of existing approaches.

Abstract: The relative similarity testing aims to determine which of the distributions, P or Q, is closer to an anchor distribution U. Existing kernel-based approaches often test the relative similarity with a fixed kernel in a manually specified alternative hypothesis, e.g., Q is closer to U than P. Although kernel selection is known to be important to kernel-based testing methods, the manually specified hypothesis poses a significant challenge for kernel selection in relative similarity testing: Once the hypothesis is specified first, we can always find a kernel such that the hypothesis is rejected. This challenge makes relative similarity testing ill-defined when we want to select a good kernel after the hypothesis is specified. In this paper, we cope with this challenge via learning a proper hypothesis and a kernel simultaneously, instead of learning a kernel after manually specifying the hypothesis. We propose an anchor-based maximum discrepancy (AMD), which defines the relative similarity as the maximum discrepancy between the distances of (U, P) and (U, Q) in a space of deep kernels. Based on AMD, our testing incorporates two phases. In Phase I, we estimate the AMD over the deep kernel space and infer the potential hypothesis. In Phase II, we assess the statistical significance of the potential hypothesis, where we propose a unified testing framework to derive thresholds for tests over different possible hypotheses from Phase I. Lastly, we validate our method theoretically and demonstrate its effectiveness via extensive experiments on benchmark datasets. Codes are publicly available at: https://github.com/zhijianzhouml/AMD.

[916] Latent Retrieval Augmented Generation of Cross-Domain Protein Binders

Zishen Zhang, Xiangzhe Kong, Wenbing Huang, Yang Liu

Main category: cs.LG

TL;DR: RADiAnce is a framework that combines retrieval and generation to design protein binders by leveraging known interfaces through a contrastive latent space and conditional latent diffusion.

Details

Motivation: Current structure-based generative models lack sufficient rationality and interpretability in generating protein interfaces, limiting their effectiveness in drug discovery.

Method: Unifies retrieval and generation in a shared contrastive latent space, uses conditional latent diffusion generator to integrate relevant interfaces and enable cross-domain interface transfer.

Result: Significantly outperforms baseline models in binding affinity, geometry recovery, and interaction recovery. Validates cross-domain generalization where retrieving interfaces from diverse domains enhances binder generation.

Conclusion: Establishes a new paradigm for protein binder design that successfully bridges retrieval-based knowledge and generative AI, opening new possibilities for drug discovery.

Abstract: Designing protein binders targeting specific sites, which requires to generate realistic and functional interaction patterns, is a fundamental challenge in drug discovery. Current structure-based generative models are limited in generating nterfaces with sufficient rationality and interpretability. In this paper, we propose Retrieval-Augmented Diffusion for Aligned interface (RADiAnce), a new framework that leverages known interfaces to guide the design of novel binders. By unifying retrieval and generation in a shared contrastive latent space, our model efficiently identifies relevant interfaces for a given binding site and seamlessly integrates them through a conditional latent diffusion generator, enabling cross-domain interface transfer. Extensive exeriments show that RADiAnce significantly outperforms baseline models across multiple metrics, including binding affinity and recovery of geometries and interactions. Additional experimental results validate cross-domain generalization, demonstrating that retrieving interfaces from diverse domains, such as peptides, antibodies, and protein fragments, enhances the generation performance of binders for other domains. Our work establishes a new paradigm for protein binder design that successfully bridges retrieval-based knowledge and generative AI, opening new possibilities for drug discovery.

[917] Gradient Enhanced Self-Training Physics-Informed Neural Network (gST-PINN) for Solving Nonlinear Partial Differential Equations

Narayan S Iyer, Bivas Bhaumik, Ram S Iyer, Satyasaran Changdar

Main category: cs.LG

TL;DR: The paper proposes gST-PINN, a gradient-enhanced self-training method that improves upon traditional PINNs by using gradient-based pseudo point self-learning to overcome limitations like low precision, slow training, and lack of labeled data.

Details

Motivation: Traditional Physics-Informed Neural Networks (PINNs) struggle with limited precision, slow training dynamics, lack of labeled data, and inadequate handling of multi-physics interactions.

Method: Proposed Gradient Enhanced Self-Training PINN (gST-PINN) with gradient-based pseudo point self-learning algorithm for solving PDEs without requiring labeled data.

Result: gST-PINN achieved MSE of 10^-5 after 18,500 iterations, outperforming standard PINNs (10^-3 MSE for Burgers’ equation, 10^-4 MSE for diffusion-sorption equation). The method shows continuous error reduction and better generalization.

Conclusion: gST-PINN consistently outperforms standard PINN and gPINN methods, especially in scenarios with low accuracy, convergence issues, and absence of labeled data, providing a purely semi-supervised approach for PDE solving.

Abstract: Partial differential equations (PDEs) provide a mathematical foundation for simulating and understanding intricate behaviors in both physical sciences and engineering. With the growing capabilities of deep learning, data$-$driven approaches like Physics$-$Informed Neural Networks (PINNs) have been developed, offering a mesh$-$free, analytic type framework for efficiently solving PDEs across a wide range of applications. However, traditional PINNs often struggle with challenges such as limited precision, slow training dynamics, lack of labeled data availability, and inadequate handling of multi$-$physics interactions. To overcome these challenging issues of PINNs, we proposed a Gradient Enhanced Self$-$Training PINN (gST$-$PINN) method that specifically introduces a gradient based pseudo point self$-$learning algorithm for solving PDEs. We tested the proposed method on three different types of PDE problems from various fields, each representing distinct scenarios. The effectiveness of the proposed method is evident, as the PINN approach for solving the Burgers$’$ equation attains a mean square error (MSE) on the order of $10^{-3}$, while the diffusion$-$sorption equation achieves an MSE on the order of $10^{-4}$ after 12,500 iterations, with no further improvement as the iterations increase. In contrast, the MSE for both PDEs in the gST$-$PINN model continues to decrease, demonstrating better generalization and reaching an MSE on the order of $10^{-5}$ after 18,500 iterations. Furthermore, the results show that the proposed purely semi$-$supervised gST$-$PINN consistently outperforms the standard PINN method in all cases, even when solution of the PDEs are unavailable. It generalizes both PINN and Gradient$-$enhanced PINN (gPINN), and can be effectively applied in scenarios prone to low accuracy and convergence issues, particularly in the absence of labeled data.

[918] Align2Act: Instruction-Tuned Models for Human-Aligned Autonomous Driving

Kanishkha Jaisankar, Sunidhi Tandel

Main category: cs.LG

TL;DR: Align2Act is a motion planning framework that transforms instruction-tuned LLMs into interpretable planners aligned with human driving behavior, achieving state-of-the-art performance on real-world benchmarks.

Details

Motivation: To address whether LLMs truly capture human driving logic and improve motion planning in complex autonomous driving scenarios by aligning LLMs with human reasoning patterns and traffic rules.

Method: Proposes Align2Act framework with Align2ActChain module for step-by-step reasoning, fine-tunes LLaMA-2-7B with LoRA on nuPlan dataset using structured driving instructions based on human reasoning patterns and traffic rules.

Result: Achieves open-loop score of 85.17 and closed-loop scores of 70.31 (non-reactive) and 66.96 (reactive) on Test14-random, demonstrating improved planning quality and human-likeness on real-world nuPlan benchmark.

Conclusion: Structured reasoning significantly improves LLM-based motion planning performance over baseline methods, enabling interpretable and human-aligned autonomous driving behavior.

Abstract: Motion planning in complex scenarios is a core challenge in autonomous driving. Conventional methods apply predefined rules or learn from driving data to generate trajectories, while recent approaches leverage large language models (LLMs) for decision-making. However, it remains unclear whether LLMs truly capture human driving logic. We propose Align2Act, a motion planning framework that transforms instruction-tuned LLMs into interpretable planners aligned with human behavior. We derive structured driving instructions based on human reasoning patterns (e.g., anticipate hazards, yield at intersections) and traffic rules (e.g., stop at red lights, maintain lane boundaries). Our Align2ActChain module guides step-by-step reasoning to produce both an interpretable rationale and a safe trajectory. By fine-tuning LLaMA-2-7B with LoRA on one million scenarios from the nuPlan dataset, our method achieves an open-loop score of 85.17 and closed-loop scores of 70.31 (non-reactive) and 66.96 (reactive) on Test14-random. Unlike prior work focused on synthetic or open-loop settings, we demonstrate improved planning quality and human-likeness on the real-world nuPlan closed-loop benchmark. Ablation studies confirm that structured reasoning significantly improves performance over baseline LLM planners.

[919] f-INE: A Hypothesis Testing Framework for Estimating Influence under Training Randomness

Subhodip Panda, Dhruv Tarsadiya, Shashwat Sourav, Prathosh A. P, Sai Praneeth Karimireddy

Main category: cs.LG

TL;DR: Introduces f-influence, a stable influence estimation framework that accounts for training randomness, with efficient single-run algorithm f-INE, validated on detecting poisoned samples in Llama-3.1-8B instruction tuning.

Details

Motivation: Existing influence estimation methods are unstable under training randomness, making them unreliable for data curation and cleanup applications where the same sample may appear critical in one run but irrelevant in another.

Method: Proposes f-influence framework grounded in hypothesis testing that explicitly models training randomness, with f-INE algorithm that computes influence estimates in a single training run using efficient computation.

Result: Scaled f-INE to estimate influence of instruction tuning data on Llama-3.1-8B, successfully detecting poisoned samples that steer model opinions, demonstrating reliable performance for data cleanup and behavior attribution.

Conclusion: f-influence provides stable and reliable influence estimation that overcomes training randomness limitations, enabling practical applications in data curation, cleanup, and model behavior attribution.

Abstract: Influence estimation methods promise to explain and debug machine learning by estimating the impact of individual samples on the final model. Yet, existing methods collapse under training randomness: the same example may appear critical in one run and irrelevant in the next. Such instability undermines their use in data curation or cleanup since it is unclear if we indeed deleted/kept the correct datapoints. To overcome this, we introduce f-influence – a new influence estimation framework grounded in hypothesis testing that explicitly accounts for training randomness, and establish desirable properties that make it suitable for reliable influence estimation. We also design a highly efficient algorithm f-INfluence Estimation (f-INE) that computes f-influence in a single training run. Finally, we scale up f-INE to estimate influence of instruction tuning data on Llama-3.1-8B and show it can reliably detect poisoned samples that steer model opinions, demonstrating its utility for data cleanup and attributing model behavior.

[920] A Hybrid Machine Learning Approach for Synthetic Data Generation with Post Hoc Calibration for Clinical Tabular Datasets

Md Ibrahim Shikder Mahin, Md Shamsul Arefin, Md Tanvir Hasan

Main category: cs.LG

TL;DR: A hybrid framework for healthcare data synthesis using five augmentation methods with reinforcement learning-based dynamic weight selection and advanced calibration techniques, achieving high fidelity and privacy protection.

Details

Motivation: Healthcare research faces data scarcity and privacy regulations (HIPAA, GDPR) that limit access to real medical data, hindering AI model development and patient care advancements.

Method: Hybrid framework integrating noise injection, interpolation, GMM sampling, CVAE sampling, and SMOTE with reinforcement learning-based dynamic weight selection. Uses advanced calibration techniques including moment matching, histogram matching, and iterative refinement.

Result: Achieved Wasserstein distances as low as 0.001 and Kolmogorov-Smirnov statistics around 0.01, with pairwise trend scores >90% and privacy protection metrics approaching 50%. Downstream classifiers achieved up to 94% accuracy and F1 scores above 93%.

Conclusion: The scalable, privacy-preserving approach matches state-of-the-art methods, sets new benchmarks for joint-distribution fidelity in healthcare, and supports sensitive AI applications.

Abstract: Healthcare research and development face significant obstacles due to data scarcity and stringent privacy regulations, such as HIPAA and the GDPR, restricting access to essential real-world medical data. These limitations impede innovation, delay robust AI model creation, and hinder advancements in patient-centered care. Synthetic data generation offers a transformative solution by producing artificial datasets that emulate real data statistics while safeguarding patient privacy. We introduce a novel hybrid framework for high-fidelity healthcare data synthesis integrating five augmentation methods: noise injection, interpolation, Gaussian Mixture Model (GMM) sampling, Conditional Variational Autoencoder (CVAE) sampling, and SMOTE, combined via a reinforcement learning-based dynamic weight selection mechanism. Its key innovations include advanced calibration techniques – moment matching, full histogram matching, soft and adaptive soft histogram matching, and iterative refinement – that align marginal distributions and preserve joint feature dependencies. Evaluated on the Breast Cancer Wisconsin (UCI Repository) and Khulna Medical College cardiology datasets, our calibrated hybrid achieves Wasserstein distances as low as 0.001 and Kolmogorov-Smirnov statistics around 0.01, demonstrating near-zero marginal discrepancy. Pairwise trend scores surpass 90%, and Nearest Neighbor Adversarial Accuracy approaches 50%, confirming robust privacy protection. Downstream classifiers trained on synthetic data achieve up to 94% accuracy and F1 scores above 93%, comparable to models trained on real data. This scalable, privacy-preserving approach matches state-of-the-art methods, sets new benchmarks for joint-distribution fidelity in healthcare, and supports sensitive AI applications.

[921] Reinforced Domain Selection for Continuous Domain Adaptation

Hanbing Liu, Huaze Tang, Yanru Wu, Yang Li, Xiao-Ping Zhang

Main category: cs.LG

TL;DR: A novel framework combining reinforcement learning with feature disentanglement for unsupervised continuous domain adaptation, enabling optimal domain path selection without metadata.

Details

Motivation: Existing CDA methods struggle with selecting intermediate domains without explicit metadata, which is crucial for bridging significant domain shifts effectively.

Method: Uses reinforcement learning with feature disentanglement, introducing unsupervised reward mechanism based on latent domain embedding distances, and aligns domain-invariant features.

Result: Substantial improvements in prediction accuracy and domain selection efficiency on Rotated MNIST and ADNI datasets compared to traditional CDA approaches.

Conclusion: The integrated strategy successfully optimizes transfer paths and target task performance simultaneously, enhancing domain adaptation effectiveness.

Abstract: Continuous Domain Adaptation (CDA) effectively bridges significant domain shifts by progressively adapting from the source domain through intermediate domains to the target domain. However, selecting intermediate domains without explicit metadata remains a substantial challenge that has not been extensively explored in existing studies. To tackle this issue, we propose a novel framework that combines reinforcement learning with feature disentanglement to conduct domain path selection in an unsupervised CDA setting. Our approach introduces an innovative unsupervised reward mechanism that leverages the distances between latent domain embeddings to facilitate the identification of optimal transfer paths. Furthermore, by disentangling features, our method facilitates the calculation of unsupervised rewards using domain-specific features and promotes domain adaptation by aligning domain-invariant features. This integrated strategy is designed to simultaneously optimize transfer paths and target task performance, enhancing the effectiveness of domain adaptation processes. Extensive empirical evaluations on datasets such as Rotated MNIST and ADNI demonstrate substantial improvements in prediction accuracy and domain selection efficiency, establishing our method’s superiority over traditional CDA approaches.

[922] Rethinking RL Evaluation: Can Benchmarks Truly Reveal Failures of RL Methods?

Zihan Chen, Yiming Zhang, Hengguang Zhou, Zenghui Ding, Yining Sun, Cho-Jui Hsieh

Main category: cs.LG

TL;DR: Current RL benchmarks for LLMs are inadequate as training on test sets achieves similar performance to training on train sets, failing to measure true generalization.

Details

Motivation: To address the inadequacy of current benchmarks in evaluating RL progress for LLMs and reveal their failure to assess generalization capabilities.

Method: Introduces a diagnostic suite with Oracle Performance Gap (OPG) metric and conducts stress tests analyzing generalization across distribution shifts, difficulty levels, and counterfactual scenarios.

Result: Found that despite strong benchmark scores, existing RL methods struggle with generalization, and current benchmarks cannot reliably separate progress.

Conclusion: Current benchmarks are insufficient for evaluating generalization; proposes three principles for better benchmarks: sufficient difficulty, balanced evaluation, and distributional robustness.

Abstract: Current benchmarks are inadequate for evaluating progress in reinforcement learning (RL) for large language models (LLMs).Despite recent benchmark gains reported for RL, we find that training on these benchmarks’ training sets achieves nearly the same performance as training directly on the test sets, suggesting that the benchmarks cannot reliably separate further progress.To study this phenomenon, we introduce a diagnostic suite and the Oracle Performance Gap (OPG) metric that quantifies the performance difference between training on the train split versus the test split of a benchmark. We further analyze this phenomenon with stress tests and find that, despite strong benchmark scores, existing RL methods struggle to generalize across distribution shifts, varying levels of difficulty, and counterfactual scenarios: shortcomings that current benchmarks fail to reveal.We conclude that current benchmarks are insufficient for evaluating generalization and propose three core principles for designing more faithful benchmarks: sufficient difficulty, balanced evaluation, and distributional robustness.

[923] PAC-Bayesian Reinforcement Learning Trains Generalizable Policies

Abdelkrim Zitouni, Mehdi Hennequin, Juba Agoun, Ryan Horache, Nadia Kabachi, Omar Rivasplata

Main category: cs.LG

TL;DR: A novel PAC-Bayesian generalization bound for reinforcement learning that accounts for Markov dependencies via mixing time, providing non-vacuous certificates for off-policy algorithms like SAC.

Details

Motivation: Overcome challenges in obtaining generalization guarantees for RL where sequential data breaks classical independence assumptions.

Method: Derive PAC-Bayesian bound using chain mixing time, and develop PB-SAC algorithm that optimizes the bound during training to guide exploration.

Result: Bound provides non-vacuous certificates for Soft Actor-Critic, and PB-SAC maintains competitive performance while offering meaningful confidence certificates.

Conclusion: The approach successfully addresses Markov dependencies in RL generalization and provides practical utility through bound-optimized training.

Abstract: We derive a novel PAC-Bayesian generalization bound for reinforcement learning that explicitly accounts for Markov dependencies in the data, through the chain’s mixing time. This contributes to overcoming challenges in obtaining generalization guarantees for reinforcement learning, where the sequential nature of data breaks the independence assumptions underlying classical bounds. Our bound provides non-vacuous certificates for modern off-policy algorithms like Soft Actor-Critic. We demonstrate the bound’s practical utility through PB-SAC, a novel algorithm that optimizes the bound during training to guide exploration. Experiments across continuous control tasks show that our approach provides meaningful confidence certificates while maintaining competitive performance.

[924] Multi-scale Frequency-Aware Adversarial Network for Parkinson’s Disease Assessment Using Wearable Sensors

Weiming Zhao, Xulong Wang, Jun Qi, Yun Yang, Po Yang

Main category: cs.LG

TL;DR: MFAM model improves Parkinson’s disease severity assessment using frequency decomposition and attention-based multi-instance learning to handle sparse symptoms and enhance feature specificity.

Details

Motivation: General-purpose time series models lack pathological specificity and traditional aggregation methods dilute key diagnostic features due to temporal sparsity of PD symptoms.

Method: Multi-scale Frequency-Aware Adversarial Multi-Instance Network (MFAM) with frequency decomposition guided by medical prior knowledge and attention-based multi-instance learning to focus on diagnostically valuable sparse segments.

Result: MFAM outperforms general-purpose time series models on both public PADS dataset for PD vs differential diagnosis and private dataset for four-class severity assessment.

Conclusion: MFAM provides a promising solution for automated PD severity assessment by handling complex clinical time series with specificity.

Abstract: Severity assessment of Parkinson’s disease (PD) using wearable sensors offers an effective, objective basis for clinical management. However, general-purpose time series models often lack pathological specificity in feature extraction, making it difficult to capture subtle signals highly correlated with PD.Furthermore, the temporal sparsity of PD symptoms causes key diagnostic features to be easily “diluted” by traditional aggregation methods, further complicating assessment. To address these issues, we propose the Multi-scale Frequency-Aware Adversarial Multi-Instance Network (MFAM). This model enhances feature specificity through a frequency decomposition module guided by medical prior knowledge. Furthermore, by introducing an attention-based multi-instance learning (MIL) framework, the model can adaptively focus on the most diagnostically valuable sparse segments.We comprehensively validated MFAM on both the public PADS dataset for PD versus differential diagnosis (DD) binary classification and a private dataset for four-class severity assessment. Experimental results demonstrate that MFAM outperforms general-purpose time series models in handling complex clinical time series with specificity, providing a promising solution for automated assessment of PD severity.

[925] Understanding Self-supervised Contrastive Learning through Supervised Objectives

Byeongchan Lee

Main category: cs.LG

TL;DR: This paper provides a theoretical framework for self-supervised representation learning by formulating it as an approximation to supervised objectives, deriving insights into contrastive losses like InfoNCE and introducing concepts of prototype representation bias and balanced contrastive loss.

Details

Motivation: To address the limited theoretical understanding of self-supervised representation learning despite its empirical success, by providing a theoretical perspective that connects it to supervised learning objectives.

Method: Formulated self-supervised representation learning as an approximation to supervised representation learning objectives, derived a loss function related to contrastive losses, introduced prototype representation bias and balanced contrastive loss concepts, and established connections to existing contrastive learning practices.

Result: Developed a theoretical framework that explains the principles behind popular contrastive losses, empirically validated the effect of balancing positive and negative pair interactions, and showed how theoretical components correspond to established practices.

Conclusion: The paper provides theoretical insights into self-supervised learning, explaining contrastive loss mechanisms through the lens of supervised learning approximation, with empirical validation supporting the introduced balanced contrastive loss concept.

Abstract: Self-supervised representation learning has achieved impressive empirical success, yet its theoretical understanding remains limited. In this work, we provide a theoretical perspective by formulating self-supervised representation learning as an approximation to supervised representation learning objectives. Based on this formulation, we derive a loss function closely related to popular contrastive losses such as InfoNCE, offering insight into their underlying principles. Our derivation naturally introduces the concepts of prototype representation bias and a balanced contrastive loss, which help explain and improve the behavior of self-supervised learning algorithms. We further show how components of our theoretical framework correspond to established practices in contrastive learning. Finally, we empirically validate the effect of balancing positive and negative pair interactions. All theoretical proofs are provided in the appendix, and our code is included in the supplementary material.

[926] Compositional Symmetry as Compression: Lie Pseudogroup Structure in Algorithmic Agents

Giulio Ruffini

Main category: cs.LG

TL;DR: The paper proposes a framework where agents track sensory streams using generative programs, with simplicity understood as compositional symmetry. Natural streams are described by Lie pseudogroups acting on complex low-dimensional manifolds.

Details

Motivation: To understand how agents can effectively track and compress sensory streams by leveraging structural priors like compositional symmetry, providing a geometric foundation for the 'blessing of compositionality' in deep learning models.

Method: Model agents as neural dynamical systems coupled to sensory streams. Analyze structural constraints (equivariance) and dynamical constraints (conserved quantities, reduced manifolds) imposed by symmetry. Connect to Spencer formalism for Lie pseudogroups and formulate symmetry-based predictive coding.

Result: Symmetry imposes both structural constraints (equivariance) and dynamical constraints (conserved quantities, reduced invariant manifolds). This yields a hierarchy of reduced manifolds aligned with pseudogroup factorization, explaining compositionality benefits.

Conclusion: The framework provides a geometric account of compositionality in deep models and enables a symmetry-based version of predictive coding where higher layers process coarse-grained residual transformations along unresolved symmetry directions.

Abstract: In the algorithmic (Kolmogorov) view, agents are programs that track and compress sensory streams using generative programs. We propose a framework where the relevant structural prior is simplicity (Solomonoff) understood as \emph{compositional symmetry}: natural streams are well described by (local) actions of finite-parameter Lie pseudogroups on geometrically and topologically complex low-dimensional configuration manifolds (latent spaces). Modeling the agent as a generic neural dynamical system coupled to such streams, we show that accurate world-tracking imposes (i) \emph{structural constraints} – equivariance of the agent’s constitutive equations and readouts – and (ii) \emph{dynamical constraints}: under static inputs, symmetry induces conserved quantities (Noether-style labels) in the agent dynamics and confines trajectories to reduced invariant manifolds; under slow drift, these manifolds move but remain low-dimensional. This yields a hierarchy of reduced manifolds aligned with the compositional factorization of the pseudogroup, providing a geometric account of the ``blessing of compositionality’’ in deep models. We connect these ideas to the Spencer formalism for Lie pseudogroups and formulate a symmetry-based, self-contained version of predictive coding in which higher layers receive only \emph{coarse-grained residual transformations} (prediction-error coordinates) along symmetry directions unresolved at lower layers.

[927] FusionGen: Feature Fusion-Based Few-Shot EEG Data Generation

Yuheng Chen, Dingkun Liu, Xinyao Yang, Xinping Xu, Baicheng Chen, Dongrui Wu

Main category: cs.LG

TL;DR: FusionGen is a novel EEG data generation framework that uses disentangled representation learning and feature fusion to address data scarcity and inter-subject variability in brain-computer interfaces.

Details

Motivation: EEG-based BCIs face challenges with data scarcity and significant inter-subject variability, which limit the generalization and practical applicability of EEG decoding models.

Method: Proposes FusionGen framework using disentangled representation learning and feature fusion, integrating features across trials through feature matching fusion module with lightweight feature extraction and reconstruction pipeline.

Result: Extensive experiments on multiple EEG datasets show FusionGen significantly outperforms existing augmentation techniques with notable improvements in classification accuracy.

Conclusion: FusionGen effectively addresses EEG data scarcity and variability issues, enabling better generalization and applicability of EEG decoding models in practical BCI settings.

Abstract: Brain-computer interfaces (BCIs) provide potential for applications ranging from medical rehabilitation to cognitive state assessment by establishing direct communication pathways between the brain and external devices via electroencephalography (EEG). However, EEG-based BCIs are severely constrained by data scarcity and significant inter-subject variability, which hinder the generalization and applicability of EEG decoding models in practical settings. To address these challenges, we propose FusionGen, a novel EEG data generation framework based on disentangled representation learning and feature fusion. By integrating features across trials through a feature matching fusion module and combining them with a lightweight feature extraction and reconstruction pipeline, FusionGen ensures both data diversity and trainability under limited data constraints. Extensive experiments on multiple publicly available EEG datasets demonstrate that FusionGen significantly outperforms existing augmentation techniques, yielding notable improvements in classification accuracy.

[928] Budget Allocation for Unknown Value Functions in a Lipschitz Space

MohammadHossein Bateni, Hossein Esfandiari, Samira HosseinGhorban, Alireza Mirrokni, Radin Shahdaei

Main category: cs.LG

TL;DR: The paper addresses optimal budget allocation for evaluating intermediate models in machine learning workflows, formalizing it as a budget allocation problem over unknown-value functions in Lipschitz space.

Details

Motivation: Building learning models requires evaluating many intermediate models during feature selection, model structure search, and parameter tuning, but evaluation costs are bounded and true performance is only known after evaluation.

Method: Formalizes the problem as a general budget allocation problem over unknown-value functions within a Lipschitz space.

Result: Not specified in the abstract.

Conclusion: Not specified in the abstract.

Abstract: Building learning models frequently requires evaluating numerous intermediate models. Examples include models considered during feature selection, model structure search, and parameter tunings. The evaluation of an intermediate model influences subsequent model exploration decisions. Although prior knowledge can provide initial quality estimates, true performance is only revealed after evaluation. In this work, we address the challenge of optimally allocating a bounded budget to explore the space of intermediate models. We formalize this as a general budget allocation problem over unknown-value functions within a Lipschitz space.

[929] Encoder Decoder Generative Adversarial Network Model for Stock Market Prediction

Bahadur Yadav, Sanjay Kumar Mohanty

Main category: cs.LG

TL;DR: Proposes EDGAN, a GRU-based Encoder-Decoder GAN for stock price forecasting that addresses GAN limitations through temporal decoding, residual connections, and conditioning on market covariates.

Details

Motivation: Stock price forecasting is challenging due to market volatility and non-linearity. Traditional GANs suffer from mode collapse, unstable training, and difficulty capturing temporal and feature correlations in financial data.

Method: GRU-based Encoder-Decoder GAN with temporal decoder using residual connections, conditioning on static/dynamic covariates, and windowing mechanism to capture temporal dynamics. Generator uses dense encoder-decoder framework with residual GRU blocks.

Result: EDGAN achieves superior forecasting accuracy and training stability across diverse stock datasets, even in volatile markets. Consistently outperforms traditional GAN variants in both accuracy and convergence stability.

Conclusion: The proposed EDGAN model successfully balances expressive power and simplicity, demonstrating improved performance and stability for stock price forecasting compared to existing GAN approaches.

Abstract: Forecasting stock prices remains challenging due to the volatile and non-linear nature of financial markets. Despite the promise of deep learning, issues such as mode collapse, unstable training, and difficulty in capturing temporal and feature level correlations have limited the applications of GANs in this domain. We propose a GRU-based Encoder-Decoder GAN (EDGAN) model that strikes a balance between expressive power and simplicity. The model introduces key innovations such as a temporal decoder with residual connections for precise reconstruction, conditioning on static and dynamic covariates for contextual learning, and a windowing mechanism to capture temporal dynamics. Here, the generator uses a dense encoder-decoder framework with residual GRU blocks. Extensive experiments on diverse stock datasets demonstrate that EDGAN achieves superior forecasting accuracy and training stability, even in volatile markets. It consistently outperforms traditional GAN variants in forecasting accuracy and convergence stability under market conditions.

[930] SDG-L: A Semiparametric Deep Gaussian Process based Framework for Battery Capacity Prediction

Hanbing Liu, Yanru Wu, Yang Li, Ercan E. Kuruoglu, Xuan Zhang

Main category: cs.LG

TL;DR: SDG-L is a semiparametric deep Gaussian process regression framework that uses LSTM feature extraction to predict lithium-ion battery capacity degradation from time series battery state data, achieving 1.2% average test MSE.

Details

Motivation: Lithium-ion battery capacity degrades over charging/discharging cycles, threatening energy storage durability. Accurate capacity prediction is crucial for system efficiency, but current methods undervalue battery state information in each cycle.

Method: Proposed SDG-L framework combines semiparametric deep Gaussian process regression with LSTM feature extractor to model time series battery state data and utilize auxiliary profiling information during charging/discharging processes.

Result: On NASA dataset, SDG-L achieves average test MSE error of 1.2% and outperforms existing methods in battery capacity prediction.

Conclusion: The SDG-L framework effectively leverages battery state information and auxiliary profiling data through LSTM feature extraction to provide accurate capacity predictions, demonstrating superior performance compared to existing approaches.

Abstract: Lithium-ion batteries are becoming increasingly omnipresent in energy supply. However, the durability of energy storage using lithium-ion batteries is threatened by their dropping capacity with the growing number of charging/discharging cycles. An accurate capacity prediction is the key to ensure system efficiency and reliability, where the exploitation of battery state information in each cycle has been largely undervalued. In this paper, we propose a semiparametric deep Gaussian process regression framework named SDG-L to give predictions based on the modeling of time series battery state data. By introducing an LSTM feature extractor, the SDG-L is specially designed to better utilize the auxiliary profiling information during charging/discharging process. In experimental studies based on NASA dataset, our proposed method obtains an average test MSE error of 1.2%. We also show that SDG-L achieves better performance compared to existing works and validate the framework using ablation studies.

[931] Trustworthy Retrosynthesis: Eliminating Hallucinations with a Diverse Ensemble of Reaction Scorers

Michal Sadowski, Maria Wyrzykowska, Lukasz Sztukiewicz, Tadija Radusinović, Jan Rzymkowski, Paweł Włodarczyk-Pruszyński, Mikołaj Sacha, Piotr Kozakowski, Ruard van Workum, Stanislaw Kamil Jastrzebski

Main category: cs.LG

TL;DR: RetroTrim is a retrosynthesis system that effectively avoids nonsensical synthetic plans (hallucinations) by combining diverse reaction scoring strategies, outperforming baselines in filtering hallucinations and producing high-quality paths.

Details

Motivation: Current retrosynthesis systems suffer from hallucinated reactions that produce nonsensical outputs, and reliable assessment methods are lacking. Automatic evaluation is needed to filter out these unreliable synthetic plans.

Method: Combines diverse reaction scoring strategies using machine learning models and chemical databases to capture different classes of hallucinations. Uses a novel evaluation protocol with expert chemist reviews.

Result: RetroTrim is the sole method that successfully filters out hallucinated reactions and produces the highest number of high-quality synthetic paths overall on challenging drug-like targets.

Conclusion: The combination of diverse scoring strategies effectively addresses hallucination problems in retrosynthesis. The released benchmark and evaluation protocol aim to inspire further research into reliable retrosynthesis systems.

Abstract: Retrosynthesis is one of the domains transformed by the rise of generative models, and it is one where the problem of nonsensical or erroneous outputs (hallucinations) is particularly insidious: reliable assessment of synthetic plans is time-consuming, with automatic methods lacking. In this work, we present RetroTrim, a retrosynthesis system that successfully avoids nonsensical plans on a set of challenging drug-like targets. Compared to common baselines in the field, our system is not only the sole method that succeeds in filtering out hallucinated reactions, but it also results in the highest number of high-quality paths overall. The key insight behind RetroTrim is the combination of diverse reaction scoring strategies, based on machine learning models and existing chemical databases. We show that our scoring strategies capture different classes of hallucinations by analyzing them on a dataset of labeled retrosynthetic intermediates. To measure the performance of retrosynthesis systems, we propose a novel evaluation protocol for reactions and synthetic paths based on a structured review by expert chemists. Using this protocol, we compare systems on a set of 32 novel targets, curated to reflect recent trends in drug structures. While the insights behind our methodology are broadly applicable to retrosynthesis, our focus is on targets in the drug-like domain. By releasing our benchmark targets and the details of our evaluation protocol, we hope to inspire further research into reliable retrosynthesis.

[932] ImpMIA: Leveraging Implicit Bias for Membership Inference Attack under Realistic Scenarios

Yuval Golbari, Navve Wasserman, Gal Vardi, Michal Irani

Main category: cs.LG

TL;DR: ImpMIA is a white-box membership inference attack that exploits neural networks’ implicit bias using KKT conditions to identify training samples without needing reference models or unrealistic assumptions.

Details

Motivation: Existing black-box MIA methods rely on unrealistic assumptions about attacker knowledge, data distribution, and training data fraction, which significantly degrade performance when removed.

Method: ImpMIA uses maximum-margin implicit bias theory and KKT optimality conditions to find samples whose gradients most strongly reconstruct the trained model’s parameters.

Result: ImpMIA achieves state-of-the-art performance compared to both black-box and white-box attacks in realistic settings with only model weights and a superset of training data.

Conclusion: The proposed white-box approach effectively addresses limitations of black-box MIA methods and demonstrates superior performance in practical scenarios where only model weights are available.

Abstract: Determining which data samples were used to train a model-known as Membership Inference Attack (MIA)-is a well-studied and important problem with implications for data privacy. Black-box methods presume access only to the model’s outputs and often rely on training auxiliary reference models. While they have shown strong empirical performance, they rely on assumptions that rarely hold in real-world settings: (i) the attacker knows the training hyperparameters; (ii) all available non-training samples come from the same distribution as the training data; and (iii) the fraction of training data in the evaluation set is known. In this paper, we demonstrate that removing these assumptions leads to a significant drop in the performance of black-box attacks. We introduce ImpMIA, a Membership Inference Attack that exploits the Implicit Bias of neural networks, hence removes the need to rely on any reference models and their assumptions. ImpMIA is a white-box attack – a setting which assumes access to model weights and is becoming increasingly realistic given that many models are publicly available (e.g., via Hugging Face). Building on maximum-margin implicit bias theory, ImpMIA uses the Karush-Kuhn-Tucker (KKT) optimality conditions to identify training samples. This is done by finding the samples whose gradients most strongly reconstruct the trained model’s parameters. As a result, ImpMIA achieves state-of-the-art performance compared to both black and white box attacks in realistic settings where only the model weights and a superset of the training data are available.

[933] ProteinAE: Protein Diffusion Autoencoders for Structure Encoding

Shaoning Li, Le Zhuo, Yusong Wang, Mingyu Li, Xinheng He, Fandi Wu, Hongsheng Li, Pheng-Ann Heng

Main category: cs.LG

TL;DR: ProteinAE is a novel protein diffusion autoencoder that maps protein backbone coordinates directly into a continuous latent space using a non-equivariant Diffusion Transformer with bottleneck design, trained with single flow matching objective.

Details

Motivation: Current protein structure representation methods struggle with SE(3) manifold complexities, discrete tokenization, and multiple training objectives, which hinder model optimization and generalization.

Method: Uses a non-equivariant Diffusion Transformer with bottleneck design for efficient compression, trained end-to-end with single flow matching objective to directly map protein backbone coordinates from E(3) into continuous latent space.

Result: Achieves state-of-the-art reconstruction quality, outperforming existing autoencoders. The latent space enables efficient, high-quality structure generation competitive with leading structure-based approaches and significantly better than prior latent-based methods.

Conclusion: ProteinAE provides a streamlined approach that simplifies optimization pipeline and creates a powerful latent foundation for protein structure generation without explicit equivariance requirements.

Abstract: Developing effective representations of protein structures is essential for advancing protein science, particularly for protein generative modeling. Current approaches often grapple with the complexities of the SE(3) manifold, rely on discrete tokenization, or the need for multiple training objectives, all of which can hinder the model optimization and generalization. We introduce ProteinAE, a novel and streamlined protein diffusion autoencoder designed to overcome these challenges by directly mapping protein backbone coordinates from E(3) into a continuous, compact latent space. ProteinAE employs a non-equivariant Diffusion Transformer with a bottleneck design for efficient compression and is trained end-to-end with a single flow matching objective, substantially simplifying the optimization pipeline. We demonstrate that ProteinAE achieves state-of-the-art reconstruction quality, outperforming existing autoencoders. The resulting latent space serves as a powerful foundation for a latent diffusion model that bypasses the need for explicit equivariance. This enables efficient, high-quality structure generation that is competitive with leading structure-based approaches and significantly outperforms prior latent-based methods. Code is available at https://github.com/OnlyLoveKFC/ProteinAE_v1.

[934] Attention-Enhanced LSTM Modeling for Improved Temperature and Rainfall Forecasting in Bangladesh

Usman Gani Joy, Shahadat kabir, Tasnim Niger

Main category: cs.LG

TL;DR: Advanced LSTM with attention mechanism outperforms baseline models for climate forecasting in Bangladesh, achieving high accuracy in temperature and rainfall predictions with improved robustness.

Details

Motivation: Accurate climate forecasting is vital for Bangladesh due to high susceptibility to climate change impacts. Existing models struggle with long-range dependencies and complex temporal patterns in climate data.

Method: Advanced LSTM model integrated with attention mechanism using comprehensive datasets from 1901-2023 (NASA POWER Project for temperature, Humanitarian Data Exchange for rainfall) to capture seasonal and long-term trends.

Result: Outperformed baseline models (XGBoost, Simple LSTM, GRU) with test MSE 0.2411, MAE 0.3860°C, R² 0.9834, NRMSE 0.0370 for temperature; MSE 1283.67 mm², MAE 22.91 mm, R² 0.9639, NRMSE 0.0354 for rainfall. Showed better robustness with only 20% MSE increase under climate trends vs 2.2-fold in baselines.

Conclusion: The model improves forecasting precision and offers insights into physical processes governing climate variability, supporting applications in climate-sensitive sectors in Bangladesh.

Abstract: Accurate climate forecasting is vital for Bangladesh, a region highly susceptible to climate change impacts on temperature and rainfall. Existing models often struggle to capture long-range dependencies and complex temporal patterns in climate data. This study introduces an advanced Long Short-Term Memory (LSTM) model integrated with an attention mechanism to enhance the prediction of temperature and rainfall dynamics. Utilizing comprehensive datasets from 1901-2023, sourced from NASA’s POWER Project for temperature and the Humanitarian Data Exchange for rainfall, the model effectively captures seasonal and long-term trends. It outperforms baseline models, including XGBoost, Simple LSTM, and GRU, achieving a test MSE of 0.2411 (normalized units), MAE of 0.3860 degrees C, R^2 of 0.9834, and NRMSE of 0.0370 for temperature, and MSE of 1283.67 mm^2, MAE of 22.91 mm, R^2 of 0.9639, and NRMSE of 0.0354 for rainfall on monthly forecasts. The model demonstrates improved robustness with only a 20 percent increase in MSE under simulated climate trends (compared to an approximately 2.2-fold increase in baseline models without trend features) and a 50 percent degradation under regional variations (compared to an approximately 4.8-fold increase in baseline models without enhancements). These results highlight the model’s ability to improve forecasting precision and offer potential insights into the physical processes governing climate variability in Bangladesh, supporting applications in climate-sensitive sectors.

[935] Digital Twin-enabled Multi-generation Control Co-Design with Deep Reinforcement Learning

Ying-Kuan Tsai, Vispi Karkaria, Yi-Ping Chen, Wei Chen

Main category: cs.LG

TL;DR: A Digital Twin-enabled Control Co-Design framework using Deep Reinforcement Learning for multi-generation lifecycle optimization, demonstrated on active suspension systems with improved robustness and efficiency.

Details

Motivation: Address unpredictable real-world uncertainties in dynamic systems by integrating lifecycle data collection and continuous improvement through multi-generation design and Digital Twin technology.

Method: Combines Control Co-Design with Deep Reinforcement Learning in a multi-generation paradigm, using Digital Twins for real-time model updating, quantile regression for uncertainty quantification, and continuous learning from operational data.

Result: Significantly enhances dynamic performance, robustness, and efficiency in active suspension systems, yielding smoother and more stable control trajectories through learning from road conditions and driving behaviors.

Conclusion: The framework successfully extends CCD into lifecycle-oriented multi-generation design, leverages DTs for continuous improvement, and employs DRL for adaptive real-time decision-making, providing a comprehensive solution for uncertain dynamic systems.

Abstract: Control Co-Design (CCD) integrates physical and control system design to improve the performance of dynamic and autonomous systems. Despite advances in uncertainty-aware CCD methods, real-world uncertainties remain highly unpredictable. Multi-generation design addresses this challenge by considering the full lifecycle of a product: data collected from each generation informs the design of subsequent generations, enabling progressive improvements in robustness and efficiency. Digital Twin (DT) technology further strengthens this paradigm by creating virtual representations that evolve over the lifecycle through real-time sensing, model updating, and adaptive re-optimization. This paper presents a DT-enabled CCD framework that integrates Deep Reinforcement Learning (DRL) to jointly optimize physical design and controller. DRL accelerates real-time decision-making by allowing controllers to continuously learn from data and adapt to uncertain environments. Extending this approach, the framework employs a multi-generation paradigm, where each cycle of deployment, operation, and redesign uses collected data to refine DT models, improve uncertainty quantification through quantile regression, and inform next-generation designs of both physical components and controllers. The framework is demonstrated on an active suspension system, where DT-enabled learning from road conditions and driving behaviors yields smoother and more stable control trajectories. Results show that the method significantly enhances dynamic performance, robustness, and efficiency. Contributions of this work include: (1) extending CCD into a lifecycle-oriented multi-generation framework, (2) leveraging DTs for continuous model updating and informed design, and (3) employing DRL to accelerate adaptive real-time decision-making.

[936] Stock Prediction via a Dual Relation Fusion Network incorporating Static and Dynamic Relations

Long Chen, Huixin Bai, Mingxin Wang, Xiaohua Huang, Ying Liu, Jie Zhao, Ziyu Guan

Main category: cs.LG

TL;DR: DRFN model captures both dynamic and static inter-stock relationships for improved stock price forecasting by fusing long-term stable patterns with short-term market shifts.

Details

Motivation: Existing methods focus only on single-state relationships, missing the complementarity between dynamic and static inter-stock relations which is crucial for accurate stock modeling.

Method: Dual Relation Fusion Network with relative static relation component for time-varying long-term patterns, distance-aware dynamic relations, and recurrent fusion of prior-day dynamic relations with pre-defined static relations.

Result: Significantly outperforms baselines across different markets with high sensitivity to co-movement of relational strength and stock price.

Conclusion: The dual relation fusion approach effectively captures both stable long-term structures and flexible responses to market shifts, demonstrating superior forecasting performance.

Abstract: Accurate modeling of inter-stock relationships is critical for stock price forecasting. However, existing methods predominantly focus on single-state relationships, neglecting the essential complementarity between dynamic and static inter-stock relations. To solve this problem, we propose a Dual Relation Fusion Network (DRFN) to capture the long-term relative stability of stock relation structures while retaining the flexibility to respond to sudden market shifts. Our approach features a novel relative static relation component that models time-varying long-term patterns and incorporates overnight informational influences. We capture dynamic inter-stock relationships through distance-aware mechanisms, while evolving long-term structures via recurrent fusion of dynamic relations from the prior day with the pre-defined static relations. Experiments demonstrate that our method significantly outperforms the baselines across different markets, with high sensitivity to the co-movement of relational strength and stock price.

[937] Designing ReLU Generative Networks to Enumerate Trees with a Given Tree Edit Distance

Mamoona Ghafoor, Tatsuya Akutsu

Main category: cs.LG

TL;DR: This paper proves that ReLU-based generative networks of size O(n³) and constant depth can generate all trees with tree edit distance ≤ d from a given tree T.

Details

Motivation: Tree generation with specified edit distance has applications in computational biology and structured data analysis, but the appropriate network size/depth for this task was unclear.

Method: Theoretical construction of deterministic ReLU-based generative networks that can produce all rooted, ordered, vertex-labeled trees within a specified tree edit distance from a given tree.

Result: Networks successfully generated all valid trees with up to 21 nodes, while state-of-the-art models GraphRNN and GraphGDP achieved only 35% and 48% validation rates respectively.

Conclusion: Provides theoretical foundation for compact generative models and enables exact, valid tree-structured data generation, opening new directions in the field.

Abstract: The generation of trees with a specified tree edit distance has significant applications across various fields, including computational biology, structured data analysis, and image processing. Recently, generative networks have been increasingly employed to synthesize new data that closely resembles the original datasets. However, the appropriate size and depth of generative networks required to generate data with a specified tree edit distance remain unclear. In this paper, we theoretically establish the existence and construction of generative networks capable of producing trees similar to a given tree with respect to the tree edit distance. Specifically, for a given rooted, ordered, and vertex-labeled tree T of size n + 1 with labels from an alphabet \Sigma, and a non-negative integer d, we prove that all rooted, ordered, and vertex-labeled trees over \Sigma with tree edit distance at most d from T can be generated using a ReLU-based generative network with size O(n^3 ) and constant depth. The proposed networks were implemented and evaluated for generating trees with up to 21 nodes. Due to their deterministic architecture, the networks successfully generated all valid trees within the specified tree edit distance. In contrast, state-of-the-art graph generative models GraphRNN and GraphGDP, which rely on non-deterministic mechanisms, produced significantly fewer valid trees, achieving validation rates of only up to 35% and 48%, respectively. These findings provide a theoretical foundation towards construction of compact generative models and open new directions for exact and valid tree-structured data generation. An implementation of the proposed networks is available at https://github.com/MGANN-KU/TreeGen_ReLUNetworks.

[938] Provable Anytime Ensemble Sampling Algorithms in Nonlinear Contextual Bandits

Jiazheng Sun, Weixin Wang, Pan Xu

Main category: cs.LG

TL;DR: A unified algorithmic framework for ensemble sampling in nonlinear contextual bandits with two methods: GLM-ES for generalized linear bandits and Neural-ES for neural contextual bandits, achieving state-of-the-art regret bounds.

Details

Motivation: To develop provable and practical randomized exploration approaches for nonlinear contextual bandits, addressing the need for effective exploration strategies in complex nonlinear settings.

Method: Maintain multiple estimators for reward model parameters via maximum likelihood estimation on randomly perturbed data. Two specific methods: GLM-ES for generalized linear bandits and Neural-ES for neural contextual bandits, with anytime versions for unknown time horizons.

Result: Proved regret bounds of O(d^{3/2}√T + d^{9/2}) for GLM-ES and O(˜d√T) for Neural-ES, matching state-of-the-art results. Developed anytime versions and demonstrated strong empirical performance.

Conclusion: Ensemble sampling is established as a provable and practical randomized exploration approach for nonlinear contextual bandits, with theoretical guarantees and empirical effectiveness.

Abstract: We provide a unified algorithmic framework for ensemble sampling in nonlinear contextual bandits and develop corresponding regret bounds for two most common nonlinear contextual bandit settings: Generalized Linear Ensemble Sampling (\texttt{GLM-ES}) for generalized linear bandits and Neural Ensemble Sampling (\texttt{Neural-ES}) for neural contextual bandits. Both methods maintain multiple estimators for the reward model parameters via maximum likelihood estimation on randomly perturbed data. We prove high-probability frequentist regret bounds of $\mathcal{O}(d^{3/2} \sqrt{T} + d^{9/2})$ for \texttt{GLM-ES} and $\mathcal{O}(\widetilde{d} \sqrt{T})$ for \texttt{Neural-ES}, where $d$ is the dimension of feature vectors, $\widetilde{d}$ is the effective dimension of a neural tangent kernel matrix, and $T$ is the number of rounds. These regret bounds match the state-of-the-art results of randomized exploration algorithms in nonlinear contextual bandit settings. In the theoretical analysis, we introduce techniques that address challenges specific to nonlinear models. Practically, we remove fixed-time horizon assumptions by developing anytime versions of our algorithms, suitable when $T$ is unknown. Finally, we empirically evaluate \texttt{GLM-ES}, \texttt{Neural-ES}, and their anytime variants, demonstrating strong performance. Overall, our results establish ensemble sampling as a provable and practical randomized exploration approach for nonlinear contextual bandits.

[939] A Stochastic Differential Equation Framework for Multi-Objective LLM Interactions: Dynamical Systems Analysis with Code Generation Applications

Shivani Shukla, Himanshu Joshi

Main category: cs.LG

TL;DR: A stochastic differential equation framework for modeling multiobjective optimization dynamics in iterative LLM interactions, validated through code generation experiments.

Details

Motivation: To capture the inherent stochasticity of LLM responses and reveal systematic interference patterns between competing objectives in iterative optimization processes.

Method: Developed a stochastic differential equation framework with explicit diffusion terms and interference matrix formulation, validated using 400 iterative code generation sessions across security, efficiency, and functionality objectives.

Result: Demonstrated strategy-dependent convergence behaviors with rates from 0.33 to 1.29, and achieved predictive accuracy of R2 = 0.74 for balanced approaches.

Conclusion: The work establishes the feasibility of dynamical systems analysis for multi-objective LLM interactions, with code generation serving as an initial validation domain.

Abstract: We introduce a general stochastic differential equation framework for modelling multiobjective optimization dynamics in iterative Large Language Model (LLM) interactions. Our framework captures the inherent stochasticity of LLM responses through explicit diffusion terms and reveals systematic interference patterns between competing objectives via an interference matrix formulation. We validate our theoretical framework using iterative code generation as a proof-of-concept application, analyzing 400 sessions across security, efficiency, and functionality objectives. Our results demonstrate strategy-dependent convergence behaviors with rates ranging from 0.33 to 1.29, and predictive accuracy achieving R2 = 0.74 for balanced approaches. This work proposes the feasibility of dynamical systems analysis for multi-objective LLM interactions, with code generation serving as an initial validation domain.

[940] Optimally Deep Networks – Adapting Model Depth to Datasets for Superior Efficiency

Shaharyar Ahmed Khan Tareen, Filza Khan Tareen

Main category: cs.LG

TL;DR: ODNs optimize neural network depth for specific datasets using progressive depth expansion, reducing memory footprint by up to 98.64% while maintaining competitive accuracy.

Details

Motivation: Deep neural networks often have unnecessarily large sizes and high computational demands, leading to wasted resources and impractical deployment on resource-constrained devices.

Method: Progressive depth expansion training strategy that starts with shallow networks and incrementally increases depth as earlier blocks converge, removing redundant layers to find optimal depth.

Result: ResNet-18 and ResNet-34 achieved 98.64% and 96.44% memory footprint reduction on MNIST and SVHN while maintaining 99.31% and 96.08% accuracy respectively.

Conclusion: ODNs provide optimal depth for given datasets, significantly reducing training/inference costs, memory usage, and enabling efficient deployment on edge devices.

Abstract: Deep neural networks (DNNs) have provided brilliant performance across various tasks. However, this success often comes at the cost of unnecessarily large model sizes, high computational demands, and substantial memory footprints. Typically, powerful architectures are trained at full depths but not all datasets or tasks require such high model capacity. Training very deep architectures on relatively low-complexity datasets frequently leads to wasted computation, unnecessary energy consumption, and excessive memory usage, which in turn makes deployment of models on resource-constrained devices impractical. To address this problem, we introduce Optimally Deep Networks (ODNs), which provide a balance between model depth and task complexity. Specifically, we propose a NAS like training strategy called progressive depth expansion, which begins by training deep networks at shallower depths and incrementally increases their depth as the earlier blocks converge, continuing this process until the target accuracy is reached. ODNs use only the optimal depth for the given datasets, removing redundant layers. This cuts down future training and inference costs, lowers the memory footprint, enhances computational efficiency, and facilitates deployment on edge devices. Empirical results show that the optimal depths of ResNet-18 and ResNet-34 for MNIST and SVHN, achieve up to 98.64 % and 96.44 % reduction in memory footprint, while maintaining a competitive accuracy of 99.31 % and 96.08 %, respectively.

[941] Understanding Sampler Stochasticity in Training Diffusion Models for RLHF

Jiayuan Sheng, Hanyang Zhao, Haoxian Chen, David D. Yao, Wenpin Tang

Main category: cs.LG

TL;DR: The paper analyzes the reward gap between stochastic training and deterministic inference in RLHF for diffusion models, providing theoretical bounds and empirical validation that higher-stochasticity training improves ODE sampling quality.

Details

Motivation: Address the mismatch between stochastic samplers used during RLHF training and deterministic samplers used during inference, which creates a reward gap and raises concerns about inference quality.

Method: Use generalized denoising diffusion implicit models (gDDIM) framework to support high stochasticity while preserving data marginals, and conduct large-scale experiments on text-to-image models using DDPO and MixGRPO.

Result: Theoretical characterization of reward gap with non-vacuous bounds for general diffusion models, sharper convergence rates for VE and VP Gaussian models, and empirical validation that reward gaps narrow over training.

Conclusion: Models updated using higher-stochasticity SDE training show improved ODE sampling quality, and reward gaps consistently decrease during training.

Abstract: Reinforcement Learning from Human Feedback (RLHF) is increasingly used to fine-tune diffusion models, but a key challenge arises from the mismatch between stochastic samplers used during training and deterministic samplers used during inference. In practice, models are fine-tuned using stochastic SDE samplers to encourage exploration, while inference typically relies on deterministic ODE samplers for efficiency and stability. This discrepancy induces a reward gap, raising concerns about whether high-quality outputs can be expected during inference. In this paper, we theoretically characterize this reward gap and provide non-vacuous bounds for general diffusion models, along with sharper convergence rates for Variance Exploding (VE) and Variance Preserving (VP) Gaussian models. Methodologically, we adopt the generalized denoising diffusion implicit models (gDDIM) framework to support arbitrarily high levels of stochasticity, preserving data marginals throughout. Empirically, our findings through large-scale experiments on text-to-image models using denoising diffusion policy optimization (DDPO) and mixed group relative policy optimization (MixGRPO) validate that reward gaps consistently narrow over training, and ODE sampling quality improves when models are updated using higher-stochasticity SDE training.

[942] BioOSS: A Bio-Inspired Oscillatory State System with Spatio-Temporal Dynamics

Zhongju Yuan, Geraint Wiggins, Dick Botteldooren

Main category: cs.LG

TL;DR: Bio-inspired oscillatory state system (BioOSS) that models wave-like neural propagation dynamics using interacting p and o neuron populations, outperforming traditional architectures.

Details

Motivation: Current deep learning models based on perceptrons don't capture biological neurons' oscillatory dynamics and spatio-temporal interactions observed in natural neural circuits, particularly in the prefrontal cortex.

Method: BioOSS uses two interacting neuron populations: p neurons (membrane-potential-like units) and o neurons (control propagation velocities). The model incorporates trainable damping and propagation speed parameters to create wave-like propagation patterns through local interactions.

Result: BioOSS demonstrates superior performance and enhanced interpretability compared to alternative architectures on both synthetic and real-world tasks.

Conclusion: The proposed oscillatory state system successfully emulates biological neural wave propagation dynamics, offering improved modeling of complex neural activity patterns while maintaining trainability and interpretability.

Abstract: Today’s deep learning architectures are primarily based on perceptron models, which do not capture the oscillatory dynamics characteristic of biological neurons. Although oscillatory systems have recently gained attention for their closer resemblance to neural behavior, they still fall short of modeling the intricate spatio-temporal interactions observed in natural neural circuits. In this paper, we propose a bio-inspired oscillatory state system (BioOSS) designed to emulate the wave-like propagation dynamics critical to neural processing, particularly in the prefrontal cortex (PFC), where complex activity patterns emerge. BioOSS comprises two interacting populations of neurons: p neurons, which represent simplified membrane-potential-like units inspired by pyramidal cells in cortical columns, and o neurons, which govern propagation velocities and modulate the lateral spread of activity. Through local interactions, these neurons produce wave-like propagation patterns. The model incorporates trainable parameters for damping and propagation speed, enabling flexible adaptation to task-specific spatio-temporal structures. We evaluate BioOSS on both synthetic and real-world tasks, demonstrating superior performance and enhanced interpretability compared to alternative architectures.

[943] Structure Over Signal: A Globalized Approach to Multi-relational GNNs for Stock Prediction

Amber Li, Aruzhan Abil, Juno Marques Oda

Main category: cs.LG

TL;DR: OmniGNN is an attention-based multi-relational dynamic GNN that integrates macroeconomic context through heterogeneous nodes and edges, using a sector node as global intermediary for efficient shock propagation without multi-hop diffusion.

Details

Motivation: Existing Graph Neural Network models fail to efficiently propagate messages during macroeconomic shocks in financial markets, limiting their effectiveness in capturing nonlinear inter-stock dependencies during turbulent periods.

Method: Proposes OmniGNN with sector nodes as global intermediaries, uses Graph Attention Networks (GAT) to weigh neighbor contributions, and employs Transformers to capture temporal dynamics across multiplex relations.

Result: OmniGNN outperforms existing stock prediction models on public datasets, showing strong robustness particularly during the COVID-19 period.

Conclusion: The proposed OmniGNN framework effectively handles macroeconomic shocks through its sector node architecture and attention mechanisms, demonstrating superior performance in stock prediction tasks.

Abstract: In financial markets, Graph Neural Networks have been successfully applied to modeling relational data, effectively capturing nonlinear inter-stock dependencies. Yet, existing models often fail to efficiently propagate messages during macroeconomic shocks. In this paper, we propose OmniGNN, an attention-based multi-relational dynamic GNN that integrates macroeconomic context via heterogeneous node and edge types for robust message passing. Central to OmniGNN is a sector node acting as a global intermediary, enabling rapid shock propagation across the graph without relying on long-range multi-hop diffusion. The model leverages Graph Attention Networks (GAT) to weigh neighbor contributions and employs Transformers to capture temporal dynamics across multiplex relations. Experiments show that OmniGNN outperforms existing stock prediction models on public datasets, particularly demonstrating strong robustness during the COVID-19 period.

[944] PruneGCRN: Minimizing and explaining spatio-temporal problems through node pruning

Javier García-Sigüenza, Mirco Nanni, Faraón Llorens-Largo, José F. Vicent

Main category: cs.LG

TL;DR: A novel deep learning model that integrates graph pruning during training to identify important nodes in spatio-temporal problems, improving explainability and maintaining information as graphs reduce in size.

Details

Motivation: To gain better understanding of spatio-temporal problems themselves rather than just model behavior, and to integrate explainability through optimized pruning mechanisms.

Method: Proposed a model with integrated pruning that removes nodes during training process, allowing the architecture to learn how to minimize prediction error while selecting most relevant nodes.

Result: Experiments on traffic datasets showed the method retains more information as graphs reduce in size compared to other methods, achieving competitive accuracy.

Conclusion: Pruning serves as a valuable tool for developing models that simplify spatio-temporal problems by identifying their most important elements, with potential for enhanced explainability.

Abstract: This work addresses the challenge of using a deep learning model to prune graphs and the ability of this method to integrate explainability into spatio-temporal problems through a new approach. Instead of applying explainability to the model’s behavior, we seek to gain a better understanding of the problem itself. To this end, we propose a novel model that integrates an optimized pruning mechanism capable of removing nodes from the graph during the training process, rather than doing so as a separate procedure. This integration allows the architecture to learn how to minimize prediction error while selecting the most relevant nodes. Thus, during training, the model searches for the most relevant subset of nodes, obtaining the most important elements of the problem, facilitating its analysis. To evaluate the proposed approach, we used several widely used traffic datasets, comparing the accuracy obtained by pruning with the model and with other methods. The experiments demonstrate that our method is capable of retaining a greater amount of information as the graph reduces in size compared to the other methods used. These results highlight the potential of pruning as a tool for developing models capable of simplifying spatio-temporal problems, thereby obtaining their most important elements.

[945] Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods

Andrey Veprikov, Arman Bolatov, Samuel Horváth, Aleksandr Beznosikov, Martin Takáč, Slavomir Hanzely

Main category: cs.LG

TL;DR: A unified optimization framework that generalizes steepest descent, quasi-Newton, and adaptive methods through preconditioned matrix norms, revealing connections between existing optimizers and introducing new competitive methods.

Details

Motivation: To overcome the fundamental trade-off between geometry adaptation and curvature utilization in deep learning optimization, where steepest descent adapts to geometry but lacks curvature, while quasi-Newton and adaptive methods use curvature but are restricted to Frobenius geometry.

Method: Proposes a unified framework using preconditioned matrix norms that generalizes existing optimization approaches. Introduces two new methods: MuAdam and MuAdam-SANIA, which combine Muon’s spectral geometry with Adam-style preconditioning.

Result: The new optimizers MuAdam and MuAdam-SANIA are competitive with and sometimes outperform existing state-of-the-art methods in experiments.

Conclusion: The framework provides a unified perspective on optimization methods and enables the development of new competitive optimizers that combine geometric adaptation with curvature utilization.

Abstract: Optimization lies at the core of modern deep learning, yet existing methods often face a fundamental trade-off between adapting to problem geometry and leveraging curvature utilization. Steepest descent algorithms adapt to different geometries through norm choices but remain strictly first-order, whereas quasi-Newton and adaptive optimizers incorporate curvature information but are restricted to Frobenius geometry, limiting their applicability across diverse architectures. In this work, we propose a unified framework generalizing steepest descent, quasi-Newton methods, and adaptive methods through the novel notion of preconditioned matrix norms. This abstraction reveals that widely used optimizers such as SGD and Adam, as well as more advanced approaches like Muon and KL-Shampoo, and recent hybrids including SOAP and SPlus, all emerge as special cases of the same principle. Within this framework, we provide the first systematic treatment of affine and scale invariance in the matrix-parameterized setting, establishing necessary and sufficient conditions under generalized norms. Building on this foundation, we introduce two new methods, $\texttt{MuAdam}$ and $\texttt{MuAdam-SANIA}$, which combine the spectral geometry of Muon with Adam-style preconditioning. Our experiments demonstrate that these optimizers are competitive with, and in some cases outperform, existing state-of-the-art methods. Our code is available at https://github.com/brain-lab-research/LIB/tree/quasi_descent

[946] Rethinking deep learning: linear regression remains a key benchmark in predicting terrestrial water storage

Wanshu Nie, Sujay V. Kumar, Junyu Chen, Long Zhao, Olya Skulovich, Jinwoong Yoo, Justin Pflug, Shahryar Khalique Ahmad, Goutam Konapala

Main category: cs.LG

TL;DR: Linear regression outperforms complex deep learning models (LSTM and Temporal Fusion Transformer) for terrestrial water storage prediction, highlighting the importance of traditional statistical benchmarks and globally representative datasets.

Details

Motivation: To evaluate whether advanced deep learning models like LSTM and Transformers truly outperform simpler methods for predicting land surface states like terrestrial water storage, which are influenced by both natural variability and human factors.

Method: Used the HydroGlobe dataset with baseline (land surface model simulation) and advanced (multi-source remote sensing data assimilation) versions to compare linear regression against LSTM and Temporal Fusion Transformer models for TWS prediction.

Result: Linear regression proved to be a robust benchmark, outperforming both LSTM and Temporal Fusion Transformer models for terrestrial water storage prediction.

Conclusion: Traditional statistical models should be included as benchmarks when developing deep learning models, and there’s a critical need for globally representative datasets that capture both natural variability and human interventions.

Abstract: Recent advances in machine learning such as Long Short-Term Memory (LSTM) models and Transformers have been widely adopted in hydrological applications, demonstrating impressive performance amongst deep learning models and outperforming physical models in various tasks. However, their superiority in predicting land surface states such as terrestrial water storage (TWS) that are dominated by many factors such as natural variability and human driven modifications remains unclear. Here, using the open-access, globally representative HydroGlobe dataset - comprising a baseline version derived solely from a land surface model simulation and an advanced version incorporating multi-source remote sensing data assimilation - we show that linear regression is a robust benchmark, outperforming the more complex LSTM and Temporal Fusion Transformer for TWS prediction. Our findings highlight the importance of including traditional statistical models as benchmarks when developing and evaluating deep learning models. Additionally, we emphasize the critical need to establish globally representative benchmark datasets that capture the combined impact of natural variability and human interventions.

[947] Crisis-Aware Regime-Conditioned Diffusion with CVaR Allocation

Ali Atiah Alzahrani

Main category: cs.LG

TL;DR: MARCD combines regime-conditioned diffusion modeling with CVaR portfolio optimization to improve tail risk management and drawdown control in multi-asset portfolios.

Details

Motivation: To address the challenge of portfolio decision-making under regime shifts, particularly improving crisis co-movements and tail risk management in financial markets.

Method: Three-stage approach: (1) infer latent regimes via Gaussian HMM, (2) train diffusion model with tail-weighted objective and regime-specialized MoE denoiser, (3) use generated scenarios in turnover-aware CVaR quadratic program with governance.

Result: Outperforms standard allocators with Sharpe 1.23 vs 1.02 baseline, reduces maximum drawdown by 34% (9.3% vs 14.1%), significant Sharpe improvement at 5% level, while maintaining comparable turnover.

Conclusion: MARCD provides a reproducible framework connecting tail-faithful scenario generation to governed portfolio decisions, offering materially improved drawdown control with theoretical guarantees.

Abstract: We study whether regime-conditioned generative scenarios, coupled with a convex CVaR allocator, improve portfolio decisions under regime shifts. We introduce Multi-Agent Regime-Conditioned Diffusion (MARCD), which (i) infers latent regimes via a Gaussian HMM, (ii) trains a diffusion model with a tail-weighted objective and a regime-specialized mixture-of-experts (MoE) denoiser to enrich crisis co-movements, and (iii) feeds the generated scenarios into a turnover-aware CVaR epigraph quadratic program with explicit governance. In strict walk-forward tests on liquid multi-asset ETFs (2005-2025), MARCD outperforms standard allocators and improves calibration relative to popular generators. Over 2020-2025 out-of-sample (monthly; 10 bps), MARCD attains Sharpe 1.23 (BL 1.02) and MaxDD 9.3 percent (BL 14.1 percent), a 34 percent reduction, at comparable turnover; stationary block-bootstrap intervals indicate the Sharpe uplift is significant at 5 percent. We provide theory linking tail-weighted diffusion to spectral-risk control of the decision-relevant CVaR gap, oracle/consistency results for the regime-MoE denoiser, and Lipschitz/regret guarantees for the allocator. Together, MARCD offers a reproducible bridge from tail-faithful scenario modeling to governed portfolio decisions with materially improved drawdown control.

Omar Islam Laskar, Fatemeh Ramezani Khozestani, Ishika Nankani, Sohrab Namazi Nia, Senjuti Basu Roy, Kaustubh Beedkar

Main category: cs.LG

TL;DR: AEGIS is a framework that efficiently identifies optimal data masking configurations to maximize utility while preserving privacy, using limited data summaries to estimate feature-label correlations.

Details

Motivation: Data providers need to anonymize datasets before sharing due to privacy concerns, but different masking configurations result in varying utility levels. The challenge is efficiently finding the optimal masking configuration that maximizes dataset utility.

Method: AEGIS uses a utility optimizer that minimizes predictive utility deviation based on changes in feature-label correlations. It leverages limited data summaries (1D histograms) or none to estimate feature-label joint distribution via iterative proportional fitting, supporting various correlation quantification methods.

Result: Experimental evaluation shows AEGIS identifies optimal masking configurations over an order of magnitude faster than baseline approaches, while maintaining comparable predictive performance on downstream ML tasks.

Conclusion: AEGIS provides an efficient framework for determining optimal data masking configurations that balance privacy protection with utility preservation, particularly useful when raw data is inaccessible due to privacy restrictions.

Abstract: Data-sharing ecosystems enable entities – such as providers, consumers, and intermediaries – to access, exchange, and utilize data for various downstream tasks and applications. Due to privacy concerns, data providers typically anonymize datasets before sharing them; however, the existence of multiple masking configurations results in masked datasets with varying utility. Consequently, a key challenge lies in efficiently determining the optimal masking configuration that maximizes a dataset’s utility. This paper presents AEGIS, a middleware framework for identifying the optimal masking configuration for machine learning datasets that consist of features and a class label. We introduce a utility optimizer that minimizes predictive utility deviation – a metric based on the changes in feature-label correlations before and after masking. Our framework leverages limited data summaries (such as 1D histograms) or none to estimate the feature-label joint distribution, making it suitable for scenarios where raw data is inaccessible due to privacy restrictions. To achieve this, we propose a joint distribution estimator based on iterative proportional fitting, which allows supporting various feature-label correlation quantification methods such as g3, mutual information, or chi-square. Our experimental evaluation on real-world datasets shows that AEGIS identifies optimal masking configurations over an order of magnitude faster, while the resulting masked datasets achieve predictive performance on downstream ML tasks that is on par with baseline approaches.

[949] Glance for Context: Learning When to Leverage LLMs for Node-Aware GNN-LLM Fusion

Donald Loveland, Yao-An Yang, Danai Koutra

Main category: cs.LG

TL;DR: GLANCE is a GNN-LLM fusion framework that selectively invokes LLMs only on nodes where GNNs typically fail, using a lightweight router trained with advantage-based learning to achieve better performance balance across node subgroups.

Details

Motivation: Current LLM-GNN fusion strategies are applied uniformly across all nodes and achieve only small overall performance gains, with aggregate metrics obscuring when LLMs actually provide benefit. The authors argue for reframing fusion around nodes where GNNs typically falter.

Method: GLANCE employs a lightweight router that uses inexpensive per-node signals to decide whether to query an LLM to refine GNN predictions. The router is trained with an advantage-based objective comparing the utility of LLM queries against relying solely on the GNN, since LLM calls are non-differentiable.

Result: GLANCE achieves the best performance balance across node subgroups, with significant gains on heterophilous nodes (up to +13%) while simultaneously achieving top overall performance across multiple benchmarks.

Conclusion: Adaptive, node-aware GNN-LLM architectures that selectively invoke LLMs enable scalable deployment on large graphs without high computational costs, highlighting the value of targeting LLM assistance where GNNs are weakest.

Abstract: Learning on text-attributed graphs has motivated the use of Large Language Models (LLMs) for graph learning. However, most fusion strategies are applied uniformly across all nodes and attain only small overall performance gains. We argue this result stems from aggregate metrics that obscure when LLMs provide benefit, inhibiting actionable signals for new strategies. In this work, we reframe LLM-GNN fusion around nodes where GNNs typically falter. We first show that performance can significantly differ between GNNs and LLMs, with each excelling on distinct structural patterns, such as local homophily. To leverage this finding, we propose GLANCE (GNN with LLM Assistance for Neighbor- and Context-aware Embeddings), a framework that invokes an LLM to refine a GNN’s prediction. GLANCE employs a lightweight router that, given inexpensive per-node signals, decides whether to query the LLM. Since the LLM calls are non-differentiable, the router is trained with an advantage-based objective that compares the utility of querying the LLM against relying solely on the GNN. Across multiple benchmarks, GLANCE achieves the best performance balance across node subgroups, achieving significant gains on heterophilous nodes (up to $+13%$) while simultaneously achieving top overall performance. Our findings highlight the value of adaptive, node-aware GNN-LLM architectures, where selectively invoking the LLM enables scalable deployment on large graphs without incurring high computational costs.

[950] Discrete State Diffusion Models: A Sample Complexity Perspective

Aadithya Srikanth, Mudit Gaur, Vaneet Aggarwal

Main category: cs.LG

TL;DR: First theoretical framework for discrete-state diffusion models with sample complexity bound of O(ε^-2), addressing the gap in understanding discrete diffusion models compared to continuous ones.

Details

Motivation: Discrete-state diffusion models are crucial for text, sequences, and combinatorial structures but lack theoretical understanding compared to continuous-state models, with no existing sample complexity analysis.

Method: Developed a principled theoretical framework with structured decomposition of score estimation error into statistical, approximation, optimization, and clipping components.

Result: Achieved first sample complexity bound of O(ε^-2) for discrete-state diffusion models, providing theoretical tractability and practical training insights.

Conclusion: Established theoretical foundation for discrete-state diffusion models, bridging the gap with continuous models and enabling efficient training for text and sequence applications.

Abstract: Diffusion models have demonstrated remarkable performance in generating high-dimensional samples across domains such as vision, language, and the sciences. Although continuous-state diffusion models have been extensively studied both empirically and theoretically, discrete-state diffusion models, essential for applications involving text, sequences, and combinatorial structures, remain significantly less understood from a theoretical standpoint. In particular, all existing analyses of discrete-state models assume score estimation error bounds without studying sample complexity results. In this work, we present a principled theoretical framework for discrete-state diffusion, providing the first sample complexity bound of $\widetilde{\mathcal{O}}(\epsilon^{-2})$. Our structured decomposition of the score estimation error into statistical, approximation, optimization, and clipping components offers critical insights into how discrete-state models can be trained efficiently. This analysis addresses a fundamental gap in the literature and establishes the theoretical tractability and practical relevance of discrete-state diffusion models.

[951] HeroFilter: Adaptive Spectral Graph Filter for Varying Heterophilic Relations

Shuaicheng Zhang, Haohui Wang, Junhong Lin, Xiaojie Guo, Yada Zhu, Si Zhang, Dongqi Fu, Dawei Zhou

Main category: cs.LG

TL;DR: The paper challenges conventional fixed filter designs for graph neural networks by showing that optimal filter responses vary across frequency components and don’t follow strict monotonic correlation with heterophily degree, proposing an adaptive filtering method.

Details

Motivation: Most existing works use simplified approaches with low-pass filters for homophilic graphs and high-pass filters for heterophilic graphs, but the relationship between graph heterophily and spectral filters is more complex than previously assumed.

Method: Proposes an adaptive GNN that extracts information across the heterophily spectrum and combines salient representations through adaptive mixing, using adaptive graph filters to fit varying heterophilic connections.

Result: The proposed method achieves up to 9.2% accuracy improvement over leading baselines across both homophilic and heterophilic graphs.

Conclusion: Theoretical analysis reveals that average frequency response of GNNs and graph heterophily degree don’t follow strict monotonic correlation, necessitating adaptive graph filters for good generalization performance.

Abstract: Graph heterophily, where connected nodes have different labels, has attracted significant interest recently. Most existing works adopt a simplified approach

using low-pass filters for homophilic graphs and high-pass filters for heterophilic graphs. However, we discover that the relationship between graph heterophily and spectral filters is more complex - the optimal filter response varies across frequency components and does not follow a strict monotonic correlation with heterophily degree. This finding challenges conventional fixed filter designs and suggests the need for adaptive filtering to preserve expressiveness in graph embeddings. Formally, natural questions arise: Given a heterophilic graph G, how and to what extent will the varying heterophily degree of G affect the performance of GNNs? How can we design adaptive filters to fit those varying heterophilic connections? Our theoretical analysis reveals that the average frequency response of GNNs and graph heterophily degree do not follow a strict monotonic correlation, necessitating adaptive graph filters to guarantee good generalization performance. Hence, we propose [METHOD NAME], a simple yet powerful GNN, which extracts information across the heterophily spectrum and combines salient representations through adaptive mixing. [METHOD NAME]’s superior performance achieves up to 9.2% accuracy improvement over leading baselines across homophilic and heterophilic graphs.

[952] A Joint Learning Approach to Hardware Caching and Prefetching

Samuel Yuan, Divyanshu Saxena, Jiayi Chen, Nihal Sharma, Aditya Akella

Main category: cs.LG

TL;DR: The paper proposes joint training of cache replacement and prefetching policies using shared representations to address suboptimal performance when these policies are trained in isolation.

Details

Motivation: Learned policies for system components like scheduling and caching often achieve suboptimal performance when placed together due to bidirectional interdependencies between policies like cache replacement and prefetching.

Method: Two approaches for developing shared representations: 1) joint encoder-based approach, and 2) contrastive learning of embeddings to enable joint training of cache replacement and prefetching policies.

Result: Preliminary results show promising performance for both shared representation approaches.

Conclusion: The paper establishes the need for joint training of interdependent system policies and outlines a research agenda for future work in this direction.

Abstract: Several learned policies have been proposed to replace heuristics for scheduling, caching, and other system components in modern systems. By leveraging diverse features, learning from historical trends, and predicting future behaviors, such models promise to keep pace with ever-increasing workload dynamism and continuous hardware evolution. However, policies trained in isolation may still achieve suboptimal performance when placed together. In this paper, we inspect one such instance in the domain of hardware caching – for the policies of cache replacement and prefetching. We argue that these two policies are bidirectionally interdependent and make the case for training the two jointly. We propose a joint learning approach based on developing shared representations for the features used by the two policies. We present two approaches to develop these shared representations, one based on a joint encoder and another based on contrastive learning of the embeddings, and demonstrate promising preliminary results for both of these. Finally, we lay down an agenda for future research in this direction.

[953] Quantifying Information Disclosure During Gradient Descent Using Gradient Uniqueness

Mahmoud Abdelghafar, Maryam Aliakbarpour, Chris Jermaine

Main category: cs.LG

TL;DR: The paper introduces gradient uniqueness as a privacy metric for machine learning models, showing it provides better utility than DP-SGD while maintaining comparable privacy protection.

Details

Motivation: There's concern about private information disclosure when publishing machine learning models, and while models are intuitively less risky than datasets, the actual risk level is unclear.

Method: Proposes gradient uniqueness as a principled disclosure metric derived from an upper bound on information disclosure, applicable to any model architecture, dataset type, or attacker strategy.

Result: A defense based on monitoring gradient uniqueness achieves privacy comparable to classical DP-SGD methods while substantially improving testing accuracy (utility).

Conclusion: Gradient uniqueness provides an intuitive privacy auditing approach that offers better utility-privacy tradeoff than traditional methods like DP-SGD.

Abstract: Disclosing private information via publication of a machine learning model is often a concern. Intuitively, publishing a learned model should be less risky than publishing a dataset. But how much risk is there? In this paper, we present a principled disclosure metric called \emph{gradient uniqueness} that is derived from an upper bound on the amount of information disclosure from publishing a learned model. Gradient uniqueness provides an intuitive way to perform privacy auditing. The mathematical derivation of gradient uniqueness is general, and does not make any assumption on the model architecture, dataset type, or the strategy of an attacker. We examine a simple defense based on monitoring gradient uniqueness, and find that it achieves privacy comparable to classical methods such as DP-SGD, while being substantially better in terms of (utility) testing accuracy.

[954] LPCVAE: A Conditional VAE with Long-Term Dependency and Probabilistic Time-Frequency Fusion for Time Series Anomaly Detection

Hanchang Cheng, Weimin Mu, Fan Liu, Weilin Zhu, Can Ma

Main category: cs.LG

TL;DR: LPCVAE is a novel VAE-based method for time series anomaly detection that addresses limitations of existing approaches by capturing long-term dependencies through LSTM and using probabilistic time-frequency fusion with Product-of-Experts mechanism.

Details

Motivation: Existing VAE-based methods for time series anomaly detection suffer from single-window feature limitations and insufficient utilization of long-term time and frequency information, leading to information loss.

Method: Proposed LPCVAE uses LSTM to capture long-term dependencies beyond windows and incorporates a Product-of-Experts mechanism for adaptive, distribution-level probabilistic fusion of time-frequency information.

Result: Extensive experiments on four public datasets demonstrate that LPCVAE outperforms state-of-the-art methods in time series anomaly detection.

Conclusion: Integrating long-term time and frequency representations with adaptive fusion provides a robust and efficient solution for time series anomaly detection.

Abstract: Time series anomaly detection(TSAD) is a critical task in signal processing field, ensuring the reliability of complex systems. Reconstruction-based methods dominate in TSAD. Among these methods, VAE-based methods have achieved promising results. Existing VAE-based methods suffer from the limitation of single-window feature and insufficient leveraging of long-term time and frequency information. We propose a Conditional Variational AutoEncoder with Long-term dependency and Probabilistic time-frequency fusion, named LPCVAE. LPCVAE introduces LSTM to capture long-term dependencies beyond windows. It further incorporates a Product-of-Experts (PoE) mechanism for adaptive and distribution-level probabilistic fusion. This design effectively mitigates time-frequency information loss. Extensive experiments on four public datasets demonstrate it outperforms state-of-the-art methods. The results confirm that integrating long-term time and frequency representations with adaptive fusion yields a robust and efficient solution for TSAD.

[955] Find Your Optimal Teacher: Personalized Data Synthesis via Router-Guided Multi-Teacher Distillation

Hengyuan Zhang, Shiping Yang, Xiao Liang, Chenming Shang, Yuxuan Jiang, Chaofan Tao, Jing Xiong, Hayden Kwok-Hay So, Ruobing Xie, Angel X. Chang, Ngai Wong

Main category: cs.LG

TL;DR: PerSyn is a personalized data synthesis method that routes prompts to optimal teachers based on student learnability, then generates tailored synthetic data for more effective knowledge distillation.

Details

Motivation: Stronger teacher models don't always produce optimal training data for student models due to mismatch between teacher outputs and student learnability.

Method: PerSyn uses a “Route then Generate” paradigm with a query-level router that assigns each prompt to its optimal teacher considering both student learnability and teacher response quality.

Result: PerSyn consistently achieves superior or comparable performance to all baselines in instruct tuning and math reasoning settings across different model families and scales.

Conclusion: The personalized synthesis approach effectively addresses the teacher-student mismatch problem and offers insights for future research in knowledge distillation.

Abstract: Training student models on synthetic data generated by strong teacher models is a promising way to distilling the capabilities of teachers. However, recent studies show that stronger models are not always optimal teachers, revealing a mismatch between teacher outputs and student learnability. To address this issue, we propose PerSyn (Personalized data Synthesis), a novel synthesis strategy that operates under a new Route then Generate'' paradigm to create data tailored to each student model, enabling it to learn more effectively. Specifically, PerSyn first assigns each prompt to its optimal teacher via a query-level router that jointly considers student learnability and teacher response quality. Each teacher then synthesizes data only for its assigned prompts, making the process more efficient than the conventional Generate then Select’’ paradigm, where all teachers must generate parallel responses for the entire prompt set before constructing the final dataset. Extensive experiments across different model families and scales demonstrate that PerSyn consistently achieves superior or comparable performance to all baselines in instruct tuning and math reasoning settings. Further analysis verifies the effectiveness of PerSyn and offers extra insights to propel future research.

[956] Neutral Agent-based Adversarial Policy Learning against Deep Reinforcement Learning in Multi-party Open Systems

Qizhou Peng, Yang Zheng, Yu Wen, Yanna Wu, Yingying Du

Main category: cs.LG

TL;DR: Proposes a neutral agent-based adversarial attack method for multi-party open systems that indirectly influences victim agents through shared environments, without requiring direct interactions or full environment control.

Details

Motivation: Existing adversarial attack techniques in deep reinforcement learning have limited adoption in multi-party open systems due to impractical assumptions of full environment control and dependency on direct interactions with victim agents.

Method: Redesigned adversarial policy learning approach using neutral agents that indirectly influence victim agents through shared environments across various task scenarios, evaluated on SMAC (Starcraft II) and Highway-env platforms.

Result: Experimental results demonstrate the method can launch general and effective adversarial attacks in multi-party open systems, successfully misleading well-trained victim agents.

Conclusion: The proposed neutral agent-based approach enables practical adversarial attacks in multi-party open systems by overcoming limitations of existing methods that require direct interactions or full environment control.

Abstract: Reinforcement learning (RL) has been an important machine learning paradigm for solving long-horizon sequential decision-making problems under uncertainty. By integrating deep neural networks (DNNs) into the RL framework, deep reinforcement learning (DRL) has emerged, which achieved significant success in various domains. However, the integration of DNNs also makes it vulnerable to adversarial attacks. Existing adversarial attack techniques mainly focus on either directly manipulating the environment with which a victim agent interacts or deploying an adversarial agent that interacts with the victim agent to induce abnormal behaviors. While these techniques achieve promising results, their adoption in multi-party open systems remains limited due to two major reasons: impractical assumption of full control over the environment and dependent on interactions with victim agents. To enable adversarial attacks in multi-party open systems, in this paper, we redesigned an adversarial policy learning approach that can mislead well-trained victim agents without requiring direct interactions with these agents or full control over their environments. Particularly, we propose a neutral agent-based approach across various task scenarios in multi-party open systems. While the neutral agents seemingly are detached from the victim agents, indirectly influence them through the shared environment. We evaluate our proposed method on the SMAC platform based on Starcraft II and the autonomous driving simulation platform Highway-env. The experimental results demonstrate that our method can launch general and effective adversarial attacks in multi-party open systems.

[957] Redundancy as a Structural Information Principle for Learning and Generalization

Yuda Bi, Ying Zhu, Vince D Calhoun

Main category: cs.LG

TL;DR: This paper presents a theoretical framework that redefines redundancy in information theory as a fundamental property of organization, unifying classical measures like mutual information under a single geometric principle and revealing an optimal redundancy equilibrium for finite systems.

Details

Motivation: To extend classical information theory to finite and structured systems by reframing redundancy as an organizational property rather than inefficiency, bridging the gap between communication theory and real-world learning systems.

Method: Developed a theoretical framework expressing redundancy as a family of informational divergences that unifies classical measures under a shared geometric principle, with experimental validation using masked autoencoders.

Result: The theory reveals that redundancy is bounded and has an optimal equilibrium that balances over-compression and over-coupling. Experiments show models achieve peak generalization at this stable redundancy level.

Conclusion: Redundancy is established as a measurable and tunable quantity that connects asymptotic communication theory with finite learning systems, with optimal performance achieved near the redundancy equilibrium.

Abstract: We present a theoretical framework that extends classical information theory to finite and structured systems by redefining redundancy as a fundamental property of information organization rather than inefficiency. In this framework, redundancy is expressed as a general family of informational divergences that unifies multiple classical measures, such as mutual information, chi-squared dependence, and spectral redundancy, under a single geometric principle. This reveals that these traditional quantities are not isolated heuristics but projections of a shared redundancy geometry. The theory further predicts that redundancy is bounded both above and below, giving rise to an optimal equilibrium that balances over-compression (loss of structure) and over-coupling (collapse). While classical communication theory favors minimal redundancy for transmission efficiency, finite and structured systems, such as those underlying real-world learning, achieve maximal stability and generalization near this equilibrium. Experiments with masked autoencoders are used to illustrate and verify this principle: the model exhibits a stable redundancy level where generalization peaks. Together, these results establish redundancy as a measurable and tunable quantity that bridges the asymptotic world of communication and the finite world of learning.

Xi Mao, Zhendong Wang, Jingyu Li, Lingchao Mao, Utibe Essien, Hairong Wang, Xuelei Sherry Ni

Main category: cs.LG

TL;DR: This paper presents a framework using social determinants of health (SDOH) to predict cognitive performance for early Alzheimer’s disease detection, achieving superior performance through XGBoost with interpretable feature analysis.

Details

Motivation: Early detection of Alzheimer's disease is crucial due to irreversible neurodegenerative effects and risk factors accumulating years before diagnosis. Identifying higher-risk individuals enables prevention, timely care, and equitable resource allocation.

Method: Used NIH NIA-supported PREPARE Challenge Phase 2 dataset from Mexican Health and Aging Study. Applied SVD-based imputation for missing data and selected XGBoost for its superior predictive performance. Analyzed demographic, socioeconomic, health, lifestyle, psychosocial, and healthcare access factors.

Result: The framework outperformed existing methods and challenge leaderboard with high accuracy, robustness, and interpretability. SHAP analysis identified flooring material as a strong predictor reflecting socioeconomic disparities, along with age, SES, lifestyle, social interaction, sleep, stress, and BMI as influential factors.

Conclusion: The study demonstrates the multifactorial nature of cognitive aging and the value of interpretable, data-driven SDOH modeling for early AD detection, highlighting socioeconomic and environmental factors as significant predictors.

Abstract: Early detection of Alzheimer’s disease (AD) is crucial because its neurodegenerative effects are irreversible, and neuropathologic and social-behavioral risk factors accumulate years before diagnosis. Identifying higher-risk individuals earlier enables prevention, timely care, and equitable resource allocation. We predict cognitive performance from social determinants of health (SDOH) using the NIH NIA-supported PREPARE Challenge Phase 2 dataset derived from the nationally representative Mex-Cog cohort of the 2003 and 2012 Mexican Health and Aging Study (MHAS). Data: The target is a validated composite cognitive score across seven domains-orientation, memory, attention, language, constructional praxis, and executive function-derived from the 2016 and 2021 MHAS waves. Predictors span demographic, socioeconomic, health, lifestyle, psychosocial, and healthcare access factors. Methodology: Missingness was addressed with a singular value decomposition (SVD)-based imputation pipeline treating continuous and categorical variables separately. This approach leverages latent feature correlations to recover missing values while balancing reliability and scalability. After evaluating multiple methods, XGBoost was chosen for its superior predictive performance. Results and Discussion: The framework outperformed existing methods and the data challenge leaderboard, demonstrating high accuracy, robustness, and interpretability. SHAP-based post hoc analysis identified top contributing SDOH factors and age-specific feature patterns. Notably, flooring material emerged as a strong predictor, reflecting socioeconomic and environmental disparities. Other influential factors, age, SES, lifestyle, social interaction, sleep, stress, and BMI, underscore the multifactorial nature of cognitive aging and the value of interpretable, data-driven SDOH modeling.

[959] Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

Xiaoyun Zhang, Xiaojian Yuan, Di Huang, Wang You, Chen Hu, Jingqing Ruan, Kejiang Chen, Xing Hu

Main category: cs.LG

TL;DR: The paper proposes Adaptive Entropy Regularization (AER) to address policy entropy collapse in RLVR training for LLMs, dynamically balancing exploration and exploitation to improve reasoning performance.

Details

Motivation: RLVR training suffers from policy entropy collapse where policies become overly deterministic, limiting exploration and reasoning performance. Standard entropy regularization is unstable due to fixed coefficients that don't adapt to task difficulty.

Method: AER framework with three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment to maintain policy entropy within a moderate range below initial levels.

Result: Experiments on multiple mathematical reasoning benchmarks show AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability.

Conclusion: Adaptive entropy regularization effectively addresses the limitations of fixed-coefficient entropy regularization in RLVR, demonstrating that balanced exploration is crucial for enhancing LLM reasoning abilities.

Abstract: Reasoning ability has become a defining capability of Large Language Models (LLMs), with Reinforcement Learning with Verifiable Rewards (RLVR) emerging as a key paradigm to enhance it. However, RLVR training often suffers from policy entropy collapse, where the policy becomes overly deterministic, hindering exploration and limiting reasoning performance. While entropy regularization is a common remedy, its effectiveness is highly sensitive to the fixed coefficient, making it unstable across tasks and models. In this work, we revisit entropy regularization in RLVR and argue that its potential has been largely underestimated. Our analysis shows that (i) tasks of varying difficulty demand distinct exploration intensities, and (ii) balanced exploration may require the policy entropy to be maintained within a moderate range below its initial level. Therefore, we propose Adaptive Entropy Regularization (AER)–a framework that dynamically balances exploration and exploitation via three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment. Experiments on multiple mathematical reasoning benchmarks show that AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability.

[960] MC#: Mixture Compressor for Mixture-of-Experts Large Models

Wei Huang, Yue Liao, Yukang Chen, Jianhui Liu, Haoru Tan, Si Liu, Shiming Zhang, Shuicheng Yan, Xiaojuan Qi

Main category: cs.LG

TL;DR: MC# is a compression framework that combines static quantization and dynamic expert pruning to reduce the computational and memory overhead of Mixture-of-Experts models while maintaining performance.

Details

Motivation: Mixture-of-Experts models face significant computational and memory overhead due to preloading all experts and activating multiple experts per input, making expert modules major contributors to model size and inference cost.

Method: Proposes MC# framework with two components: Pre-Loading Mixed-Precision Quantization (PMQ) for storage reduction via optimized bit allocation using linear programming, and Online Top-any Pruning (OTP) for runtime computation reduction via dynamic expert selection using Gumbel-Softmax sampling.

Result: On DeepSeek-VL2, achieves 6.2× weight reduction at 2.57 average bits with only 1.7% accuracy drop across five multimodal benchmarks. OTP reduces expert activation over 20% with less than 1% performance degradation.

Conclusion: MC# demonstrates strong potential for efficient MoE-based model deployment by achieving extreme compression with minimal accuracy loss through the combination of static bit-width optimization and dynamic routing.

Abstract: Mixture-of-Experts (MoE) effectively scales large language models (LLMs) and vision-language models (VLMs) by increasing capacity through sparse activation. However, preloading all experts into memory and activating multiple experts per input introduces significant computational and memory overhead, making the expert module a major contributor to model size and inference cost. To address this, we propose MC# (Mixture-Compressor-sharp), a framework that combines static quantization and dynamic expert pruning by leveraging the significance of experts and tokens for aggressive compression of MoE-LLMs/VLMs. To reduce storage and loading costs, we introduce Pre-Loading Mixed-Precision Quantization (PMQ), which optimizes bit allocation via linear programming, balancing expert importance and quantization error for a Pareto-optimal trade-off between size and performance. To reduce runtime computation, Online Top-any Pruning (OTP) uses Gumbel-Softmax sampling to dynamically select a subset of experts per token, enabling fine-grained control over activation. By combining PMQ’s static bit-width optimization with OTP’s dynamic routing, MC# achieves extreme compression with minimal accuracy loss. On DeepSeek-VL2, MC# achieves a 6.2 times weight reduction at 2.57 average bits with only a 1.7% accuracy drop across five multimodal benchmarks. Additionally, OTP reduces expert activation over 20% with less than 1% performance degradation, demonstrating strong potential for efficient MoE-based model deployment.

[961] APLOT: Robust Reward Modeling via Adaptive Preference Learning with Optimal Transport

Zhuo Li, Yuege Feng, Dandan Guo, Jinpeng Hu, Anningzhe Gao, Xiang Wan

Main category: cs.LG

TL;DR: This paper introduces an adaptive margin mechanism using Optimal Transport to enhance Bradley-Terry based reward models, improving their ability to distinguish similar preference responses and generalize to out-of-distribution samples.

Details

Motivation: Bradley-Terry based reward models struggle to effectively distinguish between similar preference responses, leading to insufficient separation between preferred/non-preferred outputs, overfitting on easy samples, and poor generalization to OOD samples.

Method: Proposes an adaptive margin mechanism that dynamically adjusts reward model focus on challenging samples using semantic similarity and reward differences, approached from a distributional perspective solvable with Optimal Transport through principled cost matrix design.

Result: Experimental results show the method outperforms existing RM techniques across multiple benchmarks, with improved performance in both in-distribution and out-of-distribution settings, and faster convergence.

Conclusion: The adaptive margin approach enables reward models to better capture distributional differences between chosen and rejected responses, leading to significant improvements in performance, convergence speed, and generalization capabilities for better LLM alignment with human preferences.

Abstract: The reward model (RM) plays a crucial role in aligning Large Language Models (LLMs) with human preferences through Reinforcement Learning, where the Bradley-Terry (BT) objective has been recognized as simple yet powerful, specifically for pairwise preference learning. However, BT-based RMs often struggle to effectively distinguish between similar preference responses, leading to insufficient separation between preferred and non-preferred outputs. Consequently, they may easily overfit easy samples and cannot generalize well to Out-Of-Distribution (OOD) samples, resulting in suboptimal performance. To address these challenges, this paper introduces an effective enhancement to BT-based RMs through an adaptive margin mechanism. Specifically, we design to dynamically adjust the RM focus on more challenging samples through margins, based on both semantic similarity and model-predicted reward differences, which is approached from a distributional perspective solvable with Optimal Transport (OT). By incorporating these factors into a principled OT cost matrix design, our adaptive margin enables the RM to better capture distributional differences between chosen and rejected responses, yielding significant improvements in performance, convergence speed, and generalization capabilities. Experimental results across multiple benchmarks demonstrate that our method outperforms several existing RM techniques, showcasing enhanced performance in both In-Distribution (ID) and OOD settings. Moreover, RLHF experiments support our practical effectiveness in better aligning LLMs with human preferences. Our code is available at https://github.com/BIRlz/APLOT

[962] Catch-Only-One: Non-Transferable Examples for Model-Specific Authorization

Zihan Wang, Zhiyong Ma, Zhongkui Ma, Shuofeng Liu, Akide Liu, Derui Wang, Minhui Xue, Guangdong Bai

Main category: cs.LG

TL;DR: The paper proposes Non-transferable Examples (NEs), a training-free method to make data useful for authorized models while resistant to misuse by unauthorized models through input recoding in low-sensitivity subspaces.

Details

Motivation: Current AI regulations require data that balances utility for innovation with protection against misuse, but existing approaches either require training control or don't govern inference by unknown models.

Method: NEs recode inputs within model-specific low-sensitivity subspaces, preserving outputs for authorized models while reducing performance on unauthorized models through subspace misalignment, with formal bounds established using Hoffman-Wielandt inequality.

Result: Empirical results show NEs maintain performance on diverse vision backbones and vision-language models under preprocessing, while non-target models collapse even with reconstruction attempts.

Conclusion: NEs provide a practical means to preserve intended data utility while preventing unauthorized exploitation, offering a training-free and data-agnostic usage-control mechanism.

Abstract: Recent AI regulations call for data that remain useful for innovation while resistant to misuse, balancing utility with protection at the model level. Existing approaches either perturb data to make it unlearnable or retrain models to suppress transfer, but neither governs inference by unknown models, and both typically require control over training. We propose non-transferable examples (NEs), a training-free and data-agnostic input-side usage-control mechanism. We recode inputs within a model-specific low-sensitivity subspace, preserving outputs for the authorized model while reducing performance on unauthorized models through subspace misalignment. We establish formal bounds that guarantee utility for the authorized model and quantify deviation for unauthorized ones, with the Hoffman-Wielandt inequality linking degradation to spectral differences. Empirically, NEs retain performance on diverse vision backbones and state-of-the-art vision-language models under common preprocessing, whereas non-target models collapse even with reconstruction attempts. These results establish NEs as a practical means to preserve intended data utility while preventing unauthorized exploitation. Our project is available at https://trusted-system-lab.github.io/model-specificity

[963] Not All Bits Are Equal: Scale-Dependent Memory Optimization Strategies for Reasoning Models

Junhyuck Kim, Ethan Ewer, Taehong Moon, Jongho Park, Dimitris Papailiopoulos

Main category: cs.LG

TL;DR: 4-bit quantization is not universally optimal for reasoning models due to KV cache memory dominance. A scale-dependent trade-off exists where smaller models benefit from more weights while larger models benefit from longer generations.

Details

Motivation: To challenge the universal prescription of 4-bit quantization for reasoning models, where KV cache rather than model size dominates memory usage, and to establish scale-dependent optimization strategies.

Method: Systematic experiments across 1,700 inference scenarios on AIME25 and GPQA-Diamond datasets, analyzing memory allocation trade-offs between model weights and generation length across different model scales.

Result: Found a scale threshold at 8-bit 4B parameters: smaller models achieve better accuracy with more weights, while larger models achieve better accuracy with longer generations. This threshold also determines parallel scaling efficiency and KV cache optimization strategies.

Conclusion: Memory optimization for LLMs cannot be scale-agnostic. For small reasoning models, prioritize model capacity; for larger ones, maximize test-time compute. Reasoning models require fundamentally different deployment strategies than non-reasoning models.

Abstract: While 4-bit quantization has emerged as a memory-optimal choice for non-reasoning models and zero-shot tasks across scales, we show that this universal prescription fails for reasoning models, where the KV cache rather than model size can dominate memory. Through systematic experiments across 1,700 inference scenarios on AIME25 and GPQA-Diamond, we find a scale-dependent trade-off: models with an effective size below 8-bit 4B parameters achieve better accuracy by allocating memory to more weights rather than longer generation, while larger models achieve better accuracy by allocating memory to longer generations. This scale threshold also determines when parallel scaling becomes memory-efficient and whether KV cache eviction outperforms KV quantization. Our findings show that memory optimization for LLMs cannot be scale-agnostic, while providing principled guidelines: for small reasoning models, prioritize model capacity over test-time compute, while for larger ones, maximize test-time compute. Our results suggest that optimizing reasoning models for deployment requires fundamentally different strategies from those established for non-reasoning models.

[964] Blade: A Derivative-free Bayesian Inversion Method using Diffusion Priors

Hongkai Zheng, Austin Wang, Zihui Wu, Zhengyu Huang, Ricardo Baptista, Yisong Yue

Main category: cs.LG

TL;DR: Blade is a derivative-free Bayesian inversion method that uses an ensemble of interacting particles with diffusion model priors to handle nonlinear forward models without requiring derivatives.

Details

Motivation: Many science and engineering applications require Bayesian inversion but face challenges when computing forward model derivatives is computationally difficult or impractical.

Method: Uses an ensemble of interacting particles with data-driven priors based on diffusion models, handling nonlinear forward models with only black-box access (derivative-free).

Result: Achieves superior performance compared to existing derivative-free Bayesian inversion methods on various inverse problems, including highly nonlinear fluid dynamics.

Conclusion: Blade provides accurate and well-calibrated posteriors for Bayesian inversion in challenging derivative-free settings, with theoretical convergence guarantees.

Abstract: Derivative-free Bayesian inversion is an important task in many science and engineering applications, particularly when computing the forward model derivative is computationally and practically challenging. In this paper, we introduce Blade, which can produce accurate and well-calibrated posteriors for Bayesian inversion using an ensemble of interacting particles. Blade leverages powerful data-driven priors based on diffusion models, and can handle nonlinear forward models that permit only black-box access (i.e., derivative-free). Theoretically, we establish a non-asymptotic convergence analysis to characterize the effects of forward model and prior estimation errors. Empirically, Blade achieves superior performance compared to existing derivative-free Bayesian inversion methods on various inverse problems, including challenging highly nonlinear fluid dynamics.

[965] On the Optimal Representation Efficiency of Barlow Twins: An Information-Geometric Interpretation

Di Zhang

Main category: cs.LG

TL;DR: A new information-geometric framework is introduced to quantify representation efficiency in self-supervised learning, showing that Barlow Twins achieves optimal efficiency by making representations’ cross-correlation matrix approach identity.

Details

Motivation: There is no unified theoretical framework for understanding and comparing the efficiency of different self-supervised learning paradigms, despite SSL's remarkable success.

Method: Define representation efficiency as the ratio between effective intrinsic dimension (from FIM spectral properties) and ambient dimension. Analyze Barlow Twins theoretically under natural assumptions.

Result: Barlow Twins achieves optimal representation efficiency (η = 1) by driving the cross-correlation matrix towards identity, inducing an isotropic Fisher Information Matrix.

Conclusion: This work provides rigorous theoretical foundation for understanding Barlow Twins’ effectiveness and offers new geometric perspective for analyzing SSL algorithms.

Abstract: Self-supervised learning (SSL) has achieved remarkable success by learning meaningful representations without labeled data. However, a unified theoretical framework for understanding and comparing the efficiency of different SSL paradigms remains elusive. In this paper, we introduce a novel information-geometric framework to quantify representation efficiency. We define representation efficiency $\eta$ as the ratio between the effective intrinsic dimension of the learned representation space and its ambient dimension, where the effective dimension is derived from the spectral properties of the Fisher Information Matrix (FIM) on the statistical manifold induced by the encoder. Within this framework, we present a theoretical analysis of the Barlow Twins method. Under specific but natural assumptions, we prove that Barlow Twins achieves optimal representation efficiency ($\eta = 1$) by driving the cross-correlation matrix of representations towards the identity matrix, which in turn induces an isotropic FIM. This work provides a rigorous theoretical foundation for understanding the effectiveness of Barlow Twins and offers a new geometric perspective for analyzing SSL algorithms.

[966] ELMO: Efficiency via Low-precision and Peak Memory Optimization in Large Output Spaces

Jinbin Zhang, Nasib Ullah, Erik Schultheis, Rohit Babbar

Main category: cs.LG

TL;DR: ELMO is a pure low-precision training framework for Extreme Multilabel Classification (XMC) that uses BFloat16 and Float8 data types, enabling significant GPU memory reduction without accuracy loss.

Details

Motivation: Current XMC methods rely on FP16-FP32 mixed-precision training which is unstable and inefficient in memory usage and computational overhead. Existing low-precision methods still retain higher precision for classification layers.

Method: Proposes ELMO framework using BFloat16 and Float8 data types with Kahan summation and stochastic rounding, enabling pure Float8 training without single-precision master weights or tensor scaling. Includes memory optimizations like gradient fusion and chunking.

Result: Achieved dramatic GPU memory reduction - trained a 3-million-label XMC model with only 6.6 GiB memory vs 39.7 GiB required by state-of-the-art method Renee, without compromising accuracy.

Conclusion: ELMO demonstrates that XMC models can be effectively trained entirely in low-precision formats, providing significant memory efficiency improvements while maintaining model accuracy.

Abstract: Large output spaces, also referred to as Extreme multilabel classification (XMC), is a setting that arises, e.g., in large-scale tagging and product-to-product recommendation, and is characterized by the number of labels ranging from hundreds of thousands to millions. This means that the linear classification head, usually only a tiny fraction of the overall model, turns into the main driver for compute and memory demand. Current state-of-the-art XMC methods predominantly rely on FP16-FP32 mixed-precision training, which we show can be unstable, and inefficient in terms of memory usage and computational overhead. Meanwhile, existing low-precision methods typically retain higher precision for the classification layer. In this work, we propose ELMO, a pure low-precision training framework for XMC models using BFloat16 and Float8 data types. By leveraging Kahan summation and stochastic rounding, we demonstrate that XMC models can be effectively trained entirely in Float8, without relying on single-precision master weights or tensor scaling. Low-precision training, combined with our proposed memory optimizations – gradient fusion and chunking – enables significant reductions in GPU memory usage. For example, we train a 3-million-label XMC model with only 6.6 GiB of GPU memory, compared to the 39.7 GiB required by the optimized SOTA method, Renee without compromising accuracy.

[967] Instruction-aware User Embedding via Synergistic Language and Representation Modeling

Ziyi Gao, Yike Xu, Jiahao Yuan, Baokun Wang, Jinyong Wen, Xiaotong Lin, Yun Liu, Xing Fu, Yu Cheng, Yongchao Liu, Weiqiang Wang, Zhongle Xie

Main category: cs.LG

TL;DR: InstructUE is an instruction-aware user embedding foundation model that uses LLMs to create general and instruction-aware user representations through a multi-encoder architecture with contrastive-autoregressive training.

Details

Motivation: Existing user representation approaches struggle with generalizability across domains and sensitivity to noisy behavioral signals, limiting their effectiveness in personalized applications.

Method: Uses multi-encoder architecture with lightweight adapter to process heterogeneous data from six sources, and contrastive-autoregressive training framework that bridges language and representation spaces using a curated UserQA dataset.

Result: Significantly outperforms existing methods across multiple domains including user prediction, marketing, and recommendation scenarios, achieving instruction-guided denoising of user information.

Conclusion: Instruction-aware user modeling enables more generalizable and robust user representation learning, paving the way for improved personalized applications.

Abstract: User representation modeling has become increasingly crucial for personalized applications, yet existing approaches struggle with generalizability across domains and sensitivity to noisy behavioral signals. We present InstructUE, an instruction-aware user embedding foundation model that leverages large language models (LLMs) to generate general and instruction-aware user representations. InstructUE introduces a multi-encoder architecture with a lightweight adapter that efficiently processes heterogeneous data from six different sources while preserving their structural characteristics. Additionally, it proposes a novel contrastive-autoregressive training framework that bridges language and representation spaces through a curated UserQA dataset. The contrastive-autoregressive training framework simultaneously leverages autoregressive learning to capture domain knowledge in language space and contrastive learning to align user-text embeddings in representation space, thereby enhancing the instruction-awareness and noise-robustness of user embeddings. Through extensive experiments on real-world applications, we demonstrate that InstructUE significantly outperforms existing methods across multiple domains including user prediction, marketing, and recommendation scenarios. Our results show that instruction-aware user modeling can effectively achieve instruction-guided denoising of user information in specific scenarios, paving the way for more generalizable and robust user representation learning.

[968] EAGER: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling

Daniel Scalena, Leonidas Zotos, Elisabetta Fersini, Malvina Nissim, Ahmet Üstün

Main category: cs.LG

TL;DR: EAGer is a training-free generation method that uses token-wise entropy distribution to reduce redundant computation in reasoning language models by branching to multiple reasoning paths only for high-entropy tokens, reallocating saved compute budget to instances where exploration is most needed.

Details

Motivation: Current test-time scaling methods allocate the same compute budget for each prompt, but different prompts have different complexity levels and computation needs. This leads to inefficient use of computational resources.

Method: EAGer leverages model uncertainty through token-wise entropy distribution to identify when to branch to multiple reasoning paths. It only explores alternative paths for high-entropy tokens and reallocates saved compute budget to instances where exploration is most beneficial.

Result: On complex reasoning benchmarks like AIME 2025, EAGer achieves the best efficiency-performance trade-off in terms of reasoning length and Pass@k. When target labels are accessible, it generates up to 65% fewer tokens (saving compute) and achieves up to 37% improvement in Pass@k compared to Full Parallel Sampling.

Conclusion: EAGer provides an effective approach to optimize compute allocation in reasoning language models by dynamically adjusting exploration based on token uncertainty, improving both efficiency and performance without requiring additional training.

Abstract: With the rise of reasoning language models and test-time scaling methods as a paradigm for improving model performance, substantial computation is often required to generate multiple candidate sequences from the same prompt. This enables exploration of different reasoning paths toward the correct solution, however, allocates the same compute budget for each prompt. Grounded on the assumption that different prompts carry different degrees of complexity, and thus different computation needs, we propose EAGer, a training-free generation method that leverages model uncertainty through token-wise entropy distribution to reduce redundant computation and concurrently improve overall performance. EAGer allows branching to multiple reasoning paths only in the presence of high-entropy tokens, and then reallocates the saved compute budget to the instances where exploration of alternative paths is most needed. We find that across multiple open-source models on complex reasoning benchmarks such as AIME 2025, EAGer can reallocate the budget without accessing target labels, achieving the best efficiency-performance trade-off in terms of reasoning length and Pass@k. When target labels are accessible, EAGer generates up to 65% fewer tokens (hence saving compute) and achieves up to 37% improvement in Pass@k compared to the Full Parallel Sampling.

[969] The Easy Path to Robustness: Coreset Selection using Sample Hardness

Pranav Ramesh, Arjun Roy, Deepak Ravikumar, Kaushik Roy, Gopalakrishnan Srinivasan

Main category: cs.LG

TL;DR: EasyCore is a coreset selection method that identifies “easy” samples with low adversarial vulnerability using average input gradient norm (AIGN) to improve adversarial robustness in training.

Details

Motivation: Current coreset selection methods focus on clean accuracy but fail to preserve adversarial robustness, creating a need for data-centric approaches that can identify samples crucial for learning resilient features.

Method: Proposes EasyCore framework that links sample vulnerability to hardness using AIGN, then selects only low-AIGN (easy) samples for training as they are less vulnerable and further from decision boundaries.

Result: Models trained with EasyCore achieve up to 7% higher adversarial accuracy under standard training and 5% higher under TRADES adversarial training compared to existing coreset methods.

Conclusion: EasyCore provides an efficient, model-agnostic data-centric method for improving adversarial robustness by leveraging AIGN as a dataset property to identify and retain robust training samples.

Abstract: Designing adversarially robust models from a data-centric perspective requires understanding which input samples are most crucial for learning resilient features. While coreset selection provides a mechanism for efficient training on data subsets, current algorithms are designed for clean accuracy and fall short in preserving robustness. To address this, we propose a framework linking a sample’s adversarial vulnerability to its \textit{hardness}, which we quantify using the average input gradient norm (AIGN) over training. We demonstrate that \textit{easy} samples (with low AIGN) are less vulnerable and occupy regions further from the decision boundary. Leveraging this insight, we present EasyCore, a coreset selection algorithm that retains only the samples with low AIGN for training. We empirically show that models trained on EasyCore-selected data achieve significantly higher adversarial accuracy than those trained with competing coreset methods under both standard and adversarial training. As AIGN is a model-agnostic dataset property, EasyCore is an efficient and widely applicable data-centric method for improving adversarial robustness. We show that EasyCore achieves up to 7% and 5% improvement in adversarial accuracy under standard training and TRADES adversarial training, respectively, compared to existing coreset methods.

[970] Temporal Alignment Guidance: On-Manifold Sampling in Diffusion Models

Youngrok Park, Hojung Jung, Sangmin Bae, Se-Young Yun

Main category: cs.LG

TL;DR: The paper proposes Temporal Alignment Guidance (TAG), a novel method that addresses the off-manifold phenomenon in diffusion models by using a time predictor to estimate deviations and guide samples back to the desired data manifold during generation.

Details

Motivation: Diffusion models accumulate errors during generation, especially when arbitrary guidance is applied to steer samples toward desired properties, which often breaks sample fidelity and causes off-manifold issues.

Method: The approach uses a time predictor to estimate deviations from the desired data manifold at each timestep, and then applies Temporal Alignment Guidance (TAG) to attract samples back to the desired manifold throughout the generation process.

Result: Extensive experiments show that TAG consistently produces samples closely aligned with the desired manifold at each timestep, leading to significant improvements in generation quality across various downstream tasks.

Conclusion: TAG provides a general solution to the off-manifold problem in diffusion models, enabling better sample fidelity and generation quality when applying guidance mechanisms.

Abstract: Diffusion models have achieved remarkable success as generative models. However, even a well-trained model can accumulate errors throughout the generation process. These errors become particularly problematic when arbitrary guidance is applied to steer samples toward desired properties, which often breaks sample fidelity. In this paper, we propose a general solution to address the off-manifold phenomenon observed in diffusion models. Our approach leverages a time predictor to estimate deviations from the desired data manifold at each timestep, identifying that a larger time gap is associated with reduced generation quality. We then design a novel guidance mechanism, `Temporal Alignment Guidance’ (TAG), attracting the samples back to the desired manifold at every timestep during generation. Through extensive experiments, we demonstrate that TAG consistently produces samples closely aligned with the desired manifold at each timestep, leading to significant improvements in generation quality across various downstream tasks.

[971] Can Tool-Integrated Reinforcement Learning Generalize Across Diverse Domains?

Zhengyu Chen, Jinluan Yang, Teng Xiao, Ruochen Zhou, Luan Zhang, Xiangyu Xi, Xiaowei Shi, Wei Wang, Jinggang Wang

Main category: cs.LG

TL;DR: RL-based tool usage learned from mathematical tasks can generalize to other domains, achieving high performance and token efficiency through a proposed TGRL framework.

Details

Motivation: To investigate cross-domain generalization of tool-augmented RL agents, as current approaches are underexplored despite LLMs' reasoning capabilities.

Method: Proposed Tool Generalization Reinforcement Learning (TGRL) framework with: standardized tool interface, dual-component reward system, and XML-based prompt template for domain-agnostic learning.

Result: RL-based tool usage from mathematical training effectively transfers to complex tasks in other domains, achieving state-of-the-art performance across diverse benchmarks.

Conclusion: Tool RL has significant cross-domain potential for LLM reasoning, with the TGRL framework enabling effective skill migration and generalization.

Abstract: Recent advances in large language models (LLMs) have demonstrated remarkable capabilities in reasoning and tool utilization. However, the generalization of tool-augmented reinforcement learning (RL) across diverse domains remains underexplored. In this work, we investigate the cross-domain generalization of an LLM agent equipped with a code interpreter tool, which is exclusively trained on mathematical problem-solving tasks. Despite the restricted training domain, we evaluate the agent’s performance across several distinct reasoning domains. The results reveal that RL-based tool usage learned from mathematical tasks can be effectively transferred to complex tasks in other domains, enabling great task performance and high token efficiency. To facilitate this cross-domain transfer, we propose a Tool Generalization Reinforcement Learning (TGRL) framework designed to promote domain-agnostic learning and skill migration, encompassing: (i) a standardized tool interface that abstracts domain-specific nuances through consistent formatting and explicit termination, fostering transferable invocation patterns; (ii) a dual-component reward system that decomposes rewards to incentivize generalizable behaviors like tool efficiency and reasoning abstraction, ensuring alignment and robustness across domain shifts; and (iii) an XML-based prompt template that separates thinking, tool calls, and responses to encourage modular, domain-invariant planning and coherent multi-turn interactions. Extensive experiments across diverse benchmarks validate our approach, achieving state-of-the-art performance and highlighting the cross-domain potential of Tool RL for LLM reasoning.

[972] Conformal Inference for Time Series over Graphs

Sonakshi Dua, Gonzalo Mateos, Sundeep Prabhakar Chepuri

Main category: cs.LG

TL;DR: A conformal prediction framework for graph time series that leverages graph structure to provide uncertainty quantification with coverage guarantees, achieving significantly smaller prediction regions than existing methods.

Details

Motivation: Existing conformal prediction methods either ignore graph topology in time series or neglect temporal dynamics in graphs, creating a gap for trustworthy decision making in networked dynamic environments.

Method: Developed a CP-based sequential prediction region framework that leverages graph structure to capture pairwise dependencies across nodes while providing user-specified coverage guarantees.

Result: The method yields exponential shrinkage in ellipsoidal prediction set volume compared to graph-agnostic approaches, with up to 80% reduction in prediction regions while maintaining desired empirical coverage on real-world datasets.

Conclusion: The proposed framework successfully bridges the gap by incorporating both graph structure and temporal dynamics, providing more efficient uncertainty quantification for graph time series with formal coverage guarantees.

Abstract: Trustworthy decision making in networked, dynamic environments calls for innovative uncertainty quantification substrates in predictive models for graph time series. Existing conformal prediction (CP) methods have been applied separately to multivariate time series and static graphs, but they either ignore the underlying graph topology or neglect temporal dynamics. To bridge this gap, here we develop a CP-based sequential prediction region framework tailored for graph time series. A key technical innovation is to leverage the graph structure and thus capture pairwise dependencies across nodes, while providing user-specified coverage guarantees on the predictive outcomes. We formally establish that our scheme yields an exponential shrinkage in the volume of the ellipsoidal prediction set relative to its graph-agnostic counterpart. Using real-world datasets, we demonstrate that the novel uncertainty quantification framework maintains desired empirical coverage while achieving markedly smaller (up to 80% reduction) prediction regions than existing approaches.

[973] ENIGMA: The Geometry of Reasoning and Alignment in Large-Language Models

Gareth Seneque, Lap-Hang Ho, Nafise Erfanian Saeedi, Jeffrey Molendijk, Ariel Kupermann, Tim Elson

Main category: cs.LG

TL;DR: ENIGMA is a novel LLM training approach that improves reasoning, alignment, and robustness by treating organizational policies as directions on the model’s information manifold, combining GRPO, SAMI-style InfoNCE, and Sinkhorn regularization.

Details

Motivation: To develop a unified approach that jointly improves reasoning, alignment, and robustness in LLMs by treating policies as geometric directions on the information manifold, enabling principled reasoning without reward models.

Method: Single-loop training combining GRPO (on-policy RL with CoT rewards), SAMI-style symmetric InfoNCE auxiliary, and Sinkhorn optimal-transport regularization on hidden states to bound geometry drift. Introduces infoNCE metrics including Sufficiency Index for policy selection.

Result: Experiments with 1B LLMs show high-SI principles predict steadier training dynamics and improved benchmark performance over GRPO ablations. Information-geometry analysis validates desirable structural changes in the manifold.

Conclusion: ENIGMA demonstrates that reasoning, alignment, and robustness are projections of a single information-geometric objective, enabling principled reasoning without reward models and offering a path to trusted capability.

Abstract: We present Entropic Mutual-Information Geometry Large-Language Model Alignment (ENIGMA), a novel approach to Large-Language Model (LLM) training that jointly improves reasoning, alignment and robustness by treating an organisation’s policies/principles as directions to move on a model’s information manifold. Our single-loop trainer combines Group-Relative Policy Optimisation (GRPO), an on-policy, critic-free RL method with Chain-of-Thought (CoT)-format only rewards; a Self-Supervised Alignment with Mutual Information (SAMI)-style symmetric InfoNCE auxiliary; and an entropic Sinkhorn optimal-transport regulariser on hidden-state distributions to bound geometry drift. We also introduce infoNCE metrics that specialise to a standard MI lower bound under matched negatives to measure how strongly a model’s CoT encodes these policies. These metrics include a Sufficiency Index (SI) that enables the selection and creation of principles that maximise downstream performance prior to training. In our experiments using small (1B) LLMs, high-SI principles predict steadier training dynamics and improved benchmark performance over GRPO ablations. Our information-geometry analysis of trained models validates desirable structural change in the manifold. These results support our hypothesis that reasoning, alignment, and robustness are projections of a single informationgeometric objective, and that models trained using ENIGMA demonstrate principled reasoning without the use of a reward model, offering a path to trusted capability

[974] Robust Photoplethysmography Signal Denoising via Mamba Networks

I Chiu, Yu-Tung Liu, Kuan-Chen Wang, Hung-Yu Wei, Yu Tsao

Main category: cs.LG

TL;DR: A deep learning framework called DPNet using Mamba-based architecture for PPG denoising that preserves physiological information through SI-SDR loss and HR-guided supervision, achieving superior performance over existing methods.

Details

Motivation: PPG signals in wearable health monitoring suffer from noise and motion artifacts that degrade reliability for applications like heart rate estimation, limiting their practical utility.

Method: Proposes DPNet - a Mamba-based denoising backbone for temporal modeling, combined with SI-SDR loss for waveform fidelity and an auxiliary HR predictor for physiological consistency through HR-based supervision.

Result: Experiments on BIDMC dataset show strong robustness against synthetic noise and real-world motion artifacts, outperforming conventional filtering and existing neural models while maintaining HR accuracy.

Conclusion: The framework effectively restores PPG signals with preserved physiological information, demonstrating complementary benefits of SI-SDR loss and HR-guided supervision for practical deployment in wearable healthcare systems.

Abstract: Photoplethysmography (PPG) is widely used in wearable health monitoring, but its reliability is often degraded by noise and motion artifacts, limiting downstream applications such as heart rate (HR) estimation. This paper presents a deep learning framework for PPG denoising with an emphasis on preserving physiological information. In this framework, we propose DPNet, a Mamba-based denoising backbone designed for effective temporal modeling. To further enhance denoising performance, the framework also incorporates a scale-invariant signal-to-distortion ratio (SI-SDR) loss to promote waveform fidelity and an auxiliary HR predictor (HRP) that provides physiological consistency through HR-based supervision. Experiments on the BIDMC dataset show that our method achieves strong robustness against both synthetic noise and real-world motion artifacts, outperforming conventional filtering and existing neural models. Our method can effectively restore PPG signals while maintaining HR accuracy, highlighting the complementary roles of SI-SDR loss and HR-guided supervision. These results demonstrate the potential of our approach for practical deployment in wearable healthcare systems.

[975] Causal Disentanglement Learning for Accurate Anomaly Detection in Multivariate Time Series

Wonah Kim, Jeonghyeon Park, Dongsan Jun, Jungkyu Han, Sejin Chun

Main category: cs.LG

TL;DR: CDRL4AD is a novel method for multivariate time series anomaly detection that explicitly models causal relationships over time periods and disentangles latent variables to identify causal factors, outperforming state-of-the-art methods.

Details

Motivation: Traditional approaches assume statistical independence between variables, while recent graph-based methods capture feature correlations but fail to explicitly infer causal relationships over different time periods in multivariate time series.

Method: Proposes Causally Disentangled Representation Learning for Anomaly Detection (CDRL4AD) using temporal heterogeneous graphs to model causal processes, identify causal relationships over time, and disentangle latent variables to infer causal factors.

Result: Experiments on real-world datasets show CDRL4AD outperforms state-of-the-art methods in accuracy and root cause analysis. Model analysis validates hyperparameter sensitivity and time complexity.

Conclusion: CDRL4AD effectively assists human experts in diagnosing root causes of anomalies through explicit causal relationship modeling and disentangled representation learning.

Abstract: Disentangling complex causal relationships is important for accurate detection of anomalies. In multivariate time series analysis, dynamic interactions among data variables over time complicate the interpretation of causal relationships. Traditional approaches assume statistical independence between variables in unsupervised settings, whereas recent methods capture feature correlations through graph representation learning. However, their representations fail to explicitly infer the causal relationships over different time periods. To solve the problem, we propose Causally Disentangled Representation Learning for Anomaly Detection (CDRL4AD) to detect anomalies and identify their causal relationships in multivariate time series. First, we design the causal process as model input, the temporal heterogeneous graph, and causal relationships. Second, our representation identifies causal relationships over different time periods and disentangles latent variables to infer the corresponding causal factors. Third, our experiments on real-world datasets demonstrate that CDRL4AD outperforms state-of-the-art methods in terms of accuracy and root cause analysis. Fourth, our model analysis validates hyperparameter sensitivity and the time complexity of CDRL4AD. Last, we conduct a case study to show how our approach assists human experts in diagnosing the root causes of anomalies.

[976] PhysioME: A Robust Multimodal Self-Supervised Framework for Physiological Signals with Missing Modalities

Cheol-Hui Lee, Hwa-Yeon Lee, Min-Kyung Jung, Dong-Joo Kim

Main category: cs.LG

TL;DR: PhysioME is a robust multimodal framework for physiological signal analysis that maintains reliable performance under missing modality conditions through self-supervised learning and restoration techniques.

Details

Motivation: Missing or corrupted modalities are common in physiological signal-based medical applications due to hardware constraints or motion artifacts, but most existing methods assume full modality availability and suffer performance degradation when modalities are missing.

Method: Uses multimodal self-supervised learning combining contrastive learning with masked prediction, a Dual-PathNeuroNet backbone for temporal dynamics, and a restoration decoder to reconstruct missing modality tokens for flexible processing of incomplete inputs.

Result: Achieves high consistency and generalization performance across various missing modality scenarios, demonstrating robust performance under imperfect data conditions.

Conclusion: PhysioME shows potential as a reliable tool for supporting clinical decision-making in real-world settings with imperfect data availability, addressing the common problem of missing modalities in physiological signal analysis.

Abstract: Missing or corrupted modalities are common in physiological signal-based medical applications owing to hardware constraints or motion artifacts. However, most existing methods assume the availability of all modalities, resulting in substantial performance degradation in the absence of any modality. To overcome this limitation, this study proposes PhysioME, a robust framework designed to ensure reliable performance under missing modality conditions. PhysioME adopts: (1) a multimodal self-supervised learning approach that combines contrastive learning with masked prediction; (2) a Dual-PathNeuroNet backbone tailored to capture the temporal dynamics of each physiological signal modality; and (3) a restoration decoder that reconstructs missing modality tokens, enabling flexible processing of incomplete inputs. The experimental results show that PhysioME achieves high consistency and generalization performance across various missing modality scenarios. These findings highlight the potential of PhysioME as a reliable tool for supporting clinical decision-making in real-world settings with imperfect data availability.

[977] ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding

Yuhang Li, Chenchen Zhang, Ruilin Lv, Ao Liu, Ken Deng, Yuanxing Zhang, Jiaheng Liu, Wiggin Zhou, Bo Zhou

Main category: cs.LG

TL;DR: ReLook is an agentic framework that uses multimodal LLMs to improve front-end code generation through visual feedback and reinforcement learning, achieving better performance than baseline methods.

Details

Motivation: Large Language Models struggle with front-end development where correctness depends on rendered pixels and interaction, requiring visual grounding.

Method: Uses a multimodal LLM as both visual critic (scoring code with screenshots) and feedback source, with forced optimization for monotonic improvement and training-inference decoupling.

Result: Consistently outperforms strong baselines across three benchmarks in vision-grounded front-end code generation.

Conclusion: Agentic perception, visual rewards, and training-inference decoupling provide significant benefits for front-end code generation.

Abstract: While Large Language Models (LLMs) excel at algorithmic code generation, they struggle with front-end development, where correctness is judged on rendered pixels and interaction. We present ReLook, an agentic, vision-grounded reinforcement learning framework that empowers an agent to close a robust generate–diagnose–refine loop by invoking a multimodal LLM (MLLM) as a tool. During training, the agent uses the MLLM-in-the-loop both as a visual critic–scoring code with screenshots–and as a source of actionable, vision-grounded feedback; a strict zero-reward rule for invalid renders anchors renderability and prevents reward hacking. To prevent behavioral collapse, we introduce Forced Optimization, a strict acceptance rule that admits only improving revisions, yielding monotonically better trajectories. At inference, we decouple the critic and run a lightweight, critic-free self-edit cycle, keeping latency comparable to base decoding while retaining most of the gains. Across three widely used benchmarks, ReLook consistently outperforms strong baselines in vision-grounded front-end code generation, highlighting the benefits of agentic perception, visual rewards, and training-inference decoupling.

[978] Refining Hybrid Genetic Search for CVRP via Reinforcement Learning-Finetuned LLM

Rongjie Zhu, Cong Zhang, Zhiguang Cao

Main category: cs.LG

TL;DR: A specialized small LLM fine-tuned with reinforcement learning can generate better crossover operators for vehicle routing problems than expert-designed heuristics, outperforming both human experts and large general-purpose LLMs.

Details

Motivation: Current methods rely on prompting massive general-purpose LLMs like GPT-4 for vehicle routing problems, but this work challenges that paradigm by showing smaller specialized models can perform better when properly fine-tuned.

Method: Proposed RFTHGS - a Reinforcement learning framework for Fine-Tuning a small LLM to generate crossover operators for Hybrid Genetic Search solver. Uses multi-tiered curriculum-based reward function and operator caching to prevent plagiarism and promote diversity.

Result: The fine-tuned LLM produces crossover operators that significantly outperform expert-designed ones in HGS, generalizing from small to large-scale problems (up to 1000 nodes). Outperforms neuro-combinatorial baselines, prompt-based methods, and commercial LLMs like GPT-4o.

Conclusion: Specialized small LLMs, when meticulously fine-tuned with reinforcement learning, can surpass both human expert designs and large general-purpose LLMs in generating high-performance components for combinatorial optimization problems.

Abstract: While large language models (LLMs) are increasingly used as automated heuristic designers for vehicle routing problems (VRPs), current state-of-the-art methods predominantly rely on prompting massive, general-purpose models like GPT-4. This work challenges that paradigm by demonstrating that a smaller, specialized LLM, when meticulously fine-tuned, can generate components that surpass expert-crafted heuristics within advanced solvers. We propose RFTHGS, a novel Reinforcement learning (RL) framework for Fine-Tuning a small LLM to generate high-performance crossover operators for the Hybrid Genetic Search (HGS) solver, applied to the Capacitated VRP (CVRP). Our method employs a multi-tiered, curriculum-based reward function that progressively guides the LLM to master generating first compilable, then executable, and finally, superior-performing operators that exceed human expert designs. This is coupled with an operator caching mechanism that discourages plagiarism and promotes diversity during training. Comprehensive experiments show that our fine-tuned LLM produces crossover operators which significantly outperform the expert-designed ones in HGS. The performance advantage remains consistent, generalizing from small-scale instances to large-scale problems with up to 1000 nodes. Furthermore, RFTHGS exceeds the performance of leading neuro-combinatorial baselines, prompt-based methods, and commercial LLMs such as GPT-4o and GPT-4o-mini.

[979] Protein as a Second Language for LLMs

Xinhui Chen, Zuchao Li, Mengqi Gao, Yufeng Zhang, Chak Tou Leong, Haoyang Li, Jiaqi Chen

Main category: cs.LG

TL;DR: The paper introduces ‘Protein-as-Second-Language’ framework that treats amino-acid sequences as sentences, enabling LLMs to interpret protein functions through contextual exemplars without additional training.

Details

Motivation: To address the fundamental challenge of deciphering unseen protein sequences without relying on task-specific adapters or large-scale supervised fine-tuning.

Method: Reformulates amino-acid sequences as sentences in a symbolic language, adaptively constructs sequence-question-answer triples for zero-shot functional understanding, and uses a curated bilingual corpus of 79,926 protein-QA instances.

Result: Achieves up to 17.2% ROUGE-L improvement (average +7%) across diverse LLMs including GPT-4, surpassing fine-tuned protein-specific language models.

Conclusion: Generic LLMs guided with protein-as-language cues can outperform domain-specialized models, offering a scalable pathway for protein understanding in foundation models.

Abstract: Deciphering the function of unseen protein sequences is a fundamental challenge with broad scientific impact, yet most existing methods depend on task-specific adapters or large-scale supervised fine-tuning. We introduce the “Protein-as-Second-Language” framework, which reformulates amino-acid sequences as sentences in a novel symbolic language that large language models can interpret through contextual exemplars. Our approach adaptively constructs sequence-question-answer triples that reveal functional cues in a zero-shot setting, without any further training. To support this process, we curate a bilingual corpus of 79,926 protein-QA instances spanning attribute prediction, descriptive understanding, and extended reasoning. Empirically, our method delivers consistent gains across diverse open-source LLMs and GPT-4, achieving up to 17.2% ROUGE-L improvement (average +7%) and even surpassing fine-tuned protein-specific language models. These results highlight that generic LLMs, when guided with protein-as-language cues, can outperform domain-specialized models, offering a scalable pathway for protein understanding in foundation models.

Qiyi Tong, Olivia Nocentini, Marta Lagomarsino, Kuanqi Cai, Marta Lorenzini, Arash Ajoudani

Main category: cs.LG

TL;DR: MLCM-KD framework enables efficient thermal facial landmark detection by using bidirectional knowledge distillation between RGB and thermal modalities to bridge the modality gap while maintaining computational efficiency.

Details

Motivation: Thermal FLD is crucial for low-light applications but lacks visual cues. Existing cross-modal methods are computationally expensive or create artifacts, limiting practical use.

Method: Proposes Multi-Level Cross-Modal Knowledge Distillation with Dual-Injected KD - a bidirectional mechanism that guides thermal student with RGB features and validates learned representations through closed-loop supervision.

Result: Sets new state-of-the-art on thermal FLD benchmarks, significantly outperforming previous methods while drastically reducing computational overhead.

Conclusion: The proposed framework successfully bridges the RGB-thermal modality gap through bidirectional distillation, enabling accurate and efficient thermal facial landmark detection suitable for practical deployment.

Abstract: Facial Landmark Detection (FLD) in thermal imagery is critical for applications in challenging lighting conditions, but it is hampered by the lack of rich visual cues. Conventional cross-modal solutions, like feature fusion or image translation from RGB data, are often computationally expensive or introduce structural artifacts, limiting their practical deployment. To address this, we propose Multi-Level Cross-Modal Knowledge Distillation (MLCM-KD), a novel framework that decouples high-fidelity RGB-to-thermal knowledge transfer from model compression to create both accurate and efficient thermal FLD models. A central challenge during knowledge transfer is the profound modality gap between RGB and thermal data, where traditional unidirectional distillation fails to enforce semantic consistency across disparate feature spaces. To overcome this, we introduce Dual-Injected Knowledge Distillation (DIKD), a bidirectional mechanism designed specifically for this task. DIKD establishes a connection between modalities: it not only guides the thermal student with rich RGB features but also validates the student’s learned representations by feeding them back into the frozen teacher’s prediction head. This closed-loop supervision forces the student to learn modality-invariant features that are semantically aligned with the teacher, ensuring a robust and profound knowledge transfer. Experiments show that our approach sets a new state-of-the-art on public thermal FLD benchmarks, notably outperforming previous methods while drastically reducing computational overhead.

[981] Test-Time Adaptation by Causal Trimming

Yingnan Liu, Rui Qiao, Mong Li Lee, Wynne Hsu

Main category: cs.LG

TL;DR: TACT is a test-time adaptation method that identifies and removes non-causal features from representations to improve model robustness under distribution shifts.

Details

Motivation: Performance degradation under distribution shifts is caused by models relying on non-causal features that lack direct relationship with prediction targets.

Method: Uses data augmentations to preserve causal features while varying non-causal ones, identifies non-causal components via PCA, trims representations by removing projections on high-variance directions, and continuously refines these directions during adaptation.

Result: TACT consistently outperforms state-of-the-art methods by a significant margin on real-world out-of-distribution benchmarks.

Conclusion: The method effectively improves model robustness by focusing on causal features and removing non-causal components from representations.

Abstract: Test-time adaptation aims to improve model robustness under distribution shifts by adapting models with access to unlabeled target samples. A primary cause of performance degradation under such shifts is the model’s reliance on features that lack a direct causal relationship with the prediction target. We introduce Test-time Adaptation by Causal Trimming (TACT), a method that identifies and removes non-causal components from representations for test distributions. TACT applies data augmentations that preserve causal features while varying non-causal ones. By analyzing the changes in the representations using Principal Component Analysis, TACT identifies the highest variance directions associated with non-causal features. It trims the representations by removing their projections on the identified directions, and uses the trimmed representations for the predictions. During adaptation, TACT continuously tracks and refines these directions to get a better estimate of non-causal features. We theoretically analyze the effectiveness of this approach and empirically validate TACT on real-world out-of-distribution benchmarks. TACT consistently outperforms state-of-the-art methods by a significant margin.

[982] DUAL: Learning Diverse Kernels for Aggregated Two-sample and Independence Testing

Zhijian Zhou, Xunye Tian, Liuhua Peng, Chao Lei, Antonin Schrab, Danica J. Sutherland, Feng Liu

Main category: cs.LG

TL;DR: The paper proposes a kernel aggregation method that explicitly incorporates kernel diversity to address the limitation of traditional multiple kernel tests where maximizing statistics leads to similar kernels capturing overlapping information.

Details

Motivation: Directly maximizing multiple kernel-based statistics often results in highly similar kernels that capture overlapping information, limiting the effectiveness of aggregation in kernel two-sample and independence testing for complex structured data.

Method: Proposes an aggregated statistic that incorporates kernel diversity based on covariance between kernels, and introduces a testing framework with selection inference that selects both effective and diverse kernels from a learned diverse kernel pool.

Result: The approach shows superior performance across various benchmarks for both two-sample and independence testing, with rigorous theoretical guarantees on test power consistency and Type-I error control.

Conclusion: Explicitly incorporating kernel diversity through covariance-based aggregation and selection inference significantly improves kernel testing performance by balancing effectiveness and diversity of selected kernels.

Abstract: To adapt kernel two-sample and independence testing to complex structured data, aggregation of multiple kernels is frequently employed to boost testing power compared to single-kernel tests. However, we observe a phenomenon that directly maximizing multiple kernel-based statistics may result in highly similar kernels that capture highly overlapping information, limiting the effectiveness of aggregation. To address this, we propose an aggregated statistic that explicitly incorporates kernel diversity based on the covariance between different kernels. Moreover, we identify a fundamental challenge: a trade-off between the diversity among kernels and the test power of individual kernels, i.e., the selected kernels should be both effective and diverse. This motivates a testing framework with selection inference, which leverages information from the training phase to select kernels with strong individual performance from the learned diverse kernel pool. We provide rigorous theoretical statements and proofs to show the consistency on the test power and control of Type-I error, along with asymptotic analysis of the proposed statistics. Lastly, we conducted extensive empirical experiments demonstrating the superior performance of our proposed approach across various benchmarks for both two-sample and independence testing.

[983] A Comprehensive Forecasting-Based Framework for Time Series Anomaly Detection: Benchmarking on the Numenta Anomaly Benchmark (NAB)

Mohammad Karami, Mostafa Jalali, Fatemeh Ghassemi

Main category: cs.LG

TL;DR: A comprehensive forecasting-based framework for time series anomaly detection that unifies classical and deep learning methods, with systematic evaluation on 58 datasets showing LSTM achieves best performance while classical methods work well for simple data.

Details

Motivation: Existing time series anomaly detection methods lack systematic cross-domain evaluation, and there's a need for a unified framework to compare classical and modern approaches under common evaluation standards.

Method: Modular pipeline integrating preprocessing (normalization, STL decomposition), four forecasting models (Holt-Winters, SARIMA, LSTM, Informer), four detection methods, and dual evaluation using both forecasting metrics (MAE, RMSE, PCC) and detection metrics (Precision, Recall, F1, AUC).

Result: LSTM achieved best performance (F1: 0.688, ranking first/second on 81% of datasets) with exceptional correlation (PCC: 0.999). Informer provided competitive accuracy (F1: 0.683) with 30% faster training. Classical methods achieved perfect predictions on simple synthetic data but showed 2-3x worse F1-scores on real-world datasets.

Conclusion: Forecasting quality dominates detection performance. Recommendations: use LSTM for complex patterns, Informer for efficiency-critical deployments, and classical methods for simple periodic data with resource constraints. The framework establishes baselines for future research.

Abstract: Time series anomaly detection is critical for modern digital infrastructures, yet existing methods lack systematic cross-domain evaluation. We present a comprehensive forecasting-based framework unifying classical methods (Holt-Winters, SARIMA) with deep learning architectures (LSTM, Informer) under a common residual-based detection interface. Our modular pipeline integrates preprocessing (normalization, STL decomposition), four forecasting models, four detection methods, and dual evaluation through forecasting metrics (MAE, RMSE, PCC) and detection metrics (Precision, Recall, F1, AUC). We conduct the first complete evaluation on the Numenta Anomaly Benchmark (58 datasets, 7 categories) with 232 model training runs and 464 detection evaluations achieving 100% success rate. LSTM achieves best performance (F1: 0.688, ranking first or second on 81% of datasets) with exceptional correlation on complex patterns (PCC: 0.999). Informer provides competitive accuracy (F1: 0.683) with 30% faster training. Classical methods achieve perfect predictions on simple synthetic data with 60 lower cost but show 2-3 worse F1-scores on real-world datasets. Forecasting quality dominates detection performance: differences between detection methods (F1: 0.621-0.688) are smaller than between forecasting models (F1: 0.344-0.688). Our findings provide evidence-based guidance: use LSTM for complex patterns, Informer for efficiency-critical deployments, and classical methods for simple periodic data with resource constraints. The complete implementation and results establish baselines for future forecasting-based anomaly detection research.

[984] LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences

Wenbo Wu, Qingyi Si, Xiurui Pan, Ye Wang, Jie Zhang

Main category: cs.LG

TL;DR: LouisKV is an efficient KV cache retrieval framework that addresses memory overhead in long-sequence scenarios by leveraging temporal locality and semantic boundaries, achieving significant speedup while maintaining accuracy.

Details

Motivation: Existing KV cache methods introduce significant memory overhead in long-sequence scenarios and suffer from efficiency and accuracy bottlenecks due to per-token retrieval and coarse-grained page-level management, especially in long-output reasoning scenarios.

Method: LouisKV introduces a semantic-aware retrieval strategy that triggers retrieval only at semantic boundaries, uses decoupled fine-grained management with differentiated strategies for input/output sequences, and incorporates custom Triton/CUDA kernels for KV clustering and retrieval acceleration.

Result: LouisKV achieves up to 4.7× speedup over state-of-the-art KV retrieval methods while maintaining near-lossless accuracy across diverse long-sequence tasks including long-input short-output, short-input long-output, and long-input long-output scenarios.

Conclusion: LouisKV effectively addresses KV cache memory overhead in long-sequence scenarios through temporal locality-based retrieval and fine-grained management, demonstrating superior efficiency and accuracy compared to existing methods.

Abstract: While Key-Value (KV) cache succeeds in reducing redundant computations in auto-regressive models, it introduces significant memory overhead, limiting its practical deployment in long-sequence scenarios. Existing KV retrieval methods mitigate this by dynamically retaining only a subset of KV entries on the GPU. However, they still suffer from notable efficiency and accuracy bottlenecks due to per-token retrieval and coarse-grained page-level KV management, especially in long-output reasoning scenarios. With the emergence of large reasoning models, efficiently handling such scenarios has become increasingly important. To address this issue, we present two key observations: (1) critical KVs exhibit strong temporal locality during decoding, and (2) these KVs exhibit distinct distribution patterns across the input prompt and generated output. Building on these observations, we propose LouisKV, an efficient KV cache retrieval framework designed for various long-sequence scenarios. Specifically, LouisKV introduces a semantic-aware retrieval strategy leveraging temporal locality to trigger retrieval only at semantic boundaries, drastically reducing computation and data transfer overhead. LouisKV also designs a decoupled, fine-grained management scheme that tailors differentiated strategies for input and output sequences to create retrieval units that better match the model’s attention patterns, enabling precise identification of critical KVs. Furthermore, to boost efficiency, LouisKV incorporates several kernel-level optimizations, including custom Triton and CUDA kernels to accelerate the KV clustering and retrieval. Evaluations show that LouisKV achieves up to 4.7$\times$ speedup over state-of-the-art KV retrieval methods while maintaining near-lossless accuracy across diverse long-sequence tasks, including long-input short-output, short-input long-output, and long-input long-output scenarios.

[985] Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models

Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li

Main category: cs.LG

TL;DR: BGPO is a memory-efficient RL algorithm for diffusion LLMs that uses a specially constructed lower bound to enable large Monte Carlo sample sizes, overcoming memory constraints of existing methods.

Details

Motivation: Existing RL methods for diffusion LLMs suffer from high memory overhead due to retaining forward computational graphs for all Monte Carlo samples during gradient computation, limiting sample sizes and causing imprecise likelihood approximations.

Method: Proposes Boundary-Guided Policy Optimization (BGPO) which maximizes a specially constructed lower bound of the ELBO-based objective that has linearity (enabling gradient accumulation) and equivalence properties (matching value and gradient of original objective).

Result: BGPO significantly outperforms previous RL algorithms for diffusion LLMs in math problem solving, code generation, and planning tasks by enabling larger MC sample sizes and more accurate likelihood approximations.

Conclusion: BGPO provides an effective memory-efficient solution for RL training of diffusion LLMs, addressing key limitations of existing methods through its novel lower bound construction with linearity and equivalence properties.

Abstract: A key challenge in applying reinforcement learning (RL) to diffusion large language models (dLLMs) lies in the intractability of their likelihood functions, which are essential for the RL objective, necessitating corresponding approximation in each training step. While existing methods approximate the log-likelihoods by their evidence lower bounds (ELBOs) via customized Monte Carlo (MC) sampling, the forward computational graphs of all MC samples need to be retained for the gradient computation of non-linear terms in the RL objective, resulting in significant memory overhead. This constraint restricts feasible sample sizes, leading to imprecise likelihood approximations and ultimately distorting the RL objective. To overcome this limitation, we propose \emph{Boundary-Guided Policy Optimization} (BGPO), a memory-efficient RL algorithm that maximizes a specially constructed lower bound of the ELBO-based objective. This lower bound is carefully designed to satisfy two key properties: (1) Linearity: it is formulated in a linear sum where each term depends only on a single MC sample, thereby enabling gradient accumulation across samples and ensuring constant memory usage; (2) Equivalence: Both the value and gradient of this lower bound are equal to those of the ELBO-based objective in on-policy training, making it also an effective approximation for the original RL objective. These properties allow BGPO to adopt a large MC sample size, resulting in more accurate likelihood approximations and improved RL objective estimation, which in turn leads to enhanced performance. Experiments show that BGPO significantly outperforms previous RL algorithms for dLLMs in math problem solving, code generation, and planning tasks.

[986] Emergence of hybrid computational dynamics through reinforcement learning

Roman A. Kononov, Nikita A. Pospelov, Konstantin V. Anokhin, Vladimir V. Nekorkin, Oleg V. Maslennikov

Main category: cs.LG

TL;DR: RL and SL produce fundamentally different computational strategies in RNNs even when trained on identical tasks, with RL discovering complex hybrid attractor dynamics while SL converges to simpler fixed-point solutions.

Details

Motivation: To understand how different learning algorithms shape emergent computational strategies in neural networks, particularly exploring the role of learning paradigms beyond just network architecture.

Method: Systematic dynamical systems analysis of recurrent neural networks trained on identical decision-making tasks using both reinforcement learning and supervised learning approaches.

Result: RL spontaneously discovers hybrid attractor architectures combining stable fixed-points with quasi-periodic dynamics, while SL converges to fixed-point-only solutions. RL also creates functionally balanced neural populations through implicit regularization, enhancing robustness.

Conclusion: The learning algorithm is a primary determinant of emergent computation, with reward-based optimization autonomously discovering sophisticated dynamical mechanisms that are less accessible to direct gradient-based optimization.

Abstract: Understanding how learning algorithms shape the computational strategies that emerge in neural networks remains a fundamental challenge in machine intelligence. While network architectures receive extensive attention, the role of the learning paradigm itself in determining emergent dynamics remains largely unexplored. Here we demonstrate that reinforcement learning (RL) and supervised learning (SL) drive recurrent neural networks (RNNs) toward fundamentally different computational solutions when trained on identical decision-making tasks. Through systematic dynamical systems analysis, we reveal that RL spontaneously discovers hybrid attractor architectures, combining stable fixed-point attractors for decision maintenance with quasi-periodic attractors for flexible evidence integration. This contrasts sharply with SL, which converges almost exclusively to simpler fixed-point-only solutions. We further show that RL sculpts functionally balanced neural populations through a powerful form of implicit regularization – a structural signature that enhances robustness and is conspicuously absent in the more heterogeneous solutions found by SL-trained networks. The prevalence of these complex dynamics in RL is controllably modulated by weight initialization and correlates strongly with performance gains, particularly as task complexity increases. Our results establish the learning algorithm as a primary determinant of emergent computation, revealing how reward-based optimization autonomously discovers sophisticated dynamical mechanisms that are less accessible to direct gradient-based optimization. These findings provide both mechanistic insights into neural computation and actionable principles for designing adaptive AI systems.

[987] QeRL: Beyond Efficiency – Quantization-enhanced Reinforcement Learning for LLMs

Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xiaojuan Qi, Song Han, Yukang Chen

Main category: cs.LG

TL;DR: QeRL is a Quantization-enhanced Reinforcement Learning framework that uses NVFP4 quantization and LoRA to accelerate RL training for LLMs while reducing memory usage, enabling training of 32B models on single GPUs.

Details

Motivation: RL is crucial for LLM reasoning but is resource-intensive, requiring substantial GPU memory and long rollout durations. QeRL addresses these efficiency challenges.

Method: Combines NVFP4 quantization with Low-Rank Adaptation (LoRA), introduces Adaptive Quantization Noise (AQN) mechanism to dynamically adjust noise during training, leveraging quantization noise to increase policy entropy for better exploration.

Result: Achieves 1.5x speedup in rollout phase, enables RL training of 32B LLM on single H100 80GB GPU, faster reward growth and higher accuracy than 16-bit LoRA/QLoRA, matches full-parameter fine-tuning performance on GSM8K (90.8%) and MATH 500 (77.4%) with 7B model.

Conclusion: QeRL establishes an efficient and effective framework for RL training in LLMs, demonstrating both computational efficiency and performance improvements.

Abstract: We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (LLMs). While RL is essential for LLMs’ reasoning capabilities, it is resource-intensive, requiring substantial GPU memory and long rollout durations. QeRL addresses these issues by combining NVFP4 quantization with Low-Rank Adaptation (LoRA), accelerating rollout phase of RL while reducing memory overhead. Beyond efficiency, our findings show that quantization noise increases policy entropy, enhancing exploration, and enabling the discovery of better strategies during RL. To further optimize exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism, which dynamically adjusts noise during training. Experiments demonstrate that QeRL delivers over 1.5 times speedup in the rollout phase. Moreover, this is the first framework to enable RL training of a 32B LLM on a single H100 80GB GPU, while delivering overall speedups for RL training. It also achieves faster reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while matching the performance of full-parameter fine-tuning on mathematical benchmarks such as GSM8K (90.8%) and MATH 500 (77.4%) in the 7B model. These results establish QeRL as an efficient and effective framework for RL training in LLMs.

[988] Beyond single-model XAI: aggregating multi-model explanations for enhanced trustworthiness

Ilaria Vascotto, Alex Rodriguez, Alessandro Bonaita, Luca Bortolussi

Main category: cs.LG

TL;DR: The paper investigates the robustness of explanations in XAI by using feature importance aggregation from multiple models (k-NN, random forest, neural networks) to increase trustworthiness.

Details

Motivation: AI models in high-risk applications require trustworthy and ethical usage. XAI addresses this by explaining black-box models, but explanation robustness is often overlooked despite being essential for building trust.

Method: Uses feature importance aggregation derived from multiple models including k-nearest neighbours, random forest, and neural networks.

Result: Preliminary results show potential in increasing the trustworthiness of applications while leveraging multiple models’ predictive power.

Conclusion: Robust explanation methods are crucial for increasing trust in AI systems, and feature importance aggregation across multiple models shows promise in achieving this goal.

Abstract: The use of Artificial Intelligence (AI) models in real-world and high-risk applications has intensified the discussion about their trustworthiness and ethical usage, from both a technical and a legislative perspective. The field of eXplainable Artificial Intelligence (XAI) addresses this challenge by proposing explanations that bring to light the decision-making processes of complex black-box models. Despite being an essential property, the robustness of explanations is often an overlooked aspect during development: only robust explanation methods can increase the trust in the system as a whole. This paper investigates the role of robustness through the usage of a feature importance aggregation derived from multiple models ($k$-nearest neighbours, random forest and neural networks). Preliminary results showcase the potential in increasing the trustworthiness of the application, while leveraging multiple model’s predictive power.

[989] Event-Aware Prompt Learning for Dynamic Graphs

Xingtong Yu, Ruijuan Liang, Xinming Zhang, Yuan Fang

Main category: cs.LG

TL;DR: EVP is an event-aware dynamic graph prompt learning framework that enhances existing methods by leveraging historical events knowledge through event adaptation and aggregation mechanisms.

Details

Motivation: Existing dynamic graph learning methods focus on node-time relationships but overlook the impact of historical events, which limits their ability to capture comprehensive temporal dynamics.

Method: Proposes EVP framework with: 1) Event adaptation mechanism to align historical events with downstream tasks, 2) Event aggregation mechanism to integrate historical knowledge into node representations.

Result: Extensive experiments conducted on four public datasets demonstrate the effectiveness of EVP as a plug-in framework.

Conclusion: EVP successfully enhances existing dynamic graph learning methods by incorporating historical event knowledge through its event-aware prompt learning approach.

Abstract: Real-world graph typically evolve via a series of events, modeling dynamic interactions between objects across various domains. For dynamic graph learning, dynamic graph neural networks (DGNNs) have emerged as popular solutions. Recently, prompt learning methods have been explored on dynamic graphs. However, existing methods generally focus on capturing the relationship between nodes and time, while overlooking the impact of historical events. In this paper, we propose EVP, an event-aware dynamic graph prompt learning framework that can serve as a plug-in to existing methods, enhancing their ability to leverage historical events knowledge. First, we extract a series of historical events for each node and introduce an event adaptation mechanism to align the fine-grained characteristics of these events with downstream tasks. Second, we propose an event aggregation mechanism to effectively integrate historical knowledge into node representations. Finally, we conduct extensive experiments on four public datasets to evaluate and analyze EVP.

[990] Evaluating Line-level Localization Ability of Learning-based Code Vulnerability Detection Models

Marco Pintore, Giorgio Piras, Angelo Sotgiu, Maura Pintor, Battista Biggio

Main category: cs.LG

TL;DR: The paper proposes Detection Alignment (DA), an explainability-based evaluation method to assess whether vulnerability detectors focus on actual vulnerable code lines rather than being biased by spurious correlations in non-vulnerable lines.

Details

Motivation: Current ML-based vulnerability detectors only flag entire functions as vulnerable without precise localization, which is crucial for helping developers debug and fix vulnerabilities. Recent approaches improve localization but ignore spurious correlations and biases that influence ML performance.

Method: Detection Alignment (DA) - an explainability-based evaluation procedure that quantifies agreement between the source code lines that most influence predictions and the actual vulnerability locations in ground truth. It’s model-agnostic and adaptable to different detection tasks.

Result: Analysis of multiple learning-based vulnerability detectors and datasets shows that model predictions are consistently biased by non-vulnerable lines, highlighting the high impact of biases and spurious correlations in vulnerability detection.

Conclusion: The proposed DA method reveals that current vulnerability detectors suffer from significant biases, focusing on non-vulnerable code lines rather than actual vulnerabilities, which undermines their practical utility for developers.

Abstract: To address the extremely concerning problem of software vulnerability, system security is often entrusted to Machine Learning (ML) algorithms. Despite their now established detection capabilities, such models are limited by design to flagging the entire input source code function as vulnerable, rather than precisely localizing the concerned code lines. However, the detection granularity is crucial to support human operators during software development, ensuring that such predictions reflect the true code semantics to help debug, evaluate, and fix the detected vulnerabilities. To address this issue, recent work made progress toward improving the detector’s localization ability, thus narrowing down the vulnerability detection “window” and providing more fine-grained predictions. Such approaches, however, implicitly disregard the presence of spurious correlations and biases in the data, which often predominantly influence the performance of ML algorithms. In this work, we investigate how detectors comply with this requirement by proposing an explainability-based evaluation procedure. Our approach, defined as Detection Alignment (DA), quantifies the agreement between the input source code lines that most influence the prediction and the actual localization of the vulnerability as per the ground truth. Through DA, which is model-agnostic and adaptable to different detection tasks, not limited to our use case, we analyze multiple learning-based vulnerability detectors and datasets. As a result, we show how the predictions of such models are consistently biased by non-vulnerable lines, ultimately highlighting the high impact of biases and spurious correlations. The code is available at https://github.com/pralab/vuln-localization-eval.

[991] Part II: ROLL Flash – Accelerating RLVR and Agentic Training with Asynchrony

Han Lu, Zichen Liu, Shaopan Xiong, Yancheng He, Wei Gao, Yanan Wu, Weixun Wang, Jiashun Liu, Yang Li, Haizhou Zhao, Ju Huang, Siran Yang, Xiaoyang Li, Yijia Luo, Zihe Liu, Ling Pan, Junchi Yan, Wei Wang, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng

Main category: cs.LG

TL;DR: ROLL Flash is a system that enables asynchronous RL post-training for LLMs, achieving 2.24x-2.72x speedup over synchronous methods through fine-grained parallelism and rollout-train decoupling.

Details

Motivation: Existing synchronous RL post-training systems suffer from low resource utilization and limited scalability, creating a need for more efficient approaches.

Method: Built on fine-grained parallelism and rollout-train decoupling principles, ROLL Flash provides flexible programming interfaces for fully asynchronous training with efficient rollout mechanisms including queue scheduling and environment-level asynchronous execution.

Result: ROLL Flash achieves up to 2.24x speedup on RLVR tasks and 2.72x on agentic tasks using the same GPU budget as synchronous baselines, while maintaining performance parity with synchronous training.

Conclusion: Asynchronous RL post-training through ROLL Flash significantly improves resource utilization and scalability compared to synchronous approaches, enabling faster training without sacrificing performance.

Abstract: Synchronous Reinforcement Learning (RL) post-training has emerged as a crucial step for enhancing Large Language Models (LLMs) with diverse capabilities. However, many systems designed to accelerate RL post-training still suffer from low resource utilization and limited scalability. We present ROLL Flash, a system that extends ROLL with native support for asynchronous RL post-training. ROLL Flash is built upon two core design principles: fine-grained parallelism and rollout-train decoupling. Guided by these principles, ROLL Flash provides flexible programming interfaces that enable a fully asynchronous training architecture and support efficient rollout mechanisms, including queue scheduling and environment-level asynchronous execution. Through comprehensive theoretical analysis and extensive experiments, we demonstrate that ROLL Flash significantly improves resource utilization and scalability over synchronous RL post-training. ROLL Flash achieves up to 2.24x speedup on RLVR tasks and 2.72x on agentic tasks, using the same GPU budget as synchronous baselines. Furthermore, we implement several popular off-policy algorithms and verify that asynchronous training can achieve performance on par with synchronous training.

[992] Cross-Scale Reservoir Computing for large spatio-temporal forecasting and modeling

Nicola Alboré, Gabriele Di Antonio, Fabrizio Coccetti, Andrea Gabrielli

Main category: cs.LG

TL;DR: A new reservoir computing method for high-resolution spatiotemporal forecasting using multi-resolution inputs from coarser to finer layers.

Details

Motivation: To better capture both local and global dynamics in high-resolution spatiotemporal datasets for improved forecasting accuracy.

Method: Combines multi-resolution inputs from coarser to finer layers in reservoir computing architecture with cross-layers coupling.

Result: Outperforms standard parallel reservoir models in long-term forecasting of Sea Surface Temperature data.

Conclusion: Cross-layers coupling improves predictive accuracy, and optimal network dynamics become increasingly linear, revealing slow modes propagated to subsequent layers.

Abstract: We propose a new reservoir computing method for forecasting high-resolution spatiotemporal datasets. By combining multi-resolution inputs from coarser to finer layers, our architecture better captures both local and global dynamics. Applied to Sea Surface Temperature data, it outperforms standard parallel reservoir models in long-term forecasting, demonstrating the effectiveness of cross-layers coupling in improving predictive accuracy. Finally, we show that the optimal network dynamics in each layer become increasingly linear, revealing the slow modes propagated to subsequent layers.

[993] Enforcing convex constraints in Graph Neural Networks

Ahmed Rashwan, Keith Briggs, Chris Budd, Lisa Kreusser

Main category: cs.LG

TL;DR: ProjNet is a Graph Neural Network framework that enforces input-dependent constraints using sparse vector clipping and the Component-Averaged Dykstra algorithm, with GPU acceleration and surrogate gradients for efficient end-to-end training.

Details

Motivation: Machine learning applications often require outputs that satisfy complex, dynamic constraints, which is particularly challenging for Graph Neural Networks due to variable output sizes in graph-structured data.

Method: ProjNet combines sparse vector clipping with the Component-Averaged Dykstra algorithm for constraint satisfaction, features GPU-accelerated implementation for large-scale inputs, and uses surrogate gradients for end-to-end training.

Result: ProjNet was validated on four classes of constrained optimization problems (linear programming, two classes of non-convex quadratic programs, and radio transmit power optimization) and demonstrated effectiveness across diverse problem settings.

Conclusion: ProjNet provides an effective framework for enforcing input-dependent constraints in Graph Neural Networks, with proven convergence guarantees and practical efficiency for various constrained optimization problems.

Abstract: Many machine learning applications require outputs that satisfy complex, dynamic constraints. This task is particularly challenging in Graph Neural Network models due to the variable output sizes of graph-structured data. In this paper, we introduce ProjNet, a Graph Neural Network framework which satisfies input-dependant constraints. ProjNet combines a sparse vector clipping method with the Component-Averaged Dykstra (CAD) algorithm, an iterative scheme for solving the best-approximation problem. We establish a convergence result for CAD and develop a GPU-accelerated implementation capable of handling large-scale inputs efficiently. To enable end-to-end training, we introduce a surrogate gradient for CAD that is both computationally efficient and better suited for optimization than the exact gradient. We validate ProjNet on four classes of constrained optimisation problems: linear programming, two classes of non-convex quadratic programs, and radio transmit power optimization, demonstrating its effectiveness across diverse problem settings.

[994] Multi-View Graph Feature Propagation for Privacy Preservation and Feature Sparsity

Etzion Harari, Moshe Unger

Main category: cs.LG

TL;DR: Proposes Multi-view Feature Propagation (MFP) framework that enhances node classification under feature sparsity while preserving privacy through multiple Gaussian-noised feature views propagated independently through graph topology.

Details

Motivation: Address challenges of degraded performance from sparse node features and privacy risks from sensitive information exposure in GNN-based node classification tasks.

Method: Extends traditional Feature Propagation by dividing available features into multiple Gaussian-noised views, each propagating independently through graph topology, then aggregating representations for robust embeddings.

Result: Outperforms state-of-the-art baselines in node classification while substantially reducing privacy leakage; propagated outputs serve as alternative imputations rather than reconstructions of original features.

Conclusion: MFP provides effective and privacy-aware framework for graph learning in domains with missing or sensitive features, balancing utility with privacy preservation.

Abstract: Graph Neural Networks (GNNs) have demonstrated remarkable success in node classification tasks over relational data, yet their effectiveness often depends on the availability of complete node features. In many real-world scenarios, however, feature matrices are highly sparse or contain sensitive information, leading to degraded performance and increased privacy risks. Furthermore, direct exposure of information can result in unintended data leakage, enabling adversaries to infer sensitive information. To address these challenges, we propose a novel Multi-view Feature Propagation (MFP) framework that enhances node classification under feature sparsity while promoting privacy preservation. MFP extends traditional Feature Propagation (FP) by dividing the available features into multiple Gaussian-noised views, each propagating information independently through the graph topology. The aggregated representations yield expressive and robust node embeddings. This framework is novel in two respects: it introduces a mechanism that improves robustness under extreme sparsity, and it provides a principled way to balance utility with privacy. Extensive experiments conducted on graph datasets demonstrate that MFP outperforms state-of-the-art baselines in node classification while substantially reducing privacy leakage. Moreover, our analysis demonstrates that propagated outputs serve as alternative imputations rather than reconstructions of the original features, preserving utility without compromising privacy. A comprehensive sensitivity analysis further confirms the stability and practical applicability of MFP across diverse scenarios. Overall, MFP provides an effective and privacy-aware framework for graph learning in domains characterized by missing or sensitive features.

[995] Neural Weight Compression for Language Models

Jegwang Ryu, Minkyu Kim, Seungjun Shin, Hee Min Choi, Dokwan Oh, Jaeho Lee

Main category: cs.LG

TL;DR: Proposes Neural Weight Compression (NWC), a learned compression framework for language model weights using neural codecs trained directly from pretrained weights, achieving competitive compression-accuracy tradeoffs.

Details

Motivation: Efficient storage and transmission of language model weights is crucial as models scale, but current compression methods rely on manual trial-and-error due to limited understanding of this data modality.

Method: Autoencoder-based neural codec with column-wise tensor chunking/normalization, importance-aware training loss, and inference-time error compensation guided by model outputs.

Result: Achieves competitive or state-of-the-art accuracy-compression tradeoffs, with strong results at 4-6 bit precisions where accuracy remains nearly on par with FP16 models.

Conclusion: NWC provides an effective learned compression approach for language model weights that addresses unique challenges of this data modality.

Abstract: The efficient storage and transmission of language model weights is becoming increasingly important, as their scale and adoption continue to grow. However, as our understanding of this new data modality is limited, designing a good compression algorithm for language model weights heavily relies on manual, trial-and-error approaches. In this paper, we propose a learned compression framework that trains neural codecs directly from pretrained language model weights. Unlike conventional data (e.g., images), language model weights pose unique challenges: the sizes and shapes of weight tensors vary significantly, and the reconstruction quality must be judged by downstream model predictions rather than na"ive MSE loss. To address this, we introduce Neural Weight Compression (NWC), a novel autoencoder-based neural codec tailored to model weight compression. The proposed method inherits the advantages of autoencoder-based codecs while incorporating three technical components: (1) column-wise tensor chunking and normalization; (2) an importance-aware training loss; (3) an inference-time error compensation mechanism guided by model outputs. Experiments on open-weight language models show that NWC achieves competitive or state-of-the-art accuracy-compression tradeoffs, with particularly strong results at 4-6 bit precisions where accuracy remains nearly on par with FP16 models.

[996] Understanding the Generalization of Stochastic Gradient Adam in Learning Neural Networks

Xuan Tang, Han Zhang, Yuan Cao, Difan Zou

Main category: cs.LG

TL;DR: This paper provides the first theoretical analysis of how batch size affects Adam’s generalization, showing that while full-batch Adam converges to poor test error, mini-batch variants can achieve near-zero test error in over-parameterized CNNs.

Details

Motivation: Most existing theoretical work analyzes full-batch Adam, which differs fundamentally from the stochastic variant used in practice. Unlike SGD, stochastic Adam doesn't converge to its full-batch counterpart even with infinitesimal learning rates, creating a gap between theory and practice.

Method: Theoretical analysis of two-layer over-parameterized CNNs on image data, examining how batch size affects Adam’s generalization performance and comparing Adam with AdamW.

Result: While both Adam and AdamW with proper weight decay converge to poor test error solutions, their mini-batch variants can achieve near-zero test error. Adam has a strictly smaller effective weight decay bound than AdamW, explaining why Adam requires more sensitive weight decay tuning.

Conclusion: Batch size and weight decay play critical roles in Adam’s generalization performance, with mini-batch training enabling better generalization than full-batch training in over-parameterized networks.

Abstract: Adam is a popular and widely used adaptive gradient method in deep learning, which has also received tremendous focus in theoretical research. However, most existing theoretical work primarily analyzes its full-batch version, which differs fundamentally from the stochastic variant used in practice. Unlike SGD, stochastic Adam does not converge to its full-batch counterpart even with infinitesimal learning rates. We present the first theoretical characterization of how batch size affects Adam’s generalization, analyzing two-layer over-parameterized CNNs on image data. Our results reveal that while both Adam and AdamW with proper weight decay $\lambda$ converge to poor test error solutions, their mini-batch variants can achieve near-zero test error. We further prove Adam has a strictly smaller effective weight decay bound than AdamW, theoretically explaining why Adam requires more sensitive $\lambda$ tuning. Extensive experiments validate our findings, demonstrating the critical role of batch size and weight decay in Adam’s generalization performance.

[997] Learning the Structure of Connection Graphs

Leonardo Di Nino, Gabriele D’Acunto, Sergio Barbarossa, Paolo Di Lorenzo

Main category: cs.LG

TL;DR: The paper proposes SCGL, a framework for learning Connection Graphs (CGs) from observed signals using maximum pseudo-likelihood and block-optimization over Riemannian manifolds.

Details

Motivation: Connection graphs extend traditional graphs by incorporating geometric transformations but learning them from data is challenging. The work addresses the inverse problem of inferring CGs directly from observed signals.

Method: Proposed Structured Connection Graph Learning (SCGL) algorithm based on maximum pseudo-likelihood with consistency assumptions, using block-optimization over Riemannian manifolds to jointly infer network topology, edge weights, and geometric structure.

Result: SCGL consistently outperforms existing baselines in both topological recovery and geometric fidelity while remaining computationally efficient.

Conclusion: The framework successfully learns connection graphs from signal data, demonstrating superior performance in recovering both network structure and geometric properties.

Abstract: Connection graphs (CGs) extend traditional graph models by coupling network topology with orthogonal transformations, enabling the representation of global geometric consistency. They play a key role in applications such as synchronization, Riemannian signal processing, and neural sheaf diffusion. In this work, we address the inverse problem of learning CGs directly from observed signals. We propose a principled framework based on maximum pseudo-likelihood under a consistency assumption, which enforces spectral properties linking the connection Laplacian to the underlying combinatorial Laplacian. Based on this formulation, we introduce the Structured Connection Graph Learning (SCGL) algorithm, a block-optimization procedure over Riemannian manifolds that jointly infers network topology, edge weights, and geometric structure. Our experiments show that SCGL consistently outperforms existing baselines in both topological recovery and geometric fidelity, while remaining computationally efficient.

[998] Medical Interpretability and Knowledge Maps of Large Language Models

Razvan Marinescu, Victoria-Elisabeth Gruber, Diego Fajardo

Main category: cs.LG

TL;DR: Systematic study of medical-domain interpretability in LLMs using four techniques to understand how medical knowledge is represented and processed, revealing specific layer patterns and phenomena.

Details

Motivation: To understand how LLMs represent and process medical knowledge, enabling better fine-tuning, un-learning, or de-biasing for medical tasks.

Method: Used four interpretability techniques: UMAP projections of intermediate activations, gradient-based saliency, layer lesioning/removal, and activation patching on five LLMs.

Result: Found medical knowledge primarily processed in first half of Llama3.3-70B layers; discovered non-linear age encoding, non-monotonic disease progression, drugs clustering by specialty, and activation collapse/recovery patterns.

Conclusion: Results provide guidance for targeted interventions in specific model layers to improve medical LLM performance through fine-tuning, un-learning, or de-biasing.

Abstract: We present a systematic study of medical-domain interpretability in Large Language Models (LLMs). We study how the LLMs both represent and process medical knowledge through four different interpretability techniques: (1) UMAP projections of intermediate activations, (2) gradient-based saliency with respect to the model weights, (3) layer lesioning/removal and (4) activation patching. We present knowledge maps of five LLMs which show, at a coarse-resolution, where knowledge about patient’s ages, medical symptoms, diseases and drugs is stored in the models. In particular for Llama3.3-70B, we find that most medical knowledge is processed in the first half of the model’s layers. In addition, we find several interesting phenomena: (i) age is often encoded in a non-linear and sometimes discontinuous manner at intermediate layers in the models, (ii) the disease progression representation is non-monotonic and circular at certain layers of the model, (iii) in Llama3.3-70B, drugs cluster better by medical specialty rather than mechanism of action, especially for Llama3.3-70B and (iv) Gemma3-27B and MedGemma-27B have activations that collapse at intermediate layers but recover by the final layers. These results can guide future research on fine-tuning, un-learning or de-biasing LLMs for medical tasks by suggesting at which layers in the model these techniques should be applied.

[999] FUSE: Fast Semi-Supervised Node Embedding Learning via Structural and Label-Aware Optimization

Sujan Chakraborty, Rahul Bordoloi, Anindya Sengupta, Olaf Wolkenhauer, Saptarshi Bej

Main category: cs.LG

TL;DR: A fast semi-supervised graph embedding framework that jointly optimizes structure preservation, supervised regularization, and label propagation for node classification in feature-sparse graphs.

Details

Motivation: Many real-world graphs lack informative node features, relying only on connectivity and labels, requiring effective structural embedding methods for classification.

Method: Joint optimization of three objectives: unsupervised structure preservation via modularity approximation, supervised regularization to minimize intra-class variance, and semi-supervised label propagation using random-walk with attention-weighted similarity.

Result: Achieves classification accuracy comparable or superior to state-of-the-art methods on standard benchmarks with significantly lower computational cost.

Conclusion: The unified iterative optimization framework produces high-quality node embeddings efficiently, making it suitable for feature-sparse graph classification tasks.

Abstract: Graph-based learning is a cornerstone for analyzing structured data, with node classification as a central task. However, in many real-world graphs, nodes lack informative feature vectors, leaving only neighborhood connectivity and class labels as available signals. In such cases, effective classification hinges on learning node embeddings that capture structural roles and topological context. We introduce a fast semi-supervised embedding framework that jointly optimizes three complementary objectives: (i) unsupervised structure preservation via scalable modularity approximation, (ii) supervised regularization to minimize intra-class variance among labeled nodes, and (iii) semi-supervised propagation that refines unlabeled nodes through random-walk-based label spreading with attention-weighted similarity. These components are unified into a single iterative optimization scheme, yielding high-quality node embeddings. On standard benchmarks, our method consistently achieves classification accuracy at par with or superior to state-of-the-art approaches, while requiring significantly less computational cost.

[1000] MIEO: encoding clinical data to enhance cardiovascular event prediction

Davide Borghini, Davide Marchi, Angelo Nardone, Giordano Scerra, Silvia Giulia Galfrè, Alessandro Pingitore, Giuseppe Prencipe, Corrado Priami, Alina Sîrbu

Main category: cs.LG

TL;DR: Self-supervised auto-encoders improve cardiovascular death prediction by addressing low labeled data availability and data heterogeneity in clinical datasets.

Details

Motivation: Clinical data faces challenges of low labeled data availability and data heterogeneity leading to missing values, limiting machine learning effectiveness.

Method: Use self-supervised auto-encoders to embed patient data in a latent space using unlabeled data, then train neural network classifier on this representation.

Result: Improved balanced accuracy compared to applying classifier directly to raw data, especially beneficial when unlabeled data availability increases.

Conclusion: Self-supervised auto-encoders are promising for clinical prediction tasks, effectively leveraging unlabeled data to overcome data scarcity and heterogeneity issues.

Abstract: As clinical data are becoming increasingly available, machine learning methods have been employed to extract knowledge from them and predict clinical events. While promising, approaches suffer from at least two main issues: low availability of labelled data and data heterogeneity leading to missing values. This work proposes the use of self-supervised auto-encoders to efficiently address these challenges. We apply our methodology to a clinical dataset from patients with ischaemic heart disease. Patient data is embedded in a latent space, built using unlabelled data, which is then used to train a neural network classifier to predict cardiovascular death. Results show improved balanced accuracy compared to applying the classifier directly to the raw data, demonstrating that this solution is promising, especially in conditions where availability of unlabelled data could increase.

[1001] Reconstructing 12-Lead ECG from 3-Lead ECG using Variational Autoencoder to Improve Cardiac Disease Detection of Wearable ECG Devices

Xinyan Guan, Yongfan Lai, Jiarui Jin, Jun Li, Haoyu Wang, Qinghao Zhao, Deyun Zhang, Shijia Geng, Shenda Hong

Main category: cs.LG

TL;DR: WearECG uses a Variational Autoencoder to reconstruct 12-lead ECGs from just 3 leads (II, V1, V5), enabling portable cardiac monitoring while maintaining diagnostic accuracy comparable to clinical gold standard.

Details

Motivation: 12-lead ECGs provide comprehensive cardiac diagnosis but lack portability, while 3-lead wearable systems are portable but often miss pathologies in unmeasured regions. There's a need to bridge this gap for continuous, large-scale cardiac screening.

Method: Proposed WearECG - a VAE with architectural improvements to capture temporal and spatial dependencies in ECG signals. Evaluated using MSE, MAE, FID, Turing test with cardiologists, and fine-tuned ECGFounder model for multi-label classification of 40+ cardiac conditions.

Result: Method produces physiologically realistic and diagnostically informative signals with robust downstream performance. Generated signals maintain diagnostic utility comparable to real 12-lead ECGs for detecting conditions including myocardial infarction at different locations.

Conclusion: Generative modeling enables effective ECG reconstruction from limited leads, demonstrating potential for scalable, low-cost cardiac screening through wearable devices while preserving diagnostic accuracy.

Abstract: Twelve-lead electrocardiograms (ECGs) are the clinical gold standard for cardiac diagnosis, providing comprehensive spatial coverage of the heart necessary to detect conditions such as myocardial infarction (MI). However, their lack of portability limits continuous and large-scale use. Three-lead ECG systems are widely used in wearable devices due to their simplicity and mobility, but they often fail to capture pathologies in unmeasured regions. To address this, we propose WearECG, a Variational Autoencoder (VAE) method that reconstructs twelve-lead ECGs from three leads: II, V1, and V5. Our model includes architectural improvements to better capture temporal and spatial dependencies in ECG signals. We evaluate generation quality using MSE, MAE, and Frechet Inception Distance (FID), and assess clinical validity via a Turing test with expert cardiologists. To further validate diagnostic utility, we fine-tune ECGFounder, a large-scale pretrained ECG model, on a multi-label classification task involving over 40 cardiac conditions, including six different myocardial infarction locations, using both real and generated signals. Experiments on the MIMIC dataset show that our method produces physiologically realistic and diagnostically informative signals, with robust performance in downstream tasks. This work demonstrates the potential of generative modeling for ECG reconstruction and its implications for scalable, low-cost cardiac screening.

[1002] FedLoRA-Optimizer: Federated LoRA Fine-Tuning with Global and Local Optimization in Heterogeneous Data Scenarios

Jianzhe Zhao, Hailin Zhu, Yu Zhang, Ziqi Chen, Guibing Guo

Main category: cs.LG

TL;DR: Proposes a fine-grained federated LoRA tuning method that separates directional vectors (shared knowledge) and magnitude vectors (personalized knowledge) to address client drift and improve both global generalization and local personalization in heterogeneous federated learning.

Details

Motivation: Address challenges in federated efficient fine-tuning including client drift in heterogeneous data scenarios, weak global model generalization, and failure to meet personalized client needs. Existing federated LoRA methods overlook fine-grained analysis of tuning matrices.

Method: Fine-grained federated LoRA tuning that fine-tunes sensitive directional vectors in A matrix for shared knowledge and sensitive magnitude vectors in B matrix for personalized knowledge. Uses pipeline combining global and local optimizers for collaborative optimization.

Result: Experiments on Databricks-Dolly-15k and Natural Instructions with LLaMA2-7B and Deepseek-7B show improvements: global performance increased by 0.39% and local performance by 0.59%.

Conclusion: The proposed method effectively improves both global model generalization and local model personalization in heterogeneous federated learning scenarios by fine-grained analysis and optimization of LoRA matrices.

Abstract: Federated efficient fine-tuning has emerged as an approach that leverages distributed data and computational resources across nodes to address the challenges of large-scale fine-tuning and privacy preservation. The Low-Rank Adaptation (LoRA) enables efficient fine-tuning of large-scale pre-trained models by introducing trainable low-rank matrices into weight updates.However, in heterogeneous data scenarios, client drift weakens the generalization of the global model, and local models often fail to meet the personalized needs of individual clients.Moreover, existing federated LoRA efficient fine-tuning techniques overlook fine-grained analysis of the tuning matrices. To address this, we conducted preliminary experiments and found that different LoRA matrices exhibit different sensitivity to changes in the direction and magnitude of their vectors.We thus propose a fine-grained federated LoRA tuning method. By fine-tuning the more sensitive directional vectors in the A matrix, which encode shared knowledge, our method learns shared features more effectively across clients and enhances global generalization. Simultaneously, by fine-tuning the more sensitive magnitude vectors in the B matrix, which encode personalized knowledge, our method better captures personalized knowledge, enabling detailed adaptation to local data. The method uses a pipeline combining global and local optimizers. Global optimization further improves local models, achieving collaborative optimization between global and local levels. This improves both the generalization ability of the global model and the personalized adaptation of local models under heterogeneous data scenarios. Experiments on Databricks-Dolly-15k and Natural Instructions with LLaMA2-7B and Deepseek-7B confirm that our method improves global performance by 0.39% and local performance by 0.59%.

[1003] Iterative Amortized Inference: Unifying In-Context Learning and Learned Optimizers

Sarthak Mittal, Divyat Mahajan, Guillaume Lajoie, Mohammad Pezeshki

Main category: cs.LG

TL;DR: A unified framework for amortized learning methods that categorizes approaches based on what aspects of learning they amortize and how they incorporate task data, with a proposed iterative amortized inference method for better scalability.

Details

Motivation: To provide a unified understanding of various amortized learning approaches (meta-learning, in-context learning, etc.) that share similar goals but differ in implementation, and to address their scalability limitations with large datasets.

Method: Proposes a taxonomy categorizing amortized models into parametric, implicit, and explicit regimes based on task adaptation mechanisms, and introduces iterative amortized inference that refines solutions step-by-step over mini-batches.

Result: The framework successfully unifies diverse amortized learning approaches and the proposed iterative method addresses scalability issues by enabling processing of large datasets through mini-batch refinement.

Conclusion: The unified framework provides a comprehensive understanding of amortized learning, and iterative amortized inference offers a scalable foundation for general-purpose task adaptation by bridging optimization-based meta-learning with forward-pass amortization.

Abstract: Modern learning systems increasingly rely on amortized learning - the idea of reusing computation or inductive biases shared across tasks to enable rapid generalization to novel problems. This principle spans a range of approaches, including meta-learning, in-context learning, prompt tuning, learned optimizers and more. While motivated by similar goals, these approaches differ in how they encode and leverage task-specific information, often provided as in-context examples. In this work, we propose a unified framework which describes how such methods differ primarily in the aspects of learning they amortize - such as initializations, learned updates, or predictive mappings - and how they incorporate task data at inference. We introduce a taxonomy that categorizes amortized models into parametric, implicit, and explicit regimes, based on whether task adaptation is externalized, internalized, or jointly modeled. Building on this view, we identify a key limitation in current approaches: most methods struggle to scale to large datasets because their capacity to process task data at inference (e.g., context length) is often limited. To address this, we propose iterative amortized inference, a class of models that refine solutions step-by-step over mini-batches, drawing inspiration from stochastic optimization. Our formulation bridges optimization-based meta-learning with forward-pass amortization in models like LLMs, offering a scalable and extensible foundation for general-purpose task adaptation.

[1004] Vision-LLMs for Spatiotemporal Traffic Forecasting

Ning Yang, Hengyu Zhong, Haijun Zhang, Randall Berry

Main category: cs.LG

TL;DR: ST-Vision-LLM reframes spatiotemporal traffic forecasting as vision-language fusion, using Vision-LLM to process traffic matrices as images and specialized encoding for numerical data, achieving superior accuracy and generalization.

Details

Motivation: LLMs struggle with spatial dependencies in grid-based traffic data and inefficiently handle dense geographical information, requiring a specialized approach for spatiotemporal forecasting.

Method: Uses Vision-LLM visual encoder to process historical traffic matrices as image sequences, introduces efficient floating-point encoding as single tokens, and employs two-stage fine-tuning with SFT followed by GRPO reinforcement learning.

Result: Outperforms existing methods by 15.6% in long-term prediction accuracy and exceeds second-best baseline by over 30.04% in cross-domain few-shot scenarios, demonstrating strong generalization in data-scarce environments.

Conclusion: ST-Vision-LLM effectively addresses LLM limitations in spatiotemporal forecasting through vision-language fusion and specialized numerical encoding, achieving state-of-the-art performance with strong generalization capabilities.

Abstract: Accurate spatiotemporal traffic forecasting is a critical prerequisite for proactive resource management in dense urban mobile networks. While Large Language Models (LLMs) have shown promise in time series analysis, they inherently struggle to model the complex spatial dependencies of grid-based traffic data. Effectively extending LLMs to this domain is challenging, as representing the vast amount of information from dense geographical grids can be inefficient and overwhelm the model’s context. To address these challenges, we propose ST-Vision-LLM, a novel framework that reframes spatiotemporal forecasting as a vision-language fusion problem. Our approach leverages a Vision-LLM visual encoder to process historical global traffic matrices as image sequences, providing the model with a comprehensive global view to inform cell-level predictions. To overcome the inefficiency of LLMs in handling numerical data, we introduce an efficient encoding scheme that represents floating-point values as single tokens via a specialized vocabulary, coupled with a two-stage numerical alignment fine-tuning process. The model is first trained with Supervised Fine-Tuning (SFT) and then further optimized for predictive accuracy using Group Relative Policy Optimization (GRPO), a memory-efficient reinforcement learning method. Evaluations on real-world mobile traffic datasets demonstrate that ST-Vision-LLM outperforms existing methods by 15.6% in long-term prediction accuracy and exceeds the second-best baseline by over 30.04% in cross-domain few-shot scenarios. Our extensive experiments validate the model’s strong generalization capabilities across various data-scarce environments.

[1005] Gym-TORAX: Open-source software for integrating RL with plasma control simulators

Antoine Mouchamps, Arthur Malherbe, Adrien Bolland, Damien Ernst

Main category: cs.LG

TL;DR: Gym-TORAX is a Python package that creates RL environments for tokamak plasma control simulation by wrapping TORAX, enabling RL research in plasma dynamics.

Details

Motivation: To facilitate Reinforcement Learning research in plasma control by providing easy-to-use environments that simulate tokamak plasma dynamics and control scenarios.

Method: Users define control actions, observations, and objectives, then Gym-TORAX creates Gymnasium environments wrapping TORAX for plasma simulation with reward-based objectives.

Result: A Python package that generates RL-compatible environments for plasma control, with one ITER ramp-up scenario environment currently available.

Conclusion: Gym-TORAX enables RL algorithm application to tokamak plasma control problems and will accelerate RL research in this domain.

Abstract: This paper presents Gym-TORAX, a Python package enabling the implementation of Reinforcement Learning (RL) environments for simulating plasma dynamics and control in tokamaks. Users define succinctly a set of control actions and observations, and a control objective from which Gym-TORAX creates a Gymnasium environment that wraps TORAX for simulating the plasma dynamics. The objective is formulated through rewards depending on the simulated state of the plasma and control action to optimize specific characteristics of the plasma, such as performance and stability. The resulting environment instance is then compatible with a wide range of RL algorithms and libraries and will facilitate RL research in plasma control. In its current version, one environment is readily available, based on a ramp-up scenario of the International Thermonuclear Experimental Reactor (ITER).

[1006] Offline Reinforcement Learning with Generative Trajectory Policies

Xinsong Feng, Leshu Tang, Chenan Wang, Haipeng Chen

Main category: cs.LG

TL;DR: Generative Trajectory Policies (GTPs) bridge the performance-speed trade-off in offline RL by unifying diffusion, flow matching, and consistency models under an ODE framework, achieving state-of-the-art results.

Details

Motivation: Existing generative policies for offline RL face a trade-off: slow diffusion policies vs fast but lower-performing consistency models. There's a need to bridge this gap.

Method: Propose GTPs that view generative models as instances of learning continuous-time trajectories governed by ODEs. Introduce two principled adaptations for practical offline RL implementation.

Result: GTP achieves state-of-the-art performance on D4RL benchmarks, significantly outperforming prior generative policies and achieving perfect scores on hard AntMaze tasks.

Conclusion: The ODE-based unifying perspective enables more general and effective generative policies that overcome the performance-speed trade-off in offline reinforcement learning.

Abstract: Generative models have emerged as a powerful class of policies for offline reinforcement learning (RL) due to their ability to capture complex, multi-modal behaviors. However, existing methods face a stark trade-off: slow, iterative models like diffusion policies are computationally expensive, while fast, single-step models like consistency policies often suffer from degraded performance. In this paper, we demonstrate that it is possible to bridge this gap. The key to moving beyond the limitations of individual methods, we argue, lies in a unifying perspective that views modern generative models, including diffusion, flow matching, and consistency models, as specific instances of learning a continuous-time generative trajectory governed by an Ordinary Differential Equation (ODE). This principled foundation provides a clearer design space for generative policies in RL and allows us to propose Generative Trajectory Policies (GTPs), a new and more general policy paradigm that learns the entire solution map of the underlying ODE. To make this paradigm practical for offline RL, we further introduce two key theoretically principled adaptations. Empirical results demonstrate that GTP achieves state-of-the-art performance on D4RL benchmarks - it significantly outperforms prior generative policies, achieving perfect scores on several notoriously hard AntMaze tasks.

[1007] DiffStyleTS: Diffusion Model for Style Transfer in Time Series

Mayank Nagda, Phil Ostheimer, Justus Arweiler, Indra Jungjohann, Jennifer Werner, Dennis Wagner, Aparna Muraleedharan, Pouya Jafari, Jochen Schmid, Fabian Jirasek, Jakob Burger, Michael Bortz, Hans Hasse, Stephan Mandt, Marius Kloft, Sophie Fellenz

Main category: cs.LG

TL;DR: DiffTSST is a diffusion-based framework for time series style transfer that disentangles content and style representations using convolutional encoders and recombines them through self-supervised attention-based diffusion.

Details

Motivation: Style transfer is well-developed in vision and language domains but remains limited for time series data, despite its potential for applications like data augmentation and scenario simulation to help machine learning models generalize in data-scarce domains.

Method: Uses convolutional encoders to disentangle time series into content and style representations, then recombines them through a self-supervised attention-based diffusion process. At inference, encoders extract content and style from two distinct series for conditional generation.

Result: Demonstrated both qualitatively and quantitatively that DiffTSST achieves effective style transfer. Data augmentation with DiffTSST improves anomaly detection in data-scarce regimes.

Conclusion: DiffTSST provides an effective diffusion-based framework for time series style transfer that enables practical applications like data augmentation for improving model performance in data-scarce scenarios.

Abstract: Style transfer combines the content of one signal with the style of another. It supports applications such as data augmentation and scenario simulation, helping machine learning models generalize in data-scarce domains. While well developed in vision and language, style transfer methods for time series data remain limited. We introduce DiffTSST, a diffusion-based framework that disentangles a time series into content and style representations via convolutional encoders and recombines them through a self-supervised attention-based diffusion process. At inference, encoders extract content and style from two distinct series, enabling conditional generation of novel samples to achieve style transfer. We demonstrate both qualitatively and quantitatively that DiffTSST achieves effective style transfer. We further validate its real-world utility by showing that data augmentation with DiffTSST improves anomaly detection in data-scarce regimes.

[1008] FedHybrid: Breaking the Memory Wall of Federated Learning via Hybrid Tensor Management

Kahou Tam, Chunlin Tian, Li Li, Haikai Zhao, ChengZhong Xu

Main category: cs.LG

TL;DR: FedHybrid is a federated learning framework that reduces memory usage on mobile devices through hybrid recomputation and compression techniques while maintaining model accuracy.

Details

Motivation: Memory limitation on mobile devices hinders federated learning deployment, requiring solutions that reduce memory footprint without compromising training progress.

Method: Selects devices based on memory budget, computing capability, and data diversity; analyzes computational graphs to generate execution plans using hybrid recomputation and compression; employs activation compression during local training.

Result: Achieves up to 39.1% increase in model accuracy and 15.5× reduction in wall clock time under various memory budgets compared to baselines.

Conclusion: FedHybrid effectively addresses memory constraints in federated learning while maintaining training efficiency and model performance.

Abstract: Federated Learning (FL) emerges as a new learning paradigm that enables multiple devices to collaboratively train a shared model while preserving data privacy. However, one fundamental and prevailing challenge that hinders the deployment of FL on mobile devices is the memory limitation. This paper proposes \textit{FedHybrid}, a novel framework that effectively reduces the memory footprint during the training process while guaranteeing the model accuracy and the overall training progress. Specifically, \textit{FedHybrid} first selects the participating devices for each training round by jointly evaluating their memory budget, computing capability, and data diversity. After that, it judiciously analyzes the computational graph and generates an execution plan for each selected client in order to meet the corresponding memory budget while minimizing the training delay through employing a hybrid of recomputation and compression techniques according to the characteristic of each tensor. During the local training process, \textit{FedHybrid} carries out the execution plan with a well-designed activation compression technique to effectively achieve memory reduction with minimum accuracy loss. We conduct extensive experiments to evaluate \textit{FedHybrid} on both simulation and off-the-shelf mobile devices. The experiment results demonstrate that \textit{FedHybrid} achieves up to a 39.1% increase in model accuracy and a 15.5$\times$ reduction in wall clock time under various memory budgets compared with the baselines.

[1009] Leveraging LLMs for Semi-Automatic Corpus Filtration in Systematic Literature Reviews

Lucas Joos, Daniel A. Keim, Maximilian T. Fischer

Main category: cs.LG

TL;DR: A pipeline using multiple LLMs with consensus voting and human supervision through a visual interface (LLMSurver) to automate systematic literature review paper classification, reducing manual effort while maintaining accuracy.

Details

Motivation: Systematic literature reviews require extensive manual effort for paper retrieval and filtering, as keyword searches in digital libraries often return many irrelevant publications.

Method: Leverage multiple LLMs with descriptive prompts and consensus scheme, supervised via open-source visual analytics interface (LLMSurver) for real-time inspection and modification.

Result: Evaluated on 8,000+ papers, pipeline significantly reduces manual effort while achieving lower error rates than single human annotators. Open-source models prove sufficient and cost-effective.

Conclusion: Responsible human-AI collaboration can accelerate and enhance systematic literature reviews within academic workflows.

Abstract: The creation of systematic literature reviews (SLR) is critical for analyzing the landscape of a research field and guiding future research directions. However, retrieving and filtering the literature corpus for an SLR is highly time-consuming and requires extensive manual effort, as keyword-based searches in digital libraries often return numerous irrelevant publications. In this work, we propose a pipeline leveraging multiple large language models (LLMs), classifying papers based on descriptive prompts and deciding jointly using a consensus scheme. The entire process is human-supervised and interactively controlled via our open-source visual analytics web interface, LLMSurver, which enables real-time inspection and modification of model outputs. We evaluate our approach using ground-truth data from a recent SLR comprising over 8,000 candidate papers, benchmarking both open and commercial state-of-the-art LLMs from mid-2024 and fall 2025. Results demonstrate that our pipeline significantly reduces manual effort while achieving lower error rates than single human annotators. Furthermore, modern open-source models prove sufficient for this task, making the method accessible and cost-effective. Overall, our work demonstrates how responsible human-AI collaboration can accelerate and enhance systematic literature reviews within academic workflows.

[1010] Differentiable Fast Top-K Selection for Large-Scale Recommendation

Yanjie Zhu, Zhen Zhang, Yunli Wang, Zhiqiang Wang, Yu Li, Rufan Zhou, Shiyang Wen, Peng Jiang, Chenhao Lin, Jian Yang

Main category: cs.LG

TL;DR: DFTopK is a novel differentiable Top-K operator that achieves O(n) time complexity, enabling efficient end-to-end training in large-scale recommendation systems while avoiding gradient conflicts of existing methods.

Details

Motivation: Existing differentiable Top-K methods suffer from O(n log n) complexity due to sorting dependencies, and differentiable sorting-based approaches introduce gradient conflicts through matrix aggregation, hindering efficient training in large-scale retrieval systems.

Method: DFTopK relaxes normalization constraints to admit a closed-form solution, avoiding sorting entirely and achieving optimal linear-time complexity for Top-K selection while bypassing the gradient conflicts of permutation matrix-based methods.

Result: DFTopK significantly improves training efficiency and achieves superior performance, enabling scaling up of training samples. In online A/B tests, it yielded +1.77% revenue lift with the same computational budget compared to baselines.

Conclusion: DFTopK is the first differentiable Top-K operator introduced to recommendation systems and the first to achieve theoretically optimal linear-time complexity, providing an efficient solution for end-to-end training in large-scale retrieval systems.

Abstract: Cascade ranking is a widely adopted paradigm in large-scale information retrieval systems for Top-K item selection. However, the Top-K operator is non-differentiable, hindering end-to-end training. Existing methods include Learning-to-Rank approaches (e.g., LambdaLoss), which optimize ranking metrics like NDCG and suffer from objective misalignment, and differentiable sorting-based methods (e.g., ARF, LCRON), which relax permutation matrices for direct Top-K optimization but introduce gradient conflicts through matrix aggregation. A promising alternative is to directly construct a differentiable approximation of the Top-K selection operator, bypassing the use of soft permutation matrices. However, even state-of-the-art differentiable Top-K operator (e.g., LapSum) require $O(n \log n)$ complexity due to their dependence on sorting for solving the threshold. Thus, we propose DFTopK, a novel differentiable Top-K operator achieving optimal $O(n)$ time complexity. By relaxing normalization constraints, DFTopK admits a closed-form solution and avoids sorting. DFTopK also avoids the gradient conflicts inherent in differentiable sorting-based methods. We evaluate DFTopK on both the public benchmark RecFLow and an industrial system. Experimental results show that DFTopK significantly improves training efficiency while achieving superior performance, which enables us to scale up training samples more efficiently. In the online A/B test, DFTopK yielded a +1.77% revenue lift with the same computational budget compared to the baseline. To the best of our knowledge, this work is the first to introduce differentiable Top-K operators into recommendation systems and the first to achieve theoretically optimal linear-time complexity for Top-K selection. We have open-sourced our implementation to facilitate future research in both academia and industry.

[1011] Rescaling-Aware Training for Efficient Deployment of Deep Learning Models on Full-Integer Hardware

Lion Mueller, Alberto Garcia-Ortiz, Ardalan Najafi, Adam Fuks, Lennart Bamberg

Main category: cs.LG

TL;DR: This paper proposes methods to reduce the computational cost of integer rescaling operations in AI inference by applying stronger quantization to rescale multiplicands and introducing Rescale-Aware Training for ultra-low bit-width rescaling.

Details

Motivation: Integer AI inference reduces computational complexity in embedded systems, but quantization-aware training overlooks the impact of costly integer rescaling operations during inference, which are hardware expensive.

Method: The paper applies stronger quantization to rescale multiplicands post-training and introduces Rescale-Aware Training, a fine-tuning method for ultra-low bit-width rescaling multiplicands.

Result: Experiments show that even with 8x reduced rescaler widths, full accuracy is preserved through minimal incremental retraining.

Conclusion: This approach enables more energy-efficient and cost-efficient AI inference for resource-constrained embedded systems by dramatically reducing rescaling costs without model-quality loss.

Abstract: Integer AI inference significantly reduces computational complexity in embedded systems. Quantization-aware training (QAT) helps mitigate accuracy degradation associated with post-training quantization but still overlooks the impact of integer rescaling during inference, which is a hardware costly operation in integer-only AI inference. This work shows that rescaling cost can be dramatically reduced post-training, by applying a stronger quantization to the rescale multiplicands at no model-quality loss. Furthermore, we introduce Rescale-Aware Training, a fine tuning method for ultra-low bit-width rescaling multiplicands. Experiments show that even with 8x reduced rescaler widths, the full accuracy is preserved through minimal incremental retraining. This enables more energy-efficient and cost-efficient AI inference for resource-constrained embedded systems.

[1012] How Reinforcement Learning After Next-Token Prediction Facilitates Learning

Nikolaos Tsilivis, Eran Malach, Karen Ullrich, Julia Kempe

Main category: cs.LG

TL;DR: The paper analyzes why reinforcement learning (RL) after next-token prediction enables better reasoning than next-token prediction alone, particularly for tasks like bit parity prediction where long reasoning sequences are rare.

Details

Motivation: To understand the optimization mechanisms behind the success of RL training for reasoning tasks in large language models, and why it outperforms next-token prediction alone.

Method: Theoretical framework studying learning from mixture distributions of short and long chain-of-thought sequences, with experiments on bit parity prediction and mathematical reasoning benchmarks using autoregressive transformers and linear models.

Result: RL after next-token prediction enables generalization for bit parity prediction even when long demonstrations are rare, while next-token prediction alone requires extreme resources. RL leverages longer test-time computation to facilitate learning.

Conclusion: The RL training paradigm efficiently enables reasoning generalization by leveraging rare long demonstrations and increased test-time computation, providing theoretical justification for current LLM training practices.

Abstract: Recent advances in reasoning domains with neural networks have primarily been enabled by a training recipe that optimizes Large Language Models, previously trained to predict the next-token in a sequence, with reinforcement learning algorithms. We introduce a framework to study the success of this paradigm, and we theoretically expose the optimization mechanisms by which reinforcement learning improves over next-token prediction in this setting. We study learning from mixture distributions of short and long ``chain-of-thought’’ sequences encoding a single task. In particular, when the task consists of predicting the parity of $d$ bits and long sequences are rare, we show how reinforcement learning after next-token prediction enables autoregressive transformers to generalize, whereas mere next-token prediction requires extreme statistical or computational resources to do so. We further explain how reinforcement learning leverages increased test-time computation, manifested in longer responses, to facilitate this learning process. In a simplified setting, we theoretically prove that autoregressive linear models following this training recipe can efficiently learn to predict the parity of $d$ bits as long as the proportion of long demonstrations in the data mix is not exponentially small in the input dimension $d$. Finally, we demonstrate these same phenomena in other settings, including the post-training of Llama-series models on mixture variations of common mathematical reasoning benchmarks.

[1013] Query-Specific GNN: A Comprehensive Graph Representation Learning Method for Retrieval Augmented Generation

Yuchen Yan, Zhihua Liu, Hao Wang, Weiming Li, Xiaoshuai Hao

Main category: cs.LG

TL;DR: A graph representation learning framework using Multi-information Level Knowledge Graph and Query-Specific Graph Neural Network to improve multi-hop question retrieval in RAG systems.

Details

Motivation: Existing RAG systems struggle with multi-hop questions due to difficulty understanding complex semantic structures and susceptibility to irrelevant noise when retrieving multiple information targets.

Method: Proposes Multi-information Level Knowledge Graph (Multi-L KG) for comprehensive question modeling and Query-Specific Graph Neural Network (QSGNN) with intra/inter-level message passing guided by queries, plus two synthesized data generation strategies for pre-training.

Result: Extensive experiments show significant effectiveness in multi-hop scenarios, with up to 33.8% improvement on high-hop questions.

Conclusion: The proposed framework successfully addresses multi-hop question retrieval challenges in RAG systems through graph representation learning and query-guided information aggregation.

Abstract: Retrieval-augmented generation (RAG) has demonstrated its ability to enhance Large Language Models (LLMs) by integrating external knowledge sources. However, multi-hop questions, which require the identification of multiple knowledge targets to form a synthesized answer, raise new challenges for RAG systems. Under the multi-hop settings, existing methods often struggle to fully understand the questions with complex semantic structures and are susceptible to irrelevant noise during the retrieval of multiple information targets. To address these limitations, we propose a novel graph representation learning framework for multi-hop question retrieval. We first introduce a Multi-information Level Knowledge Graph (Multi-L KG) to model various information levels for a more comprehensive understanding of multi-hop questions. Based on this, we design a Query-Specific Graph Neural Network (QSGNN) for representation learning on the Multi-L KG. QSGNN employs intra/inter-level message passing mechanisms, and in each message passing the information aggregation is guided by the query, which not only facilitates multi-granular information aggregation but also significantly reduces the impact of noise. To enhance its ability to learn robust representations, we further propose two synthesized data generation strategies for pre-training the QSGNN. Extensive experimental results demonstrate the effectiveness of our framework in multi-hop scenarios, especially in high-hop questions the improvement can reach 33.8%. The code is available at: https://github.com/Jerry2398/QSGNN.

[1014] Context-Aware Model-Based Reinforcement Learning for Autonomous Racing

Emran Yasser Moustafa, Ivana Dusparic

Main category: cs.LG

TL;DR: This paper explores model-based reinforcement learning (MBRL) for autonomous racing, proposing a context-aware extension called cMask that improves generalization to unseen adversary behaviors compared to context-free approaches.

Details

Motivation: Autonomous vehicles need algorithms that can generalize to unseen scenarios. While MBRL shows strong performance, it's susceptible to changes in environment dynamics. The paper aims to improve MBRL's generalization capabilities for autonomous driving tasks.

Method: Framed head-to-head racing as contextual Markov decision processes, parameterizing adversary behavior using episode context. Proposed cMask, a novel context-aware extension of MBRL algorithms, and benchmarked performance in Roboracer simulation environment.

Result: Context-aware MBRL algorithms generalize better to out-of-distribution adversary behaviors than context-free approaches. cMask showed strong generalization capabilities and further performance improvement against in-distribution adversaries compared to other context-aware MBRL methods.

Conclusion: Context-aware MBRL, particularly the proposed cMask algorithm, significantly improves generalization in autonomous racing scenarios, making it more suitable for real-world deployment where environments and adversary behaviors can vary unpredictably.

Abstract: Autonomous vehicles have shown promising potential to be a groundbreaking technology for improving the safety of road users. For these vehicles, as well as many other safety-critical robotic technologies, to be deployed in real-world applications, we require algorithms that can generalize well to unseen scenarios and data. Model-based reinforcement learning algorithms (MBRL) have demonstrated state-of-the-art performance and data efficiency across a diverse set of domains. However, these algorithms have also shown susceptibility to changes in the environment and its transition dynamics. In this work, we explore the performance and generalization capabilities of MBRL algorithms for autonomous driving, specifically in the simulated autonomous racing environment, Roboracer (formerly F1Tenth). We frame the head-to-head racing task as a learning problem using contextual Markov decision processes and parameterize the driving behavior of the adversaries using the context of the episode, thereby also parameterizing the transition and reward dynamics. We benchmark the behavior of MBRL algorithms in this environment and propose a novel context-aware extension of the existing literature, cMask. We demonstrate that context-aware MBRL algorithms generalize better to out-of-distribution adversary behaviors relative to context-free approaches. We also demonstrate that cMask displays strong generalization capabilities, as well as further performance improvement relative to other context-aware MBRL approaches when racing against adversaries with in-distribution behaviors.

[1015] Learning to Make MISTAKEs: Modeling Incorrect Student Thinking And Key Errors

Alexis Ross, Jacob Andreas

Main category: cs.LG

TL;DR: MISTAKE is a method that generates synthetic reasoning error examples using cycle consistency between incorrect answers and latent misconceptions, then uses this data to train models for student simulation, misconception classification, and answer generation.

Details

Motivation: Most reasoning research focuses on correct outputs, but applications like educational feedback systems require modeling incorrect reasoning patterns to simulate student errors and provide targeted feedback.

Method: Leverages cycle consistency between incorrect answers and latent misconceptions to construct high-quality synthetic error examples, then uses this generated data to train models for three educational tasks.

Result: MISTAKE achieves higher accuracy in simulating incorrect student answers, better performance in inferring latent misconceptions from incorrect answers, and generates incorrect answers that better align with expert-written distractors.

Conclusion: The method successfully addresses the need for modeling incorrect reasoning patterns in educational applications, demonstrating improved performance across multiple tasks related to student error simulation and misconception analysis.

Abstract: Research on reasoning in language models (LMs) predominantly focuses on improving the correctness of their outputs. But some important applications require modeling reasoning patterns that are incorrect. For example, automated systems that can reason about and simulate student errors are useful for providing real-time feedback in the classroom or offline practice for educators-in-training. This paper presents a new method, MISTAKE, that (1) constructs high-quality synthetic examples of reasoning errors by leveraging cycle consistency between incorrect answers and latent misconceptions; and (2) uses the generated data to learn models for student simulation, misconception classification, and answer generation. We evaluate MISTAKE on three educational tasks and find that it results in (1) higher accuracy when simulating incorrect student answers based on specific misconceptions, (2) increased performance inferring latent misconceptions from observed incorrect answers, and (3) higher alignment with expert-written distractor answers when generating incorrect answers (e.g., for multiple-choice tests).

[1016] Knowledge-Guided Machine Learning Models to Upscale Evapotranspiration in the U.S. Midwest

Aleksei Rozanov, Samikshya Subedi, Vasudha Sharma, Bryan C. Runck

Main category: cs.LG

TL;DR: This study develops a machine learning approach using LightGBM with knowledge-guided features to upscale evapotranspiration (ET) across the Midwest US, achieving high accuracy (R²=0.86) and providing a 500m daily resolution ET data product.

Details

Motivation: Accurate ET quantification across spatiotemporal scales is challenging. In-situ measurements like eddy covariance only provide single-location data, while agricultural applications require field-level estimates over broad areas, making widespread sensor deployment impractical.

Method: Integrated tree-based ML models (Random Forest, CatBoost, XGBoost, LightGBM) and neural networks with knowledge-guided feature engineering using multispectral remote sensing, gridded meteorology, and EC data. Used k-fold cross-validation with site-year stratified splits to prevent data leakage.

Result: LightGBM with knowledge-guided features performed best with R²=0.86, MSE=14.99 W m⁻², and MAE=8.82 W m⁻². Feature importance analysis showed knowledge-guided features were most critical. Produced a 500m daily resolution ET data product (2019-2024) that showed best-in-class correspondence with state-level weather station estimates.

Conclusion: The knowledge-guided ML approach successfully upscaled ET measurements across the Midwest US, providing accurate, high-resolution ET estimates that address the limitations of traditional in-situ measurement methods for agricultural applications.

Abstract: Evapotranspiration (ET) plays a critical role in the land-atmosphere interactions, yet its accurate quantification across various spatiotemporal scales remains a challenge. In situ measurement approaches, like eddy covariance (EC) or weather station-based ET estimation, allow for measuring ET at a single location. Agricultural uses of ET require estimates for each field over broad areas, making it infeasible to deploy sensing systems at each location. This study integrates tree-based and knowledge-guided machine learning (ML) techniques with multispectral remote sensing data, griddled meteorology and EC data to upscale ET across the Midwest United States. We compare four tree-based models - Random Forest, CatBoost, XGBoost, LightGBM - and a simple feed-forward artificial neural network in combination with features engineered using knowledge-guided ML principles. Models were trained and tested on EC towers located in the Midwest of the United States using k-fold cross validation with k=5 and site-year, biome stratified train-test split to avoid data leakage. Results show that LightGBM with knowledge-guided features outperformed other methods with an R2=0.86, MSE=14.99 W m^-2 and MAE = 8.82 W m^-2 according to grouped k-fold validation (k=5). Feature importance analysis shows that knowledge-guided features were most important for predicting evapotranspiration. Using the best performing model, we provide a data product at 500 m spatial and one-day temporal resolution for gridded ET for the period of 2019-2024. Intercomparison between the new gridded product and state-level weather station-based ET estimates show best-in-class correspondence.

[1017] Attention Factors for Statistical Arbitrage

Elliot L. Epstein, Rose Wang, Jaewon Choi, Markus Pelger

Main category: cs.LG

TL;DR: The paper develops an Attention Factor model for statistical arbitrage that jointly learns conditional latent factors from firm characteristics and identifies mispricing signals using sequence models, achieving high Sharpe ratios net of transaction costs.

Details

Motivation: To develop a framework that can jointly identify similar assets through factors, detect mispricing, and form trading policies that maximize risk-adjusted performance after accounting for trading costs in statistical arbitrage.

Method: Uses Attention Factors - conditional latent factors learned from firm characteristic embeddings that allow complex interactions. Identifies time-series signals from residual portfolios using general sequence models. Jointly estimates factors and arbitrage trading strategy.

Result: Achieves out-of-sample Sharpe ratio above 4 on largest U.S. equities over 24-year period. One-step solution yields unprecedented Sharpe ratio of 2.3 net of transaction costs. Shows weak factors are important for arbitrage trading.

Conclusion: Joint estimation of factors and trading strategy is crucial for maximizing profitability after trading costs. The Attention Factor model demonstrates superior performance in statistical arbitrage, particularly highlighting the importance of weak factors.

Abstract: Statistical arbitrage exploits temporal price differences between similar assets. We develop a framework to jointly identify similar assets through factors, identify mispricing and form a trading policy that maximizes risk-adjusted performance after trading costs. Our Attention Factors are conditional latent factors that are the most useful for arbitrage trading. They are learned from firm characteristic embeddings that allow for complex interactions. We identify time-series signals from the residual portfolios of our factors with a general sequence model. Estimating factors and the arbitrage trading strategy jointly is crucial to maximize profitability after trading costs. In a comprehensive empirical study we show that our Attention Factor model achieves an out-of-sample Sharpe ratio above 4 on the largest U.S. equities over a 24-year period. Our one-step solution yields an unprecedented Sharpe ratio of 2.3 net of transaction costs. We show that weak factors are important for arbitrage trading.

[1018] Ontolearn-A Framework for Large-scale OWL Class Expression Learning in Python

Caglar Demir, Alkid Baci, N’Dah Jean Kouagou, Leonie Nora Sieger, Stefan Heindorf, Simon Bin, Lukas Blübaum, Alexander Bigerl, Axel-Cyrille Ngonga Ngomo

Main category: cs.LG

TL;DR: Ontolearn is a framework for learning OWL class expressions from large knowledge graphs, implementing state-of-the-art symbolic and neuro-symbolic learners with verbalization and SPARQL query capabilities.

Details

Motivation: To provide an efficient framework for learning OWL class expressions over large knowledge graphs, enabling classification of instances and making complex expressions accessible through natural language translation.

Method: Implements symbolic and neuro-symbolic class expression learners (EvoLearner, DRILL), integrates LLM-based verbalization module for natural language translation, and maps OWL expressions to SPARQL queries for remote triplestore operations.

Result: Developed a comprehensive framework capable of learning OWL class expressions, verbalizing them into natural language, and operating over remote triplestores through SPARQL query mapping.

Conclusion: Ontolearn provides an effective and accessible framework for OWL class expression learning with practical applications in knowledge graph analysis and instance classification, available as open-source software.

Abstract: In this paper, we present Ontolearn-a framework for learning OWL class expressions over large knowledge graphs. Ontolearn contains efficient implementations of recent stateof-the-art symbolic and neuro-symbolic class expression learners including EvoLearner and DRILL. A learned OWL class expression can be used to classify instances in the knowledge graph. Furthermore, Ontolearn integrates a verbalization module based on an LLM to translate complex OWL class expressions into natural language sentences. By mapping OWL class expressions into respective SPARQL queries, Ontolearn can be easily used to operate over a remote triplestore. The source code of Ontolearn is available at https://github.com/dice-group/Ontolearn.

[1019] Diffusion-DFL: Decision-focused Diffusion Models for Stochastic Optimization

Zihao Zhao, Christopher Yeh, Lingkai Kong, Kai Wang

Main category: cs.LG

TL;DR: Proposes the first diffusion-based decision-focused learning approach that trains diffusion models to represent parameter distributions and optimize decisions through stochastic optimization.

Details

Motivation: Existing DFL methods rely on deterministic point predictions, which are insufficient to capture real-world stochasticity. There's a need to address uncertainty in decision-focused learning.

Method: Uses diffusion models to represent parameter distributions and solves stochastic optimization with samples from the diffusion model. Introduces both reparameterization-based training and a lightweight score function estimator that avoids backpropagation through sampling.

Result: The diffusion DFL approach consistently outperforms strong baselines in decision quality across experiments.

Conclusion: Diffusion models provide an effective framework for handling uncertainty in decision-focused learning, with both reparameterization and score function methods offering practical training approaches.

Abstract: Decision-focused learning (DFL) integrates predictive modeling and optimization by training predictors to optimize the downstream decision target rather than merely minimizing prediction error. To date, existing DFL methods typically rely on deterministic point predictions, which are often insufficient to capture the intrinsic stochasticity of real-world environments. To address this challenge, we propose the first diffusion-based DFL approach, which trains a diffusion model to represent the distribution of uncertain parameters and optimizes the decision by solving a stochastic optimization with samples drawn from the diffusion model. Our contributions are twofold. First, we formulate diffusion DFL using the reparameterization trick, enabling end-to-end training through diffusion. While effective, it is memory and compute-intensive due to the need to differentiate through the diffusion sampling process. Second, we propose a lightweight score function estimator that uses only several forward diffusion passes and avoids backpropagation through the sampling. This follows from our results that backpropagating through stochastic optimization can be approximated by a weighted score function formulation. We empirically show that our diffusion DFL approach consistently outperforms strong baselines in decision quality. The source code for all experiments is available at the project repository: https://github.com/GT-KOALA/Diffusion_DFL.

[1020] MATH-Beyond: A Benchmark for RL to Expand Beyond the Base Model

Prasanna Mayilvahanan, Ricardo Dominguez-Olmedo, Thaddäus Wiedemer, Wieland Brendel

Main category: cs.LG

TL;DR: The paper introduces MATH-B, a new benchmark designed to challenge current RL fine-tuned models by requiring reasoning beyond base model capabilities, even with large sampling budgets.

Details

Motivation: Existing RL fine-tuning methods mainly sharpen existing solution modes rather than discovering new reasoning skills, creating a plateau in mathematical reasoning capabilities despite advances like DeepSeek-R1.

Method: Created MATH-B benchmark using problems from DAPO-Math-17K and DeepScaleR datasets that defeat common open-source models (up to 8B parameters) even with large sampling budgets like pass@1024.

Result: Current RL fine-tuned models (Nemotron-Research-Reasoning-Qwen-1.5B and DeepScaleR-1.5B-Preview) perform poorly on MATH-B, demonstrating that existing approaches fail on harder instances requiring novel reasoning.

Conclusion: MATH-B aims to catalyze exploration-driven RL approaches that can develop deeper reasoning capabilities beyond current fine-tuning methods.

Abstract: With the advent of DeepSeek-R1, a new wave of reinforcement learning (RL) methods has emerged that seem to unlock stronger mathematical reasoning. However, a closer look at the open-source ecosystem reveals a critical limitation: with sufficiently many draws (e.g., $\texttt{pass@1024}$), many existing base models already solve nearly all questions on widely used math benchmarks such as MATH-500 and AIME 2024. This suggests that the RL fine-tuning methods prevalent in the LLM reasoning literature largely sharpen existing solution modes rather than discovering entirely new ones. Such sharpening stands in contrast to the broader promise of RL: to foster exploration and to acquire new skills. To move beyond this plateau, we introduce MATH-Beyond (MATH-B), a benchmark deliberately constructed to defeat common open-source models of up to 8B parameters even under large sampling budgets. Improving performance on our benchmark via RL requires methods that learn to reason in ways that go beyond base model capabilities in repeated sampling. Since the problems are drawn from subsets of DAPO-Math-17K and DeepScaleR datasets, they remain topically equivalent to standard high-school math. Validating our premise, RL fine-tuned models such as Nemotron-Research-Reasoning-Qwen-1.5B and DeepScaleR-1.5B-Preview perform poorly on MATH-B at $\texttt{pass@1024}$, showing how existing approaches fall short on tackling harder instances. We hope MATH-B will catalyze exploration-driven RL approaches that elicit deeper reasoning capabilities. We release MATH-B at https://huggingface.co/datasets/brendel-group/MATH-Beyond.

[1021] An Eulerian Perspective on Straight-Line Sampling

Panos Tsimpos, Youssef Marzouk

Main category: cs.LG

TL;DR: The paper studies dynamic measure transport for generative modeling, focusing on stochastic processes that bridge source and target distributions. It characterizes which processes produce straight-line flows (with zero acceleration) and provides PDE conditions for straightness.

Details

Motivation: To understand which stochastic processes generate straight-line flows that are easier to integrate numerically, enabling more efficient generative modeling through measure transport.

Method: The authors provide a PDE characterization of straightness as a balance between conditional acceleration and divergence of a weighted covariance tensor. They analyze affine-in-time interpolants and deterministic endpoint couplings.

Result: Straight-line flows occur exactly under deterministic endpoint couplings. The paper fully characterizes affine-in-time interpolants and derives necessary conditions that constrain flow geometry for general processes.

Conclusion: The analysis offers guidance for designing transports that are easier to integrate, with deterministic endpoint couplings providing exact straight-line flows that can be integrated with first-order methods.

Abstract: We study dynamic measure transport for generative modeling: specifically, flows induced by stochastic processes that bridge a specified source and target distribution. The conditional expectation of the process’ velocity defines an ODE whose flow map achieves the desired transport. We ask \emph{which processes produce straight-line flows} – i.e., flows whose pointwise acceleration vanishes and thus are exactly integrable with a first-order method? We provide a concise PDE characterization of straightness as a balance between conditional acceleration and the divergence of a weighted covariance (Reynolds) tensor. Using this lens, we fully characterize affine-in-time interpolants and show that straightness occurs exactly under deterministic endpoint couplings. We also derive necessary conditions that constrain flow geometry for general processes, offering broad guidance for designing transports that are easier to integrate.

[1022] Chronologically Consistent Generative AI

Songrun He, Linying Lv, Asaf Manela, Jimmy Wu

Main category: cs.LG

TL;DR: A family of chronologically consistent LLMs trained only on pre-cutoff data to eliminate lookahead bias, providing open weights and a conservative forecast accuracy baseline.

Details

Motivation: To eliminate lookahead bias in prediction tasks by ensuring models are trained only on data available before specific cutoff dates, preventing training leakage from future information.

Method: Training large language models using only data available before clearly defined knowledge-cutoff dates, with strict temporal separation from post-cutoff data.

Result: Created instruction-following models with open weights that provide a conservative lower bound on forecast accuracy by removing training leakage effects.

Conclusion: The framework provides researchers with an easy-to-use generative AI tool for prediction tasks that is free of lookahead bias, ensuring temporal consistency and replicability.

Abstract: We introduce a family of chronologically consistent, instruction-following large language models to eliminate lookahead bias. Each model is trained only on data available before a clearly defined knowledge-cutoff date, ensuring strict temporal separation from any post-cutoff data. The resulting framework offers (i) a simple, conversational chat interface, (ii) fully open, fixed model weights that guarantee replicability, and (iii) a conservative lower bound on forecast accuracy, isolating the share of predictability that survives once training leakage is removed. Together, these features provide researchers with an easy-to-use generative AI tool useful for a wide range of prediction tasks that is free of lookahead bias.

[1023] Representation-Based Exploration for Language Models: From Test-Time to Post-Training

Jens Tuyls, Dylan J. Foster, Akshay Krishnamurthy, Jordan T. Ash

Main category: cs.LG

TL;DR: Deliberate exploration with representation-based diversity bonuses derived from pre-trained language models improves reasoning performance and efficiency in both post-training and inference-time settings.

Details

Motivation: To investigate whether current RL techniques promote discovery of novel behaviors in language models or just sharpen existing ones, and understand how pre-trained model knowledge can guide exploration.

Method: Use deliberate exploration with simple, principled representation-based bonuses derived from the pre-trained language model’s hidden states to incentivize discovery of novel and diverse behaviors.

Result: Significantly improves diversity and pass@k rates in both post-training and inference-time scaling. For Qwen-2.5-14b-Instruct, obtained over 50% improvement in verifier efficiency on most tasks. For post-training, Qwen-2.5-7b-Instruct’s pass@80 matches pass@256 of GRPO, demonstrating 3x improvement in test-time sample efficiency.

Conclusion: Deliberate exploration with the right notion of diversity is a practical path toward discovering new behaviors beyond just sharpening existing capabilities in language models.

Abstract: Reinforcement learning (RL) promises to expand the capabilities of language models, but it is unclear if current RL techniques promote the discovery of novel behaviors, or simply sharpen those already present in the base model. In this paper, we investigate the value of deliberate exploration – explicitly incentivizing the model to discover novel and diverse behaviors – and aim to understand how the knowledge in pre-trained models can guide this search. Our main finding is that exploration with a simple, principled, representation-based bonus derived from the pre-trained language model’s hidden states significantly improves diversity and pass@k rates – both for post-training, and in a novel inference-time scaling setting we introduce. For inference-time, exploration with representation-based diversity improves efficiency, consistently improving pass@k rates across a variety of models and reasoning tasks. For example, for Qwen-2.5-14b-Instruct we obtain over 50% improvement in verifier efficiency on almost all tasks. For post-training, we show that integrating this exploration strategy into an RL pipeline improves reasoning performance over that of the initial model and over standard RL post-training. For example, on AIME 2024, our post-trained Qwen-2.5-7b-Instruct’s pass@80 matches the pass@256 of GRPO on the same model, demonstrating a 3x improvement in test-time sample efficiency. Overall, our findings suggest that deliberate exploration – with the right notion of diversity – is a practical path toward discovery of new behaviors beyond sharpening.

[1024] Tight Regret Upper and Lower Bounds for Optimistic Hedge in Two-Player Zero-Sum Games

Taira Tsuchiya

Main category: cs.LG

TL;DR: This paper analyzes the optimality of optimistic Hedge in two-player zero-sum games, showing improved regret bounds of O(√(log m log n)) and proving these bounds are tight with matching upper and lower bounds including constant factors.

Details

Motivation: To investigate whether the existing O(log(mn)) regret bounds for optimistic Hedge can be improved and to establish the optimal dependence on the numbers of actions m and n in the regret analysis.

Method: Refined regret analysis by expressing the regret upper bound as an optimization problem with respect to learning rates and coefficients of negative terms, enabling analysis of leading constants. Provided algorithm-dependent individual regret lower bounds to prove optimality.

Result: Improved social and individual regret bounds to O(√(log m log n)) in strongly-uncoupled settings, with exact matching upper and lower bounds including constant factors. Also improved last-iterate convergence rate and dynamic regret.

Conclusion: The paper establishes the optimality of optimistic Hedge’s regret bounds, showing that the improved O(√(log m log n)) bounds cannot be further improved, and provides matching upper and lower bounds with exact constant factors.

Abstract: In two-player zero-sum games, the learning dynamic based on optimistic Hedge achieves one of the best-known regret upper bounds among strongly-uncoupled learning dynamics. With an appropriately chosen learning rate, the social and individual regrets can be bounded by $O(\log(mn))$ in terms of the numbers of actions $m$ and $n$ of the two players. This study investigates the optimality of the dependence on $m$ and $n$ in the regret of optimistic Hedge. To this end, we begin by refining existing regret analysis and show that, in the strongly-uncoupled setting where the opponent’s number of actions is known, both the social and individual regret bounds can be improved to $O(\sqrt{\log m \log n})$. In this analysis, we express the regret upper bound as an optimization problem with respect to the learning rates and the coefficients of certain negative terms, enabling refined analysis of the leading constants. We then show that the existing social regret bound as well as these new social and individual regret upper bounds cannot be further improved for optimistic Hedge by providing algorithm-dependent individual regret lower bounds. Importantly, these social regret upper and lower bounds match exactly including the constant factor in the leading term. Finally, building on these results, we improve the last-iterate convergence rate and the dynamic regret of a learning dynamic based on optimistic Hedge, and complement these bounds with algorithm-dependent dynamic regret lower bounds that match the improved bounds.

[1025] Adversarial Attacks Leverage Interference Between Features in Superposition

Edward Stevinson, Lucas Prieto, Melih Barsbey, Tolga Birdal

Main category: cs.LG

TL;DR: Adversarial vulnerability in neural networks stems from superposition (representing more features than dimensions), which creates interference patterns that adversaries exploit, rather than being artifacts of decision landscape irregularities or sensitivity to non-robust features.

Details

Motivation: To resolve competing views about the origins of adversarial examples - whether they are artifacts of decision landscape irregularities or sensitivity to non-robust features - by proposing that adversarial vulnerability arises from efficient information encoding through superposition.

Method: Theoretical analysis of superposition effects in neural networks and experimental validation in both synthetic settings with controlled superposition and a Vision Transformer (ViT) trained on CIFAR-10.

Result: Adversarial perturbations leverage interference between superposed features, attack patterns are predictable from feature arrangements, and superposition alone suffices to create adversarial vulnerability. Findings explain attack transferability between similar models and class-specific vulnerability patterns.

Conclusion: Adversarial vulnerability is a byproduct of networks’ representational compression through superposition, rather than flaws in learning processes or non-robust inputs.

Abstract: Fundamental questions remain about when and why adversarial examples arise in neural networks, with competing views characterising them either as artifacts of the irregularities in the decision landscape or as products of sensitivity to non-robust input features. In this paper, we instead argue that adversarial vulnerability can stem from efficient information encoding in neural networks. Specifically, we show how superposition - where networks represent more features than they have dimensions - creates arrangements of latent representations that adversaries can exploit. We demonstrate that adversarial perturbations leverage interference between superposed features, making attack patterns predictable from feature arrangements. Our framework provides a mechanistic explanation for two known phenomena: adversarial attack transferability between models with similar training regimes and class-specific vulnerability patterns. In synthetic settings with precisely controlled superposition, we establish that superposition suffices to create adversarial vulnerability. We then demonstrate that these findings persist in a ViT trained on CIFAR-10. These findings reveal adversarial vulnerability can be a byproduct of networks’ representational compression, rather than flaws in the learning process or non-robust inputs.

[1026] Reinforced sequential Monte Carlo for amortised sampling

Sanghyeok Choi, Sarthak Mittal, Víctor Elvira, Jinkyoo Park, Nikolay Malkin

Main category: cs.LG

TL;DR: This paper combines amortised neural samplers with particle-based methods (SMC) for sampling from unnormalized densities, using RL-trained policies as proposals and introducing off-policy training with SMC samples for better exploration.

Details

Motivation: To improve sampling from complex distributions by bridging amortised neural samplers and traditional Monte Carlo methods, addressing exploration challenges and training stability in neural samplers.

Method: Connects SMC with neural sequential samplers trained by MaxEnt RL, introduces off-policy RL training using SMC samples as behavior policy, joint training of proposals and twist functions, adaptive weight tempering, and experience replay with annealed importance sampling.

Result: Demonstrates improved approximation of true distributions and training stability on synthetic multi-modal targets and alanine dipeptide Boltzmann distributions compared to both amortised and Monte Carlo methods.

Conclusion: The synergy between amortised and particle-based methods provides more effective and stable sampling for complex distributions across continuous and discrete spaces.

Abstract: This paper proposes a synergy of amortised and particle-based methods for sampling from distributions defined by unnormalised density functions. We state a connection between sequential Monte Carlo (SMC) and neural sequential samplers trained by maximum-entropy reinforcement learning (MaxEnt RL), wherein learnt sampling policies and value functions define proposal kernels and twist functions. Exploiting this connection, we introduce an off-policy RL training procedure for the sampler that uses samples from SMC – using the learnt sampler as a proposal – as a behaviour policy that better explores the target distribution. We describe techniques for stable joint training of proposals and twist functions and an adaptive weight tempering scheme to reduce training signal variance. Furthermore, building upon past attempts to use experience replay to guide the training of neural samplers, we derive a way to combine historical samples with annealed importance sampling weights within a replay buffer. On synthetic multi-modal targets (in both continuous and discrete spaces) and the Boltzmann distribution of alanine dipeptide conformations, we demonstrate improvements in approximating the true distribution as well as training stability compared to both amortised and Monte Carlo methods.

[1027] Meta-Learning Adaptive Loss Functions

Christian Raymond, Qi Chen, Bing Xue, Mengjie Zhang

Main category: cs.LG

TL;DR: The paper proposes an online loss function learning method that adaptively updates loss functions during training, addressing limitations of offline approaches that only consider early training steps.

Details

Motivation: Existing loss function learning techniques are limited by their offline nature, considering only the first few training steps which causes bias towards loss functions that perform well initially but poorly later in training.

Method: Proposes an online loss function learning technique that adaptively updates the loss function after each update to the base model parameters, rather than meta-learning in an offline fashion.

Result: The proposed method consistently outperforms cross-entropy loss and offline loss function learning techniques across diverse neural network architectures and datasets.

Conclusion: Online adaptive loss function learning addresses the temporal bias in offline approaches and improves training dynamics and final inference performance.

Abstract: Loss function learning is a new meta-learning paradigm that aims to automate the essential task of designing a loss function for a machine learning model. Existing techniques for loss function learning have shown promising results, often improving a model’s training dynamics and final inference performance. However, a significant limitation of these techniques is that the loss functions are meta-learned in an offline fashion, where the meta-objective only considers the very first few steps of training, which is a significantly shorter time horizon than the one typically used for training deep neural networks. This causes significant bias towards loss functions that perform well at the very start of training but perform poorly at the end of training. To address this issue we propose a new loss function learning technique for adaptively updating the loss function online after each update to the base model parameters. The experimental results show that our proposed method consistently outperforms the cross-entropy loss and offline loss function learning techniques on a diverse range of neural network architectures and datasets.

[1028] Privacy-aware Gaussian Process Regression

Rui Tuo, Haoyuan Chen, Raktim Bhattacharya

Main category: cs.LG

TL;DR: A privacy-preserving Gaussian process regression method that adds synthetic noise to data until predictive variance reaches a specified privacy level, with optimal noise covariance formulated via semi-definite programming.

Details

Motivation: Address privacy concerns when data owners are unwilling to share high-fidelity supervised learning models with the public due to privacy risks.

Method: Add synthetic noise to data until Gaussian process predictive variance reaches target privacy level; formulate optimal noise covariance using semi-definite programming; introduce kernel-based approaches for continuous privacy constraints.

Result: Developed theoretical framework for privacy-aware Gaussian process regression with continuous privacy constraints; demonstrated effectiveness on satellite trajectory tracking and census dataset applications.

Conclusion: The proposed method provides a practical solution for privacy-preserving Gaussian process regression with theoretical guarantees, enabling data sharing while maintaining privacy through controlled noise addition.

Abstract: We propose a novel theoretical and methodological framework for Gaussian process regression subject to privacy constraints. The proposed method can be used when a data owner is unwilling to share a high-fidelity supervised learning model built from their data with the public due to privacy concerns. The key idea of the proposed method is to add synthetic noise to the data until the predictive variance of the Gaussian process model reaches a prespecified privacy level. The optimal covariance matrix of the synthetic noise is formulated in terms of semi-definite programming. We also introduce the formulation of privacy-aware solutions under continuous privacy constraints using kernel-based approaches, and study their theoretical properties. The proposed method is illustrated by considering a model that tracks the trajectories of satellites and a real application on a census dataset.

[1029] Expert-Aided Causal Discovery of Ancestral Graphs

Tiago da Silva, Bruna Bazaluk, Eliezer de Souza da Silva, António Góis, Dominik Heider, Samuel Kaski, Diego Mesquita, Adèle Helena Ribeiro

Main category: cs.LG

TL;DR: AGFN uses GFlowNets to sample ancestral graphs proportionally to score-based belief distributions, addressing uncertainty in causal discovery with latent confounders, and incorporates expert feedback through optimal experimental design.

Details

Motivation: Causal discovery algorithms are brittle with scarce data and lack uncertainty quantification, making results unreliable and difficult to diagnose, especially with latent confounders.

Method: Ancestral GFlowNets (AGFN) sample ancestral graphs proportionally to score-based belief distributions, with an elicitation framework for expert feedback including optimal experimental design and feedback incorporation.

Result: AGFN is competitive against other methods handling latent confounding on synthetic and real-world datasets, and incorporating feedback from human experts or LLMs improves inference quality.

Conclusion: AGFN provides a principled approach for uncertainty-aware causal discovery with latent confounders and demonstrates that expert feedback integration enhances causal inference reliability.

Abstract: Causal discovery (CD) algorithms are notably brittle when data is scarce, inferring unreliable causal relations that may contradict expert knowledge, especially when considering latent confounders. Furthermore, the lack of uncertainty quantification in most CD methods hinders users from diagnosing and refining results. To address these issues, we introduce Ancestral GFlowNets (AGFNs). AGFN samples ancestral graphs (AGs) proportionally to a score-based belief distribution representing our epistemic uncertainty over the causal relationships. Building upon this distribution, we propose an elicitation framework for expert-driven assessment. This framework comprises an optimal experimental design to probe the expert and a scheme to incorporate the obtained feedback into AGFN. Our experiments show that: i) AGFN is competitive against other methods that address latent confounding on both synthetic and real-world datasets; and ii) our design for incorporating feedback from a (simulated) human expert or a Large Language Model (LLM) improves inference quality.

[1030] Minibatch and Local SGD: Algorithmic Stability and Linear Speedup in Generalization

Yunwen Lei, Tao Sun, Mingrui Liu

Main category: cs.LG

TL;DR: This paper analyzes the stability and generalization of minibatch SGD and local SGD, showing they achieve linear speedup with optimal risk bounds through an expectation-variance decomposition method.

Details

Motivation: Existing theoretical studies focus on optimization errors in multi-pass settings, but stability and generalization of parallel optimization methods like minibatch SGD and local SGD are less studied. The paper aims to understand their learnability.

Method: Introduces an expectation-variance decomposition for stability analysis and incorporates training errors to show how small training errors help generalization for overparameterized models.

Result: Minibatch and local SGD achieve a linear speedup to attain the optimal risk bounds.

Conclusion: The proposed analysis framework demonstrates that parallel optimization methods can achieve optimal generalization performance with linear speedup, providing theoretical foundation for their practical effectiveness.

Abstract: The increasing scale of data propels the popularity of leveraging parallelism to speed up the optimization. Minibatch stochastic gradient descent (minibatch SGD) and local SGD are two popular methods for parallel optimization. The existing theoretical studies show a linear speedup of these methods with respect to the number of machines, which, however, is measured by optimization errors in a multi-pass setting. As a comparison, the stability and generalization of these methods are much less studied. In this paper, we study the stability and generalization analysis of minibatch and local SGD to understand their learnability by introducing an expectation-variance decomposition. We incorporate training errors into the stability analysis, which shows how small training errors help generalization for overparameterized models. We show minibatch and local SGD achieve a linear speedup to attain the optimal risk bounds.

[1031] Codiscovering graphical structure and functional relationships within data: A Gaussian Process framework for connecting the dots

Théo Bourdais, Pau Batlle, Xianjin Yang, Ricardo Baptista, Nicolas Rouquette, Houman Owhadi

Main category: cs.LG

TL;DR: The paper introduces a Gaussian Process framework for Type 3 problems - discovering both hypergraph structure and unknown functions from partial observations, with polynomial complexity.

Details

Motivation: Most scientific problems require data-driven discovery of computational hypergraphs, either with known structure (Type 2) or unknown structure (Type 3). Current methods have super-exponential complexity, creating a need for efficient approaches.

Method: Uses interpretable Gaussian Processes with nonlinear ANOVA capabilities as a sensing mechanism. The framework doesn’t require data randomization, controlled sampling, or sparsity assumptions in known/learned bases.

Result: Developed a polynomial complexity framework that contrasts with super-exponential complexity of causal inference methods. The approach enables discovery of hypergraph structure and function approximation simultaneously.

Conclusion: The introduced GP framework provides an efficient solution for Type 3 problems, offering interpretability and polynomial complexity for discovering computational hypergraph structures and their unknown functions from partial observations.

Abstract: Most problems within and beyond the scientific domain can be framed into one of the following three levels of complexity of function approximation. Type 1: Approximate an unknown function given input/output data. Type 2: Consider a collection of variables and functions, some of which are unknown, indexed by the nodes and hyperedges of a hypergraph (a generalized graph where edges can connect more than two vertices). Given partial observations of the variables of the hypergraph (satisfying the functional dependencies imposed by its structure), approximate all the unobserved variables and unknown functions. Type 3: Expanding on Type 2, if the hypergraph structure itself is unknown, use partial observations of the variables of the hypergraph to discover its structure and approximate its unknown functions. These hypergraphs offer a natural platform for organizing, communicating, and processing computational knowledge. While most scientific problems can be framed as the data-driven discovery of unknown functions in a computational hypergraph whose structure is known (Type 2), many require the data-driven discovery of the structure (connectivity) of the hypergraph itself (Type 3). We introduce an interpretable Gaussian Process (GP) framework for such (Type 3) problems that does not require randomization of the data, access to or control over its sampling, or sparsity of the unknown functions in a known or learned basis. Its polynomial complexity, which contrasts sharply with the super-exponential complexity of causal inference methods, is enabled by the nonlinear ANOVA capabilities of GPs used as a sensing mechanism.

[1032] Discovering and Reasoning of Causality in the Hidden World with Large Language Models

Chenxi Liu, Yongqiang Chen, Tongliang Liu, Mingming Gong, James Cheng, Bo Han, Kun Zhang

Main category: cs.LG

TL;DR: COAT is a framework that uses LLMs to propose hidden variables from unstructured data for causal discovery, refining variables through feedback from intermediate results to uncover causal structures like Markov Blankets and Partial Ancestral Graphs.

Details

Motivation: Existing causal discovery methods rely on human-defined variables, limiting application to unstructured data. LLMs' world knowledge can help automate variable discovery from unstructured data.

Method: COAT framework uses LLMs to propose variables from unstructured data, with feedback loops from causal discovery results to refine variables. COAT-MB finds Markov Blankets, COAT-PAG extends to Partial Ancestral Graphs by iterating over targets and seeking new variables.

Result: Theoretical guarantees established for causal discovery results. Framework verified across realistic benchmarks and real-world case studies, showing efficiency and reliability.

Conclusion: COAT successfully leverages LLMs to automate hidden variable discovery from unstructured data, enabling more complete causal discovery and extending debiased causal inference to unstructured data domains.

Abstract: Revealing hidden causal variables alongside the underlying causal mechanisms is essential to the development of science. Despite the progress in the past decades, existing practice in causal discovery (CD) heavily relies on high-quality measured variables, which are usually given by human experts. In fact, the lack of well-defined high-level variables behind unstructured data has been a longstanding roadblock to a broader real-world application of CD. This procedure can naturally benefit from an automated process that can suggest potential hidden variables in the system. Interestingly, Large language models (LLMs) are trained on massive observations of the world and have demonstrated great capability in processing unstructured data. To leverage the power of LLMs, we develop a new framework termed Causal representatiOn AssistanT (COAT) that incorporates the rich world knowledge of LLMs to propose useful measured variables for CD with respect to high-value target variables on their paired unstructured data. Instead of directly inferring causality with LLMs, COAT constructs feedback from intermediate CD results to LLMs to refine the proposed variables. Given the target variable and the paired unstructured data, we first develop COAT-MB that leverages the predictivity of the proposed variables to iteratively uncover the Markov Blanket of the target variable. Built upon COAT-MB, COAT-PAG further extends to uncover a more complete causal graph, i.e., Partial Ancestral Graph, by iterating over the target variables and actively seeking new high-level variables. Moreover, the reliable CD capabilities of COAT also extend the debiased causal inference to unstructured data by discovering an adjustment set. We establish theoretical guarantees for the CD results and verify their efficiency and reliability across realistic benchmarks and real-world case studies.

[1033] Neural Surveillance: Live-Update Visualization of Latent Training Dynamics

Xianglin Yang, Jin Song Dong

Main category: cs.LG

TL;DR: SentryCam is a real-time visualization framework that monitors the evolution of hidden representations in neural networks during training, enabling early detection of training instability and providing deeper insights into model learning dynamics.

Details

Motivation: Current monitoring tools only provide surface-level metrics like validation loss, lacking real-time visualization of internal model states which are crucial for understanding learning dynamics and enabling timely interventions.

Method: Developed SentryCam - a live-update visualization framework that tracks hidden representation progression with minimal latency, using geometry-based alerts for automated auditing and supporting diverse architectures (ResNet, ViT).

Result: Quantitatively validated visualization faithfulness across datasets and architectures. Automated auditing system successfully identified impending model failure up to 7 epochs earlier than validation loss curves.

Conclusion: SentryCam provides a flexible framework for both exploratory analysis and proactive model auditing, essential for robust model development, with publicly available code.

Abstract: Monitoring the inner state of deep neural networks is essential for auditing the learning process and enabling timely interventions. While conventional metrics like validation loss offer a surface-level view of performance, the evolution of a model’s hidden representations provides a deeper, complementary window into its internal dynamics. However, the literature lacks a real-time tool to monitor these crucial internal states. To address this, we introduce SentryCam, a live-update visualization framework that tracks the progression of hidden representations throughout training. SentryCam produces high-fidelity visualizations of the evolving representation space with minimal latency, serving as a powerful dashboard for understanding how a model learns. We quantitatively validate the faithfulness of SentryCam’s visualizations across diverse datasets and architectures (ResNet, ViT). Furthermore, we demonstrate SentryCam’s practical utility for model auditing through a case study on training instability. We designed an automated auditing system with geometry-based alerts that successfully identified impending model failure up to 7 epochs earlier than was evident from the validation loss curve. SentryCam’s flexible framework is easily adaptable, supporting both the exploratory analysis and proactive auditing essential for robust model development. The code is available at https://github.com/xianglinyang/SentryCam.

[1034] Output-Constrained Decision Trees

Hüseyin Tunç, Doğanay Özese, Ş. İlker Birbil, Donato Maragno, Marco Caserta, Mustafa Baydoğan

Main category: cs.LG

TL;DR: This paper introduces three new methods for training Output-Constrained Regression Trees (OCRT) to incorporate domain-specific constraints into machine learning models for accurate and feasible predictions in real-world applications.

Details

Motivation: Traditional decision trees have limitations in constrained multi-target regression tasks, and there is a need to incorporate domain-specific constraints to ensure predictions are both accurate and feasible in real-world applications.

Method: Three approaches: M-OCRT (split-based mixed integer programming), E-OCRT (exhaustive search with constrained prediction problems), and EP-OCRT (post-hoc constrained optimization). Also introduces a random forest framework for convex feasible sets.

Result: Validation on synthetic and industry-driven hierarchical time series datasets shows that imposing constraints on decision tree training results in accurate and feasible predictions.

Conclusion: The proposed OCRT methods effectively incorporate domain constraints into machine learning models, producing predictions that are both accurate and feasible for real-world applications.

Abstract: Incorporating domain-specific constraints into machine learning models is essential for generating predictions that are both accurate and feasible in real-world applications. This paper introduces new methods for training Output-Constrained Regression Trees (OCRT), addressing the limitations of traditional decision trees in constrained multi-target regression tasks. We propose three approaches: M-OCRT, which uses split-based mixed integer programming to enforce constraints; E-OCRT, which employs an exhaustive search for optimal splits and solves constrained prediction problems at each decision node; and EP-OCRT, which applies post-hoc constrained optimization to tree predictions. To illustrate their potential uses in ensemble learning, we also introduce a random forest framework working under convex feasible sets. We validate the proposed methods through a computational study both on synthetic and industry-driven hierarchical time series datasets. Our results demonstrate that imposing constraints on decision tree training results in accurate and feasible predictions.

[1035] LDPKiT: Superimposing Remote Queries for Privacy-Preserving Local Model Training

Kexin Li, Aastha Mehta, David Lie

Main category: cs.LG

TL;DR: LDPKiT is a framework for privacy-preserving model extraction that uses local differential privacy to protect user data while enabling effective knowledge transfer from proprietary ML models.

Details

Motivation: Users face privacy concerns when sending private data to ML cloud services for inference, but may need to use proprietary models when no alternatives exist.

Method: LDPKiT introduces a novel superimposition technique that generates approximately in-distribution samples under local differential privacy (LDP), enabling knowledge transfer while bounding privacy leakage.

Result: Experiments on Fashion-MNIST, SVHN, and PathMNIST show LDPKiT improves utility while maintaining privacy, with benefits increasing at stronger noise levels. On SVHN, it achieves nearly the same accuracy at ε=1.25 as at ε=2.0 with less than 2% accuracy reduction.

Conclusion: LDPKiT provides a practical solution for privacy-preserving model extraction that maintains strong privacy guarantees while enabling effective knowledge transfer from proprietary ML models.

Abstract: Users of modern Machine Learning (ML) cloud services face a privacy conundrum – on one hand, they may have concerns about sending private data to the service for inference, but on the other hand, for specialized models, there may be no alternative but to use the proprietary model of the ML service. In this work, we present LDPKiT, a framework for non-adversarial, privacy-preserving model extraction that leverages a user’s private in-distribution data while bounding privacy leakage. LDPKiT introduces a novel superimposition technique that generates approximately in-distribution samples, enabling effective knowledge transfer under local differential privacy (LDP). Experiments on Fashion-MNIST, SVHN, and PathMNIST demonstrate that LDPKiT consistently improves utility while maintaining privacy, with benefits that become more pronounced at stronger noise levels. For example, on SVHN, LDPKiT achieves nearly the same inference accuracy at $\epsilon=1.25$ as at $\epsilon=2.0$, yielding stronger privacy guarantees with less than a 2% accuracy reduction. We further conduct sensitivity analyses to examine the effect of dataset size on performance and provide a systematic analysis of latent space representations, offering theoretical insights into the accuracy gains of LDPKiT.

[1036] The Interpretable and Effective Graph Neural Additive Networks

Maya Bechler-Speicher, Amir Globerson, Ran Gilad-Bachrach

Main category: cs.LG

TL;DR: GNAN is an interpretable Graph Neural Network that extends Generalized Additive Models, providing transparent visualizations of feature and graph relationships while maintaining accuracy comparable to black-box GNNs.

Details

Motivation: Most GNNs operate as black-box models requiring post-hoc explanations, which are insufficient for high-stakes scenarios where transparency is crucial.

Method: Extends interpretable Generalized Additive Models to graph data, creating Graph Neural Additive Networks that can be directly visualized and understood by humans.

Result: GNAN provides both global and local explanations at feature and graph levels through direct visualization, while achieving accuracy on par with black-box GNNs.

Conclusion: GNAN is suitable for critical applications requiring both transparency and high accuracy, offering fully interpretable GNNs by design.

Abstract: Graph Neural Networks (GNNs) have emerged as the predominant approach for learning over graph-structured data. However, most GNNs operate as black-box models and require post-hoc explanations, which may not suffice in high-stakes scenarios where transparency is crucial. In this paper, we present a GNN that is interpretable by design. Our model, Graph Neural Additive Network (GNAN), is a novel extension of the interpretable class of Generalized Additive Models, and can be visualized and fully understood by humans. GNAN is designed to be fully interpretable, offering both global and local explanations at the feature and graph levels through direct visualization of the model. These visualizations describe exactly how the model uses the relationships between the target variable, the features, and the graph. We demonstrate the intelligibility of GNANs in a series of examples on different tasks and datasets. In addition, we show that the accuracy of GNAN is on par with black-box GNNs, making it suitable for critical applications where transparency is essential, alongside high accuracy.

[1037] Pre-Training and Personalized Fine-Tuning via Over-the-Air Federated Meta-Learning: Convergence-Generalization Trade-Offs

Haifeng Wen, Hong Xing, Osvaldo Simeone

Main category: cs.LG

TL;DR: This paper analyzes the generalization performance of meta-learning-based personalized federated learning (meta-pFL) in wireless settings, exploring the trade-off between generalization to new agents/tasks and convergence under channel impairments.

Details

Motivation: The shift towards pre-training followed by fine-tuning in AI applications, combined with the move from centralized to federated learning deployments, creates a need to understand how wireless channel conditions affect meta-learning generalization in federated settings.

Method: The study adopts over-the-air computing in a wireless federated learning setting where agents participate in meta-learning pre-training via shared wireless channels, analyzing how channel impairments impact the trade-off between generalization and convergence.

Result: Extensive numerical results validate the theoretical analysis, showing that channel impairments can enhance generalization while degrading convergence, establishing a clear trade-off between these two objectives.

Conclusion: Wireless channel conditions create a fundamental trade-off in meta-pFL: while impairments may improve generalization to new agents and tasks, they simultaneously degrade convergence performance, requiring careful balancing in practical deployments.

Abstract: For modern artificial intelligence (AI) applications such as large language models (LLMs), the training paradigm has recently shifted to pre-training followed by fine-tuning. Furthermore, owing to dwindling open repositories of data and thanks to efforts to democratize access to AI models, pre-training is expected to increasingly migrate from the current centralized deployments to federated learning (FL) implementations. Meta-learning provides a general framework in which pre-training and fine-tuning can be formalized. Meta-learning-based personalized FL (meta-pFL) moves beyond basic personalization by targeting generalization to new agents and tasks. This paper studies the generalization performance of meta-pFL for a wireless setting in which the agents participating in the pre-training phase, i.e., meta-learning, are connected via a shared wireless channel to the server. Adopting over-the-air computing, we study the trade-off between generalization to new agents and tasks, on the one hand, and convergence, on the other hand. The trade-off arises from the fact that channel impairments may enhance generalization, while degrading convergence. Extensive numerical results validate the theory.

[1038] Designing Algorithms Empowered by Language Models: An Analytical Framework, Case Studies, and Insights

Yanxi Chen, Yaliang Li, Bolin Ding, Jingren Zhou

Main category: cs.LG

TL;DR: An analytical framework for designing and analyzing LLM-based algorithms that systematically evaluates how design choices impact accuracy and efficiency.

Details

Motivation: LLM-based algorithms have achieved empirical success but require extensive trial-and-error optimization; this framework aims to provide formal systematic analysis to reduce this burden.

Method: Proposes a formal framework to analyze how critical design choices (task decomposition patterns, prompt design, etc.) affect algorithm performance across various patterns including parallel/hierarchical/recursive decomposition and directed acyclic graphs.

Result: Demonstrated through diverse case studies with systematic empirical validation in synthetic settings, providing generalizable insights across different scenarios.

Conclusion: The framework offers a systematic approach to optimize LLM-based algorithms by formally analyzing the impact of design choices, reducing reliance on trial-and-error methods.

Abstract: This work presents an analytical framework for the design and analysis of LLM-based algorithms, i.e., algorithms that contain one or multiple calls of large language models (LLMs) as sub-routines and critically rely on the capabilities of LLMs. While such algorithms, ranging from basic LLM calls with prompt engineering to complicated LLM-powered agentic workflows and compound AI systems, have achieved remarkable empirical success, their design and optimization oftentimes require extensive trial-and-errors and case-by-case analysis. Our proposed framework serves as an attempt to mitigate such headaches, offering a formal and systematic approach for analyzing how the accuracy and efficiency of an LLM-based algorithm will be impacted by critical design choices, such as the pattern and granularity of task decomposition, or the prompt for each LLM call. Through a wide range of case studies covering diverse algorithm patterns (including parallel/hierarchical/recursive task decomposition and generic directed acyclic graphs), we demonstrate the proposed framework in action and derive interesting insights that generalize across scenarios, accompanied by systematic empirical validation in synthetic settings.

[1039] LLMEasyQuant: Scalable Quantization for Parallel and Distributed LLM Inference

Dong Liu, Yanxuan Yu

Main category: cs.LG

TL;DR: LLMEasyQuant is a modular, system-aware quantization framework for efficient low-bit LLM inference across various hardware setups, supporting multiple quantization methods with unified interfaces and achieving substantial performance improvements.

Details

Motivation: Existing quantization toolkits lack transparency, flexibility, and system-level scalability across GPUs and distributed environments, creating a need for a more practical and scalable quantization solution.

Method: A modular framework supporting Symmetric Quantization, ZeroQuant, SmoothQuant, and SimQuant with unified interfaces for per-layer calibration, bitwidth assignment, and runtime adaptation. Integrates fused CUDA kernels with NCCL-based distributed synchronization and supports both static and online quantization.

Result: Achieves substantial speedup in GEMM execution, HBM load time, and near-linear multi-GPU scaling. Ablation studies validate its ability to balance latency, memory, and accuracy under diverse deployment conditions.

Conclusion: LLMEasyQuant offers a practical quantization serving system for scalable, hardware-optimized LLM inference.

Abstract: As large language models (LLMs) grow in size and deployment scale, quantization has become an essential technique for reducing memory footprint and improving inference efficiency. However, existing quantization toolkits often lack transparency, flexibility, and system-level scalability across GPUs and distributed environments. We present \textbf{LLMEasyQuant}, a modular, system-aware quantization framework designed for efficient, low-bit inference of LLMs on single-node multi-GPU, multi-node, and edge hardware. LLMEasyQuant supports a wide range of quantization methods – including Symmetric Quantization, ZeroQuant, SmoothQuant, and SimQuant – with unified interfaces for per-layer calibration, bitwidth assignment, and runtime adaptation. It integrates fused CUDA kernels with NCCL-based distributed synchronization and supports both static and online quantization. Empirical results show that LLMEasyQuant can achieve substantial speedup in GEMM execution, HBM load time, and near-linear multi-GPU scaling. Ablation studies further validate its ability to balance latency, memory, and accuracy under diverse deployment conditions. LLMEasyQuant offers a practical quantization serving system for scalable, hardware-optimized LLM inference.

[1040] InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Rohan Gupta, Iván Arcuschin, Thomas Kwa, Adrià Garriga-Alonso

Main category: cs.LG

TL;DR: InterpBench is a benchmark with semi-synthetic transformers containing known circuits for evaluating mechanistic interpretability methods. The paper introduces Strict IIT (SIIT), an improved version of Interchange Intervention Training that better aligns neural network computations with desired causal models while preventing non-circuit nodes from affecting outputs.

Details

Motivation: There's a need to validate mechanistic interpretability methods when the true algorithm implemented by neural networks is unknown, requiring benchmarks with known ground-truth circuits.

Method: Developed InterpBench with semi-synthetic transformers containing known circuits. Introduced Strict IIT (SIIT) which trains neural networks by aligning internal computation with high-level causal models while preventing non-circuit nodes from affecting outputs. Evaluated on Tracr transformers and larger circuits like IOI.

Result: SIIT models maintain Tracr’s original circuits while being more realistic. SIIT can train transformers with larger circuits like Indirect Object Identification. The benchmark was used to evaluate existing circuit discovery techniques.

Conclusion: InterpBench provides a valuable benchmark for mechanistic interpretability evaluation, and SIIT offers an improved method for training neural networks that better preserve desired causal structures while preventing interference from non-circuit components.

Abstract: Mechanistic interpretability methods aim to identify the algorithm a neural network implements, but it is difficult to validate such methods when the true algorithm is unknown. This work presents InterpBench, a collection of semi-synthetic yet realistic transformers with known circuits for evaluating these techniques. We train simple neural networks using a stricter version of Interchange Intervention Training (IIT) which we call Strict IIT (SIIT). Like the original, SIIT trains neural networks by aligning their internal computation with a desired high-level causal model, but it also prevents non-circuit nodes from affecting the model’s output. We evaluate SIIT on sparse transformers produced by the Tracr tool and find that SIIT models maintain Tracr’s original circuit while being more realistic. SIIT can also train transformers with larger circuits, like Indirect Object Identification (IOI). Finally, we use our benchmark to evaluate existing circuit discovery techniques.

[1041] Neuralink: Fast LLM Inference on Smartphones with Neuron Co-Activation Linking

Tuowei Wang, Ruwen Fan, Minxing Huang, Zixu Hao, Kun Li, Ting Cao, Youyou Lu, Yaoxue Zhang, Ju Ren

Main category: cs.LG

TL;DR: Neuralink accelerates LLM inference on smartphones by optimizing neuron placement in flash memory based on co-activation patterns, achieving 1.49× latency improvements over state-of-the-art methods.

Details

Motivation: Deploying LLMs on mobile devices is challenging due to computational/memory demands. Lightweight LLMs sacrifice accuracy, while sparsity-based methods suffer from I/O bottlenecks on smartphones with severe IOPS constraints.

Method: Two-stage approach: offline stage reorganizes neuron placement based on co-activation patterns; online stage uses tailored data access and caching strategies aligned with hardware characteristics.

Result: Evaluation on various smartphones and LLMs shows Neuralink achieves average 1.49× improvements in end-to-end latency compared to state-of-the-art methods.

Conclusion: Neuralink explores a new optimization space at the intersection of sparsity-driven algorithms and storage-level system co-design for LLM inference, being the first solution to optimize storage placement under sparsity.

Abstract: Large Language Models (LLMs) have achieved remarkable success across various domains, yet deploying them on mobile devices remains an arduous challenge due to their extensive computational and memory demands. While lightweight LLMs have been developed to fit mobile environments, they suffer from degraded model accuracy. In contrast, sparsity-based techniques minimize DRAM usage by selectively transferring only relevant neurons to DRAM while retaining the full model in external storage, such as flash. However, such approaches are critically limited by numerous I/O operations, particularly on smartphones with severe IOPS constraints. In this paper, we propose Neuralink, a novel approach that accelerates LLM inference on smartphones by optimizing neuron placement in flash memory. Neuralink leverages the concept of Neuron Co-Activation, where neurons frequently activated together are linked to facilitate continuous read access and optimize I/O efficiency. Our approach incorporates a two-stage solution: an offline stage that reorganizes neuron placement based on co-activation patterns, and an online stage that employs tailored data access and caching strategies to align well with hardware characteristics. Evaluations conducted on a variety of smartphones and LLMs demonstrate that Neuralink achieves on average $1.49\times$ improvements in end-to-end latency compared to the state-of-the-art. As the first solution to optimize storage placement under sparsity, Neuralink explores a new optimization space at the intersection of sparsity-driven algorithm and storage-level system co-design for LLM inference.

[1042] Methods to improve run time of hydrologic models: opportunities and challenges in the machine learning era

Supath Dhital

Main category: cs.LG

TL;DR: This paper explores how machine learning and deep learning can improve computational efficiency and simulation time of physics-based hydrological models while addressing adoption challenges.

Details

Motivation: The motivation is to leverage ML/DL's computational efficiency and flexibility to enhance hydrological modeling, particularly for emergency response and large-scale applications that require rapid forecasting.

Method: The paper analyzes opportunities and challenges of adopting ML for hydrological modeling, focusing on how ML can improve runtime of physics-based models through data-driven approaches.

Result: The study identifies that ML offers significant computational efficiency advantages over traditional physics-based models and can help reduce simulation time in hydrological modeling.

Conclusion: ML and DL present promising opportunities to enhance hydrological modeling efficiency, but future work is needed to address implementation constraints and fully leverage these technologies in the field.

Abstract: The application of Machine Learning (ML) to hydrologic modeling is fledgling. Its applicability to capture the dependencies on watersheds to forecast better within a short period is fascinating. One of the key reasons to adopt ML algorithms over physics-based models is its computational efficiency advantage and flexibility to work with various data sets. The diverse applications, particularly in emergency response and expanding over a large scale, demand the hydrological model in a short time and make researchers adopt data-driven modeling approaches unhesitatingly. In this work, in the era of ML and deep learning (DL), how it can help to improve the overall run time of physics-based model and potential constraints that should be addressed while modeling. This paper covers the opportunities and challenges of adopting ML for hydrological modeling and subsequently how it can help to improve the simulation time of physics-based models and future works that should be addressed.

[1043] Retrieval-Retro: Retrieval-based Inorganic Retrosynthesis with Expert Knowledge

Heewoong Noh, Namkyeong Lee, Gyoung S. Na, Chanyoung Park

Main category: cs.LG

TL;DR: Retrieval-Retro is a machine learning approach for inorganic retrosynthesis planning that uses attention layers to implicitly extract precursor information from retrieved reference materials, considering thermodynamic relationships to identify the most probable precursor sets.

Details

Motivation: Machine learning applications in inorganic retrosynthesis planning have been less explored compared to organic retrosynthesis, creating a gap that needs to be addressed for effective materials discovery.

Method: The approach retrieves reference materials from a knowledge base and uses various attention layers to implicitly extract precursor information rather than directly using it. It also considers thermodynamic relationships between target materials and precursors during retrieval.

Result: Extensive experiments show Retrieval-Retro’s superiority in retrosynthesis planning, particularly in discovering novel synthesis recipes that are crucial for materials discovery.

Conclusion: Retrieval-Retro effectively addresses the gap in inorganic retrosynthesis planning by leveraging implicit precursor extraction through attention mechanisms and thermodynamic considerations, enabling better discovery of novel synthesis pathways.

Abstract: While inorganic retrosynthesis planning is essential in the field of chemical science, the application of machine learning in this area has been notably less explored compared to organic retrosynthesis planning. In this paper, we propose Retrieval-Retro for inorganic retrosynthesis planning, which implicitly extracts the precursor information of reference materials that are retrieved from the knowledge base regarding domain expertise in the field. Specifically, instead of directly employing the precursor information of reference materials, we propose implicitly extracting it with various attention layers, which enables the model to learn novel synthesis recipes more effectively. Moreover, during retrieval, we consider the thermodynamic relationship between target material and precursors, which is essential domain expertise in identifying the most probable precursor set among various options. Extensive experiments demonstrate the superiority of Retrieval-Retro in retrosynthesis planning, especially in discovering novel synthesis recipes, which is crucial for materials discovery. The source code for Retrieval-Retro is available at https://github.com/HeewoongNoh/Retrieval-Retro.

[1044] Graph Neural Network Surrogates to leverage Mechanistic Expert Knowledge towards Reliable and Immediate Pandemic Response

Agatha Schmidt, Henrik Zunker, Alexander Heinlein, Martin J. Kühn

Main category: cs.LG

TL;DR: A graph neural network surrogate model was developed to accelerate COVID-19 forecasting, achieving 10-27% MAPE with 28,670x speedup over mechanistic models.

Details

Motivation: Time-critical pandemic decisions require rapid forecasting, but mechanistic models are computationally expensive, creating a bottleneck for evidence-based decision making.

Method: Developed a GNN surrogate using ARMAConv-based architecture on a 400-node spatial graph with age-structured contact matrices, tested across outbreak regimes with up to three contact change points.

Result: Achieved 10-27% MAPE across 30-90 day horizons with near-constant runtime, accelerating evaluation by up to 28,670 times compared to mechanistic models.

Conclusion: GNN surrogates can effectively translate complex metapopulation models into immediate, reliable tools for time-critical pandemic response scenarios.

Abstract: During the COVID-19 crisis, mechanistic models have guided evidence-based decision making. However, time-critical decisions in a dynamical environment limit the time available to gather supporting evidence. We address this bottleneck by developing a graph neural network (GNN) surrogate of a spatially and demographically resolved mechanistic metapopulation simulator. This combined approach advances classical machine learning approaches which are often black box. Our design of experiments spans outbreak and persistent-threat regimes, up to three contact change points, and age-structured contact matrices on a 400-node spatial graph. We benchmark multiple GNN layers and identify an ARMAConv-based architecture that offers a strong accuracy-runtime trade-off. Across 30-90 day horizons and up to three contact change points, the surrogate attains 10-27 % mean absolute percentage error (MAPE) while delivering (near) constant runtime with respect to the forecast horizon. Our approach accelerates evaluation by up to 28,670 times compared with the mechanistic model, allowing responsive decision support in time-critical scenarios and straightforward web integration. These results show how GNN surrogates can translate complex metapopulation models into immediate, reliable tools for pandemic response.

[1045] Modeling AdaGrad, RMSProp, and Adam with Integro-Differential Equations

Carlos Heredia

Main category: cs.LG

TL;DR: The paper proposes continuous-time formulations for AdaGrad, RMSProp, and Adam optimization algorithms using first-order integro-differential equations, with numerical simulations and analyses showing strong agreement with discrete implementations.

Details

Motivation: To provide a new theoretical perspective on adaptive optimization methods by modeling them in continuous time rather than discrete implementations.

Method: Model AdaGrad, RMSProp, and Adam as first-order integro-differential equations, perform numerical simulations, and conduct stability and convergence analyses.

Result: Strong agreement between continuous-time models and discrete implementations, validating the continuous formulations as accurate approximations.

Conclusion: Continuous-time modeling offers a valuable new perspective for theoretical understanding of adaptive optimization algorithms.

Abstract: In this paper, we propose a continuous-time formulation for the AdaGrad, RMSProp, and Adam optimization algorithms by modeling them as first-order integro-differential equations. We perform numerical simulations of these equations, along with stability and convergence analyses, to demonstrate their validity as accurate approximations of the original algorithms. Our results indicate a strong agreement between the behavior of the continuous-time models and the discrete implementations, thus providing a new perspective on the theoretical understanding of adaptive optimization methods.

[1046] Edge Delayed Deep Deterministic Policy Gradient: efficient continuous control for edge scenarios

Alberto Sinigaglia, Niccolò Turcato, Ruggero Carli, Gian Antonio Susto

Main category: cs.LG

TL;DR: EdgeD3 is a novel reinforcement learning algorithm for edge scenarios that enhances DDPG with 25% less GPU time, 30% fewer computational resources, and 30% less memory while matching or surpassing state-of-the-art performance.

Details

Motivation: Address overestimation bias in deep reinforcement learning and adapt algorithms for edge scenarios with limited computational resources and privacy concerns.

Method: Enhances Deep Deterministic Policy Gradient (DDPG) algorithm with multiple Q-functions to mitigate overestimation bias, specifically optimized for edge computing environments.

Result: Achieves significantly improved performance with 25% less GPU time, 30% fewer computational resources, and 30% less memory while maintaining or exceeding state-of-the-art performance across benchmarks.

Conclusion: EdgeD3 successfully addresses computational efficiency challenges in edge scenarios while maintaining competitive performance, making it suitable for resource-constrained environments.

Abstract: Deep Reinforcement Learning is gaining increasing attention thanks to its capability to learn complex policies in high-dimensional settings. Recent advancements utilize a dual-network architecture to learn optimal policies through the Q-learning algorithm. However, this approach has notable drawbacks, such as an overestimation bias that can disrupt the learning process and degrade the performance of the resulting policy. To address this, novel algorithms have been developed that mitigate overestimation bias by employing multiple Q-functions. Edge scenarios, which prioritize privacy, have recently gained prominence. In these settings, limited computational resources pose a significant challenge for complex Machine Learning approaches, making the efficiency of algorithms crucial for their performance. In this work, we introduce a novel Reinforcement Learning algorithm tailored for edge scenarios, called Edge Delayed Deep Deterministic Policy Gradient (EdgeD3). EdgeD3 enhances the Deep Deterministic Policy Gradient (DDPG) algorithm, achieving significantly improved performance with $25%$ less Graphics Process Unit (GPU) time while maintaining the same memory usage. Additionally, EdgeD3 consistently matches or surpasses the performance of state-of-the-art methods across various benchmarks, all while using $30%$ fewer computational resources and requiring $30%$ less memory.

[1047] LoRA-FAIR: Federated LoRA Fine-Tuning with Aggregation and Initialization Refinement

Jieming Bian, Lei Wang, Letian Zhang, Jie Xu

Main category: cs.LG

TL;DR: LoRA-FAIR is a novel method that addresses server-side aggregation bias and client-side initialization lag in federated learning with LoRA, improving performance while maintaining efficiency.

Details

Motivation: Combining LoRA with FL introduces challenges: server-side aggregation bias where averaging LoRA matrices diverges from ideal global updates, and client-side initialization lag requiring consistent initialization across rounds.

Method: LoRA-FAIR introduces a correction term on the server to enhance aggregation efficiency and accuracy, tackling both server-side aggregation bias and client-side initialization lag issues.

Result: Experimental results on ViT and MLP-Mixer models across large-scale datasets show that LoRA-FAIR consistently achieves performance improvements in FL settings over state-of-the-art methods.

Conclusion: LoRA-FAIR effectively addresses key challenges in combining LoRA with FL, maintaining computational and communication efficiency while delivering superior performance.

Abstract: Foundation models (FMs) achieve strong performance across diverse tasks with task-specific fine-tuning, yet full parameter fine-tuning is often computationally prohibitive for large models. Parameter-efficient fine-tuning (PEFT) methods like Low-Rank Adaptation (LoRA) reduce this cost by introducing low-rank matrices for tuning fewer parameters. While LoRA allows for efficient fine-tuning, it requires significant data for adaptation, making Federated Learning (FL) an appealing solution due to its privacy-preserving collaborative framework. However, combining LoRA with FL introduces two key challenges: the \textbf{Server-Side Aggregation Bias}, where server-side averaging of LoRA matrices diverges from the ideal global update, and the \textbf{Client-Side Initialization Lag}, emphasizing the need for consistent initialization across rounds. Existing approaches address these challenges individually, limiting their effectiveness. We propose LoRA-FAIR, a novel method that tackles both issues by introducing a correction term on the server, enhancing aggregation efficiency and accuracy. LoRA-FAIR maintains computational and communication efficiency, yielding superior performance over state-of-the-art methods. Experimental results on ViT and MLP-Mixer models across large-scale datasets demonstrate that LoRA-FAIR consistently achieves performance improvements in FL settings.

[1048] Can a MISL Fly? Analysis and Ingredients for Mutual Information Skill Learning

Chongyi Zheng, Jens Tuyls, Joanne Peng, Benjamin Eysenbach

Main category: cs.LG

TL;DR: This paper argues that the benefits of METRA’s Wasserstein distance approach can be explained within existing mutual information skill learning (MISL) framework, proposes a new MISL method called contrastive successor features that matches METRA’s performance with simpler design, and provides ablation studies on key components.

Details

Motivation: To reconcile METRA's success with existing mutual information skill learning frameworks and develop a simpler method that achieves similar performance while connecting skill learning, contrastive representation learning, and successor features.

Method: Proposes contrastive successor features - a new MISL method that retains METRA’s performance with fewer components, and conducts careful ablation studies to analyze key ingredients.

Result: The new contrastive successor features method achieves excellent performance comparable to METRA but with a simpler design, and ablation studies provide insights into critical components of both methods.

Conclusion: METRA’s benefits can be explained within existing MISL framework, and the proposed contrastive successor features method offers comparable performance with reduced complexity while highlighting connections between different learning approaches.

Abstract: Self-supervised learning has the potential of lifting several of the key challenges in reinforcement learning today, such as exploration, representation learning, and reward design. Recent work (METRA) has effectively argued that moving away from mutual information and instead optimizing a certain Wasserstein distance is important for good performance. In this paper, we argue that the benefits seen in that paper can largely be explained within the existing framework of mutual information skill learning (MISL). Our analysis suggests a new MISL method (contrastive successor features) that retains the excellent performance of METRA with fewer moving parts, and highlights connections between skill learning, contrastive representation learning, and successor features. Finally, through careful ablation studies, we provide further insight into some of the key ingredients for both our method and METRA.

[1049] Learning-based Sketches for Frequency Estimation in Data Streams without Ground Truth

Xinyu Yuan, Yan Qiao, Meng Li, Zhenchun Wei, Cuiying Feng, Zonghui Wang, Wenzhi Chen

Main category: cs.LG

TL;DR: UCL-sketch is a learning-based frequency estimation method for data streams that uses online training without ground truth and achieves high accuracy with fast processing speeds.

Details

Motivation: Traditional sketches provide coarse estimates under memory constraints, while existing learning-based methods require offline training with ground truth data and suffer from slow update speeds, making them unsuitable for real-time processing.

Method: UCL-sketch uses an online training mechanism based on equivalent learning that requires no ground truth, and a scalable architecture with logically structured estimation buckets leveraging compressive sensing (CS).

Result: The method achieves significantly lower error bounds than prior works, matches oracle performance under tight memory budgets, and provides 500x faster decoding speed compared to existing equation-based sketches.

Conclusion: UCL-sketch offers a practical learning-based solution for frequency estimation in data streams that combines high accuracy with real-time processing capabilities, outperforming existing approaches in both per-key accuracy and distribution estimation.

Abstract: Estimating the frequency of items on the high-volume, fast data stream has been extensively studied in many areas, such as database and network measurement. Traditional sketches provide only coarse estimates under strict memory constraints. Although some learning-augmented methods have emerged recently, they typically rely on offline training with real frequencies or/and labels, which are often unavailable. Moreover, these methods suffer from slow update speeds, limiting their suitability for real-time processing despite offering only marginal accuracy improvements. To overcome these challenges, we propose UCL-sketch, a practical learning-based paradigm for per-key frequency estimation. Our design introduces two key innovations: (i) an online training mechanism based on equivalent learning that requires no ground truth (GT), and (ii) a highly scalable architecture leveraging logically structured estimation buckets to scale to real-world data stream. The UCL-sketch, which utilizes compressive sensing (CS), converges to an estimator that provably yields a error bound far lower than that of prior works, without sacrificing the speed of processing. Extensive experiments on both real-world and synthetic datasets demonstrate that our approach outperforms previously proposed approaches regarding per-key accuracy and distribution. Notably, under extremely tight memory budgets, its quality almost matches that of an (infeasible) omniscient oracle. Moreover, compared to the existing equation-based sketch, UCL-sketch achieves an average decoding speedup of nearly 500 times. To help further research and development, our code is publicly available at https://github.com/Y-debug-sys/UCL-sketch.

[1050] Sim-to-real supervised domain adaptation for radioisotope identification

Peter Lalor, Henry Adams, Alex Hagen

Main category: cs.LG

TL;DR: Supervised domain adaptation improves radioisotope identification by transferring knowledge from synthetic to experimental gamma spectroscopy data, achieving 96% accuracy with minimal labeled experimental data.

Details

Motivation: Labeling experimental datasets is expensive, and training on synthetic data alone suffers from domain gap issues between simulated and real measurements.

Method: Pretrain transformer-based neural network on synthetic data, then fine-tune on small labeled experimental datasets (64 spectra) using supervised domain adaptation.

Result: Achieved 96% test accuracy in sim-to-real scenario with LaBr detector, surpassing synthetic-only baseline (75%) and from-scratch training (80%). Models also learned more interpretable features.

Conclusion: Supervised domain adaptation effectively bridges the sim-to-real gap in radioisotope identification, enabling accurate and explainable classifiers with limited experimental data.

Abstract: Machine learning has the potential to improve the speed and reliability of radioisotope identification using gamma spectroscopy. However, meticulously labeling an experimental dataset for training is often prohibitively expensive, while training models purely on synthetic data is risky due to the domain gap between simulated and experimental measurements. In this research, we demonstrate that supervised domain adaptation can substantially improve the performance of radioisotope identification models by transferring knowledge between synthetic and experimental data domains. We consider two domain adaptation scenarios: (1) a simulation-to-simulation adaptation, where we perform multi-label proportion estimation using simulated high-purity germanium detectors, and (2) a simulation-to-experimental adaptation, where we perform multi-class, single-label classification using measured spectra from handheld lanthanum bromide (LaBr) and sodium iodide (NaI) detectors. We begin by pretraining a spectral classifier on synthetic data using a custom transformer-based neural network. After subsequent fine-tuning on just 64 labeled experimental spectra, we achieve a test accuracy of 96% in the sim-to-real scenario with a LaBr detector, far surpassing a synthetic-only baseline model (75%) and a model trained from scratch (80%) on the same 64 spectra. Furthermore, we demonstrate that domain-adapted models learn more human-interpretable features than experiment-only baseline models. Overall, our results highlight the potential for supervised domain adaptation techniques to bridge the sim-to-real gap in radioisotope identification, enabling the development of accurate and explainable classifiers even in real-world scenarios where access to experimental data is limited.

[1051] On the Role of Transformer Feed-Forward Layers in Nonlinear In-Context Learning

Haoyuan Sun, Ali Jadbabaie, Navid Azizan

Main category: cs.LG

TL;DR: Transformers can perform nonlinear in-context learning through a combination of linear self-attention and feed-forward layers, which together implement gradient descent on polynomial kernel regression tasks, with deep models overcoming single-block limitations.

Details

Motivation: To understand how Transformers perform in-context learning for nonlinear function classes, as prior work showed linear self-attention can only handle linear tasks.

Method: Analyze a Transformer block with linear self-attention and GLU-inspired feed-forward layers, showing it implements gradient descent on polynomial kernel regression. Study deep Transformers that distribute computation across blocks.

Result: Single Transformer blocks can perform nonlinear ICL but are limited by dimensions. Deep Transformers overcome this by distributing kernel computation across blocks, implementing block-coordinate descent in high-dimensional feature spaces.

Conclusion: Feed-forward layers provide a crucial mechanism for Transformers to express nonlinear representations in in-context learning, with depth enabling richer kernel functions that single blocks cannot represent.

Abstract: Transformer-based models demonstrate a remarkable ability for in-context learning (ICL), where they can adapt to unseen tasks from a few prompt examples without parameter updates. Recent research has illuminated how Transformers perform ICL, showing that the optimal linear self-attention (LSA) mechanism can implement one step of gradient descent for linear least-squares objectives when trained on random linear regression tasks. Building on this, we investigate ICL for nonlinear function classes. We first prove that LSA is inherently incapable of outperforming linear predictors on nonlinear tasks, underscoring why prior solutions cannot readily extend to these problems. To overcome this limitation, we analyze a Transformer block consisting of LSA and feed-forward layers inspired by the gated linear units (GLU), which is a standard component of modern Transformers. We show that this block achieves nonlinear ICL by implementing one step of gradient descent on a polynomial kernel regression loss. Furthermore, our analysis reveals that the expressivity of a single block is inherently limited by its dimensions. We then show that a deep Transformer can overcome this bottleneck by distributing the computation of richer kernel functions across multiple blocks, performing block-coordinate descent in a high-dimensional feature space that a single block cannot represent. Our findings highlight that the feed-forward layers provide a crucial and scalable mechanism by which Transformers can express nonlinear representations for ICL.

[1052] Exposing the Vulnerability of Decentralized Learning to Membership Inference Attacks Through the Lens of Graph Mixing

Ousmane Touat, Jezekael Brunon, Yacine Belal, Julien Nicolas, Mohamed Maouche, César Sabater, Sonia Ben Mokhtar

Main category: cs.LG

TL;DR: Decentralized learning’s vulnerability to Membership Inference Attacks (MIA) is heavily correlated with local model mixing strategies and global graph mixing properties, with enhanced mixing benefiting privacy when combined with techniques like Differential Privacy.

Details

Motivation: To understand factors that increase/reduce vulnerability to MIA in decentralized learning, where model parameter exchanges can be exploited to infer sensitive training data.

Method: Extensive exploration of MIA vulnerability across various decentralized architectures by varying graph structure, dynamics, aggregation strategies, datasets, and data distributions, with theoretical analysis of mixing properties.

Result: Vulnerability to MIA is heavily correlated with local model mixing strategies and global mixing properties of communication graphs, with enhanced mixing being beneficial when combined with privacy-preserving techniques.

Conclusion: Provides lessons learned for designing decentralized learning systems that reduce MIA vulnerability by design, emphasizing the importance of mixing properties and their combination with other privacy techniques.

Abstract: The primary promise of decentralized learning is to allow users to engage in the training of machine learning models in a collaborative manner while keeping their data on their premises and without relying on any central entity. However, this paradigm necessitates the exchange of model parameters or gradients between peers. Such exchanges can be exploited to infer sensitive information about training data, which is achieved through privacy attacks (e.g., Membership Inference Attacks – MIA). In order to devise effective defense mechanisms, it is important to understand the factors that increase/reduce the vulnerability of a given decentralized learning architecture to MIA. In this study, we extensively explore the vulnerability to MIA of various decentralized learning architectures by varying the graph structure (e.g., number of neighbors), the graph dynamics, and the aggregation strategy, across diverse datasets and data distributions. Our key finding, which to the best of our knowledge we are the first to report, is that the vulnerability to MIA is heavily correlated to (i) the local model mixing strategy performed by each node upon reception of models from neighboring nodes and (ii) the global mixing properties of the communication graph. We illustrate these results experimentally using four datasets and by theoretically analyzing the mixing properties of various decentralized architectures. We also empirically show that enhancing mixing properties is highly beneficial when combined with other privacy-preserving techniques such as Differential Privacy. Our paper draws a set of lessons learned for devising decentralized learning systems that reduce by design the vulnerability to MIA.

[1053] Robust Federated Finetuning of LLMs via Alternating Optimization of LoRA

Shuangyi Chen, Yuanxin Guo, Yue Ju, Harik Dalal, Zhongwen Zhu, Ashish Khisti

Main category: cs.LG

TL;DR: RoLoRA is a federated learning framework that uses alternating optimization to fine-tune LoRA adapters, emphasizing learning both up and down projection matrices for better expressiveness and robustness.

Details

Motivation: To reduce computational and communication costs in federated training while overcoming limitations of prior approaches that either generate imperfect model updates or limit model expressiveness.

Method: Uses alternating optimization to fine-tune LoRA adapters, with theoretical analysis on linear models and convergence proof under general conditions, plus extensive experiments on language models.

Result: Demonstrates advantages over prior methods through both theoretical analysis and experimental evaluations on RoBERTa-Large and Llama-2-7B across diverse tasks and FL settings.

Conclusion: RoLoRA effectively bridges theory and practice, showing the importance of learning both projection matrices in LoRA for enhanced federated learning performance.

Abstract: Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) optimize federated training by reducing computational and communication costs. We propose RoLoRA, a federated framework using alternating optimization to fine-tune LoRA adapters. Our approach emphasizes the importance of learning up and down projection matrices to enhance expressiveness and robustness. We use both theoretical analysis and extensive experiments to demonstrate the advantages of RoLoRA over prior approaches that either generate imperfect model updates or limit expressiveness of the model. We provide a theoretical analysis on a linear model to highlight the importance of learning both the down-projection and up-projection matrices in LoRA. We validate the insights on a non-linear model and separately provide a convergence proof under general conditions. To bridge theory and practice, we conducted extensive experimental evaluations on language models including RoBERTa-Large, Llama-2-7B on diverse tasks and FL settings to demonstrate the advantages of RoLoRA over other methods.

[1054] Stochastic Process Learning via Operator Flow Matching

Yaozhong Shi, Zachary E. Ross, Domniki Asimaki, Kamyar Azizzadenesheli

Main category: cs.LG

TL;DR: Operator Flow Matching (OFM) is a novel framework for learning stochastic process priors on function spaces, enabling probability density estimation and functional regression across arbitrary domains.

Details

Motivation: To extend neural operators for stochastic process learning across arbitrary domains and provide mathematically tractable functional regression with density estimation.

Method: Developed Operator Flow Matching (OFM) framework that learns stochastic process priors on function spaces and enables probability density estimation for any collection of points.

Result: Outperforms state-of-the-art models in stochastic process learning, functional regression, and prior learning tasks.

Conclusion: OFM provides an effective framework for stochastic process learning with superior performance in functional regression and density estimation compared to existing methods.

Abstract: Expanding on neural operators, we propose a novel framework for stochastic process learning across arbitrary domains. In particular, we develop operator flow matching (OFM) for learning stochastic process priors on function spaces. OFM provides the probability density of the values of any collection of points and enables mathematically tractable functional regression at new points with mean and density estimation. Our method outperforms state-of-the-art models in stochastic process learning, functional regression, and prior learning.

[1055] The Minimal Search Space for Conditional Causal Bandits

Francisco N. F. Q. Simoes, Itai Feigenbaum, Mehdi Dastani, Thijs van Ommen

Main category: cs.LG

TL;DR: This paper introduces an efficient algorithm for identifying the minimal set of nodes containing the optimal conditional intervention in causal bandits, which significantly accelerates convergence in multi-armed bandit algorithms.

Details

Motivation: Traditional causal bandits focus on hard interventions, but many real-world decision-making problems are better modeled by conditional interventions where the intervened variable's value depends on other observed variables. This creates a need for efficient methods to find optimal conditional interventions.

Method: The paper presents a graphical characterization of the minimal node set containing the optimal conditional intervention and proposes an efficient O(|V| + |E|) algorithm to identify this set, where |V| is the number of nodes and |E| is the number of edges in the causal graph.

Result: The authors prove the correctness of both the graphical characterization and the proposed algorithm. Empirical results show that the algorithm significantly prunes the search space and substantially accelerates convergence rates when integrated into standard multi-armed bandit algorithms.

Conclusion: The proposed method provides an efficient solution for conditional interventions in causal bandits, offering both theoretical guarantees and practical performance improvements for real-world decision-making problems.

Abstract: Causal knowledge can be used to support decision-making problems. This has been recognized in the causal bandits literature, where a causal (multi-armed) bandit is characterized by a causal graphical model and a target variable. The arms are then interventions on the causal model, and rewards are samples of the target variable. Causal bandits were originally studied with a focus on hard interventions. We focus instead on cases where the arms are conditional interventions, which more accurately model many real-world decision-making problems by allowing the value of the intervened variable to be chosen based on the observed values of other variables. This paper presents a graphical characterization of the minimal set of nodes guaranteed to contain the optimal conditional intervention, which maximizes the expected reward. We then propose an efficient algorithm with a time complexity of $O(|V| + |E|)$ to identify this minimal set of nodes. We prove that the graphical characterization and the proposed algorithm are correct. Finally, we empirically demonstrate that our algorithm significantly prunes the search space and substantially accelerates convergence rates when integrated into standard multi-armed bandit algorithms.

[1056] $k$-SVD with Gradient Descent

Yassir Jedra, Devavrat Shah

Main category: cs.LG

TL;DR: A gradient-descent method with universal step-size selection for computing k-SVD that achieves global linear convergence for any matrix rank d≥1 and any k≥1, with convergence rates matching complex Lanczos-based methods.

Details

Motivation: Existing optimization-based approaches for k-SVD computation have limitations: they only work for exact-parameterized or over-parameterized settings, provide only local convergence guarantees, or require problem-specific step-size selection.

Method: Gradient-descent method with random initialization and a simple, universal step-size selection rule (akin to pre-conditioning). The method behaves like Heron’s method within an attractive region and can be enhanced with Nesterov’s momentum acceleration.

Result: The method provably finds k-SVD for matrices of any rank d≥1 with global linear convergence for any k,d≥1. Enhanced version with momentum achieves convergence rates comparable to complex Lanczos-based methods.

Conclusion: This work completes the pursuit for scalable k-SVD computation by providing a simple, universally applicable gradient method with strong theoretical guarantees and competitive performance.

Abstract: The emergence of modern compute infrastructure for iterative optimization has led to great interest in developing optimization-based approaches for a scalable computation of $k$-SVD, i.e., the $k\geq 1$ largest singular values and corresponding vectors of a matrix of rank $d \geq 1$. Despite lots of exciting recent works, all prior works fall short in this pursuit. Specifically, the existing results are either for the exact-parameterized (i.e., $k = d$) and over-parameterized (i.e., $k > d$) settings; or only establish local convergence guarantees; or use a step-size that requires problem-instance-specific oracle-provided information. In this work, we complete this pursuit by providing a gradient-descent method with a simple, universal rule for step-size selection (akin to pre-conditioning), that provably finds $k$-SVD for a matrix of any rank $d \geq 1$. We establish that the gradient method with random initialization enjoys global linear convergence for any $k, d \geq 1$. Our convergence analysis reveals that the gradient method has an attractive region, and within this attractive region, the method behaves like Heron’s method (a.k.a. the Babylonian method). Our analytic results about the said attractive region imply that the gradient method can be enhanced by means of Nesterov’s momentum-based acceleration technique. The resulting improved convergence rates match those of rather complicated methods typically relying on Lanczos iterations or variants thereof. We provide an empirical study to validate the theoretical results.

[1057] Training and Evaluating with Human Label Variation: An Empirical Study

Kemal Kurniawan, Meladel Mistica, Timothy Baldwin, Jey Han Lau

Main category: cs.LG

TL;DR: The paper proposes new evaluation metrics for Human Label Variation (HLV) using fuzzy set theory and conducts an extensive comparison of training methods and metrics across 6 HLV datasets.

Details

Motivation: Human label variation challenges the single ground truth assumption, and there's uncertainty about which training methods and metrics perform best in HLV settings.

Method: Proposed new differentiable evaluation metrics based on fuzzy set theory, tested 14 training methods and 6 evaluation metrics across 6 HLV datasets, and experimented with using the differentiable metrics as training objectives.

Result: Training on disaggregated annotations or soft labels performed best across metrics, outperforming training using the proposed differentiable metrics. The proposed soft micro F1 score was identified as one of the best metrics for HLV data.

Conclusion: Traditional approaches using disaggregated annotations or soft labels remain effective for HLV, while the proposed soft micro F1 metric shows strong performance for evaluating HLV models.

Abstract: Human label variation (HLV) challenges the standard assumption that a labelled instance has a single ground truth, instead embracing the natural variation in human annotation to train and evaluate models. While various training methods and metrics for HLV have been proposed, it is still unclear which methods and metrics perform best in what settings. We propose new evaluation metrics for HLV leveraging fuzzy set theory. Since these new proposed metrics are differentiable, we then in turn experiment with employing these metrics as training objectives. We conduct an extensive study over 6 HLV datasets testing 14 training methods and 6 evaluation metrics. We find that training on either disaggregated annotations or soft labels performs best across metrics, outperforming training using the proposed training objectives with differentiable metrics. We also show that our proposed soft micro F1 score is one of the best metrics for HLV data.

[1058] AB-UPT: Scaling Neural CFD Surrogates for High-Fidelity Automotive Aerodynamics Simulations via Anchored-Branched Universal Physics Transformers

Benedikt Alkin, Maurits Bleeker, Richard Kurle, Tobias Kronlachner, Reinhard Sonnleitner, Matthias Dorfer, Johannes Brandstetter

Main category: cs.LG

TL;DR: AB-UPT is a novel neural surrogate modeling scheme for CFD simulations that addresses scalability to industrial-scale problems with 100M+ mesh cells, handles complex geometry interactions, and enforces physics constraints like divergence-free vorticity fields through multi-branch operators and anchored neural field decoders.

Details

Motivation: Industrial CFD problems face major scalability challenges with volumetric meshes reaching 100M cells, complex geometry interactions, and strict physics constraints like divergence-free requirements for vorticity fields, which current neural surrogate models struggle to handle.

Method: Multi-branch operators decouple geometry encoding and prediction tasks; neural simulation in low-dimensional latent space enables scalability; anchored neural field decoders predict high-fidelity outputs; divergence-free formulation enforces physics consistency.

Result: State-of-the-art predictive accuracy on automotive CFD simulations with 33K to 150M mesh cells; enables enforcement of hard physical constraints without performance degradation; trains on single GPU in <1 day; predicts industry-standard fields in seconds; eliminates need for costly CFD meshing for inference.

Conclusion: AB-UPT successfully addresses key challenges in industrial-scale neural CFD modeling through its scalable architecture, physics-constrained formulation, and ability to work directly from CAD geometry, making it practical for real-world automotive applications.

Abstract: Recent advances in neural surrogate modeling offer the potential for transformative innovations in applications such as automotive aerodynamics. Yet, industrial-scale problems often involve volumetric meshes with cell counts reaching 100 million, presenting major scalability challenges. Complex geometries further complicate modeling through intricate surface-volume interactions, while quantities such as vorticity are highly nonlinear and must satisfy strict divergence-free constraints. To address these requirements, we introduce AB-UPT as a novel modeling scheme for building neural surrogates for CFD simulations. AB-UPT is designed to: (i) decouple geometry encoding and prediction tasks via multi-branch operators; (ii) enable scalability to high-resolution outputs via neural simulation in a low-dimensional latent space, coupled with anchored neural field decoders to predict high-fidelity outputs; (iii) enforce physics consistency by a divergence-free formulation. We show that AB-UPT yields state-of-the-art predictive accuracy of surface and volume fields on automotive CFD simulations ranging from 33 thousand up to 150 million mesh cells. Furthermore, our anchored neural field architecture enables the enforcement of hard physical constraints on the physics predictions without degradation in performance, exemplified by modeling divergence-free vorticity fields. Notably, the proposed models can be trained on a single GPU in less than a day and predict industry-standard surface and volume fields within seconds. Additionally, we show that the flexible design of our method enables neural simulation from a CAD geometry alone, thereby eliminating the need for costly CFD meshing procedures for inference.

[1059] Physics-Inspired Binary Neural Networks: Interpretable Compression with Theoretical Guarantees

Arian Eamaz, Farhang Yeganegi, Mojtaba Soltanalian

Main category: cs.LG

TL;DR: PIBiNN is a physics-inspired binary neural network that combines data-driven one-bit quantization with problem-driven sparsity predefined by physics, achieving compression below one bit per weight while preserving operator geometry.

Details

Motivation: Traditional approaches use dense networks and then sparsify them, ignoring available prior knowledge about problem structure. Many inverse problems already have algorithm-unrolled networks that naturally encode physics and sparsity.

Method: Combines two components: (1) data-driven one-bit quantization with a single global scale, and (2) problem-driven sparsity predefined by physics that requires no updates during training. This exploits structural zeros for compression.

Result: Achieves compression rates below one bit per weight while preserving essential operator geometry. Outperforms competitive baselines like ternary and channel-wise quantization in both memory efficiency and generalization.

Conclusion: PIBiNN provides a more principled approach than ad-hoc sparsification methods, reducing metadata overhead and directly aligning with the underlying task structure through physics-inspired design.

Abstract: Why rely on dense neural networks and then blindly sparsify them when prior knowledge about the problem structure is already available? Many inverse problems admit algorithm-unrolled networks that naturally encode physics and sparsity. In this work, we propose a Physics-Inspired Binary Neural Network (PIBiNN) that combines two key components: (i) data-driven one-bit quantization with a single global scale, and (ii) problem-driven sparsity predefined by physics and requiring no updates during training. This design yields compression rates below one bit per weight by exploiting structural zeros, while preserving essential operator geometry. Unlike ternary or pruning-based schemes, our approach avoids ad-hoc sparsification, reduces metadata overhead, and aligns directly with the underlying task. Experiments suggest that PIBiNN achieves advantages in both memory efficiency and generalization compared to competitive baselines such as ternary and channel-wise quantization.

[1060] On Different Notions of Redundancy in Conditional-Independence-Based Discovery of Graphical Models

Philipp M. Faller, Dominik Janzing

Main category: cs.LG

TL;DR: Redundant conditional independence tests can detect or correct errors in graphical models, but only when they rely on graphical assumptions rather than universal probability properties.

Details

Motivation: Conditional independence tests used in graphical model discovery are unreliable and algorithms are sensitive to errors, but unused redundant tests could potentially detect or correct these errors.

Method: Analyze the potential of redundant conditional independence tests to detect and correct errors in learned graphical models, distinguishing between tests that hold universally versus those that follow from graphical assumptions.

Result: Redundant tests can help detect or correct errors in graphical models, but only those that follow from graphical assumptions are effective - tests that hold for every probability distribution are unlikely to be useful.

Conclusion: Redundant conditional independence tests should be applied selectively, focusing on those that depend on graphical assumptions rather than universal probability properties, to effectively detect and correct errors in graphical model discovery.

Abstract: Conditional-independence-based discovery uses statistical tests to identify a graphical model that represents the independence structure of variables in a dataset. These tests, however, can be unreliable, and algorithms are sensitive to errors and violated assumptions. Often, there are tests that were not used in the construction of the graph. In this work, we show that these redundant tests have the potential to detect or sometimes correct errors in the learned model. But we further show that not all tests contain this additional information and that such redundant tests have to be applied with care. Precisely, we argue that the conditional (in)dependence statements that hold for every probability distribution are unlikely to detect and correct errors - in contrast to those that follow only from graphical assumptions.

[1061] Steering LLMs for Formal Theorem Proving

Shashank Kirtania, Arun Iyer

Main category: cs.LG

TL;DR: The paper introduces activation steering to improve LLM-based theorem proving by adjusting residual activations associated with informal reasoning, enhancing proof generation without fine-tuning.

Details

Motivation: Existing LLM methods for theorem proving struggle with interpreting ambiguous informal cues, and little is known about how LLMs internally represent these cues to influence proof generation.

Method: Activation steering - an inference-time intervention that identifies linear directions in residual activations linked to informal reasoning traces and adjusts them to improve proof construction.

Result: The intervention improves performance in generating formal proofs under both sampling and best-first search decoding strategies without additional training.

Conclusion: Activation steering provides both a practical method for enhancing LLM-based theorem proving and interpretable insights into how reasoning is encoded in LLM activation spaces.

Abstract: Recent advances in automated theorem proving use Large Language Models (LLMs) to translate informal mathematical statements into formal proofs. However, informal cues are often ambiguous or lack strict logical structure, making it hard for models to interpret them precisely. While existing methods achieve strong performance, little is known about how LLMs internally represent informal cues, or how these influence proof generation. To address this, we explore \textit{activation steering}, an inference-time intervention that identifies linear directions in residual activations associated with informal reasoning traces and adjusts them to improve proof construction without fine-tuning. This mechanism also yields interpretable information about how reasoning is internally encoded in the activation space of LLMs. We test our method for generating formal proofs from already-formalized theorems. Our contributions are twofold: (1) a novel activation-based intervention for guiding proof synthesis in LLMs; and (2) demonstration that this intervention improves performance under two decoding strategies (sampling and best-first search) without any further training.

[1062] Hierarchical Balance Packing: Towards Efficient Supervised Fine-tuning for Long-Context LLM

Yongqiang Yao, Jingru Tan, Kaihuan Liang, Feizhao Zhang, Jiahao Hu, Shuo Wu, Yazhe Niu, Ruihao Gong, Dahua Lin, Ningyi Xu

Main category: cs.LG

TL;DR: Proposes Hierarchical Balance Packing (HBP) to address training inefficiencies in long-context LLMs by creating multi-level data packing groups with optimized settings, reducing training time by 2.4x for DeepSeek-V2 while maintaining performance.

Details

Motivation: Training long-context LLMs with hybrid data leads to workload imbalances, inefficient attention computation, and wasted communication overhead that existing data packing methods fail to address.

Method: HBP constructs multi-level data packing groups with distinct packing lengths, assigns samples to optimal groups, configures group settings (sequential parallelism, gradient checkpointing), and uses dynamic training pipeline with curriculum learning, adaptive parallelism, and stable loss.

Result: Significantly reduces training time across multiple datasets and models, achieving 2.4x speedup for DeepSeek-V2 (236B) MoE model while maintaining competitive performance.

Conclusion: HBP effectively addresses training inefficiencies in long-context LLMs through hierarchical data organization and optimized training configurations, enabling faster training without compromising model quality.

Abstract: Training Long-Context Large Language Models (LLMs) is challenging, as hybrid training with long-context and short-context data often leads to workload imbalances. Existing works mainly use data packing to alleviate this issue, but fail to consider imbalanced attention computation and wasted communication overhead. This paper proposes Hierarchical Balance Packing (HBP), which designs a novel batch-construction method and training recipe to address those inefficiencies. In particular, the HBP constructs multi-level data packing groups, each optimized with a distinct packing length. It assigns training samples to their optimal groups and configures each group with the most effective settings, including sequential parallelism degree and gradient checkpointing configuration. To effectively utilize multi-level groups of data, we design a dynamic training pipeline specifically tailored to HBP, including curriculum learning, adaptive sequential parallelism, and stable loss. Our extensive experiments demonstrate that our method significantly reduces training time over multiple datasets and open-source models while maintaining strong performance. For the largest DeepSeek-V2 (236B) MoE model, our method speeds up the training by 2.4$\times$ with competitive performance. Codes will be released at https://github.com/ModelTC/HBP.

[1063] Adaptive UAV-Assisted Hierarchical Federated Learning: Optimizing Energy, Latency, and Resilience for Dynamic Smart IoT

Xiaohong Yang, Minghui Liwang, Liqun Fu, Yuhan Su, Seyyedali Hosseinalipour, Xianbin Wang, Yiguang Hong

Main category: cs.LG

TL;DR: This paper proposes a hierarchical federated learning approach using UAVs as mobile aggregators for IoT systems with limited connectivity. It addresses joint optimization of learning configuration, bandwidth allocation, and device-to-UAV association to minimize training costs while handling dynamic UAV deployments and communication disruptions.

Details

Motivation: To enable efficient federated learning in geographically dispersed IoT environments with limited cellular connectivity, particularly for applications like remote monitoring and battlefield operations where UAVs can serve as mobile aggregators for terrestrial IoT devices.

Method: Decomposed the NP-hard optimization problem into three subproblems: (1) learning configuration and bandwidth allocation via augmented Lagrangian, (2) device-to-UAV assignment using TD3-based algorithm with fitness scores based on data heterogeneity, proximity, and computational resources, (3) two-stage greedy strategy for UAV redeployment and global aggregator selection.

Result: Experiments on real-world datasets showed the approach effectively reduces training costs and maintains robust performance under communication disruptions.

Conclusion: The proposed hierarchical federated learning framework with UAV aggregators provides an efficient solution for distributed learning in dynamic IoT environments with limited connectivity, successfully handling communication disruptions and optimizing resource allocation.

Abstract: Hierarchical Federated Learning (HFL) extends conventional Federated Learning (FL) by introducing intermediate aggregation layers, enabling distributed learning in geographically dispersed environments, particularly relevant for smart IoT systems, such as remote monitoring and battlefield operations, where cellular connectivity is limited. In these scenarios, UAVs serve as mobile aggregators, dynamically connecting terrestrial IoT devices. This paper investigates an HFL architecture with energy-constrained, dynamically deployed UAVs prone to communication disruptions. We propose a novel approach to minimize global training costs by formulating a joint optimization problem that integrates learning configuration, bandwidth allocation, and device-to-UAV association, ensuring timely global aggregation before UAV disconnections and redeployments. The problem accounts for dynamic IoT devices and intermittent UAV connectivity and is NP-hard. To tackle this, we decompose it into three subproblems: \textit{(i)} optimizing learning configuration and bandwidth allocation via an augmented Lagrangian to reduce training costs; \textit{(ii)} introducing a device fitness score based on data heterogeneity (via Kullback-Leibler divergence), device-to-UAV proximity, and computational resources, using a TD3-based algorithm for adaptive device-to-UAV assignment; \textit{(iii)} developing a low-complexity two-stage greedy strategy for UAV redeployment and global aggregator selection, ensuring efficient aggregation despite UAV disconnections. Experiments on diverse real-world datasets validate the approach, demonstrating cost reduction and robust performance under communication disruptions.

[1064] Clustering by Nonparametric Smoothing

David P. Hofmeyr

Main category: cs.LG

TL;DR: A novel clustering method that estimates cluster membership distributions using nonparametric smoothing, automatically determining both flexibility and number of clusters.

Details

Motivation: To create a clustering approach that avoids explicit modeling assumptions like GMMs and leverages flexible nonparametric estimation for better performance.

Method: Formulates clustering as an estimation problem where a function maps points to cluster membership distributions using nonparametric smoothing with automatic parameter selection.

Result: Strong performance demonstrated on large collection of public datasets compared to relevant benchmarks.

Conclusion: The proposed nonparametric clustering method effectively determines clustering parameters automatically and outperforms existing approaches.

Abstract: A novel formulation of the clustering problem is introduced in which the task is expressed as an estimation problem, where the object to be estimated is a function which maps a point to its distribution of cluster membership. Unlike existing approaches which implicitly estimate such a function, like Gaussian Mixture Models (GMMs), the proposed approach bypasses any explicit modelling assumptions and exploits the flexible estimation potential of nonparametric smoothing. An intuitive approach for selecting the tuning parameters governing estimation is provided, which allows the proposed method to automatically determine both an appropriate level of flexibility and also the number of clusters to extract from a given data set. Experiments on a large collection of publicly available data sets are used to document the strong performance of the proposed approach, in comparison with relevant benchmarks from the literature. R code to implement the proposed approach is available from https://github.com/DavidHofmeyr/CNS

[1065] On the Mathematical Relationship Between Layer Normalization and Dynamic Activation Functions

Felix Stollenwerk

Main category: cs.LG

TL;DR: This paper establishes a theoretical foundation for dynamic activation functions by deriving them from RMSNorm, proposing DyISRU as the exact element-wise counterpart that better reproduces normalization effects than existing methods like DyT.

Details

Motivation: To provide a mathematical foundation for dynamic activation functions like Dynamic Tanh (DyT) by establishing their relationship with layer normalization techniques, particularly RMSNorm, since current dynamic activation methods lack theoretical grounding.

Method: Derived DyT from RMSNorm through decoupling in derivative space and approximation, then applied the same decoupling procedure directly in function space to obtain the exact element-wise counterpart called Dynamic Inverse Square Root Unit (DyISRU).

Result: Numerical demonstrations show that DyISRU reproduces the normalization effect on outliers more accurately than DyT does, providing better performance in handling outlier values.

Conclusion: The paper successfully bridges the gap between layer normalization and dynamic activation functions, proposing DyISRU as a theoretically grounded alternative that more accurately captures normalization effects compared to existing dynamic activation methods.

Abstract: Layer normalization (LN) is an essential component of modern neural networks. While many alternative techniques have been proposed, none of them have succeeded in replacing LN so far. The latest suggestion in this line of research is a dynamic activation function called Dynamic Tanh (DyT). Although it is empirically well-motivated and appealing from a practical point of view, it lacks a theoretical foundation. In this work, we shed light on the mathematical relationship between LN and dynamic activation functions. In particular, we derive DyT from the LN variant RMSNorm, and show that a well-defined decoupling in derivative space as well as an approximation are needed to do so. By applying the same decoupling procedure directly in function space, we are able to omit the approximation and obtain the exact element-wise counterpart of RMSNorm, which we call Dynamic Inverse Square Root Unit (DyISRU). We demonstrate numerically that DyISRU reproduces the normalization effect on outliers more accurately than DyT does.

[1066] Towards Unified and Lossless Latent Space for 3D Molecular Latent Diffusion Modeling

Yanchen Luo, Zhiyuan Liu, Yi Zhao, Sihang Li, Hengxing Cai, Kenji Kawaguchi, Tat-Seng Chua, Yang Zhang, Xiang Wang

Main category: cs.LG

TL;DR: UAE-3D is a unified variational autoencoder that compresses 3D molecules into latent sequences from a single latent space, enabling efficient latent diffusion modeling while maintaining SE(3) equivariance and achieving state-of-the-art performance in molecule generation.

Details

Motivation: Existing 3D molecule generation approaches struggle with integrating multi-modal data (atom types, bonds, 3D coordinates) while maintaining SE(3) equivariance, typically requiring separate latent spaces that reduce training and sampling efficiency.

Method: Proposed UAE-3D uses a multi-modal VAE to compress 3D molecules into unified latent sequences, then employs Diffusion Transformer for latent generation without molecular inductive bias, eliminating complexities of handling multi-modality and equivariance.

Result: Achieved new benchmarks on GEOM-Drugs and QM9 datasets, reducing FCD by 72.6% over previous best results on GEOM-Drugs and achieving over 70% relative average improvements in geometric fidelity.

Conclusion: UAE-3D provides an efficient unified framework for 3D molecule generation that significantly outperforms existing methods in both quality and efficiency while maintaining geometric properties.

Abstract: 3D molecule generation is crucial for drug discovery and material science, requiring models to process complex multi-modalities, including atom types, chemical bonds, and 3D coordinates. A key challenge is integrating these modalities of different shapes while maintaining SE(3) equivariance for 3D coordinates. To achieve this, existing approaches typically maintain separate latent spaces for invariant and equivariant modalities, reducing efficiency in both training and sampling. In this work, we propose \textbf{U}nified Variational \textbf{A}uto-\textbf{E}ncoder for \textbf{3D} Molecular Latent Diffusion Modeling (\textbf{UAE-3D}), a multi-modal VAE that compresses 3D molecules into latent sequences from a unified latent space, while maintaining near-zero reconstruction error. This unified latent space eliminates the complexities of handling multi-modality and equivariance when performing latent diffusion modeling. We demonstrate this by employing the Diffusion Transformer–a general-purpose diffusion model without any molecular inductive bias–for latent generation. Extensive experiments on GEOM-Drugs and QM9 datasets demonstrate that our method significantly establishes new benchmarks in both \textit{de novo} and conditional 3D molecule generation, achieving leading efficiency and quality. On GEOM-Drugs, it reduces FCD by 72.6% over the previous best result, while achieving over 70% relative average improvements in geometric fidelity. Our code is released at https://github.com/lyc0930/UAE-3D/.

[1067] Rethinking Graph Structure Learning in the Era of LLMs

Zhihan Zhang, Xunkai Li, Zhu Lei, Guang Zeng, Ronghua Li, Guoren Wang

Main category: cs.LG

TL;DR: The paper proposes LLaTA, a novel graph structure learning method for text-attributed graphs that uses tree-based LLM in-context learning to enhance topology and text understanding, achieving state-of-the-art performance.

Details

Motivation: Traditional graph structure learning methods are designed for graphs without textual information, creating a need for new approaches that can effectively integrate LLMs with graph data to handle text-attributed graphs.

Method: Reformulates GSL optimization as a tree optimization framework and proposes decoupled, training-free model design using tree-based LLM in-context learning to understand topology and text without fine-tuning.

Result: Extensive experiments on 11 datasets show LLaTA outperforms other LLM-enhanced graph learning methods, achieves state-of-the-art predictive performance, and demonstrates flexibility with any backbone model.

Conclusion: LLaTA provides an effective paradigm for graph structure learning in the LLM era, enabling reliable inference and improved graph structure generation through efficient LLM integration without intensive computation.

Abstract: Recently, the emergence of LLMs has prompted researchers to integrate language descriptions into graphs, aiming to enhance model encoding capabilities from a data-centric perspective. This graph representation is called text-attributed graphs (TAGs). A review of prior advancements highlights that graph structure learning (GSL) is a pivotal technique for improving data utility, making it highly relevant to efficient TAG learning. However, most GSL methods are tailored for traditional graphs without textual information, underscoring the necessity of developing a new GSL paradigm. Despite clear motivations, it remains challenging: (1) How can we define a reasonable optimization objective for GSL in the era of LLMs, considering the massive parameters in LLM? (2) How can we design an efficient model architecture that enables seamless integration of LLM for this optimization objective? For Question 1, we reformulate existing GSL optimization objectives as a tree optimization framework, shifting the focus from obtaining a well-trained edge predictor to a language-aware tree sampler. For Question 2, we propose decoupled and training-free model design principles for LLM integration, shifting the focus from computation-intensive fine-tuning to more efficient inference. Based on this, we propose Large Language and Tree Assistant (LLaTA), which leverages tree-based LLM in-context learning to enhance the understanding of topology and text, enabling reliable inference and generating improved graph structure. Extensive experiments on 11 datasets demonstrate that LLaTA enjoys flexibility-incorporated with any backbone; scalability-outperforms other LLM-enhanced graph learning methods; effectiveness-achieves SOTA predictive performance.

[1068] DUMP: Automated Distribution-Level Curriculum Learning for RL-based LLM Post-training

Zhenting Wang, Guofeng Cui, Yu-Jhe Li, Kun Wan, Wentian Zhao

Main category: cs.LG

TL;DR: A curriculum learning framework for RL-based LLM post-training that dynamically schedules training across diverse data distributions using policy advantages and UCB principle to optimize learning efficiency.

Details

Motivation: Existing RL-based post-training methods treat training data as unified, ignoring that modern LLM training involves heterogeneous data from diverse distributions with varying sources and difficulties, creating a need for adaptive training scheduling.

Method: Proposes distribution-level curriculum learning using policy advantages to measure learnability, and applies Upper Confidence Bound (UCB) to dynamically adjust sampling probabilities across distributions, balancing exploitation (high advantage) and exploration (low sample count).

Result: Experiments with GRPO on logic reasoning datasets show significant improvements in convergence speed and final performance compared to non-curriculum approaches.

Conclusion: Distribution-aware curriculum strategies are valuable for LLM post-training, with the proposed framework effectively optimizing training efficiency across heterogeneous data distributions.

Abstract: Recent advances in reinforcement learning (RL)-based post-training have led to notable improvements in large language models (LLMs), particularly in enhancing their reasoning capabilities to handle complex tasks. However, most existing methods treat the training data as a unified whole, overlooking the fact that modern LLM training often involves a mixture of data from diverse distributions-varying in both source and difficulty. This heterogeneity introduces a key challenge: how to adaptively schedule training across distributions to optimize learning efficiency. In this paper, we present a principled curriculum learning framework grounded in the notion of distribution-level learnability. Our core insight is that the magnitude of policy advantages reflects how much a model can still benefit from further training on a given distribution. Based on this, we propose a distribution-level curriculum learning framework for RL-based LLM post-training, which leverages the Upper Confidence Bound (UCB) principle to dynamically adjust sampling probabilities for different distrubutions. This approach prioritizes distributions with either high average advantage (exploitation) or low sample count (exploration), yielding an adaptive and theoretically grounded training schedule. We instantiate our curriculum learning framework with GRPO as the underlying RL algorithm and demonstrate its effectiveness on logic reasoning datasets with multiple difficulties and sources. Our experiments show that our framework significantly improves convergence speed and final performance, highlighting the value of distribution-aware curriculum strategies in LLM post-training. Code: https://github.com/ZhentingWang/DUMP.

[1069] VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization

Menglan Chen, Xianghe Pang, Jingjing Dong, WenHao Wang, Yaxin Du, Siheng Chen

Main category: cs.LG

TL;DR: VLMGuard-R1 is a proactive safety framework that uses multimodal reasoning to rewrite user prompts, enhancing vision-language model safety without modifying core model parameters.

Details

Motivation: Vision-language models present unique safety challenges due to multimodal complexity, requiring new approaches beyond conventional safeguards to address subtle threats emerging from text-image interactions.

Method: A three-stage reasoning pipeline synthesizes training data for a reasoning-guided rewriter that dynamically interprets text-image interactions to refine user prompts, enabling tailored safety responses.

Result: Extensive experiments show VLMGuard-R1 outperforms four baselines, achieving a 43.59% average safety improvement across five models on the SIUO benchmark.

Conclusion: Multimodal reasoning-driven prompt rewriting provides an effective approach for enhancing VLM safety while maintaining model functionality across diverse architectures.

Abstract: Aligning Vision-Language Models (VLMs) with safety standards is essential to mitigate risks arising from their multimodal complexity, where integrating vision and language unveils subtle threats beyond the reach of conventional safeguards. Inspired by the insight that reasoning across modalities is key to preempting intricate vulnerabilities, we propose a novel direction for VLM safety: multimodal reasoning-driven prompt rewriting. To this end, we introduce VLMGuard-R1, a proactive framework that refines user inputs through a reasoning-guided rewriter, dynamically interpreting text-image interactions to deliver refined prompts that bolster safety across diverse VLM architectures without altering their core parameters. To achieve this, we devise a three-stage reasoning pipeline to synthesize a dataset that trains the rewriter to infer subtle threats, enabling tailored, actionable responses over generic refusals. Extensive experiments across three benchmarks with five VLMs reveal that VLMGuard-R1 outperforms four baselines. In particular, VLMGuard-R1 achieves a remarkable 43.59% increase in average safety across five models on the SIUO benchmark.

[1070] Why Ask One When You Can Ask $k$? Learning-to-Defer to the Top-$k$ Experts

Yannis Montreuil, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi

Main category: cs.LG

TL;DR: First framework for Top-k Learning-to-Defer that allocates queries to k most cost-effective experts, generalizing prior approaches and enabling principled multi-expert collaboration.

Details

Motivation: Existing L2D frameworks are limited to single-expert deferral, preventing use of collective expertise and collaboration with multiple experts.

Method: Propose Top-k Learning-to-Defer framework and adaptive Top-k(x) variant that learns optimal number of experts per query. Develop novel Bayes-consistent surrogate loss that works across different k values.

Result: Experiments show superior accuracy-cost trade-offs compared to existing approaches, enabling flexible multi-expert deferral policies.

Conclusion: Opens new direction for multi-expert deferral in L2D, with framework that unifies and generalizes prior approaches while enabling principled collaboration with multiple experts.

Abstract: Existing Learning-to-Defer (L2D) frameworks are limited to single-expert deferral, forcing each query to rely on only one expert and preventing the use of collective expertise. We introduce the first framework for Top-$k$ Learning-to-Defer, which allocates queries to the $k$ most cost-effective entities. Our formulation unifies and strictly generalizes prior approaches, including the one-stage and two-stage regimes, selective prediction, and classical cascades. In particular, it recovers the usual Top-1 deferral rule as a special case while enabling principled collaboration with multiple experts when $k>1$. We further propose Top-$k(x)$ Learning-to-Defer, an adaptive variant that learns the optimal number of experts per query based on input difficulty, expert quality, and consultation cost. To enable practical learning, we develop a novel surrogate loss that is Bayes-consistent, $\mathcal{H}_h$-consistent in the one-stage setting, and $(\mathcal{H}_r,\mathcal{H}_g)$-consistent in the two-stage setting. Crucially, this surrogate is independent of $k$, allowing a single policy to be learned once and deployed flexibly across $k$. Experiments across both regimes show that Top-$k$ and Top-$k(x)$ deliver superior accuracy-cost trade-offs, opening a new direction for multi-expert deferral in L2D.

[1071] An Effective Gram Matrix Characterizes Generalization in Deep Networks

Rubing Yang, Pratik Chaudhari

Main category: cs.LG

TL;DR: The paper derives a differential equation to model generalization gap evolution during neural network training, controlled by contraction and perturbation factors. It introduces an ’effective Gram matrix’ concept and shows how alignment with initial residuals predicts test loss.

Details

Motivation: To understand and mathematically characterize how the generalization gap evolves during deep network training with gradient descent, providing insights into why neural networks generalize well despite their complexity.

Method: Derived a differential equation governing generalization gap evolution, analyzed it to compute an effective Gram matrix, and examined alignment between this matrix and initial residuals through empirical evaluations on image classification datasets.

Result: The analysis accurately predicts test loss, and shows that during training, residuals predominantly lie in the subspace of the effective Gram matrix with smallest eigenvalues, indicating slow accumulation of generalization gap along training direction.

Conclusion: The approach provides novel perspectives for explaining neural network generalization ability through the alignment pattern between residuals and the effective Gram matrix, characterizing benign training processes.

Abstract: We derive a differential equation that governs the evolution of the generalization gap when a deep network is trained by gradient descent. This differential equation is controlled by two quantities, a contraction factor that brings together trajectories corresponding to slightly different datasets, and a perturbation factor that accounts for them training on different datasets. We analyze this differential equation to compute an effective Gram matrix'' that characterizes the generalization gap in terms of the alignment between this Gram matrix and a certain initial residual’’. Empirical evaluations on image classification datasets indicate that this analysis can predict the test loss accurately. Further, during training, the residual predominantly lies in the subspace of the effective Gram matrix with the smallest eigenvalues. This indicates that the generalization gap accumulates slowly along the direction of training, charactering a benign training process. We provide novel perspectives for explaining the generalization ability of neural network training with different datasets and architectures through the alignment pattern of the residual" and the effective Gram matrix".

[1072] Accurate and Diverse LLM Mathematical Reasoning via Automated PRM-Guided GFlowNets

Adam Younsi, Ahmed Attia, Abdalgader Abubaker, Mohamed El Amine Seddik, Hakim Hacid, Salem Lahlou

Main category: cs.LG

TL;DR: The paper introduces a Process Reward Model (PRM) trained automatically using Monte Carlo Tree Search and similarity-based data augmentation to evaluate intermediate reasoning steps in LLMs. It then adapts GFlowNets to operate at step level, enabling diverse high-quality solutions proportional to rewards.

Details

Motivation: Achieving both accuracy and diverse reasoning in complex domains like mathematics is challenging for LLMs. A key bottleneck is evaluating intermediate reasoning steps without costly human annotations.

Method: 1) Train Process Reward Model (PRM) using Monte Carlo Tree Search with similarity-based data augmentation to capture step-level reasoning quality. 2) Adapt Generative Flow Networks (GFlowNets) to operate at reasoning step level, sampling diverse solutions proportional to PRM rewards.

Result: Strong improvements in accuracy (+2.59% absolute on MATH Level 5 for Llama3.2-3B) and solution diversity on mathematical benchmarks. Effective generalization to unseen datasets (+9.4% absolute on SAT MATH). PRM shows superior alignment with reasoning quality compared to existing reward models.

Conclusion: The work demonstrates the potential of PRM-guided, step-level GFlowNets for developing more robust and versatile mathematical reasoning in LLMs.

Abstract: Achieving both accuracy and diverse reasoning remains challenging for Large Language Models (LLMs) in complex domains like mathematics. A key bottleneck is evaluating intermediate reasoning steps to guide generation without costly human annotations. To address this, we first introduce a novel Process Reward Model (PRM) trained automatically using Monte Carlo Tree Search coupled with a similarity-based data augmentation technique, effectively capturing step-level reasoning quality. Leveraging this PRM, we then adapt Generative Flow Networks (GFlowNets) to operate at the reasoning step level. Unlike traditional reinforcement learning focused on maximizing a single reward, GFlowNets naturally sample diverse, high-quality solutions proportional to their rewards, as measured by our PRM. Empirical evaluation shows strong improvements in both accuracy and solution diversity on challenging mathematical benchmarks (e.g., +2.59% absolute accuracy on MATH Level 5 for Llama3.2-3B), with effective generalization to unseen datasets (+9.4% absolute on SAT MATH). Furthermore, we benchmark our PRM against existing open-source reward models, demonstrating superior alignment with reasoning quality and more consistent guidance for downstream generation. Our work demonstrates the potential of PRM-guided, step-level GFlowNets for developing more robust and versatile mathematical reasoning in LLMs.

[1073] A Representation Learning Approach to Feature Drift Detection in Wireless Networks

Athanasios Tziouvaras, Blaz Bertalanic, George Floros, Kostas Kolomvatsos, Panagiotis Sarigiannidis, Carolina Fortuna

Main category: cs.LG

TL;DR: ALERT is a method for detecting feature distribution changes in AI models for wireless networks, triggering re-training to maintain performance in wireless fingerprinting and link anomaly detection use cases.

Details

Motivation: AI models in wireless networks can degrade due to feature distribution changes, leading to undesired behaviors. Current methods may not detect this degradation effectively.

Method: ALERT uses three components: representation learning (MLP), statistical testing (Kolmogorov-Smirnov and Population Stability Index tests), and a new utility assessment function.

Result: ALERT shows superiority over ten standard drift detection methods from literature on two wireless network use cases.

Conclusion: The proposed ALERT method effectively detects feature distribution changes and triggers model re-training, outperforming existing methods in wireless network applications.

Abstract: AI is foreseen to be a centerpiece in next generation wireless networks enabling enabling ubiquitous communication as well as new services. However, in real deployment, feature distribution changes may degrade the performance of AI models and lead to undesired behaviors. To counter for undetected model degradation, we propose ALERT; a method that can detect feature distribution changes and trigger model re-training that works well on two wireless network use cases: wireless fingerprinting and link anomaly detection. ALERT includes three components: representation learning, statistical testing and utility assessment. We rely on MLP for designing the representation learning component, on Kolmogorov-Smirnov and Population Stability Index tests for designing the statistical testing and a new function for utility assessment. We show the superiority of the proposed method against ten standard drift detection methods available in the literature on two wireless network use cases.

[1074] GenoArmory: A Unified Evaluation Framework for Adversarial Attacks on Genomic Foundation Models

Haozheng Luo, Chenghao Qiu, Yimin Wang, Shang Wu, Jiahao Yu, Zhenyu Pan, Weian Mao, Haoyang Fang, Hao Xu, Han Liu, Binghui Wang, Yan Chen

Main category: cs.LG

TL;DR: GenoArmory is the first unified adversarial attack benchmark for Genomic Foundation Models (GFMs), providing comprehensive evaluation of model vulnerabilities using various attack algorithms and defense strategies.

Details

Motivation: Existing GFM benchmarks lack systematic assessment of adversarial robustness, creating a need for a comprehensive framework to evaluate model vulnerabilities in genomic AI.

Method: Evaluated 5 state-of-the-art GFMs using 4 attack algorithms and 3 defense strategies, analyzing vulnerabilities across model architecture, quantization schemes, and training datasets. Created GenoAdv dataset for improved safety.

Result: Classification models showed greater robustness than generative models, with adversarial attacks frequently targeting biologically significant genomic regions, indicating models capture meaningful sequence features.

Conclusion: GenoArmory provides an essential framework for assessing GFM security vulnerabilities, revealing task-dependent robustness patterns and demonstrating that models learn biologically relevant features through adversarial analysis.

Abstract: We propose the first unified adversarial attack benchmark for Genomic Foundation Models (GFMs), named GenoArmory. Unlike existing GFM benchmarks, GenoArmory offers the first comprehensive evaluation framework to systematically assess the vulnerability of GFMs to adversarial attacks. Methodologically, we evaluate the adversarial robustness of five state-of-the-art GFMs using four widely adopted attack algorithms and three defense strategies. Importantly, our benchmark provides an accessible and comprehensive framework to analyze GFM vulnerabilities with respect to model architecture, quantization schemes, and training datasets. Additionally, we introduce GenoAdv, a new adversarial sample dataset designed to improve GFM safety. Empirically, classification models exhibit greater robustness to adversarial perturbations compared to generative models, highlighting the impact of task type on model vulnerability. Moreover, adversarial attacks frequently target biologically significant genomic regions, suggesting that these models effectively capture meaningful sequence features.

[1075] Efficient Attention via Pre-Scoring: Prioritizing Informative Keys in Transformers

Zhexiang Li, Haoyu Wang, Yutong Bao, David Woodruff

Main category: cs.LG

TL;DR: Proposes a pre-scoring mechanism to improve HyperAttention by prioritizing significant keys before applying attention, achieving better perplexity while maintaining efficiency.

Details

Motivation: HyperAttention fails to find all significant keys, which raises perplexity. A pre-scoring mechanism can better identify important keys to improve attention quality.

Method: Introduces three pre-scoring methods: k-means/kernel k-means clustering, k-median clustering, and leverage score-based ranking. Replaces HyperAttention’s uniform residual sampling with pre-scoring.

Result: Reduces perplexity from 12 to 8.3 on ChatGLM2 (131k context), outperforms standard HyperAttention. On ViT, achieves similar accuracy to LevAttention and surpasses it with specific parameters. 20x faster than FlashAttention.

Conclusion: Integrating pre-scoring into hierarchical attention mechanisms significantly improves transformer efficiency, providing balanced trade-off between speed and accuracy.

Abstract: Recent advances in transformer architectures deeply enhanced long-context language modeling. Among them, HyperAttention achieves competitive efficiency by combining a single-level LSH-based clustering with uniform residual sampling. However, HyperAttention fails to find all significant keys, which in turn raises the overall perplexity. We propose a pre-scoring mechanism that prioritizes significant keys before applying HyperAttention. We introduce three scoring methods: $k$-means and kernel $k$-means clustering, $k$-median clustering, and leverage score-based ranking (inspired by LevAttention) to filter keys effectively. We further replace HyperAttention’s original uniform residual sampling, relying exclusively on our pre-scoring mechanism. Experiments on ChatGLM2 (131k token context) reduce perplexity from 12 to 8.3, which outperforms standard HyperAttention. Moreover, when running on the Vision-Transformer (ViT), our method shows that it can guarantee similar accuracy compared with LevAttention, and will surpass LevAttention given specific parameters. Although this method introduces some computational overhead, its combination with HyperAttention achieves up to 20 times faster than FlashAttention, providing a balanced trade-off between speed and modeling accuracy. Our results highlight the effectiveness of integrating pre-scoring into hierarchical attention mechanisms, significantly improving transformer efficiency.

[1076] A Set-Sequence Model for Time Series

Elliot L. Epstein, Apaar Sadhwani, Kay Giesecke

Main category: cs.LG

TL;DR: Set-Sequence model combines permutation-invariant Set modules with Sequence modules to learn cross-sectional structure directly, eliminating manual feature engineering for large cross-sections of individual time series.

Details

Motivation: Many prediction problems involve large cross-sections of individual time series where capturing cross-sectional effects still relies on hand-crafted summary features, limiting expressivity and requiring manual engineering.

Method: Proposes Set-Sequence architecture with Set module that summarizes unit sets permutation-invariantly at each time step, and Sequence module that models each unit’s dynamics conditioned on both its features and learned summary. Supports unaligned series, varying unit counts, integrates with Transformers, scales linearly.

Result: Significantly outperforms strong baselines across synthetic contagion task, equity portfolio optimization, and loan risk prediction, delivering higher Sharpe ratios, improved AUCs, and interpretable cross-sectional summaries.

Conclusion: Set-Sequence model effectively learns cross-sectional structure directly, enhancing prediction performance while eliminating manual feature engineering across diverse applications.

Abstract: Many prediction problems across science and engineering, especially in finance and economics, involve large cross-sections of individual time series, where each unit (e.g., a loan, stock, or customer) is driven by unit-level features and latent cross-sectional dynamics. While sequence models have advanced per-unit temporal prediction, capturing cross-sectional effects often still relies on hand-crafted summary features. We propose Set-Sequence, a model that learns cross-sectional structure directly, enhancing expressivity and eliminating manual feature engineering. At each time step, a permutation-invariant Set module summarizes the unit set; a Sequence module then models each unit’s dynamics conditioned on both its features and the learned summary. The architecture accommodates unaligned series, supports varying numbers of units at inference, integrates with standard sequence backbones (e.g., Transformers), and scales linearly in cross-sectional size. Across a synthetic contagion task and two large-scale real-world applications, equity portfolio optimization and loan risk prediction, Set-Sequence significantly outperforms strong baselines, delivering higher Sharpe ratios, improved AUCs, and interpretable cross-sectional summaries.

[1077] Quantization Meets Reasoning: Exploring and Mitigating Degradation of Low-Bit LLMs in Mathematical Reasoning

Zhen Li, Yupeng Su, Songmiao Wang, Runming Yang, Congkai Xie, Aofan Liu, Ming Li, Jiannong Cao, Ngai Wong, Hongxia Yang

Main category: cs.LG

TL;DR: Low-bit PTQ severely impairs LLM mathematical reasoning. The paper identifies that failures occur early in step-structured solutions and proposes a lightweight intervention to restore performance by detecting the first faulty step and applying targeted tuning.

Details

Motivation: Low-bit post-training quantization is essential for deploying LLMs under memory constraints but severely degrades mathematical reasoning capabilities. The research aims to understand where this degradation occurs in step-by-step reasoning processes and develop efficient mitigation strategies.

Method: Used format-aligned chain-of-thought with step-aligned attribution across multiple PTQ methods, model families, and benchmarks. Proposed a measure→locate→restore loop that detects the first faulty step, constructs “Silver Bullet” datasets, and applies small-scale supervised/preference tuning.

Result: The approach recovers 4-bit weight math reasoning toward full-precision baseline with only 332 curated examples and 3-5 minutes of compute on a single GPU, while preserving PTQ efficiency. The framework is quantizer- and architecture-agnostic.

Conclusion: Low-bit degradation in mathematical reasoning can be addressed as a local, reproducible process intervention rather than a global accuracy problem, enabling efficient deployment of reasoning-capable LLMs under tight constraints.

Abstract: Low-bit post-training quantization (PTQ) is a practical route to deploy reasoning-capable LLMs under tight memory and latency budgets, yet it can markedly impair mathematical reasoning (drops up to 69.81% in our harder settings). We address two deployment-critical questions with process-level precision: Where along a step-structured solution does degradation first arise? How to mitigate it while staying in the low-bit regime? Across widely used PTQ methods (AWQ, GPTQ, SmoothQuant), open-source model families (Qwen, LLaMA; 0.5–7B), and math reasoning benchmarks (GSM8K, MATH, AIME), we perform format-aligned chain-of-thought with step-aligned attribution and uncover two robust regularities: (i) PTQ disproportionately elevates method and execution errors relative to high-level conceptual mistakes; and (ii) failures emerge early, with the first vulnerable step flipping and cascading to the final answer. These regularities suggest a general intervention principle: restore local token-level margins exactly at the earliest failure frontier. We instantiate this principle as a lightweight measure$\rightarrow$locate$\rightarrow$restore loop that operates directly on the quantized model: detect the first faulty step, construct our “Silver Bullet” datasets, and apply small-scale supervised/preference tuning. In our settings, as few as 332 curated examples and 3–5 minutes of compute on a single GPU recover 4-bit weight math reasoning toward the full-precision baseline while preserving PTQ efficiency. Our framework is quantizer- and architecture-agnostic within the evaluated regimes, and turns low-bit degradation from a global accuracy problem into a local, reproducible process intervention.

[1078] Simple and Effective Specialized Representations for Fair Classifiers

Alberto Sinigaglia, Davide Sartor, Marina Ceccon, Gian Antonio Susto

Main category: cs.LG

TL;DR: Proposes a fair classification method using characteristic function distance to remove sensitive information from representations while maintaining task effectiveness, offering better stability and efficiency than adversarial or distribution matching approaches.

Details

Motivation: Address limitations of existing fair classification methods - adversarial learning is unstable and distribution matching is computationally intensive, while meeting regulatory requirements for high-stakes decision-making.

Method: Uses characteristic function distance to ensure learned representations contain minimal sensitive information. Introduces a simple relaxation of the objective function that guarantees fairness in common classification models without performance degradation.

Result: Experimental results on benchmark datasets show the approach consistently matches or achieves better fairness and predictive accuracy than existing methods, while maintaining robustness and computational efficiency.

Conclusion: The proposed method provides a practical, stable, and efficient solution for fair classification that outperforms traditional approaches and is suitable for real-world applications.

Abstract: Fair classification is a critical challenge that has gained increasing importance due to international regulations and its growing use in high-stakes decision-making settings. Existing methods often rely on adversarial learning or distribution matching across sensitive groups; however, adversarial learning can be unstable, and distribution matching can be computationally intensive. To address these limitations, we propose a novel approach based on the characteristic function distance. Our method ensures that the learned representation contains minimal sensitive information while maintaining high effectiveness for downstream tasks. By utilizing characteristic functions, we achieve a more stable and efficient solution compared to traditional methods. Additionally, we introduce a simple relaxation of the objective function that guarantees fairness in common classification models with no performance degradation. Experimental results on benchmark datasets demonstrate that our approach consistently matches or achieves better fairness and predictive accuracy than existing methods. Moreover, our method maintains robustness and computational efficiency, making it a practical solution for real-world applications.

[1079] Lightweight and Interpretable Transformer via Mixed Graph Algorithm Unrolling for Traffic Forecast

Ji Qi, Tam Thuc Do, Mingxiao Liu, Zhuoshi Pan, Yuzhe Li, Gene Cheung, H. Vicky Zhao

Main category: cs.LG

TL;DR: The paper proposes a lightweight and interpretable transformer-like neural network for traffic forecasting by unrolling a mixed-graph optimization algorithm, using undirected and directed graphs to capture spatial and temporal correlations respectively.

Details

Motivation: To create a more interpretable and lightweight alternative to conventional 'black-box' transformers with classical self-attention mechanisms for traffic forecasting.

Method: Constructs two graphs (undirected for spatial correlations, directed for temporal relationships), designs l2 and l1-norm variational terms for signal smoothness, develops an ADMM-based iterative algorithm, and unrolls it into a feed-forward network with graph learning modules replacing self-attention.

Result: The unrolled networks achieve competitive traffic forecast performance compared to state-of-the-art methods while drastically reducing parameter counts.

Conclusion: The proposed approach successfully creates an interpretable and lightweight transformer-like architecture for traffic forecasting that maintains competitive performance with significantly fewer parameters.

Abstract: Unlike conventional “black-box” transformers with classical self-attention mechanism, we build a lightweight and interpretable transformer-like neural net by unrolling a mixed-graph-based optimization algorithm to forecast traffic with spatial and temporal dimensions. We construct two graphs: an undirected graph $\mathcal{G}^u$ capturing spatial correlations across geography, and a directed graph $\mathcal{G}^d$ capturing sequential relationships over time. We predict future samples of signal $\mathbf{x}$, assuming it is “smooth” with respect to both $\mathcal{G}^u$ and $\mathcal{G}^d$, where we design new $\ell_2$ and $\ell_1$-norm variational terms to quantify and promote signal smoothness (low-frequency reconstruction) on a directed graph. We design an iterative algorithm based on alternating direction method of multipliers (ADMM), and unroll it into a feed-forward network for data-driven parameter learning. We insert graph learning modules for $\mathcal{G}^u$ and $\mathcal{G}^d$ that play the role of self-attention. Experiments show that our unrolled networks achieve competitive traffic forecast performance as state-of-the-art prediction schemes, while reducing parameter counts drastically. Our code is available in https://github.com/SingularityUndefined/Unrolling-GSP-STForecast .

[1080] Approximation theory for 1-Lipschitz ResNets

Davide Murari, Takashi Furuya, Carola-Bibiane Schönlieb

Main category: cs.LG

TL;DR: This paper provides universal approximation guarantees for 1-Lipschitz residual networks, showing they can approximate any 1-Lipschitz function on compact domains with proper width/depth scaling.

Details

Motivation: 1-Lipschitz neural networks are fundamental for generative modeling, inverse problems, and robust classifiers, but their approximation capabilities needed rigorous theoretical foundation.

Method: The authors study 1-Lipschitz ResNets based on explicit Euler steps of negative gradient flows, using the Restricted Stone-Weierstrass Theorem and inserting norm-constrained linear maps between residual blocks.

Result: The paper shows that 1-Lipschitz ResNets are dense in the set of scalar 1-Lipschitz functions on compact domains, can exactly represent piecewise affine 1-Lipschitz functions, and maintain density even with fixed hidden width.

Conclusion: This work provides the first universal approximation guarantees for 1-Lipschitz ResNets, establishing a rigorous theoretical foundation for their practical use in applications requiring Lipschitz constraints.

Abstract: 1-Lipschitz neural networks are fundamental for generative modelling, inverse problems, and robust classifiers. In this paper, we focus on 1-Lipschitz residual networks (ResNets) based on explicit Euler steps of negative gradient flows and study their approximation capabilities. Leveraging the Restricted Stone-Weierstrass Theorem, we first show that these 1-Lipschitz ResNets are dense in the set of scalar 1-Lipschitz functions on any compact domain when width and depth are allowed to grow. We also show that these networks can exactly represent scalar piecewise affine 1-Lipschitz functions. We then prove a stronger statement: by inserting norm-constrained linear maps between the residual blocks, the same density holds when the hidden width is fixed. Because every layer obeys simple norm constraints, the resulting models can be trained with off-the-shelf optimisers. This paper provides the first universal approximation guarantees for 1-Lipschitz ResNets, laying a rigorous foundation for their practical use.

[1081] Breaking the Compression Ceiling: Data-Free Pipeline for Ultra-Efficient Delta Compression

Xiaohui Wang, Peng Ye, Chenyu Huang, Shenghe Zheng, Bo Zhang, Lei Bai, Wanli Ouyang, Tao Chen

Main category: cs.LG

TL;DR: UltraDelta is a data-free delta compression method that achieves ultra-high compression while maintaining strong performance across various model types by minimizing redundancy and maximizing information through variance-based sparsity allocation, distribution-aware compression, and trace-norm-guided rescaling.

Details

Motivation: With the fine-tuned-pretrained paradigm, storing multiple fine-tuned models creates significant storage overhead. Existing delta compression methods fail to maintain both high compression and performance, and often rely on data.

Method: Three key components: (1) Variance-Based Mixed Sparsity Allocation assigns sparsity based on variance, (2) Distribution-Aware Compression applies uniform quantization and group-wise pruning, (3) Trace-Norm-Guided Rescaling uses trace norm for global rescaling.

Result: Achieves up to 50x compression for LLMs, 224x for NLP models, 132x for vision models, and 18x for multi-modal models. Consistently outperforms existing methods, especially under ultra-high compression.

Conclusion: UltraDelta provides an effective data-free delta compression solution that maintains strong performance while achieving ultra-high compression rates across diverse model types.

Abstract: With the rise of the fine-tuned-pretrained paradigm, storing numerous fine-tuned models for multi-tasking creates significant storage overhead. Delta compression alleviates this by storing only the pretrained model and the highly compressed delta weights (the differences between fine-tuned and pretrained model weights). However, existing methods fail to maintain both high compression and performance, and often rely on data. To address these challenges, we propose UltraDelta, the first data-free delta compression pipeline that achieves both ultra-high compression and strong performance. UltraDelta is designed to minimize redundancy, maximize information, and stabilize performance across inter-layer, intra-layer, and global dimensions, using three key components: (1) Variance-Based Mixed Sparsity Allocation assigns sparsity based on variance, giving lower sparsity to high-variance layers to preserve inter-layer information. (2) Distribution-Aware Compression applies uniform quantization and then groups parameters by value, followed by group-wise pruning, to better preserve intra-layer distribution. (3) Trace-Norm-Guided Rescaling uses the trace norm of delta weights to estimate a global rescaling factor, improving model stability under higher compression. Extensive experiments across (a) large language models (fine-tuned on LLaMA-2 7B and 13B) with up to 50x compression, (b) general NLP models (RoBERTa-base, T5-base) with up to 224x compression, (c) vision models (ViT-B/32, ViT-L/14) with up to 132x compression, and (d) multi-modal models (BEiT-3) with 18x compression, demonstrate that UltraDelta consistently outperforms existing methods, especially under ultra-high compression. Code is available at https://github.com/xiaohuiwang000/UltraDelta.

[1082] Does Machine Unlearning Truly Remove Knowledge?

Haokun Chen, Yueqi Zhang, Yuan Bi, Yao Zhang, Tong Liu, Jinhe Bi, Jian Lan, Jindong Gu, Claudia Grosser, Denis Krompass, Nassir Navab, Volker Tresp

Main category: cs.LG

TL;DR: A comprehensive auditing framework for evaluating machine unlearning algorithms in LLMs, featuring benchmark datasets, multiple unlearning methods, and novel activation-based auditing techniques.

Details

Motivation: Address concerns about data privacy and copyright in LLMs by developing effective evaluation methods for machine unlearning algorithms that remove sensitive information without costly retraining.

Method: Proposed framework includes three benchmark datasets, six unlearning algorithms, five prompt-based auditing methods, and a novel technique using intermediate activation perturbations.

Result: The framework enables systematic evaluation of unlearning effectiveness and robustness, with activation-based auditing overcoming limitations of input/output-only methods.

Conclusion: The comprehensive auditing framework provides a robust methodology for assessing machine unlearning in LLMs, addressing privacy concerns while maintaining model utility.

Abstract: In recent years, Large Language Models (LLMs) have achieved remarkable advancements, drawing significant attention from the research community. Their capabilities are largely attributed to large-scale architectures, which require extensive training on massive datasets. However, such datasets often contain sensitive or copyrighted content sourced from the public internet, raising concerns about data privacy and ownership. Regulatory frameworks, such as the General Data Protection Regulation (GDPR), grant individuals the right to request the removal of such sensitive information. This has motivated the development of machine unlearning algorithms that aim to remove specific knowledge from models without the need for costly retraining. Despite these advancements, evaluating the efficacy of unlearning algorithms remains a challenge due to the inherent complexity and generative nature of LLMs. In this work, we introduce a comprehensive auditing framework for unlearning evaluation, comprising three benchmark datasets, six unlearning algorithms, and five prompt-based auditing methods. By using various auditing algorithms, we evaluate the effectiveness and robustness of different unlearning strategies. To explore alternatives beyond prompt-based auditing, we propose a novel technique that leverages intermediate activation perturbations, addressing the limitations of auditing methods that rely solely on model inputs and outputs.

[1083] Personalized Bayesian Federated Learning with Wasserstein Barycenter Aggregation

Ting Wei, Biao Mei, Junliang Lyu, Renquan Zhang, Feng Zhou, Yifan Sun

Main category: cs.LG

TL;DR: FedWBA is a personalized Bayesian federated learning method that uses particle-based variational inference for local posterior estimation and Wasserstein barycenter aggregation for global model combination, addressing limitations of parametric assumptions and naive averaging in existing methods.

Details

Motivation: Existing personalized Bayesian federated learning methods have restrictive parametric assumptions for client posterior inference and use naive parameter averaging for server aggregation, which limits their effectiveness.

Method: FedWBA uses particle-based variational inference for nonparametric posterior representation at client level and particle-based Wasserstein barycenter aggregation at server level for more geometrically meaningful model combination.

Result: Theoretical analysis shows local KL divergence decrease lower bound and global convergence to true parameters. Empirical experiments demonstrate superior performance in prediction accuracy, uncertainty calibration, and convergence rate compared to baselines.

Conclusion: FedWBA effectively addresses limitations of existing PBFL methods through its nonparametric local inference and geometrically meaningful global aggregation, with both theoretical guarantees and empirical validation.

Abstract: Personalized Bayesian federated learning (PBFL) handles non-i.i.d. client data and quantifies uncertainty by combining personalization with Bayesian inference. However, existing PBFL methods face two limitations: restrictive parametric assumptions in client posterior inference and naive parameter averaging for server aggregation. To overcome these issues, we propose FedWBA, a novel PBFL method that enhances both local inference and global aggregation. At the client level, we use particle-based variational inference for nonparametric posterior representation. At the server level, we introduce particle-based Wasserstein barycenter aggregation, offering a more geometrically meaningful approach. Theoretically, we provide local and global convergence guarantees for FedWBA. Locally, we prove a KL divergence decrease lower bound per iteration for variational inference convergence. Globally, we show that the Wasserstein barycenter converges to the true parameter as the client data size increases. Empirically, experiments show that FedWBA outperforms baselines in prediction accuracy, uncertainty calibration, and convergence rate, with ablation studies confirming its robustness.

[1084] Communication-Efficient Diffusion Denoising Parallelization via Reuse-then-Predict Mechanism

Kunyun Wang, Bohan Li, Kai Yu, Minyi Guo, Jieru Zhao

Main category: cs.LG

TL;DR: ParaStep is a parallelization method that accelerates diffusion model inference by exploiting similarity between adjacent denoising steps using a reuse-then-predict mechanism with lightweight step-wise communication.

Details

Motivation: Diffusion models suffer from significant inference latency due to their sequential denoising process, and existing parallelization strategies incur high communication overhead that hinders deployment on commercial hardware.

Method: Proposes ParaStep based on a reuse-then-predict mechanism that parallelizes diffusion inference by exploiting similarity between adjacent denoising steps, using lightweight step-wise communication instead of layer-wise or stage-wise communication.

Result: Achieves end-to-end speedups of 3.88× on SVD, 2.43× on CogVideoX-2b, and 6.56× on AudioLDM2-large while maintaining generation quality.

Conclusion: ParaStep is a scalable and communication-efficient solution for accelerating diffusion inference, particularly in bandwidth-constrained environments.

Abstract: Diffusion models have emerged as a powerful class of generative models across various modalities, including image, video, and audio synthesis. However, their deployment is often limited by significant inference latency, primarily due to the inherently sequential nature of the denoising process. While existing parallelization strategies attempt to accelerate inference by distributing computation across multiple devices, they typically incur high communication overhead, hindering deployment on commercial hardware. To address this challenge, we propose \textbf{ParaStep}, a novel parallelization method based on a reuse-then-predict mechanism that parallelizes diffusion inference by exploiting similarity between adjacent denoising steps. Unlike prior approaches that rely on layer-wise or stage-wise communication, ParaStep employs lightweight, step-wise communication, substantially reducing overhead. ParaStep achieves end-to-end speedups of up to \textbf{3.88}$\times$ on SVD, \textbf{2.43}$\times$ on CogVideoX-2b, and \textbf{6.56}$\times$ on AudioLDM2-large, while maintaining generation quality. These results highlight ParaStep as a scalable and communication-efficient solution for accelerating diffusion inference, particularly in bandwidth-constrained environments.

[1085] LLMSynthor: Macro-Aligned Micro-Records Synthesis with Large Language Models

Yihong Tang, Menglin Kong, Junlin He, Tong Nie, Lijun Sun

Main category: cs.LG

TL;DR: LLMSynthor uses pretrained LLMs to generate realistic micro-level data that matches target macro-statistics, enabling credible simulations in social science and urban studies.

Details

Motivation: Researchers often only have macro-level data but need micro-level records for reliable simulations (e.g., epidemic models require individual mobility patterns that match aggregate statistics).

Method: Iteratively generates synthetic datasets using LLM as nonparametric copula; employs LLM Proposal Sampling to guide record generation targeting specific variable ranges to efficiently correct statistical discrepancies.

Result: Achieves strong realism, statistical fidelity, and practical utility across mobility, e-commerce, and population domains.

Conclusion: LLMSynthor provides a broadly applicable solution for generating macro-aligned micro-records in economics, social science, and urban studies.

Abstract: Macro-aligned micro-records are crucial for credible simulations in social science and urban studies. For example, epidemic models are only reliable when individual-level mobility and contacts mirror real behavior, while aggregates match real-world statistics like case counts or travel flows. However, collecting such fine-grained data at scale is impractical, leaving researchers with only macro-level data. LLMSynthor addresses this by turning a pretrained LLM into a macro-aware simulator that generates realistic micro-records consistent with target macro-statistics. It iteratively builds synthetic datasets: in each step, the LLM generates batches of records to minimize discrepancies between synthetic and target aggregates. Treating the LLM as a nonparametric copula allows the model to capture realistic joint dependencies among variables. To improve efficiency, LLM Proposal Sampling guides the LLM to propose targeted record batches, specifying variable ranges and counts, to efficiently correct discrepancies while preserving realism grounded in the model’s priors. Evaluations across domains (mobility, e-commerce, population) show that LLMSynthor achieves strong realism, statistical fidelity, and practical utility, making it broadly applicable to economics, social science, and urban studies.

[1086] Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator

Beier Luo, Shuoyuan Wang, Yixuan Li, Hongxin Wei

Main category: cs.LG

TL;DR: DACA is an unsupervised method that improves confidence calibration in post-trained language models by selectively using agreement examples between pre-trained and post-trained models during temperature scaling, avoiding over-confidence issues.

Details

Motivation: Post-trained language models often suffer from over-confidence, assigning high confidence to both correct and incorrect outputs, which undermines reliability in critical applications. The main challenge is the scarcity of labeled data for individual downstream tasks.

Method: DACA (Disagreement-Aware Confidence Alignment) optimizes temperature parameters in post-hoc calibration by selectively using only agreement examples between PLM and PoLM, effectively decoupling the influence of disagreement examples that cause under-confidence issues.

Result: Extensive experiments show DACA improves average ECE of open-sourced and API-based LLMs (including GPT-4o) by up to 15.08% on common benchmarks.

Conclusion: DACA effectively addresses the over-confidence problem in post-trained language models through disagreement-aware calibration, significantly improving calibration performance without requiring labeled data.

Abstract: Post-training of large language models is essential for adapting pre-trained language models (PLMs) to align with human preferences and downstream tasks. While PLMs typically exhibit well-calibrated confidence, post-trained language models (PoLMs) often suffer from over-confidence, assigning high confidence to both correct and incorrect outputs, which can undermine reliability in critical applications. A major obstacle in calibrating PoLMs is the scarcity of labeled data for individual downstream tasks. To address this, we propose Disagreement-Aware Confidence Alignment (DACA), a novel unsupervised method to optimize the parameters (e.g., temperature $\tau$) in post-hoc confidence calibration. Our method is motivated by the under-confidence issue caused by prediction disagreement between the PLM and PoLM while aligning their confidence via temperature scaling. Theoretically, the PLM’s confidence underestimates PoLM’s prediction accuracy on disagreement examples, causing a larger $\tau$ and producing under-confident predictions. DACA mitigates this by selectively using only agreement examples for calibration, effectively decoupling the influence of disagreement. In this manner, our method avoids an overly large $\tau$ in temperature scaling caused by disagreement examples, improving calibration performance. Extensive experiments demonstrate the effectiveness of our method, improving the average ECE of open-sourced and API-based LLMs (e.g. GPT-4o) by up to 15.08$%$ on common benchmarks.

[1087] Equivalent Linear Mappings of Large Language Models

James R. Golden

Main category: cs.LG

TL;DR: The paper shows that LLMs can be mapped to equivalent linear systems that reconstruct outputs with high precision, revealing low-dimensional semantic structures in next-token prediction.

Details

Motivation: To understand the computational mechanisms of LLMs beyond just interpreting hidden representations, by exposing how those representations are generated through linear transformations.

Method: Strategic gradient detachment to freeze input-dependent linear transforms, creating equivalent linear mappings that reconstruct model outputs with one linear operator per input token.

Result: Successfully demonstrated on Qwen 3, Gemma 3 and Llama 3 models (up to 14B parameters), showing LLMs operate in extremely low-dimensional subspaces with interpretable semantic concepts.

Conclusion: Despite global nonlinearity, LLMs can be interpreted through equivalent linear representations that reveal low-dimensional semantic structures in the prediction process.

Abstract: Despite significant progress in transformer interpretability, an understanding of the computational mechanisms of large language models (LLMs) remains a fundamental challenge. Many approaches interpret a network’s hidden representations but remain agnostic about how those representations are generated. We address this by mapping LLM inference for a given input sequence to an equivalent and interpretable linear system which reconstructs the predicted output embedding with relative error below $10^{-13}$ at double floating-point precision, requiring no additional model training. We exploit a property of transformers wherein every operation (gated activations, attention, and normalization) can be expressed as $A(x) \cdot x$, where $A(x)$ represents an input-dependent linear transform and $x$ preserves the linear pathway. To expose this linear structure, we strategically detach components of the gradient computation with respect to an input sequence, freezing the $A(x)$ terms at their values computed during inference, such that the Jacobian yields an equivalent linear mapping. This detached Jacobian of the model reconstructs the output with one linear operator per input token, which is shown for Qwen 3, Gemma 3 and Llama 3, up to Qwen 3 14B. These linear representations demonstrate that LLMs operate in extremely low-dimensional subspaces where the singular vectors can be decoded to interpretable semantic concepts. The computation for each intermediate output also has a linear equivalent, and we examine how the linear representations of individual layers and their attention and multilayer perceptron modules build predictions, and use these as steering operators to insert semantic concepts into unrelated text. Despite their global nonlinearity, LLMs can be interpreted through equivalent linear representations that reveal low-dimensional semantic structures in the next-token prediction process.

[1088] FRIREN: Beyond Trajectories – A Spectral Lens on Time

Qilin Wang

Main category: cs.LG

TL;DR: FRIREN is a novel LTSF model that focuses on geometric structure preservation rather than pointwise prediction, using Wasserstein-2 distance and spectral analysis for accurate long-horizon forecasting in chaotic systems.

Details

Motivation: Current LTSF models assume all data is pointwise predictable, but this fails for chaotic systems. The authors argue that geometric structure preservation is the right abstraction for dynamic-agnostic foundational models.

Method: FRIREN uses an augmented normalizing-flow block to embed data into normally distributed latent space, then generates W2-efficient optimal paths decomposed into rotation, scaling, inverse rotation, and translation. It provides spectral representations that function as finite Koopman operators.

Result: FRIREN achieves superior performance on chaotic systems: MSE 11.4 vs 27.3 on Lorenz-63, and MSE 0.0349 vs 4.3988 on Rossler, maintaining effective prediction for ~2.5 Lyapunov times. It’s also competitive on standard LTSF datasets like ETT and Weather.

Conclusion: By connecting modern generative flows with classical spectral analysis, FRIREN makes long-term forecasting both accurate and interpretable, setting a new benchmark for LTSF model design with geometry-preserving predictions independent of underlying dynamics.

Abstract: Long-term time-series forecasting (LTSF) models are often presented as general-purpose solutions that can be applied across domains, implicitly assuming that all data is pointwise predictable. Using chaotic systems such as Lorenz-63 as a case study, we argue that geometric structure - not pointwise prediction - is the right abstraction for a dynamic-agnostic foundational model. Minimizing the Wasserstein-2 distance (W2), which captures geometric changes, and providing a spectral view of dynamics are essential for long-horizon forecasting. Our model, FRIREN (Flow-inspired Representations via Interpretable Eigen-networks), implements an augmented normalizing-flow block that embeds data into a normally distributed latent representation. It then generates a W2-efficient optimal path that can be decomposed into rotation, scaling, inverse rotation, and translation. This architecture yields locally generated, geometry-preserving predictions that are independent of the underlying dynamics, and a global spectral representation that functions as a finite Koopman operator with a small modification. This enables practitioners to identify which modes grow, decay, or oscillate, both locally and system-wide. FRIREN achieves an MSE of 11.4, MAE of 1.6, and SWD of 0.96 on Lorenz-63 in a 336-in, 336-out, dt=0.01 setting, surpassing TimeMixer (MSE 27.3, MAE 2.8, SWD 2.1). The model maintains effective prediction for 274 out of 336 steps, approximately 2.5 Lyapunov times. On Rossler (96-in, 336-out), FRIREN achieves an MSE of 0.0349, MAE of 0.0953, and SWD of 0.0170, outperforming TimeMixer’s MSE of 4.3988, MAE of 0.886, and SWD of 3.2065. FRIREN is also competitive on standard LTSF datasets such as ETT and Weather. By connecting modern generative flows with classical spectral analysis, FRIREN makes long-term forecasting both accurate and interpretable, setting a new benchmark for LTSF model design.

[1089] Tversky Neural Networks: Psychologically Plausible Deep Learning with Differentiable Tversky Similarity

Moussa Koulako Bala Doumbouya, Dan Jurafsky, Christopher D. Manning

Main category: cs.LG

TL;DR: The paper introduces a differentiable parameterization of Tversky’s similarity model for deep learning, replacing standard linear projection layers with Tversky projection layers that better align with human psychological similarity perception.

Details

Motivation: Standard geometric similarity models in deep learning lack psychological plausibility, while Tversky's feature-based similarity model from psychology has not been used in deep learning due to challenges with discrete set operations.

Method: Developed a differentiable parameterization of Tversky’s similarity that is learnable through gradient descent, creating Tversky projection layers that can model non-linear functions like XOR.

Result: Tversky projection layers significantly improved performance: 24.7% relative accuracy improvement on NABirds image classification, 7.8% perplexity decrease on PTB with GPT-2, and 34.8% parameter count reduction.

Conclusion: The work provides a new paradigm for similarity modeling in deep learning that is psychologically plausible and interpretable, with Tversky projection layers offering better performance and interpretability than standard linear projections.

Abstract: Work in psychology has highlighted that the geometric model of similarity standard in deep learning is not psychologically plausible because its metric properties such as symmetry do not align with human perception of similarity. In contrast, Tversky (1977) proposed an axiomatic theory of similarity with psychological plausibility based on a representation of objects as sets of features, and their similarity as a function of their common and distinctive features. This model of similarity has not been used in deep learning before, in part because of the challenge of incorporating discrete set operations. In this paper, we develop a differentiable parameterization of Tversky’s similarity that is learnable through gradient descent, and derive basic neural network building blocks such as the Tversky projection layer, which unlike the linear projection layer can model non-linear functions such as XOR. Through experiments with image recognition and language modeling neural networks, we show that the Tversky projection layer is a beneficial replacement for the linear projection layer. For instance, on the NABirds image classification task, a frozen ResNet-50 adapted with a Tversky projection layer achieves a 24.7% relative accuracy improvement over the linear layer adapter baseline. With Tversky projection layers, GPT-2’s perplexity on PTB decreases by 7.8%, and its parameter count by 34.8%. Finally, we propose a unified interpretation of both types of projection layers as computing similarities of input stimuli to learned prototypes for which we also propose a novel visualization technique highlighting the interpretability of Tversky projection layers. Our work offers a new paradigm for thinking about the similarity model implicit in modern deep learning, and designing neural networks that are interpretable under an established theory of psychological similarity.

[1090] Evolving Machine Learning: A Survey

Ignacio Cabrera Martin, Subhaditya Mukherjee, Almas Baimagambetov, Joaquin Vanschoren, Nikolaos Polatidis

Main category: cs.LG

TL;DR: This survey provides a comprehensive analysis of Evolving Machine Learning (EML), addressing core challenges like data drift, concept drift, and catastrophic forgetting, while reviewing over 100 studies across various learning approaches.

Details

Motivation: Traditional ML models struggle with dynamic environments and real-time data streams, necessitating the development of EML for continuous learning and adaptation.

Method: Systematic review of over 100 studies, categorizing state-of-the-art methods across supervised, unsupervised, and semi-supervised approaches, with exploration of evaluation metrics and benchmark datasets.

Result: The survey maps the current EML landscape, compares effectiveness of techniques, and highlights the growing role of adaptive neural architectures, meta-learning, and ensemble strategies.

Conclusion: Identifies critical research gaps and opportunities, aiming to guide development of robust, ethical, and scalable EML systems for real-world deployment.

Abstract: In an era defined by rapid data evolution, traditional Machine Learning (ML) models often fall short in adapting to dynamic environments. Evolving Machine Learning (EML) has emerged as a critical paradigm, enabling continuous learning and adaptation in real-time data streams. This survey presents a comprehensive analysis of EML, focusing on five core challenges: data drift, concept drift, catastrophic forgetting, skewed learning, and network adaptation. We systematically review over 100 studies, categorizing state-of-the-art methods across supervised, unsupervised, and semi-supervised approaches. The survey explores diverse evaluation metrics, benchmark datasets, and real-world applications, offering a comparative lens on the effectiveness and limitations of current techniques. Additionally, we highlight the growing role of adaptive neural architectures, meta-learning, and ensemble strategies in addressing evolving data complexities. By synthesizing insights from recent literature, this work not only maps the current landscape of EML but also identifies critical gaps and opportunities for future research. Our findings aim to guide researchers and practitioners in developing robust, ethical, and scalable EML systems for real-world deployment.

[1091] Rolling Ball Optimizer: Learning by ironing out loss landscape wrinkles

Mohammed D. Belgoumri, Mohamed Reda Bouadjenek, Hakim Hacid, Imran Razzak, Sunil Aryal

Main category: cs.LG

TL;DR: The Rolling Ball Optimizer (RBO) is a new optimization method that simulates a rigid sphere rolling on the loss landscape, incorporating information from larger regions to overcome local geometry issues caused by noisy data and improve generalization.

Details

Motivation: Gradient-based optimization methods are vulnerable to noisy data because they rely on local geometry, which can lead to poor generalization. The large-scale geometry of loss landscapes is less data-specific and easier to optimize than fine-grained structure.

Method: RBO simulates the motion of a rigid sphere with finite radius rolling on the loss landscape, generalizing Gradient Descent. The radius hyperparameter controls the scale at which the algorithm interacts with the landscape, providing smoothing effects.

Result: Evaluation on MNIST and CIFAR-10/100 shows promising results in convergence speed, training accuracy, and generalization performance compared to SGD, SAM, and Entropy-SGD.

Conclusion: RBO effectively addresses the limitations of local optimization methods by incorporating larger-scale geometric information, leading to improved optimization dynamics and better generalization in neural network training.

Abstract: Training large neural networks (NNs) requires optimizing high-dimensional data-dependent loss functions. The optimization landscape of these functions is often highly complex and textured, even fractal-like, with many spurious local minima, ill-conditioned valleys, degenerate points, and saddle points. Complicating things further is the fact that these landscape characteristics are a function of the data, meaning that noise in the training data can propagate forward and give rise to unrepresentative small-scale geometry. This poses a difficulty for gradient-based optimization methods, which rely on local geometry to compute updates and are, therefore, vulnerable to being derailed by noisy data. In practice,this translates to a strong dependence of the optimization dynamics on the noise in the data, i.e., poor generalization performance. To remediate this problem, we propose a new optimization procedure: Rolling Ball Optimizer (RBO), that breaks this spatial locality by incorporating information from a larger region of the loss landscape in its updates. We achieve this by simulating the motion of a rigid sphere of finite radius rolling on the loss landscape, a straightforward generalization of Gradient Descent (GD) that simplifies into it in the infinitesimal limit. The radius serves as a hyperparameter that determines the scale at which RBO sees the loss landscape, allowing control over the granularity of its interaction therewith. We are motivated by the intuition that the large-scale geometry of the loss landscape is less data-specific than its fine-grained structure, and that it is easier to optimize. We support this intuition by proving that our algorithm has a smoothing effect on the loss function. Evaluation against SGD, SAM, and Entropy-SGD, on MNIST and CIFAR-10/100 demonstrates promising results in terms of convergence speed, training accuracy, and generalization performance.

[1092] Complexity-aware fine-tuning

Andrey Goncharov, Daniil Vyazhev, Petr Sychev, Edvard Khalafyan, Alexey Zaytsev

Main category: cs.LG

TL;DR: Proposes an efficient fine-tuning method that uses reasoning only for complex data identified by entropy, achieving better performance than standard SFT and distillation approaches while using significantly less data.

Details

Motivation: To enhance LLM performance in specific domains more efficiently, avoiding the high costs of distillation approaches that require numerous expensive calls and large amounts of data.

Method: Split training data into complexity categories using single token answer entropy, then fine-tune LLMs via SFT and distillation only for complex data identified by this entropy-based classification (ROC AUC 0.73).

Result: The proposed pipeline achieved 0.58 average accuracy, outperforming standard SFT (0.45) and distillation (0.56) approaches while using 81% less data.

Conclusion: The entropy-based complexity categorization enables efficient fine-tuning that maintains high performance while dramatically reducing data requirements, making LLM fine-tuning more practical and cost-effective.

Abstract: General-purpose Large Language Models (LLMs) are frequently fine-tuned through supervised fine-tuning (SFT) to enhance performance in specific domains. Better results can be achieved by distilling the chain-of-thought of a larger model at the cost of numerous expensive calls and a much greater amount of data. We propose a novel blueprint for efficient fine-tuning that uses reasoning only for complex data identified by entropy. Specifically, across two small open models ($~3B$) we split the training data into complexity categories by a single token answer entropy (ROC AUC $0.73$), fine-tune large language models (LLMs) via SFT and distillation, and show that our pipeline significantly outperforms the standard SFT approach ($0.58$ vs $0.45$ average accuracy) and outperforms the distillation approach ($0.58$ vs $0.56$ average accuracy) while using $81%$ less data.

[1093] LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, Chongxuan Li

Main category: cs.LG

TL;DR: VRPO is a variance-reduced preference optimization framework that improves alignment of Masked Diffusion Models (MDMs) like LLaDA with human preferences through theoretical analysis and practical variance reduction strategies.

Details

Motivation: There has been little effort in aligning Masked Diffusion Models with human preferences via reinforcement learning due to high variance in ELBO-based likelihood estimates required for preference optimization.

Method: Proposed VRPO framework that formally analyzes variance of ELBO estimators, derives bounds on bias and variance of preference optimization gradients, and introduces unbiased variance reduction strategies including optimal Monte Carlo budget allocation and antithetic sampling.

Result: Applied VRPO to LLaDA to create LLaDA 1.5, which significantly outperforms its predecessor across mathematical (GSM8K +4.7), code (HumanEval +3.0, MBPP +1.8), and alignment benchmarks (IFEval +4.0, Arena-Hard +4.3).

Conclusion: VRPO effectively addresses the variance challenge in MDM alignment and enables highly competitive performance compared to strong language MDMs and ARMs.

Abstract: While Masked Diffusion Models (MDMs), such as LLaDA, present a promising paradigm for language modeling, there has been relatively little effort in aligning these models with human preferences via reinforcement learning. The challenge primarily arises from the high variance in Evidence Lower Bound (ELBO)-based likelihood estimates required for preference optimization. To address this issue, we propose Variance-Reduced Preference Optimization (VRPO), a framework that formally analyzes the variance of ELBO estimators and derives bounds on both the bias and variance of preference optimization gradients. Building on this theoretical foundation, we introduce unbiased variance reduction strategies, including optimal Monte Carlo budget allocation and antithetic sampling, that significantly improve the performance of MDM alignment. We demonstrate the effectiveness of VRPO by applying it to LLaDA, and the resulting model, LLaDA 1.5, outperforms its SFT-only predecessor consistently and significantly across mathematical (GSM8K +4.7), code (HumanEval +3.0, MBPP +1.8), and alignment benchmarks (IFEval +4.0, Arena-Hard +4.3). Furthermore, LLaDA 1.5 demonstrates a highly competitive mathematical performance compared to strong language MDMs and ARMs. Project page: https://ml-gsai.github.io/LLaDA-1.5-Demo/.

[1094] STRAP: Spatio-Temporal Pattern Retrieval for Out-of-Distribution Generalization

Haoyu Zhang, Wentao Zhang, Hao Miao, Xinke Jiang, Yuchen Fang, Yifan Zhang

Main category: cs.LG

TL;DR: STRAP is a retrieval-augmented framework that enhances STGNN generalization in out-of-distribution scenarios by storing and retrieving spatio-temporal patterns during inference.

Details

Motivation: STGNNs fail to generalize in Spatio-Temporal Out-of-Distribution (STOOD) scenarios where both temporal dynamics and spatial structures evolve beyond training distribution.

Method: Proposes STRAP framework with a compact pattern library storing enriched spatio-temporal patterns. During inference, retrieves relevant patterns via similarity matching and injects them via plug-and-play prompting with knowledge-balancing objective.

Result: Extensive experiments show STRAP consistently outperforms state-of-the-art STGNN baselines on STOOD tasks, demonstrating robustness and strong generalization without task-specific fine-tuning.

Conclusion: STRAP effectively enhances STGNN generalization in out-of-distribution scenarios through retrieval-augmented pattern learning, mitigating catastrophic forgetting while maintaining adaptability.

Abstract: Spatio-Temporal Graph Neural Networks (STGNNs) have emerged as a powerful tool for modeling dynamic graph-structured data across diverse domains. However, they often fail to generalize in Spatio-Temporal Out-of-Distribution (STOOD) scenarios, where both temporal dynamics and spatial structures evolve beyond the training distribution. To address this problem, we propose an innovative Spatio-Temporal Retrieval-Augmented Pattern Learning framework,STRAP, which enhances model generalization by integrating retrieval-augmented learning into the STGNN continue learning pipeline. The core of STRAP is a compact and expressive pattern library that stores representative spatio-temporal patterns enriched with historical, structural, and semantic information, which is obtained and optimized during the training phase. During inference, STRAP retrieves relevant patterns from this library based on similarity to the current input and injects them into the model via a plug-and-play prompting mechanism. This not only strengthens spatio-temporal representations but also mitigates catastrophic forgetting. Moreover, STRAP introduces a knowledge-balancing objective to harmonize new information with retrieved knowledge. Extensive experiments across multiple real-world streaming graph datasets show that STRAP consistently outperforms state-of-the-art STGNN baselines on STOOD tasks, demonstrating its robustness, adaptability, and strong generalization capability without task-specific fine-tuning.

[1095] Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning

Alexander Golubev, Maria Trofimova, Sergei Polezhaev, Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Sergey Abramov, Andrei Andriushchenko, Filipp Fisin, Sergei Skvortsov, Boris Yangel

Main category: cs.LG

TL;DR: RL applied to multi-turn interactive tasks like software engineering, achieving significant performance improvements on SWE-bench through a pipeline combining rejection fine-tuning and DAPO-based RL.

Details

Motivation: Most RL research on LLMs focuses on single-turn problems, but real-world domains like software engineering require multi-turn interactions with stateful environments that provide meaningful feedback.

Method: Two-stage pipeline: 1) Rejection fine-tuning (RFT) using execution feedback to train instruction following, 2) Synchronous RL pipeline using DAPO for iterative improvement.

Result: Increased Pass@1 on SWE-bench Verified from 11% to 39%, outperforming 20% RFT baseline. Achieved 35% and 31% on SWE-rebench splits, competitive with larger models like DeepSeek-V3-0324 and Qwen3-235B-A22B.

Conclusion: The methodology offers a practical approach for training capable agents for multi-turn interactive tasks using open-weight models, bridging the gap between single-turn RL applications and real-world interactive domains.

Abstract: Research on applications of reinforcement learning (RL) to large language models has mostly been focused on single-turn problems, such as mathematical reasoning or single-shot code generation. While these problems can be viewed as token-level multi-turn Markov decision processes (MDPs), this view corresponds to a degenerate case of multi-turn interaction where the environment provides no feedback. This contrasts with many real-world domains, such as software engineering (SWE), which require rich multi-turn interactions with a stateful environment that responds to each action with a non-trivial observation. To bridge this gap, we demonstrate the successful application of RL to this general regime. Our methodology begins with rejection fine-tuning (RFT) using execution feedback to train a policy to follow instructions and formatting effectively, followed by a synchronous RL pipeline using DAPO for iterative improvement. Applying this pipeline to Qwen2.5-72B-Instruct, we increase its Pass@1 on the SWE-bench Verified benchmark from 11% to 39%, substantially improving upon the 20% RFT baseline. On the May and June splits of SWE-rebench, the resulting agent achieves Pass@1 of 35% and 31% respectively, competitive with even larger models such as DeepSeek-V3-0324 or Qwen3-235B-A22B, demonstrating that our methodology offers a practical approach for training capable agents for multi-turn interactive tasks using open-weight models.

[1096] TabAttackBench: A Benchmark for Adversarial Attacks on Tabular Data

Zhipeng He, Chun Ouyang, Lijie Wen, Cong Liu, Catarina Moreira

Main category: cs.LG

TL;DR: A comprehensive benchmark evaluating adversarial attacks on tabular data, assessing 5 white-box attacks across 4 models using 11 datasets from finance, energy, and healthcare domains.

Details

Motivation: Adversarial attacks on tabular data remain underexplored due to mixed feature types and complex inter-feature dependencies, unlike well-studied attacks on unstructured data like images.

Method: Evaluates FGSM, BIM, PGD, DeepFool, and C&W attacks on LR, MLP, TabTransformer and FT-Transformer models using 11 datasets. Uses 4 imperceptibility metrics: proximity, sparsity, deviation, and sensitivity.

Result: ℓ∞-based attacks achieve higher success rates but lower subtlety, while ℓ2-based attacks offer more realistic perturbations. Quantifies trade-off between attack effectiveness and imperceptibility.

Conclusion: Provides actionable insights for designing more imperceptible adversarial attacks and advances understanding of adversarial vulnerability in tabular machine learning.

Abstract: Adversarial attacks pose a significant threat to machine learning models by inducing incorrect predictions through imperceptible perturbations to input data. While these attacks are well studied in unstructured domains such as images, their behaviour on tabular data remains underexplored due to mixed feature types and complex inter-feature dependencies. This study introduces a comprehensive benchmark that evaluates adversarial attacks on tabular datasets with respect to both effectiveness and imperceptibility. We assess five white-box attack algorithms (FGSM, BIM, PGD, DeepFool, and C&W) across four representative models (LR, MLP, TabTransformer and FT-Transformer) using eleven datasets spanning finance, energy, and healthcare domains. The benchmark employs four quantitative imperceptibility metrics (proximity, sparsity, deviation, and sensitivity) to characterise perturbation realism. The analysis quantifies the trade-off between these two aspects and reveals consistent differences between attack types, with $\ell_\infty$-based attacks achieving higher success but lower subtlety, and $\ell_2$-based attacks offering more realistic perturbations. The benchmark findings offer actionable insights for designing more imperceptible adversarial attacks, advancing the understanding of adversarial vulnerability in tabular machine learning.

[1097] Inclusive, Differentially Private Federated Learning for Clinical Data

Santhosh Parampottupadam, Melih Coşğun, Sarthak Pati, Maximilian Zenk, Saikat Roy, Dimitrios Bounias, Benjamin Hamm, Sinem Sav, Ralf Floca, Klaus Maier-Hein

Main category: cs.LG

TL;DR: A compliance-aware federated learning framework that adaptively adjusts differential privacy noise based on client compliance scores, improving accuracy by up to 15% over traditional FL while maintaining privacy in clinical settings.

Details

Motivation: Federated Learning faces adoption challenges in healthcare due to privacy concerns, resource constraints, and compliance issues. Uniform differential privacy noise disproportionately degrades model performance across institutions with varying compliance levels.

Method: Proposed a compliance-aware FL framework with adaptive differential privacy noise adjustment based on quantifiable client compliance scores. Also introduced a compliance scoring tool based on healthcare and security standards.

Result: Extensive experiments on public datasets showed that integrating under-resourced, less compliant clinics with highly regulated institutions yields accuracy improvements of up to 15% over traditional FL.

Conclusion: This work advances FL by balancing privacy, compliance, and performance, making it a viable solution for real-world clinical workflows in global healthcare.

Abstract: Federated Learning (FL) offers a promising approach for training clinical AI models without centralizing sensitive patient data. However, its real-world adoption is hindered by challenges related to privacy, resource constraints, and compliance. Existing Differential Privacy (DP) approaches often apply uniform noise, which disproportionately degrades model performance, even among well-compliant institutions. In this work, we propose a novel compliance-aware FL framework that enhances DP by adaptively adjusting noise based on quantifiable client compliance scores. Additionally, we introduce a compliance scoring tool based on key healthcare and security standards to promote secure, inclusive, and equitable participation across diverse clinical settings. Extensive experiments on public datasets demonstrate that integrating under-resourced, less compliant clinics with highly regulated institutions yields accuracy improvements of up to 15% over traditional FL. This work advances FL by balancing privacy, compliance, and performance, making it a viable solution for real-world clinical workflows in global healthcare.

[1098] Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, Shuicheng Yan

Main category: cs.LG

TL;DR: Muddit is a unified discrete diffusion transformer that enables fast parallel generation across text and image modalities by integrating pretrained visual priors with a lightweight text decoder.

Details

Motivation: Autoregressive unified models have slow inference due to sequential decoding, while non-autoregressive models have weak generalization due to limited pretrained backbones. The authors aim to create a unified model that overcomes both limitations.

Method: Muddit uses a unified discrete diffusion transformer architecture that integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible multimodal generation under a unified framework.

Result: Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency, demonstrating the effectiveness of discrete diffusion with strong visual priors.

Conclusion: The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation across modalities.

Abstract: Unified generation models aim to handle diverse tasks across modalities – such as text generation, image generation, and vision-language reasoning – within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.

[1099] AMSbench: A Comprehensive Benchmark for Evaluating MLLM Capabilities in AMS Circuits

Yichen Shi, Ze Zhang, Hongyang Wang, Zhuofu Tao, Zhongyi Li, Bingyu Chen, Yaxin Wang, Zhen huang, Xuhua Liu, Quan Chen, Zhiping Yu, Ting-Jung Lin, Lei He

Main category: cs.LG

TL;DR: Introduces AMSbench, a comprehensive benchmark for evaluating Multi-modal Large Language Models (MLLMs) on Analog/Mixed-Signal circuit design tasks, revealing significant gaps in current MLLM capabilities.

Details

Motivation: Automating AMS circuit design remains challenging, and while MLLMs show promise, current research lacks systematic evaluation across diverse AMS-related tasks.

Method: Developed AMSbench with ~8000 test questions across multiple difficulty levels, evaluating 8 prominent MLLMs on circuit schematic perception, analysis, and design tasks.

Result: Current MLLMs show significant limitations in complex multi-modal reasoning and sophisticated circuit design tasks, with performance gaps relative to human expertise.

Conclusion: Advancing MLLMs’ understanding and application of circuit-specific knowledge is necessary to achieve fully automated AMS circuit design workflows.

Abstract: Analog/Mixed-Signal (AMS) circuits play a critical role in the integrated circuit (IC) industry. However, automating Analog/Mixed-Signal (AMS) circuit design has remained a longstanding challenge due to its difficulty and complexity. Although recent advances in Multi-modal Large Language Models (MLLMs) offer promising potential for supporting AMS circuit analysis and design, current research typically evaluates MLLMs on isolated tasks within the domain, lacking a comprehensive benchmark that systematically assesses model capabilities across diverse AMS-related challenges. To address this gap, we introduce AMSbench, a benchmark suite designed to evaluate MLLM performance across critical tasks including circuit schematic perception, circuit analysis, and circuit design. AMSbench comprises approximately 8000 test questions spanning multiple difficulty levels and assesses eight prominent models, encompassing both open-source and proprietary solutions such as Qwen 2.5-VL and Gemini 2.5 Pro. Our evaluation highlights significant limitations in current MLLMs, particularly in complex multi-modal reasoning and sophisticated circuit design tasks. These results underscore the necessity of advancing MLLMs’ understanding and effective application of circuit-specific knowledge, thereby narrowing the existing performance gap relative to human expertise and moving toward fully automated AMS circuit design workflows. Our data is released at this URL.

[1100] Clip Your Sequences Fairly: Enforcing Length Fairness for Sequence-Level RL

Hanyi Mao, Quanjia Xiao, Lei Pang, Haixiao Liu

Main category: cs.LG

TL;DR: FSPO is a sequence-level RL method for LLMs that introduces length-fair clipping on importance-sampling weights to address systematic length bias in PPO/GRPO-style methods.

Details

Motivation: There's a mismatch when PPO/GRPO-style clipping is applied to sequences - fixed clip ranges systematically reweight short vs. long responses, distorting optimization direction and creating length bias.

Method: FSPO clips the sequence log-importance-sampling ratio with a band that scales as √L (square root of length), formalizing length fairness via Length Reweighting Error (LRE).

Result: FSPO flattens clip rates across length bins, stabilizes training, and outperforms baselines across model sizes and evaluation datasets, with largest gains on Qwen3-8B-Base model.

Conclusion: Length-fair clipping through FSPO effectively addresses systematic length bias in sequence-level RL for LLMs, providing better optimization direction and improved performance.

Abstract: We propose FSPO (Fair Sequence Policy Optimization), a sequence-level reinforcement learning method for LLMs that enforces length-fair clipping on the importance-sampling (IS) weight. We study RL methods with sequence-level IS and identify a mismatch when PPO/GRPO-style clipping is transplanted to sequences: a fixed clip range systematically reweights short vs. long responses, distorting the optimization direction. FSPO introduces a simple remedy: we clip the sequence log-IS ratio with a band that scales as $\sqrt{L}$. Theoretically, we formalize length fairness via a Length Reweighting Error (LRE) and prove that small LRE yields a cosine directional guarantee between the clipped and true updates. Empirically, FSPO flattens clip rates across length bins, stabilizes training, and outperforms baselines across model sizes and evaluation datasets, with the largest gains on the Qwen3-8B-Base model.

[1101] QiMeng-CodeV-R1: Reasoning-Enhanced Verilog Generation

Yaoyu Zhu, Di Huang, Hanqi Lyu, Xiaoyun Zhang, Chongxiao Li, Wenxuan Shi, Yutong Wu, Jianan Mu, Jinghua Wang, Yang Zhao, Pengwei Jin, Shuyao Cheng, Shengwen Liang, Xishan Zhang, Rui Zhang, Zidong Du, Qi Guo, Xing Hu, Yunji Chen

Main category: cs.LG

TL;DR: CodeV-R1 is an RLVR framework that trains LLMs to generate Verilog code from natural language specifications, achieving state-of-the-art performance through automated verification, data synthesis, and efficient training methods.

Details

Motivation: Extend RLVR to EDA for Verilog generation from natural language, addressing challenges of automated verification, data scarcity, and high computation costs.

Method: Developed rule-based testbench generator for equivalence checking, round-trip data synthesis for high-quality dataset, and two-stage ‘distill-then-RL’ training with adaptive DAPO algorithm.

Result: CodeV-R1-7B achieves 68.6% pass@1 on VerilogEval v2 and 72.9% on RTLLM v1.1, surpassing prior SOTA by 12-20% and even exceeding 671B DeepSeek-R1 on RTLLM.

Conclusion: The framework successfully addresses key challenges in Verilog generation and sets new benchmarks, with released model, code, and dataset to advance EDA and LLM research.

Abstract: Large language models (LLMs) trained via reinforcement learning with verifiable reward (RLVR) have achieved breakthroughs on tasks with explicit, automatable verification, such as software programming and mathematical problems. Extending RLVR to electronic design automation (EDA), especially automatically generating hardware description languages (HDLs) like Verilog from natural-language (NL) specifications, however, poses three key challenges: the lack of automated and accurate verification environments, the scarcity of high-quality NL-code pairs, and the prohibitive computation cost of RLVR. To this end, we introduce CodeV-R1, an RLVR framework for training Verilog generation LLMs. First, we develop a rule-based testbench generator that performs robust equivalence checking against golden references. Second, we propose a round-trip data synthesis method that pairs open-source Verilog snippets with LLM-generated NL descriptions, verifies code-NL-code consistency via the generated testbench, and filters out inequivalent examples to yield a high-quality dataset. Third, we employ a two-stage “distill-then-RL” training pipeline: distillation for the cold start of reasoning abilities, followed by adaptive DAPO, our novel RLVR algorithm that can reduce training cost by adaptively adjusting sampling rate. The resulting model, CodeV-R1-7B, achieves 68.6% and 72.9% pass@1 on VerilogEval v2 and RTLLM v1.1, respectively, surpassing prior state-of-the-art by 12~20%, while even exceeding the performance of 671B DeepSeek-R1 on RTLLM. We have released our model, training code, and dataset to facilitate research in EDA and LLM communities.

[1102] State-Covering Trajectory Stitching for Diffusion Planners

Kyowoon Lee, Jaesik Choi

Main category: cs.LG

TL;DR: SCoTS is a reward-free trajectory augmentation method that stitches short trajectory segments to generate diverse extended trajectories, improving diffusion planners’ performance and generalization in offline goal-conditioned RL.

Details

Motivation: Diffusion-based generative models for RL planning are limited by training data quality and diversity, restricting their generalization to out-of-distribution tasks and longer planning horizons.

Method: SCoTS learns temporal distance-preserving latent representations, then iteratively stitches trajectory segments using directional exploration and novelty to systematically expand latent space coverage.

Result: SCoTS significantly improves diffusion planners’ performance and generalization on offline goal-conditioned benchmarks requiring stitching and long-horizon reasoning, and enhances offline goal-conditioned RL algorithms across diverse environments.

Conclusion: SCoTS effectively addresses data limitations in diffusion-based planning by generating diverse extended trajectories through systematic trajectory stitching in latent space.

Abstract: Diffusion-based generative models are emerging as powerful tools for long-horizon planning in reinforcement learning (RL), particularly with offline datasets. However, their performance is fundamentally limited by the quality and diversity of training data. This often restricts their generalization to tasks outside their training distribution or longer planning horizons. To overcome this challenge, we propose State-Covering Trajectory Stitching (SCoTS), a novel reward-free trajectory augmentation method that incrementally stitches together short trajectory segments, systematically generating diverse and extended trajectories. SCoTS first learns a temporal distance-preserving latent representation that captures the underlying temporal structure of the environment, then iteratively stitches trajectory segments guided by directional exploration and novelty to effectively cover and expand this latent space. We demonstrate that SCoTS significantly improves the performance and generalization capabilities of diffusion planners on offline goal-conditioned benchmarks requiring stitching and long-horizon reasoning. Furthermore, augmented trajectories generated by SCoTS significantly improve the performance of widely used offline goal-conditioned RL algorithms across diverse environments.

[1103] IIET: Efficient Numerical Transformer via Implicit Iterative Euler Method

Xinyu Liu, Bei Li, Jiahao Liu, Junhao Ruan, Kechen Jiao, Hongyin Tang, Jingang Wang, Xiao Tong, Jingbo Zhu

Main category: cs.LG

TL;DR: IIET improves Transformer performance using iterative implicit Euler methods, achieving better accuracy than vanilla Transformers and PCformer while enabling efficient variants through novel distillation.

Details

Motivation: High-order numerical methods improve Transformer performance but create efficiency trade-offs, and conventional efficiency techniques like distillation can harm performance in models like PCformer.

Method: Propose Iterative Implicit Euler Transformer (IIET) using iterative implicit Euler approach, and Iteration Influence-Aware Distillation (IIAD) with flexible threshold for balancing performance-efficiency trade-off.

Result: IIET boosts average accuracy by 2.65% over vanilla Transformers and 0.8% over PCformer. E-IIET variant reduces inference overhead by 55% while retaining 99.4% accuracy. Most efficient variant achieves >1.6% gain over vanilla Transformer with comparable speed.

Conclusion: IIET provides superior performance over existing methods while enabling efficient variants through IIAD distillation, effectively balancing the performance-efficiency trade-off in high-order Transformer architectures.

Abstract: High-order numerical methods enhance Transformer performance in tasks like NLP and CV, but introduce a performance-efficiency trade-off due to increased computational overhead. Our analysis reveals that conventional efficiency techniques, such as distillation, can be detrimental to the performance of these models, exemplified by PCformer. To explore more optimizable ODE-based Transformer architectures, we propose the Iterative Implicit Euler Transformer (IIET), which simplifies high-order methods using an iterative implicit Euler approach. This simplification not only leads to superior performance but also facilitates model compression compared to PCformer. To enhance inference efficiency, we introduce Iteration Influence-Aware Distillation (IIAD). Through a flexible threshold, IIAD allows users to effectively balance the performance-efficiency trade-off. On lm-evaluation-harness, IIET boosts average accuracy by 2.65% over vanilla Transformers and 0.8% over PCformer. Its efficient variant, E-IIET, significantly cuts inference overhead by 55% while retaining 99.4% of the original task accuracy. Moreover, the most efficient IIET variant achieves an average performance gain exceeding 1.6% over vanilla Transformer with comparable speed.

[1104] Invariance Makes LLM Unlearning Resilient Even to Unanticipated Downstream Fine-Tuning

Changsheng Wang, Yihua Zhang, Jinghan Jia, Parikshit Ram, Dennis Wei, Yuguang Yao, Soumyadeep Pal, Nathalie Baracaldo, Sijia Liu

Main category: cs.LG

TL;DR: ILU introduces invariant risk minimization to machine unlearning for LLMs, making unlearning robust against downstream fine-tuning that could recover forgotten information.

Details

Motivation: Current machine unlearning methods for LLMs are sensitive to downstream fine-tuning, which can quickly recover forgotten information even from unrelated tasks, posing privacy and safety concerns.

Method: Proposed invariant LLM unlearning (ILU), a regularization-based framework that incorporates invariance inspired by invariant risk minimization (IRM) to enhance robustness against downstream fine-tuning.

Result: ILU significantly outperforms state-of-the-art unlearning methods (NPO and RMU) on WMDP and MUSE benchmarks, achieving superior unlearning robustness across diverse downstream fine-tuning scenarios while preserving fine-tuning performance.

Conclusion: ILU successfully addresses the vulnerability of current unlearning methods to downstream fine-tuning by introducing invariance, providing a more robust solution for selective knowledge removal in LLMs.

Abstract: Machine unlearning offers a promising solution to privacy and safety concerns in large language models (LLMs) by selectively removing targeted knowledge while preserving utility. However, current methods are highly sensitive to downstream fine-tuning, which can quickly recover forgotten information-even from unrelated tasks. To address this, we introduce invariance into unlearning for the first time, inspired by invariant risk minimization (IRM). Building on this principle, we propose invariant LLM unlearning (ILU), a regularization-based framework that enhances robustness. Notably, ILU generalizes well to diverse fine-tuning tasks, even when trained using a single dataset. A task vector analysis is also provided to further elucidate the rationale behind ILU’s effectiveness. Extensive experiments on the WMDP and MUSE benchmark, reveal that ILU significantly outperforms state-of-the-art unlearning methods, including negative preference optimization (NPO) and representation misdirection for unlearning (RMU). Notably, ILU achieves superior unlearning robustness across diverse downstream fine-tuning scenarios (e.g., math, paraphrase detection, and sentiment analysis) while preserving the fine-tuning performance.

[1105] Towards Unsupervised Training of Matching-based Graph Edit Distance Solver via Preference-aware GAN

Wei Huang, Hanchen Wang, Dong Wen, Shaozhen Ma, Wenjie Zhang, Xuemin Lin

Main category: cs.LG

TL;DR: GEDRanker is an unsupervised GAN-based framework for Graph Edit Distance computation that eliminates the need for costly ground-truth node matching supervision by using a preference-aware discriminator.

Details

Motivation: Traditional GED methods rely heavily on ground-truth node matchings which are expensive to obtain in real-world scenarios, creating a need for unsupervised approaches.

Method: Proposes GEDRanker with a matching-based GED solver and interpretable preference-aware discriminator that uses preference signals from edit path lengths to guide node matching without ground-truth supervision.

Result: Extensive experiments show GEDRanker enables matching-based GED solvers to achieve near-optimal solution quality without any ground-truth supervision.

Conclusion: GEDRanker successfully addresses the supervision dependency problem in GED computation through an unsupervised GAN-based approach with preference-aware guidance.

Abstract: Graph Edit Distance (GED) is a fundamental graph similarity metric widely used in various applications. However, computing GED is an NP-hard problem. Recent state-of-the-art hybrid GED solver has shown promising performance by formulating GED as a bipartite graph matching problem, then leveraging a generative diffusion model to predict node matching between two graphs, from which both the GED and its corresponding edit path can be extracted using a traditional algorithm. However, such methods typically rely heavily on ground-truth supervision, where the ground-truth node matchings are often costly to obtain in real-world scenarios. In this paper, we propose GEDRanker, a novel unsupervised GAN-based framework for GED computation. Specifically, GEDRanker consists of a matching-based GED solver and introduces an interpretable preference-aware discriminator. By leveraging preference signals over different node matchings derived from edit path lengths, the discriminator can guide the matching-based solver toward generating high-quality node matching without the need for ground-truth supervision. Extensive experiments on benchmark datasets demonstrate that our GEDRanker enables the matching-based GED solver to achieve near-optimal solution quality without any ground-truth supervision.

[1106] MLorc: Momentum Low-rank Compression for Memory Efficient Large Language Model Adaptation

Wei Shen, Zhang Yaxiang, Minhui Huang, Mengfan Xu, Jiawei Zhang, Cong Shen

Main category: cs.LG

TL;DR: MLorc is a memory-efficient training method that compresses momentum of matrix parameters during LLM fine-tuning, achieving performance comparable to full fine-tuning while reducing memory usage.

Details

Motivation: Full-parameter fine-tuning of large language models requires substantial memory, creating a need for more memory-efficient training methods that don't compromise performance.

Method: Compresses and reconstructs momentum of matrix parameters during training, avoiding fixed-rank constraints like LoRA and directly compressing momentum rather than gradients like GaLore.

Result: Outperforms other memory-efficient methods, matches or exceeds full fine-tuning performance at small ranks (e.g., r=4), and generalizes well across different optimizers without compromising time or memory efficiency.

Conclusion: MLorc provides an effective memory-efficient training paradigm with theoretical convergence guarantees that maintains the training dynamics of full-parameter fine-tuning while significantly reducing memory consumption.

Abstract: With increasing size of large language models (LLMs), full-parameter fine-tuning imposes substantial memory demands. To alleviate this, we propose a novel memory-efficient training paradigm called Momentum Low-rank compression (MLorc). The key idea of MLorc is to compress and reconstruct the momentum of matrix parameters during training to reduce memory consumption. Compared to LoRA, MLorc avoids enforcing a fixed-rank constraint on weight update matrices and thus enables full-parameter learning. Compared to GaLore, MLorc directly compress the momentum rather than gradients, thereby better preserving the training dynamics of full-parameter fine-tuning. We provide a theoretical guarantee for its convergence under mild assumptions. Empirically, MLorc consistently outperforms other memory-efficient training methods, matches or even exceeds the performance of full fine-tuning at small ranks (e.g., $r=4$), and generalizes well across different optimizers – all while not compromising time or memory efficiency.

[1107] Bridging Neural ODE and ResNet: A Formal Error Bound for Safety Verification

Abdelrahman Sayed Sayed, Pierre-Jean Meyer, Mohamed Ghazel

Main category: cs.LG

TL;DR: This paper establishes formal error bounds between neural ODEs and ResNets, enabling cross-model safety verification where verifying one model with error bounds guarantees safety in the other.

Details

Motivation: To formalize the relationship between neural ODEs and ResNets and enable efficient safety verification across both models without redundant verification.

Method: Developed approximation error bounds between neural ODEs and their corresponding ResNet discretizations, allowing one model to serve as a verification proxy for the other.

Result: Obtained formal error bounds that enable safety verification on one model to guarantee safety on the other when accounting for the approximation error.

Conclusion: The established error bounds provide a practical framework for cross-model safety verification, demonstrated on a neural ODE fixed-point attractor system.

Abstract: A neural ordinary differential equation (neural ODE) is a machine learning model that is commonly described as a continuous-depth generalization of a residual network (ResNet) with a single residual block, or conversely, the ResNet can be seen as the Euler discretization of the neural ODE. These two models are therefore strongly related in a way that the behaviors of either model are considered to be an approximation of the behaviors of the other. In this work, we establish a more formal relationship between these two models by bounding the approximation error between two such related models. The obtained error bound then allows us to use one of the models as a verification proxy for the other, without running the verification tools twice: if the reachable output set expanded by the error bound satisfies a safety property on one of the models, this safety property is then guaranteed to be also satisfied on the other model. This feature is fully reversible, and the initial safety verification can be run indifferently on either of the two models. This novel approach is illustrated on a numerical example of a fixed-point attractor system modeled as a neural ODE.

[1108] Understanding the Impact of Sampling Quality in Direct Preference Optimization

Kyung Rok Kim, Yumo Bai, Chonghuan Wang, Guanting Chen

Main category: cs.LG

TL;DR: Higher quality data improves DPO performance by enhancing gradient signals and optimization landscape, with theoretical support for online DPO framework.

Details

Motivation: To understand how data quality affects Direct Preference Optimization training dynamics and leverage this to improve performance.

Method: Analyzed DPO training dynamics, designed simplified alignment model to avoid likelihood displacement, developed quantitative results on gradient signal amplification.

Result: Higher quality responses amplify gradient signals and improve optimization landscape, leading to more effective policy learning.

Conclusion: Theoretical findings support online DPO framework and show that data quality significantly impacts DPO convergence and solution space.

Abstract: We study how data of higher quality can be leveraged to improve performance in Direct Preference Optimization (DPO), aiming to understand its impact on DPO training dynamics. Our analyses show that both the solution space and the convergence behavior of DPO depend on the support and quality of the data-generating distribution. We first analyze how data and reference policy influence policy updates during gradient descent, and how a practical phenomenon known as likelihood displacement can interfere with the desired dynamics. We then design a simplified yet well-structured alignment model as a proxy that preserves most of the beneficial properties of RLHF while avoiding likelihood displacement. Based on this model, we develop quantitative results showing how more frequent high-quality responses amplify the gradient signal and improve the optimization landscape, leading to more effective policy learning. Our theoretical findings are supported by empirical experiments and provide a principled justification for the online DPO framework in practice.

[1109] Learning to Reason as Action Abstractions with Scalable Mid-Training RL

Shenao Zhang, Donghan Yu, Yihao Feng, Bowen Jin, Zhaoran Wang, John Peebles, Zirui Wang

Main category: cs.LG

TL;DR: Mid-training shapes post-training by identifying compact action subspaces that minimize value approximation and RL errors, with effectiveness determined by pruning efficiency and RL convergence.

Details

Motivation: To fully unlock large language models' potential with reinforcement learning by developing an effective mid-training phase that identifies useful action sets and enables fast online RL selection.

Method: Proposed Reasoning as Action Abstractions (RA3) - a scalable mid-training algorithm that optimizes a sequential variational lower bound by iteratively discovering temporally-consistent latent structures via RL, followed by fine-tuning on bootstrapped data.

Result: RA3 improves average performance on HumanEval and MBPP by 8 and 4 points over base models and next-token prediction baselines, achieving faster convergence and higher asymptotic performance in RLVR on multiple code generation benchmarks.

Conclusion: Mid-training is most effective when operating in compact action abstraction spaces rather than primitive actions, with RA3 demonstrating significant improvements in code generation tasks across multiple models and benchmarks.

Abstract: Large language models excel with reinforcement learning (RL), but fully unlocking this potential requires a mid-training stage. An effective mid-training phase should identify a compact set of useful actions and enable fast selection among them through online RL. We formalize this intuition by presenting the first theoretical result on how mid-training shapes post-training: it characterizes an action subspace that minimizes both the value approximation error from pruning and the RL error during subsequent planning. Our analysis reveals two key determinants of mid-training effectiveness: pruning efficiency, which shapes the prior of the initial RL policy, and its impact on RL convergence, which governs the extent to which that policy can be improved via online interactions. These results suggest that mid-training is most effective when the decision space is compact and the effective horizon is short, highlighting the importance of operating in the space of action abstractions rather than primitive actions. Building on these insights, we propose Reasoning as Action Abstractions (RA3), a scalable mid-training algorithm. Specifically, we derive a sequential variational lower bound and optimize it by iteratively discovering temporally-consistent latent structures via RL, followed by fine-tuning on the bootstrapped data. Experiments on code generation tasks demonstrate the effectiveness of our approach. Across multiple base models, RA3 improves the average performance on HumanEval and MBPP by 8 and 4 points over the base model and the next-token prediction baseline. Furthermore, RA3 achieves faster convergence and higher asymptotic performance in RLVR on HumanEval+, MBPP+, LiveCodeBench, and Codeforces.

[1110] Superior Molecular Representations from Intermediate Encoder Layers

Luis Pinto

Main category: cs.LG

TL;DR: Intermediate layers in molecular encoders contain more general-purpose features than final layers, and using optimal intermediate layers for downstream tasks significantly improves performance by 5.4-8.5% on average.

Details

Motivation: Standard practice of using only final-layer embeddings for molecular property prediction may discard valuable information from intermediate layers.

Method: Analyzed information flow in 5 molecular encoders, performed layer-wise evaluation across 22 property prediction tasks using both frozen embeddings and finetuned truncated encoders.

Result: Using frozen embeddings from optimal intermediate layers improved performance by 5.4% average (up to 28.6%). Finetuning truncated encoders achieved 8.5% average improvement (up to 40.8%), setting new SOTA on several benchmarks.

Conclusion: Exploring full representational depth of molecular encoders is crucial for substantial performance improvements and computational efficiency.

Abstract: Pretrained molecular encoders have become indispensable in computational chemistry for tasks such as property prediction and molecular generation. However, the standard practice of relying solely on final-layer embeddings for downstream tasks may discard valuable information. In this work, we first analyze the information flow in five diverse molecular encoders and find that intermediate layers retain more general-purpose features, whereas the final-layer specializes and compresses information. We then perform an empirical layer-wise evaluation across 22 property prediction tasks. We find that using frozen embeddings from optimal intermediate layers improves downstream performance by an average of 5.4%, up to 28.6%, compared to the final-layer. Furthermore, finetuning encoders truncated at intermediate depths achieves even greater average improvements of 8.5%, with increases as high as 40.8%, obtaining new state-of-the-art results on several benchmarks. These findings highlight the importance of exploring the full representational depth of molecular encoders to achieve substantial performance improvements and computational efficiency. The code will be made publicly available.

[1111] Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning

Wenlong Deng, Yi Ren, Yushu Li, Boying Gong, Danica J. Sutherland, Xiaoxiao Li, Christos Thrampoulidis

Main category: cs.LG

TL;DR: THR is a token-level metric that quantifies each token’s influence on correct responses under GRPO. Tokens with positive THR favor exploitation while negative THR enables exploration. A THR-guided reweighting algorithm can bias training toward either strategy.

Details

Motivation: Current reinforcement learning with verifiable rewards advances reasoning in LLMs, but lacks explicit control over exploration vs exploitation dynamics during training.

Method: Introduce Token Hidden Reward (THR) metric to quantify token influence, then develop THR-guided reweighting algorithm that modulates GRPO’s learning signals to bias training toward exploration or exploitation.

Result: Amplifying positive THR tokens improves greedy-decoding accuracy (exploitation), while amplifying negative THR tokens improves Pass@K accuracy (exploration). Algorithm works with other RL objectives like GSPO and generalizes across architectures including Llama.

Conclusion: THR provides a principled mechanism for dynamically controlling exploration and exploitation in RL-tuned LLMs, enabling targeted fine-tuning for reasoning-intensive applications.

Abstract: Reinforcement learning with verifiable rewards has significantly advanced the reasoning capabilities of large language models, yet how to explicitly steer training toward exploration or exploitation remains an open problem. We introduce Token Hidden Reward (THR), a token-level metric that quantifies each token’s influence on the likelihood of correct responses under Group Relative Policy Optimization (GRPO). We find that training dynamics are dominated by a small subset of tokens with high absolute THR values. Most interestingly, tokens with positive THR strengthen confidence in correct outputs, thus favoring exploitation, while tokens with negative THR preserve probability mass for alternative outputs, enabling exploration. This insight suggests a natural intervention: a THR-guided reweighting algorithm that modulates GRPO’s learning signals to explicitly bias training toward exploitation or exploration. We validate the efficacy of this algorithm on diverse math reasoning benchmarks. By amplifying tokens with positive THR value and weakening negative ones, our algorithm improves greedy-decoding accuracy, favoring exploitation. The reverse strategy yields consistent gains in Pass@K accuracy, favoring exploration. We further demonstrate that our algorithm integrates seamlessly with other RL objectives such as GSPO and generalizes across architectures including Llama. These findings establish THR as a principled and fine-grained mechanism for dynamically controlling exploration and exploitation in RL-tuned LLMs, providing new tools for targeted fine-tuning in reasoning-intensive applications.

[1112] Monotone and Conservative Policy Iteration Beyond the Tabular Case

S. R. Eshwar, Gugan Thoppe, Ananyabrata Barua, Aditya Gopalan, Gal Dalal

Main category: cs.LG

TL;DR: RPI and CRPI are new policy iteration variants that maintain tabular guarantees under function approximation, addressing divergence issues in popular RL algorithms like TRPO and PPO.

Details

Motivation: Address the foundational gap where popular RL algorithms (TRPO, PPO) derived from tabular CPI lose guarantees when deployed with function approximation, leading to divergence, oscillations, or suboptimal convergence.

Method: RPI uses Bellman-constrained optimization for policy evaluation, while CRPI adds conservative policy updates by maximizing a performance-difference lower bound that accounts for function-approximation errors.

Result: RPI restores monotonicity of value estimates and provides provable lower bounds on true returns. CRPI inherits these guarantees and admits per-step improvement bounds. In simulations, both outperform PI and its variants.

Conclusion: RPI and CRPI provide a principled basis for next-generation RL by restoring PI/CPI-style guarantees for arbitrary function classes, addressing fundamental limitations of current function approximation methods.

Abstract: We introduce Reliable Policy Iteration (RPI) and Conservative RPI (CRPI), variants of Policy Iteration (PI) and Conservative PI (CPI), that retain tabular guarantees under function approximation. RPI uses a novel Bellman-constrained optimization for policy evaluation. We show that RPI restores the textbook \textit{monotonicity} of value estimates and that these estimates provably \textit{lower-bound} the true return; moreover, their limit partially satisfies the \textit{unprojected} Bellman equation. CRPI shares RPI’s evaluation, but updates policies conservatively by maximizing a new performance-difference \textit{lower bound} that explicitly accounts for function-approximation-induced errors. CRPI inherits RPI’s guarantees and, crucially, admits per-step improvement bounds. In initial simulations, RPI and CRPI outperform PI and its variants. Our work addresses a foundational gap in RL: popular algorithms such as TRPO and PPO derive from tabular CPI yet are deployed with function approximation, where CPI’s guarantees often fail-leading to divergence, oscillations, or convergence to suboptimal policies. By restoring PI/CPI-style guarantees for \textit{arbitrary} function classes, RPI and CRPI provide a principled basis for next-generation RL.

[1113] LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning

Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Nicklas Majamaki, Navdeep Jaitly, Yi-An Ma, Lianhui Qin

Main category: cs.LG

TL;DR: LaDiR is a novel reasoning framework that combines latent diffusion models with LLMs to enable iterative refinement of reasoning steps, improving accuracy and diversity over traditional autoregressive methods.

Details

Motivation: LLMs using chain-of-thought generation are limited by autoregressive decoding, which prevents holistic refinement of earlier reasoning steps and limits exploration of diverse solutions.

Method: Uses a VAE to encode reasoning steps into structured latent thought tokens, then applies a latent diffusion model with blockwise bidirectional attention to enable iterative refinement and parallel generation of diverse reasoning trajectories.

Result: Empirical evaluations on mathematical reasoning and planning benchmarks show consistent improvements in accuracy, diversity, and interpretability over existing methods.

Conclusion: LaDiR represents a new paradigm for text reasoning using latent diffusion, offering better performance and more flexible reasoning capabilities.

Abstract: Large Language Models (LLMs) demonstrate their reasoning ability through chain-of-thought (CoT) generation. However, LLM’s autoregressive decoding may limit the ability to revisit and refine earlier tokens in a holistic manner, which can also lead to inefficient exploration for diverse solutions. In this paper, we propose LaDiR (Latent Diffusion Reasoner), a novel reasoning framework that unifies the expressiveness of continuous latent representation with the iterative refinement capabilities of latent diffusion models for an existing LLM. We first construct a structured latent reasoning space using a Variational Autoencoder (VAE) that encodes text reasoning steps into blocks of thought tokens, preserving semantic information and interpretability while offering compact but expressive representations. Subsequently, we utilize a latent diffusion model that learns to denoise a block of latent thought tokens with a blockwise bidirectional attention mask, enabling longer horizon and iterative refinement with adaptive test-time compute. This design allows efficient parallel generation of diverse reasoning trajectories, allowing the model to plan and revise the reasoning process holistically. We conduct evaluations on a suite of mathematical reasoning and planning benchmarks. Empirical results show that LaDiR consistently improves accuracy, diversity, and interpretability over existing autoregressive, diffusion-based, and latent reasoning methods, revealing a new paradigm for text reasoning with latent diffusion.

[1114] Wavelet Scattering Transform and Fourier Representation for Offline Detection of Malicious Clients in Federated Learning

Alessandro Licciardi, Davide Leo, Davide Carbone

Main category: cs.LG

TL;DR: WAFFLE detects malicious clients in Federated Learning before training using compressed representations from Wavelet or Fourier transforms, improving detection accuracy and model performance.

Details

Motivation: Anomalous or corrupted clients in Federated Learning can degrade model performance, but detecting them without accessing raw data is challenging.

Method: Uses Wavelet Scattering Transform or Fourier Transform to create low-dimensional embeddings, with a lightweight detector trained on public dataset for client labeling.

Result: Improves detection accuracy and downstream classification performance compared to existing FL anomaly detection algorithms.

Conclusion: WAFFLE provides effective pre-training alternative to online detection strategies, with WST offering theoretical advantages for federated scenarios.

Abstract: Federated Learning (FL) enables the training of machine learning models across decentralized clients while preserving data privacy. However, the presence of anomalous or corrupted clients - such as those with faulty sensors or non representative data distributions - can significantly degrade model performance. Detecting such clients without accessing raw data remains a key challenge. We propose WAFFLE (Wavelet and Fourier representations for Federated Learning) a detection algorithm that labels malicious clients {\it before training}, using locally computed compressed representations derived from either the Wavelet Scattering Transform (WST) or the Fourier Transform. Both approaches provide low-dimensional, task-agnostic embeddings suitable for unsupervised client separation. A lightweight detector, trained on a distillated public dataset, performs the labeling with minimal communication and computational overhead. While both transforms enable effective detection, WST offers theoretical advantages, such as non-invertibility and stability to local deformations, that make it particularly well-suited to federated scenarios. Experiments on benchmark datasets show that our method improves detection accuracy and downstream classification performance compared to existing FL anomaly detection algorithms, validating its effectiveness as a pre-training alternative to online detection strategies.

[1115] Load Balancing Mixture of Experts with Similarity Preserving Routers

Nabil Omi, Siddhartha Sen, Ali Farhadi

Main category: cs.LG

TL;DR: A novel load balancing loss for Sparse Mixture of Experts models that preserves token-wise relational structure, enabling 36% faster convergence and lower redundancy compared to existing methods.

Details

Motivation: Current load balancing mechanisms in Sparse MoE models encourage uniform expert distribution but cause inconsistent routing behavior, leading to redundant knowledge learning and limited model capacity utilization.

Method: Introduces a novel load balancing loss that preserves token-wise relational structure, encouraging consistent expert choices for similar inputs during training.

Result: 36% faster convergence and lower redundancy compared to popular load balancing loss methods.

Conclusion: The proposed relational structure-preserving load balancing loss effectively addresses inconsistent routing in Sparse MoE models, improving convergence speed and reducing redundancy.

Abstract: Sparse Mixture of Experts (MoE) models offer a scalable and efficient architecture for training large neural networks by activating only a subset of parameters (“experts”) for each input. A learned router computes a distribution over these experts, and assigns input tokens to a small subset. However, without auxiliary balancing mechanisms, routers often converge to using only a few experts, severely limiting model capacity and degrading performance. Most current load balancing mechanisms encourage a distribution over experts that resembles a roughly uniform distribution of experts per token. During training, this can result in inconsistent routing behavior, resulting in the model spending its capacity to learn redundant knowledge. We address this by introducing a novel load balancing loss that preserves token-wise relational structure, encouraging consistent expert choices for similar inputs during training. Our experimental results show that applying our loss to the router results in 36% faster convergence and lower redundancy compared to a popular load balancing loss.

[1116] Ignition Phase : Standard Training for Fast Adversarial Robustness

Wang Yu-Hang, Liu ying, Fang liang, Wang Xuelin, Junkang Guo, Shiwei Li, Lei Gao, Jian Liu, Wenfei Yin

Main category: cs.LG

TL;DR: Adversarial Evolution Training (AET) prepends an ERM phase to standard adversarial training, achieving faster and more efficient robustness with improved clean accuracy and reduced training costs.

Details

Motivation: Existing adversarial training variants focus too much on stronger attacks while overlooking foundational feature representations. AET aims to cultivate better feature manifolds for more efficient robustness acquisition.

Method: AET strategically adds an initial Empirical Risk Minimization (ERM) phase before conventional adversarial training to pre-condition features, enabling more effective robustness learning.

Result: AET achieves comparable or superior robustness more rapidly, improves clean accuracy, and reduces training costs by 8-25% across multiple datasets and architectures.

Conclusion: Feature pre-conditioning through standard training is crucial for developing more efficient and principled robust defenses, as demonstrated by AET’s effectiveness.

Abstract: Adversarial Training (AT) is a cornerstone defense, but many variants overlook foundational feature representations by primarily focusing on stronger attack generation. We introduce Adversarial Evolution Training (AET), a simple yet powerful framework that strategically prepends an Empirical Risk Minimization (ERM) phase to conventional AT. We hypothesize this initial ERM phase cultivates a favorable feature manifold, enabling more efficient and effective robustness acquisition. Empirically, AET achieves comparable or superior robustness more rapidly, improves clean accuracy, and cuts training costs by 8-25%. Its effectiveness is shown across multiple datasets, architectures, and when augmenting established AT methods. Our findings underscore the impact of feature pre-conditioning via standard training for developing more efficient, principled robust defenses. Code is available in the supplementary material.

[1117] Online Selective Generation with Adversarial Bandit Feedback

Minjae Lee, Yoonjae Jung, Sangdon Park

Main category: cs.LG

TL;DR: Proposes an online learning algorithm for selective generation that controls false discovery rates (FDR) using adversarial bandit methods with partial feedback.

Details

Motivation: Address the problem of hallucination in large language models by enabling selective abstention from answering when uncertain, particularly in adversarial environments with partial user feedback.

Method: Repurposes adversarial bandit algorithms with novel conversion from regret to FDR control and feedback unlocking to reuse partial feedback efficiently.

Result: Empirical evaluation shows the method effectively controls FDR while maintaining reasonable selection efficiency across diverse learning environments.

Conclusion: The proposed online selective generation algorithm provides a practical solution for controlling hallucination rates in language models operating under adversarial conditions with partial feedback.

Abstract: Large language generative models increasingly interact with humans, while their falsified responses raise concerns. To mitigate this hallucination effect, selectively abstaining from answering, called selective generation, provides an effective way for generators to control the hallucination when uncertain about their answers. However, as selective generators interact under adversarial environments and receive partial feedback from users on selected generation (e.g., thumbs up or down on the selected answer), learning methods for selective generation under such practical setups are crucial but currently missing. To address this limitation, we propose an online learning algorithm for selective generation with partial feedback under an adaptive adversary. In particular, we re-purpose an adversarial bandit algorithm to design an online selective generation method with controllable false discovery rates (FDR), which measures the rate of hallucination. The key building blocks include a novel conversion lemma from regret of any bandit algorithm to the FDR, and the exploitation of a unique structure of selective generation to reuse partial feedback, which we call feedback unlocking. We empirically evaluate the efficacy of the proposed online selective generation algorithm with partial feedback over diverse learning environments, demonstrating its ability to control the FDR, while maintaining reasonable selection efficiency, i.e., the ratio of non-abstaining answers, compared to baselines.

[1118] SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification

Zhenglin Lai, Mengyao Liao, Bingzhe Wu, Dong Xu, Zebin Zhao, Zhihang Yuan, Chao Fan, Jianqiang Li

Main category: cs.LG

TL;DR: This paper identifies positional vulnerability in MoE models as a safety risk, develops SAFEx framework to detect safety-critical experts, and shows targeted interventions can efficiently mitigate safety issues.

Details

Motivation: MoE architectures introduce unique safety challenges not addressed by techniques for dense models, particularly the positional vulnerability where safety behaviors depend on specific expert modules.

Method: Developed SAFEx analytical framework with stability-based expert selection to identify safety-critical experts, categorized into Harmful Content Detection Group (HCDG) and Harmful Response Control Group (HRCG). Conducted expert-level interventions including targeted masking and LoRA adaptation.

Result: On Qwen3-30B-A3B (6,144 total experts), disabling just 12 selected experts reduced refusal rate by 22%. Lightweight LoRA adaptation targeted at HRCG improved refusal under adversarial prompts without full retraining.

Conclusion: Positional vulnerability is a distinct MoE-specific safety challenge, and expert-level interventions provide compute-efficient safety mitigation pathways for routed architectures.

Abstract: Large language models with Mixture-of-Experts (MoE) architectures achieve efficiency and scalability, yet their routing mechanisms introduce safety alignment challenges insufficiently addressed by techniques developed for dense models. In this work, the MoE-specific safety risk of positional vulnerability-that safety-aligned behaviors rely on specific expert modules-is formalized and systematically analyzed. An analytical framework, SAFEx, is presented to robustly identify, characterize, and validate safety-critical experts via a stability-based expert selection procedure, and to decompose them into two functional groups: the Harmful Content Detection Group (HCDG), which specializes in identifying and recognizing harmful content within user inputs, and the Harmful Response Control Group (HRCG), which specializes in controlling and enforcing model behaviors to generate appropriate safety responses. Expert-level interventions are conducted to probe causality and to test mitigation. Targeted masking of SAFEx-selected experts reveals that safety behavior is highly concentrated. On Qwen3-30B-A3B, configured with 48 MoE-FFN layers and 128 experts per layer under top-8 routing (48x128=6,144 experts in total), disabling 12 selected experts reduces the refusal rate by 22%. In addition, lightweight adaptation is performed using LoRA under three configurations-the HRCG, the union of HCDG and HRCG, and all experts-and the resulting updates are composed through negative weight merging targeted at the HRCG, leading to improved refusal under adversarial prompts without full-model retraining. These results establish positional vulnerability as a distinct MoE-specific safety challenge and provide a practical, compute-efficient pathway for expert-level safety interventions within routed architectures.

[1119] Structured Generative Modeling with the Thermodynamic Kolmogorov-Arnold Model

Prithvi Raj

Main category: cs.LG

TL;DR: T-KAM is a novel generative model that combines energy-based modeling in latent space with Kolmogorov-Arnold representation, enabling fast exact inference and efficient importance sampling while addressing multimodal sampling challenges.

Details

Motivation: To overcome limitations of existing energy-based models in latent spaces, particularly the inefficiency of Langevin Monte Carlo sampling and challenges with multimodal distributions, while leveraging interpretability for better model design.

Method: Proposes Thermodynamic Kolmogorov-Arnold Model (T-KAM) that constrains prior to univariate relationships for fast exact inference via inverse transform method, uses importance sampling for efficient posterior sampling, and employs population-based LMC for multimodal cases.

Result: T-KAM achieves fast inference, interpretability, stable training, and efficient multimodal sampling while being compatible with next-generation hardware.

Conclusion: T-KAM elegantly balances trade-offs in generative modeling by offering fast inference, interpretability, stable training, and efficient sampling, making it well-suited for modern computing architectures.

Abstract: Learning an energy-based model (EBM) in the latent space of a top-down generative model offers a versatile framework for generation across multiple data modalities. However, it remains unclear how its interpretability can be used to guide model design, improve generative quality, and reduce training time. Moreover, the reliance on Langevin Monte Carlo (LMC) sampling presents challenges in efficiency and sampling multimodal latent distributions. In this work, we propose a novel adaptation of the Kolmogorov-Arnold representation theorem for generative modeling and introduce the Thermodynamic Kolmogorov-Arnold Model (T-KAM) to take advantage of structural and inductive biases. By constraining the prior to univariate relationships, T-KAM enables fast and exact inference via the inverse transform method. With the low dimensionality of the latent space and suitable inductive biases encoded, we demonstrate that importance sampling (IS) becomes a viable, unbiased, and highly efficient posterior sampler. For situations where IS fails, we investigate a novel strategy using population-based LMC, which decomposes posterior sampling into a sequence of annealed distributions to improve multimodal sampling. T-KAM elegantly balances common trade-offs in generative modeling, offering fast inference, interpretability, and stable training, while being naturally suited to upcoming Zettascale Computing Corp. hardware.

[1120] Structured Kolmogorov-Arnold Neural ODEs for Interpretable Learning and Symbolic Discovery of Nonlinear Dynamics

Wei Liu, Kiran Bacsa, Loon Ching Tang, Eleni Chatzi

Main category: cs.LG

TL;DR: SKANODE integrates structured state-space modeling with Kolmogorov-Arnold Networks in a Neural ODE framework to create interpretable models of nonlinear dynamical systems, achieving high accuracy while discovering physics-consistent dynamics.

Details

Motivation: To address the challenge of creating deep learning models for nonlinear dynamical systems that are both accurate and physically interpretable, bridging the gap between black-box neural networks and interpretable physical models.

Method: Proposes Structured Kolmogorov-Arnold Neural ODEs (SKANODEs) that use trainable KANs as universal function approximators for virtual sensing to recover interpretable latent states, then leverage KAN’s symbolic regression to extract compact governing equations.

Result: SKANODE achieves superior predictive accuracy, discovers physics-consistent dynamics, reveals complex nonlinear behavior, identifies hysteretic behavior in F-16 aircraft, and recovers concise symbolic equations describing the phenomena.

Conclusion: SKANODE enables interpretable, data-driven discovery of physically grounded models for complex nonlinear dynamical systems, combining the accuracy of deep learning with the interpretability of symbolic representations.

Abstract: Understanding and modeling nonlinear dynamical systems is a fundamental challenge across science and engineering. Deep learning has shown remarkable potential for capturing complex system behavior, yet achieving models that are both accurate and physically interpretable remains difficult. To address this, we propose Structured Kolmogorov-Arnold Neural ODEs (SKANODEs), a framework that integrates structured state-space modeling with Kolmogorov-Arnold Networks (KANs). Within a Neural ODE architecture, SKANODE employs a fully trainable KAN as a universal function approximator to perform virtual sensing, recovering latent states that correspond to interpretable physical quantities such as displacements and velocities. Leveraging KAN’s symbolic regression capability, SKANODE then extracts compact, interpretable expressions for the system’s governing dynamics. Extensive experiments on simulated and real-world systems demonstrate that SKANODE achieves superior predictive accuracy, discovers physics-consistent dynamics, and reveals complex nonlinear behavior. Notably, it identifies hysteretic behavior in an F-16 aircraft and recovers a concise symbolic equation describing this phenomenon. SKANODE thus enables interpretable, data-driven discovery of physically grounded models for complex nonlinear dynamical systems.

[1121] On Convolutions, Intrinsic Dimension, and Diffusion Models

Kin Kwan Leung, Rasa Hosseinzadeh, Gabriel Loaiza-Ganem

Main category: cs.LG

TL;DR: This paper provides a theoretical foundation for FLIPD, a state-of-the-art local intrinsic dimension (LID) estimator derived from diffusion models, by proving its correctness under realistic assumptions rather than the previously required unrealistic affine submanifold assumption.

Details

Motivation: The motivation is to bridge the theoretical gap in FLIPD's foundation, as previous proofs only worked under the unrealistic assumption of affine submanifolds, despite FLIPD achieving state-of-the-art performance in practical applications.

Method: The authors formally prove the correctness of FLIPD under realistic assumptions about data manifolds, and extend the analysis to show analogous results hold when Gaussian convolutions are replaced with uniform convolutions.

Result: The paper successfully establishes the theoretical correctness of FLIPD under realistic assumptions, providing a solid mathematical foundation for its practical success in LID estimation tasks.

Conclusion: This work completes the theoretical underpinnings of FLIPD by proving its validity under realistic conditions, confirming its reliability as a state-of-the-art LID estimator derived from diffusion models.

Abstract: The manifold hypothesis asserts that data of interest in high-dimensional ambient spaces, such as image data, lies on unknown low-dimensional submanifolds. Diffusion models (DMs) – which operate by convolving data with progressively larger amounts of Gaussian noise and then learning to revert this process – have risen to prominence as the most performant generative models, and are known to be able to learn distributions with low-dimensional support. For a given datum in one of these submanifolds, we should thus intuitively expect DMs to have implicitly learned its corresponding local intrinsic dimension (LID), i.e. the dimension of the submanifold it belongs to. Kamkari et al. (2024b) recently showed that this is indeed the case by linking this LID to the rate of change of the log marginal densities of the DM with respect to the amount of added noise, resulting in an LID estimator known as FLIPD. LID estimators such as FLIPD have a plethora of uses, among others they quantify the complexity of a given datum, and can be used to detect outliers, adversarial examples and AI-generated text. FLIPD achieves state-of-the-art performance at LID estimation, yet its theoretical underpinnings are incomplete since Kamkari et al. (2024b) only proved its correctness under the highly unrealistic assumption of affine submanifolds. In this work we bridge this gap by formally proving the correctness of FLIPD under realistic assumptions. Additionally, we show that an analogous result holds when Gaussian convolutions are replaced with uniform ones, and discuss the relevance of this result.

[1122] The Hidden Link Between RLHF and Contrastive Learning

Xufei Lv, Kehai Chen, Haoyuan Sun, Xuefeng Bai, Min Zhang, Houde Liu, Kehai Chen

Main category: cs.LG

TL;DR: The paper proposes Mutual Information Optimization (MIO), a new method for aligning LLMs with human values that replaces the Donsker-Varadhan bound used in RLHF and DPO with the Jensen-Shannon mutual information estimator, achieving better performance on reasoning tasks.

Details

Motivation: Both RLHF and DPO can be interpreted from mutual information maximization perspective, revealing limitations in how they incentivize reasoning capabilities beyond the base model.

Method: Propose Mutual Information Optimization (MIO) by replacing the Donsker-Varadhan mutual information bound with the Jensen-Shannon estimator within the contrastive learning framework.

Result: MIO mitigates the late-stage decline in chosen-likelihood observed in DPO and achieves competitive or superior performance on various reasoning and mathematical benchmarks.

Conclusion: The mutual information perspective provides deeper understanding of alignment methods, and MIO offers improved performance particularly for reasoning tasks compared to existing approaches.

Abstract: Alignment of large language models (LLMs) with human values has recently garnered significant attention, with prominent examples including the canonical yet costly Reinforcement Learning from Human Feedback (RLHF) and the simple Direct Preference Optimization (DPO). In this work, we demonstrate that both RLHF and DPO can be interpreted from the perspective of mutual information (MI) maximization, uncovering a profound connection to contrastive learning. Within this framework, both RLHF and DPO can be interpreted as methods that performing contrastive learning based on the positive and negative samples derived from base model, leveraging the Donsker-Varadhan (DV) lower bound on MI (equivalently, the MINE estimator). Such paradigm further illuminates why RLHF may not intrinsically incentivize reasoning capacities in LLMs beyond what is already present in the base model. Building on the perspective, we replace the DV/MINE bound with the Jensen-Shannon (JS) MI estimator and propose the Mutual Information Optimization (MIO). Comprehensive theoretical analysis and extensive empirical evaluations demonstrate that MIO mitigates the late-stage decline in chosen-likelihood observed in DPO, achieving competitive or superior performance across various challenging reasoning and mathematical benchmarks.

[1123] Multi-model Online Conformal Prediction with Graph-Structured Feedback

Erfan Hajihashemi, Yanning Shen

Main category: cs.LG

TL;DR: The paper proposes a novel multi-model online conformal prediction algorithm that dynamically selects effective models using bipartite graph feedback, reducing computational complexity and prediction set sizes while maintaining coverage guarantees.

Details

Motivation: Existing multi-model online conformal prediction faces challenges with large candidate sets: increased computational complexity and inclusion of poor-performing models that lead to unnecessarily large prediction sets.

Method: A bipartite graph-based approach that identifies effective model subsets at each time step using feedback from new data. Uses both prediction set size and model loss as feedback to improve efficiency.

Result: The proposed algorithms ensure valid coverage and achieve sublinear regret. Experiments on real and synthetic datasets show smaller prediction sets and better performance than existing approaches.

Conclusion: The method successfully addresses computational complexity and prediction set size issues in multi-model online conformal prediction while maintaining theoretical guarantees and practical performance improvements.

Abstract: Online conformal prediction has demonstrated its capability to construct a prediction set for each incoming data point that covers the true label with a predetermined probability. To cope with potential distribution shift, multi-model online conformal prediction has been introduced to select and leverage different models from a preselected candidate set. Along with the improved flexibility, the choice of the preselected set also brings challenges. A candidate set that includes a large number of models may increase the computational complexity. In addition, the inclusion of irrelevant models with poor performance may negatively impact the performance and lead to unnecessarily large prediction sets. To address these challenges, we propose a novel multi-model online conformal prediction algorithm that identifies a subset of effective models at each time step by collecting feedback from a bipartite graph, which is refined upon receiving new data. A model is then selected from this subset to construct the prediction set, resulting in reduced computational complexity and smaller prediction sets. Additionally, we demonstrate that using prediction set size as feedback, alongside model loss, can significantly improve efficiency by constructing smaller prediction sets while still satisfying the required coverage guarantee. The proposed algorithms are proven to ensure valid coverage and achieve sublinear regret. Experiments on real and synthetic datasets validate that the proposed methods construct smaller prediction sets and outperform existing multi-model online conformal prediction approaches.

[1124] Theoretical Modeling of LLM Self-Improvement Training Dynamics Through Solver-Verifier Gap

Yifan Sun, Yushan Liang, Zhen Zhang, Jiaye Teng

Main category: cs.LG

TL;DR: This paper proposes a theoretical framework using solver-verifier gap to model LLM self-improvement dynamics, showing how performance evolves and reaches limits during training without external data.

Details

Motivation: Self-improvement is important for LLMs but how performance evolves during this process remains underexplored theoretically.

Method: Theoretical modeling of training dynamics using solver-verifier gap concept, fitting theoretical model to experimental results to quantify capability limits.

Result: Empirical validation shows effectiveness across various LLMs and datasets. External data analysis reveals it can be used at any stage without affecting final performance under limited data regimes.

Conclusion: The solver-verifier gap framework successfully models self-improvement dynamics and provides insights into performance limits and external data utilization.

Abstract: Self-improvement is among the most prominent techniques within the realm of large language models (LLM), aiming to enhance the LLM performance without relying on external data. Despite its significance, generally how LLM performances evolve during the self-improvement process remains underexplored. In this paper, we theoretically model the training dynamics of self-improvement via the concept of solver-verifier gap. This is inspired by the conjecture that the performance enhancement of self-improvement stems from the gap between LLM’s solver capability and verifier capability. Based on the theoretical framework, we further show how to model the entire training trajectory. This framework allows quantifying the capability limit of self-improvement by fitting the theoretical model to the experiment results. We empirically validate the effectiveness of the theoretical framework on various LLMs and datasets. Beyond self-improvement, we extend our analysis to investigate how external data influences these dynamics within the framework. Notably, we find that under limited external data regimes, such external data can be utilized at any stage without significantly affecting final performances, which accords with the empirical observations.

[1125] The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models

Lijun Sheng, Jian Liang, Ran He, Zilei Wang, Tieniu Tan

Main category: cs.LG

TL;DR: TTA-VLM is a comprehensive benchmark for evaluating test-time adaptation methods on vision-language models, addressing limitations in current TTA research through unified evaluation across multiple datasets and metrics.

Details

Motivation: Current TTA research suffers from duplicated baseline results, limited evaluation metrics, inconsistent experimental settings, and insufficient analysis, making fair comparisons and practical assessment difficult.

Method: Implemented 8 episodic TTA and 7 online TTA methods in a unified framework, evaluated across 15 datasets, extended evaluation to SigLIP beyond CLIP, and included training-time tuning methods for broader assessment.

Result: 1) Existing TTA methods show limited gains compared to pioneering work; 2) Poor collaboration between TTA and training-time fine-tuning methods; 3) Accuracy improvements often reduce model trustworthiness.

Conclusion: TTA-VLM provides fair comparison and comprehensive evaluation of TTA methods for VLMs, encouraging development of more reliable and generalizable TTA strategies.

Abstract: Test-time adaptation (TTA) methods have gained significant attention for enhancing the performance of vision-language models (VLMs) such as CLIP during inference, without requiring additional labeled data. However, current TTA researches generally suffer from major limitations such as duplication of baseline results, limited evaluation metrics, inconsistent experimental settings, and insufficient analysis. These problems hinder fair comparisons between TTA methods and make it difficult to assess their practical strengths and weaknesses. To address these challenges, we introduce TTA-VLM, a comprehensive benchmark for evaluating TTA methods on VLMs. Our benchmark implements 8 episodic TTA and 7 online TTA methods within a unified and reproducible framework, and evaluates them across 15 widely used datasets. Unlike prior studies focused solely on CLIP, we extend the evaluation to SigLIP–a model trained with a Sigmoid loss–and include training-time tuning methods such as CoOp, MaPLe, and TeCoA to assess generality. Beyond classification accuracy, TTA-VLM incorporates various evaluation metrics, including robustness, calibration, out-of-distribution detection, and stability, enabling a more holistic assessment of TTA methods. Through extensive experiments, we find that 1) existing TTA methods produce limited gains compared to the previous pioneering work; 2) current TTA methods exhibit poor collaboration with training-time fine-tuning methods; 3) accuracy gains frequently come at the cost of reduced model trustworthiness. We release TTA-VLM to provide fair comparison and comprehensive evaluation of TTA methods for VLMs, and we hope it encourages the community to develop more reliable and generalizable TTA strategies.

[1126] PULSE: Practical Evaluation Scenarios for Large Multimodal Model Unlearning

Tatsuki Kawakami, Kazuki Egashira, Atsuyuki Miyai, Go Irie, Kiyoharu Aizawa

Main category: cs.LG

TL;DR: The paper introduces PULSE protocol for evaluating unlearning in large multimodal models, focusing on pre-trained knowledge unlearning and long-term sustainability, revealing limitations in existing methods.

Details

Motivation: Address the lack of practical evaluation frameworks for unlearning in large multimodal models, particularly for realistic scenarios involving pre-trained knowledge and sequential unlearning requests.

Method: Proposed PULSE protocol with two key perspectives: (i) Pre-trained knowledge Unlearning to analyze effects across different knowledge acquisition phases, and (ii) Long-term Sustainability Evaluation for sequential unlearning requests.

Result: Existing unlearning methods successfully unlearn fine-tuned knowledge but struggle with pre-trained knowledge. Methods effective for batch unlearning show significant performance degradation when data is split and unlearned sequentially.

Conclusion: Current unlearning techniques have limitations in handling pre-trained knowledge and sequential unlearning scenarios, highlighting the need for more robust unlearning methods for large multimodal models.

Abstract: In recent years, unlearning techniques, which are methods for inducing a model to “forget” previously learned information, have attracted attention as a way to address privacy and copyright concerns in large language models (LLMs) and large multimodal models (LMMs). While several unlearning benchmarks have been established for LLMs, a practical evaluation framework for unlearning in LMMs has been less explored. Specifically, existing unlearning benchmark for LMMs considers only scenarios in which the model is required to unlearn fine-tuned knowledge through a single unlearning operation. In this study, we introduce PULSE protocol for realistic unlearning scenarios for LMMs by introducing two critical perspectives: (i) Pre-trained knowledge Unlearning for analyzing the effect across different knowledge acquisition phases and (ii) Long-term Sustainability Evaluation to address sequential requests. We then evaluate existing unlearning methods along these dimensions. Our results reveal that, although some techniques can successfully unlearn knowledge acquired through fine-tuning, they struggle to eliminate information learned during pre-training. Moreover, methods that effectively unlearn a batch of target data in a single operation exhibit substantial performance degradation when the same data are split and unlearned sequentially.

[1127] GradMetaNet: An Equivariant Architecture for Learning on Gradients

Yoav Gelberg, Yam Eitan, Aviv Navon, Aviv Shamsian, Theo, Putterman, Michael Bronstein, Haggai Maron

Main category: cs.LG

TL;DR: GradMetaNet is a novel architecture designed specifically for processing neural network gradients, featuring equivariant design, multi-point gradient processing, and efficient rank-1 decomposition.

Details

Motivation: Existing gradient processing methods use architectures not specifically designed for gradients, limiting their effectiveness. Gradients contain valuable optimization and model analysis information that requires specialized processing.

Method: Three principles guide GradMetaNet: (1) equivariant design preserving neuron permutation symmetries, (2) processing gradient sets across multiple data points for curvature information, (3) efficient gradient representation via rank-1 decomposition. Built from simple equivariant blocks.

Result: GradMetaNet achieves universality in gradient function approximation, outperforms previous approaches, and demonstrates effectiveness on diverse tasks including learned optimization, INR editing, and loss landscape curvature estimation for MLPs and transformers.

Conclusion: Specialized architectures like GradMetaNet that respect gradient-specific properties enable more effective processing of neural network gradients across various applications.

Abstract: Gradients of neural networks encode valuable information for optimization, editing, and analysis of models. Therefore, practitioners often treat gradients as inputs to task-specific algorithms, e.g. for pruning or optimization. Recent works explore learning algorithms that operate directly on gradients but use architectures that are not specifically designed for gradient processing, limiting their applicability. In this paper, we present a principled approach for designing architectures that process gradients. Our approach is guided by three principles: (1) equivariant design that preserves neuron permutation symmetries, (2) processing sets of gradients across multiple data points to capture curvature information, and (3) efficient gradient representation through rank-1 decomposition. Based on these principles, we introduce GradMetaNet, a novel architecture for learning on gradients, constructed from simple equivariant blocks. We prove universality results for GradMetaNet, and show that previous approaches cannot approximate natural gradient-based functions that GradMetaNet can. We then demonstrate GradMetaNet’s effectiveness on a diverse set of gradient-based tasks on MLPs and transformers, such as learned optimization, INR editing, and estimating loss landscape curvature.

[1128] Understanding and Improving Length Generalization in Recurrent Models

Ricardo Buitrago Ruiz, Albert Gu

Main category: cs.LG

TL;DR: Recurrent models fail to generalize to longer sequences due to limited state exposure during training. Simple interventions like state initialization with noise or cross-sequence states enable length generalization with minimal additional training.

Details

Motivation: Recurrent models have linear complexity but fail to generalize beyond training context lengths due to limited state distribution exposure during training.

Method: Empirical and theoretical analysis of the ‘unexplored states hypothesis’, plus simple training interventions: state initialization with Gaussian noise or cross-sequence final states.

Result: With only 500 post-training steps (~0.1% of pre-training budget), models generalize to sequences orders of magnitude longer (e.g., 2k→128k) and show improved long-context performance.

Conclusion: Simple state coverage interventions enable robust length generalization in recurrent models with minimal computational cost.

Abstract: Recently, recurrent models such as state space models and linear attention have become popular due to their linear complexity in the sequence length. Thanks to their recurrent nature, in principle they can process arbitrarily long sequences, but their performance sometimes drops considerably beyond their training context lengths-i.e. they fail to length generalize. In this work, we provide comprehensive empirical and theoretical analysis to support the unexplored states hypothesis, which posits that models fail to length generalize when during training they are only exposed to a limited subset of the distribution of all attainable states (i.e. states that would be attained if the recurrence was applied to long sequences). Furthermore, we investigate simple training interventions that aim to increase the coverage of the states that the model is trained on, e.g. by initializing the state with Gaussian noise or with the final state of a different input sequence. With only 500 post-training steps ($\sim 0.1%$ of the pre-training budget), these interventions enable length generalization for sequences that are orders of magnitude longer than the training context (e.g. $2k\longrightarrow 128k$) and show improved performance in long context tasks, thus presenting a simple and efficient way to enable robust length generalization in general recurrent models.

[1129] Train-before-Test Harmonizes Language Model Rankings

Guanhua Zhang, Ricardo Dominguez-Olmedo, Moritz Hardt

Main category: cs.LG

TL;DR: The paper proposes ’train-before-test’ approach where models are fine-tuned on benchmark-specific data before evaluation, revealing consistent model potential rankings across 24 benchmarks and 61 models.

Details

Motivation: Existing language model benchmarks provide contradictory model rankings, making model selection and comparison difficult. The dilemma of conflicting rankings hampers progress in model evaluation.

Method: Instead of direct evaluation, the authors compare model potential by providing identical benchmark-specific fine-tuning before evaluation (train-before-test approach). They evaluate 61 models across 24 benchmarks.

Result: Train-before-test shows remarkable consistency in model rankings across all benchmarks, restores connection between perplexity and downstream performance, and reduces model-score matrix to essentially rank one, revealing a dominant latent factor of model potential.

Conclusion: Train-before-test should be made a default component of LLM benchmarking as it provides consistent model potential rankings that transfer well across different benchmarks.

Abstract: Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds confusion to a growing ecosystem of competing models. In this paper, we take a different perspective on model comparison: instead of relying on out-of-the-box performance via direct evaluation, we compare model potential by providing each model with identical benchmark-specific fine-tuning before evaluation. We call this approach train-before-test. Our primary contribution is a comprehensive empirical evaluation of model potential across 24 benchmarks and 61 models. First, we demonstrate that model potential rankings obtained through train-before-test exhibit remarkable consistency across all benchmarks. Whereas traditional rankings demonstrate little external validity under direct evaluation, they enjoy a significant degree of external validity when applying train-before-test: model potential rankings transfer gracefully from one benchmark to another. Second, train-before-test restores the connection between perplexity and downstream task performance, lost under direct evaluation. Remarkably, even pre-finetuning perplexity of a base model predicts post-finetuning downstream performance, suggesting that ranking consistency reflects inherent model potential rather than fine-tuning artifacts. Finally, train-before-test reduces the model-score matrix to essentially rank one, indicating that model potential is dominated by one latent factor, uncovered by train-before-test. Our work supports the recommendation to make train-before-test a default component of LLM benchmarking.

[1130] Simulating Three-dimensional Turbulence with Physics-informed Neural Networks

Sifan Wang, Shyam Sankaran, Xiantao Fan, Panos Stinis, Paris Perdikaris

Main category: cs.LG

TL;DR: PINNs can simulate turbulent flows in 2D and 3D without computational grids or training data, using adaptive architectures and causal training to overcome chaotic dynamics challenges.

Details

Motivation: Traditional turbulent flow simulations require enormous computational resources that become prohibitive at high flow speeds, creating a need for alternative approaches.

Method: Physics-informed neural networks (PINNs) trained directly from fluid equations using adaptive network architectures, causal training, and advanced optimization methods.

Result: PINNs accurately reproduce key turbulence statistics including energy spectra, kinetic energy, enstrophy, and Reynolds stresses in both 2D and 3D flows.

Conclusion: Neural equation solvers can handle complex chaotic systems, opening new possibilities for continuous turbulence modeling that transcends traditional computational limitations.

Abstract: Turbulent fluid flows are among the most computationally demanding problems in science, requiring enormous computational resources that become prohibitive at high flow speeds. Physics-informed neural networks (PINNs) represent a radically different approach that trains neural networks directly from physical equations rather than data, offering the potential for continuous, mesh-free solutions. Here we show that appropriately designed PINNs can successfully simulate fully turbulent flows in both two and three dimensions, directly learning solutions to the fundamental fluid equations without traditional computational grids or training data. Our approach combines several algorithmic innovations including adaptive network architectures, causal training, and advanced optimization methods to overcome the inherent challenges of learning chaotic dynamics. Through rigorous validation on challenging turbulence problems, we demonstrate that PINNs accurately reproduce key flow statistics including energy spectra, kinetic energy, enstrophy, and Reynolds stresses. Our results demonstrate that neural equation solvers can handle complex chaotic systems, opening new possibilities for continuous turbulence modeling that transcends traditional computational limitations.

[1131] Learning Diffusion Models with Flexible Representation Guidance

Chenyu Wang, Cai Zhou, Sharut Gupta, Zongyu Lin, Stefanie Jegelka, Stephen Bates, Tommi Jaakkola

Main category: cs.LG

TL;DR: The paper presents a systematic framework for improving diffusion models through representation guidance, introducing two strategies: learning joint models over multimodal pairs and designing optimal training curricula that balance representation learning and data generation.

Details

Motivation: Diffusion models can be enhanced by aligning their internal representations with pre-trained models, as prior empirical work has shown this improves generation quality.

Method: 1) Alternative decompositions of denoising models with associated training criteria; 2) Pairing examples with target representations (self-derived or from synthetic modalities) to learn joint models; 3) Designing optimal training curriculum balancing representation learning and data generation.

Result: Superior performance and accelerated training across image, protein sequence, and molecule generation tasks. On ImageNet 256×256 benchmark, achieved 23.3× faster training than SiT-XL and 4× speedup over state-of-the-art REPA.

Conclusion: The proposed representation guidance framework significantly improves diffusion model training efficiency and performance across multiple domains.

Abstract: Diffusion models can be improved with additional guidance towards more effective representations of input. Indeed, prior empirical work has already shown that aligning internal representations of the diffusion model with those of pre-trained models improves generation quality. In this paper, we present a systematic framework for incorporating representation guidance into diffusion models. We provide alternative decompositions of denoising models along with their associated training criteria, where the decompositions determine when and how the auxiliary representations are incorporated. Guided by our theoretical insights, we introduce two new strategies for enhancing representation alignment in diffusion models. First, we pair examples with target representations either derived from themselves or arisen from different synthetic modalities, and subsequently learn a joint model over the multimodal pairs. Second, we design an optimal training curriculum that balances representation learning and data generation. Our experiments across image, protein sequence, and molecule generation tasks demonstrate superior performance as well as accelerated training. In particular, on the class-conditional ImageNet $256\times 256$ benchmark, our guidance results in $23.3$ times faster training than the original SiT-XL as well as four times speedup over the state-of-the-art method REPA. The code is available at https://github.com/ChenyuWang-Monica/REED.

Fei Zhao, Chonggang Lu, Yue Wang, Zheyong Xie, Ziyan Liu, Haofu Qian, JianZhao Huang, Fangcheng Shi, Zijie Meng, Hongcheng Guo, Mingqian He, Xinze Lyu, Yiming Lu, Ziyang Xiang, Zheyu Ye, Chengqiang Lu, Zhe Xu, Yi Wu, Yao Hu, Yan Gao, Jun Fan, Xiaolong Jiang, Weiting Liu, Boyang Wang, Shaosheng Cao

Main category: cs.LG

TL;DR: RedOne is a domain-specific LLM for social networking services that achieves significant performance improvements across multiple SNS tasks through a three-stage training approach, outperforming single-task baselines by up to 14.02% and demonstrating strong real-world applicability.

Details

Motivation: Social networking services face challenges in content management and interaction quality. Existing LLM solutions focus on isolated tasks, leading to diminishing benefits from data scaling and poor adaptation to diverse real-world contexts.

Method: Three-stage training strategy: continued pretraining, supervised fine-tuning, and preference optimization using large-scale real-world SNS datasets to create a comprehensive foundation model.

Result: Average improvement of 14.02% across 8 major SNS tasks and 7.56% in SNS bilingual evaluation benchmark. Online testing showed 11.23% reduction in harmful content exposure and 14.95% improvement in post-view search click rates compared to single-task baselines.

Conclusion: RedOne establishes itself as a robust domain-specific LLM for SNS with excellent generalization across various tasks and promising real-world applicability, breaking the performance bottleneck of single-task approaches.

Abstract: As a primary medium for modern information dissemination, social networking services (SNS) have experienced rapid growth, which has proposed significant challenges for platform content management and interaction quality improvement. Recently, the development of large language models (LLMs) has offered potential solutions but existing studies focus on isolated tasks, which not only encounter diminishing benefit from the data scaling within individual scenarios but also fail to flexibly adapt to diverse real-world context. To address these challenges, we introduce RedOne, a domain-specific LLM designed to break the performance bottleneck of single-task baselines and establish a comprehensive foundation for the SNS. RedOne was developed through a three-stage training strategy consisting of continue pretraining, supervised fine-tuning, and preference optimization, using a large-scale real-world dataset. Through extensive experiments, RedOne maintains strong general capabilities, and achieves an average improvement up to 14.02% across 8 major SNS tasks and 7.56% in SNS bilingual evaluation benchmark, compared with base models. Furthermore, through online testing, RedOne reduced the exposure rate in harmful content detection by 11.23% and improved the click page rate in post-view search by 14.95% compared with single-tasks finetuned baseline models. These results establish RedOne as a robust domain-specific LLM for SNS, demonstrating excellent generalization across various tasks and promising applicability in real-world scenarios.

[1133] Learning Representations of Event Time Series with Sparse Autoencoders for Anomaly Detection, Similarity Search, and Unsupervised Classification

Steven Dillmann, Juan Rafael Martínez-Galarza

Main category: cs.LG

TL;DR: Proposes novel tensor representations and sparse autoencoders for analyzing irregular event time series, enabling anomaly detection, similarity retrieval, clustering, and classification across various domains.

Details

Motivation: Event time series with irregular intervals and domain-specific modalities pose challenges for conventional analysis techniques in fields like astrophysics, finance, and healthcare.

Method: Develops two- and three-dimensional tensor representations coupled with sparse autoencoders to learn physically meaningful latent representations from event time series.

Result: Successfully captures temporal and spectral signatures, isolates diverse classes of X-ray transients in astronomy datasets, and supports multiple downstream tasks.

Conclusion: Provides a flexible, scalable, and generalizable framework for analyzing complex, irregular event time series across scientific and industrial domains.

Abstract: Event time series are sequences of discrete events occurring at irregular time intervals, each associated with a domain-specific observational modality. They are common in domains such as high-energy astrophysics, computational social science, cybersecurity, finance, healthcare, neuroscience, and seismology. Their unstructured and irregular structure poses significant challenges for extracting meaningful patterns and identifying salient phenomena using conventional techniques. We propose novel two- and three-dimensional tensor representations for event time series, coupled with sparse autoencoders that learn physically meaningful latent representations. These embeddings support a variety of downstream tasks, including anomaly detection, similarity-based retrieval, semantic clustering, and unsupervised classification. We demonstrate our approach on a real-world dataset from X-ray astronomy, showing that these representations successfully capture temporal and spectral signatures and isolate diverse classes of X-ray transients. Our framework offers a flexible, scalable, and generalizable solution for analyzing complex, irregular event time series across scientific and industrial domains.

[1134] Robust Causal Discovery in Real-World Time Series with Power-Laws

Matteo Tusoni, Giuseppe Masi, Andrea Coletta, Aldo Glielmo, Viviana Arrigoni, Novella Bartolini

Main category: cs.LG

TL;DR: Proposes a robust causal discovery method using power-law spectral features to handle noisy real-world time series data.

Details

Motivation: Existing causal discovery algorithms are highly sensitive to noise, leading to misleading inferences in real applications like finance and neuroscience.

Method: Leverages the observation that real-world time series follow power-law spectral distributions due to self-organizing behavior, extracting power-law spectral features to amplify genuine causal signals.

Result: Consistently outperforms state-of-the-art methods on both synthetic benchmarks and real-world datasets with known causal structures.

Conclusion: The proposed method demonstrates robustness and practical relevance for causal discovery in noisy time series data across various domains.

Abstract: Exploring causal relationships in stochastic time series is a challenging yet crucial task with a vast range of applications, including finance, economics, neuroscience, and climate science. Many algorithms for Causal Discovery (CD) have been proposed, but they often exhibit a high sensitivity to noise, resulting in misleading causal inferences when applied to real data. In this paper, we observe that the frequency spectra of typical real-world time series follow a power-law distribution, notably due to an inherent self-organizing behavior. Leveraging this insight, we build a robust CD method based on the extraction of power -law spectral features that amplify genuine causal signals. Our method consistently outperforms state-of-the-art alternatives on both synthetic benchmarks and real-world datasets with known causal structures, demonstrating its robustness and practical relevance.

[1135] Cost-aware Stopping for Bayesian Optimization

Qian Xie, Linda Cai, Alexander Terenin, Peter I. Frazier, Ziv Scully

Main category: cs.LG

TL;DR: A cost-aware stopping rule for Bayesian optimization that adapts to varying evaluation costs without heuristic tuning, with theoretical guarantees on cumulative evaluation costs.

Details

Motivation: Existing adaptive stopping rules in Bayesian optimization lack guarantees for stopping before excessive function evaluation costs in cost-aware settings.

Method: Proposed a cost-aware stopping rule grounded in theoretical connection to state-of-the-art cost-aware acquisition functions (Pandora’s Box Gittins Index and log expected improvement per cost).

Result: Theoretical guarantee bounding expected cumulative evaluation cost. Experiments on synthetic and empirical tasks show the stopping rule with PBGI acquisition function usually matches or outperforms other methods in cost-adjusted simple regret.

Conclusion: The proposed cost-aware stopping rule provides effective performance-cost trade-offs in Bayesian optimization applications.

Abstract: In automated machine learning, scientific discovery, and other applications of Bayesian optimization, deciding when to stop evaluating expensive black-box functions is an important practical consideration. While several adaptive stopping rules have been proposed, in the cost-aware setting they lack guarantees ensuring they stop before incurring excessive function evaluation costs. We propose a cost-aware stopping rule for Bayesian optimization that adapts to varying evaluation costs and is free of heuristic tuning. Our rule is grounded in a theoretical connection to state-of-the-art cost-aware acquisition functions, namely the Pandora’s Box Gittins Index (PBGI) and log expected improvement per cost. We prove a theoretical guarantee bounding the expected cumulative evaluation cost incurred by our stopping rule when paired with these two acquisition functions. In experiments on synthetic and empirical tasks, including hyperparameter optimization and neural architecture size search, we show that combining our stopping rule with the PBGI acquisition function usually matches or outperforms other acquisition-function–stopping-rule pairs in terms of cost-adjusted simple regret, a metric capturing trade-offs between solution quality and cumulative evaluation cost.

[1136] Efficient Approximate Posterior Sampling with Annealed Langevin Monte Carlo

Advait Parulekar, Litu Rout, Karthikeyan Shanmugam, Sanjay Shakkottai

Main category: cs.LG

TL;DR: The paper addresses posterior sampling in score-based generative models, showing that while exact sampling is intractable under standard assumptions, approximate sampling is possible by biasing the prior distribution toward measurements.

Details

Motivation: Prior work established that exact posterior sampling is intractable in KL divergence under computational hardness assumptions, yet popular algorithms for tasks like image super-resolution and reconstruction work well in practice. This motivates studying approximate posterior sampling.

Method: The paper views posterior sampling as a “tilting” problem of biasing a distribution toward a measurement. Under minimal assumptions, it shows tractable sampling from a distribution that is close to the posterior of a noised prior in KL divergence and the true posterior in Fisher divergence.

Result: The authors demonstrate that one can sample from a distribution that is simultaneously close to the posterior of a noised prior in KL divergence and the true posterior in Fisher divergence, ensuring consistency with both the measurement and the prior.

Conclusion: These are the first formal results showing that (approximate) posterior sampling is possible in polynomial time, bridging the gap between theoretical intractability and empirical success in practical applications.

Abstract: We study the problem of posterior sampling in the context of score based generative models. We have a trained score network for a prior $p(x)$, a measurement model $p(y|x)$, and are tasked with sampling from the posterior $p(x|y)$. Prior work has shown this to be intractable in KL (in the worst case) under well-accepted computational hardness assumptions. Despite this, popular algorithms for tasks such as image super-resolution, stylization, and reconstruction enjoy empirical success. Rather than establishing distributional assumptions or restricted settings under which exact posterior sampling is tractable, we view this as a more general “tilting” problem of biasing a distribution towards a measurement. Under minimal assumptions, we show that one can tractably sample from a distribution that is simultaneously close to the posterior of a noised prior in KL divergence and the true posterior in Fisher divergence. Intuitively, this combination ensures that the resulting sample is consistent with both the measurement and the prior. To the best of our knowledge these are the first formal results for (approximate) posterior sampling in polynomial time.

[1137] A Comprehensive Evaluation on Quantization Techniques for Large Language Models

Yutong Liu, Cairong Zhao, Guosheng Hu

Main category: cs.LG

TL;DR: This paper provides a comprehensive review and fair evaluation of post-training quantization methods for LLMs, analyzing connections between different approaches and evaluating performance under consistent conditions.

Details

Motivation: To address the lack of fair comparisons in LLM quantization research, where methods are often evaluated under different settings, making it difficult to understand their true performance and connections between approaches.

Method: Decoupled quantization methods into pre-quantization transformation and quantization error mitigation steps; conducted extensive evaluations under same conditions; analyzed granularity, symmetry, and new FP4 formats (MXFP4, NVFP4).

Result: Optimized rotation and scaling provide best pre-quantization performance; combining low-rank compensation with GPTQ can sometimes outperform GPTQ alone; finer granularity improves performance but increases storage; FP4 performance depends heavily on scaling-factor format and precision.

Conclusion: Rotation-based strategies effective for INT4 offer limited gains for newer FP4 formats, highlighting the need for further research into quantization methods specifically designed for floating-point formats.

Abstract: For large language models (LLMs), post-training quantization (PTQ) can significantly reduce memory footprint and computational overhead. Model quantization is rapidly evolving. Though many papers report breakthrough results, they are often evaluated under different settings because a method typically contains multiple components. Analyzing connections among existing methods is important for deeper understanding. To bridge these gaps, we conduct an extensive review of state-of-the-art methods and perform comprehensive evaluations under the same conditions for fair comparison. To our knowledge, such a fair and extensive investigation remains critically underexplored. To better understand connections, first, we decouple published quantization methods into two steps: pre-quantization transformation and quantization error mitigation. The former is a preprocessing step that reduces outlier impact by flattening the data distribution; the latter offsets quantization errors to improve performance. Second, we evaluate and analyze the impact of different settings, including granularity and symmetry. Third, we analyze and evaluate the latest MXFP4 and NVFP4 data formats and their performance. Our experiments first demonstrate that optimized rotation and scaling yield the best pre-quantization performance, and that combining low-rank compensation with GPTQ can occasionally outperform GPTQ alone for error mitigation. Second, finer granularity improves performance but increases storage overhead. Third, we find that scaling-factor format and precision greatly affect FP4 performance, and that rotation-based strategies effective for INT4 offer limited gains for MXFP4 and NVFP4, motivating further study.

[1138] A Vision-Language Pre-training Model-Guided Approach for Mitigating Backdoor Attacks in Federated Learning

Keke Gai, Dongjue Wang, Jing Yu, Liehuang Zhu, Qi Wu

Main category: cs.LG

TL;DR: CLIP-Fed is a federated learning backdoor defense framework that uses vision-language models and multimodal LLMs to defend against attacks in Non-IID data distributions without requiring client samples.

Details

Motivation: Existing FL backdoor defenses struggle with heterogeneous client data distributions and privacy concerns, often relying on homogeneous data assumptions or clean server datasets.

Method: Integrates pre-aggregation and post-aggregation defense strategies using CLIP’s zero-shot learning, prototype contrastive loss, KL divergence, and multimodal LLM-based dataset augmentation without client samples.

Result: Achieves average ASR reduction of 2.03% on CIFAR-10 and 1.35% on CIFAR-10-LT, while improving MTA by 7.92% and 0.48% respectively compared to existing methods.

Conclusion: CLIP-Fed effectively defends against backdoor attacks in FL with Non-IID data while preserving privacy and improving model performance.

Abstract: Defending backdoor attacks in Federated Learning (FL) under heterogeneous client data distributions encounters limitations balancing effectiveness and privacy-preserving, while most existing methods highly rely on the assumption of homogeneous client data distributions or the availability of a clean serve dataset. In this paper, we propose an FL backdoor defense framework, named CLIP-Fed, that utilizes the zero-shot learning capabilities of vision-language pre-training models. Our scheme overcomes the limitations of Non-IID imposed on defense effectiveness by integrating pre-aggregation and post-aggregation defense strategies. CLIP-Fed aligns the knowledge of the global model and CLIP on the augmented dataset using prototype contrastive loss and Kullback-Leibler divergence, so that class prototype deviations caused by backdoor samples are ensured and the correlation between trigger patterns and target labels is eliminated. In order to balance privacy-preserving and coverage enhancement of the dataset against diverse triggers, we further construct and augment the server dataset via using the multimodal large language model and frequency analysis without any client samples. Extensive experiments on representative datasets evidence the effectiveness of CLIP-Fed. Comparing to other existing methods, CLIP-Fed achieves an average reduction in Attack Success Rate, {\em i.e.}, 2.03% on CIFAR-10 and 1.35% on CIFAR-10-LT, while improving average Main Task Accuracy by 7.92% and 0.48%, respectively. Our codes are available at https://anonymous.4open.science/r/CLIP-Fed.

[1139] Causal Explanation of Concept Drift – A Truly Actionable Approach

David Komnick, Kathrin Lammers, Barbara Hammer, Valerie Vaquet, Fabian Hinder

Main category: cs.LG

TL;DR: Extends model-based concept drift explanations to causal explanations for more actionable insights, enabling targeted interventions by identifying causally relevant features affected by drift.

Details

Motivation: Understanding how changes impact systems is crucial for preventing model failures and physical world errors. Current drift explanations lack actionability, so causal explanations are needed to enable targeted interventions.

Method: Extends model-based drift explanations towards causal explanations, isolating causally relevant features impacted by concept drift.

Result: Evaluated on multiple use cases, demonstrating practical usefulness in identifying features for targeted intervention.

Conclusion: The framework successfully provides causal explanations for concept drift, increasing actionability and enabling targeted interventions to address model failures and system malfunctions.

Abstract: In a world that constantly changes, it is crucial to understand how those changes impact different systems, such as industrial manufacturing or critical infrastructure. Explaining critical changes, referred to as concept drift in the field of machine learning, is the first step towards enabling targeted interventions to avoid or correct model failures, as well as malfunctions and errors in the physical world. Therefore, in this work, we extend model-based drift explanations towards causal explanations, which increases the actionability of the provided explanations. We evaluate our explanation strategy on a number of use cases, demonstrating the practical usefulness of our framework, which isolates the causally relevant features impacted by concept drift and, thus, allows for targeted intervention.

[1140] Imbalance-Robust and Sampling-Efficient Continuous Conditional GANs via Adaptive Vicinity and Auxiliary Regularization

Xin Ding, Yun Chen, Yongwei Wang, Kao Zhang, Sen Zhang, Peibei Cao, Xiangxue Wang

Main category: cs.LG

TL;DR: CcGAN-AVAR is an enhanced conditional GAN framework that addresses data imbalance in CcGAN and computational inefficiency in CCDM through adaptive vicinity mechanisms and multi-task discriminators, achieving state-of-the-art generation quality with 30x-2000x faster inference.

Details

Motivation: To overcome fundamental limitations in existing conditional generative models: CcGAN suffers from data imbalance due to fixed-size vicinity constraints, while CCDM requires computationally expensive iterative sampling.

Method: Proposes CcGAN-AVAR with two novel components: (1) adaptive vicinity mechanism that dynamically adjusts vicinity size to handle data imbalance, and (2) multi-task discriminator that enhances generator training through auxiliary regression and density ratio estimation, leveraging GAN’s native one-step generator.

Result: Extensive experiments on four benchmark datasets (64x64 to 256x256 resolution) across eleven challenging settings demonstrate state-of-the-art generation quality while maintaining sampling efficiency with 30x-2000x faster inference than CCDM.

Conclusion: CcGAN-AVAR successfully addresses the limitations of both CcGAN and CCDM, achieving superior generation quality with significantly improved computational efficiency.

Abstract: Recent advances in conditional generative modeling have introduced Continuous conditional Generative Adversarial Network (CcGAN) and Continuous Conditional Diffusion Model (CCDM) for estimating high-dimensional data distributions conditioned on scalar, continuous regression labels (e.g., angles, ages, or temperatures). However, these approaches face fundamental limitations: CcGAN suffers from data imbalance due to fixed-size vicinity constraints, while CCDM requires computationally expensive iterative sampling. To address these issues, we propose CcGAN-AVAR, an enhanced CcGAN framework featuring (1) two novel components for handling data imbalance - an adaptive vicinity mechanism that dynamically adjusts vicinity size and a multi-task discriminator that enhances generator training through auxiliary regression and density ratio estimation - and (2) the GAN framework’s native one-step generator, enable 30x-2000x faster inference than CCDM. Extensive experiments on four benchmark datasets (64x64 to 256x256 resolution) across eleven challenging settings demonstrate that CcGAN-AVAR achieves state-of-the-art generation quality while maintaining sampling efficiency.

[1141] Learning Satellite Attitude Dynamics with Physics-Informed Normalising Flow

Carlo Cena, Mauro Martini, Marcello Chiaberge

Main category: cs.LG

TL;DR: Physics-Informed Neural Networks (PINNs) combined with Real NVP architecture improve spacecraft attitude control by 27.08% over purely data-driven models, with up to 42.86% better performance stability in MPC applications.

Details

Motivation: Traditional MPC relies on accurate physics models, but these can be incomplete or computationally expensive. Machine learning offers an alternative but struggles with generalization and stability outside training data.

Method: Used Real NVP neural network with self-attention mechanism trained on Basilisk simulator data. Compared purely data-driven baseline with physics-informed variant (PINNs) to improve robustness.

Result: PINN-based models reduced mean relative error by 27.08% and improved performance stability error by up to 42.86% in MPC framework. Also showed increased robustness to noise.

Conclusion: Incorporating physics-based information into neural networks significantly enhances spacecraft attitude dynamics modeling, providing better control accuracy and robustness compared to purely data-driven approaches.

Abstract: Attitude control is a fundamental aspect of spacecraft operations. Model Predictive Control (MPC) has emerged as a powerful strategy for these tasks, relying on accurate models of the system dynamics to optimize control actions over a prediction horizon. In scenarios where physics models are incomplete, difficult to derive, or computationally expensive, machine learning offers a flexible alternative by learning the system behavior directly from data. However, purely data-driven models often struggle with generalization and stability, especially when applied to inputs outside their training domain. To address these limitations, we investigate the benefits of incorporating Physics-Informed Neural Networks (PINNs) into the learning of spacecraft attitude dynamics, comparing their performance with that of purely data-driven approaches. Using a Real-valued Non-Volume Preserving (Real NVP) neural network architecture with a self-attention mechanism, we trained several models on simulated data generated with the Basilisk simulator. Two training strategies were considered: a purely data-driven baseline and a physics-informed variant to improve robustness and stability. Our results demonstrate that the inclusion of physics-based information significantly enhances the performance in terms of the mean relative error of the best architectures found by 27.08%. These advantages are particularly evident when the learned models are integrated into an MPC framework, where PINN-based models consistently outperform their purely data-driven counterparts in terms of control accuracy and robustness, yielding improvements of up to 42.86% in performance stability error and increased robustness-to-noise.

[1142] Bridging Graph and State-Space Modeling for Intensive Care Unit Length of Stay Prediction

Shuqi Zi, Haitz Sáez de Ocáriz Borde, Emma Rocheteau, Pietro Lio’

Main category: cs.LG

TL;DR: S²G-Net is a neural architecture combining state-space models with multi-view GNNs for ICU length of stay prediction, outperforming existing methods on MIMIC-IV dataset.

Details

Motivation: ICU length of stay prediction is crucial for hospital resource management but challenging due to heterogeneous and irregularly sampled EHR data.

Method: Proposes S²G-Net with temporal path using Mamba state-space models for patient trajectories and graph path using optimized GraphGPS backbone with heterogeneous patient similarity graphs from diagnostic, administrative, and semantic features.

Result: Outperforms sequence models (BiLSTM, Mamba, Transformer), graph models (classic GNNs, GraphGPS), and hybrid approaches across all primary metrics on MIMIC-IV dataset.

Conclusion: S²G-Net provides an effective and scalable solution for ICU LOS prediction with multi-modal clinical data, with ablation studies confirming complementary contributions of each component.

Abstract: Predicting a patient’s length of stay (LOS) in the intensive care unit (ICU) is a critical task for hospital resource management, yet remains challenging due to the heterogeneous and irregularly sampled nature of electronic health records (EHRs). In this work, we propose S$^2$G-Net, a novel neural architecture that unifies state-space sequence modeling with multi-view Graph Neural Networks (GNNs) for ICU LOS prediction. The temporal path employs Mamba state-space models (SSMs) to capture patient trajectories, while the graph path leverages an optimized GraphGPS backbone, designed to integrate heterogeneous patient similarity graphs derived from diagnostic, administrative, and semantic features. Experiments on the large-scale MIMIC-IV cohort dataset show that S$^2$G-Net consistently outperforms sequence models (BiLSTM, Mamba, Transformer), graph models (classic GNNs, GraphGPS), and hybrid approaches across all primary metrics. Extensive ablation studies and interpretability analyses highlight the complementary contributions of each component of our architecture and underscore the importance of principled graph construction. These results demonstrate that S$^2$G-Net provides an effective and scalable solution for ICU LOS prediction with multi-modal clinical data. The code can be found at https://github.com/ShuqiZi1/S2G-Net.

[1143] On Surjectivity of Neural Networks: Can you elicit any behavior from your model?

Haozhe Jiang, Nika Haghtalab

Main category: cs.LG

TL;DR: The paper proves that many modern neural architectures like GPT-style transformers and diffusion models are almost always surjective, meaning any output can be generated by some input, revealing inherent vulnerabilities to adversarial attacks.

Details

Motivation: To understand whether neural networks can generate any specified output, which has implications for model safety and jailbreak vulnerabilities in generative models.

Method: Mathematical proofs showing that fundamental building blocks of modern neural architectures (pre-layer normalization, linear-attention modules) are almost always surjective.

Result: Proved that widely used generative frameworks including GPT-style transformers and diffusion models with deterministic ODE solvers admit inverse mappings for arbitrary outputs.

Conclusion: Modern neural architectures have unavoidable vulnerability to adversarial attacks due to their surjective nature, raising significant safety concerns for generative models.

Abstract: Given a trained neural network, can any specified output be generated by some input? Equivalently, does the network correspond to a function that is surjective? In generative models, surjectivity implies that any output, including harmful or undesirable content, can in principle be generated by the networks, raising concerns about model safety and jailbreak vulnerabilities. In this paper, we prove that many fundamental building blocks of modern neural architectures, such as networks with pre-layer normalization and linear-attention modules, are almost always surjective. As corollaries, widely used generative frameworks, including GPT-style transformers and diffusion models with deterministic ODE solvers, admit inverse mappings for arbitrary outputs. By studying surjectivity of these modern and commonly used neural architectures, we contribute a formalism that sheds light on their unavoidable vulnerability to a broad class of adversarial attacks.

[1144] VendiRL: A Framework for Self-Supervised Reinforcement Learning of Diversely Diverse Skills

Erik M. Lintunen

Main category: cs.LG

TL;DR: The paper introduces VendiRL, a framework for learning diverse skills in self-supervised RL using the Vendi Score from ecology to measure and optimize for various forms of diversity without predefined notions.

Details

Motivation: Current self-supervised RL faces scalability issues in high-dimensional feature spaces and lacks consistent evaluation of skill diversity, making results hard to compare across methods.

Method: Proposes VendiRL framework that uses the Vendi Score metric to measure skill diversity based on user-specified similarity functions, enabling optimization for different forms of diversity.

Result: VendiRL facilitates evaluation of skill diversity and enables learning of diversely diverse skill sets that can support pretraining in interactive environments.

Conclusion: The Vendi Score provides a flexible way to define and evaluate skill diversity, and VendiRL offers a unified approach to learn diverse skills that can adapt to various downstream task requirements.

Abstract: In self-supervised reinforcement learning (RL), one of the key challenges is learning a diverse set of skills to prepare agents for unknown future tasks. Despite impressive advances, scalability and evaluation remain prevalent issues. Regarding scalability, the search for meaningful skills can be obscured by high-dimensional feature spaces, where relevant features may vary across downstream task domains. For evaluating skill diversity, defining what constitutes “diversity” typically requires a hard commitment to a specific notion of what it means for skills to be diverse, potentially leading to inconsistencies in how skill diversity is understood, making results across different approaches hard to compare, and leaving many forms of diversity unexplored. To address these issues, we adopt a measure of sample diversity that translates ideas from ecology to machine learning – the Vendi Score – allowing the user to specify and evaluate any desired form of diversity. We demonstrate how this metric facilitates skill evaluation and introduce VendiRL, a unified framework for learning diversely diverse sets of skills. Given distinct similarity functions, VendiRL motivates distinct forms of diversity, which could support skill-diversity pretraining in new and richly interactive environments where optimising for various forms of diversity may be desirable.

[1145] A Graph Laplacian Eigenvector-based Pre-training Method for Graph Neural Networks

Howard Dai, Nyambura Njenga, Hiren Madhu, Siddharth Viswanath, Ryan Pellico, Ian Adelstein, Smita Krishnaswamy

Main category: cs.LG

TL;DR: Proposes LELM, a novel graph pre-training module that predicts Laplacian eigenvectors to capture global structure while overcoming oversmoothing in deep GNNs.

Details

Motivation: Structure-based pre-training is under-explored but crucial for graph foundation models. Traditional GNNs struggle with capturing global structure due to oversmoothing in deep networks.

Method: LELM pre-trains GNNs by predicting low-frequency eigenvectors of graph Laplacian. Uses novel architecture that prevents oversmoothing and enables learning long-range dependencies.

Result: Models pre-trained with LELM outperform baseline models on downstream molecular property prediction tasks.

Conclusion: LELM effectively captures graph structure through Laplacian eigenvector prediction and addresses oversmoothing, improving performance on molecular property prediction.

Abstract: The development of self-supervised graph pre-training methods is a crucial ingredient in recent efforts to design robust graph foundation models (GFMs). Structure-based pre-training methods are under-explored yet crucial for downstream applications which rely on underlying graph structure. In addition, pre-training traditional message passing GNNs to capture global and regional structure is often challenging due to the risk of oversmoothing as network depth increases. We address these gaps by proposing the Laplacian Eigenvector Learning Module (LELM), a novel pre-training module for graph neural networks (GNNs) based on predicting the low-frequency eigenvectors of the graph Laplacian. Moreover, LELM introduces a novel architecture that overcomes oversmoothing, allowing the GNN model to learn long-range interdependencies. Empirically, we show that models pre-trained via our framework outperform baseline models on downstream molecular property prediction tasks.

[1146] Long-Range Graph Wavelet Networks

Filippo Guerranti, Fabrizio Forte, Simon Geisler, Stephan Günnemann

Main category: cs.LG

TL;DR: LR-GWN is a novel graph neural network that decomposes wavelet filters into local and global components to better capture long-range interactions in graphs, overcoming limitations of existing polynomial-based wavelet methods.

Details

Motivation: Existing wavelet-based graph neural networks rely on finite-order polynomial approximations that limit receptive fields and hinder long-range information propagation across distant parts of graphs.

Method: Decompose wavelet filters into complementary local and global components: local aggregation uses efficient low-order polynomials, while long-range interactions are captured through flexible spectral-domain parameterization.

Result: LR-GWN achieves state-of-the-art performance among wavelet-based methods on long-range benchmarks while remaining competitive on short-range datasets.

Conclusion: The hybrid design successfully unifies short- and long-distance information flow within a principled wavelet framework, enabling effective modeling of both local and global graph structures.

Abstract: Modeling long-range interactions, the propagation of information across distant parts of a graph, is a central challenge in graph machine learning. Graph wavelets, inspired by multi-resolution signal processing, provide a principled way to capture both local and global structures. However, existing wavelet-based graph neural networks rely on finite-order polynomial approximations, which limit their receptive fields and hinder long-range propagation. We propose Long-Range Graph Wavelet Networks (LR-GWN), which decompose wavelet filters into complementary local and global components. Local aggregation is handled with efficient low-order polynomials, while long-range interactions are captured through a flexible spectral-domain parameterization. This hybrid design unifies short- and long-distance information flow within a principled wavelet framework. Experiments show that LR-GWN achieves state-of-the-art performance among wavelet-based methods on long-range benchmarks, while remaining competitive on short-range datasets.

[1147] VL Norm: Rethink Loss Aggregation in RLVR

Zhiyuan He, Xufang Luo, Yike Zhang, Yuqing Yang, Lili Qiu

Main category: cs.LG

TL;DR: VL Norm is a variance-reduced length-dependent normalization method that addresses gradient variance issues in RLVR by providing unbiased policy loss estimates with minimal variance.

Details

Motivation: RLVR shows strong potential for improving LLM reasoning but suffers from high gradient variance due to variable response lengths during training. Existing methods like GRPO, DAPO, and Dr. GRPO produce biased estimates or still have high variance.

Method: VL Norm reformulates the problem as finding a minimum-variance unbiased estimator by analyzing length effects on policy loss. It’s implemented with less than 10 lines of code change and provides unbiased policy loss estimates while minimizing gradient variance.

Result: Extensive experiments show consistent superior results across different model sizes, maximum lengths, and tasks. When integrated with DAPO, it achieves up to 2.67x faster convergence on the CountDown task.

Conclusion: VL Norm is a simple yet effective loss aggregation method that successfully addresses gradient variance issues in RLVR, providing unbiased estimates and faster convergence while being easy to implement.

Abstract: We propose VL Norm (Variance-reduced Length-dependent Normalization), a simple yet effective loss aggregation method tailored to the characteristic of dynamic generation lengths in Reinforcement Learning with Verifiable Rewards (RLVR). Recently, RLVR has demonstrated strong potential in improving the reasoning capabilities of large language models (LLMs), but a major challenge lies in the large variability of response lengths during training, which leads to high gradient variance and unstable optimization. Although previous methods such as GRPO, DAPO, and Dr. GRPO introduce different loss normalization terms to address this issue, they either produce biased estimates or still suffer from high gradient variance. By analyzing the effect of varying lengths on policy loss both theoretically and empirically, we reformulate the problem as finding a minimum-variance unbiased estimator. Our proposed VL Norm not only provides an unbiased estimate of the true policy loss but also minimizes gradient variance in theory. Besides, VL Norm is easy to implement with less than 10 lines of code change. Extensive experiments show that it consistently achieves superior results across different model sizes, maximum lengths, and tasks. When integrated into the state-of-the-art RL algorithm DAPO, it achieves up to 2.67x faster convergence on the CountDown task. Our code is public at https://github.com/zerolllin/Delta-L-Normalization.

[1148] Machine Learning with Multitype Protected Attributes: Intersectional Fairness through Regularisation

Ho Ming Lee, Katrien Antonio, Benjamin Avanzi, Lorenzo Marchi, Rui Zhou

Main category: cs.LG

TL;DR: Proposes distance covariance regularization for fairness in regression and classification tasks, addressing multiple protected attributes and intersectional subgroups.

Details

Motivation: Existing fairness methods focus on binary classification and fail to handle continuous attributes or multiple protected attributes simultaneously, ignoring fairness gerrymandering in intersectional subgroups.

Method: Distance covariance regularization framework that mitigates association between predictions and protected attributes, extended with joint distance covariance (JdCov) and novel concatenated distance covariance (CCdCov) for multiple attributes.

Result: Applied to COMPAS recidivism and motor insurance datasets, the framework effectively addresses fairness gerrymandering and captures both linear and nonlinear dependencies in protected attributes.

Conclusion: The proposed distance covariance regularization provides an effective approach for achieving demographic parity fairness in regression and classification tasks with multiple protected attributes of various types.

Abstract: Ensuring equitable treatment (fairness) across protected attributes (such as gender or ethnicity) is a critical issue in machine learning. Most existing literature focuses on binary classification, but achieving fairness in regression tasks-such as insurance pricing or hiring score assessments-is equally important. Moreover, anti-discrimination laws also apply to continuous attributes, such as age, for which many existing methods are not applicable. In practice, multiple protected attributes can exist simultaneously; however, methods targeting fairness across several attributes often overlook so-called “fairness gerrymandering”, thereby ignoring disparities among intersectional subgroups (e.g., African-American women or Hispanic men). In this paper, we propose a distance covariance regularisation framework that mitigates the association between model predictions and protected attributes, in line with the fairness definition of demographic parity, and that captures both linear and nonlinear dependencies. To enhance applicability in the presence of multiple protected attributes, we extend our framework by incorporating two multivariate dependence measures based on distance covariance: the previously proposed joint distance covariance (JdCov) and our novel concatenated distance covariance (CCdCov), which effectively address fairness gerrymandering in both regression and classification tasks involving protected attributes of various types. We discuss and illustrate how to calibrate regularisation strength, including a method based on Jensen-Shannon divergence, which quantifies dissimilarities in prediction distributions across groups. We apply our framework to the COMPAS recidivism dataset and a large motor insurance claims dataset.

[1149] Prompt Optimization Meets Subspace Representation Learning for Few-shot Out-of-Distribution Detection

Faizul Rakib Sayem, Shahana Ibrahim

Main category: cs.LG

TL;DR: A novel OOD detection framework that combines context optimization with subspace representation learning to improve ID-OOD separability by projecting features into prompt-based subspaces.

Details

Motivation: Existing few-shot OOD detection methods using VLMs rely only on softmax probabilities, ignoring the rich discriminative information in feature embeddings learned from millions of training samples.

Method: Proposes a CoOp-based framework integrating subspace representation learning with prompt tuning, projecting ID features into a subspace spanned by prompt vectors and ID-irrelevant features into an orthogonal null space.

Result: Experiments on real-world datasets demonstrate the effectiveness of the approach in improving OOD detection performance.

Conclusion: The proposed framework successfully addresses the limitations of existing methods by leveraging feature embeddings and subspace learning for better ID-OOD separability.

Abstract: The reliability of artificial intelligence (AI) systems in open-world settings depends heavily on their ability to flag out-of-distribution (OOD) inputs unseen during training. Recent advances in large-scale vision-language models (VLMs) have enabled promising few-shot OOD detection frameworks using only a handful of in-distribution (ID) samples. However, existing prompt learning-based OOD methods rely solely on softmax probabilities, overlooking the rich discriminative potential of the feature embeddings learned by VLMs trained on millions of samples. To address this limitation, we propose a novel context optimization (CoOp)-based framework that integrates subspace representation learning with prompt tuning. Our approach improves ID-OOD separability by projecting the ID features into a subspace spanned by prompt vectors, while projecting ID-irrelevant features into an orthogonal null space. To train such OOD detection framework, we design an easy-to-handle end-to-end learning criterion that ensures strong OOD detection performance as well as high ID classification accuracy. Experiments on real-world datasets showcase the effectiveness of our approach.

[1150] “A 6 or a 9?”: Ensemble Learning Through the Multiplicity of Performant Models and Explanations

Gianlucca Zuin, Adriano Veloso

Main category: cs.LG

TL;DR: The Rashomon Ensemble method selects diverse high-performing models from the Rashomon set to improve generalization by maximizing diversity while maintaining accuracy.

Details

Motivation: Model selection for good generalization is challenging, and the Rashomon Effect shows multiple models can perform similarly well on the same problem, especially in real-world scenarios with diverse data patterns.

Method: Group models based on both performance and explanations, then strategically select diverse models from the Rashomon set to construct ensembles that cover distinct regions of the solution space.

Result: Validated on real-world datasets showing up to 0.20+ AUROC improvements when Rashomon ratio is large, with demonstrated business benefits in various applications.

Conclusion: The Rashomon Ensemble approach is robust, practical, and effective for improving generalization by leveraging diverse high-performing solutions.

Abstract: Creating models from past observations and ensuring their effectiveness on new data is the essence of machine learning. However, selecting models that generalize well remains a challenging task. Related to this topic, the Rashomon Effect refers to cases where multiple models perform similarly well for a given learning problem. This often occurs in real-world scenarios, like the manufacturing process or medical diagnosis, where diverse patterns in data lead to multiple high-performing solutions. We propose the Rashomon Ensemble, a method that strategically selects models from these diverse high-performing solutions to improve generalization. By grouping models based on both their performance and explanations, we construct ensembles that maximize diversity while maintaining predictive accuracy. This selection ensures that each model covers a distinct region of the solution space, making the ensemble more robust to distribution shifts and variations in unseen data. We validate our approach on both open and proprietary collaborative real-world datasets, demonstrating up to 0.20+ AUROC improvements in scenarios where the Rashomon ratio is large. Additionally, we demonstrate tangible benefits for businesses in various real-world applications, highlighting the robustness, practicality, and effectiveness of our approach.

[1151] Tree Search for LLM Agent Reinforcement Learning

Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, Liaoni Wu

Main category: cs.LG

TL;DR: Tree-GRPO is a tree-based reinforcement learning method that addresses sparse supervision in long-term agent tasks by using tree search to generate step-wise process supervision from outcome rewards and optimize grouped relative advantages.

Details

Motivation: Existing RL approaches for LLM agents driven solely by outcome rewards suffer from sparse supervision problems in long-term, multi-turn agent tasks.

Method: Proposes Tree-GRPO, a grouped agent RL method based on tree search where each node represents an agent interaction step. Uses tree-structured trajectories to construct step-wise process supervision from outcome rewards and estimates grouped relative advantages at intra-tree and inter-tree levels.

Result: Experiments across 11 datasets and 3 types of QA tasks demonstrate superiority over chain-based RL methods.

Conclusion: Tree-based RL with grouped relative policy optimization effectively addresses sparse supervision in long-term agent tasks and provides better performance than chain-based approaches.

Abstract: Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward. Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels. Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning. Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.

[1152] Fused Lasso Improves Accuracy of Co-occurrence Network Inference in Grouped Samples

Daniel Agyapong, Briana H. Beatty, Peter G. Kennedy, Toby D. Hocking

Main category: cs.LG

TL;DR: The paper introduces fuser, a novel algorithm for microbiome co-occurrence network inference that addresses limitations of existing methods by considering spatial and temporal dynamics across different environmental niches, rather than treating all samples as homogeneous.

Details

Motivation: Existing co-occurrence network inference algorithms typically analyze microbial associations within single environmental niches, capturing only static snapshots and failing to account for how microbial communities adapt their associations across varying ecological conditions.

Method: Proposed fuser algorithm that retains subsample-specific signals while sharing relevant information across environments during training, generating distinct environment-specific predictive networks. Evaluated using Same-All Cross-validation (SAC) framework comparing training/testing within same niche (Same) vs. across multiple niches (All).

Result: fuser achieves comparable performance to existing algorithms like glmnet in homogeneous environments (Same), and significantly reduces test error compared to baseline algorithms in cross-environment (All) scenarios.

Conclusion: The fuser algorithm effectively addresses the limitations of conventional methods by incorporating spatial and temporal dynamics, enabling more accurate microbial association predictions across diverse environmental conditions.

Abstract: Co-occurrence network inference algorithms have significantly advanced our understanding of microbiome communities. However, these algorithms typically analyze microbial associations within samples collected from a single environmental niche, often capturing only static snapshots rather than dynamic microbial processes. Previous studies have commonly grouped samples from different environmental niches together without fully considering how microbial communities adapt their associations when faced with varying ecological conditions. Our study addresses this limitation by explicitly investigating both spatial and temporal dynamics of microbial communities. We analyzed publicly available microbiome abundance data across multiple locations and time points, to evaluate algorithm performance in predicting microbial associations using our proposed Same-All Cross-validation (SAC) framework. SAC evaluates algorithms in two distinct scenarios: training and testing within the same environmental niche (Same), and training and testing on combined data from multiple environmental niches (All). To overcome the limitations of conventional algorithms, we propose fuser, an algorithm that, while not entirely new in machine learning, is novel for microbiome community network inference. It retains subsample-specific signals while simultaneously sharing relevant information across environments during training. Unlike standard approaches that infer a single generalized network from combined data, fuser generates distinct, environment-specific predictive networks. Our results demonstrate that fuser achieves comparable predictive performance to existing algorithms such as glmnet when evaluated within homogeneous environments (Same), and notably reduces test error compared to baseline algorithms in cross-environment (All) scenarios.

[1153] CrunchLLM: Multitask LLMs for Structured Business Reasoning and Outcome Prediction

Rabeya Tus Sadia, Qiang Cheng

Main category: cs.LG

TL;DR: CrunchLLM is a domain-adapted LLM framework that integrates structured and unstructured data to predict startup success with over 80% accuracy, outperforming traditional methods and providing interpretable reasoning.

Details

Motivation: Predicting startup success is crucial but challenging due to heterogeneous data types. Traditional ML methods have moderate accuracy and LLMs struggle with domain-specific business data, creating a need for specialized approaches.

Method: CrunchLLM integrates structured company attributes with unstructured text, using parameter-efficient fine-tuning strategies and prompt optimization to specialize foundation models for entrepreneurship data from Crunchbase.

Result: Achieves accuracy exceeding 80% on startup success prediction, significantly outperforming traditional classifiers and baseline LLMs. Provides interpretable reasoning traces for transparency.

Conclusion: Domain-aware fine-tuning and structured-unstructured data fusion can advance predictive modeling of entrepreneurial outcomes, providing a practical tool for venture capital and innovation policy decision making.

Abstract: Predicting the success of start-up companies, defined as achieving an exit through acquisition or IPO, is a critical problem in entrepreneurship and innovation research. Datasets such as Crunchbase provide both structured information (e.g., funding rounds, industries, investor networks) and unstructured text (e.g., company descriptions), but effectively leveraging this heterogeneous data for prediction remains challenging. Traditional machine learning approaches often rely only on structured features and achieve moderate accuracy, while large language models (LLMs) offer rich reasoning abilities but struggle to adapt directly to domain-specific business data. We present \textbf{CrunchLLM}, a domain-adapted LLM framework for startup success prediction. CrunchLLM integrates structured company attributes with unstructured textual narratives and applies parameter-efficient fine-tuning strategies alongside prompt optimization to specialize foundation models for entrepreneurship data. Our approach achieves accuracy exceeding 80% on Crunchbase startup success prediction, significantly outperforming traditional classifiers and baseline LLMs. Beyond predictive performance, CrunchLLM provides interpretable reasoning traces that justify its predictions, enhancing transparency and trustworthiness for financial and policy decision makers. This work demonstrates how adapting LLMs with domain-aware fine-tuning and structured–unstructured data fusion can advance predictive modeling of entrepreneurial outcomes. CrunchLLM contributes a methodological framework and a practical tool for data-driven decision making in venture capital and innovation policy.

[1154] Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm

Yang Chen, Menglin Zou, Jiaqi Zhang, Yitan Zhang, Junyi Yang, Gael Gendron, Libo Zhang, Jiamou Liu, Michael J. Witbrock

Main category: cs.LG

TL;DR: TRRO is a new IRL framework that guarantees monotonic improvement in expert behavior likelihood via Minorization-Maximization, addressing training instability in adversarial IRL methods.

Details

Motivation: Modern adversarial IRL methods suffer from unstable training, while recent non-adversarial approaches lack formal guarantees despite improved stability.

Method: Proposed Trust Region Reward Optimization (TRRO) framework and its instantiation PIRO algorithm, using Minorization-Maximization to guarantee monotonic improvement in expert behavior likelihood.

Result: PIRO matches or surpasses state-of-the-art baselines in reward recovery and policy imitation on MuJoCo, Gym-Robotics benchmarks, and real-world animal behavior modeling.

Conclusion: TRRO provides IRL counterpart to TRPO’s stability guarantees in forward RL, offering a theoretically grounded and empirically effective approach to stable IRL.

Abstract: Inverse Reinforcement Learning (IRL) learns a reward function to explain expert demonstrations. Modern IRL methods often use the adversarial (minimax) formulation that alternates between reward and policy optimization, which often lead to unstable training. Recent non-adversarial IRL approaches improve stability by jointly learning reward and policy via energy-based formulations but lack formal guarantees. This work bridges this gap. We first present a unified view showing canonical non-adversarial methods explicitly or implicitly maximize the likelihood of expert behavior, which is equivalent to minimizing the expected return gap. This insight leads to our main contribution: Trust Region Reward Optimization (TRRO), a framework that guarantees monotonic improvement in this likelihood via a Minorization-Maximization process. We instantiate TRRO into Proximal Inverse Reward Optimization (PIRO), a practical and stable IRL algorithm. Theoretically, TRRO provides the IRL counterpart to the stability guarantees of Trust Region Policy Optimization (TRPO) in forward RL. Empirically, PIRO matches or surpasses state-of-the-art baselines in reward recovery, policy imitation with high sample efficiency on MuJoCo and Gym-Robotics benchmarks and a real-world animal behavior modeling task.

[1155] FedIA: A Plug-and-Play Importance-Aware Gradient Pruning Aggregation Method for Domain-Robust Federated Graph Learning on Node Classification

Zhanting Zhou, KaHou Tam, Zeqin Wu, Pengzhao Sun, Jinbo Wang, Fengli Zhang

Main category: cs.LG

TL;DR: FedIA addresses federated graph learning under domain skew by using a projection-first strategy to denoise client updates before aggregation, achieving stable convergence and higher accuracy without extra communication costs.

Details

Motivation: Federated Graph Learning under domain skew leads to incompatible client representations and ineffective aggregation due to noisy gradient signals dominated by domain-specific variance.

Method: FedIA uses a two-stage pipeline: (1) server-side top-ρ mask to keep only the most informative 5% of gradient coordinates, and (2) influence-regularised momentum weight to suppress outlier clients.

Result: FedIA achieves smoother, more stable convergence and higher final accuracy than nine strong baselines on both homogeneous (Twitch Gamers) and heterogeneous (Wikipedia) graphs.

Conclusion: FedIA’s projection-first approach with dynamic projection maintains optimal convergence rate while being communication-efficient and readily deployable.

Abstract: Federated Graph Learning (FGL) under domain skew – as observed on platforms such as \emph{Twitch Gamers} and multilingual \emph{Wikipedia} networks – drives client models toward incompatible representations, rendering naive aggregation both unstable and ineffective. We find that the culprit is not the weighting scheme but the \emph{noisy gradient signal}: empirical analysis of baseline methods suggests that a vast majority of gradient dimensions can be dominated by domain-specific variance. We therefore shift focus from “aggregation-first” to a \emph{projection-first} strategy that denoises client updates \emph{before} they are combined. The proposed FedIA framework realises this \underline{I}mportance-\underline{A}ware idea through a two-stage, plug-and-play pipeline: (i) a server-side top-$\rho$ mask keeps only the most informative about 5% of coordinates, and (ii) a lightweight influence-regularised momentum weight suppresses outlier clients. FedIA adds \emph{no extra uplink traffic and only negligible server memory}, making it readily deployable. On both homogeneous (Twitch Gamers) and heterogeneous (Wikipedia) graphs, it yields smoother, more stable convergence and higher final accuracy than nine strong baselines. A convergence sketch further shows that dynamic projection maintains the optimal $\mathcal{O}(\sigma^{2}/\sqrt{T})$ rate.

[1156] Graph Your Own Prompt

Xi Ding, Lei Wang, Piotr Koniusz, Yongsheng Gao

Main category: cs.LG

TL;DR: Graph Consistency Regularization (GCR) is a framework that uses model predictions to create relational graphs and align them with feature similarity graphs, promoting semantically meaningful representations without architectural changes.

Details

Motivation: Deep networks learn noisy inter-class similarities that contradict predicted semantics, so GCR aims to enforce class-consistent feature relationships throughout the network.

Method: Uses parameter-free Graph Consistency Layers (GCLs) that build feature similarity graphs and align them with class-aware masked prediction graphs, with adaptive layer weighting based on graph discrepancy.

Result: GCR promotes cleaner feature structure, stronger intra-class cohesion, and improved generalization across various networks and datasets.

Conclusion: GCR offers a lightweight, model-agnostic approach to enhance semantic structure by learning from prediction structure through multi-layer graph alignment.

Abstract: We propose Graph Consistency Regularization (GCR), a novel framework that injects relational graph structures, derived from model predictions, into the learning process to promote class-aware, semantically meaningful feature representations. Functioning as a form of self-prompting, GCR enables the model to refine its internal structure using its own outputs. While deep networks learn rich representations, these often capture noisy inter-class similarities that contradict the model’s predicted semantics. GCR addresses this issue by introducing parameter-free Graph Consistency Layers (GCLs) at arbitrary depths. Each GCL builds a batch-level feature similarity graph and aligns it with a global, class-aware masked prediction graph, derived by modulating softmax prediction similarities with intra-class indicators. This alignment enforces that feature-level relationships reflect class-consistent prediction behavior, acting as a semantic regularizer throughout the network. Unlike prior work, GCR introduces a multi-layer, cross-space graph alignment mechanism with adaptive weighting, where layer importance is learned from graph discrepancy magnitudes. This allows the model to prioritize semantically reliable layers and suppress noisy ones, enhancing feature quality without modifying the architecture or training procedure. GCR is model-agnostic, lightweight, and improves semantic structure across various networks and datasets. Experiments show that GCR promotes cleaner feature structure, stronger intra-class cohesion, and improved generalization, offering a new perspective on learning from prediction structure. Project website Code

[1157] Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models

Jonas Hübotter, Patrik Wolf, Alexander Shevchenko, Dennis Jüni, Andreas Krause, Gil Kur

Main category: cs.LG

TL;DR: Test-time training (TTT) improves performance by allowing foundation models to specialize on test tasks, addressing global underparameterization through concept-focused capacity allocation.

Details

Motivation: To understand why TTT works effectively even with in-distribution test data, challenging previous explanations focused on out-of-distribution adaptation or privileged data.

Method: Proposed a theoretical model under linear representation hypothesis, trained sparse autoencoder on ImageNet to validate assumptions, and conducted scaling studies across image and language tasks.

Result: TTT achieves substantially smaller in-distribution test error than global training, with empirical validation showing semantically related data points share few concepts.

Conclusion: TTT enables specialization after generalization, focusing model capacity on test-relevant concepts, with identified regimes where specialization is most effective.

Abstract: Recent empirical studies have explored the idea of continuing to train a model at test-time for a given task, known as test-time training (TTT), and have found it to yield significant performance improvements. However, there is limited understanding of why and when TTT is effective. Earlier explanations mostly focused on the observation that TTT may help when applied to out-of-distribution adaptation or used with privileged data. However, the growing scale of foundation models with most test data being in-distribution questions these explanations. We instead posit that foundation models remain globally underparameterized, with TTT providing a mechanism for specialization after generalization, focusing capacity on concepts relevant to the test task. Specifically, under the linear representation hypothesis, we propose a model in which TTT achieves a substantially smaller in-distribution test error than global training. We empirically validate our model’s key assumptions by training a sparse autoencoder on ImageNet, showing that semantically related data points are explained by only a few shared concepts. Finally, we perform scaling studies across image and language tasks that confirm the practical implications of our model, identifying the regimes where specialization is most effective.

[1158] BALF: Budgeted Activation-Aware Low-Rank Factorization for Fine-Tuning-Free Model Compression

David González-Martínez

Main category: cs.LG

TL;DR: BALF is a fine-tuning-free neural network compression framework that uses activation-aware factorization and a scalable budgeted rank allocator to achieve efficient compression across various models and scales.

Details

Motivation: Traditional neural network compression methods require expensive fine-tuning or search procedures, making them impractical on commodity hardware. The goal is to develop a compression approach that works without fine-tuning.

Method: The method combines an activation-aware factorization framework applicable to various layers with a scalable budgeted rank allocator that enables flexible control over compression targets without overhead.

Result: BALF achieves excellent compression results without fine-tuning, reducing FLOPs on ResNeXt-101 by 45% with only a 1-percentage-point top-1 accuracy drop. It demonstrates effectiveness across multiple scales and architectures including ResNet-20, ResNeXt-101, and vision transformers.

Conclusion: BALF provides an efficient pipeline for compressing models without fine-tuning, achieving significant compression with minimal accuracy loss across diverse neural network architectures.

Abstract: Neural network compression techniques typically require expensive fine-tuning or search procedures, rendering them impractical on commodity hardware. Inspired by recent LLM compression research, we present a general activation-aware factorization framework that can be applied to a broad range of layers. Moreover, we introduce a scalable budgeted rank allocator that allows flexible control over compression targets (e.g., retaining 50% of parameters) with no overhead. Together, these components form BALF, an efficient pipeline for compressing models without fine-tuning. We demonstrate its effectiveness across multiple scales and architectures, from ResNet-20 on CIFAR-10 to ResNeXt-101 and vision transformers on ImageNet, and show that it achieves excellent results in the fine-tuning-free regime. For instance, BALF reduces FLOPs on ResNeXt-101 by 45% with only a 1-percentage-point top-1 accuracy drop.

[1159] Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models

Runqian Wang, Yilun Du

Main category: cs.LG

TL;DR: Equilibrium Matching (EqM) is a generative modeling framework that learns the equilibrium gradient of an implicit energy landscape, replacing time-conditional dynamics with optimization-based sampling via gradient descent.

Details

Motivation: To overcome limitations of non-equilibrium, time-conditional dynamics in traditional diffusion and flow-based models by learning from an equilibrium perspective.

Method: Discards time-conditional dynamics and learns equilibrium gradient of implicit energy landscape; uses optimization-based sampling with gradient descent, adjustable step sizes, adaptive optimizers, and adaptive compute at inference.

Result: Achieves state-of-the-art FID of 1.90 on ImageNet 256×256, surpassing diffusion/flow models; theoretically justified for learning and sampling from data manifold.

Conclusion: EqM provides a unified framework bridging flow and energy-based models, enabling optimization-driven inference and handling diverse tasks like denoising, OOD detection, and image composition.

Abstract: We introduce Equilibrium Matching (EqM), a generative modeling framework built from an equilibrium dynamics perspective. EqM discards the non-equilibrium, time-conditional dynamics in traditional diffusion and flow-based generative models and instead learns the equilibrium gradient of an implicit energy landscape. Through this approach, we can adopt an optimization-based sampling process at inference time, where samples are obtained by gradient descent on the learned landscape with adjustable step sizes, adaptive optimizers, and adaptive compute. EqM surpasses the generation performance of diffusion/flow models empirically, achieving an FID of 1.90 on ImageNet 256$\times$256. EqM is also theoretically justified to learn and sample from the data manifold. Beyond generation, EqM is a flexible framework that naturally handles tasks including partially noised image denoising, OOD detection, and image composition. By replacing time-conditional velocities with a unified equilibrium landscape, EqM offers a tighter bridge between flow and energy-based models and a simple route to optimization-driven inference.

[1160] How Effective Are Time-Series Models for Rainfall Nowcasting? A Comprehensive Benchmark for Rainfall Nowcasting Incorporating PWV Data

Yifang Zhang, Pengfei Duan, Henan Wang, Wenjie Yin, Chen Zhou, Shengwu Xiong

Main category: cs.LG

TL;DR: RainfallBench is a new benchmark for rainfall nowcasting (0-3 hour prediction) that addresses limitations of existing meteorological benchmarks by focusing on complex rainfall characteristics like zero inflation, temporal decay, and non-stationarity, and includes precipitable water vapor data.

Details

Motivation: Existing time series forecasting benchmarks in meteorology focus on periodic variables like temperature and humidity, failing to capture the complexity of rainfall nowcasting which is critical for disaster mitigation and real-time response planning.

Method: Created RainfallBench dataset from 5 years of meteorological observations at 15-minute intervals across 6 variables from 12,000+ GNSS stations, incorporating precipitable water vapor. Also developed Bi-Focus Precipitation Forecaster (BFPF) - a plug-and-play module with domain-specific priors to handle zero-inflation and temporal decay.

Result: Evaluated over 20 state-of-the-art models across 6 major architectures on RainfallBench. Statistical analysis and ablation studies validated dataset comprehensiveness and methodology superiority.

Conclusion: RainfallBench provides a comprehensive benchmark for rainfall nowcasting that better reflects real-world meteorological challenges, and the proposed BFPF module effectively addresses key rainfall forecasting issues overlooked by existing models.

Abstract: Rainfall nowcasting, which aims to predict precipitation within the next 0 to 3 hours, is critical for disaster mitigation and real-time response planning. However, most time series forecasting benchmarks in meteorology are evaluated on variables with strong periodicity, such as temperature and humidity, which fail to reflect model capabilities in more complex and practically meteorology scenarios like rainfall nowcasting. To address this gap, we propose RainfallBench, a benchmark designed for rainfall nowcasting, a highly challenging and practically relevant task characterized by zero inflation, temporal decay, and non-stationarity, focused on predicting precipitation within the next 0 to 3 hours. The dataset is derived from five years of meteorological observations, recorded at 15-minute intervals across six essential variables, and collected from more than 12,000 GNSS stations globally. In particular, it incorporates precipitable water vapor (PWV), a crucial indicator of rainfall that is absent in other datasets. We further design specialized evaluation strategies to assess model performance on key meteorological challenges, such as multi-scale prediction and extreme rainfall events, and evaluate over 20 state-of-the-art models across six major architectures on RainfallBench. Additionally, to address the zero-inflation and temporal decay issues overlooked by existing models, we introduce Bi-Focus Precipitation Forecaster (BFPF), a plug-and-play module that incorporates domain-specific priors to enhance rainfall time series forecasting. Statistical analysis and ablation studies validate the comprehensiveness of our dataset as well as the superiority of our methodology. Code and datasets are available at https://anonymous.4open.science/r/RainfallBench-A710.

[1161] Vicinity-Guided Discriminative Latent Diffusion for Privacy-Preserving Domain Adaptation

Jing Wang, Wonho Bae, Jiahong Chen, Wenxu Wang, Junhyug Noh

Main category: cs.LG

TL;DR: DVD is a novel latent diffusion model framework for source-free domain adaptation that uses diffusion networks to transfer decision boundaries without accessing source data, achieving state-of-the-art performance.

Details

Motivation: To address the unexplored potential of latent diffusion models for discriminative transfer and solve the practical challenge of source-free domain adaptation where source data cannot be shared due to privacy concerns.

Method: Encodes source feature label information into latent vicinity using Gaussian priors over k-nearest neighbors, trains diffusion network to drift noisy samples back to label-consistent representations, then aligns target encoder to generated source-like cues using InfoNCE loss during adaptation.

Result: Outperforms state-of-the-art methods across standard SFDA benchmarks and enhances source classifier accuracy on in-domain data, also boosting performance in supervised classification and domain generalization.

Conclusion: DVD reinterprets latent diffusion models as practical, privacy-preserving bridges for explicit knowledge transfer, solving a core challenge in source-free domain adaptation that prior methods couldn’t address.

Abstract: Recent work on latent diffusion models (LDMs) has focused almost exclusively on generative tasks, leaving their potential for discriminative transfer largely unexplored. We introduce Discriminative Vicinity Diffusion (DVD), a novel LDM-based framework for a more practical variant of source-free domain adaptation (SFDA): the source provider may share not only a pre-trained classifier but also an auxiliary latent diffusion module, trained once on the source data and never exposing raw source samples. DVD encodes each source feature’s label information into its latent vicinity by fitting a Gaussian prior over its k-nearest neighbors and training the diffusion network to drift noisy samples back to label-consistent representations. During adaptation, we sample from each target feature’s latent vicinity, apply the frozen diffusion module to generate source-like cues, and use a simple InfoNCE loss to align the target encoder to these cues, explicitly transferring decision boundaries without source access. Across standard SFDA benchmarks, DVD outperforms state-of-the-art methods. We further show that the same latent diffusion module enhances the source classifier’s accuracy on in-domain data and boosts performance in supervised classification and domain generalization experiments. DVD thus reinterprets LDMs as practical, privacy-preserving bridges for explicit knowledge transfer, addressing a core challenge in source-free domain adaptation that prior methods have yet to solve.

[1162] Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning

Yicheng Lang, Yihua Zhang, Chongyu Fan, Changsheng Wang, Jinghan Jia, Sijia Liu

Main category: cs.LG

TL;DR: LLM unlearning effects are fragile and can be easily reversed by post-processing. This paper shows that using lower-grade optimizers (zeroth-order or gradient-compressed) improves unlearning robustness by converging to more stable loss basins, and proposes a hybrid optimizer combining first-order and zeroth-order methods.

Details

Motivation: Address the fragility of LLM unlearning effects that can be neutralized by weight quantization or fine-tuning, and investigate how optimizer choice affects unlearning robustness independent of specific unlearning objectives.

Method: Analyze how optimizer ‘grade’ (zeroth-order to second-order) affects unlearning robustness, finding that downgraded optimizers produce more robust unlearning. Propose a hybrid optimizer combining first-order and zeroth-order updates.

Result: Lower-grade optimizers (zeroth-order, gradient-compressed) lead to stronger unlearning robustness by converging to harder-to-disturb loss landscape basins. The proposed hybrid optimizer achieves more resilient forgetting without sacrificing unlearning quality on MUSE and WMDP benchmarks.

Conclusion: Optimizer choice significantly impacts LLM unlearning robustness, with downgraded optimizers providing natural advantages. A hybrid optimizer approach can preserve unlearning efficacy while enhancing robustness against post-training manipulations.

Abstract: Large language model (LLM) unlearning aims to surgically remove the influence of undesired data or knowledge from an existing model while preserving its utility on unrelated tasks. This paradigm has shown promise in addressing privacy and safety concerns. However, recent findings reveal that unlearning effects are often fragile: post-unlearning manipulations such as weight quantization or fine-tuning can quickly neutralize the intended forgetting. Prior efforts to improve robustness primarily reformulate unlearning objectives by explicitly assuming the role of vulnerability sources. In this work, we take a different perspective by investigating the role of the optimizer, independent of unlearning objectives and formulations, in shaping unlearning robustness. We show that the ‘grade’ of the optimizer, defined by the level of information it exploits, ranging from zeroth-order (gradient-free) to first-order (gradient-based) to second-order (Hessian-based), is tightly linked to the resilience of unlearning. Surprisingly, we find that downgrading the optimizer, such as using zeroth-order methods or compressed-gradient variants (e.g., gradient sign-based optimizers), often leads to stronger robustness. While these optimizers produce noisier and less precise updates, they encourage convergence to harder-to-disturb basins in the loss landscape, thereby resisting post-training perturbations. By connecting zeroth-order methods with randomized smoothing, we further highlight their natural advantage for robust unlearning. Motivated by these insights, we propose a hybrid optimizer that combines first-order and zeroth-order updates, preserving unlearning efficacy while enhancing robustness. Extensive experiments on the MUSE and WMDP benchmarks, across multiple LLM unlearning algorithms, validate that our approach achieves more resilient forgetting without sacrificing unlearning quality.

[1163] TetriServe: Efficient DiT Serving for Heterogeneous Image Generation

Runyu Lu, Shiqi He, Wenxuan Tan, Shenggui Li, Ruofan Wu, Jeff J. Ma, Ang Chen, Mosharaf Chowdhury

Main category: cs.LG

TL;DR: TetriServe is a DiT serving system that uses step-level sequence parallelism to dynamically adjust parallelism for individual requests based on deadlines, achieving up to 32% higher SLO attainment without degrading image quality.

Details

Motivation: Existing serving systems use fixed degree sequence parallelism, which is inefficient for heterogeneous workloads with mixed resolutions and deadlines, leading to poor GPU utilization and low SLO attainment.

Method: TetriServe introduces step-level sequence parallelism and a round-based scheduling mechanism that: (1) discretizes time into fixed rounds for tractable deadline-aware scheduling, (2) adapts parallelism at step level to minimize GPU hour consumption, and (3) jointly packs requests to minimize late completions.

Result: Extensive evaluation on state-of-the-art DiT models shows TetriServe achieves up to 32% higher SLO attainment compared to existing solutions without degrading image quality.

Conclusion: TetriServe’s step-level sequence parallelism and round-based scheduling effectively address the challenges of serving DiT models under strict SLOs, significantly improving performance for heterogeneous workloads.

Abstract: Diffusion Transformer (DiT) models excel at generating highquality images through iterative denoising steps, but serving them under strict Service Level Objectives (SLOs) is challenging due to their high computational cost, particularly at large resolutions. Existing serving systems use fixed degree sequence parallelism, which is inefficient for heterogeneous workloads with mixed resolutions and deadlines, leading to poor GPU utilization and low SLO attainment. In this paper, we propose step-level sequence parallelism to dynamically adjust the parallel degree of individual requests according to their deadlines. We present TetriServe, a DiT serving system that implements this strategy for highly efficient image generation. Specifically, TetriServe introduces a novel round-based scheduling mechanism that improves SLO attainment: (1) discretizing time into fixed rounds to make deadline-aware scheduling tractable, (2) adapting parallelism at the step level and minimize GPU hour consumption, and (3) jointly packing requests to minimize late completions. Extensive evaluation on state-of-the-art DiT models shows that TetriServe achieves up to 32% higher SLO attainment compared to existing solutions without degrading image quality.

[1164] Fusing Multi- and Hyperspectral Satellite Data for Harmful Algal Bloom Monitoring with Self-Supervised and Hierarchical Deep Learning

Nicholas LaHaye, Kelly M. Luis, Michelle M. Gierach

Main category: cs.LG

TL;DR: SIT-FUSE is a self-supervised framework that detects and maps harmful algal bloom severity and speciation using multi-sensor satellite data without requiring labeled datasets.

Details

Motivation: To advance scalable HAB monitoring in label-scarce environments and enable operational self-supervised learning for global aquatic biogeochemistry.

Method: Fuses reflectance data from VIIRS, MODIS, Sentinel-3, PACE with TROPOMI solar-induced fluorescence, using self-supervised representation learning and hierarchical deep clustering to segment phytoplankton concentrations.

Result: Strong agreement with in-situ measurements of total phytoplankton, Karenia brevis, Alexandrium spp., and Pseudo-nitzschia spp. in Gulf of Mexico and Southern California (2018-2025).

Conclusion: The framework successfully enables exploratory analysis via hierarchical embeddings and represents a critical step toward operationalizing self-supervised learning for global aquatic biogeochemistry.

Abstract: We present a self-supervised machine learning framework for detecting and mapping harmful algal bloom (HAB) severity and speciation using multi-sensor satellite data. By fusing reflectance data from operational instruments (VIIRS, MODIS, Sentinel-3, PACE) with TROPOMI solar-induced fluorescence (SIF), our framework, called SIT-FUSE, generates HAB severity and speciation products without requiring per-instrument labeled datasets. The framework employs self-supervised representation learning, hierarchical deep clustering to segment phytoplankton concentrations and speciations into interpretable classes, validated against in-situ data from the Gulf of Mexico and Southern California (2018-2025). Results show strong agreement with total phytoplankton, Karenia brevis, Alexandrium spp., and Pseudo-nitzschia spp. measurements. This work advances scalable HAB monitoring in label-scarce environments while enabling exploratory analysis via hierarchical embeddings: a critical step toward operationalizing self-supervised learning for global aquatic biogeochemistry.

[1165] Curl Descent: Non-Gradient Learning Dynamics with Sign-Diverse Plasticity

Hugo Ninou, Jonathan Kadmon, N. Alex Cayco-Gajic

Main category: cs.LG

TL;DR: The paper investigates whether biological neural networks use non-gradient “curl” components in learning dynamics, showing these can emerge from inhibitory-excitatory connectivity or Hebbian/anti-Hebbian plasticity and can either preserve stability or create chaotic dynamics depending on their strength.

Details

Motivation: To understand if biological neural networks employ gradient-based learning strategies, given the diversity of synaptic plasticity rules observed experimentally and the possibility that learning includes non-gradient components.

Method: Analysis of feedforward networks using a student-teacher framework, systematically introducing non-gradient dynamics through neurons with rule-flipped plasticity to study the impact of curl terms on learning.

Result: Small curl terms maintain stability similar to gradient descent, while strong curl terms destabilize solutions, potentially causing chaotic dynamics or surprisingly speeding learning by escaping saddles through temporary loss ascent.

Conclusion: Specific neural architectures can support robust learning via diverse non-gradient rules, challenging normative gradient-based theories and highlighting the potential benefits of curl dynamics in biological learning.

Abstract: Gradient-based algorithms are a cornerstone of artificial neural network training, yet it remains unclear whether biological neural networks use similar gradient-based strategies during learning. Experiments often discover a diversity of synaptic plasticity rules, but whether these amount to an approximation to gradient descent is unclear. Here we investigate a previously overlooked possibility: that learning dynamics may include fundamentally non-gradient “curl”-like components while still being able to effectively optimize a loss function. Curl terms naturally emerge in networks with inhibitory-excitatory connectivity or Hebbian/anti-Hebbian plasticity, resulting in learning dynamics that cannot be framed as gradient descent on any objective. To investigate the impact of these curl terms, we analyze feedforward networks within an analytically tractable student-teacher framework, systematically introducing non-gradient dynamics through neurons exhibiting rule-flipped plasticity. Small curl terms preserve the stability of the original solution manifold, resulting in learning dynamics similar to gradient descent. Beyond a critical value, strong curl terms destabilize the solution manifold. Depending on the network architecture, this loss of stability can lead to chaotic learning dynamics that destroy performance. In other cases, the curl terms can counterintuitively speed learning compared to gradient descent by allowing the weight dynamics to escape saddles by temporarily ascending the loss. Our results identify specific architectures capable of supporting robust learning via diverse learning rules, providing an important counterpoint to normative theories of gradient-based learning in neural networks.

[1166] PDE-Transformer: A Continuous Dynamical Systems Approach to Sequence Modeling

Yukun Zhang, Xueqing Zhou

Main category: cs.LG

TL;DR: PDE-Transformer casts Transformer forward pass as numerical discretization of continuous reaction-diffusion system, introducing Adaptive PDE Diffusion Layer for local smoothness with linear complexity.

Details

Motivation: To provide a principled mechanism for long-range dependency modeling by harmonizing continuous PDE smoothing with discrete self-attention in Transformers.

Method: Proposes PDE-Transformer framework where token embeddings evolve under PDE with nonlocal integral term for self-attention, local reaction for feed-forward layers, diffusion for positional smoothing, and stability control for layer normalization. Introduces Adaptive PDE Diffusion Layer as efficient learnable finite-difference stencil.

Result: On Long Range Arena benchmark, placing the PDE layer immediately after embedding yields 4.1 pp average accuracy gain over strong baseline, with adaptive multi-scale variant delivering further improvements.

Conclusion: PDE-Transformer offers principled, lightweight mechanism to bolster long-range dependency modeling by combining continuous PDE smoothing with discrete self-attention through systematic theoretical framework.

Abstract: We propose PDE-Transformer, a novel sequence modeling paradigm that casts the forward pass of a Transformer as the numerical discretization of a continuous reaction-diffusion system derived from a variational energy functional. In our framework, token embeddings evolve under a partial differential equation whose nonlocal integral term models self-attention, local reaction term models feed-forward layers, diffusion term encodes positional smoothing, and a stability control term corresponds to layer normalization. From this unifying perspective, we design an Adaptive PDE Diffusion Layer-an efficient, learnable finite-difference stencil that enforces local smoothness in feature space with linear time complexity and complements self-attention’s global routing. Through a systematic theoretical analysis based on four pillars:stability, diffusion geometry, multi-scale dynamics, and component coupling, we derive principled guidelines for integrating the PDE layer at seven candidate points in the Transformer. Empirically, on the Long Range Arena benchmark, placing the layer immediately after embedding yields a 4.1 pp average accuracy gain over a strong baseline, and an adaptive multi-scale variant delivers further improvements. Our work thus offers a principled, lightweight mechanism to bolster long-range dependency modeling by harmonizing continuous PDE smoothing with discrete self-attention.

Flavio Giorgi, Matteo Silvestri, Cesare Campagnano, Fabrizio Silvestri, Gabriele Tolomei

Main category: cs.LG

TL;DR: A pipeline using Language Models to generate natural language narratives for counterfactual explanations, making AI explanations more accessible to non-experts through knowledge distillation and evaluation methods.

Details

Motivation: Counterfactual explanations are promising for AI explainability but are often too technical for non-experts to understand, creating a barrier to practical adoption.

Method: Proposed a pipeline that leverages Language Models (both large and small) with knowledge distillation and refining mechanisms to generate narrative explanations, plus an evaluation method to verify alignment with factual ground truth.

Result: The pipeline enhances reasoning capabilities and practical performance of student models, enabling small models to perform comparably to larger ones while maintaining robust reasoning.

Conclusion: The approach makes counterfactual explanations more interpretable and suitable for real-world applications by translating technical explanations into accessible natural language narratives.

Abstract: Explainable Artificial Intelligence has become a crucial area of research, aiming to demystify the decision-making processes of deep learning models. Among various explainability techniques, counterfactual explanations have been proven particularly promising, as they offer insights into model behavior by highlighting minimal changes that would alter a prediction. Despite their potential, these explanations are often complex and technical, making them difficult for non-experts to interpret. To address this challenge, we propose a novel pipeline that leverages Language Models, large and small, to compose narratives for counterfactual explanations. We employ knowledge distillation techniques along with a refining mechanism to enable Small Language Models to perform comparably to their larger counterparts while maintaining robust reasoning abilities. In addition, we introduce a simple but effective evaluation method to assess natural language narratives, designed to verify whether the models’ responses are in line with the factual, counterfactual ground truth. As a result, our proposed pipeline enhances both the reasoning capabilities and practical performance of student models, making them more suitable for real-world use cases.

[1168] Early-Warning of Thunderstorm-Driven Power Outages with a Two-Stage Machine Learning Model

Iryna Stanishevska

Main category: cs.LG

TL;DR: A two-stage early-warning model for thunderstorm-driven power outages in Michigan using open data sources, combining logistic gate and LSTM regressor to predict outages 24-48 hours in advance.

Details

Motivation: Thunderstorm outages are difficult to predict due to chaotic convective processes, most storms not causing damage, and noisy/incomplete public data.

Method: Two-stage model with logistic gate and LSTM regressor using kriging-preserved convective signals, causal spatio-temporal features (moisture advection, wind shifts, pressure drops), and event-centric evaluation metrics.

Result: Two-Stage model detects more reference peaks (3/4 vs 2/4 at +/-48h, F1 66.7% vs 57.1%) with modest amplitude gains near peaks (2-3% lower cMASE at +/-0-12h) but comparable overall errors to baseline.

Conclusion: Despite open-data noise, the feature-driven pipeline provides actionable early warnings for thunderstorm outages, with SHAP confirming moisture-advection and wind precursors.

Abstract: Thunderstorm-driven outages are difficult to predict because most storms do not cause damage, convective processes occur rapidly and chaotically, and the available public data are both noisy and incomplete. We develop a 24-48 h early-warning model for summer, thunderstorm-related outages in Michigan using only open sources (EAGLE-I for ground truth; METAR for weather). We use the publicly released EAGLE-I outage dataset (2014-2022), maintained by Oak Ridge National Laboratory for the U.S. Department of Energy. The pipeline preserves convective micro-signals from a sparse station network via parameter-specific kriging with hourly variograms and targeted overdrafting to retain extremes, and builds causal spatio-temporal features (lags/rolling statistics; k-NN/IDW spatial aggregates) capturing precursors of severe convection (moisture advection, wind shifts, and pressure drops). The two-stage model design, combining a logistic gate and an LSTM regressor, limits routine periods and reduces noise exposure. The study uses event-centric metrics (cluster-based hits/misses/false alarms) and peak-conditional MASE (cMASE) in +/-Delta-hour windows around state-level peaks (>= 50,000), with uncertainty quantified by hourly moving-block bootstrap. On the test sample, Two-Stage detects more reference peaks across all windows (e.g., at +/-48 h it records 3/4 vs. 2/4; F1 66.7% vs. 57.1%) with one extra false alarm. Near peaks, it shows modest amplitude gains (2-3% lower cMASE at +/-0-12 h; bootstrap medians +9-13% at +/-6-12 h) but small losses at +/-36-48 h (~3-4%). Overall, errors are comparable to the one-step LSTM baseline. SHAP analysis confirms moisture-advection and wind/gust precursors, underscoring the value of the feature engineering. Despite open-data noise, the feature-driven pipeline yields actionable, event-focused early warnings for thunderstorm outages.

[1169] A Clinical-grade Universal Foundation Model for Intraoperative Pathology

Zihan Zhao, Fengtao Zhou, Ronggang Li, Bing Chu, Xinke Zhang, Xueyi Zheng, Ke Zheng, Xiaobo Wen, Jiabo Ma, Yihui Wang, Jiewei Chen, Chengyou Zheng, Jiangyu Zhang, Yongqin Wen, Jiajia Meng, Ziqi Zeng, Xiaoqing Li, Jing Li, Dan Xie, Yaping Ye, Yu Wang, Hao Chen, Muyan Cai

Main category: cs.LG

TL;DR: CRISP is a clinical-grade foundation model for intraoperative pathology that was trained on 100,000+ frozen sections and validated on 15,000+ slides across 100 diagnostic tasks, demonstrating robust performance in prospective clinical use.

Details

Motivation: Intraoperative pathology is crucial for precision surgery but faces challenges with diagnostic complexity and limited high-quality frozen-section data. Computational pathology has advanced but lacks large-scale prospective validation for routine surgical use.

Method: Developed CRISP foundation model on 100,000+ frozen sections from 8 medical centers. Evaluated on 15,000+ intraoperative slides across nearly 100 diagnostic tasks including benign-malignant discrimination, intraoperative decision-making, and pan-cancer detection.

Result: Model showed robust generalization across institutions, tumor types, and anatomical sites. In prospective cohort of 2,000+ patients: 92.6% cases informed surgical decisions, human-AI collaboration reduced diagnostic workload by 35%, avoided 105 ancillary tests, detected micrometastases with 87.5% accuracy.

Conclusion: CRISP represents a clinical-grade paradigm for AI-driven intraoperative pathology, bridging computational advances with surgical precision and accelerating AI translation into routine clinical practice.

Abstract: Intraoperative pathology is pivotal to precision surgery, yet its clinical impact is constrained by diagnostic complexity and the limited availability of high-quality frozen-section data. While computational pathology has made significant strides, the lack of large-scale, prospective validation has impeded its routine adoption in surgical workflows. Here, we introduce CRISP, a clinical-grade foundation model developed on over 100,000 frozen sections from eight medical centers, specifically designed to provide Clinical-grade Robust Intraoperative Support for Pathology (CRISP). CRISP was comprehensively evaluated on more than 15,000 intraoperative slides across nearly 100 retrospective diagnostic tasks, including benign-malignant discrimination, key intraoperative decision-making, and pan-cancer detection, etc. The model demonstrated robust generalization across diverse institutions, tumor types, and anatomical sites-including previously unseen sites and rare cancers. In a prospective cohort of over 2,000 patients, CRISP sustained high diagnostic accuracy under real-world conditions, directly informing surgical decisions in 92.6% of cases. Human-AI collaboration further reduced diagnostic workload by 35%, avoided 105 ancillary tests and enhanced detection of micrometastases with 87.5% accuracy. Together, these findings position CRISP as a clinical-grade paradigm for AI-driven intraoperative pathology, bridging computational advances with surgical precision and accelerating the translation of artificial intelligence into routine clinical practice.

[1170] Exact Causal Attention with 10% Fewer Operations

Dmitry Rybin, Yushun Zhang, Ding Tian, Zhihang Lin, Zhi-Quan Luo

Main category: cs.LG

TL;DR: Exact Causal Attention (ECA) is a Strassen-style algorithm that reduces operations by 10% for computing exact Causal Attention, using algebraic identities discovered via machine learning.

Details

Motivation: To improve efficiency in causal attention computations by reducing the number of operations required for matrix multiplications involving triangular matrices, which are common in attention mechanisms.

Method: Developed through machine learning and combinatorial search to discover algebraic identities, ECA optimizes matrix multiplications where operands or outputs are upper- or lower-triangular, including masked products like Mask(QK^T).

Result: ECA achieves a 10% reduction in operations for exact Causal Attention computation, though it cannot accelerate fused kernels like FlashAttention due to memory requirements for intermediate expressions.

Conclusion: ECA provides an alternative approach for compute-bound applications and scenarios where FLOPs considerations are important, despite limitations with fused kernels.

Abstract: We present Exact Causal Attention (ECA), a Strassen-style algorithm that computes exact Causal Attention using 10% fewer operations. ECA improves a special class of matrix multiplications where either one operand or the output matrix is upper- or lower-triangular. This includes all matrix multiplication operations in the forward and backward pass of Causal Attention, such as masked product $\mathrm{Mask}(QK^{T})$. ECA is built upon algebraic identities discovered via machine learning and combinatorial search. We note that ECA cannot accelerate fused kernels such as FlashAttention on GPU. This is because ECA requires materialization of large intermediate expressions in the memory, while FlashAttention does not. However, it provides an alternative approach for compute-bound applications and can potentially be useful in scenarios with FLOPs considerations.

[1171] Logistic-Gated Operators Enable Auditable Unit-Aware Thresholds in Symbolic Regression

Ou Deng, Ruichen Cong, Jianting Xu, Shoji Nishimura, Atsushi Ogihara, Qun Jin

Main category: cs.LG

TL;DR: Proposes logistic-gated operators (LGO) for symbolic regression to handle unit-aware thresholds and conditional logic, achieving clinically plausible cut-points with competitive accuracy while maintaining compact equations.

Details

Motivation: Symbolic regression struggles with encoding unit-aware thresholds and conditional logic, limiting its ability to produce clinically interpretable equations with explicit thresholds that can be audited against medical guidelines.

Method: Introduces logistic-gated operators (LGO) - differentiable gates with learnable location and steepness parameters, embedded as typed primitives and mapped back to physical units for auditability.

Result: Hard-gate variant recovers clinically plausible cut-points: 71% within 10% of guideline anchors and 100% within 20%, using fewer gates than soft variant while maintaining competitive accuracy with symbolic regression baselines.

Conclusion: Enables compact symbolic equations with explicit, unit-aware thresholds that turn interpretability into a modeling constraint, providing practical calculus for regime switching and governance-ready deployment in clinical applications.

Abstract: Symbolic regression promises readable equations but struggles to encode unit-aware thresholds and conditional logic. We propose logistic-gated operators (LGO) – differentiable gates with learnable location and steepness – embedded as typed primitives and mapped back to physical units for audit. Across two primary health datasets (ICU, NHANES), the hard-gate variant recovers clinically plausible cut-points: 71% (5/7) of assessed thresholds fall within 10% of guideline anchors and 100% within 20%, while using far fewer gates than the soft variant (ICU median 4.0 vs 10.0; NHANES 5.0 vs 12.5), and remaining within the competitive accuracy envelope of strong SR baselines. On predominantly smooth tasks, gates are pruned, preserving parsimony. The result is compact symbolic equations with explicit, unit-aware thresholds that can be audited against clinical anchors – turning interpretability from a post-hoc explanation into a modeling constraint and equipping symbolic regression with a practical calculus for regime switching and governance-ready deployment.

[1172] Revisiting Node Affinity Prediction in Temporal Graphs

Krishna Sri Ipsit Mantri, Or Feldman, Moshe Eliasof, Chaim Baskin

Main category: cs.LG

TL;DR: NAViS is a novel node affinity prediction model that addresses limitations of current Temporal Graph Neural Networks by exploiting the equivalence between heuristics and state space models, achieving state-of-the-art performance.

Details

Motivation: Current dynamic link property prediction models adapted for node affinity prediction are outperformed by simple heuristics like Persistent Forecast or Moving Average, indicating challenges in training existing Temporal Graph Neural Networks for this task.

Method: Developed NAViS (Node Affinity prediction model using Virtual State) by combining solutions to training challenges and exploiting the equivalence between heuristics and state space models, along with a novel loss function for node affinity prediction.

Result: NAViS outperforms state-of-the-art methods including heuristics on TGB benchmark.

Conclusion: NAViS successfully addresses training challenges in node affinity prediction and demonstrates superior performance over existing approaches.

Abstract: Node affinity prediction is a common task that is widely used in temporal graph learning with applications in social and financial networks, recommender systems, and more. Recent works have addressed this task by adapting state-of-the-art dynamic link property prediction models to node affinity prediction. However, simple heuristics, such as Persistent Forecast or Moving Average, outperform these models. In this work, we analyze the challenges in training current Temporal Graph Neural Networks for node affinity prediction and suggest appropriate solutions. Combining the solutions, we develop NAViS - Node Affinity prediction model using Virtual State, by exploiting the equivalence between heuristics and state space models. While promising, training NAViS is non-trivial. Therefore, we further introduce a novel loss function for node affinity prediction. We evaluate NAViS on TGB and show that it outperforms the state-of-the-art, including heuristics. Our source code is available at https://github.com/orfeld415/NAVIS

[1173] HTMformer: Hybrid Time and Multivariate Transformer for Time Series Forecasting

Tan Wang, Yun Wei Dong, Tao Zhang, Qi Wang

Main category: cs.LG

TL;DR: HTMformer introduces Hybrid Temporal and Multivariate Embeddings (HTME) to enhance Transformer-based time series forecasting by extracting richer multidimensional features, achieving better accuracy and efficiency than existing methods.

Details

Motivation: Existing Transformers overemphasize temporal dependencies in time series forecasting, incurring computational overhead without corresponding performance gains. The performance depends heavily on effective embedding methods for sequence representations.

Method: Proposed HTME extractor integrates lightweight temporal feature extraction with multivariate feature extraction to create multidimensional embeddings. Combined with Transformer architecture to form HTMformer - a lightweight forecaster.

Result: Experiments on eight real-world datasets show HTMformer outperforms existing baselines in both accuracy and efficiency.

Conclusion: HTME provides richer sequence representations that enable Transformers to better understand time series, achieving optimal balance between model complexity and performance.

Abstract: Transformer-based methods have achieved impressive results in time series forecasting. However, existing Transformers still exhibit limitations in sequence modeling as they tend to overemphasize temporal dependencies. This incurs additional computational overhead without yielding corresponding performance gains. We find that the performance of Transformers is highly dependent on the embedding method used to learn effective representations. To address this issue, we extract multivariate features to augment the effective information captured in the embedding layer, yielding multidimensional embeddings that convey richer and more meaningful sequence representations. These representations enable Transformer-based forecasters to better understand the series. Specifically, we introduce Hybrid Temporal and Multivariate Embeddings (HTME). The HTME extractor integrates a lightweight temporal feature extraction module with a carefully designed multivariate feature extraction module to provide complementary features, thereby achieving a balance between model complexity and performance. By combining HTME with the Transformer architecture, we present HTMformer, leveraging the enhanced feature extraction capability of the HTME extractor to build a lightweight forecaster. Experiments conducted on eight real-world datasets demonstrate that our approach outperforms existing baselines in both accuracy and efficiency.

[1174] Enhancing Self-Supervised Learning with Semantic Pairs A New Dataset and Empirical Study

Mohammad Alkhalefi, Georgios Leontidis, Mingjun Zhong

Main category: cs.LG

TL;DR: Using semantic pairs (instances from same semantic category) in self-supervised learning improves model generalizability beyond traditional data transformations alone.

Details

Motivation: Traditional instance discrimination relies on limited handcrafted transformations, which cannot cover the full spectrum of real-world data variations, limiting model generalizability to unseen datasets and downstream tasks.

Method: Proposed incorporating semantic pairs - two instances from the same semantic category - to expose models to varied real-world scene contexts and enhance representation learning.

Result: Empirical validation using a novel curated semantic pairs dataset showed that including semantic pairs enables learning more general representations and improves performance across diverse downstream tasks.

Conclusion: Semantic pairs effectively mitigate the limitation of finite transformation coverage in self-supervised learning, leading to more generalizable object representations.

Abstract: Instance discrimination is a self-supervised representation learning paradigm wherein individual instances within a dataset are treated as distinct classes. This is typically achieved by generating two disparate views of each instance by applying stochastic transformations, encouraging the model to learn representations invariant to the common underlying object across these views. While this approach facilitates the acquisition of invariant representations for dataset instances under various handcrafted transformations (e.g., random cropping, colour jittering), an exclusive reliance on such data transformations for achieving invariance may inherently limit the model’s generalizability to unseen datasets and diverse downstream tasks. The inherent limitation stems from the fact that the finite set of transformations within the data processing pipeline is unable to encompass the full spectrum of potential data variations. In this study, we provide the technical foundation for leveraging semantic pairs to enhance the generalizability of the model’s representation and empirically demonstrate that incorporating semantic pairs mitigates the issue of limited transformation coverage. Specifically, we propose that by exposing the model to semantic pairs (i.e., two instances belonging to the same semantic category), we introduce varied real-world scene contexts, thereby fostering the development of more generalizable object representations. To validate this hypothesis, we constructed and released a novel dataset comprising curated semantic pairs and conducted extensive experimentation to empirically establish that their inclusion enables the model to learn more general representations, ultimately leading to improved performance across diverse downstream tasks.

[1175] Automated Machine Learning for Unsupervised Tabular Tasks

Prabhant Singh, Pieter Gijsbers, Elif Ceren Gok Yildirim, Murat Onur Yildirim, Joaquin Vanschoren

Main category: cs.LG

TL;DR: LOTUS is a method for model selection in unsupervised ML tasks using Optimal Transport to find dataset similarity and recommend pipelines.

Details

Motivation: ML pipelines perform well on new datasets if they worked well on similar data distributions, enabling effective model selection for unsupervised tasks.

Method: Uses Optimal Transport distances to measure similarity between unlabeled tabular datasets and recommends ML pipelines for outlier detection and clustering.

Result: Experiments show LOTUS outperforms strong baselines and is promising for model selection in unsupervised ML tasks.

Conclusion: LOTUS is an effective first step toward unified model selection for multiple unsupervised ML tasks using dataset similarity.

Abstract: In this work, we present LOTUS (Learning to Learn with Optimal Transport for Unsupervised Scenarios), a simple yet effective method to perform model selection for multiple unsupervised machine learning(ML) tasks such as outlier detection and clustering. Our intuition behind this work is that a machine learning pipeline will perform well in a new dataset if it previously worked well on datasets with a similar underlying data distribution. We use Optimal Transport distances to find this similarity between unlabeled tabular datasets and recommend machine learning pipelines with one unified single method on two downstream unsupervised tasks: outlier detection and clustering. We present the effectiveness of our approach with experiments against strong baselines and show that LOTUS is a very promising first step toward model selection for multiple unsupervised ML tasks.

[1176] Design-Based Bandits Under Network Interference: Trade-Off Between Regret and Statistical Inference

Zichen Wang, Haoyang Hong, Chuanhao Li, Haoxuan Li, Zhiheng Zhang, Huazheng Wang

Main category: cs.LG

TL;DR: This paper establishes a theoretical Pareto frontier for the trade-off between regret minimization and inference accuracy in multi-armed bandits with network interference (MABNI), and introduces an algorithm called EXP3-N-CS with anytime-valid confidence sequences.

Details

Motivation: Existing MABNI research focuses primarily on regret minimization but overlooks how excessive emphasis on optimal arms can undermine inference accuracy for sub-optimal arms, creating a critical trade-off that becomes more pronounced in network interference settings.

Method: The authors establish a theoretical Pareto frontier for the regret-inference trade-off in adversarial MABNI and develop an algorithm called EXP3-N-CS with anytime-valid asymptotic confidence sequences specifically designed to balance this trade-off.

Result: The paper provides the first theoretical characterization of the Pareto frontier between regret minimization and inference accuracy in adversarial MABNI settings, along with a practical algorithm that achieves this balance.

Conclusion: This work addresses the important trade-off between regret minimization and inference accuracy in network bandits, providing both theoretical foundations and practical algorithmic solutions for balanced performance in MABNI problems.

Abstract: In multi-armed bandits with network interference (MABNI), the action taken by one node can influence the rewards of others, creating complex interdependence. While existing research on MABNI largely concentrates on minimizing regret, it often overlooks the crucial concern that an excessive emphasis on the optimal arm can undermine the inference accuracy for sub-optimal arms. Although initial efforts have been made to address this trade-off in single-unit scenarios, these challenges have become more pronounced in the context of MABNI. In this paper, we establish, for the first time, a theoretical Pareto frontier characterizing the trade-off between regret minimization and inference accuracy in adversarial (design-based) MABNI. We further introduce an anytime-valid asymptotic confidence sequence along with a corresponding algorithm, $\texttt{EXP3-N-CS}$, specifically designed to balance the trade-off between regret minimization and inference accuracy in this setting.

[1177] Continual Learning for Adaptive AI Systems

Md Hasibul Amin, Tamzid Tanvi Alam

Main category: cs.LG

TL;DR: CAR is a continual learning framework that combines class-balanced replay with Inter-Cluster Fitness regularization to reduce catastrophic forgetting by separating feature representations between tasks.

Details

Motivation: To address catastrophic forgetting in neural networks when learning sequential tasks, overcoming limitations of overfitting and interference between tasks.

Method: Hybrid framework integrating small class-balanced replay buffer with Inter-Cluster Fitness (ICF) regularization that penalizes overlapping feature representations between new and old tasks.

Result: Initial experiments on Split CIFAR-10 with ResNet-18 show CAR better preserves earlier task performance compared to fine-tuning alone.

Conclusion: Feature-space regularization is a promising direction for mitigating catastrophic forgetting in continual learning.

Abstract: Continual learning the ability of a neural network to learn multiple sequential tasks without catastrophic forgetting remains a central challenge in developing adaptive artificial intelligence systems. While deep learning models achieve state-of-the-art performance across domains, they remain limited by overfitting and forgetting. This paper introduces Cluster-Aware Replay (CAR), a hybrid continual learning framework that integrates a small, class-balanced replay buffer with a regularization term based on Inter-Cluster Fitness (ICF) in the feature space. The ICF loss penalizes overlapping feature representations between new and previously learned tasks, encouraging geometric separation in the latent space and reducing interference. Using the standard five-task Split CIFAR-10 benchmark with a ResNet-18 backbone, initial experiments demonstrate that CAR better preserves earlier task performance compared to fine-tuning alone. These findings are preliminary but highlight feature-space regularization as a promising direction for mitigating catastrophic forgetting.

[1178] Arbitrary Entropy Policy Optimization: Entropy Is Controllable in Reinforcement Fine-tuning

Chen Wang, Zhaochun Li, Jionghao Bai, Yuzhi Zhang, Shisheng Cui, Zhou Zhao, Yue Wang

Main category: cs.LG

TL;DR: AEPO eliminates entropy collapse in reinforcement fine-tuning by replacing entropy bonuses with REINFORCE policy gradient on temperature-adjusted distributions and stabilizing entropy through temperature regulation.

Details

Motivation: Existing methods like GRPO suffer from entropy collapse where entropy monotonically decreases, exploration vanishes, and policies converge prematurely. Current entropy-regularized methods only partially address this while introducing bias and instability.

Method: AEPO integrates three key designs: policy gradient as regularization, distribution as regularization, and REINFORCE as regularization. It replaces entropy bonuses with REINFORCE policy gradient on temperature-adjusted distributions and stabilizes entropy through temperature regulation.

Result: AEPO stabilizes entropy at arbitrary target levels, effectively removing collapse in GRPO. It reveals a non-monotonic relation where performance first improves then declines with increasing entropy, clarifying the link between entropy, exploration, and reasoning.

Conclusion: AEPO provides a broader RFT paradigm where superior target distributions can serve as REINFORCE regularizers, enabling precise entropy control without distorting optimization and generalizing beyond entropy control.

Abstract: Reinforcement fine-tuning (RFT) is essential for enhancing the reasoning capabilities of large language models (LLM), yet the widely adopted Group Relative Policy Optimization (GRPO) suffers from entropy collapse, where entropy monotonically decreases, exploration vanishes, and policies converge prematurely. Existing entropy-regularized methods only partially alleviate this issue while introducing bias and instability, leaving entropy control unresolved and the connection between entropy, exploration, and performance unclear. We propose Arbitrary Entropy Policy Optimization (AEPO), which eliminates entropy collapse by replacing entropy bonuses with REINFORCE policy gradient on temperature-adjusted distributions and stabilizing entropy through temperature regulation. AEPO integrates three key designs: policy gradient as regularization, distribution as regularization, and REINFORCE as regularization, enabling precise entropy control without distorting optimization. Experiments demonstrate three major contributions: AEPO (1) stabilizes entropy at arbitrary target levels, effectively removing collapse in GRPO; (2) reveals a non-monotonic relation where performance first improves then declines with increasing entropy, clarifying the link between entropy, exploration, and reasoning; and (3) generalizes beyond entropy, providing a broader RFT paradigm where superior target distributions can serve as REINFORCE regularizers.

[1179] TinyGraphEstimator: Adapting Lightweight Language Models for Graph Structure Inference

Michal Podstawski

Main category: cs.LG

TL;DR: Small transformer models can infer graph parameters from graph representations, and lightweight fine-tuning with LoRA improves performance across all metrics.

Details

Motivation: To explore whether compact, resource-efficient language models can perform structural inference on graph data, as this capability remains largely unexplored despite LLMs showing symbolic reasoning abilities.

Method: Created TinyGraphEstimator dataset with connected graphs from multiple random models, evaluated small open models on predicting graph parameters, and applied LoRA fine-tuning for efficient parameter adaptation.

Result: Small language models demonstrated non-trivial reasoning capacity over graph-structured data, with LoRA fine-tuning achieving consistent improvements across all evaluated metrics.

Conclusion: Compact transformer models can effectively perform structural inference on graphs through efficient parameter tuning, showing promising capabilities for graph analysis tasks.

Abstract: Graphs provide a universal framework for representing complex relational systems, and inferring their structural properties is a core challenge in graph analysis and reasoning. While large language models have recently demonstrated emerging abilities to perform symbolic and numerical reasoning, the potential of smaller, resource-efficient models in this context remains largely unexplored. This paper investigates whether compact transformer-based language models can infer graph-theoretic parameters directly from graph representations. To enable systematic evaluation, we introduce the TinyGraphEstimator dataset - a balanced collection of connected graphs generated from multiple random graph models and annotated with detailed structural metadata. We evaluate several small open models on their ability to predict key graph parameters such as density, clustering, and chromatic number. Furthermore, we apply lightweight fine-tuning using the Low-Rank Adaptation (LoRA) technique, achieving consistent improvements across all evaluated metrics. The results demonstrate that small language models possess non-trivial reasoning capacity over graph-structured data and can be effectively adapted for structural inference tasks through efficient parameter tuning.

[1180] Robustness and Regularization in Hierarchical Re-Basin

Benedikt Franke, Florian Heinrich, Markus Lange, Arne Raulf

Main category: cs.LG

TL;DR: Git Re-Basin model merging approach is analyzed, showing it induces robustness but causes larger performance drops than originally reported.

Details

Motivation: To investigate Git Re-Basin's model merging capabilities and develop improved merging algorithms.

Method: Proposed hierarchical model merging scheme that outperforms standard MergeMany algorithm.

Result: Re-Basin induces adversarial and perturbation robustness in merged models, with stronger effects from more models in hierarchical merging, but causes larger performance drops than originally reported.

Conclusion: Git Re-Basin provides robustness benefits but has more significant performance trade-offs than initially claimed.

Abstract: This paper takes a closer look at Git Re-Basin, an interesting new approach to merge trained models. We propose a hierarchical model merging scheme that significantly outperforms the standard MergeMany algorithm. With our new algorithm, we find that Re-Basin induces adversarial and perturbation robustness into the merged models, with the effect becoming stronger the more models participate in the hierarchical merging scheme. However, in our experiments Re-Basin induces a much bigger performance drop than reported by the original authors.

[1181] FM-IRL: Flow-Matching for Reward Modeling and Policy Regularization in Reinforcement Learning

Zhenglin Wan, Jingxuan Wu, Xingrui Yu, Chubin Zhang, Mingcong Lei, Bo An, Ivor Tsang

Main category: cs.LG

TL;DR: Proposes a student-teacher framework where a simple MLP student policy explores online via RL with rewards from a teacher Flow Matching model, overcoming FM’s limitations in online interaction while leveraging its expressiveness.

Details

Motivation: Flow Matching policies have strong behavioral cloning capabilities but lack environmental interaction and exploration, leading to poor generalization in unseen scenarios. Online optimization of FM policies is challenging due to gradient instability and high inference costs.

Method: Use a student MLP policy for online exploration updated via RL with a reward model derived from a teacher FM model. The teacher FM model provides expert distribution information and regularizes the student policy to stabilize learning.

Result: Significantly enhances learning efficiency, generalization, and robustness, especially when learning from suboptimal expert data. Avoids gradient instability of FM policies while leveraging their expressiveness.

Conclusion: The student-teacher framework successfully combines the expressiveness of FM models with efficient online exploration, addressing key limitations of pure FM-based approaches in imitation learning.

Abstract: Flow Matching (FM) has shown remarkable ability in modeling complex distributions and achieves strong performance in offline imitation learning for cloning expert behaviors. However, despite its behavioral cloning expressiveness, FM-based policies are inherently limited by their lack of environmental interaction and exploration. This leads to poor generalization in unseen scenarios beyond the expert demonstrations, underscoring the necessity of online interaction with environment. Unfortunately, optimizing FM policies via online interaction is challenging and inefficient due to instability in gradient computation and high inference costs. To address these issues, we propose to let a student policy with simple MLP structure explore the environment and be online updated via RL algorithm with a reward model. This reward model is associated with a teacher FM model, containing rich information of expert data distribution. Furthermore, the same teacher FM model is utilized to regularize the student policy’s behavior to stabilize policy learning. Due to the student’s simple architecture, we avoid the gradient instability of FM policies and enable efficient online exploration, while still leveraging the expressiveness of the teacher FM model. Extensive experiments show that our approach significantly enhances learning efficiency, generalization, and robustness, especially when learning from suboptimal expert data.

[1182] Mitigating Model Drift in Developing Economies Using Synthetic Data and Outliers

Ilyas Varshavskiy, Bonu Boboeva, Shuhrat Khalilbekov, Azizjon Azimi, Sergey Shulgin, Akhlitdin Nizamitdinov, Haitz Sáez de Ocáriz Borde

Main category: cs.LG

TL;DR: This paper investigates using synthetic outliers to mitigate model drift in financial ML models for developing economies in Central Asia and the Caucasus, introducing a two-level evaluation framework.

Details

Motivation: Financial ML models in developing economies like Tajikistan, Uzbekistan, Kazakhstan, and Azerbaijan are highly vulnerable to model drift due to frequent macroeconomic shocks that destabilize data distributions.

Method: The study explores synthetic outliers as a drift mitigation approach and introduces a two-level framework to measure performance degradation and shock severity on macroeconomic tabular datasets.

Result: Adding a small proportion of synthetic outliers generally improves model stability compared to baseline models, though the optimal amount varies by dataset and model.

Conclusion: Synthetic outliers show promise for mitigating model drift in financial applications for developing economies, representing one of the first studies focused on these regions.

Abstract: Machine Learning models in finance are highly susceptible to model drift, where predictive performance declines as data distributions shift. This issue is especially acute in developing economies such as those in Central Asia and the Caucasus - including Tajikistan, Uzbekistan, Kazakhstan, and Azerbaijan - where frequent and unpredictable macroeconomics shocks destabilize financial data. To the best of our knowledge, this is among the first studies to examine drift mitigation methods on financial datasets from these regions. We investigate the use of synthetic outliers, a largely unexplored approach, to improve model stability against unforeseen shocks. To evaluate effectiveness, we introduce a two-level framework that measures both the extent of performance degradation and the severity of shocks. Our experiments on macroeconomic tabular datasets show that adding a small proportion of synthetic outliers generally improves stability compared to baseline models, though the optimal amount varies by dataset and model

cs.MA

[1183] A Hybrid Agent-Based and System Dynamics Framework for Modelling Project Execution and Technology Maturity in Early-Stage R&D

R. W. S. Pessoa, M. H. Næss, J. C. Bijos, C. M. Rebello, D. Colombo, L. Schnitman, I. B. R. Nogueira

Main category: cs.MA

TL;DR: Hybrid System Dynamics and Agent-Based Modeling framework predicts technological maturity evolution in R&D projects, tested on oil and gas sector with scenarios showing optimal team sizes and reduced rework.

Details

Motivation: To address uncertainties in R&D project evolution (work effort, team size, duration) and limited use of hybrid SD-ABM approaches in R&D contexts.

Method: Multi-level framework integrating System Dynamics (system feedback structures) with Agent-Based Modeling (decentralized agents like team members, tasks, controllers) to capture emergent project dynamics.

Result: Base case: 15 parallel tasks over 156 weeks; Sequential scenario showed 88% rework reduction; Mixed scenarios found optimal teams of 4-5 members, with larger teams potentially decreasing performance due to communication complexity.

Conclusion: Model outputs align with expert understanding, supporting validity as quantitative tools for analyzing resource allocation, scheduling efficiency, and technology maturity progression.

Abstract: This paper presents a hybrid approach to predict the evolution of technological maturity in R and D projects, using the oil and gas sector as an example. Integrating System Dynamics (SD) and Agent Based Modelling (ABM) allows the proposed multi level framework to capture uncertainties in work effort, team size, and project duration, which influence technological progress. While AB SD hybrid models are established in other fields, their use in R and D remains limited. The model combines system level feedback structures governing work phases, rework cycles, and duration with decentralised agents such as team members, tasks, and controllers, whose interactions generate emergent project dynamics. A base case scenario analysed early stage innovation projects with 15 parallel tasks over 156 weeks. A comparative sequential scenario showed an 88 percent reduction in rework duration. A second scenario assessed mixed parallel sequential task structures with varying team sizes. In parallel configurations, increasing team size reduced project duration and improved task completion, with optimal results for teams of four to five members. These findings align with empirical evidence showing that moderate team expansion enhances coordination efficiency without excessive communication overhead. However, larger teams may decrease performance due to communication complexity and management delays. Overall, the model outputs and framework align with expert understanding, supporting their validity as quantitative tools for analysing resource allocation, scheduling efficiency, and technology maturity progression.

[1184] Structured Cooperative Multi-Agent Reinforcement Learning: a Bayesian Network Perspective

Shahbaz P Qadri Syed, He Bai

Main category: cs.MA

TL;DR: A new partially decentralized training decentralized execution (P-DTDE) paradigm for multi-agent reinforcement learning that leverages inter-agent coupling structures through Bayesian networks to improve efficiency and scalability.

Details

Motivation: Existing MARL algorithms do not fully exploit inter-agent coupling information, limiting efficiency and scalability for large multi-agent systems.

Method: Model cooperative MARL via Bayesian networks, identify value dependency sets, propose P-DTDE paradigm, derive policy gradient theorem, and develop scalable actor-critic algorithm with approximation for dense dependencies.

Result: Theoretical proof that P-DTDE policy gradient estimator has lower variance than CTDE, and empirical demonstration of efficiency/scalability on resource allocation and temperature control tasks.

Conclusion: The proposed approach enables more efficient and scalable MARL by systematically exploiting inter-agent coupling structures, with approximation schemes handling large-scale systems.

Abstract: The empirical success of multi-agent reinforcement learning (MARL) has motivated the search for more efficient and scalable algorithms for large scale multi-agent systems. However, existing state-of-the-art algorithms do not fully exploit inter-agent coupling information to develop MARL algorithms. In this paper, we propose a systematic approach to leverage structures in the inter-agent couplings for efficient model-free reinforcement learning. We model the cooperative MARL problem via a Bayesian network and characterize the subset of agents, termed as the value dependency set, whose information is required by each agent to estimate its local action value function exactly. Moreover, we propose a partially decentralized training decentralized execution (P-DTDE) paradigm based on the value dependency set. We theoretically establish that the total variance of our P-DTDE policy gradient estimator is less than the centralized training decentralized execution (CTDE) policy gradient estimator. We derive a multi-agent policy gradient theorem based on the P-DTDE scheme and develop a scalable actor-critic algorithm. We demonstrate the efficiency and scalability of the proposed algorithm on multi-warehouse resource allocation and multi-zone temperature control examples. For dense value dependency sets, we propose an approximation scheme based on truncation of the Bayesian network and empirically show that it achieves a faster convergence than the exact value dependence set for applications with a large number of agents.

[1185] KG-MAS: Knowledge Graph-Enhanced Multi-Agent Infrastructure for coupling physical and digital robotic environments

Walid Abdela

Main category: cs.MA

TL;DR: KG-MAS uses a centralized Knowledge Graph as a shared world model for Multi-Agent Systems to enable intelligent coordination between physical and digital components in Cyber-Physical Systems.

Details

Motivation: Traditional approaches in CPS lack semantic richness and flexibility for intelligent autonomous coordination, relying on rigid data-centric solutions that struggle with system heterogeneity and complexity.

Method: KG-MAS employs a centralized Knowledge Graph as a dynamic shared world model, with autonomous agents querying and updating the KG for decision-making. It features model-driven architecture for automatic agent generation from semantic descriptions.

Result: The infrastructure abstracts away underlying communication protocols and provides unified intelligent coordination, offering a robust, scalable, and flexible solution for heterogeneous physical-digital environments.

Conclusion: KG-MAS successfully addresses limitations of traditional CPS approaches by providing a semantic-rich, flexible framework that simplifies system extension and maintenance while enabling intelligent autonomous coordination.

Abstract: The seamless integration of physical and digital environments in Cyber-Physical Systems(CPS), particularly within Industry 4.0, presents significant challenges stemming from system heterogeneity and complexity. Traditional approaches often rely on rigid, data-centric solutions like co-simulation frameworks or brittle point-to-point middleware bridges, which lack the semantic richness and flexibility required for intelligent, autonomous coordination. This report introduces the Knowledge Graph-Enhanced Multi-Agent Infrastructure(KG-MAS), as resolution in addressing such limitations. KG-MAS leverages a centralized Knowledge Graph (KG) as a dynamic, shared world model, providing a common semantic foundation for a Multi-Agent System(MAS). Autonomous agents, representing both physical and digital components, query this KG for decision-making and update it with real-time state information. The infrastructure features a model-driven architecture which facilitates the automatic generation of agents from semantic descriptions, thereby simplifying system extension and maintenance. By abstracting away underlying communication protocols and providing a unified, intelligent coordination mechanism, KG-MAS offers a robust, scalable, and flexible solution for coupling heterogeneous physical and digital robotic environments.

[1186] HyperAgent: Leveraging Hypergraphs for Topology Optimization in Multi-Agent Communication

Heng Zhang, Yuling Shi, Xiaodong Gu, Zijian Zhang, Haochen You, Lubin Gan, Yilei Yuan, Jin Huang

Main category: cs.MA

TL;DR: HyperAgent is a hypergraph-based framework that addresses limitations in multi-agent systems by using hyperedges to model group collaborations and dynamically optimize communication topologies based on task complexity.

Details

Motivation: Existing multi-agent systems face ineffective group collaboration modeling (limited to pairwise relationships) and limited task-adaptiveness in communication topology design, restricting scalability and practical deployment.

Method: Uses hyperedges to link multiple agents within subtasks, employs hypergraph convolutional layers for one-step information aggregation, and incorporates a variational autoencoder with sparsity regularization to dynamically adjust hypergraph topologies.

Result: Achieves 95.07% accuracy on GSM8K while reducing token consumption by 25.33%, demonstrating superior performance and efficiency compared to existing approaches.

Conclusion: HyperAgent demonstrates the potential of hypergraph-based optimization for multi-agent communication, effectively capturing group collaboration patterns and adapting communication topologies to task complexity.

Abstract: Recent advances in large language model-powered multi-agent systems have demonstrated remarkable collective intelligence through effective communication. However, existing approaches face two primary challenges: (i) \textit{Ineffective group collaboration modeling}, as they rely on pairwise edge representations in graph structures, limiting their ability to capture relationships among multiple agents; and (ii) \textit{Limited task-adaptiveness in communication topology design}, leading to excessive communication cost for simple tasks and insufficient coordination for complex scenarios. These issues restrict the scalability and practical deployment of adaptive collaboration frameworks. To address these challenges, we propose \textbf{HyperAgent}, a hypergraph-based framework that optimizes communication topologies and effectively captures group collaboration patterns using direct hyperedge representations. Unlike edge-based approaches, HyperAgent uses hyperedges to link multiple agents within the same subtask and employs hypergraph convolutional layers to achieve one-step information aggregation in collaboration groups. Additionally, it incorporates a variational autoencoder framework with sparsity regularization to dynamically adjust hypergraph topologies based on task complexity. Experiments highlight the superiority of HyperAgent in both performance and efficiency. For instance, on GSM8K, HyperAgent achieves 95.07% accuracy while reducing token consumption by 25.33%, demonstrating the potential of hypergraph-based optimization for multi-agent communication.

[1187] Fast and the Furious: Hot Starts in Pursuit-Evasion Games

Gabriel Smithline, Scott Nivison

Main category: cs.MA

TL;DR: A novel approach combining game-theoretic control and Graph Neural Networks for pursuer positioning in pursuit-evasion games, using GCN-generated “hot starts” that significantly outperform random configurations.

Details

Motivation: Effectively positioning pursuers without prior knowledge of evader locations remains a significant challenge in pursuit-evasion games.

Method: Conceptualizes pursuer configurations as graphs, constructs Graph Characteristic Space via multi-objective optimization for Pareto-optimal configurations, and trains a Graph Convolutional Network (GCN) to generate strategic “hot starts”.

Result: GCN-generated hot starts provide significant advantage over random configurations - hasten decline in evader survival rates, reduce pursuer travel distances, and enhance containment in multi-pursuer/evader scenarios.

Conclusion: The method demonstrates clear strategic benefits for pursuer positioning in pursuit-evasion games through game-theoretic GCN approach.

Abstract: Effectively positioning pursuers in pursuit-evasion games without prior knowledge of evader locations remains a significant challenge. A novel approach that combines game-theoretic control theory with Graph Neural Networks is introduced in this work. By conceptualizing pursuer configurations as strategic arrangements and representing them as graphs, a Graph Characteristic Space is constructed via multi-objective optimization to identify Pareto-optimal configurations. A Graph Convolutional Network (GCN) is trained on these Pareto-optimal graphs to generate strategically effective initial configurations, termed “hot starts”. Empirical evaluations demonstrate that the GCN-generated hot starts provide a significant advantage over random configurations. In scenarios considering multiple pursuers and evaders, this method hastens the decline in evader survival rates, reduces pursuer travel distances, and enhances containment, showcasing clear strategic benefits.

Thi-Nhung Nguyen, Linhao Luo, Thuy-Trang Vu, Dinh Phung

Main category: cs.MA

TL;DR: This paper studies bias in multi-agent LLM systems, finding they are less robust than single agents with bias emerging through in-group favoritism, but cooperative communication and robust base LLMs can help mitigate bias.

Details

Motivation: While bias in individual LLMs is well-studied, the rise of multi-agent systems introduces new unexplored dynamics in bias emergence and propagation that need investigation.

Method: Simulated social contexts with agents representing different social groups, evaluated system behavior under various interaction and adversarial scenarios using three bias benchmarks.

Result: MAS are generally less robust than single-agent systems with bias emerging early through in-group favoritism, but cooperative/debate-based communication and robust underlying LLMs can mitigate bias amplification.

Conclusion: Multi-agent LLM systems present unique bias challenges requiring careful consideration of communication protocols and base model robustness to ensure fairness and resilience.

Abstract: Bias in large language models (LLMs) remains a persistent challenge, manifesting in stereotyping and unfair treatment across social groups. While prior research has primarily focused on individual models, the rise of multi-agent systems (MAS), where multiple LLMs collaborate and communicate, introduces new and largely unexplored dynamics in bias emergence and propagation. In this work, we present a comprehensive study of stereotypical bias in MAS, examining how internal specialization, underlying LLMs and inter-agent communication protocols influence bias robustness, propagation, and amplification. We simulate social contexts where agents represent different social groups and evaluate system behavior under various interaction and adversarial scenarios. Experiments on three bias benchmarks reveal that MAS are generally less robust than single-agent systems, with bias often emerging early through in-group favoritism. However, cooperative and debate-based communication can mitigate bias amplification, while more robust underlying LLMs improve overall system stability. Our findings highlight critical factors shaping fairness and resilience in multi-agent LLM systems.

[1189] Automating Structural Engineering Workflows with Large Language Model Agents

Haoran Liang, Yufa Zhou, Mohammad Talebi Kalaleh, Qipei Mei

Main category: cs.MA

TL;DR: MASSE is the first Multi-Agent System for Structural Engineering that integrates LLM-based agents with real-world engineering workflows, automating most structural engineering tasks and reducing expert workload from hours to minutes.

Details

Motivation: Structural engineering is a fundamental but stagnant domain with workflows unchanged for decades, despite its economic importance. Recent LLM advancements in complex reasoning and tool utilization align well with engineering tasks like code interpretation and load calculations.

Method: A training-free LLM-based multi-agent system that integrates with real-world engineering workflows, using LLM agents for interpreting design codes, executing calculations, and verifying structural capacities.

Result: MASSE can fully automate most real-world structural engineering workflows, reducing expert workload from approximately two hours to mere minutes while enhancing reliability and accuracy in practical scenarios.

Conclusion: MASSE enables immediate deployment in professional environments and demonstrates that structural engineering workflows can be effectively automated through LLM-based multi-agent systems, significantly improving efficiency and accuracy.

Abstract: We introduce $\textbf{MASSE}$, the first Multi-Agent System for Structural Engineering, effectively integrating large language model (LLM)-based agents with real-world engineering workflows. Structural engineering is a fundamental yet traditionally stagnant domain, with core workflows remaining largely unchanged for decades despite its substantial economic impact and global market size. Recent advancements in LLMs have significantly enhanced their ability to perform complex reasoning, long-horizon planning, and precise tool utilization – capabilities well aligned with structural engineering tasks such as interpreting design codes, executing load calculations, and verifying structural capacities. We present a proof-of-concept showing that most real-world structural engineering workflows can be fully automated through a training-free LLM-based multi-agent system. MASSE enables immediate deployment in professional environments, and our comprehensive validation on real-world case studies demonstrates that it can reduce expert workload from approximately two hours to mere minutes, while enhancing both reliability and accuracy in practical engineering scenarios.

[1190] A Vision for Access Control in LLM-based Agent Systems

Xinfeng Li, Dong Huang, Jie Li, Hongyi Cai, Zhenhong Zhou, Wei Dong, XiaoFeng Wang, Yang Liu

Main category: cs.MA

TL;DR: Proposes Agent Access Control (AAC) as a new framework for governing information flow in LLM-based agents, moving beyond traditional binary access control to dynamic, context-aware information governance.

Details

Motivation: Traditional access control mechanisms are insufficient for LLM-based agents due to their autonomy and contextual complexity, requiring a shift from binary permission systems to sophisticated information flow governance.

Method: Introduces AAC framework with two core modules: multi-dimensional contextual evaluation (assessing identity, relationships, scenarios, norms) and adaptive response formulation (using redaction, summarization, paraphrasing beyond simple allow/deny decisions).

Result: Proposes a conceptual framework that reframes access control as dynamic information flow governance, powered by a dedicated AC reasoning engine to bridge nuanced human judgment with scalable AI safety.

Conclusion: AAC represents a paradigm shift in trustworthy agent design, offering a new conceptual lens for future research that addresses the fundamental challenge of governing information flow rather than just managing permissions.

Abstract: The autonomy and contextual complexity of LLM-based agents render traditional access control (AC) mechanisms insufficient. Static, rule-based systems designed for predictable environments are fundamentally ill-equipped to manage the dynamic information flows inherent in agentic interactions. This position paper argues for a paradigm shift from binary access control to a more sophisticated model of information governance, positing that the core challenge is not merely about permission, but about governing the flow of information. We introduce Agent Access Control (AAC), a novel framework that reframes AC as a dynamic, context-aware process of information flow governance. AAC operates on two core modules: (1) multi-dimensional contextual evaluation, which assesses not just identity but also relationships, scenarios, and norms; and (2) adaptive response formulation, which moves beyond simple allow/deny decisions to shape information through redaction, summarization, and paraphrasing. This vision, powered by a dedicated AC reasoning engine, aims to bridge the gap between human-like nuanced judgment and scalable Al safety, proposing a new conceptual lens for future research in trustworthy agent design.

Anastasia Psarou, Łukasz Gorczyca, Dominik Gaweł, Rafał Kucharski

Main category: cs.MA

TL;DR: Introducing social awareness through marginal cost rewards in MARL for AV routing reduces training time and improves convergence, benefiting both system-wide and individual performance.

Details

Motivation: Selfish AV routing strategies using MARL destabilize traffic systems and take too long to converge, requiring years of real-world commuting time.

Method: Add intrinsic reward based on marginal cost matrix to align agents’ objectives, quantifying each route-choice’s impact on total travel time while preserving system equilibria.

Result: MARL algorithms with marginal cost rewards converge to optimal solution in both toy and real-world networks, while baseline algorithms fail to converge.

Conclusion: Social awareness through marginal cost inclusion improves both system-wide and individual performance in AV routing systems.

Abstract: Previous work has shown that when multiple selfish Autonomous Vehicles (AVs) are introduced to future cities and start learning optimal routing strategies using Multi-Agent Reinforcement Learning (MARL), they may destabilize traffic systems, as they would require a significant amount of time to converge to the optimal solution, equivalent to years of real-world commuting. We demonstrate that moving beyond the selfish component in the reward significantly relieves this issue. If each AV, apart from minimizing its own travel time, aims to reduce its impact on the system, this will be beneficial not only for the system-wide performance but also for each individual player in this routing game. By introducing an intrinsic reward signal based on the marginal cost matrix, we significantly reduce training time and achieve convergence more reliably. Marginal cost quantifies the impact of each individual action (route-choice) on the system (total travel time). Including it as one of the components of the reward can reduce the degree of non-stationarity by aligning agents’ objectives. Notably, the proposed counterfactual formulation preserves the system’s equilibria and avoids oscillations. Our experiments show that training MARL algorithms with our novel reward formulation enables the agents to converge to the optimal solution, whereas the baseline algorithms fail to do so. We show these effects in both a toy network and the real-world network of Saint-Arnoult. Our results optimistically indicate that social awareness (i.e., including marginal costs in routing decisions) improves both the system-wide and individual performance of future urban systems with AVs.

[1192] Agent-Based Modelling for Real-World Stock Markets under Behavioral Economic Principles

Tianlang He, Fengming Zhu, Keyan Lu, Chang Xu, Yang Liu, Weiqing Liu, Fangzhen Lin, S. -H. Gary Chan, Jiang Bian

Main category: cs.MA

TL;DR: Agent-based modeling with deep learning calibration reproduces financial market dynamics with 90% confidence, addressing limitations of traditional time series forecasting methods.

Details

Motivation: Traditional time series forecasting for financial markets faces challenges like overfitting historical data, failing to reconstruct stylized facts, and limiting counterfactual analysis capabilities.

Method: Agent-based modeling where traders act as autonomous agents guided by behavioral-economic principles, with parameters calibrated using deep learning and aligned with economic indices like CPI.

Result: ABM method reproduces market dynamics with 90% confidence, accurately reflects stylized facts, and shows computational efficiency in calibration compared to other simulation-based inference methods.

Conclusion: Agent-based modeling combined with deep learning calibration provides an effective approach for realistic financial market simulation that addresses key limitations of traditional forecasting methods.

Abstract: The reproduction of realistic dynamics in financial markets is of great significance, as it enhances our understanding of market evolution beyond other physical processes, and facilitates the development and backtesting of investment strategies. Most existing literature approaches this issue as a time series forecasting problem, which often faces challenges such as 1) overfitting historical data, 2) failing to reconstruct stylized facts, and 3) limiting users’ ability to conduct counterfactual analyses. To address these limitations, we employ agent-based modeling (ABM) for market simulation, where each trader acts as an autonomous agent guided by established behavioral-economic principles. The parameters of the agent model are subsequently calibrated using deep learning techniques. Additionally, we align our agent model with publicly available economic indices, such as the Consumer Price Index (CPI), to enhance the explainability of our system’s outcomes. Our experiments demonstrate that the ABM method effectively reproduces market dynamics with a confidence level of 90%, accurately reflecting well-known stylized facts. Furthermore, the calibration process proves to be more computationally efficient compared to other existing methods that perform simulation-based inference. We also present case studies illustrating the correlation between agent parameters and economic indices.

cs.MM

[1193] Building and Evaluating a Realistic Virtual World for Large Scale Urban Exploration from 360° Videos

Mizuki Takenawa, Naoki Sugimoto, Leslie Wöhler, Satoshi Ikehata, Kiyoharu Aizawa

Main category: cs.MM

TL;DR: 360RVW creates realistic virtual worlds from 360° videos for urban exploration, enabling interactive navigation with avatar-based movement and virtual collision detection.

Details

Motivation: To build highly realistic and immersive virtual urban environments directly from 360° videos for interactive exploration, avoiding the need for complex 3D modeling.

Method: Four main operations: video segmentation by intersection detection, video completion to remove videographer, semantic segmentation for collision detection, and projection onto moving distorted spheres along camera trajectory.

Result: Users can freely navigate urban environments via avatars, change directions at intersections, select locations via map, and experience collision detection even without 3D models. System supports web browser streaming.

Conclusion: The system provides high-quality virtual tours with strong user presence and interactive exploration capabilities, making it ideal for urban environment exploration.

Abstract: We propose to build realistic virtual worlds, called 360RVW, for large urban environments directly from 360{\deg} videos. We provide an interface for interactive exploration, where users can freely navigate via their own avatars. 360{\deg} videos record the entire environment of the shooting location simultaneously leading to highly realistic and immersive representations. Our system uses 360{\deg} videos recorded along streets and builds a 360RVW through four main operations: video segmentation by intersection detection, video completion to remove the videographer, semantic segmentation for virtual collision detection with the avatar, and projection onto a distorted sphere that moves along the camera trajectory following the avatar’s movements. Our interface allows users to explore large urban environments by changing their walking direction at intersections or choosing a new location by clicking on a map. Even without a 3D model, the users can experience collision with buildings using metadata produced by semantic segmentation. Furthermore, we stream the 360{\deg} videos so users can directly access 360RVW via their web browser. We fully evaluate our system, including a perceptual experiment comparing our approach to previous exploratory interfaces. The results confirm the quality of our system, especially regarding the presence of users and the interactive exploration, making it most suitable for a virtual tour of urban environments.

[1194] TAG-WM: Tamper-Aware Generative Image Watermarking via Diffusion Inversion Sensitivity

Yuzhuo Chen, Zehua Ma, Han Fang, Weiming Zhang, Nenghai Yu

Main category: cs.MM

TL;DR: TAG-WM is a tamper-aware generative image watermarking method that embeds copyright and localization watermarks in latent space while maintaining image quality, with robust detection of tampered regions.

Details

Motivation: Address copyright and authenticity risks in AI-generated content by developing robust watermarking that can withstand generative image editing tools and detect malicious tampering.

Method: Uses four modules: dual-mark joint sampling for watermark embedding, watermark latent reconstruction, dense variation region detector using diffusion inversion sensitivity, and tamper-aware decoding guided by localization results.

Result: Achieves state-of-the-art performance in tampering robustness and localization capability under distortion, maintains lossless generation quality and 256-bit watermark capacity.

Conclusion: TAG-WM provides an effective solution for protecting AI-generated content against tampering while preserving image quality and enabling reliable source tracing.

Abstract: AI-generated content (AIGC) enables efficient visual creation but raises copyright and authenticity risks. As a common technique for integrity verification and source tracing, digital image watermarking is regarded as a potential solution to above issues. However, the widespread adoption and advancing capabilities of generative image editing tools have amplified malicious tampering risks, while simultaneously posing new challenges to passive tampering detection and watermark robustness. To address these challenges, this paper proposes a Tamper-Aware Generative image WaterMarking method named TAG-WM. The proposed method comprises four key modules: a dual-mark joint sampling (DMJS) algorithm for embedding copyright and localization watermarks into the latent space while preserving generative quality, the watermark latent reconstruction (WLR) utilizing reversed DMJS, a dense variation region detector (DVRD) leveraging diffusion inversion sensitivity to identify tampered areas via statistical deviation analysis, and the tamper-aware decoding (TAD) guided by localization results. The experimental results demonstrate that TAG-WM achieves state-of-the-art performance in both tampering robustness and localization capability even under distortion, while preserving lossless generation quality and maintaining a watermark capacity of 256 bits. The code is available at: https://github.com/Suchenl/TAG-WM.

[1195] Towards Robust and Realible Multimodal Fake News Detection with Incomplete Modality

Hengyang Zhou, Yiwei Wei, Jian Yang, Zhenyu Zhang

Main category: cs.MM

TL;DR: MMLNet is a robust multimodal fake news detection framework that handles modality incompleteness through multi-expert collaboration, incomplete modality adapters, and contrastive learning.

Details

Motivation: Real-world multimodal fake news often suffers from missing modalities during dissemination, which harms model generalization and robustness of existing methods.

Method: Three-step approach: (1) Multi-Expert Collaborative Reasoning for missing modality compensation, (2) Incomplete Modality Adapters for feature distribution adaptation, (3) Modality Missing Learning with adaptive weighting and contrastive learning.

Result: Superior performance on three real-world benchmarks across two languages compared to state-of-the-art methods while maintaining simplicity.

Conclusion: MMLNet effectively improves fake news detection accuracy in incomplete modality scenarios, helping curb malicious misinformation spread.

Abstract: Multimodal fake news detection (MFND) has become an urgent task with the emergence of huge multimodal fake content on social media platforms. Previous studies mainly focus on complex feature extraction and fusion to learn discriminative information from multimodal content. However, in real-world applications, multimedia news may naturally lose some information during dissemination, resulting in modality incompleteness, which is detrimental to the generalization and robustness of existing models. To this end, we propose a novel generic and robust multimodal fusion strategy, termed Multi-expert Modality-incomplete Learning Network (MMLNet), which is simple yet effective. It consists of three key steps: (1) Multi-Expert Collaborative Reasoning to compensate for missing modalities by dynamically leveraging complementary information through multiple experts. (2) Incomplete Modality Adapters compensates for the missing information by leveraging the new feature distribution. (3) Modality Missing Learning leveraging an label-aware adaptive weighting strategy to learn a robust representation with contrastive learning. We evaluate MMLNet on three real-world benchmarks across two languages, demonstrating superior performance compared to state-of-the-art methods while maintaining relative simplicity. By ensuring the accuracy of fake news detection in incomplete modality scenarios caused by information propagation, MMLNet effectively curbs the spread of malicious misinformation. Code is publicly available at https://github.com/zhyhome/MMLNet.

eess.AS

[1196] Perceptual Compensation of Ambisonics Recordings for Reproduction in Room

Ali Fallah, Shun Nakamura, Steven van de Par

Main category: eess.AS

TL;DR: A perceptually-motivated Ambisonics method that compensates for playback room reverberation by spectrally and spatially adjusting direct and reverberant sound components to preserve auditory cues.

Details

Motivation: Conventional Ambisonics assumes ideal playback conditions, but real playback room acoustics degrade sound quality, requiring compensation for accurate sound field reproduction.

Method: Record direct and reverberant sound field components in spherical harmonics domain, then apply spectral and spatial compensation to preserve direction of arrival, spectral energy distribution, and interaural coherence across auditory bands.

Result: Listening tests show the proposed method provides perceptually accurate rendering, outperforming conventional Ambisonics without compensation and even ideal Ambisonics in simulated anechoic rooms.

Conclusion: The method enables flexible Ambisonics channel usage and remains robust to head rotation and minor listener displacements while delivering superior perceptual accuracy.

Abstract: Ambisonics is a method for capturing and rendering a sound field accurately, assuming that the acoustics of the playback room does not significantly influence the sound field. However, in practice, the acoustics of the playback room may lead to a noticeable degradation in sound quality. We propose a recording and rendering method based on Ambisonics that utilizes a perceptually-motivated approach to compensate for the reverberation of the playback room. The recorded direct and reverberant sound field components in the spherical harmonics (SHs) domain are spectrally and spatially compensated to preserve the relevant auditory cues including the direction of arrival of the direct sound, the spectral energy of the direct and reverberant sound components, and the Interaural Coherence (IC) across each auditory band. In contrast to the conventional Ambisonics, a flexible number of Ambisonics channels can be used for audio rendering. Listening test results show that the proposed method provides a perceptually accurate rendering of the originally recorded sound field, outperforming both conventional Ambisonics without compensation and even ideal Ambisonics rendering in a simulated anechoic room. Additionally, subjective evaluations of listeners seated at the center of the loudspeaker array demonstrate that the method remains robust to head rotation and minor displacements.

[1197] Phase Aware Ear-Conditioned Learning for Multi-Channel Binaural Speaker Separation

Ruben Johnson Robert Jeremiah, Peyman Goli, Steven van de Par

Main category: eess.AS

TL;DR: PEASE-8 is a phase-aware ear-conditioned speaker separation network that uses eight microphones to separate two competing speakers in reverberant environments while preserving spatial cues and maintaining efficiency.

Details

Motivation: To address the challenge of separating competing speech in reverberant environments while preserving spatial cues and maintaining separation efficiency, especially for applications requiring accurate spatial information.

Method: Uses a phase-aware ear-conditioned speaker separation network with eight microphones that consumes complex STFTs and introduces raw-STFT input directly to early decoder layers, bypassing the encoder pathway. Trained end-to-end with SI-SDR-based objective against direct-path ear targets, jointly performing separation and dereverberation for two speakers at fixed azimuth without permutation invariant training.

Result: Achieves strong separation and intelligibility across anechoic, reverberant, and noisy conditions. Specifically in reverberant environments (T60 = 0.6 s): 12.37 dB SI-SDR, 0.87 STOI, and 1.86 PESQ. Remains competitive under anechoic conditions.

Conclusion: PEASE-8 effectively separates competing speech in challenging reverberant environments while preserving spatial cues and maintaining computational efficiency, demonstrating robust performance across various acoustic conditions.

Abstract: Separating competing speech in reverberant environments requires models that preserve spatial cues while maintaining separation efficiency. We present a Phase-aware Ear-conditioned speaker Separation network using eight microphones (PEASE-8) that consumes complex STFTs and directly introduces a raw-STFT input to the early decoder layer, bypassing the entire encoder pathway to improve reconstruction. The model is trained end-to-end with an SI-SDR-based objective against direct-path ear targets, jointly performing separation and dereverberation for two speakers in a fixed azimuth, eliminating the need for permutation invariant training. On spatialized two-speaker mixtures spanning anechoic, reverberant, and noisy conditions, PEASE-8 delivers strong separation and intelligibility. In reverberant environments, it achieves 12.37 dB SI-SDR, 0.87 STOI, and 1.86 PESQ at T60 = 0.6 s, while remaining competitive under anechoic conditions.

[1198] Dynamically Slimmable Speech Enhancement Network with Metric-Guided Training

Haixin Zhao, Kaixuan Yang, Nilesh Madhu

Main category: eess.AS

TL;DR: A gating-based Dynamically Slimmable Network (DSN) with static and dynamic components reduces complexity in lightweight speech enhancement models. It adaptively controls computational load based on input signal quality using frame-wise policy decisions and Metric-Guided Training.

Details

Motivation: To further reduce the complexity of lightweight speech enhancement models while maintaining performance.

Method: DSN with static and dynamic components targeting common neural network layers (grouped RNN, multi-head attention, convolutional, fully connected). Uses policy module for frame-wise adaptive computation control and Metric-Guided Training to guide policy decisions based on input quality.

Result: Achieves comparable enhancement performance to state-of-the-art lightweight baseline while using only 73% of computational load on average. Appropriately allocates network resources based on input signal distortion severity.

Conclusion: The DSN with Metric-Guided Training effectively reduces computational complexity in speech enhancement while maintaining performance through adaptive resource allocation based on input signal quality.

Abstract: To further reduce the complexity of lightweight speech enhancement models, we introduce a gating-based Dynamically Slimmable Network (DSN). The DSN comprises static and dynamic components. For architecture-independent applicability, we introduce distinct dynamic structures targeting the commonly used components, namely, grouped recurrent neural network units, multi-head attention, convolutional, and fully connected layers. A policy module adaptively governs the use of dynamic parts at a frame-wise resolution according to the input signal quality, controlling computational load. We further propose Metric-Guided Training (MGT) to explicitly guide the policy module in assessing input speech quality. Experimental results demonstrate that the DSN achieves comparable enhancement performance in instrumental metrics to the state-of-the-art lightweight baseline, while using only 73% of its computational load on average. Evaluations of dynamic component usage ratios indicate that the MGT-DSN can appropriately allocate network resources according to the severity of input signal distortion.

[1199] ILD-VIT: A Unified Vision Transformer Architecture for Detection of Interstitial Lung Disease from Respiratory Sounds

Soubhagya Ranjan Hota, Arka Roy, Udit Satija

Main category: eess.AS

TL;DR: A vision transformer-based deep learning framework called ILD-VIT is developed to detect interstitial lung disease using respiratory sound recordings, achieving high accuracy and successfully implemented on a Raspberry Pi for clinical screening.

Details

Motivation: Interstitial lung disease (ILD) causes irreversible lung damage and is typically diagnosed through various clinical methods including respiratory sounds. There's a need for automated detection systems using accessible modalities.

Method: The ILD-VIT framework uses three stages: pre-processing, mel spectrogram extraction, and classification using a vision transformer architecture that processes mel spectrogram image patches.

Result: The system achieved 84.86% accuracy, 82.67% sensitivity, and 86.91% specificity in subject-independent blind testing on BRACETS and KAUH databases.

Conclusion: The successful implementation on Raspberry Pi demonstrates the framework’s potential as a standalone clinical system for ILD screening in real-world scenarios.

Abstract: Interstitial lung disease (ILD) represents a group of restrictive chronic pulmonary diseases that impair oxygen acquisition by causing irreversible changes in the lungs such as fibrosis, scarring of parenchyma, etc. ILD conditions are often diagnosed by various clinical modalities such as spirometry, high-resolution lung imaging techniques, crackling respiratory sounds (RSs), etc. In this letter, we develop a novel vision transformer (VIT)-based deep learning framework namely, ILD-VIT, to detect the ILD condition using the RS recordings. The proposed framework comprises three major stages: pre-processing, mel spectrogram extraction, and classification using the proposed VIT architecture using the mel spectrogram image patches. Experimental results using the publicly available BRACETS and KAUH databases show that our proposed ILD-VIT achieves an accuracy, sensitivity, and specificity of 84.86%, 82.67%, and 86.91%, respectively, for subject-independent blind testing. The successful onboard implantation of the proposed framework on a Raspberry-pi-4 microcontroller indicates its potential as a standalone clinical system for ILD screening in a real clinical scenario.

[1200] Speech Enhancement and Dereverberation with Diffusion-based Generative Models

Julius Richter, Simon Welker, Jean-Marie Lemercier, Bunlong Lay, Timo Gerkmann

Main category: eess.AS

TL;DR: This paper presents an improved diffusion-based speech enhancement method that starts from noisy speech rather than pure Gaussian noise, enabling high-quality results with only 30 diffusion steps and achieving competitive performance with better generalization.

Details

Motivation: To improve upon previous diffusion-based speech enhancement methods by addressing limitations in the network architecture and formalism, enabling more efficient and effective speech enhancement with better generalization capabilities.

Method: Uses diffusion models with a modified forward process that moves from clean to noisy speech via a drift term, and starts the reverse process from a mixture of noisy speech and Gaussian noise rather than pure noise. Adapts network architecture and examines different sampler configurations.

Result: Achieves high-quality speech enhancement with only 30 diffusion steps, competes with recent discriminative models, shows better cross-dataset generalization, performs well on real-world recordings, and is also suitable for dereverberation tasks.

Conclusion: The proposed diffusion-based approach effectively improves speech enhancement performance, with the network architecture being the main limitation in previous work rather than the formalism. The method offers a good balance between performance and computational efficiency.

Abstract: In this work, we build upon our previous publication and use diffusion-based generative models for speech enhancement. We present a detailed overview of the diffusion process that is based on a stochastic differential equation and delve into an extensive theoretical examination of its implications. Opposed to usual conditional generation tasks, we do not start the reverse process from pure Gaussian noise but from a mixture of noisy speech and Gaussian noise. This matches our forward process which moves from clean speech to noisy speech by including a drift term. We show that this procedure enables using only 30 diffusion steps to generate high-quality clean speech estimates. By adapting the network architecture, we are able to significantly improve the speech enhancement performance, indicating that the network, rather than the formalism, was the main limitation of our original approach. In an extensive cross-dataset evaluation, we show that the improved method can compete with recent discriminative models and achieves better generalization when evaluating on a different corpus than used for training. We complement the results with an instrumental evaluation using real-world noisy recordings and a listening experiment, in which our proposed method is rated best. Examining different sampler configurations for solving the reverse process allows us to balance the performance and computational speed of the proposed method. Moreover, we show that the proposed method is also suitable for dereverberation and thus not limited to additive background noise removal. Code and audio examples are available online, see https://github.com/sp-uhh/sgmse.

[1201] Stimulus Modality Matters: Impact of Perceptual Evaluations from Different Modalities on Speech Emotion Recognition System Performance

Huang-Cheng Chou, Haibin Wu, Chi-Chun Lee

Main category: eess.AS

TL;DR: This paper compares the effectiveness of Speech Emotion Recognition (SER) systems trained with emotional labels collected from different modality stimuli (voice-only vs. video with sound), finding that voice-only labels perform better.

Details

Motivation: Different emotion databases collect perceptual evaluations differently - some use video clips with sounds while others use only speech. This raises the question of which type of emotional labels are most effective for training SER systems.

Method: Comprehensive comparison of SER systems trained with labels elicited by different modality stimuli, evaluation on various testing conditions, and introduction of an all-inclusive label combining all modalities.

Result: Using labels elicited by voice-only stimuli for training yields better performance on the test set compared to other modalities.

Conclusion: Voice-only emotional labels are more effective for training SER systems than labels collected from multimodal stimuli like video with sound.

Abstract: Speech Emotion Recognition (SER) systems rely on speech input and emotional labels annotated by humans. However, various emotion databases collect perceptional evaluations in different ways. For instance, the IEMOCAP dataset uses video clips with sounds for annotators to provide their emotional perceptions. However, the most significant English emotion dataset, the MSP-PODCAST, only provides speech for raters to choose the emotional ratings. Nevertheless, using speech as input is the standard approach to training SER systems. Therefore, the open question is the emotional labels elicited by which scenarios are the most effective for training SER systems. We comprehensively compare the effectiveness of SER systems trained with labels elicited by different modality stimuli and evaluate the SER systems on various testing conditions. Also, we introduce an all-inclusive label that combines all labels elicited by various modalities. We show that using labels elicited by voice-only stimuli for training yields better performance on the test set, whereas labels elicited by voice-only stimuli.

[1202] Enhancing Noise Robustness for Neural Speech Codecs through Resource-Efficient Progressive Quantization Perturbation Simulation

Rui-Chen Zheng, Yang Ai, Hui-Peng Du, Li-Rong Dai

Main category: eess.AS

TL;DR: A novel training strategy to enhance noise robustness of neural speech codecs by simulating perturbations at quantization level, using distance-weighted probabilistic top-K sampling and progressive training, trained only on clean speech.

Details

Motivation: Noise robustness is critical for deploying neural speech codecs in real-world scenarios where background noise is inevitable, as slight input noise perturbations cause unintended shifts in quantized codewords that degrade reconstructed speech quality.

Method: Proposes two core mechanisms: 1) distance-weighted probabilistic top-K sampling that replaces deterministic nearest-neighbor selection in RVQ, and 2) progressive training scheme that introduces perturbations from last to first quantizer in controlled manner. Method is trained exclusively on clean speech.

Result: Substantially improves robustness under noisy conditions - boosting UTMOS from 3.475 to 3.586 at 15 dB SNR on Encodec - while also enhancing coding quality for clean speech. Demonstrated effectiveness on Encodec and WavTokenizer codecs.

Conclusion: The proposed resource-efficient training strategy successfully enhances noise robustness of speech codecs without requiring paired noisy-clean data, achieving improved performance in both noisy and clean conditions.

Abstract: Noise robustness remains a critical challenge for deploying neural speech codecs in real-world acoustic scenarios where background noise is often inevitable. A key observation we make is that even slight input noise perturbations can cause unintended shifts in quantized codewords, thereby degrading the quality of reconstructed speech. Motivated by this finding, we propose a novel and resource-efficient training strategy to enhance the noise robustness of speech codecs by simulating such perturbations directly at the quantization level. Our approach introduces two core mechanisms: (1) a distance-weighted probabilistic top-K sampling strategy that replaces the conventional deterministic nearest-neighbor selection in residual vector quantization (RVQ); and (2) a progressive training scheme that introduces perturbations from the last to the first quantizer in a controlled manner. Crucially, our method is trained exclusively on clean speech, eliminating the need for any paired noisy-clean data. Experiments on two advanced neural speech codecs, Encodec and WavTokenizer, demonstrate that the proposed strategy substantially improves robustness under noisy conditions-for example, boosting UTMOS from 3.475 to 3.586 at 15 dB SNR on Encodec-while also enhancing coding quality for clean speech.

[1203] SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision

Chunbo Hao, Ruibin Yuan, Jixun Yao, Qixin Deng, Xinyi Bai, Wei Xue, Lei Xie

Main category: eess.AS

TL;DR: SongFormer is a scalable framework for music structure analysis that fuses short- and long-window audio representations and uses learned source embeddings to handle heterogeneous supervision, achieving state-of-the-art performance on the largest MSA corpus.

Details

Motivation: Progress in music structure analysis has been limited by small, inconsistent corpora, creating a need for scalable frameworks that can learn from diverse and imperfect data sources.

Method: Fuses short- and long-window self-supervised audio representations to capture both fine-grained and long-range dependencies, and introduces learned source embeddings to enable training with partial, noisy, and schema-mismatched labels.

Result: Sets new state-of-the-art in strict boundary detection (HR.5F) and achieves highest functional label accuracy on SongFormBench, surpassing strong baselines and Gemini 2.5 Pro while remaining computationally efficient.

Conclusion: SongFormer provides an effective scalable framework for music structure analysis that handles heterogeneous supervision and achieves superior performance, supported by the largest MSA corpus and expert-verified benchmark.

Abstract: Music structure analysis (MSA) underpins music understanding and controllable generation, yet progress has been limited by small, inconsistent corpora. We present SongFormer, a scalable framework that learns from heterogeneous supervision. SongFormer (i) fuses short- and long-window self-supervised audio representations to capture both fine-grained and long-range dependencies, and (ii) introduces a learned source embedding to enable training with partial, noisy, and schema-mismatched labels. To support scaling and fair evaluation, we release SongFormDB, the largest MSA corpus to date (over 10k tracks spanning languages and genres), and SongFormBench, a 300-song expert-verified benchmark. On SongFormBench, SongFormer sets a new state of the art in strict boundary detection (HR.5F) and achieves the highest functional label accuracy, while remaining computationally efficient; it surpasses strong baselines and Gemini 2.5 Pro on these metrics and remains competitive under relaxed tolerance (HR3F). Code, datasets, and model are publicly available.

[1204] Enhancing Speaker Verification with w2v-BERT 2.0 and Knowledge Distillation guided Structured Pruning

Ze Li, Ming Cheng, Ming Li

Main category: eess.AS

TL;DR: This paper uses w2v-BERT 2.0, a 600M parameter model pre-trained on 4.5M hours of multilingual data, for speaker verification. It employs MFA structure with Layer Adapter and LoRA fine-tuning, achieving state-of-the-art results with 0.12% EER on Vox1-O and 0.55% on Vox1-H, plus 80% model size reduction via knowledge distillation pruning.

Details

Motivation: Leverage large-scale self-supervised pre-trained models to improve speaker verification performance by utilizing their rich feature representations.

Method: Use w2v-BERT 2.0 PTM with MFA structure and Layer Adapter for multi-layer feature processing, LoRA for efficient fine-tuning, and knowledge distillation guided structured pruning for model compression.

Result: Achieved state-of-the-art results: 0.12% EER on Vox1-O and 0.55% EER on Vox1-H test sets. Successfully reduced model size by 80% with only 0.04% EER degradation through pruning.

Conclusion: Large-scale pre-trained models combined with efficient fine-tuning and compression techniques can achieve excellent speaker verification performance while maintaining model efficiency.

Abstract: Large-scale self-supervised Pre-Trained Models (PTMs) have shown significant improvements in the speaker verification (SV) task by providing rich feature representations. In this paper, we utilize w2v-BERT 2.0, a model with approximately 600 million parameters trained on 4.5 million hours of unlabeled data across 143 languages, for the SV task. The MFA structure with Layer Adapter is employed to process the multi-layer feature outputs from the PTM and extract speaker embeddings. Additionally, we incorporate LoRA for efficient fine-tuning. Our model achieves state-of-the-art results with 0.12% and 0.55% EER on the Vox1-O and Vox1-H test sets, respectively. Furthermore, we apply knowledge distillation guided structured pruning, reducing the model size by 80% while achieving only a 0.04% EER degradation. Source code and models are released at https://github.com/ZXHY-82/w2v-BERT-2.0_SV.

eess.IV

[1205] Chlorophyll-a Mapping and Prediction in the Mar Menor Lagoon Using C2RCC-Processed Sentinel 2 Imagery

Antonio Martínez-Ibarra, Aurora González-Vidal, Adrián Cánovas-Rodríguez, Antonio F. Skarmeta

Main category: eess.IV

TL;DR: This study developed a machine learning-based methodology using Sentinel 2 satellite imagery and buoy data to monitor chlorophyll-a concentrations at different depth layers in the Mar Menor lagoon, enabling improved eutrophication monitoring and early warning capabilities.

Details

Motivation: To overcome limitations of traditional chlorophyll monitoring methods that are spatially and temporally limited, and to provide comprehensive, depth-specific monitoring for anticipating harmful algal blooms in the Mar Menor lagoon.

Method: Integrated nearly a decade of Sentinel 2 imagery (atmospherically corrected with C2RCC) with buoy data aggregated by depth layers. Used multiple ML/DL algorithms (RF, XGBoost, CatBoost, MLP, ensembles) with cross-validation and systematic band-combination experiments.

Result: Achieved high prediction accuracy across depth layers: R²=0.89 at surface, R²=0.87 at 1-2m, R²=0.81 at 2-3m, and R²=0.66 at 3-4m. Successfully reproduced known eutrophication events like the 2016 crisis and 2025 surge.

Conclusion: The study provides an end-to-end, validated methodology for depth-specific chlorophyll-a mapping that offers a transferable framework for monitoring other turbid coastal systems.

Abstract: The Mar Menor, Europe’s largest coastal lagoon, located in Spain, has undergone severe eutrophication crises. Monitoring chlorophyll-a (Chl-a) is essential to anticipate harmful algal blooms and guide mitigation. Traditional in situ measurements are spatially and temporally limited. Satellite-based approaches provide a more comprehensive view, enabling scalable, long-term, and transferable monitoring. This study aims to overcome limitations of chlorophyll monitoring, often restricted to surface estimates or limited temporal coverage, by developing a reliable methodology to predict and map Chl-a across the water column of the Mar Menor. The work integrates Sentinel 2 imagery with buoy-based ground truth to create models capable of high-resolution, depth-specific monitoring, enhancing early-warning capabilities for eutrophication. Nearly a decade of Sentinel 2 images was atmospherically corrected using C2RCC processors. Buoy data were aggregated by depth (0-1 m, 1-2 m, 2-3 m, 3-4 m). Multiple ML and DL algorithms-including RF, XGBoost, CatBoost, Multilater Perceptron Networks, and ensembles-were trained and validated using cross-validation. Systematic band-combination experiments and spatial aggregation strategies were tested to optimize prediction. Results show depth-dependent performance. At the surface, C2X-Complex with XGBoost and ensemble models achieved R2 = 0.89; at 1-2 m, CatBoost and ensemble models reached R2 = 0.87; at 2-3 m, TOA reflectances with KNN performed best (R2 = 0.81); while at 3-4 m, RF achieved R2 = 0.66. Generated maps successfully reproduced known eutrophication events (e.g., 2016 crisis, 2025 surge), confirming robustness. The study delivers an end-to-end, validated methodology for depth-specific Chl-amapping. Its integration of multispectral band combinations, buoy calibration, and ML/DL modeling offers a transferable framework for other turbid coastal systems.

[1206] Generative Latent Video Compression

Zongyu Guo, Zhaoyang Jia, Jiahao Li, Xiaoyi Zhang, Bin Li, Yan Lu

Main category: eess.IV

TL;DR: GLVC is a generative latent video compression framework that uses a pretrained tokenizer to project frames into a perceptually aligned latent space, achieving state-of-the-art performance with stable temporal coherence at nearly half the rate of existing neural video codecs.

Details

Motivation: Balancing rate-distortion-perception tradeoff in video compression is challenging, with perceptually optimized neural video codecs often suffering from flickering artifacts due to frame-wise quality fluctuations.

Method: Uses pretrained continuous tokenizer to project video frames into perceptually aligned latent space, redesigns codec architecture for latent domain with unified intra/inter coding and recurrent memory mechanism.

Result: Achieves state-of-the-art performance in DISTS and LPIPS metrics across multiple benchmarks, rivals latest neural video codecs at nearly half their rate while maintaining stable temporal coherence.

Conclusion: GLVC marks a step toward practical perceptual video compression by effectively addressing flickering artifacts and achieving superior perceptual quality with temporal stability.

Abstract: Perceptual optimization is widely recognized as essential for neural compression, yet balancing the rate-distortion-perception tradeoff remains challenging. This difficulty is especially pronounced in video compression, where frame-wise quality fluctuations often cause perceptually optimized neural video codecs to suffer from flickering artifacts. In this paper, inspired by the success of latent generative models, we present Generative Latent Video Compression (GLVC), an effective framework for perceptual video compression. GLVC employs a pretrained continuous tokenizer to project video frames into a perceptually aligned latent space, thereby offloading perceptual constraints from the rate-distortion optimization. We redesign the codec architecture explicitly for the latent domain, drawing on extensive insights from prior neural video codecs, and further equip it with innovations such as unified intra/inter coding and a recurrent memory mechanism. Experimental results across multiple benchmarks show that GLVC achieves state-of-the-art performance in terms of DISTS and LPIPS metrics. Notably, our user study confirms GLVC rivals the latest neural video codecs at nearly half their rate while maintaining stable temporal coherence, marking a step toward practical perceptual video compression.

[1207] Towards Efficient 3D Gaussian Human Avatar Compression: A Prior-Guided Framework

Shanzhi Yin, Bolin Chen, Xinju Wu, Ru-Ling Liao, Jie Chen, Shiqi Wang, Yan Ye

Main category: eess.IV

TL;DR: An efficient 3D avatar coding framework using canonical Gaussian avatars and compact human priors for ultra-low bit rate compression, achieving superior rate-distortion performance over conventional codecs.

Details

Motivation: To enable high-quality 3D human avatar video compression at ultra-low bit rates for immersive meta-verse applications by leveraging compact human priors and separating appearance from temporal evolution.

Method: Trains a canonical Gaussian avatar using articulated splatting, captures temporal body movements via compact parametric representations (94 parameters per frame), and generates target avatars through Linear Blend Skinning transformation.

Result: Significantly outperforms conventional 2D/3D codecs and existing learnable dynamic 3D Gaussian splatting compression methods in rate-distortion performance on mainstream multi-view human video datasets.

Conclusion: The framework enables efficient compression by sharing canonical avatars across sequences and transmitting minimal temporal parameters, paving the way for seamless immersive multimedia experiences.

Abstract: This paper proposes an efficient 3D avatar coding framework that leverages compact human priors and canonical-to-target transformation to enable high-quality 3D human avatar video compression at ultra-low bit rates. The framework begins by training a canonical Gaussian avatar using articulated splatting in a network-free manner, which serves as the foundation for avatar appearance modeling. Simultaneously, a human-prior template is employed to capture temporal body movements through compact parametric representations. This decomposition of appearance and temporal evolution minimizes redundancy, enabling efficient compression: the canonical avatar is shared across the sequence, requiring compression only once, while the temporal parameters, consisting of just 94 parameters per frame, are transmitted with minimal bit-rate. For each frame, the target human avatar is generated by deforming canonical avatar via Linear Blend Skinning transformation, facilitating temporal coherent video reconstruction and novel view synthesis. Experimental results demonstrate that the proposed method significantly outperforms conventional 2D/3D codecs and existing learnable dynamic 3D Gaussian splatting compression method in terms of rate-distortion performance on mainstream multi-view human video datasets, paving the way for seamless immersive multimedia experiences in meta-verse applications.

[1208] JND-Guided Light-Weight Neural Pre-Filter for Perceptual Image Coding

Chenlong He, Zijing Dong, Min Li, Zhijian Hao, Leilei Huang, Xiaoyang Zeng, Yibo Fan

Main category: eess.IV

TL;DR: This paper introduces FJNDF-Pytorch, a unified benchmark for frequency-domain JND-guided pre-filters, and proposes a lightweight CNN framework that achieves state-of-the-art compression efficiency with significantly reduced computational cost.

Details

Motivation: Existing JND-guided pre-filter methods are computationally expensive and lack standardized benchmarks for fair comparison, limiting their practical application and evaluation.

Method: Developed FJNDF-Pytorch as a unified benchmark platform and proposed a complete learning framework for a novel lightweight CNN architecture.

Result: The proposed method achieves state-of-the-art compression efficiency across multiple datasets and encoders, requiring only 7.15 GFLOPs for 1080p images (14.1% of recent lightweight networks).

Conclusion: The work provides a robust, efficient solution for perceptual image compression with a reproducible research platform, making significant contributions to both performance and computational efficiency.

Abstract: Just Noticeable Distortion (JND)-guided pre-filter is a promising technique for improving the perceptual compression efficiency of image coding. However, existing methods are often computationally expensive, and the field lacks standardized benchmarks for fair comparison. To address these challenges, this paper introduces a twofold contribution. First, we develop and open-source FJNDF-Pytorch, a unified benchmark for frequency-domain JND-Guided pre-filters. Second, leveraging this platform, we propose a complete learning framework for a novel, lightweight Convolutional Neural Network (CNN). Experimental results demonstrate that our proposed method achieves state-of-the-art compression efficiency, consistently outperforming competitors across multiple datasets and encoders. In terms of computational cost, our model is exceptionally lightweight, requiring only 7.15 GFLOPs to process a 1080p image, which is merely 14.1% of the cost of recent lightweight network. Our work presents a robust, state-of-the-art solution that excels in both performance and efficiency, supported by a reproducible research platform. The open-source implementation is available at https://github.com/viplab-fudan/FJNDF-Pytorch.

[1209] Bit Allocation Transfer for Perceptual Quality Enhancement of VVC Intra Coding

Runyu Yang, Ivan V. Bajić

Main category: eess.IV

TL;DR: Proposes a low-complexity method to enhance perceptual quality in VVC intra coding by transferring bit allocation knowledge from end-to-end image compression, achieving over 11% BD-rate reduction in MS-SSIM.

Details

Motivation: Traditional block-based hybrid coding frameworks (H.266/VVC, AVS3, AV1) optimize well for PSNR but struggle with perceptually-aligned metrics like MS-SSIM, creating a need for better perceptual quality optimization.

Method: Uses a lightweight model trained with perceptual losses to generate a quantization step map that captures block-level perceptual importance, enabling efficient derivation of QP map for VVC intra coding.

Result: Experiments on Kodak and CLIC datasets show significant advantages in execution time and perceptual metric performance, with more than 11% BD-rate reduction in MS-SSIM.

Conclusion: The scheme provides an efficient, practical pathway for perceptual enhancement of traditional codecs while maintaining low complexity.

Abstract: Mainstream image and video coding standards – including state-of-the-art codecs like H.266/VVC, AVS3, and AV1 – adopt a block-based hybrid coding framework. While this framework facilitates straightforward optimization for Peak Signal-to-Noise Ratio (PSNR), it struggles to effectively optimize perceptually-aligned metrics such as Multi-Scale Structural Similarity (MS-SSIM). To address this challenge, this paper proposes a low-complexity method to enhance perceptual quality in VVC intra coding by transferring bit allocation knowledge from end-to-end image compression. We introduce a lightweight model trained with perceptual losses to generate a quantization step map. This map implicitly captures block-level perceptual importance, enabling efficient derivation of a QP map for VVC. Experiments on Kodak and CLIC datasets demonstrate significant advantages, both in execution time and perceptual metric performance, with more than 11% BD-rate reduction in terms of MS-SSIM. Our scheme provides an efficient, practical pathway for perceptual enhancement of traditional codecs.

[1210] Generalisation of automatic tumour segmentation in histopathological whole-slide images across multiple cancer types

Ole-Johan Skrede, Manohar Pradhan, Maria Xepapadakis Isaksen, Tarjei Sveinsgjerd Hveem, Ljiljana Vlatkovic, Arild Nesbakken, Kristina Lindemann, Gunnar B Kristensen, Jenneke Kasius, Alain G Zeimet, Odd Terje Brustugun, Lill-Tove Rasmussen Busund, Elin H Richardsen, Erik Skaaheim Haug, Bjørn Brennhovd, Emma Rewcastle, Melinda Lillesand, Vebjørn Kvikstad, Emiel Janssen, David J Kerr, Knut Liestøl, Fritz Albregtsen, Andreas Kleppe

Main category: eess.IV

TL;DR: A universal deep learning model for tumor segmentation in histopathological images was developed and validated across multiple cancer types with high performance.

Details

Motivation: To develop one universal tumor segmentation model that can work across different cancer types rather than requiring specialized models for each cancer type.

Method: Developed a deep learning model using over 20,000 whole-slide images from 4,000+ patients with colorectal, endometrial, lung, or prostate carcinoma. Validated on external cohorts with 3,000+ patients across six cancer types and exploratory analyses on 1,500+ additional patients from The Cancer Genome Atlas.

Result: Average Dice coefficient was over 80% in all validation cohorts with en bloc resection specimens and in The Cancer Genome Atlas cohorts. No performance loss compared to specialized single-cancer models.

Conclusion: Generic tumor segmentation by a single model is feasible across cancer types, patient populations, sample preparations, and slide scanners.

Abstract: Deep learning is expected to aid pathologists by automating tasks such as tumour segmentation. We aimed to develop one universal tumour segmentation model for histopathological images and examine its performance in different cancer types. The model was developed using over 20 000 whole-slide images from over 4 000 patients with colorectal, endometrial, lung, or prostate carcinoma. Performance was validated in pre-planned analyses on external cohorts with over 3 000 patients across six cancer types. Exploratory analyses included over 1 500 additional patients from The Cancer Genome Atlas. Average Dice coefficient was over 80% in all validation cohorts with en bloc resection specimens and in The Cancer Genome Atlas cohorts. No loss of performance was observed when comparing the universal model with models specialised on single cancer types. In conclusion, extensive and rigorous evaluations demonstrate that generic tumour segmentation by a single model is possible across cancer types, patient populations, sample preparations, and slide scanners.

[1211] GADA: Graph Attention-based Detection Aggregation for Ultrasound Video Classification

Li Chen, Naveen Balaraju, Jochen Kruecker, Balasundar Raju, Alvin Chen

Main category: eess.IV

TL;DR: GADA is a Graph Attention-based Detection Aggregation framework that reformulates medical ultrasound video classification as a graph reasoning problem over detected pathology regions, outperforming conventional methods while providing interpretable attention.

Details

Motivation: Medical ultrasound video analysis is challenging due to variable sequence lengths, subtle spatial cues, and the need for interpretable video-level assessment.

Method: GADA detects pathology-relevant regions across frames and represents them as nodes in a spatiotemporal graph with edges encoding spatial and temporal dependencies. A graph attention network aggregates node-level predictions through edge-aware attention.

Result: Evaluated on a large-scale, multi-center clinical lung ultrasound dataset, GADA outperforms conventional baselines on two pathology video classification tasks.

Conclusion: The framework provides interpretable region- and frame-level attention while generating compact, discriminative video-level outputs for medical ultrasound analysis.

Abstract: Medical ultrasound video analysis is challenging due to variable sequence lengths, subtle spatial cues, and the need for interpretable video-level assessment. We introduce GADA, a Graph Attention-based Detection Aggregation framework that reformulates video classification as a graph reasoning problem over spatially localized regions of interest. Rather than relying on 3D CNNs or full-frame analysis, GADA detects pathology-relevant regions across frames and represents them as nodes in a spatiotemporal graph, with edges encoding spatial and temporal dependencies. A graph attention network aggregates these node-level predictions through edge-aware attention to generate a compact, discriminative video-level output. Evaluated on a large-scale, multi-center clinical lung ultrasound dataset, GADA outperforms conventional baselines on two pathology video classification tasks while providing interpretable region- and frame-level attention.

[1212] Rethinking Medical Anomaly Detection in Brain MRI: An Image Quality Assessment Perspective

Zixuan Pan, Jun Xia, Zheyu Yan, Guoyue Xu, Yifan Qin, Xueyang Li, Yawen Wu, Zhenge Jia, Jianxu Chen, Yiyu Shi

Main category: eess.IV

TL;DR: The paper proposes a novel image quality assessment (IQA) approach for brain MRI anomaly detection that combines structural similarity (SSIM) with pixel-level precision (l1) through a fusion quality metric, enhanced by average intensity ratio (AIR)-based data transformation to amplify divisive discrepancies between normal and abnormal regions.

Details

Motivation: Existing reconstruction-based anomaly detection methods in brain MRI focus on architectural improvements, but conventional metrics like l1 fail to capture nuanced differences in reconstructed images. The paper addresses this gap by exploring image quality assessment as an under-explored direction for improving anomaly detection accuracy.

Method: Proposes fusion quality metric integrating SSIM’s structure-level sensitivity with l1’s pixel-level precision, considering intensity, contrast, and structural similarity. Also designs AIR-based data transformation to amplify divisive discrepancies between normal and abnormal regions by leveraging SSIM’s inherent divisive properties.

Result: Experimental results on two distinct brain MRI datasets demonstrate that the proposed IQA approach significantly enhances medical anomaly detection performance when integrated with state-of-the-art baselines.

Conclusion: The IQA perspective offers a promising alternative to architectural innovations for improving brain MRI anomaly detection, with the fusion quality metric and AIR-based transformation effectively capturing subtle regional variations that conventional metrics miss.

Abstract: Reconstruction-based methods, particularly those leveraging autoencoders, have been widely adopted for anomaly detection task in brain MRI. Unlike most existing works try to improve the task accuracy through architectural or algorithmic innovations, we tackle this task from image quality assessment (IQA) perspective, an under-explored direction in the field. Due to the limitations of conventional metrics such as l1 in capturing the nuanced differences in reconstructed images for medical anomaly detection, we propose fusion quality, a novel metric that wisely integrates the structure-level sensitivity of Structural Similarity Index Measure (SSIM) with the pixel-level precision of l1. The metric offers a more comprehensive assessment of reconstruction quality, considering intensity (subtractive property of l1 and divisive property of SSIM), contrast, and structural similarity. Furthermore, the proposed metric makes subtle regional variations more impactful in the final assessment. Thus, considering the inherent divisive properties of SSIM, we design an average intensity ratio (AIR)-based data transformation that amplifies the divisive discrepancies between normal and abnormal regions, thereby enhancing anomaly detection. By fusing the aforementioned two components, we devise the IQA approach. Experimental results on two distinct brain MRI datasets show that our IQA approach significantly enhances medical anomaly detection performance when integrated with state-of-the-art baselines.

[1213] Cell as Point: One-Stage Framework for Efficient Cell Tracking

Yaxuan Song, Jianan Fan, Heng Huang, Mei Chen, Weidong Cai

Main category: eess.IV

TL;DR: CAP is a novel end-to-end one-stage cell tracking framework that treats cells as points, eliminating the need for detection/segmentation and achieving 8-32x efficiency gains over existing methods.

Details

Motivation: To overcome limitations of conventional multi-stage cell tracking approaches that require high-quality segmentation masks and increase prediction time, by developing a simpler, more efficient method.

Method: Jointly tracks cells as points in one stage using trajectory correlations, with adaptive event-guided sampling to handle cell division imbalance and rolling-as-window inference for long sequences.

Result: CAP demonstrates promising cell tracking performance and is 8 to 32 times more efficient than existing methods while removing dependency on segmentation-based preprocessing.

Conclusion: The proposed CAP framework successfully addresses key challenges in cell tracking by simplifying the pipeline through point-based representation and innovative sampling/inference strategies.

Abstract: Conventional multi-stage cell tracking approaches rely heavily on detection or segmentation in each frame as a prerequisite, requiring substantial resources for high-quality segmentation masks and increasing the overall prediction time. To address these limitations, we propose CAP, a novel end-to-end one-stage framework that reimagines cell tracking by treating Cell as Point. Unlike traditional methods, CAP eliminates the need for explicit detection or segmentation, instead jointly tracking cells for sequences in one stage by leveraging the inherent correlations among their trajectories. This simplification reduces both labeling requirements and pipeline complexity. However, directly processing the entire sequence in one stage poses challenges related to data imbalance in capturing cell division events and long sequence inference. To solve these challenges, CAP introduces two key innovations: (1) adaptive event-guided (AEG) sampling, which prioritizes cell division events to mitigate the occurrence imbalance of cell events, and (2) the rolling-as-window (RAW) inference strategy, which ensures continuous and stable tracking of newly emerging cells over extended sequences. By removing the dependency on segmentation-based preprocessing while addressing the challenges of imbalanced occurrence of cell events and long-sequence tracking, CAP demonstrates promising cell tracking performance and is 8 to 32 times more efficient than existing methods. The code and model checkpoints will be available soon.

[1214] MedVKAN: Efficient Feature Extraction with Mamba and KAN for Medical Image Segmentation

Hancan Zhu, Jinhao Chen, Guanghua He

Main category: eess.IV

TL;DR: MedVKAN integrates Visual State Space (VSS) model with Expanded Field Convolutional KAN (EFC-KAN) in a U-Net framework, achieving state-of-the-art medical image segmentation with near-linear computational complexity.

Details

Motivation: To overcome limitations of CNNs (limited receptive fields) and Transformers (quadratic complexity) in medical image segmentation by combining the strengths of Mamba's selective state-space design and KAN's enhanced nonlinear expressiveness.

Method: Proposes VKAN module that integrates VSS with EFC-KAN to replace Transformer modules, then embeds VKAN into U-Net framework to create MedVKAN model for efficient feature extraction.

Result: Achieves state-of-the-art performance on 4 out of 5 public medical image datasets and ranks second on the remaining one, demonstrating superior segmentation accuracy with computational efficiency.

Conclusion: The combination of Mamba and KAN provides an effective and computationally efficient feature extraction framework for medical image segmentation, offering a novel alternative to traditional CNN and Transformer approaches.

Abstract: Medical image segmentation has traditionally relied on convolutional neural networks (CNNs) and Transformer-based models. CNNs, however, are constrained by limited receptive fields, while Transformers face scalability challenges due to quadratic computational complexity. To over-come these issues, recent studies have explored alternative architectures. The Mamba model, a selective state-space design, achieves near-linear complexity and effectively captures long-range dependencies. Its vision-oriented variant, the Visual State Space (VSS) model, extends these strengths to image feature learning. In parallel, the Kolmogorov-Arnold Network (KAN) enhanc-es nonlinear expressiveness by replacing fixed activation functions with learnable ones. Moti-vated by these advances, we propose the VSS-Enhanced KAN (VKAN) module, which integrates VSS with the Expanded Field Convolutional KAN (EFC-KAN) as a replacement for Transformer modules, thereby strengthening feature extraction. We further embed VKAN into a U-Net frame-work, resulting in MedVKAN, an efficient medical image segmentation model. Extensive exper-iments on five public datasets demonstrate that MedVKAN achieves state-of-the-art performance on four datasets and ranks second on the remaining one. These results underscore the effective-ness of combining Mamba and KAN while introducing a novel and computationally efficient feature extraction framework. The source code is available at: https://github.com/beginner-cjh/MedVKAN.

[1215] OSCAR: One-Step Diffusion Codec Across Multiple Bit-rates

Jinpei Guo, Yifei Ji, Zheng Chen, Kai Liu, Min Liu, Wang Rao, Wenbo Li, Yong Guo, Yulun Zhang

Main category: eess.IV

TL;DR: OSCAR is a one-step diffusion-based image compression method that achieves high quality reconstruction across multiple bit-rates using a single model, significantly improving computational efficiency over traditional multi-step diffusion approaches.

Details

Motivation: Existing diffusion-based image compression methods suffer from substantial computational overhead due to multi-step sampling processes and require training separate models for different bit-rates, leading to high training and storage costs.

Method: OSCAR models compressed latents as noisy variants of original latents along a diffusion trajectory, establishes a mapping from compression bit-rate to pseudo diffusion timestep, and uses a single generative model with one-step denoising for reconstruction across multiple bit-rates.

Result: Extensive experiments show OSCAR achieves superior performance in both quantitative and visual quality metrics while significantly improving inference efficiency through single-step reconstruction.

Conclusion: OSCAR successfully addresses the computational and training cost challenges of diffusion-based compression by enabling one-step reconstruction across multiple bit-rates with a single model, maintaining high reconstruction quality.

Abstract: Pretrained latent diffusion models have shown strong potential for lossy image compression, owing to their powerful generative priors. Most existing diffusion-based methods reconstruct images by iteratively denoising from random noise, guided by compressed latent representations. While these approaches have achieved high reconstruction quality, their multi-step sampling process incurs substantial computational overhead. Moreover, they typically require training separate models for different compression bit-rates, leading to significant training and storage costs. To address these challenges, we propose a one-step diffusion codec across multiple bit-rates. termed OSCAR. Specifically, our method views compressed latents as noisy variants of the original latents, where the level of distortion depends on the bit-rate. This perspective allows them to be modeled as intermediate states along a diffusion trajectory. By establishing a mapping from the compression bit-rate to a pseudo diffusion timestep, we condition a single generative model to support reconstructions at multiple bit-rates. Meanwhile, we argue that the compressed latents retain rich structural information, thereby making one-step denoising feasible. Thus, OSCAR replaces iterative sampling with a single denoising pass, significantly improving inference efficiency. Extensive experiments demonstrate that OSCAR achieves superior performance in both quantitative and visual quality metrics. The code and models are available at https://github.com/jp-guo/OSCAR.